Spaarsh commented on PR #982:
URL:
https://github.com/apache/datafusion-python/pull/982#issuecomment-2708491224
Key Points:
1. ```|``` operator not supported for python < 3.10, anyone pulling the main
post merge will not be able to use ```SessionContext``` at all
2. ```global_ctx``` already exposed to python
Details:
The `|` operator being used in all the ```read_*``` functions is supported
only for python >=3.10. So in order to even import SessionContext, I had to
change all ```|``` operations with ```Union```. Until then, I was getting this
error:
```
$ python3
Python 3.9.7 (default, Oct 18 2021, 02:25:46)
[Clang 13.0.0 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> imp
KeyboardInterrupt
>>> from datafusion import SessionContext
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/spaarsh/gsoc/df-py/datafusion-python/python/datafusion/__init__.py",
line 48, in <module>
from .io import read_avro, read_csv, read_json, read_parquet
File "/home/spaarsh/gsoc/df-py/datafusion-python/python/datafusion/io.py",
line 31, in <module>
path: str | pathlib.Path,
TypeError: unsupported operand type(s) for |: 'type' and 'type'
```
After replacing all ```|``` operations with ```Union```, it all works.
```global_ctx``` is already being exposed to python, unless I have
misunderstood something. I pulled the branch and tested it. It works.
```
$ python3 test.py
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name | age | salary | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1 | John Doe | 32 | 75000.5 | 2020-01-15 | true |
+----+----------+-----+---------+------------+-----------+
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name | age | salary | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1 | John Doe | 32 | 75000.5 | 2020-01-15 | true |
+----+----------+-----+---------+------------+-----------+
DataFrame()
+-----+----+-----------+----------+---------+------------+
| age | id | is_active | name | salary | start_date |
+-----+----+-----------+----------+---------+------------+
| 32 | 1 | true | John Doe | 75000.5 | 2020-01-15 |
+-----+----+-----------+----------+---------+------------+
DataFrame()
+----+----------+-----+---------+------------+-----------+
| id | name | age | salary | start_date | is_active |
+----+----------+-----+---------+------------+-----------+
| 1 | John Doe | 32 | 75000.5 | 2020-01-15 | true |
+----+----------+-----+---------+------------+-----------+
```
Just for reference, these are the scripts I used to generate and test the
functions:
<details>
```
####test.py
from datafusion import SessionContext
#### Create a new session
ctx = SessionContext()
#### Read different file formats
df1 = ctx.read_csv("data.csv") # Accepts str or Path
df2 = ctx.read_parquet("data.parquet")
df3 = ctx.read_json("data.json")
df4 = ctx.read_avro("data.avro")
print(df1)
print(df2)
print(df3)
print(df4)
```
```
####create.py - to create the data files
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import json
import fastavro
#### Sample data as a dictionary
data = {
'id': 1,
'name': ['John Doe'],
'age': [32],
'salary': [75000.50],
'start_date': ['2020-01-15'],
'is_active': [True]
}
#### Create DataFrame
df = pd.DataFrame(data)
#### Save as Parquet
df.to_parquet('data.parquet')
#### Save as JSON (line-delimited)
with open('data.json', 'w') as f:
for _, row in df.iterrows():
json.dump(row.to_dict(), f)
f.write('\n')
#### Save as Avro
schema = {
'name': 'Employee',
'type': 'record',
'fields': [
{'name': 'id', 'type': 'int'},
{'name': 'name', 'type': 'string'},
{'name': 'age', 'type': 'int'},
{'name': 'salary', 'type': 'double'},
{'name': 'start_date', 'type': 'string'},
{'name': 'is_active', 'type': 'boolean'}
]
}
records = df.to_dict('records')
with open('data.avro', 'wb') as f:
fastavro.writer(f, schema, records)
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]