Re: [PR] feat: reads using global ctx [datafusion-python]

via GitHub Sat, 08 Mar 2025 13:20:04 -0800


Spaarsh commented on PR #982:
URL: 
https://github.com/apache/datafusion-python/pull/982#issuecomment-2708491224


   Key Points:
   1. ```|``` operator not supported for python < 3.10, anyone pulling the main 
post merge will not be able to use ```SessionContext``` at all
   2. ```global_ctx``` already exposed to python
   
   Details:
   The `|` operator being used in all the ```read_*``` functions is supported 
only for python >=3.10. So in order to even import SessionContext, I had to 
change all ```|``` operations with ```Union```. Until then, I was getting this 
error:
   ```
   $ python3
   Python 3.9.7 (default, Oct 18 2021, 02:25:46) 
   [Clang 13.0.0 ] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> imp   
   KeyboardInterrupt
   >>> from datafusion import SessionContext
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/home/spaarsh/gsoc/df-py/datafusion-python/python/datafusion/__init__.py", 
line 48, in <module>
       from .io import read_avro, read_csv, read_json, read_parquet
     File "/home/spaarsh/gsoc/df-py/datafusion-python/python/datafusion/io.py", 
line 31, in <module>
       path: str | pathlib.Path,
   TypeError: unsupported operand type(s) for |: 'type' and 'type'
   ```
   After replacing all ```|``` operations with ```Union```, it all works. 
```global_ctx``` is already being exposed to python, unless I have 
misunderstood something. I pulled the branch and tested it. It works.
   ```
   $ python3 test.py 
   DataFrame()
   +----+----------+-----+---------+------------+-----------+
   | id | name     | age | salary  | start_date | is_active |
   +----+----------+-----+---------+------------+-----------+
   | 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
   +----+----------+-----+---------+------------+-----------+
   DataFrame()
   +----+----------+-----+---------+------------+-----------+
   | id | name     | age | salary  | start_date | is_active |
   +----+----------+-----+---------+------------+-----------+
   | 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
   +----+----------+-----+---------+------------+-----------+
   DataFrame()
   +-----+----+-----------+----------+---------+------------+
   | age | id | is_active | name     | salary  | start_date |
   +-----+----+-----------+----------+---------+------------+
   | 32  | 1  | true      | John Doe | 75000.5 | 2020-01-15 |
   +-----+----+-----------+----------+---------+------------+
   DataFrame()
   +----+----------+-----+---------+------------+-----------+
   | id | name     | age | salary  | start_date | is_active |
   +----+----------+-----+---------+------------+-----------+
   | 1  | John Doe | 32  | 75000.5 | 2020-01-15 | true      |
   +----+----------+-----+---------+------------+-----------+
   ```
   Just for reference, these are the scripts I used to generate and test the 
functions:
   <details>
   ```
   ####test.py
   from datafusion import SessionContext
   
   #### Create a new session
   ctx = SessionContext()
   
   #### Read different file formats
   df1 = ctx.read_csv("data.csv")  # Accepts str or Path
   df2 = ctx.read_parquet("data.parquet")
   df3 = ctx.read_json("data.json")
   df4 = ctx.read_avro("data.avro")
   
   print(df1)
   print(df2)
   print(df3)
   print(df4)
   ```
   
   ```
   ####create.py - to create the data files
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import json
   import fastavro
   
   #### Sample data as a dictionary
   data = {
       'id': 1,
       'name': ['John Doe'],
       'age': [32],
       'salary': [75000.50],
       'start_date': ['2020-01-15'],
       'is_active': [True]
   }
   
   #### Create DataFrame
   df = pd.DataFrame(data)
   
   #### Save as Parquet
   df.to_parquet('data.parquet')
   
   #### Save as JSON (line-delimited)
   with open('data.json', 'w') as f:
       for _, row in df.iterrows():
           json.dump(row.to_dict(), f)
           f.write('\n')
   
   #### Save as Avro
   schema = {
       'name': 'Employee',
       'type': 'record',
       'fields': [
           {'name': 'id', 'type': 'int'},
           {'name': 'name', 'type': 'string'},
           {'name': 'age', 'type': 'int'},
           {'name': 'salary', 'type': 'double'},
           {'name': 'start_date', 'type': 'string'},
           {'name': 'is_active', 'type': 'boolean'}
       ]
   }
   
   records = df.to_dict('records')
   with open('data.avro', 'wb') as f:
       fastavro.writer(f, schema, records)
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: reads using global ctx [datafusion-python]

Reply via email to