[I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

via GitHub Fri, 09 Aug 2024 09:23:57 -0700


jkleinkauff opened a new issue, #1032:
URL: https://github.com/apache/iceberg-python/issues/1032


   ### Question
   
   Hey, thanks for this very convenient library.
   
   This is not a bug, just want to better understand something.
   
   I have a question regarding the performance - ie time to query the table (?) 
- for such methods.
   
   ```python
   if __name__ == "__main__":
       catalog = SqlCatalog(
           "default",
           **{
               "uri": 
f"postgresql+psycopg2://postgres:Password1@localhost/postgres",
           },
       )
       table = catalog.load_table("bronze.curitiba_starts_june")
       df = table.scan(limit=100)
       pa_table = df.to_arrow()
   ````
   The code above will run ok. My question is regarding the last command, 
to_arrow() transformation takes around 50s (+-) to execute. I believe this is 
mostly because of the network itself? 
   The execution time will stay roughly the same with different row limit (10, 
100, 1000).
   
   
   Querying the same table in motherduck - using iceberg_scan - is faster:
   <img width="836" alt="image" 
src="https://github.com/user-attachments/assets/21a05d45-ebcd-4323-ba31-2689d2d12fe7";>
   
   When running the same query locally - without motherduck - the execution 
time will be similar to what pyiceberg takes, actually it will be a little bit 
slower. That's why I think this is mostly like a network "issue". Can you help 
be understand what's happening? Thank you!
   
   #### Table Data
   The table has two parquet files (110mb, 127mb)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Peformance question for to_arrow, to_pandas, to_duckdb [iceberg-python]

Reply via email to