Re: [I] [Question] Why does plan_files not seem to get multi-threading improvement [iceberg-python]

via GitHub Tue, 31 Dec 2024 01:43:38 -0800


gitzwz commented on issue #1479:
URL: 
https://github.com/apache/iceberg-python/issues/1479#issuecomment-2566291942


   
   Here is my test code:
   `
   from pyiceberg.catalog import load_catalog
   from pyspark.sql import SparkSession
   from pyiceberg import expressions as pyi_expr
   import time
   from line_profiler import LineProfiler
   
   catalog = load_catalog("default")
   table = catalog.load_table('b_ods.pyiceberg_test2')
   def scan_plan_files(key, values):
       row_filter=pyi_expr.In(key, values)
   
       files = table.scan(
           row_filter=row_filter,
           limit=1000
       ).plan_files()
       print(f"total plans {len(files)}")
       for file in files:
           print(file.file.file_path)
   
   start_time = time.perf_counter()
   scan_plan_files("cid", {'844'})
   print(f"Time consumed：{time.perf_counter() - start_time:.3f} seconds")
   `
   
   I also modified the ~/.pyiceberg.yaml file, changing *max-workers: 1* to 
*max-workers: 32*, but the total time is still around 64 seconds with little to 
no change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Question] Why does plan_files not seem to get multi-threading improvement [iceberg-python]

Reply via email to