gitzwz opened a new issue, #1479:
URL: https://github.com/apache/iceberg-python/issues/1479
### Question
I encountered a problem with table.scan.plan_files() where there is no
noticeable time difference between single-threaded and multi-threaded
execution. The total time is directly proportional to the number of manifest
entries. The table I used for testing has 6 manifest files, and each manifest
file contains around 70,000 entries. The most time-consuming process is
_open_manifest in the DataScan.plan_files() function, and it performs similarly
whether using a thread pool or not. Could someone help me investigate if there
might be an issue?
Here is my test code:
`from pyiceberg.catalog import load_catalog
from pyspark.sql import SparkSession
from pyiceberg import expressions as pyi_expr
import time
from line_profiler import LineProfiler
catalog = load_catalog("default")
table = catalog.load_table('b_ods.pyiceberg_test2')
def scan_plan_files(key, values):
row_filter=pyi_expr.In(key, values)
files = table.scan(
row_filter=row_filter,
limit=1000
).plan_files()
print(f"total plans {len(files)}")
for file in files:
print(file.file.file_path)
start_time = time.perf_counter()
scan_plan_files("cid", {'844'})
print(f"Time consumed:{time.perf_counter() - start_time:.3f} seconds")
`
I also modified the ~/.pyiceberg.yaml file, changing *max-workers: 1* to
*max-workers: 32*, but the total time is still around 64 seconds with little to
no change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]