Maarten Breddels created ARROW-9458: ---------------------------------------
Summary: [Python] Dataset singlethreaded only Key: ARROW-9458 URL: https://issues.apache.org/jira/browse/ARROW-9458 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Maarten Breddels I'm not sure this is a misunderstanding, or a compilation issue (flags?) or an issue in the C++ layer. I have 1000 parquet files with a total of 1 billion rows (1 million rows each file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 columns efficiently (vaex use case). {code:java} import pyarrow.parquet import pyarrow as pa import pyarrow.dataset as ds import glob ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet')) scanned = 0 for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=True): for record_batch in scan_task.execute(): scanned += record_batch.num_rows scanned {code} This only seems to use 1 cpu. Using a threadpool from Python: {code:java} # %%timeit import concurrent.futures pool = concurrent.futures.ThreadPoolExecutor() ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet')) def process(scan_task): scan_count = 0 for record_batch in scan_task.execute(): scan_count += len(record_batch) return scan_count sum(pool.map(process, ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=False))) {code} Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu). py-spy (profiler for Python) shows no GIL, so this might be something at the C++ layer. Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a problem on this system (it actually all comes from OS cache, no disk read observed) -- This message was sent by Atlassian Jira (v8.3.4#803005)