A. Coady created ARROW-15318: -------------------------------- Summary: [C++][Python] Regression reading partition keys of large batches. Key: ARROW-15318 URL: https://issues.apache.org/jira/browse/ARROW-15318 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 7.0.0 Reporter: A. Coady
In a partitioned dataset with chunks larger than the default 1Gi batch size, reading _only_ the partition keys is hanging, and consuming unbounded memory. The bug first appeared in nightly build `7.0.0.dev468`. {code:python} In [1]: import pyarrow as pa, pyarrow.parquet as pq, numpy as np In [2]: pa.__version__ Out[2]: '7.0.0.dev468' In [3]: table = pa.table({'key': pa.repeat(0, 2 ** 20 + 1), 'value': np.arange(2 ** 20 + 1)}) In [4]: pq.write_to_dataset(table[:2 ** 20], 'one', partition_cols=['key']) In [5]: pq.write_to_dataset(table[:2 ** 20 + 1], 'two', partition_cols=['key']) In [6]: pq.read_table('one', columns=['key'])['key'].num_chunks Out[6]: 1 In [7]: pq.read_table('two', columns=['key', 'value'])['key'].num_chunks Out[7]: 2 In [8]: pq.read_table('two', columns=['key'])['key'].num_chunks zsh: killed ipython # hangs; kllled {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)