Joris Van den Bossche created ARROW-7702: --------------------------------------------
Summary: [C++][Dataset] Provide (optional) deterministic order of batches Key: ARROW-7702 URL: https://issues.apache.org/jira/browse/ARROW-7702 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Example with python: {code} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'a': range(12)}) pq.write_table(table, "test_chunks.parquet", chunk_size=3) # reading with dataset import pyarrow.dataset as ds ds.dataset("test_chunks.parquet").to_table().to_pandas() {code} gives non-deterministic result (order of the row groups in the parquet file): ``` In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[25]: a 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[26]: a 0 0 1 1 2 2 3 3 4 8 5 9 6 10 7 11 8 4 9 5 10 6 11 7 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)