Joris Van den Bossche created ARROW-10883: ---------------------------------------------
Summary: [C++][Dataset] Preserve order when writing dataset Key: ARROW-10883 URL: https://issues.apache.org/jira/browse/ARROW-10883 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently, when writing a dataset, e.g. from a table consisting of a set of record batches, there is no guarantee that the row order is preserved when reading the dataset. Small code example: {code} In [1]: import pyarrow.dataset as ds In [2]: table = pa.table({"a": range(10)}) In [3]: table.to_pandas() Out[3]: a 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 In [4]: batches = table.to_batches(max_chunksize=2) In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet") In [6]: ds.dataset("test_dataset_order").to_table().to_pandas() Out[6]: a 0 4 1 5 2 8 3 9 4 6 5 7 6 2 7 3 8 0 9 1 {code} Although this might seem normal in SQL world, typical dataframe users (R, pandas/dask, etc) will expect a preserved row order. Some applications might also rely on this, eg with dask you can have a sorted index column ("divisions" between the partitions) that would get lost this way (note, the dask parquet writer itself doesn't use {{pyarrow.dataset.write_dataset}} so isn't impacted by this.) Some discussion about this started in https://github.com/apache/arrow/pull/8305 (ARROW-9782), which changed to write all fragments to a single file instead of a file per fragment. I am not fully sure what the best way to solve this, but IMO at least having the _option_ to preserve the order would be good. cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)