[ https://issues.apache.org/jira/browse/ARROW-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-10883: ------------------------------------------ Fix Version/s: 11.0.0 > [C++][Dataset] Preserve order when writing dataset > -------------------------------------------------- > > Key: ARROW-10883 > URL: https://issues.apache.org/jira/browse/ARROW-10883 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Joris Van den Bossche > Priority: Major > Fix For: 11.0.0 > > > Currently, when writing a dataset, e.g. from a table consisting of a set of > record batches, there is no guarantee that the row order is preserved when > reading the dataset. > Small code example: > {code} > In [1]: import pyarrow.dataset as ds > In [2]: table = pa.table({"a": range(10)}) > In [3]: table.to_pandas() > Out[3]: > a > 0 0 > 1 1 > 2 2 > 3 3 > 4 4 > 5 5 > 6 6 > 7 7 > 8 8 > 9 9 > In [4]: batches = table.to_batches(max_chunksize=2) > In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet") > In [6]: ds.dataset("test_dataset_order").to_table().to_pandas() > Out[6]: > a > 0 4 > 1 5 > 2 8 > 3 9 > 4 6 > 5 7 > 6 2 > 7 3 > 8 0 > 9 1 > {code} > Although this might seem normal in SQL world, typical dataframe users (R, > pandas/dask, etc) will expect a preserved row order. > Some applications might also rely on this, eg with dask you can have a sorted > index column ("divisions" between the partitions) that would get lost this > way (note, the dask parquet writer itself doesn't use > {{pyarrow.dataset.write_dataset}} so isn't impacted by this.) > Some discussion about this started in > https://github.com/apache/arrow/pull/8305 (ARROW-9782), which changed to > write all fragments to a single file instead of a file per fragment. > I am not fully sure what the best way to solve this, but IMO at least having > the _option_ to preserve the order would be good. > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.20.10#820010)