[jira] [Commented] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches
[ https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140294#comment-17140294 ] Joris Van den Bossche commented on ARROW-7702: -- Indeed, this was fixed. Thanks for noting > [C++][Dataset] Provide (optional) deterministic order of batches > > > Key: ARROW-7702 > URL: https://issues.apache.org/jira/browse/ARROW-7702 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset > Fix For: 1.0.0 > > > Example with python: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'a': range(12)}) > pq.write_table(table, "test_chunks.parquet", chunk_size=3) > # reading with dataset > import pyarrow.dataset as ds > ds.dataset("test_chunks.parquet").to_table().to_pandas() > {code} > gives non-deterministic result (order of the row groups in the parquet file): > {code} > In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() > > > Out[25]: > a > 00 > 11 > 22 > 33 > 44 > 55 > 66 > 77 > 88 > 99 > 10 10 > 11 11 > In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() > > > Out[26]: > a > 00 > 11 > 22 > 33 > 48 > 59 > 6 10 > 7 11 > 84 > 95 > 10 6 > 11 7 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches
[ https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140027#comment-17140027 ] Daniel Nugent commented on ARROW-7702: -- [~jorisvandenbossche] Please confirm that issue is now resolved. > [C++][Dataset] Provide (optional) deterministic order of batches > > > Key: ARROW-7702 > URL: https://issues.apache.org/jira/browse/ARROW-7702 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > > Example with python: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'a': range(12)}) > pq.write_table(table, "test_chunks.parquet", chunk_size=3) > # reading with dataset > import pyarrow.dataset as ds > ds.dataset("test_chunks.parquet").to_table().to_pandas() > {code} > gives non-deterministic result (order of the row groups in the parquet file): > {code} > In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() > > > Out[25]: > a > 00 > 11 > 22 > 33 > 44 > 55 > 66 > 77 > 88 > 99 > 10 10 > 11 11 > In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() > > > Out[26]: > a > 00 > 11 > 22 > 33 > 48 > 59 > 6 10 > 7 11 > 84 > 95 > 10 6 > 11 7 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)