[
https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-7702:
-----------------------------------------
Description:
Example with python:
{code}
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'a': range(12)})
pq.write_table(table, "test_chunks.parquet", chunk_size=3)
# reading with dataset
import pyarrow.dataset as ds
ds.dataset("test_chunks.parquet").to_table().to_pandas()
{code}
gives non-deterministic result (order of the row groups in the parquet file):
{code}
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()
Out[25]:
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()
Out[26]:
a
0 0
1 1
2 2
3 3
4 8
5 9
6 10
7 11
8 4
9 5
10 6
11 7
{code}
was:
Example with python:
{code}
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'a': range(12)})
pq.write_table(table, "test_chunks.parquet", chunk_size=3)
# reading with dataset
import pyarrow.dataset as ds
ds.dataset("test_chunks.parquet").to_table().to_pandas()
{code}
gives non-deterministic result (order of the row groups in the parquet file):
```
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()
Out[25]:
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()
Out[26]:
a
0 0
1 1
2 2
3 3
4 8
5 9
6 10
7 11
8 4
9 5
10 6
11 7
```
> [C++][Dataset] Provide (optional) deterministic order of batches
> ----------------------------------------------------------------
>
> Key: ARROW-7702
> URL: https://issues.apache.org/jira/browse/ARROW-7702
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++ - Dataset, Python
> Reporter: Joris Van den Bossche
> Priority: Major
>
> Example with python:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': range(12)})
> pq.write_table(table, "test_chunks.parquet", chunk_size=3)
> # reading with dataset
> import pyarrow.dataset as ds
> ds.dataset("test_chunks.parquet").to_table().to_pandas()
> {code}
> gives non-deterministic result (order of the row groups in the parquet file):
> {code}
> In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()
>
>
> Out[25]:
> a
> 0 0
> 1 1
> 2 2
> 3 3
> 4 4
> 5 5
> 6 6
> 7 7
> 8 8
> 9 9
> 10 10
> 11 11
> In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()
>
>
> Out[26]:
> a
> 0 0
> 1 1
> 2 2
> 3 3
> 4 8
> 5 9
> 6 10
> 7 11
> 8 4
> 9 5
> 10 6
> 11 7
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)