[ 
https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7702:
-----------------------------------------
    Description: 
Example with python:

{code}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': range(12)}) 
pq.write_table(table, "test_chunks.parquet", chunk_size=3) 

# reading with dataset
import pyarrow.dataset as ds
ds.dataset("test_chunks.parquet").to_table().to_pandas()
{code}

gives non-deterministic result (order of the row groups in the parquet file):

{code}
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()               
                                                                                
                                                   
Out[25]: 
     a
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11

In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()               
                                                                                
                                                   
Out[26]: 
     a
0    0
1    1
2    2
3    3
4    8
5    9
6   10
7   11
8    4
9    5
10   6
11   7

{code}

  was:
Example with python:

{code}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': range(12)}) 
pq.write_table(table, "test_chunks.parquet", chunk_size=3) 

# reading with dataset
import pyarrow.dataset as ds
ds.dataset("test_chunks.parquet").to_table().to_pandas()
{code}

gives non-deterministic result (order of the row groups in the parquet file):

```
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()               
                                                                                
                                                   
Out[25]: 
     a
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11

In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()               
                                                                                
                                                   
Out[26]: 
     a
0    0
1    1
2    2
3    3
4    8
5    9
6   10
7   11
8    4
9    5
10   6
11   7

```


> [C++][Dataset] Provide (optional) deterministic order of batches
> ----------------------------------------------------------------
>
>                 Key: ARROW-7702
>                 URL: https://issues.apache.org/jira/browse/ARROW-7702
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++ - Dataset, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Example with python:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': range(12)}) 
> pq.write_table(table, "test_chunks.parquet", chunk_size=3) 
> # reading with dataset
> import pyarrow.dataset as ds
> ds.dataset("test_chunks.parquet").to_table().to_pandas()
> {code}
> gives non-deterministic result (order of the row groups in the parquet file):
> {code}
> In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()             
>                                                                               
>                                                        
> Out[25]: 
>      a
> 0    0
> 1    1
> 2    2
> 3    3
> 4    4
> 5    5
> 6    6
> 7    7
> 8    8
> 9    9
> 10  10
> 11  11
> In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()             
>                                                                               
>                                                        
> Out[26]: 
>      a
> 0    0
> 1    1
> 2    2
> 3    3
> 4    8
> 5    9
> 6   10
> 7   11
> 8    4
> 9    5
> 10   6
> 11   7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to