[jira] [Commented] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches

2020-06-19 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140294#comment-17140294
 ] 

Joris Van den Bossche commented on ARROW-7702:
--

Indeed, this was fixed. Thanks for noting

> [C++][Dataset] Provide (optional) deterministic order of batches
> 
>
> Key: ARROW-7702
> URL: https://issues.apache.org/jira/browse/ARROW-7702
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> Example with python:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': range(12)}) 
> pq.write_table(table, "test_chunks.parquet", chunk_size=3) 
> # reading with dataset
> import pyarrow.dataset as ds
> ds.dataset("test_chunks.parquet").to_table().to_pandas()
> {code}
> gives non-deterministic result (order of the row groups in the parquet file):
> {code}
> In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() 
>   
>
> Out[25]: 
>  a
> 00
> 11
> 22
> 33
> 44
> 55
> 66
> 77
> 88
> 99
> 10  10
> 11  11
> In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() 
>   
>
> Out[26]: 
>  a
> 00
> 11
> 22
> 33
> 48
> 59
> 6   10
> 7   11
> 84
> 95
> 10   6
> 11   7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches

2020-06-18 Thread Daniel Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140027#comment-17140027
 ] 

Daniel Nugent commented on ARROW-7702:
--

[~jorisvandenbossche] Please confirm that issue is now resolved.

> [C++][Dataset] Provide (optional) deterministic order of batches
> 
>
> Key: ARROW-7702
> URL: https://issues.apache.org/jira/browse/ARROW-7702
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
>
> Example with python:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': range(12)}) 
> pq.write_table(table, "test_chunks.parquet", chunk_size=3) 
> # reading with dataset
> import pyarrow.dataset as ds
> ds.dataset("test_chunks.parquet").to_table().to_pandas()
> {code}
> gives non-deterministic result (order of the row groups in the parquet file):
> {code}
> In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() 
>   
>
> Out[25]: 
>  a
> 00
> 11
> 22
> 33
> 44
> 55
> 66
> 77
> 88
> 99
> 10  10
> 11  11
> In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() 
>   
>
> Out[26]: 
>  a
> 00
> 11
> 22
> 33
> 48
> 59
> 6   10
> 7   11
> 84
> 95
> 10   6
> 11   7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)