[jira] [Commented] (ARROW-9063) [Python][C++] Order of files are not respected using the new pyarrow.dataset

Joris Van den Bossche (Jira) Thu, 11 Jun 2020 05:06:15 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133203#comment-17133203
 ]


Joris Van den Bossche commented on ARROW-9063:
----------------------------------------------

[~brillliantz] thanks for the report

ARROW-8447 should have fixed this, I think (that patch is not yet included in 
0.17, will only be in the upcoming 1.0 release). At least with that patch it 
will ensure to always give the same order (and also not row groups of different 
files interleaved). 


> [Python][C++] Order of files are not respected using the new pyarrow.dataset
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-9063
>                 URL: https://issues.apache.org/jira/browse/ARROW-9063
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.17.1
>         Environment: ubuntu-18.04
>            Reporter: William Liu
>            Priority: Critical
>              Labels: bug, dataset
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Say we have multiple parquet files under the same folder (a.parquet, 
> b.parquet, c.parquet). If I pass a list of file paths into either of the two 
> statements below
> {code:java}
> ds = pq.ParquetDataset(fps, use_legacy_dataset=False)
> ds = pyarrow.dataset(fps){code}
> Then rows of the resulting table will have:
> aaaa...bbbb...aaa...bbbb...aaa...ccc..bbb...cccc
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9063) [Python][C++] Order of files are not respected using the new pyarrow.dataset

Reply via email to