[jira] [Commented] (ARROW-1459) [Python] PyArrow fails to load partitioned parquet files with non-primitive types

Jonas Amrich (JIRA) Tue, 05 Sep 2017 00:44:20 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153232#comment-16153232
 ]


Jonas Amrich commented on ARROW-1459:
-------------------------------------

I'll try to look deeper into this. However I'm not so familiar with Arrow's 
internals, so I expect it will take some time..

> [Python] PyArrow fails to load partitioned parquet files with non-primitive 
> types
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-1459
>                 URL: https://issues.apache.org/jira/browse/ARROW-1459
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.6.0
>            Reporter: Jonas Amrich
>             Fix For: 0.7.0
>
>
> When reading partitioned parquet files (tested with those produced by Spark), 
> that contain lists, the resulting table seems to contain data loaded only 
> from one partition. Primitive types seems to be loaded correctly.
> It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):
> {noformat}
> >>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), 
> >>> np.arange(20).reshape((10,2)).tolist())))
> >>> df.toPandas()
>    _1        _2
> 0   0    [0, 1]
> 1   1    [2, 3]
> 2   2    [4, 5]
> 3   3    [6, 7]
> 4   4    [8, 9]
> 5   5  [10, 11]
> 6   6  [12, 13]
> 7   7  [14, 15]
> 8   8  [16, 17]
> 9   9  [18, 19]
> >>> df.repartition(2).write.parquet('df_parts.parquet')
> >>> pq.read_table('df_parts.parquet').to_pandas()
>    _1        _2
> 0   0    [0, 1]
> 1   2    [4, 5]
> 2   4    [8, 9]
> 3   6  [12, 13]
> 4   8  [16, 17]
> 5   1    [0, 1]
> 6   3    [4, 5]
> 7   5    [8, 9]
> 8   7  [12, 13]
> 9   9  [16, 17]
> {noformat}
> When the data is loaded using Spark or coalesced into one partition, 
> everything works as expected:
> {noformat}
> >>> spark.read.parquet('df_parts.parquet').toPandas()
>    _1        _2
> 0   1    [2, 3]
> 1   3    [6, 7]
> 2   5  [10, 11]
> 3   7  [14, 15]
> 4   9  [18, 19]
> 5   0    [0, 1]
> 6   2    [4, 5]
> 7   4    [8, 9]
> 8   6  [12, 13]
> 9   8  [16, 17]
> >>> df.coalesce(1).write.parquet('df_single.parquet')
> >>> pq.read_table('df_single.parquet').to_pandas()
>    _1        _2
> 0   0    [0, 1]
> 1   1    [2, 3]
> 2   2    [4, 5]
> 3   3    [6, 7]
> 4   4    [8, 9]
> 5   5  [10, 11]
> 6   6  [12, 13]
> 7   7  [14, 15]
> 8   8  [16, 17]
> 9   9  [18, 19]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1459) [Python] PyArrow fails to load partitioned parquet files with non-primitive types

Reply via email to