[ https://issues.apache.org/jira/browse/ARROW-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162391#comment-16162391 ]
Wes McKinney commented on ARROW-1459: ------------------------------------- PR: https://github.com/apache/arrow/pull/1090 > [Python] PyArrow fails to load partitioned parquet files with non-primitive > types > --------------------------------------------------------------------------------- > > Key: ARROW-1459 > URL: https://issues.apache.org/jira/browse/ARROW-1459 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.6.0 > Reporter: Jonas Amrich > Assignee: Wes McKinney > Fix For: 0.7.0 > > > When reading partitioned parquet files (tested with those produced by Spark), > that contain lists, the resulting table seems to contain data loaded only > from one partition. Primitive types seems to be loaded correctly. > It can be reproduced using following code (arrow 0.6.0, spark 2.1.1): > {noformat} > >>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), > >>> np.arange(20).reshape((10,2)).tolist()))) > >>> df.toPandas() > _1 _2 > 0 0 [0, 1] > 1 1 [2, 3] > 2 2 [4, 5] > 3 3 [6, 7] > 4 4 [8, 9] > 5 5 [10, 11] > 6 6 [12, 13] > 7 7 [14, 15] > 8 8 [16, 17] > 9 9 [18, 19] > >>> df.repartition(2).write.parquet('df_parts.parquet') > >>> pq.read_table('df_parts.parquet').to_pandas() > _1 _2 > 0 0 [0, 1] > 1 2 [4, 5] > 2 4 [8, 9] > 3 6 [12, 13] > 4 8 [16, 17] > 5 1 [0, 1] > 6 3 [4, 5] > 7 5 [8, 9] > 8 7 [12, 13] > 9 9 [16, 17] > {noformat} > When the data is loaded using Spark or coalesced into one partition, > everything works as expected: > {noformat} > >>> spark.read.parquet('df_parts.parquet').toPandas() > _1 _2 > 0 1 [2, 3] > 1 3 [6, 7] > 2 5 [10, 11] > 3 7 [14, 15] > 4 9 [18, 19] > 5 0 [0, 1] > 6 2 [4, 5] > 7 4 [8, 9] > 8 6 [12, 13] > 9 8 [16, 17] > >>> df.coalesce(1).write.parquet('df_single.parquet') > >>> pq.read_table('df_single.parquet').to_pandas() > _1 _2 > 0 0 [0, 1] > 1 1 [2, 3] > 2 2 [4, 5] > 3 3 [6, 7] > 4 4 [8, 9] > 5 5 [10, 11] > 6 6 [12, 13] > 7 7 [14, 15] > 8 8 [16, 17] > 9 9 [18, 19] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)