[jira] [Created] (ARROW-1459) [Python] PyArrow fails to load partitioned parquet files with non-primitive types

Jonas Amrich (JIRA) Mon, 04 Sep 2017 07:53:03 -0700

Jonas Amrich created ARROW-1459:
-----------------------------------

             Summary: [Python] PyArrow fails to load partitioned parquet files 
with non-primitive types
                 Key: ARROW-1459
                 URL: https://issues.apache.org/jira/browse/ARROW-1459
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.6.0
            Reporter: Jonas Amrich



When reading partitioned parquet files (tested with those produced by Spark), 
that contain lists, the resulting table seems to contain data loaded only from 
one partition. Primitive types seems to be loaded correctly.

It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):


{noformat}
>>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), 
>>> np.arange(20).reshape((10,2)).tolist())))
>>> df.toPandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]
>>> df.repartition(2).write.parquet('df_parts.parquet')
>>> pq.read_table('df_parts.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   2    [4, 5]
2   4    [8, 9]
3   6  [12, 13]
4   8  [16, 17]
5   1    [0, 1]
6   3    [4, 5]
7   5    [8, 9]
8   7  [12, 13]
9   9  [16, 17]
{noformat}

When the data is loaded using Spark or coalesced into one partition, everything 
works as expected:

{noformat}
>>> spark.read.parquet('df_parts.parquet').toPandas()
   _1        _2
0   1    [2, 3]
1   3    [6, 7]
2   5  [10, 11]
3   7  [14, 15]
4   9  [18, 19]
5   0    [0, 1]
6   2    [4, 5]
7   4    [8, 9]
8   6  [12, 13]
9   8  [16, 17]
>>> df.coalesce(1).write.parquet('df_single.parquet')
>>> pq.read_table('df_single.parquet').to_pandas()
   _1        _2
0   0    [0, 1]
1   1    [2, 3]
2   2    [4, 5]
3   3    [6, 7]
4   4    [8, 9]
5   5  [10, 11]
6   6  [12, 13]
7   7  [14, 15]
8   8  [16, 17]
9   9  [18, 19]
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1459) [Python] PyArrow fails to load partitioned parquet files with non-primitive types

Reply via email to