Jarno Seppanen created ARROW-1357:
-------------------------------------
Summary: Data corruption in reading multi-file parquet dataset
Key: ARROW-1357
URL: https://issues.apache.org/jira/browse/ARROW-1357
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.5.0
Environment: python 3.5.3
Reporter: Jarno Seppanen
I generated a parquet dataset in Spark that has two files. PyArrow corrupts the
data of the second file if I read them both in using pyarrow's parquet
directory loading mode.
$ ls -l data
total 28608
-rw-rw-r-- 1 jarno jarno 14651449 Aug 15 09:30
part-00000-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet
-rw-rw-r-- 1 jarno jarno 14636502 Aug 15 09:30
part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet
import pyarrow.parquet as pq
tab1 = pq.read_table('data')
df1 = tab1.to_pandas()
df1[df1.account_id == 38658373328].legal_labels.tolist()
# [array([ 2, 3, 5, 8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 31, 60, 61, 63,
# 64, 65, 66, 69, 70, 74, 75, 77, 82, 0, 1, 2, 3, 5, 8, 10, 11,
# 13, 14, 17, 18, 19, 21, 22])]
tab2 =
pq.read_table('data/part-00001-fd2ff92f-201b-4ffc-b0b4-275e06a7fa02.snappy.parquet')
df2 = tab2.to_pandas()
df2[df2.account_id == 38658373328].legal_labels.tolist()
# [array([ 0, 1, 2, 3, 5, 8, 10, 11, 13, 14, 17, 18, 19, 21, 22, 24, 28,
# 30, 31, 36, 38, 39, 40, 41, 43, 49, 60, 61, 62, 63, 64, 65, 66, 67,
# 69, 70, 74, 75, 77, 82, 90])]
Unfortunately I cannot share the data files, and I was not able to create a
dummy data file pair that would have triggered the bug. I'm sending this bug
report in the hope that it is still useful without a minimal repro example.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)