Vladyslav Shamaida created ARROW-6057:
-----------------------------------------
Summary: Parquet files v2.0 created by spark can't be read by
pyarrow
Key: ARROW-6057
URL: https://issues.apache.org/jira/browse/ARROW-6057
Project: Apache Arrow
Issue Type: Bug
Reporter: Vladyslav Shamaida
PyArrow uses footer metadata to determine the format version of parquet file,
while parquet-mr lib (which is used by spark) determines version on the page
level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes
version in footer to '1'. See:
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913]
Thus, spark can write and read its own written files, pyarrow can write and
read its own written files, but when pyarrow tries to read file of version 2.0,
which was written by spark it throws an error about malformed file (because it
thinks that format version is 1.0).
Depending on the compression method an error is:
- _Corrupt snappy compressed data_
- _GZipCodec failed: incorrect header check_
- _ArrowIOError: Unknown encoding type_
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)