Eunsoo Roh created PARQUET-68:
---------------------------------
Summary: Incompatible behavior for ColumnChunk.file_offset between
Parquet-mr and Impala
Key: PARQUET-68
URL: https://issues.apache.org/jira/browse/PARQUET-68
Project: Parquet
Issue Type: Bug
Components: parquet-format, parquet-mr
Reporter: Eunsoo Roh
According to comments in
[parquet.thrift|https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift#L479],
this field is supposed to store offset of ColumnMetaData within the file
column chunk is stored. My understanding is that this allows omitting
ColumnMetaData within ColumnChunk (it is optional field, after all).
Unfortunately, two major implementations, Parquet-mr and Impala, deviate from
this definition when writing Parquet files. Impala implementation writes offset
pointing to the ColumnChunk rather than ColumnMetaData, as can be found in
[hdfs-parquet-table-reader.cc|https://github.com/cloudera/Impala/blob/24db37f4efdc493d218470dc045b61f5104c4fd0/be/src/exec/hdfs-parquet-table-writer.cc#L895].
While this is still incorrect behavior according to the comments in
parquet.thrift, this still allows access to the ColumnMetaData necessary for
reading data.
Parquet-mr implementation can be found in
[ParquetMetadataConverter|https://github.com/Parquet/parquet-mr/blob/fd8d18f26af9ad7813dda71352b5dcb0080306eb/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L149],
which writes the offset to the first data page. Not only this is incompatible
behavior, but also it makes no sense because you cannot read the data with just
data page offset. There is even a comment on that line saying "verify this is
the right offset."
--
This message was sent by Atlassian JIRA
(v6.2#6252)