Ryan Blue created PARQUET-207:
---------------------------------
Summary: ParquetInputSplit end calculation bug
Key: PARQUET-207
URL: https://issues.apache.org/jira/browse/PARQUET-207
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.6.0
Reporter: Ryan Blue
Fix For: 1.6.0
The calculation for end of a split using the file metadata is broken by
PARQUET-108. The calculation was updated to use the requested schema so that
the end of a block would be the end of the last projected column. But [the end
logic|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputSplit.java#L94]
actually calculates the total number of bytes that are selected.
The end of a split is only used to select row groups when a block has no row
group offsets, which doesn't happen when the constructor that uses the broken
method is called. However, this should still be removed.
After 1.6.0, I want to move Hive to pass FileSplits directly rather than
wrapping them in ParquetInputSplit. The internal reader code can handle mapping
row groups to splits because it needs to for PARQUET-84.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)