Ryan Blue created PARQUET-207:
---------------------------------

             Summary: ParquetInputSplit end calculation bug
                 Key: PARQUET-207
                 URL: https://issues.apache.org/jira/browse/PARQUET-207
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.6.0
            Reporter: Ryan Blue
             Fix For: 1.6.0


The calculation for end of a split using the file metadata is broken by 
PARQUET-108. The calculation was updated to use the requested schema so that 
the end of a block would be the end of the last projected column. But [the end 
logic|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputSplit.java#L94]
 actually calculates the total number of bytes that are selected.

The end of a split is only used to select row groups when a block has no row 
group offsets, which doesn't happen when the constructor that uses the broken 
method is called. However, this should still be removed.

After 1.6.0, I want to move Hive to pass FileSplits directly rather than 
wrapping them in ParquetInputSplit. The internal reader code can handle mapping 
row groups to splits because it needs to for PARQUET-84.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to