[ https://issues.apache.org/jira/browse/PARQUET-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Szadovszky reassigned PARQUET-1947: ----------------------------------------- Assignee: Daniel Dai > DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong > data > ------------------------------------------------------------------------------- > > Key: PARQUET-1947 > URL: https://issues.apache.org/jira/browse/PARQUET-1947 > Project: Parquet > Issue Type: Bug > Components: parquet-cascading > Reporter: Daniel Dai > Assignee: Daniel Dai > Priority: Major > Attachments: Part1.java > > > When we read parquet file using cascading 2, we observe wrong data in the > file boundary when we turn on input combine in cascading (setUseCombinedInput > to true). > This can be reproduced easily with two parquet input files, each containing > one record. A simple cascading application (attached) read the two input with > setUseCombinedInput(true). What we get is the duplicated record in the first > input file and the missing record in the second input file. > Here is the call sequence to understand what happen after the last record of > first input: > 1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the > last record of first input again > 2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of > first input > 3. CombineFileRecordReader creates a new > DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new > "value" variable containing the first record of second input > 4. CombineFileRecordReader invokes RecordReader.next on the new > RecordReaderWrapper, but since firstRecord flag is on, next does not do > anything > 5. Thus the "value" variable containing the first record of second input is > lost, and cascading is reusing the last record of first input -- This message was sent by Atlassian Jira (v8.3.4#803005)