Rajmund Takacs created NIFI-12843:
-------------------------------------

             Summary: If record count is set, ParquetRecordReader does not read 
the whole file
                 Key: NIFI-12843
                 URL: https://issues.apache.org/jira/browse/NIFI-12843
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
    Affects Versions: 2.0.0-M2, 1.25.0
            Reporter: Rajmund Takacs
            Assignee: Rajmund Takacs


Earlier ParquetRecordReader ignored the record.count attribue of the incoming 
FlowFile. With NIFI-12241 this had been changed, and now the reader reads only 
the specified number of rows from the record set. But if the Parquet file is 
not produced by a record writer, then this attribute is not set normally, and 
in this case the record reader reads the whole file. However, processors 
producing parquet file by processing record sets, might have this attribute 
set, referring to the record set the parquet file is taken from, and not the 
actual content. This leads to an incorrect behavior.

For example: ConsumeKafka produces a single record FlowFile, that is a parquet 
file with 1000 rows, then record.count would be set to 1, instead of 1000, 
because it refers to the Kafka record set. So ParquetRecordReader now reads 
only the first record of the Parquet file.

The sole reason of changing the reader to take record.count into account is 
that, CalculateParquetOffsets processors generate flow files with same content, 
but different offset and count attributes, representing a slice of the 
original, big input. And then the parquet reader acts as if the big flow file 
was only a small one, containing that slice, which makes processing more 
efficient. There is no need to support files having no offset, but having a 
limit (count), so changing the reader to only take record.count into account, 
if offset is present too, could to be a reasonable fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to