[ https://issues.apache.org/jira/browse/NIFI-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajmund Takacs updated NIFI-12843: ---------------------------------- Attachment: parquet_reader_usecases.json > If record count is set, ParquetRecordReader does not read the whole file > ------------------------------------------------------------------------ > > Key: NIFI-12843 > URL: https://issues.apache.org/jira/browse/NIFI-12843 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Affects Versions: 1.25.0, 2.0.0-M2 > Reporter: Rajmund Takacs > Assignee: Rajmund Takacs > Priority: Major > Attachments: parquet_reader_usecases.json > > Time Spent: 0.5h > Remaining Estimate: 0h > > Earlier ParquetRecordReader ignored the record.count attribue of the incoming > FlowFile. With NIFI-12241 this had been changed, and now the reader reads > only the specified number of rows from the record set. But if the Parquet > file is not produced by a record writer, then this attribute is not set > normally, and in this case the record reader reads the whole file. However, > processors producing parquet file by processing record sets, might have this > attribute set, referring to the record set the parquet file is taken from, > and not the actual content. This leads to an incorrect behavior. > For example: ConsumeKafka produces a single record FlowFile, that is a > parquet file with 1000 rows, then record.count would be set to 1, instead of > 1000, because it refers to the Kafka record set. So ParquetRecordReader now > reads only the first record of the Parquet file. > The sole reason of changing the reader to take record.count into account is > that, CalculateParquetOffsets processors generate flow files with same > content, but different offset and count attributes, representing a slice of > the original, big input. And then the parquet reader acts as if the big flow > file was only a small one, containing that slice, which makes processing more > efficient. There is no need to support files having no offset, but having a > limit (count), so changing the reader to only take record.count into account, > if offset is present too, could to be a reasonable fix. -- This message was sent by Atlassian Jira (v8.20.10#820010)