[ https://issues.apache.org/jira/browse/NIFI-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821351#comment-17821351 ]
ASF subversion and git services commented on NIFI-12843: -------------------------------------------------------- Commit 48fb538e685ae4c81faf67a65c1e790d5e1bf4e5 in nifi's branch refs/heads/support/nifi-1.x from Rajmund Takacs [ https://gitbox.apache.org/repos/asf?p=nifi.git;h=48fb538e68 ] NIFI-12843: Fix incorrect read of parquet data, when record.count is inherited This closes #8452. Signed-off-by: Tamas Palfy <tpa...@apache.org> > If record count is set, ParquetRecordReader does not read the whole file > ------------------------------------------------------------------------ > > Key: NIFI-12843 > URL: https://issues.apache.org/jira/browse/NIFI-12843 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Affects Versions: 1.25.0, 2.0.0-M2 > Reporter: Rajmund Takacs > Assignee: Rajmund Takacs > Priority: Major > Attachments: parquet_reader_usecases.json > > Time Spent: 40m > Remaining Estimate: 0h > > Earlier ParquetRecordReader ignored the record.count attribue of the incoming > FlowFile. With NIFI-12241 this had been changed, and now the reader reads > only the specified number of rows from the record set. But if the Parquet > file is not produced by a record writer, then this attribute is not set > normally, and in this case the record reader reads the whole file. However, > processors producing parquet file by processing record sets, might have this > attribute set, referring to the record set the parquet file is taken from, > and not the actual content. This leads to an incorrect behavior. > For example: ConsumeKafka produces a single record FlowFile, that is a > parquet file with 1000 rows, then record.count would be set to 1, instead of > 1000, because it refers to the Kafka record set. So ParquetRecordReader now > reads only the first record of the Parquet file. > The sole reason of changing the reader to take record.count into account is > that, CalculateParquetOffsets processors generate flow files with same > content, but different offset and count attributes, representing a slice of > the original, big input. And then the parquet reader acts as if the big flow > file was only a small one, containing that slice, which makes processing more > efficient. There is no need to support files having no offset, but having a > limit (count), so changing the reader to only take record.count into account, > if offset is present too, could to be a reasonable fix. -- This message was sent by Atlassian Jira (v8.20.10#820010)