[jira] [Commented] (NIFI-12843) If record count is set, ParquetRecordReader does not read the whole file

ASF subversion and git services (Jira) Tue, 27 Feb 2024 09:10:04 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17821351#comment-17821351
 ]


ASF subversion and git services commented on NIFI-12843:
--------------------------------------------------------

Commit 48fb538e685ae4c81faf67a65c1e790d5e1bf4e5 in nifi's branch 
refs/heads/support/nifi-1.x from Rajmund Takacs
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=48fb538e68 ]

NIFI-12843: Fix incorrect read of parquet data, when record.count is inherited

This closes #8452.

Signed-off-by: Tamas Palfy <tpa...@apache.org>


> If record count is set, ParquetRecordReader does not read the whole file
> ------------------------------------------------------------------------
>
>                 Key: NIFI-12843
>                 URL: https://issues.apache.org/jira/browse/NIFI-12843
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.25.0, 2.0.0-M2
>            Reporter: Rajmund Takacs
>            Assignee: Rajmund Takacs
>            Priority: Major
>         Attachments: parquet_reader_usecases.json
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Earlier ParquetRecordReader ignored the record.count attribue of the incoming 
> FlowFile. With NIFI-12241 this had been changed, and now the reader reads 
> only the specified number of rows from the record set. But if the Parquet 
> file is not produced by a record writer, then this attribute is not set 
> normally, and in this case the record reader reads the whole file. However, 
> processors producing parquet file by processing record sets, might have this 
> attribute set, referring to the record set the parquet file is taken from, 
> and not the actual content. This leads to an incorrect behavior.
> For example: ConsumeKafka produces a single record FlowFile, that is a 
> parquet file with 1000 rows, then record.count would be set to 1, instead of 
> 1000, because it refers to the Kafka record set. So ParquetRecordReader now 
> reads only the first record of the Parquet file.
> The sole reason of changing the reader to take record.count into account is 
> that, CalculateParquetOffsets processors generate flow files with same 
> content, but different offset and count attributes, representing a slice of 
> the original, big input. And then the parquet reader acts as if the big flow 
> file was only a small one, containing that slice, which makes processing more 
> efficient. There is no need to support files having no offset, but having a 
> limit (count), so changing the reader to only take record.count into account, 
> if offset is present too, could to be a reasonable fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NIFI-12843) If record count is set, ParquetRecordReader does not read the whole file

Reply via email to