Re: [PR] NIFI-12241 Efficient Parquet Splitting [nifi]

via GitHub Wed, 18 Oct 2023 06:50:47 -0700


markap14 commented on PR #7893:
URL: https://github.com/apache/nifi/pull/7893#issuecomment-1768506240


   Hey @takraj I don't know much of anything about Parquet so I'm probably not 
the best to really review this in terms of Parquet. But looking at what's 
happening here, the processor does not split Parquet at all. Instead, it clones 
the input and adds 'count' and 'offset' types of attributes. So the naming is 
problematic. If I sent in a 10 GB Parquet file to SplitParquet and I get out 10 
FlowFiles, I expect each to be 1 GB. Here, each one will be 10 GB because it's 
a clone of the original. This would lead to a lot confusion.
   Perhaps a name like 'CalculateParquetOffsets' is appropriate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] NIFI-12241 Efficient Parquet Splitting [nifi]

Reply via email to