[ https://issues.apache.org/jira/browse/NIFI-12241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tamas Palfy updated NIFI-12241: ------------------------------- Fix Version/s: 1.25.0 2.0.0 Resolution: Fixed Status: Resolved (was: Patch Available) > Efficient Parquet Splitting > --------------------------- > > Key: NIFI-12241 > URL: https://issues.apache.org/jira/browse/NIFI-12241 > Project: Apache NiFi > Issue Type: New Feature > Components: Extensions > Reporter: Rajmund Takacs > Assignee: Rajmund Takacs > Priority: Major > Labels: feature, performance, pull-request-available > Fix For: 1.25.0, 2.0.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > SplitParquet processor that expects as input a FlowFile with Parquet content > and would take as parameter a number of records as the split configuration. > The processor would generate X flow files with unmodified content and would > add attributes with the offsets required to read the group of rows in the > flowfile's content. > Then the Parquet Reader would be improved to accept optional flow file > attributes containing the information so that the reader can only read the > required part of the data. > Instead of having something like > {noformat} > X -> SplitRecord (Parquet / JSON) -> ...{noformat} > It'd be something like > {noformat} > X -> SplitParquet -> ConvertRecord (Parquet / JSON) -> ...{noformat} > The goal here is to increase the overall efficiency of this operation for > extremely large Parquet files (hundreds of GBs). With the second approach, it > could leverage multi-threading for processing a single file. > SplitParquet processor should also have a property (true/false) to write > zero-content flow files. The existing FetchParquet processor should be > enhanced to accept the flow file attributes for giving offsets. It'd give > something like > {noformat} > X -> SplitParquet -> FetchParquet (JSON Writer) -> ...{noformat} > This way, a load balanced connection could be used between SplitParquet and > FetchParquet in order to distribute the work across the nodes (without > transferring a lot of data across the nodes of the cluster). -- This message was sent by Atlassian Jira (v8.20.10#820010)