[ 
https://issues.apache.org/jira/browse/NIFI-14702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Handermann resolved NIFI-14702.
-------------------------------------
    Fix Version/s: 2.6.0
       Resolution: Fixed

> Improve SplitExcel processor memory usage
> -----------------------------------------
>
>                 Key: NIFI-14702
>                 URL: https://issues.apache.org/jira/browse/NIFI-14702
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Piotr Zalas
>            Assignee: Piotr Zalas
>            Priority: Major
>             Fix For: 2.6.0
>
>          Time Spent: 5h
>  Remaining Estimate: 0h
>
> The SplitExcel processor consumes huge amount of memory for large files. For 
> a single sheet test file consisting of 1,000,000 rows with 2 kb of data in 
> each row, it consumed 14 GB of RAM, causing my environment to crash. The test 
> file has size of 15 MB.
> The large memory consumption is caused by 2 factors:
>  # The whole sheet is read into memory and all read rows are added into 
> single ArrayList, even through streaming XLSX reader is used.
>  # After all rows are read, a copy to a new file operation begins. The new 
> file with copied sheet is kept in memory until the whole file content is 
> created.
> To fix these problems, I have changed XLSX writer to optimised streaming 
> SXSSFWorkbook. Additionally, I had to copy logic responsible for copying rows 
> from Apache POI library and adjust it to use streaming approach, instead of 
> taking list of rows to copy as an ArrayList. As a result, overall memory 
> consumption dropped to 550 MB.
> When copying Apache POI code, I made the following adjustments:
>  # Removed validation of input rows to copy - all required conditions are 
> enforced by the processor.
>  # Removed unnecessary logic related to row number shifting - the previously 
> used generic method allowed to copy part of the rows to already existing 
> sheet in a different place (e.g. copy row 10 from source sheet to row 50 in 
> destination sheet). This isn't needed in SplitExcel as row number doesn't 
> change when sheet is copied.
>  # Removed logic responsible for cleaning destination sheet in case 
> destination row already contains some data that would be overwritten. In the 
> processor, destination sheet is always empty when copy operation begins.
> [~dstiegli1], I believe it should simplify implementation of XLS support in 
> the processor. I would be very grateful for your careful review.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to