[
https://issues.apache.org/jira/browse/NIFI-14702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Handermann resolved NIFI-14702.
-------------------------------------
Fix Version/s: 2.6.0
Resolution: Fixed
> Improve SplitExcel processor memory usage
> -----------------------------------------
>
> Key: NIFI-14702
> URL: https://issues.apache.org/jira/browse/NIFI-14702
> Project: Apache NiFi
> Issue Type: Improvement
> Components: Extensions
> Reporter: Piotr Zalas
> Assignee: Piotr Zalas
> Priority: Major
> Fix For: 2.6.0
>
> Time Spent: 5h
> Remaining Estimate: 0h
>
> The SplitExcel processor consumes huge amount of memory for large files. For
> a single sheet test file consisting of 1,000,000 rows with 2 kb of data in
> each row, it consumed 14 GB of RAM, causing my environment to crash. The test
> file has size of 15 MB.
> The large memory consumption is caused by 2 factors:
> # The whole sheet is read into memory and all read rows are added into
> single ArrayList, even through streaming XLSX reader is used.
> # After all rows are read, a copy to a new file operation begins. The new
> file with copied sheet is kept in memory until the whole file content is
> created.
> To fix these problems, I have changed XLSX writer to optimised streaming
> SXSSFWorkbook. Additionally, I had to copy logic responsible for copying rows
> from Apache POI library and adjust it to use streaming approach, instead of
> taking list of rows to copy as an ArrayList. As a result, overall memory
> consumption dropped to 550 MB.
> When copying Apache POI code, I made the following adjustments:
> # Removed validation of input rows to copy - all required conditions are
> enforced by the processor.
> # Removed unnecessary logic related to row number shifting - the previously
> used generic method allowed to copy part of the rows to already existing
> sheet in a different place (e.g. copy row 10 from source sheet to row 50 in
> destination sheet). This isn't needed in SplitExcel as row number doesn't
> change when sheet is copied.
> # Removed logic responsible for cleaning destination sheet in case
> destination row already contains some data that would be overwritten. In the
> processor, destination sheet is always empty when copy operation begins.
> [~dstiegli1], I believe it should simplify implementation of XLS support in
> the processor. I would be very grateful for your careful review.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)