[
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gary Potagal updated PDFBOX-4188:
---------------------------------
Affects Version/s: 3.0.0 PDFBox
> "Maximum allowed scratch file memory exceeded." Exception when merging large
> number of small PDFs
> --------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 2.0.9, 3.0.0 PDFBox
> Reporter: Gary Potagal
> Priority: Major
>
> I have been running some tests trying to merge large amounts (2618) of small
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory
> footprint.
> I personally thought that using MemorySettings with temporary file only would
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with memory settings:
> * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L
> * 1024L)
> * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB
> and 8GB)
> It looks like the loop in the mergeDocuments accumulates PDDocument objects
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating
> them and closing all at the end, improves memory usage considerably.(although
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers
> and or the stream type accumulating all the streams is a potential
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can
> process the same amount of pdfs with about half the memory.
> I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
> * Suppliers
> * Supplier
> * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature
> changes only internal code changes. (just rename the class to
> PDFMergerUtility if you decide to implemented the changes.)
> In attachment you can also find some screenshots from visualvm showing the
> memory usage of the original version and the refactored version as well as
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory
> usage in general as pdfbox seems to consumer a lot of memory in general.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]