[
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434416#comment-16434416
]
Gary Potagal commented on PDFBOX-4188:
--------------------------------------
[~tilman] - we created a breaking test and it's attached
[^PDFBOX-4188-breakingTest.zip].
The patch is binary, so you would need to apply it in the checked out trunk
directory using the command:
trunk> patch -p0 --binary -i PDFBOX-4188-breakingTest.diff
patching file
pdfbox/src/test/java/org/apache/pdfbox/multipdf/PdfMergeUtilityPagesTest.java
patching file pdfbox/src/test/resources/input/merge/pages/pdf_sample_1.pdf
The test does the following:
# Creates four folders containing copies of one page simple pdf_sample_1.pdf
file. Each folders contain increasing number of copies, starting with 100, so
it's 100, 200, 300, 400 . Each file is about 8K
# Merges all files in each folder. The numbers in test for maxStorageBytes
are just enough to let the test pass. If you decrease them slightly, the
Exception will be thrown.
Output looks like this:
Apr 11, 2018 2:25:32 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest
runMergeTest
INFO: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 0.781;
Pages/Second: 128.041; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 74;
Total Sources Size(K): 775; Merged File Size(K): 522; Ratio
MaxStorageBytes/Merged File Size: 145
Apr 11, 2018 2:25:34 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest
runMergeTest
INFO: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 1.486;
Pages/Second: 134.590; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 315;
Total Sources Size(K): 1,551; Merged File Size(K): 1,042; Ratio
MaxStorageBytes/Merged File Size: 309
Apr 11, 2018 2:25:37 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest
runMergeTest
INFO: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3.532;
Pages/Second: 84.938; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 710;
Total Sources Size(K): 2,327; Merged File Size(K): 1,562; Ratio
MaxStorageBytes/Merged File Size: 465
Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest
runMergeTest
INFO: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4.677;
Pages/Second: 85.525; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1,240;
Total Sources Size(K): 3,103; Merged File Size(K): 2,082; Ratio
MaxStorageBytes/Merged File Size: 609
Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest
testPerformanceMerge
INFO: Summary: Pages: 1000, Time(s): 10.476, Pages/Second: 95.456
As you can see, to merge 400 one page 8K files, We need to set maxStorageBytes
to ~1.2 GIG. The resulting file is ~2000 K
> "Maximum allowed scratch file memory exceeded." Exception when merging large
> number of small PDFs
> --------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 2.0.9, 3.0.0 PDFBox
> Reporter: Gary Potagal
> Priority: Major
> Attachments: PDFBOX-4188-breakingTest.zip
>
>
>
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>
> We wanted to address one more merge issue in
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files. We use mixed mode, memory
> and disk for cache. Initially, we would often get "Maximum allowed scratch
> file memory exceeded.", unless we turned off the check by passing "-1" to
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that
> instead of sharing a single cache, it breaks it up into equal sized fixed
> partitions based on the number of input + output files being merged. This
> means that each partition must be big enough to hold the final output file.
> When 400 1-page files are merged, this creates 401 partitions, but each of
> which needs to be big enough to hold the final 400 pages. Even worse, the
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401 x
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This
> would be a very high number, usually in GIGs.
>
> Given the current limitation that we need to keep all the input files open
> until the output file is written (HUGE), we came up with 2 options. (See
> PDFBOX-4182)
>
> 1. Good: Split the cache in ½, give ½ to the output file, and segment the
> other ½ across the input files. (Still keeping them open until then end).
> 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk
> on demand, release cache as documents are closed after merge. This is our
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are
> addressed.
>
> We would like to submit our current implementation as a Patch to 2.0.10 and
> 3.0.0, unless this is already addressed.
>
> Thank you
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]