[ 
https://issues.apache.org/jira/browse/PDFBOX-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722943#comment-17722943
 ] 

Zbigniew Minciel commented on PDFBOX-5602:
------------------------------------------

 

Attached again due to missing Self Time columns

 

*3.0.0-SNAPSHOT*

 

!cpu-hot-spots-3.0.0-SNAPSHOT.PNG!

 

*3.0.0-alpha3*

!cpu-hot-spots-3.0.0-alpha3.PNG!

 

> Consider adding support for PDF files Concatenation in addition to the  full 
> Merge
> ----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5602
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5602
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Utilities
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Zbigniew Minciel
>            Priority: Major
>         Attachments: CapturePdfDebugger.PNG, Large527MbytesPDF.PNG, 
> cpu-hot-spots-3.0.0-SNAPSHOT-1.PNG, cpu-hot-spots-3.0.0-SNAPSHOT.PNG, 
> cpu-hot-spots-3.0.0-alpha3-1.PNG, cpu-hot-spots-3.0.0-alpha3.PNG
>
>
> I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of 
> PDF files.
> I attempted to merge 7500 mails in separate PDF files on Windows. Given the 
> limitation on the max size of the command line arguments, I was merging 
> subsets of files. I ended up with 5 large PDF files, each around 
> 500-600MBytes. I tried to merge these 5 files but eventually merge failed 
> after running more than 6 hours.  See error log at the bottom. I have large 
> RAM 48GBytes.  PDFBox was using up 13GB of memory max. Usage was changing 
> between 600MB and 13Gb. 
> I am wondering whether PDFBox could support Concatenation mode in addition to 
> the full Merge mode.  No need to create index table, etc. It could work as 
> follow I suppose given my total lack of understanding how PDF works:
>  # Read first file, process and append to the target PDF file. Delete PDF 
> data and related meta data for this file except perhaps the last page number.
>  # Read the second file and process in similar fashion as in the step 1
>  # etc
> If Concatenation is possible, it would greatly reduce the cpu and memory 
> overhead and reduce processing time.
> I admit merging of such large number of PDF files is not typical but the 
> issue is valid.
> ^CException in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Hashtable.rehash(Hashtable.java:419)
>     at java.base/java.util.Hashtable.addEntry(Hashtable.java:441)
>     at java.base/java.util.Hashtable.put(Hashtable.java:493)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260)
>     at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>     at picocli.CommandLine.access$1300(CommandLine.java:145)
>     at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
>     at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>     at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
>     at picocli.CommandLine.execute(CommandLine.java:2078)
>     at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
> Respectfully,
> Zbigniew
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to