[ https://issues.apache.org/jira/browse/PDFBOX-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722943#comment-17722943 ]
Zbigniew Minciel commented on PDFBOX-5602: ------------------------------------------ Attached again due to missing Self Time columns *3.0.0-SNAPSHOT* !cpu-hot-spots-3.0.0-SNAPSHOT.PNG! *3.0.0-alpha3* !cpu-hot-spots-3.0.0-alpha3.PNG! > Consider adding support for PDF files Concatenation in addition to the full > Merge > ---------------------------------------------------------------------------------- > > Key: PDFBOX-5602 > URL: https://issues.apache.org/jira/browse/PDFBOX-5602 > Project: PDFBox > Issue Type: New Feature > Components: Utilities > Affects Versions: 3.0.0 PDFBox > Reporter: Zbigniew Minciel > Priority: Major > Attachments: CapturePdfDebugger.PNG, Large527MbytesPDF.PNG, > cpu-hot-spots-3.0.0-SNAPSHOT-1.PNG, cpu-hot-spots-3.0.0-SNAPSHOT.PNG, > cpu-hot-spots-3.0.0-alpha3-1.PNG, cpu-hot-spots-3.0.0-alpha3.PNG > > > I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of > PDF files. > I attempted to merge 7500 mails in separate PDF files on Windows. Given the > limitation on the max size of the command line arguments, I was merging > subsets of files. I ended up with 5 large PDF files, each around > 500-600MBytes. I tried to merge these 5 files but eventually merge failed > after running more than 6 hours. See error log at the bottom. I have large > RAM 48GBytes. PDFBox was using up 13GB of memory max. Usage was changing > between 600MB and 13Gb. > I am wondering whether PDFBox could support Concatenation mode in addition to > the full Merge mode. No need to create index table, etc. It could work as > follow I suppose given my total lack of understanding how PDF works: > # Read first file, process and append to the target PDF file. Delete PDF > data and related meta data for this file except perhaps the last page number. > # Read the second file and process in similar fashion as in the step 1 > # etc > If Concatenation is possible, it would greatly reduce the cpu and memory > overhead and reduce processing time. > I admit merging of such large number of PDF files is not typical but the > issue is valid. > ^CException in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.base/java.util.Hashtable.rehash(Hashtable.java:419) > at java.base/java.util.Hashtable.addEntry(Hashtable.java:441) > at java.base/java.util.Hashtable.put(Hashtable.java:493) > at > org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481) > at > org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260) > at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402) > at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542) > at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418) > at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018) > at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963) > at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982) > at > org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476) > at > org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355) > at > org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339) > at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76) > at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37) > at picocli.CommandLine.executeUserObject(CommandLine.java:1953) > at picocli.CommandLine.access$1300(CommandLine.java:145) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2352) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2314) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2316) > at picocli.CommandLine.execute(CommandLine.java:2078) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) > Respectfully, > Zbigniew > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org