Seems that PDFMergerUtility.appendDocument(PDDocument dest, PDDocument src) is the solution, as you can then use RandomAccessFile.
On 11 November 2013 20:31, Jesus Jr M Salvo <[email protected]> wrote: > Thanks. > > I tried adding the bookmark to the source PDFs upfront, but they are > not merged into the merged PDF. However, using a scratch file / > org.apache.pdfbox.io.RandomAccessFile worked pretty well to bring down > memory usage. So am happy with that. > > Now the only thing left is the memory usage when actually merging. I > was using PDFMergerUtility.addSource( File ) multiple times then doing > a PDFMergerUtility.setDestinationStream() and > PDFMergerUtility.mergeDocuments(). The memory usage when calling > PDFMergerUtility.mergeDocuments() is the last bit where memory jumps > quite high. > > > > > > On 9 November 2013 19:29, Maruan Sahyoun <[email protected]> wrote: >> Hi, >> >> there are some possible improvements >> >> # add the bookmarks to the source files upfront - they will be merged into >> the target >> # use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream >> input, RandomAccess scratchFile) so temporary data is stored on file instead >> of memory to lower the memory consumption during runtime >> # enhance the way how the images are stored in the PDF e.g. by using a >> different compression algorithm. This will be more complicated as you need >> to preprocess your PDFs but maybe it's useful as it might help you to >> produce smaller result files. >> >> BR >> >> Maruan Sahyoun >> >> Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <[email protected]>: >> >>> pdfbox-1.8.2 >>> tika-app-1.4 ( I'm including Apache Tika as I just found out that >>> Apache Tika comes with pdfbox ) >>> >>> I have various existing PDFs that I need to merge into one PDF. The >>> number of PDFs to be merged into one can be varied .. anywhere from 2 >>> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be >>> merged can also be varied. These PDFs are mostly scanned via an EDRMS >>> like HP TRIM7 ... so documents say like ... medical reports, etc .. >>> and up as PDFs. Thus, each page of the PDF is an image instead of >>> text. >>> >>> Merging them into a single PDF is no problem using the PDFMergerUtility. >>> >>> After I have merged them into a single PDF, I then need to add >>> bookmarks so that the person reading the PDF ( e.g. insurer, trustee ) >>> can quickly jump to a section of the merged PDF to see one of the >>> merged PDFs. >>> >>> The issue is the memory consumption .. the merged PDF tend to be quite >>> large ( anywhere from 200MB to 1GB ... again because each individual >>> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an >>> image ). Now having multiple of these merges run in parallel, and I >>> can easily consume the entire heap allocated to the JVM. >>> >>> To create the bookmarks, I have to open the large / merged PDF. >>> >>> So the question is, is there a better way of creating bookmarks so as >>> that the amount of memory consumed is minimal ? >>> >>> Note that I am making sure I am calling PDDocument.close() in a >>> finally clause. See snippets below. >>> >>> >>> 1) To create the bookmarks, I have to find out the number of pages in >>> each PDF before they are merged. Something like in a loop: >>> >>> PDDocument document = null; >>> try { >>> document = PDDocument.load(aDownload.getLocalFile()); >>> aDownload.setNumberOfPages( document.getNumberOfPages() ); >>> } finally { >>> if( document != null ) { >>> document.close(); >>> } >>> } >>> >>> 2) Then I have to open the large / merged PDF file, then create the >>> bookmarks using the number of pages as the guide from above ( And I >>> also have to set the meta-data ... the author, date/time, subject on >>> the PDF ): >>> >>> private void finaliseDocument( >>> final File pdfFile, >>> final List<DocumentDownloadEntry> downloadEntries ) >>> throws Exception >>> { >>> logger.log(Level.INFO, String.format("Finalising PDF document %s", >>> pdfFile.toString())); >>> PDDocument document = null; >>> try { >>> document = PDDocument.load(pdfFile); >>> >>> document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES); >>> >>> document.getDocumentInformation().setCreationDate(Calendar.getInstance()); >>> document.getDocumentInformation().setAuthor(getUserName()); >>> >>> document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId() >>> + " - " + getSubmissionType()); >>> makeBookmarks( document, downloadEntries ); >>> document.save(pdfFile); >>> } finally { >>> if( document != null ) { >>> document.close(); >>> } >>> } >>> } >>> >>> private void makeBookmarks( >>> final PDDocument document, >>> final List<DocumentDownloadEntry> downloadEntries) >>> throws Exception >>> { >>> PDDocumentOutline outline = new PDDocumentOutline(); >>> document.getDocumentCatalog().setDocumentOutline( outline ); >>> PDOutlineItem pagesOutline = new PDOutlineItem(); >>> pagesOutline.setTitle( document.getDocumentInformation().getTitle() >>> ); >>> outline.appendChild( pagesOutline ); >>> >>> @SuppressWarnings("rawtypes") >>> List pages = document.getDocumentCatalog().getAllPages(); >>> int pageIndex = 0; >>> for( DocumentDownloadEntry aDownload : downloadEntries ) { >>> if( aDownload.isDownload() && aDownload.isDownloaded() ) { >>> PDPage page = (PDPage)pages.get( pageIndex ); >>> pageIndex += aDownload.getNumberOfPages(); >>> >>> PDPageFitWidthDestination dest = new >>> PDPageFitWidthDestination(); >>> dest.setPage( page ); >>> PDOutlineItem bookmark = new PDOutlineItem(); >>> bookmark.setDestination( dest ); >>> >>> bookmark.setTitle( aDownload.getDocumentName() ); >>> pagesOutline.appendChild( bookmark ); >>> } >>> } >>> pagesOutline.openNode(); >>> outline.openNode(); >>> } >>

