Thanks. I tried adding the bookmark to the source PDFs upfront, but they are not merged into the merged PDF. However, using a scratch file / org.apache.pdfbox.io.RandomAccessFile worked pretty well to bring down memory usage. So am happy with that.
Now the only thing left is the memory usage when actually merging. I was using PDFMergerUtility.addSource( File ) multiple times then doing a PDFMergerUtility.setDestinationStream() and PDFMergerUtility.mergeDocuments(). The memory usage when calling PDFMergerUtility.mergeDocuments() is the last bit where memory jumps quite high. On 9 November 2013 19:29, Maruan Sahyoun <[email protected]> wrote: > Hi, > > there are some possible improvements > > # add the bookmarks to the source files upfront - they will be merged into > the target > # use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream > input, RandomAccess scratchFile) so temporary data is stored on file instead > of memory to lower the memory consumption during runtime > # enhance the way how the images are stored in the PDF e.g. by using a > different compression algorithm. This will be more complicated as you need to > preprocess your PDFs but maybe it's useful as it might help you to produce > smaller result files. > > BR > > Maruan Sahyoun > > Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <[email protected]>: > >> pdfbox-1.8.2 >> tika-app-1.4 ( I'm including Apache Tika as I just found out that >> Apache Tika comes with pdfbox ) >> >> I have various existing PDFs that I need to merge into one PDF. The >> number of PDFs to be merged into one can be varied .. anywhere from 2 >> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be >> merged can also be varied. These PDFs are mostly scanned via an EDRMS >> like HP TRIM7 ... so documents say like ... medical reports, etc .. >> and up as PDFs. Thus, each page of the PDF is an image instead of >> text. >> >> Merging them into a single PDF is no problem using the PDFMergerUtility. >> >> After I have merged them into a single PDF, I then need to add >> bookmarks so that the person reading the PDF ( e.g. insurer, trustee ) >> can quickly jump to a section of the merged PDF to see one of the >> merged PDFs. >> >> The issue is the memory consumption .. the merged PDF tend to be quite >> large ( anywhere from 200MB to 1GB ... again because each individual >> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an >> image ). Now having multiple of these merges run in parallel, and I >> can easily consume the entire heap allocated to the JVM. >> >> To create the bookmarks, I have to open the large / merged PDF. >> >> So the question is, is there a better way of creating bookmarks so as >> that the amount of memory consumed is minimal ? >> >> Note that I am making sure I am calling PDDocument.close() in a >> finally clause. See snippets below. >> >> >> 1) To create the bookmarks, I have to find out the number of pages in >> each PDF before they are merged. Something like in a loop: >> >> PDDocument document = null; >> try { >> document = PDDocument.load(aDownload.getLocalFile()); >> aDownload.setNumberOfPages( document.getNumberOfPages() ); >> } finally { >> if( document != null ) { >> document.close(); >> } >> } >> >> 2) Then I have to open the large / merged PDF file, then create the >> bookmarks using the number of pages as the guide from above ( And I >> also have to set the meta-data ... the author, date/time, subject on >> the PDF ): >> >> private void finaliseDocument( >> final File pdfFile, >> final List<DocumentDownloadEntry> downloadEntries ) >> throws Exception >> { >> logger.log(Level.INFO, String.format("Finalising PDF document %s", >> pdfFile.toString())); >> PDDocument document = null; >> try { >> document = PDDocument.load(pdfFile); >> >> document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES); >> >> document.getDocumentInformation().setCreationDate(Calendar.getInstance()); >> document.getDocumentInformation().setAuthor(getUserName()); >> >> document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId() >> + " - " + getSubmissionType()); >> makeBookmarks( document, downloadEntries ); >> document.save(pdfFile); >> } finally { >> if( document != null ) { >> document.close(); >> } >> } >> } >> >> private void makeBookmarks( >> final PDDocument document, >> final List<DocumentDownloadEntry> downloadEntries) >> throws Exception >> { >> PDDocumentOutline outline = new PDDocumentOutline(); >> document.getDocumentCatalog().setDocumentOutline( outline ); >> PDOutlineItem pagesOutline = new PDOutlineItem(); >> pagesOutline.setTitle( document.getDocumentInformation().getTitle() ); >> outline.appendChild( pagesOutline ); >> >> @SuppressWarnings("rawtypes") >> List pages = document.getDocumentCatalog().getAllPages(); >> int pageIndex = 0; >> for( DocumentDownloadEntry aDownload : downloadEntries ) { >> if( aDownload.isDownload() && aDownload.isDownloaded() ) { >> PDPage page = (PDPage)pages.get( pageIndex ); >> pageIndex += aDownload.getNumberOfPages(); >> >> PDPageFitWidthDestination dest = new >> PDPageFitWidthDestination(); >> dest.setPage( page ); >> PDOutlineItem bookmark = new PDOutlineItem(); >> bookmark.setDestination( dest ); >> >> bookmark.setTitle( aDownload.getDocumentName() ); >> pagesOutline.appendChild( bookmark ); >> } >> } >> pagesOutline.openNode(); >> outline.openNode(); >> } >

