Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs

Jesus Jr M Salvo Mon, 11 Nov 2013 01:32:58 -0800

Thanks.

I tried adding the bookmark to the source PDFs upfront, but they are
not merged into the merged PDF. However, using a scratch file /
org.apache.pdfbox.io.RandomAccessFile worked pretty well to bring down
memory usage. So am happy with that.


Now the only thing left is the memory usage when actually merging. I
was using PDFMergerUtility.addSource( File ) multiple times then doing
a PDFMergerUtility.setDestinationStream() and
PDFMergerUtility.mergeDocuments(). The memory usage when calling
PDFMergerUtility.mergeDocuments() is the last bit where memory jumps
quite high.





On 9 November 2013 19:29, Maruan Sahyoun <[email protected]> wrote:
> Hi,
>
> there are some possible improvements
>
> # add the bookmarks to the source files upfront - they will be merged into 
> the target
> # use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream 
> input, RandomAccess scratchFile) so temporary data is stored on file instead 
> of memory to lower the memory consumption during runtime
> # enhance the way how the images are stored in the PDF e.g. by using a 
> different compression algorithm. This will be more complicated as you need to 
> preprocess your PDFs but maybe it's useful as it might help you to produce 
> smaller result files.
>
> BR
>
> Maruan Sahyoun
>
> Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <[email protected]>:
>
>> pdfbox-1.8.2
>> tika-app-1.4 ( I'm including Apache Tika as I just found out that
>> Apache Tika comes with pdfbox )
>>
>> I have various existing PDFs that I need to merge into one PDF. The
>> number of PDFs to be merged into one can be varied .. anywhere from 2
>> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
>> merged can also be varied. These PDFs are mostly scanned via an EDRMS
>> like HP TRIM7 ... so documents say like ... medical reports, etc ..
>> and up as PDFs. Thus, each page of the PDF is an image instead of
>> text.
>>
>> Merging them into a single PDF is no problem using the PDFMergerUtility.
>>
>> After I have merged them into a single PDF, I then need to add
>> bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
>> can quickly jump to a section of the merged PDF to see one of the
>> merged PDFs.
>>
>> The issue is the memory consumption .. the merged PDF tend to be quite
>> large ( anywhere from 200MB to 1GB ... again because each individual
>> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
>> image ). Now having multiple of these merges run in parallel, and I
>> can easily consume the entire heap allocated to the JVM.
>>
>> To create the bookmarks, I have to open the large / merged PDF.
>>
>> So the question is, is there a better way of creating bookmarks so as
>> that the amount of memory consumed is minimal ?
>>
>> Note that I am making sure I am calling PDDocument.close() in a
>> finally clause. See snippets below.
>>
>>
>> 1) To create the bookmarks, I have to find out the number of pages in
>> each PDF before they are merged. Something like in a loop:
>>
>> PDDocument document = null;
>> try {
>>    document = PDDocument.load(aDownload.getLocalFile());
>>    aDownload.setNumberOfPages( document.getNumberOfPages() );
>> } finally {
>>    if( document != null ) {
>>        document.close();
>>    }
>> }
>>
>> 2) Then I have to open the large / merged PDF file, then create the
>> bookmarks using the number of pages as the guide from above ( And I
>> also have to set the meta-data ... the author, date/time, subject on
>> the PDF ):
>>
>> private void finaliseDocument(
>> final File pdfFile,
>> final List<DocumentDownloadEntry> downloadEntries )
>> throws Exception
>> {
>>    logger.log(Level.INFO, String.format("Finalising PDF document %s",
>> pdfFile.toString()));
>>    PDDocument document = null;
>>    try {
>>        document = PDDocument.load(pdfFile);
>>        
>> document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
>>        
>> document.getDocumentInformation().setCreationDate(Calendar.getInstance());
>>        document.getDocumentInformation().setAuthor(getUserName());
>>        
>> document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
>> + " - " + getSubmissionType());
>>        makeBookmarks( document, downloadEntries );
>>        document.save(pdfFile);
>>    } finally {
>>        if( document != null ) {
>>            document.close();
>>        }
>>    }
>> }
>>
>> private void makeBookmarks(
>> final PDDocument document,
>> final List<DocumentDownloadEntry> downloadEntries)
>> throws Exception
>> {
>>        PDDocumentOutline outline =  new PDDocumentOutline();
>>        document.getDocumentCatalog().setDocumentOutline( outline );
>>        PDOutlineItem pagesOutline = new PDOutlineItem();
>>        pagesOutline.setTitle( document.getDocumentInformation().getTitle() );
>>        outline.appendChild( pagesOutline );
>>
>>        @SuppressWarnings("rawtypes")
>>        List pages = document.getDocumentCatalog().getAllPages();
>>        int pageIndex = 0;
>>        for( DocumentDownloadEntry aDownload : downloadEntries ) {
>>          if( aDownload.isDownload() && aDownload.isDownloaded() ) {
>>            PDPage page = (PDPage)pages.get( pageIndex );
>>            pageIndex += aDownload.getNumberOfPages();
>>
>>                PDPageFitWidthDestination dest = new
>> PDPageFitWidthDestination();
>>                dest.setPage( page );
>>                PDOutlineItem bookmark = new PDOutlineItem();
>>                bookmark.setDestination( dest );
>>
>>                bookmark.setTitle( aDownload.getDocumentName() );
>>                pagesOutline.appendChild( bookmark );
>>          }
>>        }
>>        pagesOutline.openNode();
>>        outline.openNode();
>> }
>

Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs

Reply via email to