Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs

Jesus Jr M Salvo Mon, 11 Nov 2013 02:07:45 -0800

Seems that PDFMergerUtility.appendDocument(PDDocument dest, PDDocument
src) is the solution, as you can then use RandomAccessFile.


On 11 November 2013 20:31, Jesus Jr M Salvo <[email protected]> wrote:
> Thanks.
>
> I tried adding the bookmark to the source PDFs upfront, but they are
> not merged into the merged PDF. However, using a scratch file /
> org.apache.pdfbox.io.RandomAccessFile worked pretty well to bring down
> memory usage. So am happy with that.
>
> Now the only thing left is the memory usage when actually merging. I
> was using PDFMergerUtility.addSource( File ) multiple times then doing
> a PDFMergerUtility.setDestinationStream() and
> PDFMergerUtility.mergeDocuments(). The memory usage when calling
> PDFMergerUtility.mergeDocuments() is the last bit where memory jumps
> quite high.
>
>
>
>
>
> On 9 November 2013 19:29, Maruan Sahyoun <[email protected]> wrote:
>> Hi,
>>
>> there are some possible improvements
>>
>> # add the bookmarks to the source files upfront - they will be merged into 
>> the target
>> # use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream 
>> input, RandomAccess scratchFile) so temporary data is stored on file instead 
>> of memory to lower the memory consumption during runtime
>> # enhance the way how the images are stored in the PDF e.g. by using a 
>> different compression algorithm. This will be more complicated as you need 
>> to preprocess your PDFs but maybe it's useful as it might help you to 
>> produce smaller result files.
>>
>> BR
>>
>> Maruan Sahyoun
>>
>> Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <[email protected]>:
>>
>>> pdfbox-1.8.2
>>> tika-app-1.4 ( I'm including Apache Tika as I just found out that
>>> Apache Tika comes with pdfbox )
>>>
>>> I have various existing PDFs that I need to merge into one PDF. The
>>> number of PDFs to be merged into one can be varied .. anywhere from 2
>>> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
>>> merged can also be varied. These PDFs are mostly scanned via an EDRMS
>>> like HP TRIM7 ... so documents say like ... medical reports, etc ..
>>> and up as PDFs. Thus, each page of the PDF is an image instead of
>>> text.
>>>
>>> Merging them into a single PDF is no problem using the PDFMergerUtility.
>>>
>>> After I have merged them into a single PDF, I then need to add
>>> bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
>>> can quickly jump to a section of the merged PDF to see one of the
>>> merged PDFs.
>>>
>>> The issue is the memory consumption .. the merged PDF tend to be quite
>>> large ( anywhere from 200MB to 1GB ... again because each individual
>>> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
>>> image ). Now having multiple of these merges run in parallel, and I
>>> can easily consume the entire heap allocated to the JVM.
>>>
>>> To create the bookmarks, I have to open the large / merged PDF.
>>>
>>> So the question is, is there a better way of creating bookmarks so as
>>> that the amount of memory consumed is minimal ?
>>>
>>> Note that I am making sure I am calling PDDocument.close() in a
>>> finally clause. See snippets below.
>>>
>>>
>>> 1) To create the bookmarks, I have to find out the number of pages in
>>> each PDF before they are merged. Something like in a loop:
>>>
>>> PDDocument document = null;
>>> try {
>>>    document = PDDocument.load(aDownload.getLocalFile());
>>>    aDownload.setNumberOfPages( document.getNumberOfPages() );
>>> } finally {
>>>    if( document != null ) {
>>>        document.close();
>>>    }
>>> }
>>>
>>> 2) Then I have to open the large / merged PDF file, then create the
>>> bookmarks using the number of pages as the guide from above ( And I
>>> also have to set the meta-data ... the author, date/time, subject on
>>> the PDF ):
>>>
>>> private void finaliseDocument(
>>> final File pdfFile,
>>> final List<DocumentDownloadEntry> downloadEntries )
>>> throws Exception
>>> {
>>>    logger.log(Level.INFO, String.format("Finalising PDF document %s",
>>> pdfFile.toString()));
>>>    PDDocument document = null;
>>>    try {
>>>        document = PDDocument.load(pdfFile);
>>>        
>>> document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
>>>        
>>> document.getDocumentInformation().setCreationDate(Calendar.getInstance());
>>>        document.getDocumentInformation().setAuthor(getUserName());
>>>        
>>> document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
>>> + " - " + getSubmissionType());
>>>        makeBookmarks( document, downloadEntries );
>>>        document.save(pdfFile);
>>>    } finally {
>>>        if( document != null ) {
>>>            document.close();
>>>        }
>>>    }
>>> }
>>>
>>> private void makeBookmarks(
>>> final PDDocument document,
>>> final List<DocumentDownloadEntry> downloadEntries)
>>> throws Exception
>>> {
>>>        PDDocumentOutline outline =  new PDDocumentOutline();
>>>        document.getDocumentCatalog().setDocumentOutline( outline );
>>>        PDOutlineItem pagesOutline = new PDOutlineItem();
>>>        pagesOutline.setTitle( document.getDocumentInformation().getTitle() 
>>> );
>>>        outline.appendChild( pagesOutline );
>>>
>>>        @SuppressWarnings("rawtypes")
>>>        List pages = document.getDocumentCatalog().getAllPages();
>>>        int pageIndex = 0;
>>>        for( DocumentDownloadEntry aDownload : downloadEntries ) {
>>>          if( aDownload.isDownload() && aDownload.isDownloaded() ) {
>>>            PDPage page = (PDPage)pages.get( pageIndex );
>>>            pageIndex += aDownload.getNumberOfPages();
>>>
>>>                PDPageFitWidthDestination dest = new
>>> PDPageFitWidthDestination();
>>>                dest.setPage( page );
>>>                PDOutlineItem bookmark = new PDOutlineItem();
>>>                bookmark.setDestination( dest );
>>>
>>>                bookmark.setTitle( aDownload.getDocumentName() );
>>>                pagesOutline.appendChild( bookmark );
>>>          }
>>>        }
>>>        pagesOutline.openNode();
>>>        outline.openNode();
>>> }
>>

Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs

Reply via email to