[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.

John Hewson (JIRA) Thu, 18 Sep 2014 12:27:58 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139369#comment-14139369
 ]


John Hewson commented on PDFBOX-2301:
-------------------------------------

In response to Andreas' third problem:

{quote}
find a way to release resources during processing a pdf, e.g. after each page
{quote}

Currently we cache the page resources in PDResources which belongs to a 
specific PDPage. This causes two problems, 1) users who want to hold many 
PDPage objects in memory will have high memory use (but this is often by 
accident*). 2) By caching resources in PDPage we only get to keep that cache 
for the lifetime of the page, which e.g. in PDFRenderer is a single page only. 
That means that a font which appears on 40 pages has to be parsed 40 times, 
which causes slow running times, but also memory thrashing as objects are 
destroyed frequently only to be re-created.

What PDFRenderer really needs is not page-wide caching but document-wide 
caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But 
that won't work for images, because they're too large. What we're beginning to 
realise is that caching is use-case specific and probably shouldn't be built-in 
to PDFBox's pdmodel. Instead we should removing resource caching from 
PDPage/PDResources and implement custom caching in PDFRenderer and other 
downstream classes such as PDFTextStripper. I'll happily volunteer myself. The 
existing high-level PDFBox APIs will continue to "just work" and power users 
will get a level of control that they appreciate.

This strategy could be enhanced by removing memory-hungry methods on 
PDResources such as getFonts() and getXObjects() which force all resources of a 
particular type to be loaded, whether or not they are needed, or actually used 
in the content stream. They would be replaced by methods to retrieve a single 
resource, e.g. getFont(name).

---

\* There probably isn't a legitimate use case for 1) any more, we've solved the 
issues which we used to have with image caching (in fact, the clearCache() 
method actually no longer needs to be called by PDFRenderer, though it 
currently is). The real problem is that it's easy to accidentally retain PDPage 
objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous 
as looping over it will cause pages to be retained during processing, like so:

{code}
for (PDPage page : document.getDocumentCatalog().getAllPages()) // 
java.util.List
{
     // ... this is idiomatic in PDFBox 1.8
} 
// List returned by getAllPages() kept in scope until here (bad)
{code}

I added of couple of methods a while ago to avoid this by fetching each PDPage 
one at a time, and this is now used internally in PDFBox to avoid the memory 
problems we used to have:

{code}
for (int i = 0; i < document.getNumberOfPages(); i++)
{
    PDPage page = document.getPage(i);
    // ... this is the new 2.0 way
    // current page falls out of scope here (good)
}
{code}

To solve this problem, we could change getAllPages() so that instead of 
returning a List it returns an Iterator<PDPage>, which would provide a nicer 
API than getPage(int) and most existing code will continue to work. This is 
also an opportunity to also fix type safety issues due to PDPageNode and 
incorrect handling of the page tree (this is similar to the issue we had we the 
acroform field tree).

> RandomAccessBuffer consumes too much memory.
> --------------------------------------------
>
>                 Key: PDFBOX-2301
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2301
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: gee
>            Assignee: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: clone.diff, clone2.diff, clone3.diff
>
>
> RandomAccessBuffer holds uncompressed image during operation because it is 
> what exactly pdfbox ExtractImages do.
> but holding uncompressed image instead of compressed one in memory consumes 
> too much memory, not excluding many PDF XObjects that can use filter to 
> compress itself. It would be good if pdfbox provides option that reverts to 
> COSObject state just before the RandomAccess object created(the state that 
> pdf XObject stream parsed and COSDictionary objects haven't created because 
> user doesn't requested it using get____() method.) It is crucial feature so 
> that pdfbox can analyze huge pdf file(>100MB).
> In current source, one must close COSStream unless required(and I know closed 
> stream cannot reopened again.)
> Class Name                                                                    
>                                                                               
>                                                                               
>                                                                          | 
> Shallow Heap | Retained Heap
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> org.apache.pdfbox.cos.COSObject @ 0x5ad4940                                   
>                                                                               
>                                                                               
>                                                                          |    
>        24 |     8,187,264
> |- <class> class org.apache.pdfbox.cos.COSObject @ 0x58c4020                  
>                                                                               
>                                                                               
>                                                                          |    
>         0 |             0
> |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080              
>                                                                               
>                                                                               
>                                                                          |    
>        24 |            24
> |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0                     
>                                                                               
>                                                                               
>                                                                          |    
>        32 |     8,187,216
> |  |- <class> class org.apache.pdfbox.cos.COSStream @ 0x58c3e00               
>                                                                               
>                                                                               
>                                                                          |    
>         8 |             8
> |  |- items java.util.LinkedHashMap @ 0x5b2a0f0                               
>                                                                               
>                                                                               
>                                                                          |    
>        56 |           552
> |  |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128                
>                                                                               
>                                                                               
>                                                                          |    
>        48 |     8,186,528
> |  |  |- <class> class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00    
>                                                                               
>                                                                               
>                                                                          |    
>         8 |             8
> |  |  |- currentBuffer byte[16384] @ 0x590f360      16,400 |        16,400
> |  |  |- bufferList java.util.ArrayList @ 0x5b2e200                           
>                                                                               
>                                                                               
>                                                                          |    
>        24 |     8,170,080
> |  |  '- Total: 3 entries                                                     
>                                                                               
>                                                                               
>                                                                          |    
>           |              
> |  |- filteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ 
> 0x5b2a158                                                                     
>                                                                               
>                                                                               
> |           32 |            32
> |  |- decodeResult org.apache.pdfbox.filter.DecodeResult @ 0xa65f618          
>                                                                               
>                                                                               
>                                                                          |    
>        16 |            16
> |  |- unFilteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ 
> 0xa71ab18                                                                     
>                                                                               
>                                                                             | 
>           32 |            32
> |  '- Total: 6 entries                                                        
>                                                                               
>                                                                               
>                                                                          |    
>           |              
> |- objectNumber org.apache.pdfbox.cos.COSInteger @ 0x5b25ec0                  
>                                                                               
>                                                                               
>                                                                          |    
>        24 |            24
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2301) RandomAccessBuffer consumes too much memory.

Reply via email to