[ 
https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138730#comment-14138730
 ] 

Maruan Sahyoun commented on PDFBOX-2301:
----------------------------------------

As the NonSeqParser can work on demand maybe the initial focus could be to get 
that work internally as this will already decrease the memory requirements as 
well as processing speed where only partial informations are needed (e.g. 
import a single page form a source document). The results of Andreas initial 
change showed that we currently have too many open streams.

Apache Camel has a Stream Cache with configurable size where the stream is kept 
in memory or written to a temp file if it’s above the size. 
http://camel.apache.org/stream-caching.html. Maybe with looking into.

> RandomAccessBuffer consumes too much memory.
> --------------------------------------------
>
>                 Key: PDFBOX-2301
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2301
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: gee
>            Assignee: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: clone.diff, clone2.diff, clone3.diff
>
>
> RandomAccessBuffer holds uncompressed image during operation because it is 
> what exactly pdfbox ExtractImages do.
> but holding uncompressed image instead of compressed one in memory consumes 
> too much memory, not excluding many PDF XObjects that can use filter to 
> compress itself. It would be good if pdfbox provides option that reverts to 
> COSObject state just before the RandomAccess object created(the state that 
> pdf XObject stream parsed and COSDictionary objects haven't created because 
> user doesn't requested it using get____() method.) It is crucial feature so 
> that pdfbox can analyze huge pdf file(>100MB).
> In current source, one must close COSStream unless required(and I know closed 
> stream cannot reopened again.)
> Class Name                                                                    
>                                                                               
>                                                                               
>                                                                          | 
> Shallow Heap | Retained Heap
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> org.apache.pdfbox.cos.COSObject @ 0x5ad4940                                   
>                                                                               
>                                                                               
>                                                                          |    
>        24 |     8,187,264
> |- <class> class org.apache.pdfbox.cos.COSObject @ 0x58c4020                  
>                                                                               
>                                                                               
>                                                                          |    
>         0 |             0
> |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080              
>                                                                               
>                                                                               
>                                                                          |    
>        24 |            24
> |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0                     
>                                                                               
>                                                                               
>                                                                          |    
>        32 |     8,187,216
> |  |- <class> class org.apache.pdfbox.cos.COSStream @ 0x58c3e00               
>                                                                               
>                                                                               
>                                                                          |    
>         8 |             8
> |  |- items java.util.LinkedHashMap @ 0x5b2a0f0                               
>                                                                               
>                                                                               
>                                                                          |    
>        56 |           552
> |  |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128                
>                                                                               
>                                                                               
>                                                                          |    
>        48 |     8,186,528
> |  |  |- <class> class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00    
>                                                                               
>                                                                               
>                                                                          |    
>         8 |             8
> |  |  |- currentBuffer byte[16384] @ 0x590f360      16,400 |        16,400
> |  |  |- bufferList java.util.ArrayList @ 0x5b2e200                           
>                                                                               
>                                                                               
>                                                                          |    
>        24 |     8,170,080
> |  |  '- Total: 3 entries                                                     
>                                                                               
>                                                                               
>                                                                          |    
>           |              
> |  |- filteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ 
> 0x5b2a158                                                                     
>                                                                               
>                                                                               
> |           32 |            32
> |  |- decodeResult org.apache.pdfbox.filter.DecodeResult @ 0xa65f618          
>                                                                               
>                                                                               
>                                                                          |    
>        16 |            16
> |  |- unFilteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ 
> 0xa71ab18                                                                     
>                                                                               
>                                                                             | 
>           32 |            32
> |  '- Total: 6 entries                                                        
>                                                                               
>                                                                               
>                                                                          |    
>           |              
> |- objectNumber org.apache.pdfbox.cos.COSInteger @ 0x5b25ec0                  
>                                                                               
>                                                                               
>                                                                          |    
>        24 |            24
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to