[ https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133180#comment-14133180 ]
Andreas Lehmkühler commented on PDFBOX-2301: -------------------------------------------- I've overhauled the whole stream stuff as follows: - there isn't any scratch file to be provided by the user anymore, so that the close-issue should be gone - It isn't no longer necessary to clone the data, so that the memeory footprint should be smaller - the user can decide wether to handle all data using memory only or to create scratch files for each COSStream - the whole scratch file handling is encapsulated within COSStream -- each stream has it's own scratch file -- scratch files are closed and deleted when closing the stream -- scratch files aren't exposed to the user - the scratch file usage is available for the nonsequential and the old parser > RandomAccessBuffer consumes too much memory. > -------------------------------------------- > > Key: PDFBOX-2301 > URL: https://issues.apache.org/jira/browse/PDFBOX-2301 > Project: PDFBox > Issue Type: Bug > Components: PDModel > Affects Versions: 1.8.6, 2.0.0 > Reporter: gee > Assignee: Andreas Lehmkühler > Fix For: 2.0.0 > > Attachments: clone.diff > > > RandomAccessBuffer holds uncompressed image during operation because it is > what exactly pdfbox ExtractImages do. > but holding uncompressed image instead of compressed one in memory consumes > too much memory, not excluding many PDF XObjects that can use filter to > compress itself. It would be good if pdfbox provides option that reverts to > COSObject state just before the RandomAccess object created(the state that > pdf XObject stream parsed and COSDictionary objects haven't created because > user doesn't requested it using get____() method.) It is crucial feature so > that pdfbox can analyze huge pdf file(>100MB). > In current source, one must close COSStream unless required(and I know closed > stream cannot reopened again.) > Class Name > > > | > Shallow Heap | Retained Heap > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > org.apache.pdfbox.cos.COSObject @ 0x5ad4940 > > > | > 24 | 8,187,264 > |- <class> class org.apache.pdfbox.cos.COSObject @ 0x58c4020 > > > | > 0 | 0 > |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080 > > > | > 24 | 24 > |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 > > > | > 32 | 8,187,216 > | |- <class> class org.apache.pdfbox.cos.COSStream @ 0x58c3e00 > > > | > 8 | 8 > | |- items java.util.LinkedHashMap @ 0x5b2a0f0 > > > | > 56 | 552 > | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128 > > > | > 48 | 8,186,528 > | | |- <class> class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00 > > > | > 8 | 8 > | | |- currentBuffer byte[16384] @ 0x590f360 16,400 | 16,400 > | | |- bufferList java.util.ArrayList @ 0x5b2e200 > > > | > 24 | 8,170,080 > | | '- Total: 3 entries > > > | > | > | |- filteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ > 0x5b2a158 > > > | 32 | 32 > | |- decodeResult org.apache.pdfbox.filter.DecodeResult @ 0xa65f618 > > > | > 16 | 16 > | |- unFilteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ > 0xa71ab18 > > | > 32 | 32 > | '- Total: 6 entries > > > | > | > |- objectNumber org.apache.pdfbox.cos.COSInteger @ 0x5b25ec0 > > > | > 24 | 24 > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -- This message was sent by Atlassian JIRA (v6.3.4#6332)