[ https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129848#comment-14129848 ]
Andreas Lehmkühler commented on PDFBOX-2301: -------------------------------------------- The origin issue (PDFBOX-1625) was about merging pdfs and the underlying issue about the usage of the scratch file when merging pdfs (see PDFBOX-1586). PDFBOX-1586 reduces the usage of the scratch file and PDFBOX-1625 tries to detach the source and the destination pdf by cloning. Both are just workarounds and in the case of PDFBOX-1625 it has some side effects. IMHO, we have to overhaul the stream handling within the COS layer and we shouldn't expose the scratch file anymore. The whole stuff should be handled under the hood. The only thing the user may decide is wether to use the file system or the memory as temp area. > RandomAccessBuffer consumes too much memory. > -------------------------------------------- > > Key: PDFBOX-2301 > URL: https://issues.apache.org/jira/browse/PDFBOX-2301 > Project: PDFBox > Issue Type: Bug > Components: PDModel > Reporter: gee > Attachments: clone.diff > > > RandomAccessBuffer holds uncompressed image during operation because it is > what exactly pdfbox ExtractImages do. > but holding uncompressed image instead of compressed one in memory consumes > too much memory, not excluding many PDF XObjects that can use filter to > compress itself. It would be good if pdfbox provides option that reverts to > COSObject state just before the RandomAccess object created(the state that > pdf XObject stream parsed and COSDictionary objects haven't created because > user doesn't requested it using get____() method.) It is crucial feature so > that pdfbox can analyze huge pdf file(>100MB). > In current source, one must close COSStream unless required(and I know closed > stream cannot reopened again.) > Class Name > > > | > Shallow Heap | Retained Heap > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > org.apache.pdfbox.cos.COSObject @ 0x5ad4940 > > > | > 24 | 8,187,264 > |- <class> class org.apache.pdfbox.cos.COSObject @ 0x58c4020 > > > | > 0 | 0 > |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080 > > > | > 24 | 24 > |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0 > > > | > 32 | 8,187,216 > | |- <class> class org.apache.pdfbox.cos.COSStream @ 0x58c3e00 > > > | > 8 | 8 > | |- items java.util.LinkedHashMap @ 0x5b2a0f0 > > > | > 56 | 552 > | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128 > > > | > 48 | 8,186,528 > | | |- <class> class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00 > > > | > 8 | 8 > | | |- currentBuffer byte[16384] @ 0x590f360 16,400 | 16,400 > | | |- bufferList java.util.ArrayList @ 0x5b2e200 > > > | > 24 | 8,170,080 > | | '- Total: 3 entries > > > | > | > | |- filteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ > 0x5b2a158 > > > | 32 | 32 > | |- decodeResult org.apache.pdfbox.filter.DecodeResult @ 0xa65f618 > > > | > 16 | 16 > | |- unFilteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @ > 0xa71ab18 > > | > 32 | 32 > | '- Total: 6 entries > > > | > | > |- objectNumber org.apache.pdfbox.cos.COSInteger @ 0x5b25ec0 > > > | > 24 | 24 > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -- This message was sent by Atlassian JIRA (v6.3.4#6332)