[
https://issues.apache.org/jira/browse/PDFBOX-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137140#comment-14137140
]
Timo Boehme commented on PDFBOX-2301:
-------------------------------------
[~lehmi] The NonSeqParser by design is an on-demand parser. Only because other
parts of PDFBOX require data already parsed it initializes/parses all objects
in the init procedure (see parseMinimalCatalog variable) as a work around. So
COSObject and its subclasses should only be a stub in the beginning and if used
(any method call) should trigger parsing the object by the parser
(NonSequentialPDFParser.parseObjectDynamically). Thus COSDocument needs to have
a reference to the parser.
For the scratch file workaround I'm still in favor for a split in-memory/file
usage so that only large PDF need to write to file.
> RandomAccessBuffer consumes too much memory.
> --------------------------------------------
>
> Key: PDFBOX-2301
> URL: https://issues.apache.org/jira/browse/PDFBOX-2301
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 1.8.6, 2.0.0
> Reporter: gee
> Assignee: Andreas Lehmkühler
> Fix For: 2.0.0
>
> Attachments: clone.diff, clone2.diff, clone3.diff
>
>
> RandomAccessBuffer holds uncompressed image during operation because it is
> what exactly pdfbox ExtractImages do.
> but holding uncompressed image instead of compressed one in memory consumes
> too much memory, not excluding many PDF XObjects that can use filter to
> compress itself. It would be good if pdfbox provides option that reverts to
> COSObject state just before the RandomAccess object created(the state that
> pdf XObject stream parsed and COSDictionary objects haven't created because
> user doesn't requested it using get____() method.) It is crucial feature so
> that pdfbox can analyze huge pdf file(>100MB).
> In current source, one must close COSStream unless required(and I know closed
> stream cannot reopened again.)
> Class Name
>
>
> |
> Shallow Heap | Retained Heap
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> org.apache.pdfbox.cos.COSObject @ 0x5ad4940
>
>
> |
> 24 | 8,187,264
> |- <class> class org.apache.pdfbox.cos.COSObject @ 0x58c4020
>
>
> |
> 0 | 0
> |- generationNumber org.apache.pdfbox.cos.COSInteger @ 0x5ad0080
>
>
> |
> 24 | 24
> |- baseObject org.apache.pdfbox.cos.COSStream @ 0x5b25ea0
>
>
> |
> 32 | 8,187,216
> | |- <class> class org.apache.pdfbox.cos.COSStream @ 0x58c3e00
>
>
> |
> 8 | 8
> | |- items java.util.LinkedHashMap @ 0x5b2a0f0
>
>
> |
> 56 | 552
> | |- file org.apache.pdfbox.io.RandomAccessBuffer @ 0x5b2a128
>
>
> |
> 48 | 8,186,528
> | | |- <class> class org.apache.pdfbox.io.RandomAccessBuffer @ 0x5ad2b00
>
>
> |
> 8 | 8
> | | |- currentBuffer byte[16384] @ 0x590f360 16,400 | 16,400
> | | |- bufferList java.util.ArrayList @ 0x5b2e200
>
>
> |
> 24 | 8,170,080
> | | '- Total: 3 entries
>
>
> |
> |
> | |- filteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @
> 0x5b2a158
>
>
> | 32 | 32
> | |- decodeResult org.apache.pdfbox.filter.DecodeResult @ 0xa65f618
>
>
> |
> 16 | 16
> | |- unFilteredStream org.apache.pdfbox.io.RandomAccessFileOutputStream @
> 0xa71ab18
>
> |
> 32 | 32
> | '- Total: 6 entries
>
>
> |
> |
> |- objectNumber org.apache.pdfbox.cos.COSInteger @ 0x5b25ec0
>
>
> |
> 24 | 24
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)