[jira] [Resolved] (PDFBOX-4569) Implement an ondemand Parser

Jira Mon, 27 Jan 2020 11:01:28 -0800


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler resolved PDFBOX-4569.
----------------------------------------
    Resolution: Fixed

I guess we are done here so far. Any further optimization should have it's own 
ticket.

+Summary+

The parser starts with reading all cross reference informations and creates the 
trailer object holding the root dictionary. All other objects are read on 
demand processing the following steps

* create a COSObjectKey for the object number
* get the COSObject for the COSObjectKey by calling 
COSDocument#getObjectFromPool
* COSObject#getObject dereferences the COSBase we are looking
* the interface ICOSParser was introduced to decouple COSObject and the parser 
used to dereference the object
* COSParser implements the interface and does the parsing
* the COSBase object is cached in COSObject for further use
* objects within an object stream are dereferenced one by one

All of this is done automagically so that the end user doesn't have to change 
anything to use the on demand parser.

+Some important details+
* less memory consumption if one doesn't need all objects, e.g. text extraction 
doesn't need to read image informations
* no performance regression so far, loading is way much faster, but the parser 
needs more time to load the objects on demand if the number of objects to be 
processed is nearly the same in both cases (on demand vs old parser)
* the more objects are needed/loaded the lesser are the positive memory effects 
as all objects are cached and in the end the memory footprint is nearly the same

+Some findings for further optimizations+
I've tried to deactivate the caching of objects within COSObject. Instead of 
storing them I've simply reloaded the objects. That doesn't work as there maybe 
changes made to the loaded objects which are reverted when reloading them. IMHO 
the main cause of this effect is the fact that the two layers (COS and PD) are 
glued together to one layer which doesn't support such changes. One idea could 
be to really separate both layers by creating PD objects from COS objects 
without using them for storage and drop the COS objects afterwards. That would 
be a huge effort.

I've tried to use memory mapped files as input but stumbled upon our scratch 
file implementation. IMHO we have to drop/change that first if we want to 
support memory mapped files in combination with on demand parsing.



> Implement an ondemand Parser
> ----------------------------
>
>                 Key: PDFBOX-4569
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4569
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Andreas Lehmkühler
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: PDFBOX-1084.pdf
>
>
> There is a need to replace the big bang parser with an ondemand parser



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (PDFBOX-4569) Implement an ondemand Parser

Reply via email to