[
https://issues.apache.org/jira/browse/PDFBOX-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-4569.
----------------------------------------
Resolution: Fixed
I guess we are done here so far. Any further optimization should have it's own
ticket.
+Summary+
The parser starts with reading all cross reference informations and creates the
trailer object holding the root dictionary. All other objects are read on
demand processing the following steps
* create a COSObjectKey for the object number
* get the COSObject for the COSObjectKey by calling
COSDocument#getObjectFromPool
* COSObject#getObject dereferences the COSBase we are looking
* the interface ICOSParser was introduced to decouple COSObject and the parser
used to dereference the object
* COSParser implements the interface and does the parsing
* the COSBase object is cached in COSObject for further use
* objects within an object stream are dereferenced one by one
All of this is done automagically so that the end user doesn't have to change
anything to use the on demand parser.
+Some important details+
* less memory consumption if one doesn't need all objects, e.g. text extraction
doesn't need to read image informations
* no performance regression so far, loading is way much faster, but the parser
needs more time to load the objects on demand if the number of objects to be
processed is nearly the same in both cases (on demand vs old parser)
* the more objects are needed/loaded the lesser are the positive memory effects
as all objects are cached and in the end the memory footprint is nearly the same
+Some findings for further optimizations+
I've tried to deactivate the caching of objects within COSObject. Instead of
storing them I've simply reloaded the objects. That doesn't work as there maybe
changes made to the loaded objects which are reverted when reloading them. IMHO
the main cause of this effect is the fact that the two layers (COS and PD) are
glued together to one layer which doesn't support such changes. One idea could
be to really separate both layers by creating PD objects from COS objects
without using them for storage and drop the COS objects afterwards. That would
be a huge effort.
I've tried to use memory mapped files as input but stumbled upon our scratch
file implementation. IMHO we have to drop/change that first if we want to
support memory mapped files in combination with on demand parsing.
> Implement an ondemand Parser
> ----------------------------
>
> Key: PDFBOX-4569
> URL: https://issues.apache.org/jira/browse/PDFBOX-4569
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 3.0.0 PDFBox
> Reporter: Andreas Lehmkühler
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: PDFBOX-1084.pdf
>
>
> There is a need to replace the big bang parser with an ondemand parser
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]