Object scanning (was: Re: Apache PDFBox July 2012 board report due)

Timo Boehme Thu, 19 Jul 2012 04:01:42 -0700

Hi

Am 19.07.2012 10:03, schrieb Maruan Sahyoun:

maybe wie can join forces here as I'm currently working on an Xref
class which parses xref tables and xref streams. One method should
also do the mentioned scanning.

Sure. I haven't started yet thus we can discuss the details. What I hadin mind was a fast scanning of line starts with object start, endobj,endstream. With this we can detect missing endobj/endstream etc.Furthermore we can correct xref entries which sometimes are some bytesoff. Embedded, not extra encoded PDFs can make some trouble here but aslong as the embedding object and the embedded PDF is correct this can behandled - furthermore this method is only needed for broken PDFs andmost of them won't have such embedded PDFs.



Kind regards,

Timo

Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler"<andr...@lehmi.de>:

Timo Boehme<timo.boe...@ontochem.com>  hat am 16. Juli 2012 um 18:02
geschrieben:

Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:

Am 10.07.2012 09:16, schrieb Timo Boehme:

...

For the next time I plan to improve on the broken document robustness of
the parser by doing a first scan over the document (in case of parsing
failure), collecting object start/end points and using them to repair
xref table.


Seems to be necessary, at least for some PDFs. :-(

Another task I would like to do is reducing the amount of memory needed
by using the existing file as input stream resource instead of copying
an object stream first to a temporary buffer (in cases where an input
file exists).
Maybe for this we should change from assuming to have an input stream to
assuming we have an input file and if we have an input stream a
temporary file is created on the fly - WDYT?


I guess internally we have to use something abstract and as everything is a
stream
the might be a good choice. AFAIU the current implementation, one reason for the
usage of a temporary buffer is the fact that the data is modified
(decompressing,
decrypting) and we must not alter the input data. It is perhaps a better idea to
somehow split the inputstream and the unfilteredinputstream, e.g. read from the
inputstream every time an object is dereferenced and store the (decompressed)
data in the corresponding object.



Kind regards,
Timo



BR
Andreas Lehmkühler



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_____________________________________________________________________

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_____________________________________________________________________

Object scanning (was: Re: Apache PDFBox July 2012 board report due)

Reply via email to