Hi
Am 19.07.2012 10:03, schrieb Maruan Sahyoun:
maybe wie can join forces here as I'm currently working on an Xref
class which parses xref tables and xref streams. One method should
also do the mentioned scanning.
Sure. I haven't started yet thus we can discuss the details. What I had
in mind was a fast scanning of line starts with object start, endobj,
endstream. With this we can detect missing endobj/endstream etc.
Furthermore we can correct xref entries which sometimes are some bytes
off. Embedded, not extra encoded PDFs can make some trouble here but as
long as the embedding object and the embedded PDF is correct this can be
handled - furthermore this method is only needed for broken PDFs and
most of them won't have such embedded PDFs.
Kind regards,
Timo
Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler"<andr...@lehmi.de>:
Timo Boehme<timo.boe...@ontochem.com> hat am 16. Juli 2012 um 18:02
geschrieben:
Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
Am 10.07.2012 09:16, schrieb Timo Boehme:
...
For the next time I plan to improve on the broken document robustness of
the parser by doing a first scan over the document (in case of parsing
failure), collecting object start/end points and using them to repair
xref table.
Seems to be necessary, at least for some PDFs. :-(
Another task I would like to do is reducing the amount of memory needed
by using the existing file as input stream resource instead of copying
an object stream first to a temporary buffer (in cases where an input
file exists).
Maybe for this we should change from assuming to have an input stream to
assuming we have an input file and if we have an input stream a
temporary file is created on the fly - WDYT?
I guess internally we have to use something abstract and as everything is a
stream
the might be a good choice. AFAIU the current implementation, one reason for the
usage of a temporary buffer is the fact that the data is modified
(decompressing,
decrypting) and we must not alter the input data. It is perhaps a better idea to
somehow split the inputstream and the unfilteredinputstream, e.g. read from the
inputstream every time an object is dereferenced and store the (decompressed)
data in the corresponding object.
Kind regards,
Timo
BR
Andreas Lehmkühler
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
timo.boe...@ontochem.com
_____________________________________________________________________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_____________________________________________________________________