Re: Apache PDFBox July 2012 board report due

Andreas Lehmkühler Thu, 19 Jul 2012 00:43:17 -0700

Timo Boehme <timo.boe...@ontochem.com> hat am 16. Juli 2012 um 18:02
geschrieben:


> Hi,
>
> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
> > Am 10.07.2012 09:16, schrieb Timo Boehme:
> >> ...
> >> looks good to me. Some mention about the preflight module which will be
> >> integrated in the next major release?
> > Thanks for your comment. I added some information about preflight/xmpbox
> > as you maybe already have seen.
>
> Yes, thank you very much for all the time spending on administrative
> tasks/improvements on PDFBOX.
>
> For the next time I plan to improve on the broken document robustness of
> the parser by doing a first scan over the document (in case of parsing
> failure), collecting object start/end points and using them to repair
> xref table.


Seems to be necessary, at least for some PDFs. :-(


> Another task I would like to do is reducing the amount of memory needed
> by using the existing file as input stream resource instead of copying
> an object stream first to a temporary buffer (in cases where an input
> file exists).
> Maybe for this we should change from assuming to have an input stream to
> assuming we have an input file and if we have an input stream a
> temporary file is created on the fly - WDYT?


I guess internally we have to use something abstract and as everything is a
stream
the might be a good choice. AFAIU the current implementation, one reason for the
usage of a temporary buffer is the fact that the data is modified
(decompressing,
decrypting) and we must not alter the input data. It is perhaps a better idea to
somehow split the inputstream and the unfilteredinputstream, e.g. read from the
inputstream every time an object is dereferenced and store the (decompressed)
data in the corresponding object.

>
>
> Kind regards,
> Timo


BR
Andreas Lehmkühler

Re: Apache PDFBox July 2012 board report due

Reply via email to