Object scanning (was: Re: Apache PDFBox July 2012 board report due)

2012-07-19 Thread Timo Boehme

Hi

Am 19.07.2012 10:03, schrieb Maruan Sahyoun:


maybe wie can join forces here as I'm currently working on an Xref
class which parses xref tables and xref streams. One method should
also do the mentioned scanning.


Sure. I haven't started yet thus we can discuss the details. What I had 
in mind was a fast scanning of line starts with object start, endobj, 
endstream. With this we can detect missing endobj/endstream etc. 
Furthermore we can correct xref entries which sometimes are some bytes 
off. Embedded, not extra encoded PDFs can make some trouble here but as 
long as the embedding object and the embedded PDF is correct this can be 
handled - furthermore this method is only needed for broken PDFs and 
most of them won't have such embedded PDFs.



Kind regards,

Timo



Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler":

Timo Boehme  hat am 16. Juli 2012 um 18:02
geschrieben:

Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:

Am 10.07.2012 09:16, schrieb Timo Boehme:

...

For the next time I plan to improve on the broken document robustness of
the parser by doing a first scan over the document (in case of parsing
failure), collecting object start/end points and using them to repair
xref table.


Seems to be necessary, at least for some PDFs. :-(


Another task I would like to do is reducing the amount of memory needed
by using the existing file as input stream resource instead of copying
an object stream first to a temporary buffer (in cases where an input
file exists).
Maybe for this we should change from assuming to have an input stream to
assuming we have an input file and if we have an input stream a
temporary file is created on the fly - WDYT?


I guess internally we have to use something abstract and as everything is a
stream
the might be a good choice. AFAIU the current implementation, one reason for the
usage of a temporary buffer is the fact that the data is modified
(decompressing,
decrypting) and we must not alter the input data. It is perhaps a better idea to
somehow split the inputstream and the unfilteredinputstream, e.g. read from the
inputstream every time an object is dereferenced and store the (decompressed)
data in the corresponding object.




Kind regards,
Timo



BR
Andreas Lehmkühler



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_



Re: Apache PDFBox July 2012 board report due

2012-07-19 Thread Maruan Sahyoun
Hi,

maybe wie can join forces here as I'm currently working on an Xref class which 
parses xref tables and xref streams. One method should also do the mentioned 
scanning.

Kind regards

Maruan Sahyoun

Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler" :

> 
> Timo Boehme  hat am 16. Juli 2012 um 18:02
> geschrieben:
> 
>> Hi,
>> 
>> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
>>> Am 10.07.2012 09:16, schrieb Timo Boehme:
 ...
 looks good to me. Some mention about the preflight module which will be
 integrated in the next major release?
>>> Thanks for your comment. I added some information about preflight/xmpbox
>>> as you maybe already have seen.
>> 
>> Yes, thank you very much for all the time spending on administrative
>> tasks/improvements on PDFBOX.
>> 
>> For the next time I plan to improve on the broken document robustness of
>> the parser by doing a first scan over the document (in case of parsing
>> failure), collecting object start/end points and using them to repair
>> xref table.
> 
> 
> Seems to be necessary, at least for some PDFs. :-(
> 
> 
>> Another task I would like to do is reducing the amount of memory needed
>> by using the existing file as input stream resource instead of copying
>> an object stream first to a temporary buffer (in cases where an input
>> file exists).
>> Maybe for this we should change from assuming to have an input stream to
>> assuming we have an input file and if we have an input stream a
>> temporary file is created on the fly - WDYT?
> 
> 
> I guess internally we have to use something abstract and as everything is a
> stream
> the might be a good choice. AFAIU the current implementation, one reason for 
> the
> usage of a temporary buffer is the fact that the data is modified
> (decompressing,
> decrypting) and we must not alter the input data. It is perhaps a better idea 
> to
> somehow split the inputstream and the unfilteredinputstream, e.g. read from 
> the
> inputstream every time an object is dereferenced and store the (decompressed)
> data in the corresponding object.
> 
>> 
>> 
>> Kind regards,
>> Timo
> 
> 
> BR
> Andreas Lehmkühler


Re: Apache PDFBox July 2012 board report due

2012-07-19 Thread Andreas Lehmkühler

Timo Boehme  hat am 16. Juli 2012 um 18:02
geschrieben:

> Hi,
>
> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
> > Am 10.07.2012 09:16, schrieb Timo Boehme:
> >> ...
> >> looks good to me. Some mention about the preflight module which will be
> >> integrated in the next major release?
> > Thanks for your comment. I added some information about preflight/xmpbox
> > as you maybe already have seen.
>
> Yes, thank you very much for all the time spending on administrative
> tasks/improvements on PDFBOX.
>
> For the next time I plan to improve on the broken document robustness of
> the parser by doing a first scan over the document (in case of parsing
> failure), collecting object start/end points and using them to repair
> xref table.


Seems to be necessary, at least for some PDFs. :-(


> Another task I would like to do is reducing the amount of memory needed
> by using the existing file as input stream resource instead of copying
> an object stream first to a temporary buffer (in cases where an input
> file exists).
> Maybe for this we should change from assuming to have an input stream to
> assuming we have an input file and if we have an input stream a
> temporary file is created on the fly - WDYT?


I guess internally we have to use something abstract and as everything is a
stream
the might be a good choice. AFAIU the current implementation, one reason for the
usage of a temporary buffer is the fact that the data is modified
(decompressing,
decrypting) and we must not alter the input data. It is perhaps a better idea to
somehow split the inputstream and the unfilteredinputstream, e.g. read from the
inputstream every time an object is dereferenced and store the (decompressed)
data in the corresponding object.

>
>
> Kind regards,
> Timo


BR
Andreas Lehmkühler

Re: Apache PDFBox July 2012 board report due

2012-07-16 Thread Timo Boehme

Hi,

Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:

Am 10.07.2012 09:16, schrieb Timo Boehme:

...
looks good to me. Some mention about the preflight module which will be
integrated in the next major release?

Thanks for your comment. I added some information about preflight/xmpbox
as you maybe already have seen.


Yes, thank you very much for all the time spending on administrative 
tasks/improvements on PDFBOX.


For the next time I plan to improve on the broken document robustness of 
the parser by doing a first scan over the document (in case of parsing 
failure), collecting object start/end points and using them to repair 
xref table.
Another task I would like to do is reducing the amount of memory needed 
by using the existing file as input stream resource instead of copying 
an object stream first to a temporary buffer (in cases where an input 
file exists).
Maybe for this we should change from assuming to have an input stream to 
assuming we have an input file and if we have an input stream a 
temporary file is created on the fly - WDYT?



Kind regards,
Timo

--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_



Re: Apache PDFBox July 2012 board report due

2012-07-10 Thread Timo Boehme

Hi,

Am 10.07.2012 08:03, schrieb Andreas Lehmkuehler:

find attached a quick draft of the board report we're expected to submit
this
month (tomorrow, sorry for my lateness).

Any comments, objections or additions?


looks good to me. Some mention about the preflight module which will be 
integrated in the next major release?



Kind regards,
Timo





The Apache PDFBox library is an open source Java tool for working with PDF
documents.


General Comments


There are no issues that require Board attention.

Community
-

There is a steady stream of contributions and bug reports from the
community.
Wolfgang Glas offered to contribute some code to improve the unicode
support
when creating documents, one of the most asked features.
The new conforming parser works well and will replace the old one at least
in the next major release.

Releases


PDFBox 1.7.0 was released on 29 May 2012

We are planning to cut a 1.7.1 bugfix release in the near future.

Development:


The development on the next release is still in progress. We are currently
working on

- improved font handling
- improved rendering
- refactoring + improved integration of preflight
- bugfixing

We just started a discussion on how to proceed with the next release(s), it
looks like the next release will probably be a major one.


BR
Andreas Lehmkühler



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_