Object scanning (was: Re: Apache PDFBox July 2012 board report due)
Hi Am 19.07.2012 10:03, schrieb Maruan Sahyoun: maybe wie can join forces here as I'm currently working on an Xref class which parses xref tables and xref streams. One method should also do the mentioned scanning. Sure. I haven't started yet thus we can discuss the details. What I had in mind was a fast scanning of line starts with object start, endobj, endstream. With this we can detect missing endobj/endstream etc. Furthermore we can correct xref entries which sometimes are some bytes off. Embedded, not extra encoded PDFs can make some trouble here but as long as the embedding object and the embedded PDF is correct this can be handled - furthermore this method is only needed for broken PDFs and most of them won't have such embedded PDFs. Kind regards, Timo Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler": Timo Boehme hat am 16. Juli 2012 um 18:02 geschrieben: Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler: Am 10.07.2012 09:16, schrieb Timo Boehme: ... For the next time I plan to improve on the broken document robustness of the parser by doing a first scan over the document (in case of parsing failure), collecting object start/end points and using them to repair xref table. Seems to be necessary, at least for some PDFs. :-( Another task I would like to do is reducing the amount of memory needed by using the existing file as input stream resource instead of copying an object stream first to a temporary buffer (in cases where an input file exists). Maybe for this we should change from assuming to have an input stream to assuming we have an input file and if we have an input stream a temporary file is created on the fly - WDYT? I guess internally we have to use something abstract and as everything is a stream the might be a good choice. AFAIU the current implementation, one reason for the usage of a temporary buffer is the fact that the data is modified (decompressing, decrypting) and we must not alter the input data. It is perhaps a better idea to somehow split the inputstream and the unfilteredinputstream, e.g. read from the inputstream every time an object is dereferenced and store the (decompressed) data in the corresponding object. Kind regards, Timo BR Andreas Lehmkühler -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
Re: Apache PDFBox July 2012 board report due
Hi, maybe wie can join forces here as I'm currently working on an Xref class which parses xref tables and xref streams. One method should also do the mentioned scanning. Kind regards Maruan Sahyoun Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler" : > > Timo Boehme hat am 16. Juli 2012 um 18:02 > geschrieben: > >> Hi, >> >> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler: >>> Am 10.07.2012 09:16, schrieb Timo Boehme: ... looks good to me. Some mention about the preflight module which will be integrated in the next major release? >>> Thanks for your comment. I added some information about preflight/xmpbox >>> as you maybe already have seen. >> >> Yes, thank you very much for all the time spending on administrative >> tasks/improvements on PDFBOX. >> >> For the next time I plan to improve on the broken document robustness of >> the parser by doing a first scan over the document (in case of parsing >> failure), collecting object start/end points and using them to repair >> xref table. > > > Seems to be necessary, at least for some PDFs. :-( > > >> Another task I would like to do is reducing the amount of memory needed >> by using the existing file as input stream resource instead of copying >> an object stream first to a temporary buffer (in cases where an input >> file exists). >> Maybe for this we should change from assuming to have an input stream to >> assuming we have an input file and if we have an input stream a >> temporary file is created on the fly - WDYT? > > > I guess internally we have to use something abstract and as everything is a > stream > the might be a good choice. AFAIU the current implementation, one reason for > the > usage of a temporary buffer is the fact that the data is modified > (decompressing, > decrypting) and we must not alter the input data. It is perhaps a better idea > to > somehow split the inputstream and the unfilteredinputstream, e.g. read from > the > inputstream every time an object is dereferenced and store the (decompressed) > data in the corresponding object. > >> >> >> Kind regards, >> Timo > > > BR > Andreas Lehmkühler
Re: Apache PDFBox July 2012 board report due
Timo Boehme hat am 16. Juli 2012 um 18:02 geschrieben: > Hi, > > Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler: > > Am 10.07.2012 09:16, schrieb Timo Boehme: > >> ... > >> looks good to me. Some mention about the preflight module which will be > >> integrated in the next major release? > > Thanks for your comment. I added some information about preflight/xmpbox > > as you maybe already have seen. > > Yes, thank you very much for all the time spending on administrative > tasks/improvements on PDFBOX. > > For the next time I plan to improve on the broken document robustness of > the parser by doing a first scan over the document (in case of parsing > failure), collecting object start/end points and using them to repair > xref table. Seems to be necessary, at least for some PDFs. :-( > Another task I would like to do is reducing the amount of memory needed > by using the existing file as input stream resource instead of copying > an object stream first to a temporary buffer (in cases where an input > file exists). > Maybe for this we should change from assuming to have an input stream to > assuming we have an input file and if we have an input stream a > temporary file is created on the fly - WDYT? I guess internally we have to use something abstract and as everything is a stream the might be a good choice. AFAIU the current implementation, one reason for the usage of a temporary buffer is the fact that the data is modified (decompressing, decrypting) and we must not alter the input data. It is perhaps a better idea to somehow split the inputstream and the unfilteredinputstream, e.g. read from the inputstream every time an object is dereferenced and store the (decompressed) data in the corresponding object. > > > Kind regards, > Timo BR Andreas Lehmkühler
Re: Apache PDFBox July 2012 board report due
Hi, Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler: Am 10.07.2012 09:16, schrieb Timo Boehme: ... looks good to me. Some mention about the preflight module which will be integrated in the next major release? Thanks for your comment. I added some information about preflight/xmpbox as you maybe already have seen. Yes, thank you very much for all the time spending on administrative tasks/improvements on PDFBOX. For the next time I plan to improve on the broken document robustness of the parser by doing a first scan over the document (in case of parsing failure), collecting object start/end points and using them to repair xref table. Another task I would like to do is reducing the amount of memory needed by using the existing file as input stream resource instead of copying an object stream first to a temporary buffer (in cases where an input file exists). Maybe for this we should change from assuming to have an input stream to assuming we have an input file and if we have an input stream a temporary file is created on the fly - WDYT? Kind regards, Timo -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
Re: Apache PDFBox July 2012 board report due
Hi, Am 10.07.2012 08:03, schrieb Andreas Lehmkuehler: find attached a quick draft of the board report we're expected to submit this month (tomorrow, sorry for my lateness). Any comments, objections or additions? looks good to me. Some mention about the preflight module which will be integrated in the next major release? Kind regards, Timo The Apache PDFBox library is an open source Java tool for working with PDF documents. General Comments There are no issues that require Board attention. Community - There is a steady stream of contributions and bug reports from the community. Wolfgang Glas offered to contribute some code to improve the unicode support when creating documents, one of the most asked features. The new conforming parser works well and will replace the old one at least in the next major release. Releases PDFBox 1.7.0 was released on 29 May 2012 We are planning to cut a 1.7.1 bugfix release in the near future. Development: The development on the next release is still in progress. We are currently working on - improved font handling - improved rendering - refactoring + improved integration of preflight - bugfixing We just started a discussion on how to proceed with the next release(s), it looks like the next release will probably be a major one. BR Andreas Lehmkühler -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _