Re: 2.0

Timo Boehme Wed, 15 Oct 2014 00:19:42 -0700

Hi,

the difference between the parsers stems from the fact that the oldparser can cope with a completely broken xref table because it uses theobjects as it finds them on its sequential way. What we need (as Iproposed before) is a repair mechanism scanning the file for objectstart/end to be used for re-creating the xref table.

I will see if I can find some time to do this.

The only other stopper is as Andreas has pointed out the signing. I'mnot familiar with this and don't known what needs to be done here.



Best,
Timo


Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:

Here are some:

055/055794.pdf
082/082463.pdf
108/108362.pdf
113/113223.pdf
115/115458.pdf
115/115463.pdf
122/122393.pdf
129/129416.pdf
133/133423.pdf
148/148020.pdf
152/152012.pdf
161/161466.pdf

to be found here:
http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/

Tilman

Am 14.10.2014 um 21:06 schrieb John Hewson:

Unless somebody provides us with a list of those files, then I think
this is an unreasonable request. As long as we continue to leave the
old parser in PDFBox, we won’t get the bug reports which we need to
fix the new parser, and the situation will never resolve itself.
Falling back to the old parser is just as bad - we won’t get bug reports.

-- John

On 14 Oct 2014, at 07:39, Tilman Hausherr <[email protected]> wrote:

I prefer that the "old" parser not be removed, because there are many
files that can only be parsed by the old parser. This came out in a
large scale test with TIKA.

The best idea (in my current opinion) is to use the nonSeq parser
first, and the old parser if there is an exception.

Tilman

Am 14.10.2014 um 09:45 schrieb Timo Boehme:

Hi,

Am 14.10.2014 um 07:22 schrieb John Hewson:

Hi,

John Hewson <[email protected]> hat am 10. Oktober 2014 um 20:05
geschrieben:


        - Parsing (Andreas?)

I guess we won't get a complete new parser in 2.0, but I try to
improve the XRef
and the COSStream stuff

It would be great if we could get rid of the old parser and switch
to the non-sequential
parser, WDYT?

I would also propose to completely remove the old parser. That way
we are more flexible in parsing streams etc. since parts of the
non-sequential parser are a compromise to work side-by-side with the
old parser.
Possibly there are a small number of functions for which the old
parser is still needed - e.g. signing?


Best,
Timo



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 [email protected]

_____________________________________________________________________

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_____________________________________________________________________

Re: 2.0

Reply via email to