Re: 2.0

Tilman Hausherr Tue, 14 Oct 2014 12:18:53 -0700

Here are some:

055/055794.pdf
082/082463.pdf
108/108362.pdf
113/113223.pdf
115/115458.pdf
115/115463.pdf
122/122393.pdf
129/129416.pdf
133/133423.pdf
148/148020.pdf
152/152012.pdf
161/161466.pdf


to be found here:
http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/

Tilman

Am 14.10.2014 um 21:06 schrieb John Hewson:

Unless somebody provides us with a list of those files, then I think this is an 
unreasonable request. As long as we continue to leave the old parser in PDFBox, 
we won’t get the bug reports which we need to fix the new parser, and the 
situation will never resolve itself. Falling back to the old parser is just as 
bad - we won’t get bug reports.

-- John

On 14 Oct 2014, at 07:39, Tilman Hausherr <[email protected]> wrote:

I prefer that the "old" parser not be removed, because there are many files 
that can only be parsed by the old parser. This came out in a  large scale test with TIKA.

The best idea (in my current opinion) is to use the nonSeq parser first, and 
the old parser if there is an exception.

Tilman

Am 14.10.2014 um 09:45 schrieb Timo Boehme:

Hi,

Am 14.10.2014 um 07:22 schrieb John Hewson:

Hi,

John Hewson <[email protected]> hat am 10. Oktober 2014 um 20:05 geschrieben:


        - Parsing (Andreas?)

I guess we won't get a complete new parser in 2.0, but I try to improve the XRef
and the COSStream stuff

It would be great if we could get rid of the old parser and switch to the 
non-sequential
parser, WDYT?

I would also propose to completely remove the old parser. That way we are more 
flexible in parsing streams etc. since parts of the non-sequential parser are a 
compromise to work side-by-side with the old parser.
Possibly there are a small number of functions for which the old parser is 
still needed - e.g. signing?


Best,
Timo

Re: 2.0

Reply via email to