> Timo Boehme <timo.boe...@ontochem.com> hat am 15. Oktober 2014 um 09:18 > geschrieben: > > > Hi, > > the difference between the parsers stems from the fact that the old > parser can cope with a completely broken xref table because it uses the > objects as it finds them on its sequential way. What we need (as I > proposed before) is a repair mechanism scanning the file for object > start/end to be used for re-creating the xref table. > I will see if I can find some time to do this. I already have a working prototype but I'm not yet happy with the implementation.
> The only other stopper is as Andreas has pointed out the signing. I'm > not familiar with this and don't known what needs to be done here. Me neither. > Best, > Timo BR Andreas Lehmkühler > Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: > > Here are some: > > > > 055/055794.pdf > > 082/082463.pdf > > 108/108362.pdf > > 113/113223.pdf > > 115/115458.pdf > > 115/115463.pdf > > 122/122393.pdf > > 129/129416.pdf > > 133/133423.pdf > > 148/148020.pdf > > 152/152012.pdf > > 161/161466.pdf > > > > to be found here: > > http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ > > > > Tilman > > > > Am 14.10.2014 um 21:06 schrieb John Hewson: > >> Unless somebody provides us with a list of those files, then I think > >> this is an unreasonable request. As long as we continue to leave the > >> old parser in PDFBox, we won’t get the bug reports which we need to > >> fix the new parser, and the situation will never resolve itself. > >> Falling back to the old parser is just as bad - we won’t get bug reports. > >> > >> -- John > >> > >> On 14 Oct 2014, at 07:39, Tilman Hausherr <thaush...@t-online.de> wrote: > >> > >>> I prefer that the "old" parser not be removed, because there are many > >>> files that can only be parsed by the old parser. This came out in a > >>> large scale test with TIKA. > >>> > >>> The best idea (in my current opinion) is to use the nonSeq parser > >>> first, and the old parser if there is an exception. > >>> > >>> Tilman > >>> > >>> Am 14.10.2014 um 09:45 schrieb Timo Boehme: > >>>> Hi, > >>>> > >>>> Am 14.10.2014 um 07:22 schrieb John Hewson: > >>>>> Hi, > >>>>>>> John Hewson <j...@jahewson.com> hat am 10. Oktober 2014 um 20:05 > >>>>>>> geschrieben: > >>>>>>> > >>>>>>> > >>>>>>> - Parsing (Andreas?) > >>>>>> I guess we won't get a complete new parser in 2.0, but I try to > >>>>>> improve the XRef > >>>>>> and the COSStream stuff > >>>>> It would be great if we could get rid of the old parser and switch > >>>>> to the non-sequential > >>>>> parser, WDYT? > >>>> I would also propose to completely remove the old parser. That way > >>>> we are more flexible in parsing streams etc. since parts of the > >>>> non-sequential parser are a compromise to work side-by-side with the > >>>> old parser. > >>>> Possibly there are a small number of functions for which the old > >>>> parser is still needed - e.g. signing? > >>>> > >>>> > >>>> Best, > >>>> Timo > >>>> > >>>> > >> > > > > > -- > > Timo Boehme > OntoChem GmbH > H.-Damerow-Str. 4 > 06120 Halle/Saale > T: +49 345 4780474 > F: +49 345 4780471 > timo.boe...@ontochem.com > > _____________________________________________________________________ > > OntoChem GmbH > Geschäftsführer: Dr. Lutz Weber > Sitz: Halle / Saale > Registergericht: Stendal > Registernummer: HRB 215461 > _____________________________________________________________________ >