Am 15.10.2014 um 12:12 schrieb Andreas Lehmkühler <andr...@lehmi.de>:
> > >> Timo Boehme <timo.boe...@ontochem.com> hat am 15. Oktober 2014 um 09:18 >> geschrieben: >> >> >> Hi, >> >> the difference between the parsers stems from the fact that the old >> parser can cope with a completely broken xref table because it uses the >> objects as it finds them on its sequential way. What we need (as I >> proposed before) is a repair mechanism scanning the file for object >> start/end to be used for re-creating the xref table. >> I will see if I can find some time to do this. > I already have a working prototype but I'm not yet happy with the > implementation. > >> The only other stopper is as Andreas has pointed out the signing. I'm >> not familiar with this and don't known what needs to be done here. > Me neither. > If we keep the old parser side by side to the new one we can look at implementing incremental updates at a later stage correctly thus not only supporting signing but other important use cases too. Something we can do behind the scene. >> Best, >> Timo > > BR > Andreas Lehmkühler > >> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr: >>> Here are some: >>> >>> 055/055794.pdf >>> 082/082463.pdf >>> 108/108362.pdf >>> 113/113223.pdf >>> 115/115458.pdf >>> 115/115463.pdf >>> 122/122393.pdf >>> 129/129416.pdf >>> 133/133423.pdf >>> 148/148020.pdf >>> 152/152012.pdf >>> 161/161466.pdf >>> >>> to be found here: >>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/ >>> >>> Tilman >>> >>> Am 14.10.2014 um 21:06 schrieb John Hewson: >>>> Unless somebody provides us with a list of those files, then I think >>>> this is an unreasonable request. As long as we continue to leave the >>>> old parser in PDFBox, we won’t get the bug reports which we need to >>>> fix the new parser, and the situation will never resolve itself. >>>> Falling back to the old parser is just as bad - we won’t get bug reports. >>>> >>>> -- John >>>> >>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <thaush...@t-online.de> wrote: >>>> >>>>> I prefer that the "old" parser not be removed, because there are many >>>>> files that can only be parsed by the old parser. This came out in a >>>>> large scale test with TIKA. >>>>> >>>>> The best idea (in my current opinion) is to use the nonSeq parser >>>>> first, and the old parser if there is an exception. >>>>> >>>>> Tilman >>>>> >>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme: >>>>>> Hi, >>>>>> >>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson: >>>>>>> Hi, >>>>>>>>> John Hewson <j...@jahewson.com> hat am 10. Oktober 2014 um 20:05 >>>>>>>>> geschrieben: >>>>>>>>> >>>>>>>>> >>>>>>>>> - Parsing (Andreas?) >>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to >>>>>>>> improve the XRef >>>>>>>> and the COSStream stuff >>>>>>> It would be great if we could get rid of the old parser and switch >>>>>>> to the non-sequential >>>>>>> parser, WDYT? >>>>>> I would also propose to completely remove the old parser. That way >>>>>> we are more flexible in parsing streams etc. since parts of the >>>>>> non-sequential parser are a compromise to work side-by-side with the >>>>>> old parser. >>>>>> Possibly there are a small number of functions for which the old >>>>>> parser is still needed - e.g. signing? >>>>>> >>>>>> >>>>>> Best, >>>>>> Timo >>>>>> >>>>>> >>>> >>> >> >> >> -- >> >> Timo Boehme >> OntoChem GmbH >> H.-Damerow-Str. 4 >> 06120 Halle/Saale >> T: +49 345 4780474 >> F: +49 345 4780471 >> timo.boe...@ontochem.com >> >> _____________________________________________________________________ >> >> OntoChem GmbH >> Geschäftsführer: Dr. Lutz Weber >> Sitz: Halle / Saale >> Registergericht: Stendal >> Registernummer: HRB 215461 >> _____________________________________________________________________ >>