Re: 2.0

Andreas Lehmkühler Wed, 15 Oct 2014 03:13:19 -0700


> Timo Boehme <[email protected]> hat am 15. Oktober 2014 um 09:18
> geschrieben:
>
>
> Hi,
>
> the difference between the parsers stems from the fact that the old
> parser can cope with a completely broken xref table because it uses the
> objects as it finds them on its sequential way. What we need (as I
> proposed before) is a repair mechanism scanning the file for object
> start/end to be used for re-creating the xref table.
> I will see if I can find some time to do this.
I already have a working prototype but I'm not yet happy with the
implementation.


> The only other stopper is as Andreas has pointed out the signing. I'm
> not familiar with this and don't known what needs to be done here.
Me neither.

> Best,
> Timo

BR
Andreas Lehmkühler

> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> > Here are some:
> >
> > 055/055794.pdf
> > 082/082463.pdf
> > 108/108362.pdf
> > 113/113223.pdf
> > 115/115458.pdf
> > 115/115463.pdf
> > 122/122393.pdf
> > 129/129416.pdf
> > 133/133423.pdf
> > 148/148020.pdf
> > 152/152012.pdf
> > 161/161466.pdf
> >
> > to be found here:
> > http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
> >
> > Tilman
> >
> > Am 14.10.2014 um 21:06 schrieb John Hewson:
> >> Unless somebody provides us with a list of those files, then I think
> >> this is an unreasonable request. As long as we continue to leave the
> >> old parser in PDFBox, we won’t get the bug reports which we need to
> >> fix the new parser, and the situation will never resolve itself.
> >> Falling back to the old parser is just as bad - we won’t get bug reports.
> >>
> >> -- John
> >>
> >> On 14 Oct 2014, at 07:39, Tilman Hausherr <[email protected]> wrote:
> >>
> >>> I prefer that the "old" parser not be removed, because there are many
> >>> files that can only be parsed by the old parser. This came out in a
> >>> large scale test with TIKA.
> >>>
> >>> The best idea (in my current opinion) is to use the nonSeq parser
> >>> first, and the old parser if there is an exception.
> >>>
> >>> Tilman
> >>>
> >>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> >>>> Hi,
> >>>>
> >>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
> >>>>> Hi,
> >>>>>>> John Hewson <[email protected]> hat am 10. Oktober 2014 um 20:05
> >>>>>>> geschrieben:
> >>>>>>>
> >>>>>>>
> >>>>>>>         - Parsing (Andreas?)
> >>>>>> I guess we won't get a complete new parser in 2.0, but I try to
> >>>>>> improve the XRef
> >>>>>> and the COSStream stuff
> >>>>> It would be great if we could get rid of the old parser and switch
> >>>>> to the non-sequential
> >>>>> parser, WDYT?
> >>>> I would also propose to completely remove the old parser. That way
> >>>> we are more flexible in parsing streams etc. since parts of the
> >>>> non-sequential parser are a compromise to work side-by-side with the
> >>>> old parser.
> >>>> Possibly there are a small number of functions for which the old
> >>>> parser is still needed - e.g. signing?
> >>>>
> >>>>
> >>>> Best,
> >>>> Timo
> >>>>
> >>>>
> >>
> >
>
>
> --
>
>   Timo Boehme
>   OntoChem GmbH
>   H.-Damerow-Str. 4
>   06120 Halle/Saale
>   T: +49 345 4780474
>   F: +49 345 4780471
>   [email protected]
>
> _____________________________________________________________________
>
>   OntoChem GmbH
>   Geschäftsführer: Dr. Lutz Weber
>   Sitz: Halle / Saale
>   Registergericht: Stendal
>   Registernummer: HRB 215461
> _____________________________________________________________________
>

Re: 2.0

Reply via email to