RE: 2.0

Andreas Lehmkühler Tue, 21 Oct 2014 23:35:42 -0700

Hi Tim,

first of all thanks for the offer, this is highly appreciated!


I already have a first fix for PDFBOX-2441, but there is another issue. I hope
to fix it soon.

I'm just curious, do you run that comparisons manually or do you plan to
implement some more or less automatic test which can be started without that
much effort?

BR
Andreas Lehmkühler

> "Allison, Timothy B." <talli...@mitre.org> hat am 21. Oktober 2014 um 22:19
> geschrieben:
>
>
> Hi Tilman,
>   Sounds good.  Should I wait for PDFBOX-2441?
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, October 21, 2014 1:42 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0
>
> Hi Tim,
>
> 2.0 doesn't seem to be released soon... what might be useful again is a
> comparison between seq v non-seq, Andreas recently resolved an issue
> (PDFBOX-2250) that improves the nonSeq parser a lot. Although this isn't
> fully done, a follow-up issue PDFBOX-2441
> <https://issues.apache.org/jira/browse/PDFBOX-2441> has been opened
> which will improve a few more complex files.
>
> Tilman
>
>
>
> Am 21.10.2014 um 13:00 schrieb Allison, Timothy B.:
> > Been too busy over in Tika-land...just noticing this now.
> >
> > Let me know which comparisons you'd like to run (2.0 v 1.8.x or seq v
> > non-seq).  I won't have time to integrate 2.0 into our Tika PDFParser any
> > time soon (Jeremy Anderson on TIKA-1285 has already started this), but I
> > could easily write a lightweight wrapper around PDFBox's TextStripper +
> > metadata inside of the tika-batch/tika-eval framework.
> >
> > Cheers,
> >
> >        Tim
> > ________________________________________
> > From: Andreas Lehmkühler [andr...@lehmi.de]
> > Sent: Wednesday, October 15, 2014 6:20 AM
> > To: dev@pdfbox.apache.org
> > Subject: Re: 2.0
> >
> > Hi,
> >
> >
> >> Maruan Sahyoun <sahy...@fileaffairs.de> hat am 15. Oktober 2014 um 09:32
> >> geschrieben:
> >>
> >>
> >> What about keeping both for the 2.0 release and phase the old one out for 3
> >> but making the NonSequential the default parser.
> >> Would also give us some time to work with Tim (TIKA) on the test suite.
> > I agree, that's the only thing we can manage in a timely manner.
> >
> >
> >> Maybe we could simplify the variations of PDDocument.load to something like
> >>
> >> PDDocument.load(input, raf, enforce, useLegacyParser) or
> >> PDDocument.load(input, raf, enforce, withSignatureSupport) .
> >>
> >> and introduce PDDocument.load(input) to use the NonSequential
> >>
> >>
> >> WDYT?
> > Good idea, I've already created PDFBOX-2430 for this.
> >
> >> Maruan
> >
> > BR
> > Andreas Lehmkühler
> >> Am 15.10.2014 um 09:18 schrieb Timo Boehme <timo.boe...@ontochem.com>:
> >>
> >>> Hi,
> >>>
> >>> the difference between the parsers stems from the fact that the old parser
> >>> can cope with a completely broken xref table because it uses the objects
> >>> as
> >>> it finds them on its sequential way. What we need (as I proposed before)
> >>> is
> >>> a repair mechanism scanning the file for object start/end to be used for
> >>> re-creating the xref table.
> >>> I will see if I can find some time to do this.
> >>>
> >>> The only other stopper is as Andreas has pointed out the signing. I'm not
> >>> familiar with this and don't known what needs to be done here.
> >>>
> >>>
> >>> Best,
> >>> Timo
> >>>
> >>>
> >>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
> >>>> Here are some:
> >>>>
> >>>> 055/055794.pdf
> >>>> 082/082463.pdf
> >>>> 108/108362.pdf
> >>>> 113/113223.pdf
> >>>> 115/115458.pdf
> >>>> 115/115463.pdf
> >>>> 122/122393.pdf
> >>>> 129/129416.pdf
> >>>> 133/133423.pdf
> >>>> 148/148020.pdf
> >>>> 152/152012.pdf
> >>>> 161/161466.pdf
> >>>>
> >>>> to be found here:
> >>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
> >>>>> Unless somebody provides us with a list of those files, then I think
> >>>>> this is an unreasonable request. As long as we continue to leave the
> >>>>> old parser in PDFBox, we won't get the bug reports which we need to
> >>>>> fix the new parser, and the situation will never resolve itself.
> >>>>> Falling back to the old parser is just as bad - we won't get bug
> >>>>> reports.
> >>>>>
> >>>>> -- John
> >>>>>
> >>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <thaush...@t-online.de> wrote:
> >>>>>
> >>>>>> I prefer that the "old" parser not be removed, because there are many
> >>>>>> files that can only be parsed by the old parser. This came out in a
> >>>>>> large scale test with TIKA.
> >>>>>>
> >>>>>> The best idea (in my current opinion) is to use the nonSeq parser
> >>>>>> first, and the old parser if there is an exception.
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
> >>>>>>>> Hi,
> >>>>>>>>>> John Hewson <j...@jahewson.com> hat am 10. Oktober 2014 um 20:05
> >>>>>>>>>> geschrieben:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>          - Parsing (Andreas?)
> >>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
> >>>>>>>>> improve the XRef
> >>>>>>>>> and the COSStream stuff
> >>>>>>>> It would be great if we could get rid of the old parser and switch
> >>>>>>>> to the non-sequential
> >>>>>>>> parser, WDYT?
> >>>>>>> I would also propose to completely remove the old parser. That way
> >>>>>>> we are more flexible in parsing streams etc. since parts of the
> >>>>>>> non-sequential parser are a compromise to work side-by-side with the
> >>>>>>> old parser.
> >>>>>>> Possibly there are a small number of functions for which the old
> >>>>>>> parser is still needed - e.g. signing?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Timo
> >>>>>>>
> >>>>>>>
> >>>
> >>> --
> >>>
> >>> Timo Boehme
> >>> OntoChem GmbH
> >>> H.-Damerow-Str. 4
> >>> 06120 Halle/Saale
> >>> T: +49 345 4780474
> >>> F: +49 345 4780471
> >>> timo.boe...@ontochem.com
> >>>
> >>> _____________________________________________________________________
> >>>
> >>> OntoChem GmbH
> >>> Geschäftsführer: Dr. Lutz Weber
> >>> Sitz: Halle / Saale
> >>> Registergericht: Stendal
> >>> Registernummer: HRB 215461
> >>> _____________________________________________________________________
> >>>
>

RE: 2.0

Reply via email to