Re: 2.0

Maruan Sahyoun Wed, 15 Oct 2014 03:23:03 -0700

Am 15.10.2014 um 12:12 schrieb Andreas Lehmkühler <[email protected]>:


> 
> 
>> Timo Boehme <[email protected]> hat am 15. Oktober 2014 um 09:18
>> geschrieben:
>> 
>> 
>> Hi,
>> 
>> the difference between the parsers stems from the fact that the old
>> parser can cope with a completely broken xref table because it uses the
>> objects as it finds them on its sequential way. What we need (as I
>> proposed before) is a repair mechanism scanning the file for object
>> start/end to be used for re-creating the xref table.
>> I will see if I can find some time to do this.
> I already have a working prototype but I'm not yet happy with the
> implementation.
> 
>> The only other stopper is as Andreas has pointed out the signing. I'm
>> not familiar with this and don't known what needs to be done here.
> Me neither.
> 

If we keep the old parser side by side to the new one we can look at 
implementing incremental updates at a later stage correctly thus not only 
supporting signing but other important use cases too. Something we can do 
behind the scene.



>> Best,
>> Timo
> 
> BR
> Andreas Lehmkühler
> 
>> Am 14.10.2014 um 21:18 schrieb Tilman Hausherr:
>>> Here are some:
>>> 
>>> 055/055794.pdf
>>> 082/082463.pdf
>>> 108/108362.pdf
>>> 113/113223.pdf
>>> 115/115458.pdf
>>> 115/115463.pdf
>>> 122/122393.pdf
>>> 129/129416.pdf
>>> 133/133423.pdf
>>> 148/148020.pdf
>>> 152/152012.pdf
>>> 161/161466.pdf
>>> 
>>> to be found here:
>>> http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
>>> 
>>> Tilman
>>> 
>>> Am 14.10.2014 um 21:06 schrieb John Hewson:
>>>> Unless somebody provides us with a list of those files, then I think
>>>> this is an unreasonable request. As long as we continue to leave the
>>>> old parser in PDFBox, we won’t get the bug reports which we need to
>>>> fix the new parser, and the situation will never resolve itself.
>>>> Falling back to the old parser is just as bad - we won’t get bug reports.
>>>> 
>>>> -- John
>>>> 
>>>> On 14 Oct 2014, at 07:39, Tilman Hausherr <[email protected]> wrote:
>>>> 
>>>>> I prefer that the "old" parser not be removed, because there are many
>>>>> files that can only be parsed by the old parser. This came out in a
>>>>> large scale test with TIKA.
>>>>> 
>>>>> The best idea (in my current opinion) is to use the nonSeq parser
>>>>> first, and the old parser if there is an exception.
>>>>> 
>>>>> Tilman
>>>>> 
>>>>> Am 14.10.2014 um 09:45 schrieb Timo Boehme:
>>>>>> Hi,
>>>>>> 
>>>>>> Am 14.10.2014 um 07:22 schrieb John Hewson:
>>>>>>> Hi,
>>>>>>>>> John Hewson <[email protected]> hat am 10. Oktober 2014 um 20:05
>>>>>>>>> geschrieben:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>          - Parsing (Andreas?)
>>>>>>>> I guess we won't get a complete new parser in 2.0, but I try to
>>>>>>>> improve the XRef
>>>>>>>> and the COSStream stuff
>>>>>>> It would be great if we could get rid of the old parser and switch
>>>>>>> to the non-sequential
>>>>>>> parser, WDYT?
>>>>>> I would also propose to completely remove the old parser. That way
>>>>>> we are more flexible in parsing streams etc. since parts of the
>>>>>> non-sequential parser are a compromise to work side-by-side with the
>>>>>> old parser.
>>>>>> Possibly there are a small number of functions for which the old
>>>>>> parser is still needed - e.g. signing?
>>>>>> 
>>>>>> 
>>>>>> Best,
>>>>>> Timo
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>> 
>> --
>> 
>>    Timo Boehme
>>    OntoChem GmbH
>>    H.-Damerow-Str. 4
>>    06120 Halle/Saale
>>    T: +49 345 4780474
>>    F: +49 345 4780471
>>    [email protected]
>> 
>> _____________________________________________________________________
>> 
>>    OntoChem GmbH
>>    Geschäftsführer: Dr. Lutz Weber
>>    Sitz: Halle / Saale
>>    Registergericht: Stendal
>>    Registernummer: HRB 215461
>> _____________________________________________________________________
>>

Re: 2.0

Reply via email to