Tilman,
  This is fantastic!  If you send me an example of the code you used to call 
preflight (#parse() or  #parse(Format format)???), I'd like to run it within 
tika-batch to see what our batch performance is.
  Ideally, once we can turn our public vm on, it would be fun to run these 
tests there.
  

         Best,

                    Tim

-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Friday, December 05, 2014 2:45 PM
To: dev@pdfbox.apache.org
Subject: Re: preflight mass tests

Some numbers... it took 4-5 days

total: 231223, failed: 142, percentage failed: 0.06141257472336292

Of these, one can substract 33 OutOfMemoryErrors that happened near the 
end of the test. Isolated runs went fine.

about the rest:
18 are the isSymbol stackoverflow
9 are the getFontMatrix NPE
33 are the "root must be of type Pages" errors

The rest is mostly related to very broken PDF files.

Tilman


Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun:
> Hi Tilman,
>
> that's very good news. I trust a lot of time went into reviewing the test 
> results. wo your and Tim's efforts this achievement wouldn't have been 
> possible.
>
> BR
>
> Maruan
>
> Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <thaush...@t-online.de>:
>
>> I've now run preflight on half of the govdocs files. Every issue I have 
>> opened on preflight is related to that test. The failure rate (exceptions 
>> other than the "allowed" ValidationExceptions) is down from 1% when I 
>> started to 0.05% now. Most of the frequent exceptions (e.g. the one with 
>> NonTermimalField) have been fixed. Whats left now are exceptions related to 
>> messy files, and some of the font related issues.
>>
>> Tilman
>>
>> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr:
>>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr:
>>>> It is not looking good, there is at least one NPEs issue coming.
>>> No more NPE after solving the two issues I opened today except 
>>> PDFBOX-1743.pdf which is a known problem.
>>>
>>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora 
>>> to see what happens.
>>>
>>> Tilman
>>>
>

Reply via email to