Tilman, This is fantastic! If you send me an example of the code you used to call preflight (#parse() or #parse(Format format)???), I'd like to run it within tika-batch to see what our batch performance is. Ideally, once we can turn our public vm on, it would be fun to run these tests there.
Best, Tim -----Original Message----- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Friday, December 05, 2014 2:45 PM To: dev@pdfbox.apache.org Subject: Re: preflight mass tests Some numbers... it took 4-5 days total: 231223, failed: 142, percentage failed: 0.06141257472336292 Of these, one can substract 33 OutOfMemoryErrors that happened near the end of the test. Isolated runs went fine. about the rest: 18 are the isSymbol stackoverflow 9 are the getFontMatrix NPE 33 are the "root must be of type Pages" errors The rest is mostly related to very broken PDF files. Tilman Am 04.12.2014 um 14:55 schrieb Maruan Sahyoun: > Hi Tilman, > > that's very good news. I trust a lot of time went into reviewing the test > results. wo your and Tim's efforts this achievement wouldn't have been > possible. > > BR > > Maruan > > Am 03.12.2014 um 21:04 schrieb Tilman Hausherr <thaush...@t-online.de>: > >> I've now run preflight on half of the govdocs files. Every issue I have >> opened on preflight is related to that test. The failure rate (exceptions >> other than the "allowed" ValidationExceptions) is down from 1% when I >> started to 0.05% now. Most of the frequent exceptions (e.g. the one with >> NonTermimalField) have been fixed. Whats left now are exceptions related to >> messy files, and some of the font related issues. >> >> Tilman >> >> Am 03.11.2014 um 22:58 schrieb Tilman Hausherr: >>> Am 03.11.2014 um 19:00 schrieb Tilman Hausherr: >>>> It is not looking good, there is at least one NPEs issue coming. >>> No more NPE after solving the two issues I opened today except >>> PDFBOX-1743.pdf which is a known problem. >>> >>> Coming up soon: run preflight on the 231227 PDF files from digitalcorpora >>> to see what happens. >>> >>> Tilman >>> >