Hi Maruan

Thanks for your thoughts...

> # Tests #
> In addition to rendering we shall be covering metadata and text extraction as 
> well as PDF/A validation. 

Yes, we could add extracted text and validation results to the “regression” SVN 
repo also.

> # Testfiles # 
> Recently there were a number of test sets made available which we can use. […]

Excellent.

> In addition we can put additional files into our own repository as you 
> suggested.
> So there is no shortage on test files. 

Some people seem to have downloaded many (or all) of the JIRA files, I guess we 
could add those too.

> TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
> development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
> with them.

I see that in TIKA-1302 the Tika developers suggest that PDFBox should set up 
its own regression tests, so I guess that’s our starting point. We should make 
sure that it’s easy to run just the text extraction regression tests using 
maven, and also ask them to give us any test files they have.

-- John

PS. Nice job handling those tough questions a PDFDays, I watched the video.

On 3 Jul 2014, at 23:43, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:

> Hi John,
> 
> thanks for binging this up. This is a very important topic which was also 
> discussed at the PDFDays in Germany.
> 
> # Tests #
> In addition to rendering we shall be covering metadata and text extraction as 
> well as PDF/A validation. 
> 
> # Testfiles # 
> Recently there were a number of test sets made available which we can use. 
> http://digitalcorpora.org/corpora/files , 
> https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
> For PDF/A validation there is the Isartor test suite 
> http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
> apply there.
> In addition we can put additional files into our own repository as you 
> suggested.
> So there is no shortage on test files. 
> 
> TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
> development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
> with them.
> 
> BR
> 
> Maruan
> 
> 
> Am 04.07.2014 um 02:16 schrieb John Hewson <j...@jahewson.com>:
> 
>> Hi All
>> 
>> I’ve been thinking about regression testing recently and how we can improve
>> our tests for rendering. There are currently two problems:
>> 
>> 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
>>   (I suspect that AWT fonts are a big part of this, so the problem might get 
>> a lot better
>>   soon once we render all fonts ourselves).
>> 
>> 2) Most PDF test files we have are not under an Apache-friendly license, so
>>   we can’t put the test files into the trunk SVN.
>> 
>> It seems that some of you have your own collections of test PDF files which 
>> you are
>> running regression tests on: that’s great but it would be much better if we 
>> had a
>> central repository of test files and sample renderings.
>> 
>> I’d like to suggest the following solutions to the above issues:
>> 
>> 1) We should choose a “blessed” JDK which will be used to perform the 
>> renderings
>>   this should be whatever is a convenient and sensible default for 
>> committers. (My
>>   preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has 
>> known
>>   rendering bugs). We should make sure that Jenkins runs tests using the 
>> ”blessed”
>>   JDK.
>> 
>>  The regression test can then check to see if it is running on the “blessed” 
>> JDK and
>>  if not then the tests can be skipped and we can warn the user.
>> 
>> 2) We should create a new “regression” branch in SVN which contains only PDF 
>> files
>>   for testing and PNG images which contain known-good renderings created 
>> using the
>>   “blessed” JDK. This branch would not be part of the source of PDFBox but 
>> will still
>>   allow us to version control the test PDFs (it also simplifies the workflow 
>> for adding
>>   new test PDFs and new known-good renderings: simply do an "svn add”).
>> 
>>   As far as copyright and licensing is concerned we can put any PDF files 
>> which are
>>   available publicly on the web into this branch without too much worry.
>> 
>> What does everybody think?
>> 
>> -- John
>> 
> 

Reply via email to