Re: Regression Testing

2014-07-15 Thread Tilman Hausherr
As yet another proof that IT people always solve things in similar ways, 
see this interesting blog post by one of our competitors:

http://blog.idrsolutions.com/2013/06/save-time-test/

Tilman

Am 04.07.2014 23:05, schrieb Petr Slabý:

Hi,
following is a description of what we are doing in our company.

With our software, we run regression tests after each nightly build 
and sometimes it is a tough fight. If there is a regression, it is not 
so easy to find which commit caused it, because there are potentially 
many between the nightly builds. Then, the decision whether the change 
is wanted and expected is in some cases also difficult (this part 
might be easier with PDF where there is the golden standard 
rendering in Acrobat). If the change is expected and the new rendering 
better then one has to commit the new reference. This means that the 
files produced on the nightly build machine must be available somehow 
- it is almost impossible to produce them locally as the rendering 
results are slightly different with different versions of java and 
many other reasons. All this has to be done before the next regression 
test is run to avoid that new regressions are hidden by earlier ones. 
Our complete build with all tests runs several hours...


To improve this workflow, we now use the following schema in addition:
- there is a smaller set of regression tests which runs relatively fast
- these tests are triggered by each commit in formatting and rendering 
related projects
- before running the test itself, the modified project(s) are compiled 
locally, w/o publishing the result to maven

- the reference rendering files are stored in SVN
- if a test finds a regression, it immediately stores the new result 
as a new reference into SVN. This makes sure that a) the test 
renderings do not get lost and b) that each regression exactly points 
to the commit that has caused it - the one that triggered the test. 
The failed test creates a new issue in JIRA with a pointer to SVN to 
the before and after rendering and a bitmap of the differencies. The 
issue is then processed. If we find the change to be expected then the 
issue is simply closed, otherwise we take actions to fix the problem. 
The only annoying thing about this scheme is that, after commiting the 
correction, the test runs again and reports a regression because it 
now compares to the faulty version of the rendering.


Best regards,
Petr.

-Původní zpráva- From: John Hewson
Sent: Friday, July 04, 2014 7:39 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing

Hi Tilman

Thanks for your thoughts, I think that your concerns are already 
covered by my original proposal, I’ll try to explain why and how:


Of course I agree with the need for regression tests, however it 
isn't easy: besides the problems of the different JDKs (I use JDK7 
Windows 64 bit), there is the problem that some enhancements create 
slight changes in rendering that are not errors, i.e. both the 
before and the after files look OK by itself. This has happened 
when we changed the text rendering recently, and has happened again 
when the clipping was improved. The cause are probably slight changes 
in color or in boundaries.


If a rendering has changed then the regression test should fail. When 
a failure occurs the developer needs to manually inspect the 
differences (we could generate a visual diff which highlights what 
changed to make this easier) and if ok then they can replace the 
known-good PNG with the ones just rendered. Indeed this will be the 
basic workflow for working with regression tests.


Copyrights is a problem: I'm testing mostly with JIRA attachments 
that I've downloaded over the years. While uploading such files to 
JIRA might count as fair use, I doubt that this would still be true 
if they are included in a distribution. Instead, they should be 
stored somewhere on Apache servers where only committers and build 
software (Travis, Jenkins, ...) can access then. The public PDFs 
that Maruan mentions don't possibly have all the Problem cases that 
we solved before. However I have started working with these files and 
there are at least 5 recent issues that deals with them.


The PDFs won’t be in a distribution. They will just happen to be 
stored in an SVN repo but not our source code repo, in the same way 
that the website is stored in the “cmssite” branch of SVN or indeed, 
are on JIRA. The law doesn’t distinguish between JIRA and SVN, both 
are publicly available via HTTP, so using SVN will simply be a 
continuation of what we’re already doing with JIRA.


The crucial factor is that we’re only storing publicly available PDFs, 
because we have the right to do so, just like Google’s cache, and like 
we currently do with JIRA.


Additionally, the PDFs need to be version controlled otherwise we 
won’t be able to reliably recreate previous builds, so storing the 
files on a web server won’t be practical. Also committers will 
frequently

Re: Regression Testing

2014-07-15 Thread John Hewson
Interesting, it certainly looks pretty similar.

-- John

On 14 Jul 2014, at 23:15, Tilman Hausherr thaush...@t-online.de wrote:

 As yet another proof that IT people always solve things in similar ways, see 
 this interesting blog post by one of our competitors:
 http://blog.idrsolutions.com/2013/06/save-time-test/
 
 Tilman
 
 Am 04.07.2014 23:05, schrieb Petr Slabý:
 Hi,
 following is a description of what we are doing in our company.
 
 With our software, we run regression tests after each nightly build and 
 sometimes it is a tough fight. If there is a regression, it is not so easy 
 to find which commit caused it, because there are potentially many between 
 the nightly builds. Then, the decision whether the change is wanted and 
 expected is in some cases also difficult (this part might be easier with PDF 
 where there is the golden standard rendering in Acrobat). If the change is 
 expected and the new rendering better then one has to commit the new 
 reference. This means that the files produced on the nightly build machine 
 must be available somehow - it is almost impossible to produce them locally 
 as the rendering results are slightly different with different versions of 
 java and many other reasons. All this has to be done before the next 
 regression test is run to avoid that new regressions are hidden by earlier 
 ones. Our complete build with all tests runs several hours...
 
 To improve this workflow, we now use the following schema in addition:
 - there is a smaller set of regression tests which runs relatively fast
 - these tests are triggered by each commit in formatting and rendering 
 related projects
 - before running the test itself, the modified project(s) are compiled 
 locally, w/o publishing the result to maven
 - the reference rendering files are stored in SVN
 - if a test finds a regression, it immediately stores the new result as a 
 new reference into SVN. This makes sure that a) the test renderings do not 
 get lost and b) that each regression exactly points to the commit that has 
 caused it - the one that triggered the test. The failed test creates a new 
 issue in JIRA with a pointer to SVN to the before and after rendering and a 
 bitmap of the differencies. The issue is then processed. If we find the 
 change to be expected then the issue is simply closed, otherwise we take 
 actions to fix the problem. The only annoying thing about this scheme is 
 that, after commiting the correction, the test runs again and reports a 
 regression because it now compares to the faulty version of the rendering.
 
 Best regards,
 Petr.
 
 -Původní zpráva- From: John Hewson
 Sent: Friday, July 04, 2014 7:39 PM
 To: dev@pdfbox.apache.org
 Subject: Re: Regression Testing
 
 Hi Tilman
 
 Thanks for your thoughts, I think that your concerns are already covered by 
 my original proposal, I’ll try to explain why and how:
 
 Of course I agree with the need for regression tests, however it isn't 
 easy: besides the problems of the different JDKs (I use JDK7 Windows 64 
 bit), there is the problem that some enhancements create slight changes in 
 rendering that are not errors, i.e. both the before and the after files 
 look OK by itself. This has happened when we changed the text rendering 
 recently, and has happened again when the clipping was improved. The cause 
 are probably slight changes in color or in boundaries.
 
 If a rendering has changed then the regression test should fail. When a 
 failure occurs the developer needs to manually inspect the differences (we 
 could generate a visual diff which highlights what changed to make this 
 easier) and if ok then they can replace the known-good PNG with the ones 
 just rendered. Indeed this will be the basic workflow for working with 
 regression tests.
 
 Copyrights is a problem: I'm testing mostly with JIRA attachments that I've 
 downloaded over the years. While uploading such files to JIRA might count 
 as fair use, I doubt that this would still be true if they are included in 
 a distribution. Instead, they should be stored somewhere on Apache servers 
 where only committers and build software (Travis, Jenkins, ...) can 
 access then. The public PDFs that Maruan mentions don't possibly have all 
 the Problem cases that we solved before. However I have started working 
 with these files and there are at least 5 recent issues that deals with 
 them.
 
 The PDFs won’t be in a distribution. They will just happen to be stored in 
 an SVN repo but not our source code repo, in the same way that the website 
 is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law 
 doesn’t distinguish between JIRA and SVN, both are publicly available via 
 HTTP, so using SVN will simply be a continuation of what we’re already doing 
 with JIRA.
 
 The crucial factor is that we’re only storing publicly available PDFs, 
 because we have the right to do so, just like Google’s cache, and like we 
 currently do with JIRA

Re: Regression Testing

2014-07-08 Thread John Hewson
On 6 Jul 2014, at 01:28, Guillaume Bailleul gbm.baill...@gmail.com wrote:

 About why are isartor test not done by default?
 
 In the first time of preflight in PDFBox, I made it not by default
 because some manipulation were needed to make it work, I was not good
 with maven in these time. When I changed that using some download
 plugin of maven, I did not changed the default mode... only not to
 break the build, as the preflight code was not so stable.
 
 I do not find any objection to change the default mode. One idea could
 be to move the test in integration test, maybe using the failsafe
 plugin. I can work on it.

Great, I’m going to enable these tests by default in the trunk.

Running these tests as unit tests with surefire looks good to me. As there
isn’t a test environment which needs tearing down I’m not sure that we’d
stand to gain from moving to failsafe?

-- John

Re: Regression Testing

2014-07-08 Thread John Hewson

On 4 Jul 2014, at 14:05, Petr Slabý sl...@kadel.cz wrote:

 Hi,
 following is a description of what we are doing in our company.
 
 With our software, we run regression tests after each nightly build and 
 sometimes it is a tough fight. If there is a regression, it is not so easy to 
 find which commit caused it, because there are potentially many between the 
 nightly builds.

Our Jenkins build should run after each commit, so that will simplify things a 
bit. Sometimes it doesn’t but we also have TravisCI which is always does.

 Then, the decision whether the change is wanted and expected is in some cases 
 also difficult (this part might be easier with PDF where there is the golden 
 standard rendering in Acrobat).

Yes, Acrobat is the answer here for PDF. Most of the time the decision should 
be straightforward.

 If the change is expected and the new rendering better then one has to 
 commit the new reference. This means that the files produced on the nightly 
 build machine must be available somehow - it is almost impossible to produce 
 them locally as the rendering results are slightly different with different 
 versions of java and many other reasons.

Yes, I really want to get local renderings working if possible. That might 
include some basic restrictions on which JVMs can be used (my “blessed JVM 
proposal) but also introducing some fuzziness into the image comparisons, 
perhaps allowing a small per-pixel error. I’m hoping that once AWT rendering of 
fonts is removed that we’ll see more consistent rendering across JVMs.

 All this has to be done before the next regression test is run to avoid that 
 new regressions are hidden by earlier ones. Our complete build with all tests 
 runs several hours…
 
 To improve this workflow, we now use the following schema in addition:
 - there is a smaller set of regression tests which runs relatively fast
 - these tests are triggered by each commit in formatting and rendering 
 related projects
 - before running the test itself, the modified project(s) are compiled 
 locally, w/o publishing the result to maven
 - the reference rendering files are stored in SVN
 - if a test finds a regression, it immediately stores the new result as a new 
 reference into SVN. This makes sure that a) the test renderings do not get 
 lost and b) that each regression exactly points to the commit that has caused 
 it - the one that triggered the test. The failed test creates a new issue in 
 JIRA with a pointer to SVN to the before and after rendering and a bitmap of 
 the differencies. The issue is then processed. If we find the change to be 
 expected then the issue is simply closed, otherwise we take actions to fix 
 the problem. The only annoying thing about this scheme is that, after 
 commiting the correction, the test runs again and reports a regression 
 because it now compares to the faulty version of the rendering.

That sounds fairly similar to my proposal, I like the aspect of pushing the 
server build’s PNGs to SVN. If we can’t get robust local rendering to work then 
that sounds like a good way to make the images easily available.

Thanks

-- John



Re: Regression Testing

2014-07-08 Thread John Hewson
Hi Tim,

  My initial plan for TIKA-1302 is very similar to what Tilman outlined, and 
 my understanding/concerns/thoughts were very much in line with what he 
 articulated.  The idea is that there should be a small Apache license-able 
 gold truth set like both projects now have for specific unit tests 
 (patient-based care), but that we should also occasionally take a 
 public-health view and compare the outputs of  different versions of our 
 parsers on a large set of docs to identify new exceptions or large changes in 
 extracted content/metadata. 

I’m not aware of a good supply of Apache license-able PDF files, we have very 
few such tests currently. For regression tests to be useful we really have to 
run our tests on a large corpus of real files every time.

   I'm persuaded by your points about fair use and the importance of open 
 data.  Before proceeding on TIKA-1302, I'd like to get broader feedback on 
 the way ahead via legal-discuss or maybe jira's Legal.  Do you mind if I 
 quote your arguments?

Yes, certainly, obviously I’m not a lawyer. My reasoning is basically that 
Google do essentially the same thing that we want to and they have plenty of 
lawyers who presumably know what they’re doing.

   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you 
 see any way that we could share resources so that we're not double-storing 
 files on Apache infrastructure?  There may be easy ways to share some eval 
 code as well.

I was thinking of just storing our test files in an SVN branch, the Tika 
project should already have read access (obviously write access would be for 
PDFBox committers only otherwise our builds will get broken). The tests could 
run on Jenkins as part of the normal build process. For eval code I was  
planning to simply have a single paramaterized JUnit test which runs in 
parallel, that way it’s easy to run from an IDE and to debug and integrate with 
Maven. The unit test would look for source files in ../../regression which 
would be a directory above the SVN trunk (i.e. a separate repo). It would do a 
full rendering of each file to a PNG and compare the results, we’ll probably 
have a text extraction test too: perhaps that’s more like what Tika will need?

Thanks

-- John



Re: Regression Testing

2014-07-07 Thread Petr Slabý

Hi,
following is a description of what we are doing in our company.

With our software, we run regression tests after each nightly build and 
sometimes it is a tough fight. If there is a regression, it is not so easy 
to find which commit caused it, because there are potentially many between 
the nightly builds. Then, the decision whether the change is wanted and 
expected is in some cases also difficult (this part might be easier with PDF 
where there is the golden standard rendering in Acrobat). If the change is 
expected and the new rendering better then one has to commit the new 
reference. This means that the files produced on the nightly build machine 
must be available somehow - it is almost impossible to produce them locally 
as the rendering results are slightly different with different versions of 
java and many other reasons. All this has to be done before the next 
regression test is run to avoid that new regressions are hidden by earlier 
ones. Our complete build with all tests runs several hours...


To improve this workflow, we now use the following schema in addition:
- there is a smaller set of regression tests which runs relatively fast
- these tests are triggered by each commit in formatting and rendering 
related projects
- before running the test itself, the modified project(s) are compiled 
locally, w/o publishing the result to maven

- the reference rendering files are stored in SVN
- if a test finds a regression, it immediately stores the new result as a 
new reference into SVN. This makes sure that a) the test renderings do not 
get lost and b) that each regression exactly points to the commit that has 
caused it - the one that triggered the test. The failed test creates a new 
issue in JIRA with a pointer to SVN to the before and after rendering and a 
bitmap of the differencies. The issue is then processed. If we find the 
change to be expected then the issue is simply closed, otherwise we take 
actions to fix the problem. The only annoying thing about this scheme is 
that, after commiting the correction, the test runs again and reports a 
regression because it now compares to the faulty version of the rendering.


Best regards,
Petr.

-Původní zpráva- 
From: John Hewson

Sent: Friday, July 04, 2014 7:39 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing

Hi Tilman

Thanks for your thoughts, I think that your concerns are already covered by 
my original proposal, I’ll try to explain why and how:


Of course I agree with the need for regression tests, however it isn't 
easy: besides the problems of the different JDKs (I use JDK7 Windows 64 
bit), there is the problem that some enhancements create slight changes in 
rendering that are not errors, i.e. both the before and the after 
files look OK by itself. This has happened when we changed the text 
rendering recently, and has happened again when the clipping was improved. 
The cause are probably slight changes in color or in boundaries.


If a rendering has changed then the regression test should fail. When a 
failure occurs the developer needs to manually inspect the differences (we 
could generate a visual diff which highlights what changed to make this 
easier) and if ok then they can replace the known-good PNG with the ones 
just rendered. Indeed this will be the basic workflow for working with 
regression tests.


Copyrights is a problem: I'm testing mostly with JIRA attachments that 
I've downloaded over the years. While uploading such files to JIRA might 
count as fair use, I doubt that this would still be true if they are 
included in a distribution. Instead, they should be stored somewhere on 
Apache servers where only committers and build software (Travis, 
Jenkins, ...) can access then. The public PDFs that Maruan mentions 
don't possibly have all the Problem cases that we solved before. However I 
have started working with these files and there are at least 5 recent 
issues that deals with them.


The PDFs won’t be in a distribution. They will just happen to be stored in 
an SVN repo but not our source code repo, in the same way that the website 
is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law 
doesn’t distinguish between JIRA and SVN, both are publicly available via 
HTTP, so using SVN will simply be a continuation of what we’re already doing 
with JIRA.


The crucial factor is that we’re only storing publicly available PDFs, 
because we have the right to do so, just like Google’s cache, and like we 
currently do with JIRA.


Additionally, the PDFs need to be version controlled otherwise we won’t be 
able to reliably recreate previous builds, so storing the files on a web 
server won’t be practical. Also committers will frequently be updating the 
renderings as bugs are fixed and we’ll need to version-control the rendered 
PNG files for the same reason. Finally, having committers-only files doesn’t 
fit well with the Apache goal of open development and would be unnecessary

RE: Regression Testing

2014-07-07 Thread Allison, Timothy B.
John,

   My initial plan for TIKA-1302 is very similar to what Tilman outlined, and 
my understanding/concerns/thoughts were very much in line with what he 
articulated.  The idea is that there should be a small Apache license-able gold 
truth set like both projects now have for specific unit tests (patient-based 
care), but that we should also occasionally take a public-health view and 
compare the outputs of  different versions of our parsers on a large set of 
docs to identify new exceptions or large changes in extracted content/metadata. 

   I'm persuaded by your points about fair use and the importance of open 
data.  Before proceeding on TIKA-1302, I'd like to get broader feedback on the 
way ahead via legal-discuss or maybe jira's Legal.  Do you mind if I quote your 
arguments?

   Also, I was on my way to requesting a vm from infra for TIKA-1302.  Do you 
see any way that we could share resources so that we're not double-storing 
files on Apache infrastructure?  There may be easy ways to share some eval code 
as well.

  Best,

   Tim

-Original Message-
From: John Hewson [mailto:j...@jahewson.com] 
Sent: Saturday, July 05, 2014 5:01 PM
To: dev@pdfbox.apache.org
Subject: Re: Regression Testing


On 5 Jul 2014, at 13:47, Tilman Hausherr thaush...@t-online.de wrote:

 Am 05.07.2014 22:12, schrieb John Hewson:
 Copyrights is a problem: I'm testing mostly with JIRA attachments that 
 I've downloaded over the years. While uploading such files to JIRA might 
 count as fair use, I doubt that this would still be true if they are 
 included in a distribution. Instead, they should be stored somewhere on 
 Apache servers where only committers and build software (Travis, 
 Jenkins, ...) can access then. The public PDFs that Maruan mentions 
 don't possibly have all the Problem cases that we solved before. However 
 I have started working with these files and there are at least 5 recent 
 issues that deals with them.
 The PDFs won't be in a distribution. They will just happen to be stored in 
 an SVN repo but not our source code repo, in the same way that the website 
 is stored in the cmssite branch of SVN or indeed, are on JIRA. The law 
 doesn't distinguish between JIRA and SVN, both are publicly available via 
 HTTP, so using SVN will simply be a continuation of what we're already 
 doing with JIRA.
 
 The crucial factor is that we're only storing publicly available PDFs,  
 because we have the right to do so, just like Google's cache, and like we 
 currently do with JIRA.
 Yes but many PDFs we got aren't really public. If this svn repo is only 
 accessible to committers, and if the publicly available build scripts won't 
 break because of this, then it is OK.
 Any non-public PDFs will not be permitted in our test suite, just as they 
 shouldn't be on JIRA.
 
 Note that even if something is publicly available, it may still be 
 copyrighted. Other risks can be that some people upload PDFs that include 
 personal data. One really good test PDF was apparently a loan application. 
 I remember that the user insisted that 1. it was test data, and 2. that it 
 be removed.
 All Apache development should be in the open, this is a key ASF principle, 
 having a committers-only test suite is basically a no-no. It's important to 
 understand that fair use allows us to use copyrighted works - this is 
 expressly permitted, it's the same legal principle as Google's cache. There 
 is no need to seek permission. This is what we've been doing with JIRA 
 already for years, so we are already doing this - it's fine.
 
 The problem is that this has all happened before. A few years ago, many files 
 were deleted, see PDFBOX-391.

That issue is about including files in the source code repo as part of the 
PDFBox distribution, where there is a need to put files under an Apache 2.0 
compatible license. What I'm advocating is keeping a separate public repository 
of test files which are not a part of the PDFBox source, like we currently have 
on JIRA.

-- John


Re: Regression Testing

2014-07-06 Thread Guillaume Bailleul
About why are isartor test not done by default?

In the first time of preflight in PDFBox, I made it not by default
because some manipulation were needed to make it work, I was not good
with maven in these time. When I changed that using some download
plugin of maven, I did not changed the default mode... only not to
break the build, as the preflight code was not so stable.

I do not find any objection to change the default mode. One idea could
be to move the test in integration test, maybe using the failsafe
plugin. I can work on it.



On Sat, Jul 5, 2014 at 11:01 PM, John Hewson j...@jahewson.com wrote:

 On 5 Jul 2014, at 13:47, Tilman Hausherr thaush...@t-online.de wrote:

 Am 05.07.2014 22:12, schrieb John Hewson:
 Copyrights is a problem: I'm testing mostly with JIRA attachments that 
 I've downloaded over the years. While uploading such files to JIRA might 
 count as fair use, I doubt that this would still be true if they are 
 included in a distribution. Instead, they should be stored somewhere on 
 Apache servers where only committers and build software (Travis, 
 Jenkins, ...) can access then. The public PDFs that Maruan mentions 
 don't possibly have all the Problem cases that we solved before. However 
 I have started working with these files and there are at least 5 recent 
 issues that deals with them.
 The PDFs won’t be in a distribution. They will just happen to be stored 
 in an SVN repo but not our source code repo, in the same way that the 
 website is stored in the “cmssite” branch of SVN or indeed, are on JIRA. 
 The law doesn’t distinguish between JIRA and SVN, both are publicly 
 available via HTTP, so using SVN will simply be a continuation of what 
 we’re already doing with JIRA.

 The crucial factor is that we’re only storing publicly available PDFs,  
 because we have the right to do so, just like Google’s cache, and like we 
 currently do with JIRA.
 Yes but many PDFs we got aren't really public. If this svn repo is only 
 accessible to committers, and if the publicly available build scripts 
 won't break because of this, then it is OK.
 Any non-public PDFs will not be permitted in our test suite, just as they 
 shouldn't be on JIRA.

 Note that even if something is publicly available, it may still be 
 copyrighted. Other risks can be that some people upload PDFs that include 
 personal data. One really good test PDF was apparently a loan application. 
 I remember that the user insisted that 1. it was test data, and 2. that it 
 be removed.
 All Apache development should be in the open, this is a key ASF principle, 
 having a committers-only test suite is basically a no-no. It's important to 
 understand that fair use allows us to use copyrighted works - this is 
 expressly permitted, it's the same legal principle as Google’s cache. There 
 is no need to seek permission. This is what we’ve been doing with JIRA 
 already for years, so we are already doing this - it’s fine.

 The problem is that this has all happened before. A few years ago, many 
 files were deleted, see PDFBOX-391.

 That issue is about including files in the source code repo as part of the 
 PDFBox distribution, where there is a need to put files under an Apache 2.0 
 compatible license. What I’m advocating is keeping a separate public 
 repository of test files which are not a part of the PDFBox source, like we 
 currently have on JIRA.

 -- John


Re: Regression Testing

2014-07-05 Thread Maruan Sahyoun

 Hi Tilman
 
 Thanks for your thoughts, I think that your concerns are already covered by 
 my original proposal, I’ll try to explain why and how:
 
 Of course I agree with the need for regression tests, however it isn't easy: 
 besides the problems of the different JDKs (I use JDK7 Windows 64 bit), 
 there is the problem that some enhancements create slight changes in 
 rendering that are not errors, i.e. both the before and the after files 
 look OK by itself. This has happened when we changed the text rendering 
 recently, and has happened again when the clipping was improved. The cause 
 are probably slight changes in color or in boundaries.
 
 If a rendering has changed then the regression test should fail. When a 
 failure occurs the developer needs to manually inspect the differences (we 
 could generate a visual diff which highlights what changed to make this 
 easier) and if ok then they can replace the known-good PNG with the ones just 
 rendered. Indeed this will be the basic workflow for working with regression 
 tests.
 

I think this is the only way to handle that situation. The same applies for 
text extraction etc. - If an improvement changes the results the ‚base‘ needs 
to be reset by adding the new image, text etc as the validation source.

A basic testbed could also run against other JDKs - e.g. wo validating against 
the know-good files - so we pick up potential issues early. Should be easy with 
Jenkins and treated as a hint.  


 Copyrights is a problem: I'm testing mostly with JIRA attachments that I've 
 downloaded over the years. While uploading such files to JIRA might count as 
 fair use, I doubt that this would still be true if they are included in a 
 distribution. Instead, they should be stored somewhere on Apache servers 
 where only committers and build software (Travis, Jenkins, ...) can 
 access then. The public PDFs that Maruan mentions don't possibly have all 
 the Problem cases that we solved before. However I have started working with 
 these files and there are at least 5 recent issues that deals with them.
 
 The PDFs won’t be in a distribution. They will just happen to be stored in an 
 SVN repo but not our source code repo, in the same way that the website is 
 stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t 
 distinguish between JIRA and SVN, both are publicly available via HTTP, so 
 using SVN will simply be a continuation of what we’re already doing with JIRA.
 
 The crucial factor is that we’re only storing publicly available PDFs,  
 because we have the right to do so, just like Google’s cache, and like we 
 currently do with JIRA.
 
 Additionally, the PDFs need to be version controlled otherwise we won’t be 
 able to reliably recreate previous builds, so storing the files on a web 
 server won’t be practical. Also committers will frequently be updating the 
 renderings as bugs are fixed and we’ll need to version-control the rendered 
 PNG files for the same reason. Finally, having committers-only files doesn’t 
 fit well with the Apache goal of open development and would be unnecessary 
 anyway given that all the PDFs are to be taken from public sources only.
 
 In summary, I’m proposing that we just keep doing what we’re currently doing 
 with JIRA but we move it into its own SVN repo along with some pre-rendered 
 PNGs.

In addition if we put in workarounds to handle nonconforming PDFs there should 
be a unit test added to make sure that we don’t break that e.g. when rewriting 
the parser. 

 
 Re preflight: the default mode should be to have the Isartor tests on. 
 Individuals could still disable them locally, but the central build software 
 should always use them.
 
 Yes - does anybody know why this isn’t the default?
 

No.

+1 for enabling it per default


 -- John



Re: Regression Testing

2014-07-05 Thread Tilman Hausherr

Am 04.07.2014 19:39, schrieb John Hewson:

Hi Tilman

Thanks for your thoughts, I think that your concerns are already covered by my 
original proposal, I’ll try to explain why and how:


Of course I agree with the need for regression tests, however it isn't easy: besides the problems 
of the different JDKs (I use JDK7 Windows 64 bit), there is the problem that some enhancements 
create slight changes in rendering that are not errors, i.e. both the before and the 
after files look OK by itself. This has happened when we changed the text rendering 
recently, and has happened again when the clipping was improved. The cause are probably slight 
changes in color or in boundaries.

If a rendering has changed then the regression test should fail. When a failure 
occurs the developer needs to manually inspect the differences (we could 
generate a visual diff which highlights what changed to make this easier) and 
if ok then they can replace the known-good PNG with the ones just rendered. 
Indeed this will be the basic workflow for working with regression tests.


Thats exactly what I do now, I generate a visual diff and I make a 
decision whether it is relevant or not. If I think not, then I replace 
the PNG.





Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the 
years. While uploading such files to JIRA might count as fair use, I doubt that this would still be 
true if they are included in a distribution. Instead, they should be stored somewhere on Apache 
servers where only committers and build software (Travis, Jenkins, ...) can 
access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we 
solved before. However I have started working with these files and there are at least 5 recent 
issues that deals with them.

The PDFs won’t be in a distribution. They will just happen to be stored in an 
SVN repo but not our source code repo, in the same way that the website is 
stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t 
distinguish between JIRA and SVN, both are publicly available via HTTP, so 
using SVN will simply be a continuation of what we’re already doing with JIRA.

The crucial factor is that we’re only storing publicly available PDFs,  because 
we have the right to do so, just like Google’s cache, and like we currently do 
with JIRA.


Yes but many PDFs we got aren't really public. If this svn repo is 
only accessible to committers, and if the publicly available build 
scripts won't break because of this, then it is OK.


Note that even if something is publicly available, it may still be 
copyrighted. Other risks can be that some people upload PDFs that 
include personal data. One really good test PDF was apparently a loan 
application. I remember that the user insisted that 1. it was test data, 
and 2. that it be removed.


Tilman


Additionally, the PDFs need to be version controlled otherwise we won’t be able 
to reliably recreate previous builds, so storing the files on a web server 
won’t be practical. Also committers will frequently be updating the renderings 
as bugs are fixed and we’ll need to version-control the rendered PNG files for 
the same reason. Finally, having committers-only files doesn’t fit well with 
the Apache goal of open development and would be unnecessary anyway given that 
all the PDFs are to be taken from public sources only.

In summary, I’m proposing that we just keep doing what we’re currently doing 
with JIRA but we move it into its own SVN repo along with some pre-rendered 
PNGs.


Re preflight: the default mode should be to have the Isartor tests on. 
Individuals could still disable them locally, but the central build software 
should always use them.

Yes - does anybody know why this isn’t the default?

-- John




Re: Regression Testing

2014-07-05 Thread John Hewson

 Copyrights is a problem: I'm testing mostly with JIRA attachments that I've 
 downloaded over the years. While uploading such files to JIRA might count 
 as fair use, I doubt that this would still be true if they are included in 
 a distribution. Instead, they should be stored somewhere on Apache servers 
 where only committers and build software (Travis, Jenkins, ...) can 
 access then. The public PDFs that Maruan mentions don't possibly have all 
 the Problem cases that we solved before. However I have started working 
 with these files and there are at least 5 recent issues that deals with 
 them.
 The PDFs won’t be in a distribution. They will just happen to be stored in 
 an SVN repo but not our source code repo, in the same way that the website 
 is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law 
 doesn’t distinguish between JIRA and SVN, both are publicly available via 
 HTTP, so using SVN will simply be a continuation of what we’re already doing 
 with JIRA.
 
 The crucial factor is that we’re only storing publicly available PDFs,  
 because we have the right to do so, just like Google’s cache, and like we 
 currently do with JIRA.
 
 Yes but many PDFs we got aren't really public. If this svn repo is only 
 accessible to committers, and if the publicly available build scripts won't 
 break because of this, then it is OK.

Any non-public PDFs will not be permitted in our test suite, just as they 
shouldn't be on JIRA.

 Note that even if something is publicly available, it may still be 
 copyrighted. Other risks can be that some people upload PDFs that include 
 personal data. One really good test PDF was apparently a loan application. I 
 remember that the user insisted that 1. it was test data, and 2. that it be 
 removed.

All Apache development should be in the open, this is a key ASF principle, 
having a committers-only test suite is basically a no-no. It's important to 
understand that fair use allows us to use copyrighted works - this is 
expressly permitted, it's the same legal principle as Google’s cache. There is 
no need to seek permission. This is what we’ve been doing with JIRA already for 
years, so we are already doing this - it’s fine.

Naturally, if anybody objects to their PDF being in our test suite, we can 
always remove it, but it shouldn’t include anything which isn’t already on the 
public web.

-- John

Re: Regression Testing

2014-07-05 Thread Tilman Hausherr

Am 05.07.2014 22:12, schrieb John Hewson:

Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the 
years. While uploading such files to JIRA might count as fair use, I doubt that this would still be 
true if they are included in a distribution. Instead, they should be stored somewhere on Apache 
servers where only committers and build software (Travis, Jenkins, ...) can 
access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we 
solved before. However I have started working with these files and there are at least 5 recent 
issues that deals with them.

The PDFs won’t be in a distribution. They will just happen to be stored in an 
SVN repo but not our source code repo, in the same way that the website is 
stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t 
distinguish between JIRA and SVN, both are publicly available via HTTP, so 
using SVN will simply be a continuation of what we’re already doing with JIRA.

The crucial factor is that we’re only storing publicly available PDFs,  because 
we have the right to do so, just like Google’s cache, and like we currently do 
with JIRA.

Yes but many PDFs we got aren't really public. If this svn repo is only 
accessible to committers, and if the publicly available build scripts won't break because 
of this, then it is OK.

Any non-public PDFs will not be permitted in our test suite, just as they 
shouldn't be on JIRA.


Note that even if something is publicly available, it may still be 
copyrighted. Other risks can be that some people upload PDFs that include personal data. 
One really good test PDF was apparently a loan application. I remember that the user 
insisted that 1. it was test data, and 2. that it be removed.

All Apache development should be in the open, this is a key ASF principle, having a 
committers-only test suite is basically a no-no. It's important to understand that 
fair use allows us to use copyrighted works - this is expressly permitted, 
it's the same legal principle as Google’s cache. There is no need to seek permission. 
This is what we’ve been doing with JIRA already for years, so we are already doing this - 
it’s fine.


The problem is that this has all happened before. A few years ago, many 
files were deleted, see PDFBOX-391.


Tilman



Naturally, if anybody objects to their PDF being in our test suite, we can 
always remove it, but it shouldn’t include anything which isn’t already on the 
public web.

-- John




Re: Regression Testing

2014-07-05 Thread John Hewson

On 5 Jul 2014, at 13:47, Tilman Hausherr thaush...@t-online.de wrote:

 Am 05.07.2014 22:12, schrieb John Hewson:
 Copyrights is a problem: I'm testing mostly with JIRA attachments that 
 I've downloaded over the years. While uploading such files to JIRA might 
 count as fair use, I doubt that this would still be true if they are 
 included in a distribution. Instead, they should be stored somewhere on 
 Apache servers where only committers and build software (Travis, 
 Jenkins, ...) can access then. The public PDFs that Maruan mentions 
 don't possibly have all the Problem cases that we solved before. However 
 I have started working with these files and there are at least 5 recent 
 issues that deals with them.
 The PDFs won’t be in a distribution. They will just happen to be stored in 
 an SVN repo but not our source code repo, in the same way that the website 
 is stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law 
 doesn’t distinguish between JIRA and SVN, both are publicly available via 
 HTTP, so using SVN will simply be a continuation of what we’re already 
 doing with JIRA.
 
 The crucial factor is that we’re only storing publicly available PDFs,  
 because we have the right to do so, just like Google’s cache, and like we 
 currently do with JIRA.
 Yes but many PDFs we got aren't really public. If this svn repo is only 
 accessible to committers, and if the publicly available build scripts won't 
 break because of this, then it is OK.
 Any non-public PDFs will not be permitted in our test suite, just as they 
 shouldn't be on JIRA.
 
 Note that even if something is publicly available, it may still be 
 copyrighted. Other risks can be that some people upload PDFs that include 
 personal data. One really good test PDF was apparently a loan application. 
 I remember that the user insisted that 1. it was test data, and 2. that it 
 be removed.
 All Apache development should be in the open, this is a key ASF principle, 
 having a committers-only test suite is basically a no-no. It's important to 
 understand that fair use allows us to use copyrighted works - this is 
 expressly permitted, it's the same legal principle as Google’s cache. There 
 is no need to seek permission. This is what we’ve been doing with JIRA 
 already for years, so we are already doing this - it’s fine.
 
 The problem is that this has all happened before. A few years ago, many files 
 were deleted, see PDFBOX-391.

That issue is about including files in the source code repo as part of the 
PDFBox distribution, where there is a need to put files under an Apache 2.0 
compatible license. What I’m advocating is keeping a separate public repository 
of test files which are not a part of the PDFBox source, like we currently have 
on JIRA.

-- John

Re: Regression Testing

2014-07-04 Thread Maruan Sahyoun
Hi John,

thanks for binging this up. This is a very important topic which was also 
discussed at the PDFDays in Germany.

 # Tests #
In addition to rendering we shall be covering metadata and text extraction as 
well as PDF/A validation. 

# Testfiles # 
Recently there were a number of test sets made available which we can use. 
http://digitalcorpora.org/corpora/files , 
https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite 
http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
apply there.
In addition we can put additional files into our own repository as you 
suggested.
So there is no shortage on test files. 

TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
with them.

BR

Maruan


Am 04.07.2014 um 02:16 schrieb John Hewson j...@jahewson.com:

 Hi All
 
 I’ve been thinking about regression testing recently and how we can improve
 our tests for rendering. There are currently two problems:
 
 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
(I suspect that AWT fonts are a big part of this, so the problem might get 
 a lot better
soon once we render all fonts ourselves).
 
 2) Most PDF test files we have are not under an Apache-friendly license, so
we can’t put the test files into the trunk SVN.
 
 It seems that some of you have your own collections of test PDF files which 
 you are
 running regression tests on: that’s great but it would be much better if we 
 had a
 central repository of test files and sample renderings.
 
 I’d like to suggest the following solutions to the above issues:
 
 1) We should choose a “blessed” JDK which will be used to perform the 
 renderings
this should be whatever is a convenient and sensible default for 
 committers. (My
preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has 
 known
rendering bugs). We should make sure that Jenkins runs tests using the 
 ”blessed”
JDK.
 
   The regression test can then check to see if it is running on the “blessed” 
 JDK and
   if not then the tests can be skipped and we can warn the user.
 
 2) We should create a new “regression” branch in SVN which contains only PDF 
 files
for testing and PNG images which contain known-good renderings created 
 using the
“blessed” JDK. This branch would not be part of the source of PDFBox but 
 will still
allow us to version control the test PDFs (it also simplifies the workflow 
 for adding
new test PDFs and new known-good renderings: simply do an svn add”).
 
As far as copyright and licensing is concerned we can put any PDF files 
 which are
available publicly on the web into this branch without too much worry.
 
 What does everybody think?
 
 -- John
 



Re: Regression Testing

2014-07-04 Thread Tilman Hausherr
Of course I agree with the need for regression tests, however it isn't 
easy: besides the problems of the different JDKs (I use JDK7 Windows 64 
bit), there is the problem that some enhancements create slight changes 
in rendering that are not errors, i.e. both the before and the after 
files look OK by itself. This has happened when we changed the text 
rendering recently, and has happened again when the clipping was 
improved. The cause are probably slight changes in color or in boundaries.


Copyrights is a problem: I'm testing mostly with JIRA attachments that 
I've downloaded over the years. While uploading such files to JIRA might 
count as fair use, I doubt that this would still be true if they are 
included in a distribution. Instead, they should be stored somewhere on 
Apache servers where only committers and build software (Travis, 
Jenkins, ...) can access then. The public PDFs that Maruan mentions 
don't possibly have all the Problem cases that we solved before. However 
I have started working with these files and there are at least 5 recent 
issues that deals with them.


I'm using an improved version of the TestPDFToImage class and I will 
commit it within a few days, but I must clean it up first.


Re preflight: the default mode should be to have the Isartor tests on. 
Individuals could still disable them locally, but the central build 
software should always use them.


Tilman


Am 04.07.2014 08:43, schrieb Maruan Sahyoun:

Hi John,

thanks for binging this up. This is a very important topic which was also 
discussed at the PDFDays in Germany.

  # Tests #
In addition to rendering we shall be covering metadata and text extraction as 
well as PDF/A validation.

# Testfiles #
Recently there were a number of test sets made available which we can use. 
http://digitalcorpora.org/corpora/files , 
https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite 
http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
apply there.
In addition we can put additional files into our own repository as you 
suggested.
So there is no shortage on test files.

TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
with them.

BR

Maruan


Am 04.07.2014 um 02:16 schrieb John Hewson j...@jahewson.com:


Hi All

I’ve been thinking about regression testing recently and how we can improve
our tests for rendering. There are currently two problems:

1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
(I suspect that AWT fonts are a big part of this, so the problem might get 
a lot better
soon once we render all fonts ourselves).

2) Most PDF test files we have are not under an Apache-friendly license, so
we can’t put the test files into the trunk SVN.

It seems that some of you have your own collections of test PDF files which you 
are
running regression tests on: that’s great but it would be much better if we had 
a
central repository of test files and sample renderings.

I’d like to suggest the following solutions to the above issues:

1) We should choose a “blessed” JDK which will be used to perform the renderings
this should be whatever is a convenient and sensible default for 
committers. (My
preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
rendering bugs). We should make sure that Jenkins runs tests using the 
”blessed”
JDK.

   The regression test can then check to see if it is running on the “blessed” 
JDK and
   if not then the tests can be skipped and we can warn the user.

2) We should create a new “regression” branch in SVN which contains only PDF 
files
for testing and PNG images which contain known-good renderings created 
using the
“blessed” JDK. This branch would not be part of the source of PDFBox but 
will still
allow us to version control the test PDFs (it also simplifies the workflow 
for adding
new test PDFs and new known-good renderings: simply do an svn add”).

As far as copyright and licensing is concerned we can put any PDF files 
which are
available publicly on the web into this branch without too much worry.

What does everybody think?

-- John







Re: Regression Testing

2014-07-04 Thread John Hewson
Hi Tilman

Thanks for your thoughts, I think that your concerns are already covered by my 
original proposal, I’ll try to explain why and how:

 Of course I agree with the need for regression tests, however it isn't easy: 
 besides the problems of the different JDKs (I use JDK7 Windows 64 bit), there 
 is the problem that some enhancements create slight changes in rendering that 
 are not errors, i.e. both the before and the after files look OK by 
 itself. This has happened when we changed the text rendering recently, and 
 has happened again when the clipping was improved. The cause are probably 
 slight changes in color or in boundaries.

If a rendering has changed then the regression test should fail. When a failure 
occurs the developer needs to manually inspect the differences (we could 
generate a visual diff which highlights what changed to make this easier) and 
if ok then they can replace the known-good PNG with the ones just rendered. 
Indeed this will be the basic workflow for working with regression tests.

 Copyrights is a problem: I'm testing mostly with JIRA attachments that I've 
 downloaded over the years. While uploading such files to JIRA might count as 
 fair use, I doubt that this would still be true if they are included in a 
 distribution. Instead, they should be stored somewhere on Apache servers 
 where only committers and build software (Travis, Jenkins, ...) can 
 access then. The public PDFs that Maruan mentions don't possibly have all the 
 Problem cases that we solved before. However I have started working with 
 these files and there are at least 5 recent issues that deals with them.

The PDFs won’t be in a distribution. They will just happen to be stored in an 
SVN repo but not our source code repo, in the same way that the website is 
stored in the “cmssite” branch of SVN or indeed, are on JIRA. The law doesn’t 
distinguish between JIRA and SVN, both are publicly available via HTTP, so 
using SVN will simply be a continuation of what we’re already doing with JIRA.

The crucial factor is that we’re only storing publicly available PDFs,  because 
we have the right to do so, just like Google’s cache, and like we currently do 
with JIRA.

Additionally, the PDFs need to be version controlled otherwise we won’t be able 
to reliably recreate previous builds, so storing the files on a web server 
won’t be practical. Also committers will frequently be updating the renderings 
as bugs are fixed and we’ll need to version-control the rendered PNG files for 
the same reason. Finally, having committers-only files doesn’t fit well with 
the Apache goal of open development and would be unnecessary anyway given that 
all the PDFs are to be taken from public sources only.

In summary, I’m proposing that we just keep doing what we’re currently doing 
with JIRA but we move it into its own SVN repo along with some pre-rendered 
PNGs.

 Re preflight: the default mode should be to have the Isartor tests on. 
 Individuals could still disable them locally, but the central build software 
 should always use them.

Yes - does anybody know why this isn’t the default?

-- John

Re: Regression Testing

2014-07-04 Thread John Hewson
Hi Maruan

Thanks for your thoughts...

 # Tests #
 In addition to rendering we shall be covering metadata and text extraction as 
 well as PDF/A validation. 

Yes, we could add extracted text and validation results to the “regression” SVN 
repo also.

 # Testfiles # 
 Recently there were a number of test sets made available which we can use. […]

Excellent.

 In addition we can put additional files into our own repository as you 
 suggested.
 So there is no shortage on test files. 

Some people seem to have downloaded many (or all) of the JIRA files, I guess we 
could add those too.

 TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
 development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
 with them.

I see that in TIKA-1302 the Tika developers suggest that PDFBox should set up 
its own regression tests, so I guess that’s our starting point. We should make 
sure that it’s easy to run just the text extraction regression tests using 
maven, and also ask them to give us any test files they have.

-- John

PS. Nice job handling those tough questions a PDFDays, I watched the video.

On 3 Jul 2014, at 23:43, Maruan Sahyoun sahy...@fileaffairs.de wrote:

 Hi John,
 
 thanks for binging this up. This is a very important topic which was also 
 discussed at the PDFDays in Germany.
 
 # Tests #
 In addition to rendering we shall be covering metadata and text extraction as 
 well as PDF/A validation. 
 
 # Testfiles # 
 Recently there were a number of test sets made available which we can use. 
 http://digitalcorpora.org/corpora/files , 
 https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
 For PDF/A validation there is the Isartor test suite 
 http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
 apply there.
 In addition we can put additional files into our own repository as you 
 suggested.
 So there is no shortage on test files. 
 
 TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
 development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
 with them.
 
 BR
 
 Maruan
 
 
 Am 04.07.2014 um 02:16 schrieb John Hewson j...@jahewson.com:
 
 Hi All
 
 I’ve been thinking about regression testing recently and how we can improve
 our tests for rendering. There are currently two problems:
 
 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
   (I suspect that AWT fonts are a big part of this, so the problem might get 
 a lot better
   soon once we render all fonts ourselves).
 
 2) Most PDF test files we have are not under an Apache-friendly license, so
   we can’t put the test files into the trunk SVN.
 
 It seems that some of you have your own collections of test PDF files which 
 you are
 running regression tests on: that’s great but it would be much better if we 
 had a
 central repository of test files and sample renderings.
 
 I’d like to suggest the following solutions to the above issues:
 
 1) We should choose a “blessed” JDK which will be used to perform the 
 renderings
   this should be whatever is a convenient and sensible default for 
 committers. (My
   preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has 
 known
   rendering bugs). We should make sure that Jenkins runs tests using the 
 ”blessed”
   JDK.
 
  The regression test can then check to see if it is running on the “blessed” 
 JDK and
  if not then the tests can be skipped and we can warn the user.
 
 2) We should create a new “regression” branch in SVN which contains only PDF 
 files
   for testing and PNG images which contain known-good renderings created 
 using the
   “blessed” JDK. This branch would not be part of the source of PDFBox but 
 will still
   allow us to version control the test PDFs (it also simplifies the workflow 
 for adding
   new test PDFs and new known-good renderings: simply do an svn add”).
 
   As far as copyright and licensing is concerned we can put any PDF files 
 which are
   available publicly on the web into this branch without too much worry.
 
 What does everybody think?
 
 -- John
 
 



Regression Testing

2014-07-03 Thread John Hewson
Hi All

I’ve been thinking about regression testing recently and how we can improve
our tests for rendering. There are currently two problems:

1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
(I suspect that AWT fonts are a big part of this, so the problem might get 
a lot better
soon once we render all fonts ourselves).

2) Most PDF test files we have are not under an Apache-friendly license, so
we can’t put the test files into the trunk SVN.

It seems that some of you have your own collections of test PDF files which you 
are
running regression tests on: that’s great but it would be much better if we had 
a
central repository of test files and sample renderings.

I’d like to suggest the following solutions to the above issues:

1) We should choose a “blessed” JDK which will be used to perform the renderings
this should be whatever is a convenient and sensible default for 
committers. (My
preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
rendering bugs). We should make sure that Jenkins runs tests using the 
”blessed”
JDK.

   The regression test can then check to see if it is running on the “blessed” 
JDK and
   if not then the tests can be skipped and we can warn the user.

2) We should create a new “regression” branch in SVN which contains only PDF 
files
for testing and PNG images which contain known-good renderings created 
using the
“blessed” JDK. This branch would not be part of the source of PDFBox but 
will still
allow us to version control the test PDFs (it also simplifies the workflow 
for adding
new test PDFs and new known-good renderings: simply do an svn add”).

As far as copyright and licensing is concerned we can put any PDF files 
which are
available publicly on the web into this branch without too much worry.
  
What does everybody think?

-- John