Hi Jukka,
On Dec 7, 2008, at 7:24 PM, Jukka Zitting wrote:
To keep the tests we could repackage the test suite as an independent
component and release it under the PDFBox SourceForge project with
proper disclaimers about the copyright status of the included files.
Then in our developer documentation at Apache we could have a pointer
to that test suite and instructions on how to integrate it with a
normal PDFBox checkout.
This sounds good. For the text extraction tests, and possibly the
others, we could do it such that the SF files are placed in their own
directory and the test suite will test for the existence of the
directory (and not give an error if it does not exist).
Before we move the test files though, I would like to think about how
we can improve the text extraction regression tests. Currently, there
is a single "input" directory with PDF files and a corresponding text
file (the name has its extension changed to txt). For example,
"test1.pdf" and "test1.txt". The tests extract text from the PDF
file and compare the result with the text file. This works great,
except that it does not allow us to test using the 'sort' feature,
which is what some of the page rotation and Arabic text direction
fixes need.
It seems like we could add tests that use the sorting feature by either:
a) Store the PDF files in one directory and a separate directory
exists for each test (i.e. a directory for non-sorted text files and
a directory for sorted text files). The text files for each test are
stored in the directory and renamed to have a .txt extension.
b) Store the PDF files and text files in the same directory, but
rename the 'sorted' text files to have "-sort.txt" at the end. For
example, "test1.pdf" would have "test1.txt" for its non-sorted gold
standard and "test1-sort.txt" for its sorted gold standard.
If we do approach b, then we do not need to change the current
directory structure. If we do approach a, then we do. 'b' seems a
little more clumsy, but it could be easier if we are going to have
multiple directories of test files. For example, we could have an
'input' directory of the files in Apache and a 'input-sf' directory
of the files in SourceForge.
Thoughts?
brian