>> To keep the tests we could repackage the test suite as an independent >> component and release it under the PDFBox SourceForge project with >> proper disclaimers about the copyright status of the included files. >> Then in our developer documentation at Apache we could have a pointer >> to that test suite and instructions on how to integrate it with a >> normal PDFBox checkout. > >This sounds good. For the text extraction tests, and possibly the >others, we could do it such that the SF files are placed in their own >directory and the test suite will test for the existence of the >directory (and not give an error if it does not exist). That's a good idea.
>... >It seems like we could add tests that use the sorting feature by either: >a) Store the PDF files in one directory and a separate directory >exists for each test (i.e. a directory for non-sorted text files and >a directory for sorted text files). The text files for each test are >stored in the directory and renamed to have a .txt extension. >b) Store the PDF files and text files in the same directory, but >rename the 'sorted' text files to have "-sort.txt" at the end. For >example, "test1.pdf" would have "test1.txt" for its non-sorted gold >standard and "test1-sort.txt" for its sorted gold standard. > >If we do approach b, then we do not need to change the current >directory structure. If we do approach a, then we do. 'b' seems a >little more clumsy, but it could be easier if we are going to have >multiple directories of test files. For example, we could have an >'input' directory of the files in Apache and a 'input-sf' directory >of the files in SourceForge. I agree with Brian. We should extend the extraction-part with an additional test with sorting enabled. I prefer approach b). I like the idea with 2 directories, one for each source. Then we are able to replace all needed documents in 'input-sf' with other suitable documents step by step. Andreas ---------------------------------------------------------------- - Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), Stefan Niehusmann - - Sitz der Gesellschaft: Dortmund - - Eingetragen beim Amtsgericht Dortmund - - Handelsregister-Nr. HR B 21222 - - USt.-IdNr. DE 2588 96 719 -
