I have modified PDFBox to properly handle all of the text rotation
issues that have been discussed, but then I tried the regression
tests. There were lots of failures and while debugging them, I
realized that the regression tests do not specify that the text
should be written in sorted order (background: the regression tests
extract text from a collection of PDF files and then compare the
output with a "gold standard" text file). By default,
PDFTextStripper does not sort the text by its location. It is written
in the order it is declared in the PDF file (which can be any order).
As a result, we really can't test the page rotation issues using the
regression tests because the rotation is taken into account only
when sorting. The errors that I were getting were caused by extra or
missing white space, which could cause hundreds of errors because
everything was off by a line.
I think that the text should be sorted by default. There doesn't
seem to be much point in extracting text from a PDF file if it isn't
in a readable order. A flag could exist to not sort, if people care
more about performance than readability.
If the regression tests are going to be used as a test for committing
the page rotation patches, I think we should first enable sorting by
default and regenerate the gold standard files in the regression
tests. Then we can more easily test the rotation patches.
brian
- Sorting text order Brian Carrier
-