I have modified PDFBox to properly handle all of the text rotation issues that have been discussed, but then I tried the regression tests. There were lots of failures and while debugging them, I realized that the regression tests do not specify that the text should be written in sorted order (background: the regression tests extract text from a collection of PDF files and then compare the output with a "gold standard" text file). By default, PDFTextStripper does not sort the text by its location. It is written in the order it is declared in the PDF file (which can be any order). As a result, we really can't test the page rotation issues using the regression tests because the rotation is taken into account only when sorting. The errors that I were getting were caused by extra or missing white space, which could cause hundreds of errors because everything was off by a line.

I think that the text should be sorted by default. There doesn't seem to be much point in extracting text from a PDF file if it isn't in a readable order. A flag could exist to not sort, if people care more about performance than readability.

If the regression tests are going to be used as a test for committing the page rotation patches, I think we should first enable sorting by default and regenerate the gold standard files in the regression tests. Then we can more easily test the rotation patches.

brian

Reply via email to