Sorting text order

Brian Carrier Thu, 13 Nov 2008 09:59:18 -0800

I have modified PDFBox to properly handle all of the text rotationissues that have been discussed, but then I tried the regressiontests. There were lots of failures and while debugging them, Irealized that the regression tests do not specify that the textshould be written in sorted order (background: the regression testsextract text from a collection of PDF files and then compare theoutput with a "gold standard" text file). By default,PDFTextStripper does not sort the text by its location. It is writtenin the order it is declared in the PDF file (which can be any order).As a result, we really can't test the page rotation issues using theregression tests because the rotation is taken into account onlywhen sorting. The errors that I were getting were caused by extra ormissing white space, which could cause hundreds of errors becauseeverything was off by a line.

I think that the text should be sorted by default. There doesn'tseem to be much point in extracting text from a PDF file if it isn'tin a readable order. A flag could exist to not sort, if people caremore about performance than readability.

If the regression tests are going to be used as a test for committingthe page rotation patches, I think we should first enable sorting bydefault and regenerate the gold standard files in the regressiontests. Then we can more easily test the rotation patches.


brian

Sorting text order

Reply via email to