I would like to make a case against sorting by default: 1. PDF's primary mechanism to provide the reading order is "Logical Structure" (or "Tagged PDF"). With this mechanism a PDF author can express the explicit and correct reading order. Your sorting will always be an approximation. Without the structure information, you will have problems to separate headers/footers from body text, for example. Of course, not all PDFs have tagged content. 2. You're changing the default behaviour of PDFBox which might be seen as a backwards compatibility issue. 3. You cannot be sure that in all cases your sorting will be better than the original order in which the text is painted inside the content streams. 4. You don't need to fix the regression tests, you only have to write new ones. ;-)
That said, I believe that an optional sorting functionality is an interesting addition. But IMO it shouldn't be enabled by default. On 13.11.2008 18:58:46 Brian Carrier wrote: > I have modified PDFBox to properly handle all of the text rotation > issues that have been discussed, but then I tried the regression > tests. There were lots of failures and while debugging them, I > realized that the regression tests do not specify that the text > should be written in sorted order (background: the regression tests > extract text from a collection of PDF files and then compare the > output with a "gold standard" text file). By default, > PDFTextStripper does not sort the text by its location. It is written > in the order it is declared in the PDF file (which can be any order). > As a result, we really can't test the page rotation issues using the > regression tests because the rotation is taken into account only > when sorting. The errors that I were getting were caused by extra or > missing white space, which could cause hundreds of errors because > everything was off by a line. > > I think that the text should be sorted by default. There doesn't > seem to be much point in extracting text from a PDF file if it isn't > in a readable order. A flag could exist to not sort, if people care > more about performance than readability. > > If the regression tests are going to be used as a test for committing > the page rotation patches, I think we should first enable sorting by > default and regenerate the gold standard files in the regression > tests. Then we can more easily test the rotation patches. > > brian Jeremias Maerki
