Hi Brian On 14.11.2008 15:51:56 Brian Carrier wrote: > Hi Jeremias, > > > On Nov 13, 2008, at 2:41 PM, Jeremias Maerki wrote: > > > I would like to make a case against sorting by default: > > 1. PDF's primary mechanism to provide the reading order is "Logical > > Structure" > > (or "Tagged PDF"). With this mechanism a PDF author can express the > > explicit and correct reading order. Your sorting will always be an > > approximation. Without the structure information, you will have > > problems > > to separate headers/footers from body text, for example. Of course, > > not > > all PDFs have tagged content. > > Maybe I'm missing something, but I don't see any references to these > things in PDFStreamEngine and PDFTextStripper (context: I'm only > using PDFBox to get a plain text file from a PDF file). > PDFStreamEngine simply creates TextPosition objects at a given > coordinate and adds them to a list. PDFTextStripper, by default, pops > text off the list in the order they exist in the content stream and > prints them (and adds spaces if two consecutive text chunks are > sufficiently far away). If you enable sorting, then PDFTextStripper > sorts the list based on the coordinates before it prints them and > adds spaces.
I didn't want to mislead you. I have no clue about how the text extraction in PDFBox works. I've looked at this simply from the perspective of the PDF specification. It can very well be that PDFBox doesn't support marked content and document structure, yet. > My reading of the PDF spec (and experience with some files I've been > debugging with) is that the stream can contain the text and objects > on a page in any order that it wants. As an example, the > whats_new.pdf file in the regression tests stores the page numbers as > one of the first text chunks, even though the page numbers are > located at the bottom of the page. Right now, I think PDFBox is > ignoring a lot of the structure that PDF files provide (at least for > text extraction). Looks like it. > > 2. You're changing the default behaviour of PDFBox which might be seen > > as a backwards compatibility issue. > > True. > > > 3. You cannot be sure that in all cases your sorting will be better > > than > > the original order in which the text is painted inside the content > > streams. > > But the order of the text stored in the PDF file can be any order and > does not need any type of logic associated with it. I think that > some order is better than arbitrary order. Not necessarily. Look at the recent thread by "Duseja, Sushil" on pdfbox-users. He might actually profit from not ordering when he tries to extract values from the PDF file. I'm not saying that one or the other is more correct. It really depends on the situation. But what I'm saying is that a sorting algorithm might improved things in a large number of scenarios but it can also be wrong. The most reliable mechanism would be to work with the optional document structure feature of PDF. > > 4. You don't need to fix the regression tests, you only have to write > > new ones. ;-) > > Actually, I've been able to decrease the number of regression test > errors. The text extraction regression tests compare a line of text > from a gold standard to a line of text from the test. There isn't a > diff-like comparison, so any difference in newlines causes every > following line to be an error, which creates thousands of errors. It > turned out that a few tweaks got rid of most of them. OK, cool. > > That said, I believe that an optional sorting functionality is an > > interesting addition. But IMO it shouldn't be enabled by default. > > To be clear, the sorting functionality already exists. The logic is > in TextPositionComparator and is enabled by > PDFTextStripper.setSortByPosition(). See? That uncovers my ignorance about many of PDFBox's features. ;-) I'm just a project mentor after all, not a committer. Thanks for bearing with me. > thanks, > brian > > > > > > On 13.11.2008 18:58:46 Brian Carrier wrote: > >> I have modified PDFBox to properly handle all of the text rotation > >> issues that have been discussed, but then I tried the regression > >> tests. There were lots of failures and while debugging them, I > >> realized that the regression tests do not specify that the text > >> should be written in sorted order (background: the regression tests > >> extract text from a collection of PDF files and then compare the > >> output with a "gold standard" text file). By default, > >> PDFTextStripper does not sort the text by its location. It is written > >> in the order it is declared in the PDF file (which can be any order). > >> As a result, we really can't test the page rotation issues using the > >> regression tests because the rotation is taken into account only > >> when sorting. The errors that I were getting were caused by extra or > >> missing white space, which could cause hundreds of errors because > >> everything was off by a line. > >> > >> I think that the text should be sorted by default. There doesn't > >> seem to be much point in extracting text from a PDF file if it isn't > >> in a readable order. A flag could exist to not sort, if people care > >> more about performance than readability. > >> > >> If the regression tests are going to be used as a test for committing > >> the page rotation patches, I think we should first enable sorting by > >> default and regenerate the gold standard files in the regression > >> tests. Then we can more easily test the rotation patches. > >> > >> brian > > > > > > > > > > Jeremias Maerki > > Jeremias Maerki
