Paul, Could you explain me why you use "stripper.setSortByPosition(true);" and what does it do actually?
When I use "stripper.setSortByPosition(true);", I got the following errors: Exception in thread "main" java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeLo(TimSort.java:747) at java.util.TimSort.mergeAt(TimSort.java:483) at java.util.TimSort.mergeCollapse(TimSort.java:408) at java.util.TimSort.sort(TimSort.java:214) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:565) at org.apache.pdfbox.util.PDFTextStripperByArea.writePage(PDFTextStripperByArea.java:190) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:457) at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153) Do you know why? PS: The pdf file I use are attached. On Tue, Apr 29, 2014 at 9:00 PM, Paul Monday <[email protected]> wrote: > It's not really PDFBox that mixed the main content up. It's just a basic > algorithm for extracting text. You run into this quite often when > interpreting PDF files. I've been playing with this all week so I actually > have some code. > > Theere are two things you can try. You could get the rectangle that the > cropbox defines and have the text stripper attempt to sort by position. > Depending on how your headers and footers were inserted, this may sort it > out. Here is where I did that on a per page basis: > > for (PDPage page : pages) { > PDRectangle pdr = page.getCropBox(); > Rectangle rec = new Rectangle(); > rec.setBounds( > Math.round(pdr.getLowerLeftX()) > , Math.round(pdr.getLowerLeftY()) > , Math.round(pdr.getWidth()) > , Math.round(pdr.getHeight())); > System.out.println("Crobox: " + rec); > PDFTextStripperByArea stripper = new > PDFTextStripperByArea(); > stripper.addRegion("cropbox", rec); > stripper.setSortByPosition(true); > stripper.extractRegions(page); > List<String> regions = stripper.getRegions(); > for (String region : regions) { > String text = > stripper.getTextForRegion(region); > > This may sort your strings in the order you want.

