On Apr 29, 2014, at 8:59 AM, Qingchao Kong <[email protected]> wrote:
> Paul, > Could you explain me why you use "stripper.setSortByPosition(true);" > and what does it do actually? I copied this from the JavaDoc for you: The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. For example, a PDF writer may write out all text by font, so all bold or larger text, then make a second pass and write out the normal text. The default is to not sort by position. A PDF writer could choose to write each character in a different order. By default PDFBox does not sort the text tokens before processing them due to performance reasons. > When I use "stripper.setSortByPosition(true);", I got the following errors: > Exception in thread "main" java.lang.IllegalArgumentException: > Comparison method violates its general contract! > at java.util.TimSort.mergeLo(TimSort.java:747) > at java.util.TimSort.mergeAt(TimSort.java:483) > at java.util.TimSort.mergeCollapse(TimSort.java:408) > at java.util.TimSort.sort(TimSort.java:214) > at java.util.TimSort.sort(TimSort.java:173) > at java.util.Arrays.sort(Arrays.java:659) > at java.util.Collections.sort(Collections.java:217) > at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:565) > at > org.apache.pdfbox.util.PDFTextStripperByArea.writePage(PDFTextStripperByArea.java:190) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:457) > at > org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153) > > Do you know why? I don't know why you would get that. Perhaps you have a different version of PDFBox than I'm using. I don't have time to debug your PDF and I'm not sure what your program is doing from the stack trace. You may not have adapted the code I gave you to your particular cropbox size if you went with the manual rectangle setup, or perhaps your PDF is funny. Try using the mediabox or bleed box dimensions perhaps. I am rather new to this approach as well. > PS: The pdf file I use are attached. > > > On Tue, Apr 29, 2014 at 9:00 PM, Paul Monday <[email protected]> > wrote: >> It's not really PDFBox that mixed the main content up. It's just a basic >> algorithm for extracting text. You run into this quite often when >> interpreting PDF files. I've been playing with this all week so I actually >> have some code. >> >> Theere are two things you can try. You could get the rectangle that the >> cropbox defines and have the text stripper attempt to sort by position. >> Depending on how your headers and footers were inserted, this may sort it >> out. Here is where I did that on a per page basis: >> >> for (PDPage page : pages) { >> PDRectangle pdr = page.getCropBox(); >> Rectangle rec = new Rectangle(); >> rec.setBounds( >> Math.round(pdr.getLowerLeftX()) >> , Math.round(pdr.getLowerLeftY()) >> , Math.round(pdr.getWidth()) >> , Math.round(pdr.getHeight())); >> System.out.println("Crobox: " + rec); >> PDFTextStripperByArea stripper = new >> PDFTextStripperByArea(); >> stripper.addRegion("cropbox", rec); >> stripper.setSortByPosition(true); >> stripper.extractRegions(page); >> List<String> regions = stripper.getRegions(); >> for (String region : regions) { >> String text = >> stripper.getTextForRegion(region); >> >> This may sort your strings in the order you want. Paul Monday [email protected]

