Re: How to define regions in PDFTextStripperByArea?

Qingchao Kong Tue, 29 Apr 2014 08:00:33 -0700

Paul,
Could you explain me why you use "stripper.setSortByPosition(true);"
and what does it do actually?

When I use "stripper.setSortByPosition(true);", I got the following errors:
Exception in thread "main" java.lang.IllegalArgumentException:
Comparison method violates its general contract!
at java.util.TimSort.mergeLo(TimSort.java:747)
at java.util.TimSort.mergeAt(TimSort.java:483)
at java.util.TimSort.mergeCollapse(TimSort.java:408)
at java.util.TimSort.sort(TimSort.java:214)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at java.util.Collections.sort(Collections.java:217)
at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:565)
at 
org.apache.pdfbox.util.PDFTextStripperByArea.writePage(PDFTextStripperByArea.java:190)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:457)
at 
org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)

Do you know why?
PS: The pdf file I use are attached.

On Tue, Apr 29, 2014 at 9:00 PM, Paul Monday <[email protected]> wrote:
> It's not really PDFBox that mixed the main content up.  It's just a basic 
> algorithm for extracting text.  You run into this quite often when 
> interpreting PDF files.  I've been playing with this all week so I actually 
> have some code.
>
> Theere are two things you can try.  You could get the rectangle that the 
> cropbox defines and have the text stripper attempt to sort by position.  
> Depending on how your headers and footers were inserted, this may sort it 
> out.  Here is where I did that on a per page basis:
>
>                 for (PDPage page : pages) {
>                         PDRectangle pdr = page.getCropBox();
>                         Rectangle rec = new Rectangle();
>                         rec.setBounds(
>                                         Math.round(pdr.getLowerLeftX())
>                                         , Math.round(pdr.getLowerLeftY())
>                                         , Math.round(pdr.getWidth())
>                                         , Math.round(pdr.getHeight()));
>                         System.out.println("Crobox: " + rec);
>                         PDFTextStripperByArea stripper = new 
> PDFTextStripperByArea();
>                         stripper.addRegion("cropbox", rec);
>                         stripper.setSortByPosition(true);
>                         stripper.extractRegions(page);
>                         List<String> regions = stripper.getRegions();
>                         for (String region : regions) {
>                                 String text = 
> stripper.getTextForRegion(region);
>
> This may sort your strings in the order you want.

Re: How to define regions in PDFTextStripperByArea?

Reply via email to