Re: How to define regions in PDFTextStripperByArea?

Paul Monday Tue, 29 Apr 2014 08:31:51 -0700

On Apr 29, 2014, at 8:59 AM, Qingchao Kong <[email protected]> wrote:


> Paul,
> Could you explain me why you use "stripper.setSortByPosition(true);"
> and what does it do actually?
I copied this from the JavaDoc for you:

The order of the text tokens in a PDF file may not be in the same as they 
appear visually on the screen. For example, a PDF writer may write out all text 
by font, so all bold or larger text, then make a second pass and write out the 
normal text.

The default is to not sort by position.

A PDF writer could choose to write each character in a different order. By 
default PDFBox does not sort the text tokens before processing them due to 
performance reasons.


> When I use "stripper.setSortByPosition(true);", I got the following errors:
> Exception in thread "main" java.lang.IllegalArgumentException:
> Comparison method violates its general contract!
> at java.util.TimSort.mergeLo(TimSort.java:747)
> at java.util.TimSort.mergeAt(TimSort.java:483)
> at java.util.TimSort.mergeCollapse(TimSort.java:408)
> at java.util.TimSort.sort(TimSort.java:214)
> at java.util.TimSort.sort(TimSort.java:173)
> at java.util.Arrays.sort(Arrays.java:659)
> at java.util.Collections.sort(Collections.java:217)
> at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:565)
> at 
> org.apache.pdfbox.util.PDFTextStripperByArea.writePage(PDFTextStripperByArea.java:190)
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:457)
> at 
> org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)
> 
> Do you know why?

I don't know why you would get that.  Perhaps you have a different version of 
PDFBox than I'm using.  I don't have time to debug your PDF and I'm not sure 
what your program is doing from the stack trace.  You may not have adapted the 
code I gave you to your particular cropbox size if you went with the manual 
rectangle setup, or perhaps your PDF is funny.  Try using the mediabox or bleed 
box dimensions perhaps.

I am rather new to this approach as well.

> PS: The pdf file I use are attached.
> 
> 
> On Tue, Apr 29, 2014 at 9:00 PM, Paul Monday <[email protected]> 
> wrote:
>> It's not really PDFBox that mixed the main content up.  It's just a basic 
>> algorithm for extracting text.  You run into this quite often when 
>> interpreting PDF files.  I've been playing with this all week so I actually 
>> have some code.
>> 
>> Theere are two things you can try.  You could get the rectangle that the 
>> cropbox defines and have the text stripper attempt to sort by position.  
>> Depending on how your headers and footers were inserted, this may sort it 
>> out.  Here is where I did that on a per page basis:
>> 
>>                for (PDPage page : pages) {
>>                        PDRectangle pdr = page.getCropBox();
>>                        Rectangle rec = new Rectangle();
>>                        rec.setBounds(
>>                                        Math.round(pdr.getLowerLeftX())
>>                                        , Math.round(pdr.getLowerLeftY())
>>                                        , Math.round(pdr.getWidth())
>>                                        , Math.round(pdr.getHeight()));
>>                        System.out.println("Crobox: " + rec);
>>                        PDFTextStripperByArea stripper = new 
>> PDFTextStripperByArea();
>>                        stripper.addRegion("cropbox", rec);
>>                        stripper.setSortByPosition(true);
>>                        stripper.extractRegions(page);
>>                        List<String> regions = stripper.getRegions();
>>                        for (String region : regions) {
>>                                String text = 
>> stripper.getTextForRegion(region);
>> 
>> This may sort your strings in the order you want.

Paul Monday
[email protected]

Re: How to define regions in PDFTextStripperByArea?

Reply via email to