Re: How to define regions in PDFTextStripperByArea?

Paul Monday Wed, 30 Apr 2014 07:21:06 -0700

On Apr 30, 2014, at 12:57 AM, Qingchao Kong <[email protected]> wrote:


> Paul,
>> 
>>                int width = 612;
>>                int height = 792;
>> 
>>                int hX = 320, tX = 340, cX = 100;
>>                int hY = 0, tY = 580, cY = 200;
>>                int hW = width - hX, tW = width - tX, cW = 100;
>>                int hH = 80, tH = height - tY, cH = 60;
>> 
>>                Rectangle header = new Rectangle();
>>                header.setBounds(hX, hY, hW, hH);
>>                Rectangle totals = new Rectangle();
>>                totals.setBounds(tX, tY, tW, tH);
>>                Rectangle customer = new Rectangle();
>>                customer.setBounds(cX, cY, cW, cH);
>> 
>>                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>>                stripper.addRegion("header", header);
>>                stripper.addRegion("totals", totals);
>>                stripper.addRegion("customer", customer);
>>                stripper.setSortByPosition(true);
>> 
> 
> So it means that you have set the bounds emperically, like header,
> totals and customer, is that correct? The problem is PDF files may be
> of various sizes and you only know the header/footer are at the
> front/end of a PDF page, you would never know the exact locations.

The document that I'm looking at puts a lot of the information in drawn 
rectangles so I was able to look at the rectangles that are drawn in the 
document, study where they are, then determine the boundaries I wanted.  I 
don't know if that works for the document you are looking at but to get all of 
the existing rectangles on a page:

a) get the tokens on the page
b) for each token that is an "re"
b1) get the previous 4 tokens (token location - 4 is x, -3 is y, -2 is w, -1 is 
h)
b2) store the rectangle (I actually wrote a routine to see if the rectangle was 
a part of another rectangle or intersected, if the latter then I store a union 
of the two rectangles and remove the two originals)
c) then I wrote a comparator so I could easily sort rectangles by the y 
coordinate

d) stare at the output and compare to the page and determine your regions

That only works if your PDF is drawn with rectangles though.

I believe the FIRST way I showed you originally is the better approach to your 
problem though since it sorts your tokens, that was your original complaint.

There is no magical "one size fits all" for parsing a PDF.  You need to do the 
hard work of understanding the PDF specification, how PDFBox interprets that 
specification, and then understand how the AUTHOR of the PDF assembled it in 
the first place.  This takes time and experience.


> Btw, which version of PDFBox do you use? You never encounter the
> "Exception in thread "main" java.lang.IllegalArgumentException:" ?

Really, an illegal argument exception has to do with your code most likely.  
Post your code here and maybe its obvious.  Your exception and stack trace are 
sort of irrelevant since you have a simple coding error.

Paul Monday
[email protected]

Re: How to define regions in PDFTextStripperByArea?

Reply via email to