On Apr 30, 2014, at 12:57 AM, Qingchao Kong <[email protected]> wrote:
> Paul,
>>
>> int width = 612;
>> int height = 792;
>>
>> int hX = 320, tX = 340, cX = 100;
>> int hY = 0, tY = 580, cY = 200;
>> int hW = width - hX, tW = width - tX, cW = 100;
>> int hH = 80, tH = height - tY, cH = 60;
>>
>> Rectangle header = new Rectangle();
>> header.setBounds(hX, hY, hW, hH);
>> Rectangle totals = new Rectangle();
>> totals.setBounds(tX, tY, tW, tH);
>> Rectangle customer = new Rectangle();
>> customer.setBounds(cX, cY, cW, cH);
>>
>> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>> stripper.addRegion("header", header);
>> stripper.addRegion("totals", totals);
>> stripper.addRegion("customer", customer);
>> stripper.setSortByPosition(true);
>>
>
> So it means that you have set the bounds emperically, like header,
> totals and customer, is that correct? The problem is PDF files may be
> of various sizes and you only know the header/footer are at the
> front/end of a PDF page, you would never know the exact locations.
The document that I'm looking at puts a lot of the information in drawn
rectangles so I was able to look at the rectangles that are drawn in the
document, study where they are, then determine the boundaries I wanted. I
don't know if that works for the document you are looking at but to get all of
the existing rectangles on a page:
a) get the tokens on the page
b) for each token that is an "re"
b1) get the previous 4 tokens (token location - 4 is x, -3 is y, -2 is w, -1 is
h)
b2) store the rectangle (I actually wrote a routine to see if the rectangle was
a part of another rectangle or intersected, if the latter then I store a union
of the two rectangles and remove the two originals)
c) then I wrote a comparator so I could easily sort rectangles by the y
coordinate
d) stare at the output and compare to the page and determine your regions
That only works if your PDF is drawn with rectangles though.
I believe the FIRST way I showed you originally is the better approach to your
problem though since it sorts your tokens, that was your original complaint.
There is no magical "one size fits all" for parsing a PDF. You need to do the
hard work of understanding the PDF specification, how PDFBox interprets that
specification, and then understand how the AUTHOR of the PDF assembled it in
the first place. This takes time and experience.
> Btw, which version of PDFBox do you use? You never encounter the
> "Exception in thread "main" java.lang.IllegalArgumentException:" ?
Really, an illegal argument exception has to do with your code most likely.
Post your code here and maybe its obvious. Your exception and stack trace are
sort of irrelevant since you have a simple coding error.
Paul Monday
[email protected]