1.2.1 - PDFTextStripper* uses different Y values when cropbox has non-zero Y: 
not so for X coordinates.
-------------------------------------------------------------------------------------------------------

                 Key: PDFBOX-816
                 URL: https://issues.apache.org/jira/browse/PDFBOX-816
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.2.1
         Environment: Mac OS X 10.6.4 + JDK 1.6.0_20, and Ubuntu 10.4 (kernel 
2.6.32-24) with Sun Java 1.6.0_20. 
            Reporter: Larry West


[First off, kudos to the folks who work on PDFBox.  It's got some great 
functionality.]

The issue is that a cropbox with non-zero "lower-left-corner" changes positions 
reported for text by PDFTextStripper.  In the Y coordinate only.

See page 5 of the attached PDF (which is a phony tax return, not real data).

As an example, near the top is the tax year, "2009".   Using a program such as 
Apple's Preview, one would estimate that a snug bounding rectangle for that 
text would be x=300, y=54, w=41, and h=18.

And on other PDFs, that would be fine with PDFTextStripperByArea.  But this PDF 
has a non-zero-origin cropbox set, one with the alleged lower-left-corner at 
[-24.0, -24.0].   So the region coordinates that PDFTextStripperByArea wants to 
see need to be offset by subtracting -24 from x and y, i.e., yielding x=324, 
y=78.

Or so you would think.  It turns out that the X coordinate stays the same, only 
the Y coordinate gets affected by the cropbox setting.

Using the sample program PrintTextLocations, which, like PDFTextStripperByArea, 
derives from PDFTextStripper, reports both coordinates as being offset by 24 in 
its processTextPosition():

...
String[92.0,94.0 fs=12.0 xscale=1.0 height=9.0720005 space=3.3360004 
width=186.71997]U.S. Individual Income Tax Retur
String[278.71997,94.0 fs=12.0 xscale=1.0 height=9.0720005 space=3.3360004 
width=7.3320007]n
String[301.0,94.0 fs=18.0 xscale=1.0 height=13.122001 space=5.0040007 
width=30.023987]200
String[331.024,94.0 fs=18.0 xscale=1.0 height=13.122001 space=5.0040007 
width=10.007996]9
String[368.0,94.0 fs=8.0 xscale=1.0 height=7.5200005 space=2.2240002 
width=11.559998](99
String[379.56,94.0 fs=8.0 xscale=1.0 height=7.5200005 space=2.2240002 
width=2.6640015])
String[399.0,94.0 fs=6.0 xscale=1.0 height=4.5360003 space=1.6680002 
width=36.34201]IRS Use Only
...

(Lines 3 and 4 are the only key ones, the others for comparison).  

To make sense of this: 301.0 is close enough to 300 for the X coordinate: if I 
go to 324, I don't get the "200", just the "9".

Also, the y coordinate of 94 is the bottom of the text, with a height of 18 
that roughly extends up to 76, but 78 works as far as extractRegions() is 
concerned (I think it only cares about the lower-left corner of each character).

So the bounding rectangle reported above for "2009" is lower-left corner 
(301.0, 94.0) to upper-right corner approx (341.03, 76).
In those coordinates, the region that works with extractRegions() is LL (300, 
96) to UR (341, 78).
(Or, the exact Rectangle2D I pass to extractRegions: x=300, y=78, w=41, h=18).

This applies to any field you choose on this page.

So:
(a) it doesn't seem to me that a cropbox has any business changing the 
coordinates.  But I could be wrong.
(b) if it does make sense for a cropbox to affect the coordinates, it should do 
so in both X and Y dimensions, shouldn't it?
(c) I suppose it would be too much to ask for notes explaining the coordinates 
used for each method, but it's a nice thought.

I tried looking through PDFTextStripper* but I'm not sufficiently familiar with 
the code to determine where the coordinate perturbation occurs.   It might be 
in how PDFStreamEngine.processEncodedText() is using the graphicsState 
(initialized with the cropbox) to transform textMatrixStDisp, but that seems to 
be initialized with a Dimension, so I don't see how an offset would affect it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to