Thanks for this quick answer.
Sorry if it's not clear (it's as clear as in my head right now !).
the coordinates I get from the OCR are in px with 0.0 in the UpperLeft
corner of the image. I verify this using Gimp and it's OK.
Additional questions :
# Does the PrintImageLocation class perform a transformation of the
coordinates to a system where 0.0 is on the upperLeft corner near this
code (in processOperator() method ) :
float ph = page.findMediaBox().getHeight();
float pw = page.findMediaBox().getWidth();
Matrix ctm =
getGraphicsState().getCurrentTransformationMatrix();
double rotationInRadians =(page.findRotation() *
Math.PI)/180; ...
Or is it just in case of a rotation ?
Are the scaling factors taken into account at this level too ? So, I
could bypass this and deal with my problem easier .
# As I want to draw annotation on what I've found, I guess I have to
transform to a system where 0.0 is LowerRight corner of the PDF page. Am
I right ? (and probably do some metric conversions also)
[EMAIL PROTECTED] a écrit :
I'm loosing my hair on coordinates conversion and image extraction.
Here is what I'm trying to do :
I want to perform keyword search on non-searchable pdf or pdfs where
text layer is not well positioned behind images (and then underline the
results using annots) using PDFBOX and an OCR:
I've extended printImageLocation the following way :
On a given page I extract all images and generate png images with JAI
for better quality (tried getting a sole image for the whole page but
results are not good enough with the OCR due to layout issues I think,
with JAI I expect to be able to posterize, reduce noise if necessary,
etc...to make the ocr happy).
I externally run an ocr on them (ocropus/tesseract. it's c++, so I have
some "Process p = Runtime.getRuntime().exec(cmd); " code) which
produces hOCR files giving text and coordinates for each characters. I'm then
able to determine the coordinates of a keyword parsing the
hOCR file.
At this point, I have the coordinates of the keyword in the image, the
position of the image on the page and the size of the image.
I then try to "translate to" coordinates in the pdf page from the ones I
have got from the parsed image.
First I invert the bounding box as the OCR gives me a UpperLeft/
LowerRight couple of points.
then ...I'm stucked : I expected the origin to be lowerleft in a pdf
page but it seems to be upperLeft here.
and to be honest, I hardly figure out which corner of the image is used
to determine its location and what is the metric used.
Inside the image, I retrieve coordinates in dot.
For example, here are the images I've found :
[I0] at 571.26746,71.80139 size=796.0658,93.23215 (small logo) [I1] at
368.0984,85.12024 size=92.90973,196.4537 (small logo) [I2] at
583.11694,707.5416 size=12841.42,15587.612 (the scanned article) [I3] at
176.53192,341.2494 size=402.6675,1046.7035 (image attached to
the article)
visually, [I0] is upperLeft, [I1] is at [I0] right side, [I3] is upper
right but below [I0] and i1 line.
[I2] is the "body" of the page actually a press article, where I find
the keyword's occurences.
here is a set of coordinates retrieved from the ocr processing (upper
left / lower right):
keyword: (2056.0/2484.0) (2193.0/2501.0)
which gives (lower left / upper right):
(2056.0/2501.0) (2193.0/2484.0)
here are the coordinates of the same occurence in the pdf (the result I
would find after a conversion lowerleft / upper right. Provided here
parsing the text layer hopefully well positionned) :
START : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399
yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword END
: String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399
yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword
--------->keyword : 1790.4882,2322.2808, 1881.345,2348.561 (the
bounding box converted in a suitable metric system to put annotations on it)
I guess I have to set up a transformation matrix but I don't know what
parameters I have to take into accounts (and if they are available in a
way or another !).
Could someone provide some advices ?
I don't understand every point of your problem, but here are some details you
are perhaps looking for:
- the pdf-0,0 reference is the lower left corner (as you already mentioned)
- a possible dimension of a page is something like this: 612, 792 (Letter) or
596, 843 (DINA4) both portrait
- images are drawn starting at their lower left corner, with the given width
and height in the pdf
- the image may be stored in the pdf-document with a larger/smaller dimension
than used for displaying/printing
If you want to compare your ocr-results with the pdf, you have to have a look
at the possible scaling of the image.
HTH,
Andreas
----------------------------------------------------------------
- Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender),
Stefan Niehusmann -
- Sitz der Gesellschaft: Dortmund -
- Eingetragen beim Amtsgericht Dortmund -
- Handelsregister-Nr. HR B 21222 -
- USt.-IdNr. DE 2588 96 719 -