>I'm loosing my hair on coordinates conversion and image extraction. > >Here is what I'm trying to do : > >I want to perform keyword search on non-searchable pdf or pdfs where >text layer is not well positioned behind images (and then underline the >results using annots) using PDFBOX and an OCR: > >I've extended printImageLocation the following way : >On a given page I extract all images and generate png images with JAI >for better quality (tried getting a sole image for the whole page but >results are not good enough with the OCR due to layout issues I think, >with JAI I expect to be able to posterize, reduce noise if necessary, >etc...to make the ocr happy). >I externally run an ocr on them (ocropus/tesseract. it's c++, so I have >some "Process p = Runtime.getRuntime().exec(cmd); " code) which >produces hOCR files giving text and coordinates for each characters. I'm >then able to determine the coordinates of a keyword parsing the >hOCR file. >At this point, I have the coordinates of the keyword in the image, the >position of the image on the page and the size of the image. >I then try to "translate to" coordinates in the pdf page from the ones I >have got from the parsed image. >First I invert the bounding box as the OCR gives me a UpperLeft/ >LowerRight couple of points. >then ...I'm stucked : I expected the origin to be lowerleft in a pdf >page but it seems to be upperLeft here. >and to be honest, I hardly figure out which corner of the image is used >to determine its location and what is the metric used. >Inside the image, I retrieve coordinates in dot. > >For example, here are the images I've found : >[I0] at 571.26746,71.80139 size=796.0658,93.23215 (small logo) [I1] at >368.0984,85.12024 size=92.90973,196.4537 (small logo) [I2] at >583.11694,707.5416 size=12841.42,15587.612 (the scanned article) [I3] at >176.53192,341.2494 size=402.6675,1046.7035 (image attached to >the article) > >visually, [I0] is upperLeft, [I1] is at [I0] right side, [I3] is upper >right but below [I0] and i1 line. >[I2] is the "body" of the page actually a press article, where I find >the keyword's occurences. > >here is a set of coordinates retrieved from the ocr processing (upper >left / lower right): >keyword: (2056.0/2484.0) (2193.0/2501.0) > >which gives (lower left / upper right): >(2056.0/2501.0) (2193.0/2484.0) > >here are the coordinates of the same occurence in the pdf (the result I >would find after a conversion lowerleft / upper right. Provided here >parsing the text layer hopefully well positionned) : >START : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399 >yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword END >: String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399 >yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword >--------->keyword : 1790.4882,2322.2808, 1881.345,2348.561 (the >bounding box converted in a suitable metric system to put annotations on it) > >I guess I have to set up a transformation matrix but I don't know what >parameters I have to take into accounts (and if they are available in a >way or another !). >Could someone provide some advices ?
I don't understand every point of your problem, but here are some details you are perhaps looking for: - the pdf-0,0 reference is the lower left corner (as you already mentioned) - a possible dimension of a page is something like this: 612, 792 (Letter) or 596, 843 (DINA4) both portrait - images are drawn starting at their lower left corner, with the given width and height in the pdf - the image may be stored in the pdf-document with a larger/smaller dimension than used for displaying/printing If you want to compare your ocr-results with the pdf, you have to have a look at the possible scaling of the image. HTH, Andreas ---------------------------------------------------------------- - Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), Stefan Niehusmann - - Sitz der Gesellschaft: Dortmund - - Eingetragen beim Amtsgericht Dortmund - - Handelsregister-Nr. HR B 21222 - - USt.-IdNr. DE 2588 96 719 -
