Hi John, I looked at processTextPosition method in PDFTextStripper. But I couldn't understand actual process happening inside the method. What should be the input for that method? In my case I have words with bounding box's coordinates. How can I make those data to compatible with the input of processTextPosition method. As well, what is the output of the method?
Thanks Dimuthu On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <[email protected]> wrote: > Hi Dimuthu > >> 1 Print those data into PDDocument again and pass through TextStripper >> of PDFBox. This could reduce the performance of overall process. > > This was what I had in mind, but rather than printing the text into the > PDDocument > you can inject it directly into PDFTextStripper as TextPosition instances. I > mentioned > something like this a while ago: > >> You could subclass PDFTextStripper and override the startDocument method and >> use it to create a PDFRenderer and store it in a field. Then override the >> processPage method and use the previously created PDFRenderer to render the >> current page to a buffered image and perform OCR on the image. Once you have >> the OCR text + positions, instead of calling processStream you can call >> processTextPosition once for each character + position. > > Let's see how well it works and then re-evaluate. > > -- John > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka
