Hi John,

I looked at processTextPosition method in PDFTextStripper. But I
couldn't understand actual process happening inside the method. What
should be the input for that method? In my case I have words with
bounding box's coordinates. How can I make those data to compatible
with the input of processTextPosition method. As well, what is the
output of the method?

Thanks
Dimuthu

On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <[email protected]> wrote:
> Hi Dimuthu
>
>> 1 Print those data into PDDocument again and pass through TextStripper
>> of PDFBox. This could reduce the performance of overall process.
>
> This was what I had in mind, but rather than printing the text into the 
> PDDocument
> you can inject it directly into PDFTextStripper as TextPosition instances. I 
> mentioned
> something like this a while ago:
>
>> You could subclass PDFTextStripper and override the startDocument method and 
>> use it to create a PDFRenderer and store it in a field. Then override the 
>> processPage method and use the previously created PDFRenderer to render the 
>> current page to a buffered image and perform OCR on the image. Once you have 
>> the OCR text + positions, instead of calling processStream you can call 
>> processTextPosition once for each character + position.
>
> Let's see how well it works and then re-evaluate.
>
> -- John
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Reply via email to