Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Tue, 25 Mar 2014 11:25:58 -0700

Hi Dimuthu

Each line of text is handled by the processEncodedText method in PDFStreamEngine
which calls processTextPosition once for each character. The processTextPosition
method in PDFStreamEngine collects the text positions into lines, paragraphs and
columns (also called “articles”). Text on a PDF page does not have to be drawn 
in
order, so text at any position can occur at any time and processTextPosition 
will sort
the text and insert it into the relevant line/paragraph/column.


To make your words with bounding boxes compatible with processTextPosition you
should convert each character in the word into a TextPosition and then you can 
call
processTextPosition.

Thanks

-- John

On 24 Mar 2014, at 01:32, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:

> Hi John,
> 
> I looked at processTextPosition method in PDFTextStripper. But I
> couldn't understand actual process happening inside the method. What
> should be the input for that method? In my case I have words with
> bounding box's coordinates. How can I make those data to compatible
> with the input of processTextPosition method. As well, what is the
> output of the method?
> 
> Thanks
> Dimuthu
> 
> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <j...@jahewson.com> wrote:
>> Hi Dimuthu
>> 
>>> 1 Print those data into PDDocument again and pass through TextStripper
>>> of PDFBox. This could reduce the performance of overall process.
>> 
>> This was what I had in mind, but rather than printing the text into the 
>> PDDocument
>> you can inject it directly into PDFTextStripper as TextPosition instances. I 
>> mentioned
>> something like this a while ago:
>> 
>>> You could subclass PDFTextStripper and override the startDocument method 
>>> and use it to create a PDFRenderer and store it in a field. Then override 
>>> the processPage method and use the previously created PDFRenderer to render 
>>> the current page to a buffered image and perform OCR on the image. Once you 
>>> have the OCR text + positions, instead of calling processStream you can 
>>> call processTextPosition once for each character + position.
>> 
>> Let's see how well it works and then re-evaluate.
>> 
>> -- John
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to