Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Wed, 16 Apr 2014 17:24:13 -0700

Hi Dimuthu

I'm travelling for the next week so I'm ping to be a little slow at replying 
and somewhat brief.


The scale can simply be 1.0 at all times. The font size should be the height of 
the current line of text in points (1/72 inch). To calculate this from the 
height of the text in pixels you need to take into account the DPI   (dots per 
inch) at which PDFRenderer rendered the image.

I'm not sure what totalVerticalDisplacementDisp does, I'm not at my computer 
currently so I'll have to get back to you on that.

-- John

> On 14 Apr 2014, at 15:46, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:
> 
> Hi john,
> 
> I managed to override processStream method and pass some hardcoded
> text position values to processStream method.
> I still have doubts about totalVerticalDisplacementDisp and
> fontSizeText variables. Is there is standard way to calculate the
> fontSizeText variable? What is the use of
> totalVerticalDisplacementDisp variable and how can we fetch it?
> 
> For text matrix how can we calculate the scale x and scale y? For this
> scenario I put 1 for each.
> 
> @Override
> 
>    public void processStream(PDResources resources, COSStream cosStream,
>    PDRectangle drawingSize, int rotation) throws IOException {
>    float pageWidth = drawingSize.getWidth();
>    float pageHeight = drawingSize.getHeight();
>    Matrix textMatrixStart = new Matrix();
> 
>    textMatrixStart.setValue(0, 0, 1); //scale x
>    textMatrixStart.setValue(0, 1, 0);
>    textMatrixStart.setValue(0, 2, 0);
> 
>    textMatrixStart.setValue(1, 0, 0);
>    textMatrixStart.setValue(1, 1, 1); //scale y
>    textMatrixStart.setValue(1, 2, 0);
> 
>    textMatrixStart.setValue(2, 0, 10);
>    textMatrixStart.setValue(2, 1, 100);
>    textMatrixStart.setValue(2, 2, 1);
> 
>    float endXPosition = 29.34f;
>    float endYPosition =0.0f;
>    float totalVerticalDisplacementDisp =8.0f;
>    float widthText = 29.34f;
>    float spaceWidthDisp = 12.0f;
>    String c = "Hello";
>    int []codePoints = {72,101,108,108,111};
>    PDFont font = new PDType1Font();
>    float fontSizeText = 12.0f;
> 
>    TextPosition textPosition = new TextPosition(rotation, pageWidth,
> pageHeight, textMatrixStart, endXPosition,
>        endYPosition, totalVerticalDisplacementDisp, widthText,
> spaceWidthDisp, c, codePoints, font,
>        fontSizeText,12);
> 
>    processTextPosition(textPosition);
>    }
> 
>> On Sat, Apr 12, 2014 at 6:36 AM, John Hewson <j...@jahewson.com> wrote:
>> These are the values of the "text matrix" at the start and end of the given 
>> text. Take a look at the PDF spec for a complete description of how the text 
>> matrix is calculated. It's an affine transform which can rotate, scale, and 
>> skew text and it represents "text space", the coordinate system for 
>> rendering text. Usually it just contains a translation component, though 
>> often a scale too (default is 1.0).
>> 
>> Another way of describing this is to say that transforming (0,0) by the text 
>> matrix gives you the (x,y) coordinate of the text.
>> 
>> In order to generate a fake text matrix for OCR all you need is to start 
>> with the identity matrix and then set the translation components to the 
>> current (x,y) position, where St (start) is the left-hand side of the glyph 
>> and End is its right-hand side.
>> 
>> -- John
>> 
>>> On 11 Apr 2014, at 16:05, DImuthu Upeksha <dimuthu.upeks...@gmail.com> 
>>> wrote:
>>> 
>>> I looked at processTextPosition method in PDFTextStripper class. It
>>> takes a TextPosition object as the parameter. TextPosition object
>>> takes two Matrices as parameters in its constructor
>>> 
>>> Matrix textPositionSt
>>> Matrix textPositionEnd
>>> 
>>> 1. What is the task of these matrices?
>>> 2. What should be the format of it's data?
>>> 
>>> I debugged one textPositionSt matrix and for that sample its value was
>>> 
>>> [12.0, 0.0, 0.0, 0.0, 12.0, 0.0, 15.336001, 0.0, 1.0]
>>> 
>>> What is the meaning of these values?
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>>> On Tue, Mar 25, 2014 at 11:54 PM, John Hewson <j...@jahewson.com> wrote:
>>>> Hi Dimuthu
>>>> 
>>>> Each line of text is handled by the processEncodedText method in 
>>>> PDFStreamEngine
>>>> which calls processTextPosition once for each character. The 
>>>> processTextPosition
>>>> method in PDFStreamEngine collects the text positions into lines, 
>>>> paragraphs and
>>>> columns (also called "articles"). Text on a PDF page does not have to be 
>>>> drawn in
>>>> order, so text at any position can occur at any time and 
>>>> processTextPosition will sort
>>>> the text and insert it into the relevant line/paragraph/column.
>>>> 
>>>> To make your words with bounding boxes compatible with processTextPosition 
>>>> you
>>>> should convert each character in the word into a TextPosition and then you 
>>>> can call
>>>> processTextPosition.
>>>> 
>>>> Thanks
>>>> 
>>>> -- John
>>>> 
>>>>> On 24 Mar 2014, at 01:32, DImuthu Upeksha <dimuthu.upeks...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi John,
>>>>> 
>>>>> I looked at processTextPosition method in PDFTextStripper. But I
>>>>> couldn't understand actual process happening inside the method. What
>>>>> should be the input for that method? In my case I have words with
>>>>> bounding box's coordinates. How can I make those data to compatible
>>>>> with the input of processTextPosition method. As well, what is the
>>>>> output of the method?
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>>> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <j...@jahewson.com> wrote:
>>>>>> Hi Dimuthu
>>>>>> 
>>>>>>> 1 Print those data into PDDocument again and pass through TextStripper
>>>>>>> of PDFBox. This could reduce the performance of overall process.
>>>>>> 
>>>>>> This was what I had in mind, but rather than printing the text into the 
>>>>>> PDDocument
>>>>>> you can inject it directly into PDFTextStripper as TextPosition 
>>>>>> instances. I mentioned
>>>>>> something like this a while ago:
>>>>>> 
>>>>>>> You could subclass PDFTextStripper and override the startDocument 
>>>>>>> method and use it to create a PDFRenderer and store it in a field. Then 
>>>>>>> override the processPage method and use the previously created 
>>>>>>> PDFRenderer to render the current page to a buffered image and perform 
>>>>>>> OCR on the image. Once you have the OCR text + positions, instead of 
>>>>>>> calling processStream you can call processTextPosition once for each 
>>>>>>> character + position.
>>>>>> 
>>>>>> Let's see how well it works and then re-evaluate.
>>>>>> 
>>>>>> -- John
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> 
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to