These are the values of the "text matrix" at the start and end of the given 
text. Take a look at the PDF spec for a complete description of how the text 
matrix is calculated. It's an affine transform which can rotate, scale, and 
skew text and it represents "text space", the coordinate system for rendering 
text. Usually it just contains a translation component, though often a scale 
too (default is 1.0).

Another way of describing this is to say that transforming (0,0) by the text 
matrix gives you the (x,y) coordinate of the text.

In order to generate a fake text matrix for OCR all you need is to start with 
the identity matrix and then set the translation components to the current 
(x,y) position, where St (start) is the left-hand side of the glyph and End is 
its right-hand side.

-- John

> On 11 Apr 2014, at 16:05, DImuthu Upeksha <[email protected]> wrote:
> 
> I looked at processTextPosition method in PDFTextStripper class. It
> takes a TextPosition object as the parameter. TextPosition object
> takes two Matrices as parameters in its constructor
> 
> Matrix textPositionSt
> Matrix textPositionEnd
> 
> 1. What is the task of these matrices?
> 2. What should be the format of it's data?
> 
> I debugged one textPositionSt matrix and for that sample its value was
> 
> [12.0, 0.0, 0.0, 0.0, 12.0, 0.0, 15.336001, 0.0, 1.0]
> 
> What is the meaning of these values?
> 
> Thanks
> Dimuthu
> 
>> On Tue, Mar 25, 2014 at 11:54 PM, John Hewson <[email protected]> wrote:
>> Hi Dimuthu
>> 
>> Each line of text is handled by the processEncodedText method in 
>> PDFStreamEngine
>> which calls processTextPosition once for each character. The 
>> processTextPosition
>> method in PDFStreamEngine collects the text positions into lines, paragraphs 
>> and
>> columns (also called "articles"). Text on a PDF page does not have to be 
>> drawn in
>> order, so text at any position can occur at any time and processTextPosition 
>> will sort
>> the text and insert it into the relevant line/paragraph/column.
>> 
>> To make your words with bounding boxes compatible with processTextPosition 
>> you
>> should convert each character in the word into a TextPosition and then you 
>> can call
>> processTextPosition.
>> 
>> Thanks
>> 
>> -- John
>> 
>>> On 24 Mar 2014, at 01:32, DImuthu Upeksha <[email protected]> 
>>> wrote:
>>> 
>>> Hi John,
>>> 
>>> I looked at processTextPosition method in PDFTextStripper. But I
>>> couldn't understand actual process happening inside the method. What
>>> should be the input for that method? In my case I have words with
>>> bounding box's coordinates. How can I make those data to compatible
>>> with the input of processTextPosition method. As well, what is the
>>> output of the method?
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>>> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <[email protected]> wrote:
>>>> Hi Dimuthu
>>>> 
>>>>> 1 Print those data into PDDocument again and pass through TextStripper
>>>>> of PDFBox. This could reduce the performance of overall process.
>>>> 
>>>> This was what I had in mind, but rather than printing the text into the 
>>>> PDDocument
>>>> you can inject it directly into PDFTextStripper as TextPosition instances. 
>>>> I mentioned
>>>> something like this a while ago:
>>>> 
>>>>> You could subclass PDFTextStripper and override the startDocument method 
>>>>> and use it to create a PDFRenderer and store it in a field. Then override 
>>>>> the processPage method and use the previously created PDFRenderer to 
>>>>> render the current page to a buffered image and perform OCR on the image. 
>>>>> Once you have the OCR text + positions, instead of calling processStream 
>>>>> you can call processTextPosition once for each character + position.
>>>> 
>>>> Let's see how well it works and then re-evaluate.
>>>> 
>>>> -- John
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Reply via email to