Yes, as Alin says, the y-axis in PDF uses y=0 as the bottom of the page, 
instead of
the top as is usually the case in Java. PDFBox uses both styles of coordinates 
internally
at various points.

-- John

On 17 May 2014, at 11:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:

> Hi Alin,
> Thank you. It helped me a lot. I'll look into that further.
> 
> About OCR.
> I use Tesseract C library to do OCR and I have written some native
> calls to communicate with Tesseract API. [2]
> 
> [2] https://github.com/DImuthuUpe/Tesseract-API
> 
> On Sat, May 17, 2014 at 10:43 PM, Alin Mazilu <impet...@gmail.com> wrote:
>> Hello,
>> 
>> I commented on the gist. You have to use setSortByPosition(true) in the
>> constructor right after super(). Be careful with your coordinate system.
>> When you do textPosition1.getY() you get 792 not 0. I don't remember
>> exactly where, but there is a class that uses the lower left corner of the
>> page as the origin (0,0), not the upper left corner as it is natural.
>> 
>> I hope that helps.
>> 
>> Alin
>> 
>> PS Is the OCR going to be pure Java or will you be writing it in other
>> language and use native calls?
>> 
>> 
>> On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha <dimuthu.upeks...@gmail.com
>>> wrote:
>> 
>>> Hi Alin,
>>> 
>>> You can find my source code from here
>>> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
>>> As you can see I set
>>> X-offset : 0 and Y-offset : 0 for "H"
>>> X-offset : 32 and Y-offset : 0 for "W"
>>> in Text Matrices. Is that enough? Is there other way to set X,Y
>>> co-ordinates?
>>> 
>>> 
>>> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <impet...@gmail.com> wrote:
>>>> What are the x and y coordinates of H and W?
>>>> 
>>>> Alin Mazilu
>>>> SKE GlobalTech, LLC
>>>> 3250 West Market St. Suite 307D
>>>> Fairlawn, OH 44333
>>>> 
>>>> Sent from my Galaxy S3
>>>> On May 17, 2014 2:42 AM, "DImuthu Upeksha" <dimuthu.upeks...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I was tying to manually feed text position objects to
>>>>> processTextPosition method in PDFTextStripper class. I created a sub
>>>>> class of PDFTextStripper and override processStream method. In
>>>>> processStream method I manually created two text position objects for
>>>>> words "W" and "H". At the end I passed them to processTextPosition
>>>>> 
>>>>> processTextPosition(textPosition1);
>>>>> processTextPosition(textPosition2);
>>>>> 
>>>>> Then I tested it using
>>>>> 
>>>>> PDFTextStripper ocrStripper = new PDFOCRTextStripper();
>>>>> PDDocument document = PDDocument.load("some pdf file");
>>>>> String data = ocrStripper.getText(document);
>>>>> System.out.println(data);
>>>>> 
>>>>> Output was : H W
>>>>> 
>>>>> Then I changed the sequence of passing TextPosition objects in [1]
>>>>> 
>>>>> processTextPosition(textPosition2);
>>>>> processTextPosition(textPosition1);
>>>>> 
>>>>> Output was : WH
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> As far as I understood processTextPosition works with the text
>>>>> position metadata like x and y co-ordinates of the input text. It
>>>>> should not depend on the order of the input sequence. But in case It
>>>>> seems like processTextPosition method works according to order of
>>>>> input.
>>>>> Ex. If I input W first, it prints W first without considering it's
>>>>> actual position.
>>>>> 
>>>>> Is this the normal behaviour? Or am I missing something here?
>>>>> 
>>>>> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> 
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Reply via email to