Hello,

I commented on the gist. You have to use setSortByPosition(true) in the
constructor right after super(). Be careful with your coordinate system.
When you do textPosition1.getY() you get 792 not 0. I don't remember
exactly where, but there is a class that uses the lower left corner of the
page as the origin (0,0), not the upper left corner as it is natural.

I hope that helps.

Alin

PS Is the OCR going to be pure Java or will you be writing it in other
language and use native calls?


On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha <dimuthu.upeks...@gmail.com
> wrote:

> Hi Alin,
>
> You can find my source code from here
> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
> As you can see I set
> X-offset : 0 and Y-offset : 0 for "H"
> X-offset : 32 and Y-offset : 0 for "W"
> in Text Matrices. Is that enough? Is there other way to set X,Y
> co-ordinates?
>
>
> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <impet...@gmail.com> wrote:
> > What are the x and y coordinates of H and W?
> >
> > Alin Mazilu
> > SKE GlobalTech, LLC
> > 3250 West Market St. Suite 307D
> > Fairlawn, OH 44333
> >
> > Sent from my Galaxy S3
> > On May 17, 2014 2:42 AM, "DImuthu Upeksha" <dimuthu.upeks...@gmail.com>
> > wrote:
> >
> >> Hi all,
> >>
> >> I was tying to manually feed text position objects to
> >> processTextPosition method in PDFTextStripper class. I created a sub
> >> class of PDFTextStripper and override processStream method. In
> >> processStream method I manually created two text position objects for
> >> words "W" and "H". At the end I passed them to processTextPosition
> >>
> >> processTextPosition(textPosition1);
> >> processTextPosition(textPosition2);
> >>
> >> Then I tested it using
> >>
> >> PDFTextStripper ocrStripper = new PDFOCRTextStripper();
> >> PDDocument document = PDDocument.load("some pdf file");
> >> String data = ocrStripper.getText(document);
> >> System.out.println(data);
> >>
> >> Output was : H W
> >>
> >> Then I changed the sequence of passing TextPosition objects in [1]
> >>
> >> processTextPosition(textPosition2);
> >> processTextPosition(textPosition1);
> >>
> >> Output was : WH
> >>
> >> ------------------------------
> >>
> >> As far as I understood processTextPosition works with the text
> >> position metadata like x and y co-ordinates of the input text. It
> >> should not depend on the order of the input sequence. But in case It
> >> seems like processTextPosition method works according to order of
> >> input.
> >> Ex. If I input W first, it prints W first without considering it's
> >> actual position.
> >>
> >> Is this the normal behaviour? Or am I missing something here?
> >>
> >> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
> >> --
> >> Regards
> >>
> >> W.Dimuthu Upeksha
> >> Undergraduate
> >>
> >> Department of Computer Science And Engineering
> >>
> >> University of Moratuwa, Sri Lanka
> >>
>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
>
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>

Reply via email to