Yes, as Alin says, the y-axis in PDF uses y=0 as the bottom of the page, instead of the top as is usually the case in Java. PDFBox uses both styles of coordinates internally at various points.
-- John On 17 May 2014, at 11:45, DImuthu Upeksha <[email protected]> wrote: > Hi Alin, > Thank you. It helped me a lot. I'll look into that further. > > About OCR. > I use Tesseract C library to do OCR and I have written some native > calls to communicate with Tesseract API. [2] > > [2] https://github.com/DImuthuUpe/Tesseract-API > > On Sat, May 17, 2014 at 10:43 PM, Alin Mazilu <[email protected]> wrote: >> Hello, >> >> I commented on the gist. You have to use setSortByPosition(true) in the >> constructor right after super(). Be careful with your coordinate system. >> When you do textPosition1.getY() you get 792 not 0. I don't remember >> exactly where, but there is a class that uses the lower left corner of the >> page as the origin (0,0), not the upper left corner as it is natural. >> >> I hope that helps. >> >> Alin >> >> PS Is the OCR going to be pure Java or will you be writing it in other >> language and use native calls? >> >> >> On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha <[email protected] >>> wrote: >> >>> Hi Alin, >>> >>> You can find my source code from here >>> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >>> As you can see I set >>> X-offset : 0 and Y-offset : 0 for "H" >>> X-offset : 32 and Y-offset : 0 for "W" >>> in Text Matrices. Is that enough? Is there other way to set X,Y >>> co-ordinates? >>> >>> >>> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <[email protected]> wrote: >>>> What are the x and y coordinates of H and W? >>>> >>>> Alin Mazilu >>>> SKE GlobalTech, LLC >>>> 3250 West Market St. Suite 307D >>>> Fairlawn, OH 44333 >>>> >>>> Sent from my Galaxy S3 >>>> On May 17, 2014 2:42 AM, "DImuthu Upeksha" <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I was tying to manually feed text position objects to >>>>> processTextPosition method in PDFTextStripper class. I created a sub >>>>> class of PDFTextStripper and override processStream method. In >>>>> processStream method I manually created two text position objects for >>>>> words "W" and "H". At the end I passed them to processTextPosition >>>>> >>>>> processTextPosition(textPosition1); >>>>> processTextPosition(textPosition2); >>>>> >>>>> Then I tested it using >>>>> >>>>> PDFTextStripper ocrStripper = new PDFOCRTextStripper(); >>>>> PDDocument document = PDDocument.load("some pdf file"); >>>>> String data = ocrStripper.getText(document); >>>>> System.out.println(data); >>>>> >>>>> Output was : H W >>>>> >>>>> Then I changed the sequence of passing TextPosition objects in [1] >>>>> >>>>> processTextPosition(textPosition2); >>>>> processTextPosition(textPosition1); >>>>> >>>>> Output was : WH >>>>> >>>>> ------------------------------ >>>>> >>>>> As far as I understood processTextPosition works with the text >>>>> position metadata like x and y co-ordinates of the input text. It >>>>> should not depend on the order of the input sequence. But in case It >>>>> seems like processTextPosition method works according to order of >>>>> input. >>>>> Ex. If I input W first, it prints W first without considering it's >>>>> actual position. >>>>> >>>>> Is this the normal behaviour? Or am I missing something here? >>>>> >>>>> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >>>>> -- >>>>> Regards >>>>> >>>>> W.Dimuthu Upeksha >>>>> Undergraduate >>>>> >>>>> Department of Computer Science And Engineering >>>>> >>>>> University of Moratuwa, Sri Lanka >>>>> >>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka >>> > > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka
