Hi Dimuthu Each line of text is handled by the processEncodedText method in PDFStreamEngine which calls processTextPosition once for each character. The processTextPosition method in PDFStreamEngine collects the text positions into lines, paragraphs and columns (also called “articles”). Text on a PDF page does not have to be drawn in order, so text at any position can occur at any time and processTextPosition will sort the text and insert it into the relevant line/paragraph/column.
To make your words with bounding boxes compatible with processTextPosition you should convert each character in the word into a TextPosition and then you can call processTextPosition. Thanks -- John On 24 Mar 2014, at 01:32, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > Hi John, > > I looked at processTextPosition method in PDFTextStripper. But I > couldn't understand actual process happening inside the method. What > should be the input for that method? In my case I have words with > bounding box's coordinates. How can I make those data to compatible > with the input of processTextPosition method. As well, what is the > output of the method? > > Thanks > Dimuthu > > On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <j...@jahewson.com> wrote: >> Hi Dimuthu >> >>> 1 Print those data into PDDocument again and pass through TextStripper >>> of PDFBox. This could reduce the performance of overall process. >> >> This was what I had in mind, but rather than printing the text into the >> PDDocument >> you can inject it directly into PDFTextStripper as TextPosition instances. I >> mentioned >> something like this a while ago: >> >>> You could subclass PDFTextStripper and override the startDocument method >>> and use it to create a PDFRenderer and store it in a field. Then override >>> the processPage method and use the previously created PDFRenderer to render >>> the current page to a buffered image and perform OCR on the image. Once you >>> have the OCR text + positions, instead of calling processStream you can >>> call processTextPosition once for each character + position. >> >> Let's see how well it works and then re-evaluate. >> >> -- John >> > > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka