> On 21 Apr 2015, at 13:21, Hesham G. <heshamgne...@gmail.com> wrote: > > Frank , > > Thanks for explaining this. > > What I am trying to do is reading sentences from the PDF using TextPosition. > Your explanation is clear and I can detect the new line using X & Y, but what > if a sentence is written on 2 lines ? ... Reading the Y-coordinate for the > second line will result with dealing with it as a new sentence instead of > considering it a completion for the first line of the sentence.
Could you just take output of PDFToText as a text file and then run it through an NLP sentence segmenter? Or is there some special case which you're trying to handle? > Best regards , > Hesham > > ------------------------------------------------------------------------ > Included message : > > Hi Hesham, > > There is no newline character in a PDF. Only printable characters are > saved, each with its X and Y coordinates. > If you sort the TextPositions by Y and X, you can detect 'newlines' by > finding an increase in Y and a decrease in X. However, this isn't > foolproof, since things like subscripts and superscripts are out of order > when sorted by Y. Where there are multiple columns, this won't work. > > Frank > > >> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <heshamgne...@gmail.com> wrote: >> >> Hello , >> >> When reading PDF text using TextPosition, is there a way to know if the >> current character is a new line character ? >> >> protected void processTextPosition( TextPosition text ) { >> System.out.println( text.getCharacter() ); // Prints space if this is >> a new line character in the PDF file. >> } >> >> >> Best regards , >> Hesham --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org