A sentence could also end with a question mark, exclamation mark, ... Etc. I think there will be many cases to handle.
I also wonder .. When reading text from the book using PDFTextStripper it can read the new line characters, right ? TextPosition seems to be reading the pdf text in a different way. On Apr 21, 2015 10:40 PM, "Eric Douglas" <[email protected]> wrote: > A proper sentence ends with a period, so text that is one character height > below other text is assumed to be tacked onto the same sentence (with a > space between). > If you have the font, you know the font size, you should be able to > calculate one character height. > If sentences aren't ended with periods, text may be assumed to be a new > sentence on a new line if it's more than a character height down. > > ie > A sentence here > > > Another sentence here > > On Tue, Apr 21, 2015 at 4:21 PM, Hesham G. <[email protected]> wrote: > > > Frank , > > > > Thanks for explaining this. > > > > What I am trying to do is reading sentences from the PDF using > > TextPosition. Your explanation is clear and I can detect the new line > using > > X & Y, but what if a sentence is written on 2 lines ? ... Reading the > > Y-coordinate for the second line will result with dealing with it as a > new > > sentence instead of considering it a completion for the first line of the > > sentence. > > > > > > Best regards , > > Hesham > > > > ------------------------------------------------------------------------ > > Included message : > > > > Hi Hesham, > > > > There is no newline character in a PDF. Only printable characters are > > saved, each with its X and Y coordinates. > > If you sort the TextPositions by Y and X, you can detect 'newlines' by > > finding an increase in Y and a decrease in X. However, this isn't > > foolproof, since things like subscripts and superscripts are out of order > > when sorted by Y. Where there are multiple columns, this won't work. > > > > Frank > > > > > > On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <[email protected]> > wrote: > > > > > Hello , > > > > > > When reading PDF text using TextPosition, is there a way to know if the > > > current character is a new line character ? > > > > > > protected void processTextPosition( TextPosition text ) { > > > System.out.println( text.getCharacter() ); // Prints space if this > > is > > > a new line character in the PDF file. > > > } > > > > > > > > > Best regards , > > > Hesham > > >

