> On 21 Apr 2015, at 13:21, Hesham G. <heshamgne...@gmail.com> wrote:
> 
> Frank ,
> 
> Thanks for explaining this. 
> 
> What I am trying to do is reading sentences from the PDF using TextPosition. 
> Your explanation is clear and I can detect the new line using X & Y, but what 
> if a sentence is written on 2 lines ? ... Reading the Y-coordinate for the 
> second line will result with dealing with it as a new sentence instead of 
> considering it a completion for the first line of the sentence.

Could you just take output of PDFToText as a text file and then run it through 
an NLP sentence segmenter? Or is there some special case which you're trying to 
handle?

> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
> Hi Hesham,
> 
> There is no newline character in a PDF. Only printable characters are
> saved, each with its X and Y coordinates.
> If you sort the TextPositions by Y and X, you can detect 'newlines' by
> finding an increase in Y and a decrease in X. However, this isn't
> foolproof, since things like subscripts and superscripts are out of order
> when sorted by Y. Where there are multiple columns, this won't work.
> 
> Frank
> 
> 
>> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <heshamgne...@gmail.com> wrote:
>> 
>> Hello ,
>> 
>> When reading PDF text using TextPosition, is there a way to know if the
>> current character is a new line character ?
>> 
>> protected void processTextPosition( TextPosition text )  {
>>    System.out.println( text.getCharacter() );  // Prints space if this is
>> a new line character in the PDF file.
>> }
>> 
>> 
>> Best regards ,
>> Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to