The NLP sentence segmenter was really a helpful idea.
Thanks a lot John & Frank.
Best regards ,
Hesham
Included message :
What have you got so far? Can you provide sample code to work with?
On Wed, Apr 22, 2015 at 12:02
> On 21 Apr 2015, at 13:21, Hesham G. wrote:
>
> Frank ,
>
> Thanks for explaining this.
>
> What I am trying to do is reading sentences from the PDF using TextPosition.
> Your explanation is clear and I can detect the new line using X & Y, but what
> if a sentence is written on 2 lines ?
What have you got so far? Can you provide sample code to work with?
On Wed, Apr 22, 2015 at 12:02 PM, Hesham G. wrote:
> Frank ,
>
> I have handled TextPositions using X & Y coordinates as you have suggested
> to detect new lines. It works fine, but if a sentence is written on 2 lines
> I can't
Frank ,
I have handled TextPositions using X & Y coordinates as you have suggested
to detect new lines. It works fine, but if a sentence is written on 2 lines
I can't detect it. If you know a trick to detect that it will help a lot.
Best regards ,
Hesham
-
Am 21.04.2015 um 23:00 schrieb Hesham Gneady:
A sentence could also end with a question mark, exclamation mark, ... Etc.
I think there will be many cases to handle.
I also wonder .. When reading text from the book using PDFTextStripper it
can read the new line characters, right ? TextPosition se
A sentence could also end with a question mark, exclamation mark, ... Etc.
I think there will be many cases to handle.
I also wonder .. When reading text from the book using PDFTextStripper it
can read the new line characters, right ? TextPosition seems to be reading
the pdf text in a different wa
A proper sentence ends with a period, so text that is one character height
below other text is assumed to be tacked onto the same sentence (with a
space between).
If you have the font, you know the font size, you should be able to
calculate one character height.
If sentences aren't ended with perio
Frank ,
Thanks for explaining this.
What I am trying to do is reading sentences from the PDF using TextPosition.
Your explanation is clear and I can detect the new line using X & Y, but what
if a sentence is written on 2 lines ? ... Reading the Y-coordinate for the
second line will result wit
Hi Hesham,
There is no newline character in a PDF. Only printable characters are
saved, each with its X and Y coordinates.
If you sort the TextPositions by Y and X, you can detect 'newlines' by
finding an increase in Y and a decrease in X. However, this isn't
foolproof, since things like subscript
9 matches
Mail list logo