Extracting text wrapped by PDAnnotationLink

Navendu Garg Wed, 16 Sep 2009 08:07:15 -0700

Hi,

I am facing some issues with extracting text wrapped by the
PDAnnotationLink. First a little background:


I am using a the PDFTextStripper class to extract individual bounding
boxes for each character on the page. Then I extract the rectangle
from the PDAnnotationLink instance. Finally I traverse the list of
characters and see which all characters lie inside the bounding
rectangle for the link. It works fine for most of the cases. It fails
in two scenarios:

a) the link text breaks on line and continues on the next line. Thus
the bounding rectangle selects the entire text  for both the lines. As
a result my algorithm fails.
b) Sometimes the character bounding rectangle coordinates lie outside
the bounding rectangle for the link, even though visibly the character
seems to be inside the link. As a result
I am unable to select those characters.

Does anyone have a better idea about how to approach this problem?

thanks,

Navendu

Extracting text wrapped by PDAnnotationLink

Reply via email to