On Thu, Nov 12, 2015 at 9:28 AM, Matt Lundin <m...@imapmail.org> wrote:

> Ramon Diaz-Uriarte <rdia...@gmail.com> writes:
> >
> > I'll do. In the meantime, I think this is a limitation coming from
> > poppler. Other people have mentioned similar things (e.g.,
> > http://coda.caseykuhlman.com/entries/2014/pdf-extract.html) and using
> other
> > tools that depend on poppler (such as Leela:
> > https://github.com/TrilbyWhite/Leela) also will not give us the text
> > itself.
>
> I don't think this is a limitation of poppler so much as the way that
> pdf annotations work. Typically, the subject/text field is not populated
> by the text of the highlighted region. Rather, a highlight annotation
> specifies bounds, color, style, etc. Basically what Repligo does (I
> wouldn't recommend using it, as it is closed source and severely out of
> date) is to grab the text *at the time of highlighting* and add it to
> the notes field. I don't know of any other annotation tool that does the
> same thing. Applications built on poppler could do it, though they
> currently do not.
>
> For extracting the text of highlighted regions *after the fact*, I've
> had good luck with this script that relies on the pdf-reader gem for
> ruby:
>
> https://gist.github.com/danlucraft/5277732
>
> This looks interesting. It searches for file "./markup_receiver", but
doesn't provide that file, which does not appear to be a gem.  Any hints?

With politza's help am getting close to being able to extract annotation
text from within pdf-tools, but am not quite there yet.


> Matt
>

Reply via email to