Re: keyword-in-content for PDF document

Alexandre Rafalovitch Thu, 13 Apr 2017 08:38:13 -0700

With great difficulty. PDF does not usually preserve the text flow, it
uses instead absolute positioning for text fragments. Extraction will
try to approximate the right thing, but it is an approximation. And if
you have two columns, it is harder again. Some documents may have
accessibility layer, which would help.

I'd start from using Tika (or extract handler with extractOnly=true)
on the documents you have and seeing what comes out. See
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Then you have to figure out whether you are searching just a word or
across the sentence boundaries. You could probably (somehow) split on
sentence boundary if you want to store each sentence as a value in a
multivalued field. Or you could try using highlighter to return only
the sentence.

Of course, defining the sentence boundary is a lot trickier than it
seems at first...... (eg. "He works for B.B.C.")

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 13 April 2017 at 15:54, ankur <ankur.sancheti.netw...@gmail.com> wrote:
> If i am search for word "growth" in a PDF, i want to output all the sentences
> with the word "growth" in it.
>
> How can that be done?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/keyword-in-content-for-PDF-document-tp4329754.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: keyword-in-content for PDF document

Reply via email to