With great difficulty. PDF does not usually preserve the text flow, it uses instead absolute positioning for text fragments. Extraction will try to approximate the right thing, but it is an approximation. And if you have two columns, it is harder again. Some documents may have accessibility layer, which would help.
I'd start from using Tika (or extract handler with extractOnly=true) on the documents you have and seeing what comes out. See https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Then you have to figure out whether you are searching just a word or across the sentence boundaries. You could probably (somehow) split on sentence boundary if you want to store each sentence as a value in a multivalued field. Or you could try using highlighter to return only the sentence. Of course, defining the sentence boundary is a lot trickier than it seems at first...... (eg. "He works for B.B.C.") Regards, Alex. ---- http://www.solr-start.com/ - Resources for Solr users, new and experienced On 13 April 2017 at 15:54, ankur <ankur.sancheti.netw...@gmail.com> wrote: > If i am search for word "growth" in a PDF, i want to output all the sentences > with the word "growth" in it. > > How can that be done? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/keyword-in-content-for-PDF-document-tp4329754.html > Sent from the Solr - User mailing list archive at Nabble.com.