On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:
Basically, what problem I am facing is - I am getting the textual content + 
other metadata in my _text_ field. But, I want only the textual content written 
inside the document.
I tried various Request Handler Update Extract configurations, but none of them 
worked for me.
Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to require that you write the indexing software yourself -- a program that extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in production -- Tika is known to crash when given some documents (usually PDF files are the problematic ones, but other formats can cause it too), and if it crashes while running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing. It also talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

Reply via email to