Shawn, Thanks again for your input
As I said in my last email I was successfully completed this in Solr standalone My requirement is, to index a emails which is already converted to a text file(There are no attachments), Once these text files are indexed Solr search result should bring me back the entire text file as it is, I am able to achieve this in Solr Standalone For testing my code in SolrCloud I just kept a small file with 3 characters in it , Solr does not throw any error but also not indexing the file I tried below approaches 1. Issue with Dataimporthandler -- Zookeeper is not able to read tikaConfig.conf file at run time 2. Issue with Flume SolrSink -- No error shown, it is not indexing but I see once in a while it indexes though I did not make any code changes As you mentioned I never saw Solr crashing or eating up CPU, RAM. The file which I am indexing is very small { it has ABC \n DEF} My worry is Solr is not throwing any error, I kept the Log level to TRACE Thanks & Regards, ~Sri -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, February 08, 2017 4:15 PM To: solr-user@lucene.apache.org Cc: Shawn Heisey <apa...@elyograg.org> Subject: RE: DataImportHandler - Unable to load Tika Config Processing Document # 1 > Thank you I will follow Erick's steps > BTW I am also trying to ingesting using Flume , Flume uses Morphlines > along with Tika Even Flume SolrSink will have the same issue? Yes, when using Tika you run the risk of it choking on a document, eating CPU and/or RAM until everything dies. This is also true when you run it standalone. The problem is usually caused by PDF and Office documents that are unusual, corrupt or incomplete (e.g. truncated in size) or extremely large. But even ordinary HTML can get you into trouble due to extreme sizes or very deep nested elements. But, in general, it is not a problem you will experience frequently. We operate broad and large scale web crawlers, ingesting all kinds of bad stuff all the time. The trick to avoid problems is running each Tika parse in a separate thread, have a timer and kill the thread if it reaches a limit. It can still go wrong, but trouble is very rare. Running it standalone and talking to it over network is safest, but not very portable/easy distributable on Hadoop or other platforms.