RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

Anatharaman, Srinatha (Contractor) Thu, 09 Feb 2017 06:39:13 -0800

Shawn,

Thanks again for your input


As I said in my last email I was successfully completed this in Solr standalone
My requirement is, to index a emails which is already converted to a text 
file(There are no attachments), Once these text files are indexed Solr search 
result should bring me back the entire text file as it is, I am able to achieve 
this in Solr Standalone
For testing my code in SolrCloud I just kept a small file with 3 characters in 
it , Solr does not throw any error but also not indexing the file

I tried below approaches
1. Issue with Dataimporthandler -- Zookeeper is not able to read 
tikaConfig.conf file at run time
2. Issue with Flume SolrSink -- No error shown, it is not indexing but I see 
once in a while it indexes though I did not make any code changes

As you mentioned I never saw Solr crashing or eating up CPU, RAM. The file 
which I am indexing is very small { it has ABC \n DEF}
My worry is Solr is not throwing any error, I kept the Log level to TRACE

Thanks & Regards,
~Sri



-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, February 08, 2017 4:15 PM
To: solr-user@lucene.apache.org
Cc: Shawn Heisey <apa...@elyograg.org>
Subject: RE: DataImportHandler - Unable to load Tika Config Processing Document 
# 1

> Thank you I will follow Erick's steps
> BTW I am also trying to ingesting using Flume , Flume uses Morphlines 
> along with Tika Even Flume SolrSink will have the same issue?

Yes, when using Tika you run the risk of it choking on a document, eating CPU 
and/or RAM until everything dies. This is also true when you run it standalone. 
The problem is usually caused by PDF and Office documents that are unusual, 
corrupt or incomplete (e.g. truncated in size) or extremely large. But even 
ordinary HTML can get you into trouble due to extreme sizes or very deep nested 
elements.

But, in general, it is not a problem you will experience frequently. We operate 
broad and large scale web crawlers, ingesting all kinds of bad stuff all the 
time. The trick to avoid problems is running each Tika parse in a separate 
thread, have a timer and kill the thread if it reaches a limit. It can still go 
wrong, but trouble is very rare.

Running it standalone and talking to it over network is safest, but not very 
portable/easy distributable on Hadoop or other platforms.

RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

Reply via email to