> Thank you I will follow Erick's steps
> BTW I am also trying to ingesting using Flume , Flume uses Morphlines along 
> with Tika
> Even Flume SolrSink will have the same issue?

Yes, when using Tika you run the risk of it choking on a document, eating CPU 
and/or RAM until everything dies. This is also true when you run it standalone. 
The problem is usually caused by PDF and Office documents that are unusual, 
corrupt or incomplete (e.g. truncated in size) or extremely large. But even 
ordinary HTML can get you into trouble due to extreme sizes or very deep nested 
elements.

But, in general, it is not a problem you will experience frequently. We operate 
broad and large scale web crawlers, ingesting all kinds of bad stuff all the 
time. The trick to avoid problems is running each Tika parse in a separate 
thread, have a timer and kill the thread if it reaches a limit. It can still go 
wrong, but trouble is very rare.

Running it standalone and talking to it over network is safest, but not very 
portable/easy distributable on Hadoop or other platforms.

Reply via email to