Here's a writeup that should help....

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

On Tue, Feb 9, 2016 at 2:49 PM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:
> Solr uses Tika directly. And not in the most efficient way. It is
> there mostly for convenience rather than performance.
>
> So, for performance, Solr recommendation is also to run Tika
> separately and only send Solr the processed documents.
>
> Regards,
>     Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote:
>> Hi folks,
>>
>> I'm writing a file-system-crawler that will index files.  The file system
>> is going to be very busy an I anticipate on average 10 new updates per
>> min.  My application checks for new or updated files once every 1 min.  I
>> use Tika to extract the raw-text off those files and send them over to Solr
>> for indexing.  My application will be running 24x7xN-days.  It will not
>> recycle unless if the OS is restarted.
>>
>> Over at Tika mailing list, I was told the following:
>>
>> "As a side note, if you are handling a bunch of files from the wild in a
>> production environment, I encourage separating Tika into a separate jvm vs
>> tying it into any post processing – consider tika-batch and writing
>> separate text files for each file processed (not so efficient, but
>> exceedingly robust).  If this is demo code or you know your document set
>> well enough, you should be good to go with keeping Tika and your
>> postprocessing steps in the same jvm."
>>
>> My question is, how does Solr utilize Tika?  Does it run Tika in its own
>> JVM as an out-of-process application or does it link with Tika JARs
>> directly?  If it links in directly, are there known issues with Solr
>> integrated with Tika because of Tika issues?
>>
>> Thanks
>>
>> Steve

Reply via email to