Here's a writeup that should help.... https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
On Tue, Feb 9, 2016 at 2:49 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Solr uses Tika directly. And not in the most efficient way. It is > there mostly for convenience rather than performance. > > So, for performance, Solr recommendation is also to run Tika > separately and only send Solr the processed documents. > > Regards, > Alex. > ---- > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 10 February 2016 at 09:46, Steven White <swhite4...@gmail.com> wrote: >> Hi folks, >> >> I'm writing a file-system-crawler that will index files. The file system >> is going to be very busy an I anticipate on average 10 new updates per >> min. My application checks for new or updated files once every 1 min. I >> use Tika to extract the raw-text off those files and send them over to Solr >> for indexing. My application will be running 24x7xN-days. It will not >> recycle unless if the OS is restarted. >> >> Over at Tika mailing list, I was told the following: >> >> "As a side note, if you are handling a bunch of files from the wild in a >> production environment, I encourage separating Tika into a separate jvm vs >> tying it into any post processing – consider tika-batch and writing >> separate text files for each file processed (not so efficient, but >> exceedingly robust). If this is demo code or you know your document set >> well enough, you should be good to go with keeping Tika and your >> postprocessing steps in the same jvm." >> >> My question is, how does Solr utilize Tika? Does it run Tika in its own >> JVM as an out-of-process application or does it link with Tika JARs >> directly? If it links in directly, are there known issues with Solr >> integrated with Tika because of Tika issues? >> >> Thanks >> >> Steve