> I can tell you that Tika is  quite the resource hog.  It is likely chewing up 
> CPU and memory 
> resources at an incredible rate, slowing down your Solr server.  You 
> would probably see better performance than ERH if you incorporate Tika 
> and SolrJ into a client indexing program that runs on a different machine 
> than Solr.

+1

It'd be interesting to see what happens if you use standalone tika-batch to see 
what the performance is.  

java -jar tika-app.jar -i <input_dir> -o <output_dir>

and if you're feeling adventurous:

java -jar tika-app.jar -i <input_dir> -o <output_dir> -J -t

You can specify the number of threads with -numConsumers 5 (don't use many more 
than # of cpus!)

Content extraction with Tika is usually slower (sometimes far slower) than the 
indexing step.  If you have any crazily slow docs, open an issue on Tika's JIRA.

Cheers,
 
          Tim



-----Original Message-----
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] 
Sent: Thursday, April 21, 2016 12:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Overall large size in Solr across collections

Hi Shawn,

Yes, I'm using the Extracting Request Handler.

The 0.7GB/hr is the indexing rate at which the size of the original documents 
which get ingested into Solr. This means that for every hour, only 0.7GB of my 
documents gets ingested into Solr. It will require 10 hours just to index 
documents which are of 7GB in size.

Regards,
Edwin


Reply via email to