Re: Overall large size in Solr across collections

Shawn Heisey Wed, 20 Apr 2016 06:51:41 -0700

On 4/19/2016 10:12 PM, Zheng Lin Edwin Yeo wrote:
> Thanks for the information Shawn.
>
> I believe it could be due to the types of file that is being indexed.
> Currently, I'm indexing the EML files which are in HTML format, and they
> are more rich in content (with in line images and full text), while
> previously the EML files are in Plain Text format, with the images as
> attachments.
>
> Will this be the cause of the slow indexing speed which I'm facing now? It
> is more than 3 times slower than what I had previously.


I assume that you are using the Extracting Request Handler for this.  I
know almost nothing about Tika, but I would imagine that extracting data
from rich text documents is not a fast process, and that plain text
documents would be a lot faster.  I could be wrong -- I've never used
the ERH myself.

If you want a setup like this to go faster, you probably need to make
your indexing process multi-threaded.  Ideally, such an application
would be written in Java and would incorporate Tika into the client-side
code.  Tika can be very unstable, so running it inside Solr (the
Extracting Request Handler) can make Solr itself unstable.

Thanks,
Shawn

Re: Overall large size in Solr across collections

Reply via email to