Re: Overall large size in Solr across collections

Zheng Lin Edwin Yeo Wed, 20 Apr 2016 19:10:28 -0700

Hi Shawn,

I'm currently running 4 threads concurrently to run the indexing, Which
means I run the script in command prompt in 4 different command windows.
The ID has been configured in such a way that it will not overwrite each
other during the indexing. Is that considered multi-threading?


The rate are all below 0.2GB/hr for each individual threads, and overall
rate is just 0.7GB/hr.

Regards,
Edwin


On 20 April 2016 at 21:43, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/19/2016 10:12 PM, Zheng Lin Edwin Yeo wrote:
> > Thanks for the information Shawn.
> >
> > I believe it could be due to the types of file that is being indexed.
> > Currently, I'm indexing the EML files which are in HTML format, and they
> > are more rich in content (with in line images and full text), while
> > previously the EML files are in Plain Text format, with the images as
> > attachments.
> >
> > Will this be the cause of the slow indexing speed which I'm facing now?
> It
> > is more than 3 times slower than what I had previously.
>
> I assume that you are using the Extracting Request Handler for this.  I
> know almost nothing about Tika, but I would imagine that extracting data
> from rich text documents is not a fast process, and that plain text
> documents would be a lot faster.  I could be wrong -- I've never used
> the ERH myself.
>
> If you want a setup like this to go faster, you probably need to make
> your indexing process multi-threaded.  Ideally, such an application
> would be written in Java and would incorporate Tika into the client-side
> code.  Tika can be very unstable, so running it inside Solr (the
> Extracting Request Handler) can make Solr itself unstable.
>
> Thanks,
> Shawn
>
>

Re: Overall large size in Solr across collections

Reply via email to