Hi Shawn,

Yes, I'm using the Extracting Request Handler.

The 0.7GB/hr is the indexing rate at which the size of the original
documents which get ingested into Solr. This means that for every hour,
only 0.7GB of my documents gets ingested into Solr. It will require 10
hours just to index documents which are of 7GB in size.

Regards,
Edwin


On 21 April 2016 at 11:40, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/20/2016 8:10 PM, Zheng Lin Edwin Yeo wrote:
> > I'm currently running 4 threads concurrently to run the indexing, Which
> > means I run the script in command prompt in 4 different command windows.
> > The ID has been configured in such a way that it will not overwrite each
> > other during the indexing. Is that considered multi-threading?
> >
> > The rate are all below 0.2GB/hr for each individual threads, and overall
> > rate is just 0.7GB/hr.
>
> Was I right to think you're using the Extracting Request Handler?
>
> If you have enough CPU resources on the Solr server, you could start
> even more copies of the program -- effectively, more threads.
>
> What are you measuring at 0.7GB/hr?  The size of the rich text documents
> you are ingesting?  The size of the text extracted from the documents?
> The size of the index directory in Solr?
>
> Using the dataimport handler importing from MySQL, I can simultaneously
> build six separate 60GB indexes in about 18 hours, on two servers.  Each
> of those indexes has more than 50 million documents.  These are not rich
> text documents, though.  DIH is single-threaded, so each of those
> indexes is only being built with one thread.  Saying the important thing
> again:  These are NOT rich text documents.
>
> If you're using ERH, which runs Tika, I can tell you that Tika is quite
> the resource hog.  It is likely chewing up CPU and memory resources at
> an incredible rate, slowing down your Solr server.  You would probably
> see better performance than ERH if you incorporate Tika and SolrJ into a
> client indexing program that runs on a different machine than Solr.
>
> Thanks,
> Shawn
>
>

Reply via email to