Thanks for the info, Anshum.

Writing up a SolrJ program to do this is entirely within my wheelhouse.

Read through some of the SolrJ docs and found some examples to start.

A handful of questions if anyone has some pointers.

1. From a performance perspective, is it worth it to use
ConcurrentUpdateSolrServer? Also, documentation says best for updates;
does that include adding documents?

2. When I run the importer via my SolrJ program to distribute the
indexing, I¹ll create some kind of Solr client within SolrJ and point them
at zookeeper.  But the communication with the SQL Server db is independent
of the communication with zookeeper, right?  In that case, is it
possible/does it make sense to run the SolrJ program on each node, so that
each node communicates with the DB but they¹re both communicating with zk?

One more question: for document routing to specific shards, the particular
documents I have don¹t really have a natural way for routing.  Even if
they did, my intuition is that I want the documents randomly and evenly
distributed across all the machines in the cluster that will perform the
querying.  Or is that intuition wrong, and it¹s better to have documents
that fit a search criteria sorted in some way and placed near each other
on a single or small number of machines?

Any insights much appreciated!

-Colin



On 2/18/16, 2:01 AM, "Anshum Gupta" <ans...@anshumgupta.net> wrote:

>Hi Colin,
>
>As per when I last checked, DIH works with SolrCloud but has it's
>limitations. It was designed for the non-cloud mode and is single
>threaded.
>It runs on whatever node you set it up on and that node might not host the
>leader for the shard a document belongs to, adding an extra hop for those
>documents.
>
>SolrCloud is designed for multi-threaded indexing and I'd highly recommend
>you to use SolrJ to speed up your indexing. Yes, that would involve
>writing
>some code but it would speed things up considerably.
>
>
>On Wed, Feb 17, 2016 at 10:51 PM, Colin Freas <cfr...@stsci.edu> wrote:
>
>>
>> I just set up a SolrCloud instance with 2 Solr nodes & another machine
>> running zookeeper.
>>
>> I¹ve imported 200M records from a SQL Server database, and those records
>> are split nicely between the 2 nodes.  Everything seems ok.
>>
>> I did the data import via the admin ui.  It took not quite 8 hours,
>>which
>> I guess is fine.  So, in the middle of the import I checked to see what
>>was
>> connected to the SQL Server machine.  It turned out that only the node
>>that
>> I had started the import on was actually connected to my database
>>server.
>>
>> Is that the expected behavior?  Is there any way to have all nodes of a
>> SolrCloud index communicate with the database during the indexing?
>>Would
>> that speed up indexing?  Maybe this isn¹t a bottleneck I should be
>>worried
>> about.
>>
>> Thanks,
>> -Colin
>>
>
>
>
>-- 
>Anshum Gupta

Reply via email to