Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Erick Erickson Wed, 13 Jan 2016 09:12:35 -0800

It's usually not all that difficult to write a multi-threaded
client that uses CloudSolrClient, or even fire up multiple
instances of the SolrJ client (assuming they can work
on discreet sections of the documents you need to index).

That avoids the problem Shawn alludes to. Plus other
issues. If you do _not_ use CloudSolrClient, then all the
docs go to some node in the system that then sub-divides
the list (and you really should update in batches, see:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)
then the node that receives the packet sub-divides it
into groups based on what shard they should be part of
and forwards them to the leaders for that shard, very
significantly increasing the numbers of conversations
being carried on between Solr nodes. Times the number
of threads you're specifying with CUSC (I really regret
the renaming from ConcurrentUpdateSolrServer, I liked
writing CUSS).

With CloudSolrClient, you can scale nearly linearly with
the number of shards. Not so with CUSC.

FWIW,
Erick

On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
>> Now since with ConcurrentUdateSolrClient I am able to use a queue and a pool 
>> of threads, which makes it more attractive to use over CloudSolrClient which 
>> will use a HTTPSolrClient once it gets a set of nodes to do the updates.
>>
>> What is the recommended API for updating large amounts of documents with 
>> higher throughput rate.
>
> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
> that happen during indexing.  Your application will never know about any
> problems that occur during indexing.  The entire cluster could be down,
> and your application would never know about it until you tried an
> explicit commit operation.  Commit is an operation that is not handled
> in the background by CUSC, so I would expect any exception to be
> returned for that operation.
>
> This flaw is inherent to its design, the behavior would be very
> difficult to change.
>
> If you don't care about your application getting error messages when
> indexing requests fail, then CUSC is perfect.  This might be the case if
> you are doing initial bulk loading.  For normal index updates after
> initial loading, you would not want to use CUSC.
>
> If you do care about getting error messages when bulk indexing requests
> fail, then you'll want to build a program with CloudSolrClient where you
> create multiple indexing threads that all use the same the client object.
>
> Thanks,
> Shawn
>

Re: ConcurrentUpdateSolrClient vs CloudSolrClient for bulk update to SolrCloud

Reply via email to