Re: how to get high-availability for Solr csv update handler?

Walter Underwood Mon, 25 Feb 2019 13:52:28 -0800

We send batches of updates to a load balancer. The cluster gets the updates to 
the right leader with very little overhead. When we get an error, we resend the 
update batch. The load balancer will find a healthy node to receive it. This is 
simple, robust, and fast.


One handy tip: if a batch fails with a 400, we back off and resend it in 
batches of 1 document each so we can identify the bad one. This saves a ton of 
time trying to manually find the bad document.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 25, 2019, at 1:31 PM, Ganesh Sethuraman <ganeshmail...@gmail.com> 
> wrote:
> 
> Thanks for details and updates. We are looking at load balancers not
> because of the little improvement in performance. But more for high
> availability. Other alternative is, if the update fails on one server using
> curl, on error we have to call another SOLR server. I was looking to see if
> there any other way to get the working leader from the Zookeeper before the
> update, is there a way to query zookeeper for the same? But, I understand
> there is no guarantee that leader wont change during the large CSV file
> update. But at least some protection during planed server restarts can be
> managed.
> 
> Regarding the Solrj option, it certainly seems to be best option, do we
> have the python solr client to it which can be Solr Leader aware? like how
> it is done in the solrj (java) client.
> 
> Regards,
> Ganesh
> 
> On Mon, Feb 25, 2019 at 3:00 PM Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 2/25/2019 11:15 AM, Ganesh Sethuraman wrote:
>>> We are using Solr Cloud 7.2.1. We are using Solr CSV update handler to do
>>> bulk update (several Millions of docs) in to multiple collections. When
>> we
>>> make a call to the CSV update handler using curl command line (as below),
>>> we are pointing to single server in Solr. During the problem time, when
>> one
>>> of the Solr server goes down this approach could fail. Is there any way
>>> that we do this to send the write to the leader, like how the solrj does,
>>> through the simple curl command(s) line?
>> 
>> The SolrJ client named CloudSolrClient is able to do this because it is
>> a full ZooKeeper client that has instant access to the clusterstate
>> maintained by your Solr servers.
>> 
>> To get that capability in any other client would require that the client
>> is aware of the ZooKeeper ensemble in the same way.  Curl cannot do this.
>> 
>>> 
>>> In the request below for some reason, if the SOLR1-SERVER is down, the
>>> request will fail, even though the new leader say SOLR2-SERVER is up.
>>> 
>>> curl 'http://
>> <<SOLR1-SERVER>>:8983/solr/my_collection/update?commit=true'
>>> --data-binary @example/exampledocs/books.csv -H
>>> 'Content-type:application/csv'
>>> 
>>> 1. I can create load balancer / ALB infront of solr, but that may not
>> still
>>> identify the Leader for efficiency.
>> 
>> A load balancer won't be able to identify the leader unless it is
>> capable of talking to ZooKeeper and knows how Solr represents data in
>> ZK.  Have you measured the efficiency improvement that comes from
>> sending to the leader?  If that improvement is small, it's probably not
>> worth implementing something that talks to ZooKeeper.  I know there are
>> people who don't try to send to leaders that are achieving very fast
>> indexing rates ... I suspect that the improvement obtained by sending to
>> leaders is relatively small.
>> 
>>> 2. I can write a solrj client to update, but i am not sure if i will get
>>> the efficiency of  bulk update? not sure about the simplicity of the curl
>>> as well.
>> 
>> SolrJ is probably more efficient than something like curl, because it
>> utilizes a compact binary format for data transfer in both directions,
>> called javabin.  With curl, you would most likely be using a text format
>> like json, xml, or csv.
>> 
>> SolrJ clients are fully thread-safe.  Which means you can use a single
>> instance to send updates in parallel with multiple threads.  That is the
>> best way to achieve good indexing performance with Solr.
>> 
>> Thanks,
>> Shawn
>>

Re: how to get high-availability for Solr csv update handler?

Reply via email to