Tomás,

Good find, but I don’t think the rate of updates was high enough during the 
network outage to create the overrun situation described in the ticket.

I did notice that one of the proposed fixes, 
https://issues.apache.org/jira/browse/SOLR-8586, is an entire-index consistency 
check between leader and replica.  I really hope they are able to get this to 
work.  Ideally, the replicas would never become (permanently) inconsistent, but 
given that they do, it is crucial that SolrCloud can internally detect and fix, 
no matter what caused it or how long ago it happened.


Regards,

David



On 1/28/16, 1:08 PM, "Tomás Fernández Löbbe" <tomasflo...@gmail.com> wrote:

>Maybe you are hitting the reordering issue described in SOLR-8129?
>
>Tomás
>
>On Wed, Jan 27, 2016 at 11:32 AM, David Smith <dsmiths...@yahoo.com.invalid>
>wrote:
>
>> Sure.  Here is our SolrCloud cluster:
>>
>>    + Three (3) instances of Zookeeper on three separate (physical)
>> servers.  The ZK servers are beefy and fairly recently built, with 2x10
>> GigE (bonded) Ethernet connectivity to the rest of the data center.  We
>> recognize importance of the stability and responsiveness of ZK to the
>> stability of SolrCloud as a whole.
>>
>>    + 364 collections, all with single shards and a replication factor of
>> 3.  Currently housing only 100,000,000 documents in aggregate.  Expected to
>> grow to 25 billion+.  The size of a single document would be considered
>> “large”, by the standards of what I’ve seen posted elsewhere on this
>> mailing list.
>>
>> We are always open to ZK recommendations from you or anyone else,
>> particularly for running a SolrCloud cluster of this size.
>>
>> Kind Regards,
>>
>> David
>>
>>
>>
>> On 1/27/16, 12:46 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:
>>
>> >
>> >If you can identify the problem documents, you can just re-index those
>> after forcing a sync. Might save a full rebuild and downtime.
>> >
>> >You might describe your cluster setup, including ZK. it sounds like
>> you’ve done your research, but improper ZK node distribution could
>> certainly invalidate some of Solr’s assumptions.
>> >
>> >
>> >
>> >
>> >On 1/27/16, 7:59 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:
>> >
>> >>Jeff, again, very much appreciate your feedback.
>> >>
>> >>It is interesting — the article you linked to by Shalin is exactly why
>> we picked SolrCloud over ES, because (eventual) consistency is critical for
>> our application and we will sacrifice availability for it.  To be clear,
>> after the outage, NONE of our three replicas are correct or complete.
>> >>
>> >>So we definitely don’t have CP yet — our very first network outage
>> resulted in multiple overlapped lost updates.  As a result, I can’t pick
>> one replica and make it the new “master”.  I must rebuild this collection
>> from scratch, which I can do, but that requires downtime which is a problem
>> in our app (24/7 High Availability with few maintenance windows).
>> >>
>> >>
>> >>So, I definitely need to “fix” this somehow.  I wish I could outline a
>> reproducible test case, but as the root cause is likely very tight timing
>> issues and complicated interactions with Zookeeper, that is not really an
>> option.  I’m happy to share the full logs of all 3 replicas though if that
>> helps.
>> >>
>> >>I am curious though if the thoughts have changed since
>> https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering
>> a “majority quorum” model, with rollback?  Done properly, this should be
>> free of all lost update problems, at the cost of availability.  Some
>> SolrCloud users (like us!!!) would gladly accept that tradeoff.
>> >>
>> >>Regards
>> >>
>> >>David
>> >>
>> >>
>>
>>

Reply via email to