[ 
https://issues.apache.org/jira/browse/SOLR-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403041#comment-13403041
 ] 

Per Steffensen commented on SOLR-3582:
--------------------------------------

Trym didnt mention it, but this is not only a negligible problem that will 
never cause any problems in real-world usage. Actually we discovered the 
problem during one of our performance/endurance test of our real world 
application in a real world setup and with real world workload (high). We are 
running with numerous Solr instances in a SolrCloud cluster, with numerous 
collections each having about 25 slices each with 2 shards (one replica for 
each slice). During the test Solrs lose their ZK connection (probably due to 
too long GC pause) and reconnect - resulting in more watchers. The next time a 
dis-/re-connect to ZK happens it gets many watcher-events resulting in even 
more watchers for the next time. All in all, seen from the outside, this breaks 
our performance/endurance test - at first things starts to slow down and 
eventually JVMs break down with OOM errors. This is a self-reinforcing problem, 
because for every iteration more time has to be used by the garbage collector 
collecting watchers (twice as many as last time), increasing the probability of 
new ZK timeouts, and more time has to be used creating new watchers (twice as 
many as last time).

I think you should commit the fix. Basically because it makes a (our) real 
world application able to run for a long time - it wasnt before. Commit the 
fix, not so much for our sake, because we are using our own build of Solr (inkl 
this fix, other fixes and nice impl of optimistic locking etc (SOLR-3173, 
SOLR-3178, etc)) anyway, but to save others (that might also be among the 
"first movers" on using Solr 4.0 for high scale real world applications) from 
having to use weeks tracking down the essence of this issue and make a fix.

If you think this observation/fix should lead to a walk through of the code, to 
check if watchers are used undesirably at other places, and maybe even come to 
a more generic fix, I would endorse such a task. But for now I urge you to 
commit to save others from weeks of debugging. If/when you come to a better or 
more generic solution, you can always go refactor.

Regards, Per Steffensen
                
> Leader election zookeeper watcher is responding to con/discon notifications 
> incorrectly.
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-3582
>                 URL: https://issues.apache.org/jira/browse/SOLR-3582
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 4.0, 5.0
>
>
> As brought up by Trym R. Møller on the mailing list, we are responding to 
> watcher events about connection/disconnection as if they were notifications 
> about node changes.
> http://www.lucidimagination.com/search/document/e13ef390b88eeee2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to