[
https://issues.apache.org/jira/browse/SOLR-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436175#comment-13436175
]
Mark Miller commented on SOLR-3721:
-----------------------------------
bq. Believe there is nothing solid to do!
Well, we can do some practical things right? I don't think we need to support a
node coming back from the dead a year later and it had some updates the cluster
doesn't have. A node coming up 2 minutes later is something we want to worry
about though.
So basically we either need something timing based or admin command based that
lets you start a cold shard (slice :)) and each node waits around for X amount
of time or until command X is received, and then leader election begins.
> Multiple concurrent recoveries of same shard?
> ---------------------------------------------
>
> Key: SOLR-3721
> URL: https://issues.apache.org/jira/browse/SOLR-3721
> Project: Solr
> Issue Type: Bug
> Components: multicore, SolrCloud
> Affects Versions: 4.0
> Environment: Using our own Solr release based on Apache revision
> 1355667 from 4.x branch. Our changes to the Solr version is our solutions to
> TLT-3178 etc., and should have no effect on this issue.
> Reporter: Per Steffensen
> Labels: concurrency, multicore, recovery, solrcloud
> Fix For: 4.0
>
> Attachments: recovery_in_progress.png, recovery_start_finish.log
>
>
> We run a performance/endurance test on a 7 Solr instance SolrCloud setup and
> eventually Solrs lose ZK connections and go into recovery. BTW the recovery
> often does not ever succeed, but we are looking into that. While doing that I
> noticed that, according to logs, multiple recoveries are in progress at the
> same time for the same shard. That cannot be intended and I can certainly
> imagine that it will cause some problems.
> It is just the logs that are wrong, did I make some mistake, or is this a
> real bug?
> See attached grep from log, grepping only on "Finished recovery" and
> "Starting recovery" logs.
> {code}
> grep -B 1 "Finished recovery\|Starting recovery" solr9.log solr8.log
> solr7.log solr6.log solr5.log solr4.log solr3.log solr2.log solr1.log
> solr0.log > recovery_start_finish.log
> {code}
> It can be hard to get an overview of the log, but I have generated a graph
> showing (based alone on "Started recovery" and "Finished recovery" logs) how
> many recoveries are in progress at any time for the different shards. See
> attached recovery_in_progress.png. The graph is also a little hard to get an
> overview of (due to the many shards) but it is clear that for several shards
> there are multiple recoveries going on at the same time, and that several
> recoveries never succeed.
> Regards, Per Steffensen
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]