[
https://issues.apache.org/jira/browse/SOLR-3755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631384#comment-13631384
]
Shalin Shekhar Mangar commented on SOLR-3755:
---------------------------------------------
bq. Anshum suggested over chat that we should think about combining
ShardSplitTest and ChaosMonkeyShardSplit tests into one to avoid code
duplication. I'll try to see if we can do that.
I've changed ChaosMonkeyShardSplitTest to extend ShardSplitTest so that we can
share most of the code. The ChaosMonkey test is not completely correct and I
intend to improve it.
bq. The original change around this made preRegister start taking a core rather
than a core descriptor. I'd like to work that out so it doesn't need to be the
case.
I'll revert the change to the preRegister method signature and find another way.
I've found two kinds of test failures of (ChaosMonkey)ShardSplitTest.
The first is because of the following sequence of events:
# A doc addition fails (because of the kill leader jetty command), client
throws an exception and therefore the docCount variable is not incremented
inside the index thread.
# However, the doc addition is recorded in the update logs (of the proxy node?)
and replayed on the new leader so in reality, the doc does get added to the
shard
# Split happens and we assert on docCounts being equal in the server which
fails because the server has the document that we have not counted.
This happens mostly with Lucene-Solr-Tests-4.x-Java6 builds. The bug is in the
tests and not in the split code. Following is the stack trace:
{code}
[junit4:junit4] 1> ERROR - 2013-04-14 14:24:27.697;
org.apache.solr.cloud.ChaosMonkeyShardSplitTest$1; Exception while adding doc
[junit4:junit4] 1> org.apache.solr.client.solrj.SolrServerException: No live
SolrServers available to handle this
request:[http://127.0.0.1:34203/h/y/collection1,
http://127.0.0.1:34304/h/y/collection1, http://127.0.0.1:34311/h/y/collection1,
http://127.0.0.1:34270/h/y/collection1]
[junit4:junit4] 1> at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333)
[junit4:junit4] 1> at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:306)
[junit4:junit4] 1> at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
[junit4:junit4] 1> at
org.apache.solr.cloud.AbstractFullDistribZkTestBase.indexDoc(AbstractFullDistribZkTestBase.java:561)
[junit4:junit4] 1> at
org.apache.solr.cloud.ChaosMonkeyShardSplitTest.indexr(ChaosMonkeyShardSplitTest.java:434)
[junit4:junit4] 1> at
org.apache.solr.cloud.ChaosMonkeyShardSplitTest$1.run(ChaosMonkeyShardSplitTest.java:158)
[junit4:junit4] 1> Caused by: org.apache.solr.common.SolrException: Server at
http://127.0.0.1:34311/h/y/collection1 returned non ok status:503,
message:Service Unavailable
[junit4:junit4] 1> at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
[junit4:junit4] 1> at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
[junit4:junit4] 1> at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
[junit4:junit4] 1> ... 5 more
{code}
Perhaps we should check the exception message and continue to count such a
document?
The second kind of test failures are where a document add fails due to version
conflict. This exception is always seen just after the "updateshardstate" is
called to switch the shard states. Following is the relevant log:
{code}
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861;
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state invoked
for collection: collection1
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861;
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1
to inactive
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861;
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_0
to active
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.861;
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_1
to active
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.873;
org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=
path=/update params={wt=javabin&version=2} {add=[169 (1432319507166134272)]} 0 2
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877;
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json,
has occurred - updating... (live nodes size: 5)
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877;
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json,
has occurred - updating... (live nodes size: 5)
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877;
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json,
has occurred - updating... (live nodes size: 5)
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877;
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json,
has occurred - updating... (live nodes size: 5)
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877;
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json,
has occurred - updating... (live nodes size: 5)
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.877;
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change:
WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json,
has occurred - updating... (live nodes size: 5)
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.884;
org.apache.solr.update.processor.LogUpdateProcessor;
[collection1_shard1_1_replica1] webapp= path=/update
params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2}
{} 0 1
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.885;
org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=
path=/update
params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2}
{add=[169 (1432319507173474304)]} 0 2
[junit4:junit4] 1> ERROR - 2013-04-14 19:05:26.885;
org.apache.solr.common.SolrException; shard update error StdNode:
http://127.0.0.1:41028/collection1_shard1_1_replica1/:org.apache.solr.common.SolrException:
version conflict for 169 expected=1432319507173474304 actual=-1
[junit4:junit4] 1> at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:404)
[junit4:junit4] 1> at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
[junit4:junit4] 1> at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
[junit4:junit4] 1> at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
[junit4:junit4] 1> at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
[junit4:junit4] 1> at
java.util.concurrent.FutureTask.run(FutureTask.java:166)
[junit4:junit4] 1> at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[junit4:junit4] 1> at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
[junit4:junit4] 1> at
java.util.concurrent.FutureTask.run(FutureTask.java:166)
[junit4:junit4] 1> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
[junit4:junit4] 1> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit4:junit4] 1> at java.lang.Thread.run(Thread.java:679)
[junit4:junit4] 1>
[junit4:junit4] 1> INFO - 2013-04-14 19:05:26.886;
org.apache.solr.update.processor.DistributedUpdateProcessor; try and ask
http://127.0.0.1:41028 to recover
{code}
I'm not sure yet why a version conflict will happen and why it follows an
"updateshardstate" command.
> shard splitting
> ---------------
>
> Key: SOLR-3755
> URL: https://issues.apache.org/jira/browse/SOLR-3755
> Project: Solr
> Issue Type: New Feature
> Components: SolrCloud
> Reporter: Yonik Seeley
> Assignee: Shalin Shekhar Mangar
> Fix For: 4.3, 5.0
>
> Attachments: SOLR-3755-combined.patch,
> SOLR-3755-combinedWithReplication.patch, SOLR-3755-CoreAdmin.patch,
> SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch,
> SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch, SOLR-3755.patch,
> SOLR-3755.patch, SOLR-3755.patch, SOLR-3755-testSplitter.patch,
> SOLR-3755-testSplitter.patch
>
>
> We can currently easily add replicas to handle increases in query volume, but
> we should also add a way to add additional shards dynamically by splitting
> existing shards.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]