replica placement more reproducible

Hoss Man (JIRA) Fri, 25 Mar 2016 15:23:49 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212498#comment-15212498
 ]


Hoss Man commented on SOLR-8907:
--------------------------------


The motivation for creating this issue came out of a situation i noticed while 
working on SOLR-445.

The goal was to test that updates were working reliably regardless of if what 
node they were routed to.

The test, in a nutshell, looked like this...

{code}
// tests setup...
cluster.createCollection(...);
CLOUD_CLIENT = cluster.getSolrClient();
NODE_CLIENTS = new ArrayList<SolrClient>(numServers);
for (JettySolrRunner jetty : cluster.getJettySolrRunners()) {
  URL jettyURL = jetty.getBaseUrl();
  NODE_CLIENTS.add(new HttpSolrClient(jettyURL.toString() + "/" + 
COLLECTION_NAME + "/"));
}


// in a loop...
SolrRequest req = makeRandomUpdateRequest(random());
SolrClient client = random().nextBoolean() ? CLOUD_CLIENT
   : NODE_CLIENTS.get(TestUtil.nextInt(random(), 0, NODE_CLIENTS.size()-1));
}
assertSomeStuffAboutResponse(req.process(client));
{code}

There was a bug in the code such that in some specific situations (based on the 
output of {{makeRandomUpdateRequest(...)}}) updates meeting certain criteria 
would fail _unless_ they were sent to the leader of a particular shard 
(particular because it was the leader for all the Ids generated by 
{{makeRandomUpdateRequest(...)}} in that particular loop iteration)

This meant that there were particular seeds that _most of the time_ would 
reliably reproduce, but roughly every {{1 / numServer}} number of attempts, the 
leader for the particular shard in question would randomly be assigned to the 
jetty instance whose httpSolrClient was randomly (but consistently for this 
seed) being selected at this point.

That made the test far more confusing to try and debug then if the leaders for 
the shards were being consistently assigned to the same jetty nodes (relative 
to their ordering in the list returned by {{cluster.getJettySolrRunners()}}) 
... like how older, pre-cloud, distributed update tests use to work.

In short: given a fixed seed, the test code was doing everything in it's power 
to be 100% consistent w/ the requests it generated and the jetty nodes those 
requests were sent to -- but the test still wasn't very reproducible because of 
the shard & leader assignments were random.

----

I suspect that the best way to try and implement something like this would be 
to use [rule based replica 
placement|https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement]
 feature -- perhaps with a special "Snitch" designed for use in 
MiniSolrCloudCluster tests? ... But i'm not really sure how it would work 
because i don't really understand how to use / extend that feature.


So assuming for the sake of argument that it's not possible using the rule 
based placement stuff, here's a description of the approach that initially 
ocured to me to serve as a straw man for discussion...

* If it's not already, {{MiniSolrCloudCluster}} should ensure every Jetty 
instance is started up with a consistent node name (sequentially numbered or 
whatever)
* If it's not already, {{MiniSolrCloudCluster.getJettySolrRunners()}} should 
return the jetty instances in a consistently sorted order (based on something 
like node name -- not something non-deterministic like the port#, or order that 
they started up)
* {{MiniSolrCloudCluster.createCollection(...)}} (or some new method with a 
similar signature) should be changed to more explicitly do a lot of work 
currently done implicitly by the {{CREATE}} API call...
** use the {{shards}} param to provide explicitly generated names for every 
shard 
** use the {{createNodeSet=EMPTY}} param
** Once the collection is created (w/o any replicas)...
*** {{ADDREPLICA}} and {{ADDREPLICAPROP}} should be used explicitly to create a 
preferedLeader for each (named) {{shard}} and assign it to a predictably chosen 
{{node}} (by name).
*** Additional {{ADDREPLICA}} calls should then be made as needed to add the 
expected number of replicas for each {{shard}} on predictably chosen {{node}}s 
(by name).
* {{MiniSolrCloudCluster}} could then support some new convenience methods for 
tests to use:
** Things like...
*** {{List<HttpSolrClient> getClientsForAllReplicas(String collectionName)}}
*** {{List<HttpSolrClient> getClientsForShard(String collectionName, String 
shardName)}}
*** {{SortedMap<String,HttpSolrClient> getClientsForLeaders(String 
collectionName) // keyed by shardName}}
*** {{HttpSolrClient getClientForLeader(String collectionName, String 
shardName)}}
** These methods should do a "live" lookup of the data current in ZK, so that 
even if a test shuts down nodes, or adds replicas, or triggers some bit of 
chaos they can still subsequently lookup a useful SolrClient to test some 
action with
** Obviously these methods should return all clients in a consistent order (ie: 
sort by core node name)
** (See {{TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster()}} for 
some sample code of building up SolrClients targeting shard leaders)



...what do folks think?

is this possible/easy using a custom "snitch" ?

> add features to MiniSolrCloudCluster to make shard/leader/replica placement 
> more reproducible
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8907
>                 URL: https://issues.apache.org/jira/browse/SOLR-8907
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Hoss Man
>
> I think MiniSolrCloudCluster would be greatly improved if (by default) 
> collections created for test purposes had predictable shard/leader/core 
> assignment across the jetty instances that are spun up.  Even though the 
> port#s used by the jettys will obviously vary every time a test is run, 
> ideally a given seed should ensure that the following are all consistent:
> * the node_name used by each JettySolrRunner
> * which nodes host which shards
> * the core names use on each jetty instance
> * which core is the leader for each shard
> Obviously this wouldn't make sense for tests where the entire purpose is to 
> ensure that the automatic assignment of these things works properly when 
> creating a collection, or when explicitly testing things like 
> "preferedLeader", but for tests of non-collection API related features (ie: 
> update requests, search requests, sorting, etc...) where the test setup 
> already takes advantage of methods like 
> {{MiniSolrCloudCluster.createCollection(...)}} as a short cut to using the 
> API directly, this type of consistency would make potential test failures a 
> lot more reproducible && easier to diagnose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-8907) add features to MiniSolrCloudCluster to make shard/leader/replica placement more reproducible

Reply via email to