Re: Terminology question: Core vs. Collection vs...

Per Steffensen Fri, 04 Jan 2013 11:15:16 -0800

It was a very good explanation, Jack!

I believe I have heard most of it before, so it is really not new forme. I DO understand that the name "replica" and "replication-factor" CANbe justified, but it requires a long and thorough explanation. And thatsthe point. A good name for a concept means that:* The name is among the first that pops up in your mind when you thinkabout the concept, or at least you can make a very short explanation whyyou choose this name for that concept* When a (fairly educated) newcomer hears the name for the first time,his first thoughts about the concept it covers is as close as possibleto the actual concept


Good metrics for whether or not we have good names must therefore be
1) The frequency of questions about the concepts behind the names

2) The frequency of wrong usage of names (cases where people actuallydidnt understand the concept behind the name, didnt ask (1. above) andjust used the name for what he thought it meant)

3) The length of the explanation of why you chose this name for that concept

Ad 1)

I counted several questions just this week. Especially I noted "Replica(Replica of _what_?)" in the original post of this thread. Whether wewant it or not, newcomers will keep "not getting" the concept of replicaor getting it wrong. Why? Because it is a bad name.

Ad 2)

I also counted several cases where names where used completely wrongthis week.

Ad 3)

Take a look at the length of Jacks great post below, and take a look atthe length of this mail-thread.

I believe we will do better on the metrics if we usenode/collection/shard/shard-instance/index instead ofnode/collection/shard/replica/(core/)index, and use instances-per-shardinstead of replication-factor. And say that "core" is the same as a"shard-instance", but typically used in a non/pre-Cloud context. Thatindex is a physical lucene thing - and nothing but that. Thatcollections and shards are logical concepts. That a shard-instance is aphysical instance of a shard implemented using a lucene index persistingits data on physical disk.

My only interest here is to try to pull the project in a good direction.You just get my opinion. Keep it simple and no bullshit.

This entire discussion is great I think, but it probably belong ondev-list (or maybe on a JIRA).I belive Alexandre Rafalovitch got his answer already :-) To the level aclean answer exists at the moment.


Regards, Per Steffensen

On 1/4/13 2:54 PM, Jack Krupansky wrote:

Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.
A replica is an instance of the data of the shard and instances ofSolr servers that have indicated a readiness to service queries andupdates for the data. Alternatively, a replica is a node which hasindicated a readiness to receive and serve the data of a shard, butmay not have any data at the moment.
Lets describe it operationally for SolrCloud: If data comes in to anyreplica of a shard it will automatically and quickly be "replicated"to all other replicas of the shard. If a new replica of a shard comesup it will be streamed all of the data from the another replica of theshard. If an existing replica of a shard restarts or reconnects to thecluster, it will be streamed updates of any new data since it was lastupdated from another replica of the shard.
Replication is simply the process of assuring that all replicas arekept up to date. That's the same abstract meaning as for Master/Slaveeven though the operational details are somewhat different. The goalremains the same.
Replication factor is the number of instances of the data of the shardand instances of Solr servers that can service queries and updates forthe data. Alternatively, the replication factor is the number of nodesof the SolrCloud cluster which have indicated a readiness to receiveand serve the data of a shard, but may not have any data at the moment.
A node is an instance of Solr running in a Java JVM that has indicatedto the Zookeeper ensemble of a SolrCloud cluster that it is ready tobe a replica for a shard of a collection. [The latter part of that isa bit too fuzzy - I'm not sure what the node tells Zookeeper and whodoes shard assignment. I mean, does a node explicitly say what shardit wants to be, or is that assigned by Zookeeper, or is that a node'schoice/option? But none of that changes the fact that a node"registers" with Zookeeper and then somehow becomes a replica for ashard.]
A node (instance of a Solr server) can be a replica of shards frommultiple collections (potentially multiple shards per collection). Anode is not a replica per se, but a container that can serve multiplecollections. A node can serve as multiple replicas, each of adifferent collection.
My only interest here on this user list is to understand and explainthe terms we have today and that SEEM to be working for the most part,even though we may not have defined them carefully enough and usedthem consistently enough.
If somebody want to propose an alternative terminology - fine, discussthat on the dev list and/or file a Jira.
I won't claim that my definitions are perfect (yet), but perfectingthe definitions (for users) should be separated from changing theterms themselves.
-- Jack Krupansky

-----Original Message----- From: Per Steffensen
Sent: Friday, January 04, 2013 2:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

On 1/3/13 5:58 PM, Walter Underwood wrote:
A "factor" is multiplied, so multiplying the leader by areplicationFactor of 1 means you have exactly one copy of that shard.
I think that recycling the term "replication" within Solr wasconfusing, but it is a bit late to change that.
wunder
Yes, the term "factor" is not misleading, but the term "replication" is.
If we keep calling shard-instances for "Replica" I guess "replicaFactor"
will be ok - at least much better than "replicationFactor". But it would
still be better with e.g. "ShardInstance" and "InstancesPerShard"

Re: Terminology question: Core vs. Collection vs...

Reply via email to