Re: Terminology question: Core vs. Collection vs...
This is a good explanation and makes sense. The one inconsistency is referring to a replica of a shard that has no replication. But its not that big of a problem. If you wove the term 'core' into your writeup below it would be complete and should be posted on the wiki. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Jack Krupansky j...@basetechnology.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Replication makes perfect sense even if our explanations so far do not. A shard is an abstraction of a subset of the data for a collection. A replica is an instance of the data of the shard and instances of Solr servers that have indicated a readiness to service queries and updates for the data. Alternatively, a replica is a node which has indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. Lets describe it operationally for SolrCloud: If data comes in to any replica of a shard it will automatically and quickly be replicated to all other replicas of the shard. If a new replica of a shard comes up it will be streamed all of the data from the another replica of the shard. If an existing replica of a shard restarts or reconnects to the cluster, it will be streamed updates of any new data since it was last updated from another replica of the shard. Replication is simply the process of assuring that all replicas are kept up to date. That's the same abstract meaning as for Master/Slave even though the operational details are somewhat different. The goal remains the same. Replication factor is the number of instances of the data of the shard and instances of Solr servers that can service queries and updates for the data. Alternatively, the replication factor is the number of nodes of the SolrCloud cluster which have indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. A node is an instance of Solr running in a Java JVM that has indicated to the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a replica for a shard of a collection. [The latter part of that is a bit too fuzzy - I'm not sure what the node tells Zookeeper and who does shard assignment. I mean, does a node explicitly say what shard it wants to be, or is that assigned by Zookeeper, or is that a node's choice/option? But none of that changes the fact that a node registers with Zookeeper and then somehow becomes a replica for a shard.] A node (instance of a Solr server) can be a replica of shards from multiple collections (potentially multiple shards per collection). A node is not a replica per se, but a container that can serve multiple collections. A node can serve as multiple replicas, each of a different collection. My only interest here on this user list is to understand and explain the terms we have today and that SEEM to be working for the most part, even though we may not have defined them carefully enough and used them consistently enough. If somebody want to propose an alternative terminology - fine, discuss that on the dev list and/or file a Jira. I won't claim that my definitions are perfect (yet), but perfecting the definitions (for users) should be separated from changing the terms themselves. -- Jack Krupansky -Original Message- From: Per Steffensen Sent: Friday, January 04, 2013 2:49 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On 1/3/13 5:58 PM, Walter Underwood wrote: A factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. I think that recycling the term replication within Solr was confusing, but it is a bit late to change that. wunder Yes, the term factor is not misleading, but the term replication is. If we keep calling shard-instances for Replica I guess replicaFactor will be ok - at least much better than replicationFactor. But it would still be better with e.g. ShardInstance and InstancesPerShard
Re: Terminology question: Core vs. Collection vs...
I thought about adding Solr core, but it only muddies the water. Yes, it needs to be added, but carefully. In the context of SolrCloud, a Solr core is the underlying representation of a replica. Alternatively, a replica of a shard of a collection is implemented as a Solr core. [Need to factor in the potential for multiple shards on a single node.] Or, a Solr core is capable of serving as a replica of a shard. A Solr core has a collection name but can exist without being registered with Zookeeper, so it may not be a replica of a zookeeper-registered collection. Something like that. Not quite there yet. The main point, I think, is that when we talk about SolrCloud or a Solr cluster it would be better for people to speak of replicas and shards and collections than cores since core is the implementation rather than the abstraction. I mean, at the level of cores, they know of only documents and fields, not shards, replicas, and the overall structure of collections and the cluster. Sure, the core has the name of the collection, but cores on other nodes can use that same name. -- Jack Krupansky -Original Message- From: darren Sent: Friday, January 04, 2013 9:00 AM To: j...@basetechnology.com ; solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... This is a good explanation and makes sense. The one inconsistency is referring to a replica of a shard that has no replication. But its not that big of a problem. If you wove the term 'core' into your writeup below it would be complete and should be posted on the wiki. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Jack Krupansky j...@basetechnology.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Replication makes perfect sense even if our explanations so far do not. A shard is an abstraction of a subset of the data for a collection. A replica is an instance of the data of the shard and instances of Solr servers that have indicated a readiness to service queries and updates for the data. Alternatively, a replica is a node which has indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. Lets describe it operationally for SolrCloud: If data comes in to any replica of a shard it will automatically and quickly be replicated to all other replicas of the shard. If a new replica of a shard comes up it will be streamed all of the data from the another replica of the shard. If an existing replica of a shard restarts or reconnects to the cluster, it will be streamed updates of any new data since it was last updated from another replica of the shard. Replication is simply the process of assuring that all replicas are kept up to date. That's the same abstract meaning as for Master/Slave even though the operational details are somewhat different. The goal remains the same. Replication factor is the number of instances of the data of the shard and instances of Solr servers that can service queries and updates for the data. Alternatively, the replication factor is the number of nodes of the SolrCloud cluster which have indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. A node is an instance of Solr running in a Java JVM that has indicated to the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a replica for a shard of a collection. [The latter part of that is a bit too fuzzy - I'm not sure what the node tells Zookeeper and who does shard assignment. I mean, does a node explicitly say what shard it wants to be, or is that assigned by Zookeeper, or is that a node's choice/option? But none of that changes the fact that a node registers with Zookeeper and then somehow becomes a replica for a shard.] A node (instance of a Solr server) can be a replica of shards from multiple collections (potentially multiple shards per collection). A node is not a replica per se, but a container that can serve multiple collections. A node can serve as multiple replicas, each of a different collection. My only interest here on this user list is to understand and explain the terms we have today and that SEEM to be working for the most part, even though we may not have defined them carefully enough and used them consistently enough. If somebody want to propose an alternative terminology - fine, discuss that on the dev list and/or file a Jira. I won't claim that my definitions are perfect (yet), but perfecting the definitions (for users) should be separated from changing the terms themselves. -- Jack Krupansky -Original Message- From: Per Steffensen Sent: Friday, January 04, 2013 2:49 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On 1/3/13 5:58 PM, Walter Underwood wrote: A factor is multiplied, so multiplying the leader by a replicationFactor of 1
Re: Terminology question: Core vs. Collection vs...
Yes. Thats it. Its clear if we separate logical terms from physical terms. A simple cake diagram on the wiki along with perhaps a uml will solidify these concepts. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Jack Krupansky j...@basetechnology.com Date: To: solr-user@lucene.apache.org,darren dar...@ontrenet.com Subject: Re: Terminology question: Core vs. Collection vs... I thought about adding Solr core, but it only muddies the water. Yes, it needs to be added, but carefully. In the context of SolrCloud, a Solr core is the underlying representation of a replica. Alternatively, a replica of a shard of a collection is implemented as a Solr core. [Need to factor in the potential for multiple shards on a single node.] Or, a Solr core is capable of serving as a replica of a shard. A Solr core has a collection name but can exist without being registered with Zookeeper, so it may not be a replica of a zookeeper-registered collection. Something like that. Not quite there yet. The main point, I think, is that when we talk about SolrCloud or a Solr cluster it would be better for people to speak of replicas and shards and collections than cores since core is the implementation rather than the abstraction. I mean, at the level of cores, they know of only documents and fields, not shards, replicas, and the overall structure of collections and the cluster. Sure, the core has the name of the collection, but cores on other nodes can use that same name. -- Jack Krupansky -Original Message- From: darren Sent: Friday, January 04, 2013 9:00 AM To: j...@basetechnology.com ; solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... This is a good explanation and makes sense. The one inconsistency is referring to a replica of a shard that has no replication. But its not that big of a problem. If you wove the term 'core' into your writeup below it would be complete and should be posted on the wiki. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Jack Krupansky j...@basetechnology.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Replication makes perfect sense even if our explanations so far do not. A shard is an abstraction of a subset of the data for a collection. A replica is an instance of the data of the shard and instances of Solr servers that have indicated a readiness to service queries and updates for the data. Alternatively, a replica is a node which has indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. Lets describe it operationally for SolrCloud: If data comes in to any replica of a shard it will automatically and quickly be replicated to all other replicas of the shard. If a new replica of a shard comes up it will be streamed all of the data from the another replica of the shard. If an existing replica of a shard restarts or reconnects to the cluster, it will be streamed updates of any new data since it was last updated from another replica of the shard. Replication is simply the process of assuring that all replicas are kept up to date. That's the same abstract meaning as for Master/Slave even though the operational details are somewhat different. The goal remains the same. Replication factor is the number of instances of the data of the shard and instances of Solr servers that can service queries and updates for the data. Alternatively, the replication factor is the number of nodes of the SolrCloud cluster which have indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. A node is an instance of Solr running in a Java JVM that has indicated to the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a replica for a shard of a collection. [The latter part of that is a bit too fuzzy - I'm not sure what the node tells Zookeeper and who does shard assignment. I mean, does a node explicitly say what shard it wants to be, or is that assigned by Zookeeper, or is that a node's choice/option? But none of that changes the fact that a node registers with Zookeeper and then somehow becomes a replica for a shard.] A node (instance of a Solr server) can be a replica of shards from multiple collections (potentially multiple shards per collection). A node is not a replica per se, but a container that can serve multiple collections. A node can serve as multiple replicas, each of a different collection. My only interest here on this user list is to understand and explain the terms we have today and that SEEM to be working for the most part, even though we may not have defined them carefully enough and used them consistently enough. If somebody want to propose an alternative terminology - fine, discuss that on the dev list and/or file a Jira. I won't claim that my definitions are perfect (yet
Re: Terminology question: Core vs. Collection vs...
On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote: Our biggest problem is that we really havent decided once and for all and made sure to reflect the decision consistently across code and documentation. As long as we havnt I believe it is still ok to change our minds. IMO, I *think* it's settled: It's collection consists of 1 or more shards, which each consist of one or more replicas. A *long* time ago (3 years actually), I tried to get slice used in place of shard just because shard was already used ambiguously by people for both physical and logical shards, but it never caught on, and as I recall no one could really agree on a set of terms that satisfied everyone. Attempting to replace Replica with something like Shard Instance could actually end up being worse since it's a mouthful and people would tend to shorten it to shard when talking about it. From a practical standpoint, I don't think people will be confused by the current terminology once we document it well (we should probably start with collection/shard/replica). It's mostly an issue of when one goes looking for inconsistencies or things that might not make sense. And as has been pointed out, others use the exact same terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication In the *code* I have been migrating away from shard as the physical kind. I've also used slice as a synonym for logical shard in the code because of this mixed history of shard and since removing all remnants of the use of shard as physical all at once would be impractical. Anyone who works on the code should not be bothered by an extra synonym, and things will continue to be cleaned up over time. -Yonik http://lucidworks.com
Re: Terminology question: Core vs. Collection vs...
Agreed. But for completeness can it be node/collection/shard/replica/core? Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Yonik Seeley yo...@lucidworks.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote: Our biggest problem is that we really havent decided once and for all and made sure to reflect the decision consistently across code and documentation. As long as we havnt I believe it is still ok to change our minds. IMO, I *think* it's settled: It's collection consists of 1 or more shards, which each consist of one or more replicas. A *long* time ago (3 years actually), I tried to get slice used in place of shard just because shard was already used ambiguously by people for both physical and logical shards, but it never caught on, and as I recall no one could really agree on a set of terms that satisfied everyone. Attempting to replace Replica with something like Shard Instance could actually end up being worse since it's a mouthful and people would tend to shorten it to shard when talking about it. From a practical standpoint, I don't think people will be confused by the current terminology once we document it well (we should probably start with collection/shard/replica). It's mostly an issue of when one goes looking for inconsistencies or things that might not make sense. And as has been pointed out, others use the exact same terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication In the *code* I have been migrating away from shard as the physical kind. I've also used slice as a synonym for logical shard in the code because of this mixed history of shard and since removing all remnants of the use of shard as physical all at once would be impractical. Anyone who works on the code should not be bothered by an extra synonym, and things will continue to be cleaned up over time. -Yonik http://lucidworks.com
Re: Terminology question: Core vs. Collection vs...
Actually. Node/collection/shard/replica/core/index Sent from my Verizon Wireless 4G LTE Smartphone Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core? Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Yonik Seeley yo...@lucidworks.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote: Our biggest problem is that we really havent decided once and for all and made sure to reflect the decision consistently across code and documentation. As long as we havnt I believe it is still ok to change our minds. IMO, I *think* it's settled: It's collection consists of 1 or more shards, which each consist of one or more replicas. A *long* time ago (3 years actually), I tried to get slice used in place of shard just because shard was already used ambiguously by people for both physical and logical shards, but it never caught on, and as I recall no one could really agree on a set of terms that satisfied everyone. Attempting to replace Replica with something like Shard Instance could actually end up being worse since it's a mouthful and people would tend to shorten it to shard when talking about it. From a practical standpoint, I don't think people will be confused by the current terminology once we document it well (we should probably start with collection/shard/replica). It's mostly an issue of when one goes looking for inconsistencies or things that might not make sense. And as has been pointed out, others use the exact same terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication In the *code* I have been migrating away from shard as the physical kind. I've also used slice as a synonym for logical shard in the code because of this mixed history of shard and since removing all remnants of the use of shard as physical all at once would be impractical. Anyone who works on the code should not be bothered by an extra synonym, and things will continue to be cleaned up over time. -Yonik http://lucidworks.com
Re: Terminology question: Core vs. Collection vs...
Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
The entire collection does have an index - a distributed index - which consists of a Lucene index on each core/replica for the subset of the data in that shard. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, January 04, 2013 1:12 PM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
Hmm. Doesn't that make (logical) index=collection? And (physical) index=core? Which creates duplication of terminology and at the same time can cause confusion between highest logical and lowest physical level. Regards, Alex. P.s. Hoping not to start a new terminology war. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote: The entire collection does have an index - a distributed index - which consists of a Lucene index on each core/replica for the subset of the data in that shard. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, January 04, 2013 1:12 PM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yonik@**lucidworks.com yo...@lucidworks.com, solr-user@**lucene.apache.org solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/**core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-**u...@lucene.apache.orgsolr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/** core?
Re: Terminology question: Core vs. Collection vs...
On Fri, Jan 4, 2013 at 1:35 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hmm. Doesn't that make (logical) index=collection? And (physical) index=core? Which creates duplication of terminology and at the same time can cause confusion between highest logical and lowest physical level. That's why I've avoided index to mean anything other than the lowest level physical lucene index, and used collection for the logical meaning instead. A solr core is essentially a replica (currently... core is more of an implementation thing), and it has a lucene index. -Yonik http://lucidworks.com
Re: Terminology question: Core vs. Collection vs...
I agree. In my opinion index is a low level lucene thing. I never say a collection has an index directly. That confuses levels and creates confusion. To me at least. I think the terminology discussed is good. Just some lingering usage inconsistencies. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Hmm. Doesn't that make (logical) index=collection? And (physical) index=core? Which creates duplication of terminology and at the same time can cause confusion between highest logical and lowest physical level. Regards, Alex. P.s. Hoping not to start a new terminology war. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote: The entire collection does have an index - a distributed index - which consists of a Lucene index on each core/replica for the subset of the data in that shard. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, January 04, 2013 1:12 PM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yonik@**lucidworks.com yo...@lucidworks.com, solr-user@**lucene.apache.org solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/**core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-**u...@lucene.apache.orgsolr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/** core?
Re: Terminology question: Core vs. Collection vs...
Using your terminology, I'd say core is a physical solr term, and index is a pysical lucene term. A collection or a shard is a logical solr term. Upayavira On Fri, Jan 4, 2013, at 06:28 PM, darren wrote: My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
Good point. Agree. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Upayavira u...@odoko.co.uk Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Using your terminology, I'd say core is a physical solr term, and index is a pysical lucene term. A collection or a shard is a logical solr term. Upayavira On Fri, Jan 4, 2013, at 06:28 PM, darren wrote: My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason that needs to alway be that way. It's possible that we may at some point add built in micro sharding support that means a SolrCore could have multiple underlying Lucene indexes. Or we may not. - Mark On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote: Good point. Agree. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Upayavira u...@odoko.co.uk Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Using your terminology, I'd say core is a physical solr term, and index is a pysical lucene term. A collection or a shard is a logical solr term. Upayavira On Fri, Jan 4, 2013, at 06:28 PM, darren wrote: My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
Yes. In that case, core should best be described as a logical solr entity with various managed attributes and qualities above the physical layer (sorry, not trying to perpetuate this thread so much). On 01/04/2013 01:55 PM, Mark Miller wrote: Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason that needs to alway be that way. It's possible that we may at some point add built in micro sharding support that means a SolrCore could have multiple underlying Lucene indexes. Or we may not. - Mark On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote: Good point. Agree. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Upayavira u...@odoko.co.uk Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Using your terminology, I'd say core is a physical solr term, and index is a pysical lucene term. A collection or a shard is a logical solr term. Upayavira On Fri, Jan 4, 2013, at 06:28 PM, darren wrote: My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
to be working for the most part, even though we may not have defined them carefully enough and used them consistently enough. If somebody want to propose an alternative terminology - fine, discuss that on the dev list and/or file a Jira. I won't claim that my definitions are perfect (yet), but perfecting the definitions (for users) should be separated from changing the terms themselves. -- Jack Krupansky -Original Message- From: Per Steffensen Sent: Friday, January 04, 2013 2:49 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On 1/3/13 5:58 PM, Walter Underwood wrote: A factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. I think that recycling the term replication within Solr was confusing, but it is a bit late to change that. wunder Yes, the term factor is not misleading, but the term replication is. If we keep calling shard-instances for Replica I guess replicaFactor will be ok - at least much better than replicationFactor. But it would still be better with e.g. ShardInstance and InstancesPerShard
Re: Terminology question: Core vs. Collection vs...
it wants to be, or is that assigned by Zookeeper, or is that a node's choice/option? But none of that changes the fact that a node registers with Zookeeper and then somehow becomes a replica for a shard.] A node (instance of a Solr server) can be a replica of shards from multiple collections (potentially multiple shards per collection). A node is not a replica per se, but a container that can serve multiple collections. A node can serve as multiple replicas, each of a different collection. My only interest here on this user list is to understand and explain the terms we have today and that SEEM to be working for the most part, even though we may not have defined them carefully enough and used them consistently enough. If somebody want to propose an alternative terminology - fine, discuss that on the dev list and/or file a Jira. I won't claim that my definitions are perfect (yet), but perfecting the definitions (for users) should be separated from changing the terms themselves. -- Jack Krupansky -Original Message- From: Per Steffensen Sent: Friday, January 04, 2013 2:49 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On 1/3/13 5:58 PM, Walter Underwood wrote: A factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. I think that recycling the term replication within Solr was confusing, but it is a bit late to change that. wunder Yes, the term factor is not misleading, but the term replication is. If we keep calling shard-instances for Replica I guess replicaFactor will be ok - at least much better than replicationFactor. But it would still be better with e.g. ShardInstance and InstancesPerShard
Re: Terminology question: Core vs. Collection vs...
On Jan 4, 2013, at 2:14 PM, Per Steffensen st...@designware.dk wrote: I'm not sure what the node tells Zookeeper and who does shard assignment. I mean, does a node explicitly say what shard it wants to be, or is that assigned by Zookeeper, or is that a node's choice/option? It's basically both. If you don't explicitly specify a shard assignment on SolrCore creation, the Oveerseer will use ZooKeeper to assign a shard for you. It's up to the user which road to take. - mark
Terminology question: Core vs. Collection vs...
Hello, I am trying to understand the core Solr terminology. I am looking for correct rather than loose meaning as I am trying to teach an example that starts from easy scenario and may scale to multi-core, multi-machine situation. Here are the terms that seem to be all overlapping and/or crossing over in my mind a the moment. 1) Index 2) Core 3) Collection 4) Instance 5) Replica (Replica of _what_?) 6) Others? I tried looking through documentation, but either there is a terminology drift or I am having trouble understanding the distinctions. If anybody has a clear picture in their mind, I would appreciate a clarification. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Terminology question: Core vs. Collection vs...
Hi, If you haven't already, please refer to: http://www.ngdata.com/site/blog/57-ng.html http://lucene.472066.n3.nabble.com/solr-cloud-concepts-td3726292.html http://wiki.apache.org/solr/SolrCloud#FAQ Regards, Aloke On Thu, Jan 3, 2013 at 3:12 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: Hello, I am trying to understand the core Solr terminology. I am looking for correct rather than loose meaning as I am trying to teach an example that starts from easy scenario and may scale to multi-core, multi-machine situation. Here are the terms that seem to be all overlapping and/or crossing over in my mind a the moment. 1) Index 2) Core 3) Collection 4) Instance 5) Replica (Replica of _what_?) 6) Others? I tried looking through documentation, but either there is a terminology drift or I am having trouble understanding the distinctions. If anybody has a clear picture in their mind, I would appreciate a clarification. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Terminology question: Core vs. Collection vs...
Haven't seen these yet. These look like a great start, though now I see even more terms to figure out. Thank you, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jan 3, 2013 at 5:34 AM, Aloke Ghoshal alghos...@gmail.com wrote: Hi, If you haven't already, please refer to: http://www.ngdata.com/site/blog/57-ng.html http://lucene.472066.n3.nabble.com/solr-cloud-concepts-td3726292.html http://wiki.apache.org/solr/SolrCloud#FAQ Regards, Aloke On Thu, Jan 3, 2013 at 3:12 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Hello, I am trying to understand the core Solr terminology. I am looking for correct rather than loose meaning as I am trying to teach an example that starts from easy scenario and may scale to multi-core, multi-machine situation. Here are the terms that seem to be all overlapping and/or crossing over in my mind a the moment. 1) Index 2) Core 3) Collection 4) Instance 5) Replica (Replica of _what_?) 6) Others? I tried looking through documentation, but either there is a terminology drift or I am having trouble understanding the distinctions. If anybody has a clear picture in their mind, I would appreciate a clarification. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Terminology question: Core vs. Collection vs...
Collection is the more modern term and incorporates the fact that the collection may be sharded, with each shard on one or more cores, with each core being a replica of the other cores within that shard of that collection. Instance is a general term, but is commonly used to refer to a running Solr server, each of which can service any number of cores. A sharded collection would typically require multiple instances of Solr, each with a shard of the collection. Multiple collections can be supported on a single instance of Solr. They don't have to be sharded or replicated. But if they are, each Solr instance will have a copy or replica of the data (index) of one shard of each sharded collection - to the degree that each collection needs that many shards. At the API level, you talk to a Solr instance, using a host and port, and giving the collection name. Some operations will refer only to the portion of a multi-shard collection on that Solr instance, but typically Solr will distribute the operation, whether it be an update or a query, to all of the shards of the named collection. In the case of update, the update will be distributed to all replicas as well, but in the case of query only one replica of each shard of the collection is needed. Before SolrCloud we Solr had master and slave and the slaves were replicas of the master, but with SolrCloud there is no master and all the replicas of the shard are peers, although at any moment of time one of them will be considered the leader for coordination purposes, but not in the sense that it is a master of the other replicas in that shard. A SolrCloud replica is a replica of the data, in an abstract sense, for a single shard of a collection. A SolrCloud replica is more of an instance of the data/index. An index exists at two levels: the portion of a collection on a single Solr core will have a Lucene index, but collectively the Lucene indexes for the shards of a collection can be referred to the index of the collection. Each replica is a copy or instance of a portion of the collection's index. The term slice is sometimes used to refer collectively to all of the cores/replicas of a single shard, or sometimes to a single replica as it contains only a slice of the full collection data. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Thursday, January 03, 2013 4:42 AM To: solr-user@lucene.apache.org Subject: Terminology question: Core vs. Collection vs... Hello, I am trying to understand the core Solr terminology. I am looking for correct rather than loose meaning as I am trying to teach an example that starts from easy scenario and may scale to multi-core, multi-machine situation. Here are the terms that seem to be all overlapping and/or crossing over in my mind a the moment. 1) Index 2) Core 3) Collection 4) Instance 5) Replica (Replica of _what_?) 6) Others? I tried looking through documentation, but either there is a terminology drift or I am having trouble understanding the distinctions. If anybody has a clear picture in their mind, I would appreciate a clarification. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
RE: Re: Terminology question: Core vs. Collection vs...
Good write up. And what about node? I think there needs to be an official glossary of terms that is sanctioned by the solr team and some terms still ni use may need to be labeled deprecated. After so many years, its still confusing. brbrbr--- Original Message --- On 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. br brInstance is a general term, but is commonly used to refer to a running Solr brserver, each of which can service any number of cores. A sharded collection brwould typically require multiple instances of Solr, each with a shard of the brcollection. br brMultiple collections can be supported on a single instance of Solr. They brdon't have to be sharded or replicated. But if they are, each Solr instance brwill have a copy or replica of the data (index) of one shard of each sharded brcollection - to the degree that each collection needs that many shards. br brAt the API level, you talk to a Solr instance, using a host and port, and brgiving the collection name. Some operations will refer only to the portion brof a multi-shard collection on that Solr instance, but typically Solr will brdistribute the operation, whether it be an update or a query, to all of brthe shards of the named collection. In the case of update, the update will brbe distributed to all replicas as well, but in the case of query only one brreplica of each shard of the collection is needed. br brBefore SolrCloud we Solr had master and slave and the slaves were replicas brof the master, but with SolrCloud there is no master and all the replicas of brthe shard are peers, although at any moment of time one of them will be brconsidered the leader for coordination purposes, but not in the sense that brit is a master of the other replicas in that shard. A SolrCloud replica is a brreplica of the data, in an abstract sense, for a single shard of a brcollection. A SolrCloud replica is more of an instance of the data/index. br brAn index exists at two levels: the portion of a collection on a single Solr brcore will have a Lucene index, but collectively the Lucene indexes for the brshards of a collection can be referred to the index of the collection. Each brreplica is a copy or instance of a portion of the collection's index. br brThe term slice is sometimes used to refer collectively to all of the brcores/replicas of a single shard, or sometimes to a single replica as it brcontains only a slice of the full collection data. br br-- Jack Krupansky br br-Original Message- brFrom: Alexandre Rafalovitch brSent: Thursday, January 03, 2013 4:42 AM brTo: solr-user@lucene.apache.org brSubject: Terminology question: Core vs. Collection vs... br brHello, br brI am trying to understand the core Solr terminology. I am looking for brcorrect rather than loose meaning as I am trying to teach an example that brstarts from easy scenario and may scale to multi-core, multi-machine brsituation. br brHere are the terms that seem to be all overlapping and/or crossing over in brmy mind a the moment. br br1) Index br2) Core br3) Collection br4) Instance br5) Replica (Replica of _what_?) br6) Others? br brI tried looking through documentation, but either there is a terminology brdrift or I am having trouble understanding the distinctions. br brIf anybody has a clear picture in their mind, I would appreciate a brclarification. br brRegards, br Alex. br brPersonal blog: http://blog.outerthoughts.com/ brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch br- Time is the quality of nature that keeps events from happening all at bronce. Lately, it doesn't seem to be working. (Anonymous - via GTD book) br br
Re: Terminology question: Core vs. Collection vs...
A node is a machine in a cluster or cloud (graph). It could be a real machine or a virtualized machine. Technically, you could have multiple virtual nodes on the same physical box. Each Solr replica would be on a different node. Technically, you could have multiple Solr instances running on a single hardware node, each with a different port. They are simply instances of Solr, although you could consider each Solr instance a node in a Solr cloud as well, a virtual node. So, technically, you could have multiple replicas on the same node, but that sort of defeats most of the purpose of having replicas in the first place - to distribute the data for performance and fault tolerance. But, you could have replicas of different shards on the same node/box for a partial improvement of performance and fault tolerance. A Solr cloud' is really a cluster. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 8:16 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Good write up. And what about node? I think there needs to be an official glossary of terms that is sanctioned by the solr team and some terms still ni use may need to be labeled deprecated. After so many years, its still confusing. brbrbr--- Original Message --- On 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. br brInstance is a general term, but is commonly used to refer to a running Solr brserver, each of which can service any number of cores. A sharded collection brwould typically require multiple instances of Solr, each with a shard of the brcollection. br brMultiple collections can be supported on a single instance of Solr. They brdon't have to be sharded or replicated. But if they are, each Solr instance brwill have a copy or replica of the data (index) of one shard of each sharded brcollection - to the degree that each collection needs that many shards. br brAt the API level, you talk to a Solr instance, using a host and port, and brgiving the collection name. Some operations will refer only to the portion brof a multi-shard collection on that Solr instance, but typically Solr will brdistribute the operation, whether it be an update or a query, to all of brthe shards of the named collection. In the case of update, the update will brbe distributed to all replicas as well, but in the case of query only one brreplica of each shard of the collection is needed. br brBefore SolrCloud we Solr had master and slave and the slaves were replicas brof the master, but with SolrCloud there is no master and all the replicas of brthe shard are peers, although at any moment of time one of them will be brconsidered the leader for coordination purposes, but not in the sense that brit is a master of the other replicas in that shard. A SolrCloud replica is a brreplica of the data, in an abstract sense, for a single shard of a brcollection. A SolrCloud replica is more of an instance of the data/index. br brAn index exists at two levels: the portion of a collection on a single Solr brcore will have a Lucene index, but collectively the Lucene indexes for the brshards of a collection can be referred to the index of the collection. Each brreplica is a copy or instance of a portion of the collection's index. br brThe term slice is sometimes used to refer collectively to all of the brcores/replicas of a single shard, or sometimes to a single replica as it brcontains only a slice of the full collection data. br br-- Jack Krupansky br br-Original Message- brFrom: Alexandre Rafalovitch brSent: Thursday, January 03, 2013 4:42 AM brTo: solr-user@lucene.apache.org brSubject: Terminology question: Core vs. Collection vs... br brHello, br brI am trying to understand the core Solr terminology. I am looking for brcorrect rather than loose meaning as I am trying to teach an example that brstarts from easy scenario and may scale to multi-core, multi-machine brsituation. br brHere are the terms that seem to be all overlapping and/or crossing over in brmy mind a the moment. br br1) Index br2) Core br3) Collection br4) Instance br5) Replica (Replica of _what_?) br6) Others? br brI tried looking through documentation, but either there is a terminology brdrift or I am having trouble understanding the distinctions. br brIf anybody has a clear picture in their mind, I would appreciate a brclarification. br brRegards, br Alex. br brPersonal blog: http://blog.outerthoughts.com/ brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch br- Time is the quality of nature that keeps events from happening all at bronce. Lately, it doesn't seem to be working. (Anonymous - via GTD book) br br
RE: Re: Terminology question: Core vs. Collection vs...
Thanks again. (And sorry to jump into this convo) But I had a question on your statement: On 1/3/2013 08:07 AM Jack Krupansky wrote: brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. A collection is sharded, meaning it is distributed across cores. A shard itself is not distributed across cores in the same since. Rather a shard exist on a single core and is replicated on other cores. Is that right? The way its worded above, it sounds like a shard can also be sharded... brbrbr--- Original Message --- On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real brmachine or a virtualized machine. Technically, you could have multiple brvirtual nodes on the same physical box. Each Solr replica would be on a brdifferent node. br brTechnically, you could have multiple Solr instances running on a single brhardware node, each with a different port. They are simply instances of brSolr, although you could consider each Solr instance a node in a Solr cloud bras well, a virtual node. So, technically, you could have multiple replicas bron the same node, but that sort of defeats most of the purpose of having brreplicas in the first place - to distribute the data for performance and brfault tolerance. But, you could have replicas of different shards on the brsame node/box for a partial improvement of performance and fault tolerance. br brA Solr cloud' is really a cluster. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:16 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brGood write up. br brAnd what about node? br brI think there needs to be an official glossary of terms that is sanctioned brby the solr team and some terms still ni use may need to be labeled brdeprecated. After so many years, its still confusing. br brbrbrbr--- Original Message --- brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern brterm and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brbrcore being a replica of the other cores within that shard of that brbrcollection. brbr brbrInstance is a general term, but is commonly used to refer to a running brSolr brbrserver, each of which can service any number of cores. A sharded brcollection brbrwould typically require multiple instances of Solr, each with a shard of brthe brbrcollection. brbr brbrMultiple collections can be supported on a single instance of Solr. They brbrdon't have to be sharded or replicated. But if they are, each Solr brinstance brbrwill have a copy or replica of the data (index) of one shard of each brsharded brbrcollection - to the degree that each collection needs that many shards. brbr brbrAt the API level, you talk to a Solr instance, using a host and port, brand brbrgiving the collection name. Some operations will refer only to the brportion brbrof a multi-shard collection on that Solr instance, but typically Solr brwill brbrdistribute the operation, whether it be an update or a query, to all brof brbrthe shards of the named collection. In the case of update, the update brwill brbrbe distributed to all replicas as well, but in the case of query only brone brbrreplica of each shard of the collection is needed. brbr brbrBefore SolrCloud we Solr had master and slave and the slaves were brreplicas brbrof the master, but with SolrCloud there is no master and all the brreplicas of brbrthe shard are peers, although at any moment of time one of them will be brbrconsidered the leader for coordination purposes, but not in the sense brthat brbrit is a master of the other replicas in that shard. A SolrCloud replica bris a brbrreplica of the data, in an abstract sense, for a single shard of a brbrcollection. A SolrCloud replica is more of an instance of the brdata/index. brbr brbrAn index exists at two levels: the portion of a collection on a single brSolr brbrcore will have a Lucene index, but collectively the Lucene indexes for brthe brbrshards of a collection can be referred to the index of the collection. brEach brbrreplica is a copy or instance of a portion of the collection's index. brbr brbrThe term slice is sometimes used to refer collectively to all of the brbrcores/replicas of a single shard, or sometimes to a single replica as it brbrcontains only a slice of the full collection data. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Alexandre Rafalovitch brbrSent: Thursday, January 03, 2013 4:42 AM brbrTo: solr-user@lucene.apache.org brbrSubject: Terminology question: Core vs. Collection vs... brbr brbrHello, brbr brbrI am trying
Re: Terminology question: Core vs. Collection vs...
No, a shard is a subset (or slice) of the collection. Sharding is a way of slicing the original data, before we talk about how the shards get stored and replicated on actual Solr cores. Replicas are instances of the data for a shard. Sometimes people may loosely speak of a replica as being a shard, but that's just loose use of the terminology. So, we're not sharding shards, but we are replicating shards. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 8:51 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Thanks again. (And sorry to jump into this convo) But I had a question on your statement: On 1/3/2013 08:07 AM Jack Krupansky wrote: brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. A collection is sharded, meaning it is distributed across cores. A shard itself is not distributed across cores in the same since. Rather a shard exist on a single core and is replicated on other cores. Is that right? The way its worded above, it sounds like a shard can also be sharded... brbrbr--- Original Message --- On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real brmachine or a virtualized machine. Technically, you could have multiple brvirtual nodes on the same physical box. Each Solr replica would be on a brdifferent node. br brTechnically, you could have multiple Solr instances running on a single brhardware node, each with a different port. They are simply instances of brSolr, although you could consider each Solr instance a node in a Solr cloud bras well, a virtual node. So, technically, you could have multiple replicas bron the same node, but that sort of defeats most of the purpose of having brreplicas in the first place - to distribute the data for performance and brfault tolerance. But, you could have replicas of different shards on the brsame node/box for a partial improvement of performance and fault tolerance. br brA Solr cloud' is really a cluster. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:16 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brGood write up. br brAnd what about node? br brI think there needs to be an official glossary of terms that is sanctioned brby the solr team and some terms still ni use may need to be labeled brdeprecated. After so many years, its still confusing. br brbrbrbr--- Original Message --- brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern brterm and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brbrcore being a replica of the other cores within that shard of that brbrcollection. brbr brbrInstance is a general term, but is commonly used to refer to a running brSolr brbrserver, each of which can service any number of cores. A sharded brcollection brbrwould typically require multiple instances of Solr, each with a shard of brthe brbrcollection. brbr brbrMultiple collections can be supported on a single instance of Solr. They brbrdon't have to be sharded or replicated. But if they are, each Solr brinstance brbrwill have a copy or replica of the data (index) of one shard of each brsharded brbrcollection - to the degree that each collection needs that many shards. brbr brbrAt the API level, you talk to a Solr instance, using a host and port, brand brbrgiving the collection name. Some operations will refer only to the brportion brbrof a multi-shard collection on that Solr instance, but typically Solr brwill brbrdistribute the operation, whether it be an update or a query, to all brof brbrthe shards of the named collection. In the case of update, the update brwill brbrbe distributed to all replicas as well, but in the case of query only brone brbrreplica of each shard of the collection is needed. brbr brbrBefore SolrCloud we Solr had master and slave and the slaves were brreplicas brbrof the master, but with SolrCloud there is no master and all the brreplicas of brbrthe shard are peers, although at any moment of time one of them will be brbrconsidered the leader for coordination purposes, but not in the sense brthat brbrit is a master of the other replicas in that shard. A SolrCloud replica bris a brbrreplica of the data, in an abstract sense, for a single shard of a brbrcollection. A SolrCloud replica is more of an instance of the brdata/index. brbr brbrAn index exists at two levels: the portion of a collection on a single brSolr brbrcore will have a Lucene index, but collectively the Lucene indexes for brthe brbrshards of a collection can
Re: Terminology question: Core vs. Collection vs...
Oops... let me word that a little more carefully: ...we are replicating the data of each shard. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Thursday, January 03, 2013 9:03 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... No, a shard is a subset (or slice) of the collection. Sharding is a way of slicing the original data, before we talk about how the shards get stored and replicated on actual Solr cores. Replicas are instances of the data for a shard. Sometimes people may loosely speak of a replica as being a shard, but that's just loose use of the terminology. So, we're not sharding shards, but we are replicating shards. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 8:51 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Thanks again. (And sorry to jump into this convo) But I had a question on your statement: On 1/3/2013 08:07 AM Jack Krupansky wrote: brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. A collection is sharded, meaning it is distributed across cores. A shard itself is not distributed across cores in the same since. Rather a shard exist on a single core and is replicated on other cores. Is that right? The way its worded above, it sounds like a shard can also be sharded... brbrbr--- Original Message --- On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real brmachine or a virtualized machine. Technically, you could have multiple brvirtual nodes on the same physical box. Each Solr replica would be on a brdifferent node. br brTechnically, you could have multiple Solr instances running on a single brhardware node, each with a different port. They are simply instances of brSolr, although you could consider each Solr instance a node in a Solr cloud bras well, a virtual node. So, technically, you could have multiple replicas bron the same node, but that sort of defeats most of the purpose of having brreplicas in the first place - to distribute the data for performance and brfault tolerance. But, you could have replicas of different shards on the brsame node/box for a partial improvement of performance and fault tolerance. br brA Solr cloud' is really a cluster. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:16 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brGood write up. br brAnd what about node? br brI think there needs to be an official glossary of terms that is sanctioned brby the solr team and some terms still ni use may need to be labeled brdeprecated. After so many years, its still confusing. br brbrbrbr--- Original Message --- brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern brterm and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brbrcore being a replica of the other cores within that shard of that brbrcollection. brbr brbrInstance is a general term, but is commonly used to refer to a running brSolr brbrserver, each of which can service any number of cores. A sharded brcollection brbrwould typically require multiple instances of Solr, each with a shard of brthe brbrcollection. brbr brbrMultiple collections can be supported on a single instance of Solr. They brbrdon't have to be sharded or replicated. But if they are, each Solr brinstance brbrwill have a copy or replica of the data (index) of one shard of each brsharded brbrcollection - to the degree that each collection needs that many shards. brbr brbrAt the API level, you talk to a Solr instance, using a host and port, brand brbrgiving the collection name. Some operations will refer only to the brportion brbrof a multi-shard collection on that Solr instance, but typically Solr brwill brbrdistribute the operation, whether it be an update or a query, to all brof brbrthe shards of the named collection. In the case of update, the update brwill brbrbe distributed to all replicas as well, but in the case of query only brone brbrreplica of each shard of the collection is needed. brbr brbrBefore SolrCloud we Solr had master and slave and the slaves were brreplicas brbrof the master, but with SolrCloud there is no master and all the brreplicas of brbrthe shard are peers, although at any moment of time one of them will be brbrconsidered the leader for coordination purposes, but not in the sense brthat brbrit is a master of the other replicas in that shard. A SolrCloud replica bris a brbrreplica of the data, in an abstract sense, for a single shard of a brbrcollection. A SolrCloud
RE: Re: Terminology question: Core vs. Collection vs...
Thanks. I got that part. A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core? brbrbr--- Original Message --- On 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving the collection name. Some operations will refer only to the brbrportion brbrbrof a multi-shard collection on that Solr instance, but typically brSolr brbrwill brbrbrdistribute the operation, whether it be an update or a query, to brall brbrof brbrbrthe shards of the named collection. In the case of update, the brupdate brbrwill brbrbrbe distributed to all replicas as well, but in the case of query bronly brbrone brbrbrreplica of each shard of the collection is needed. brbrbr brbrbrBefore SolrCloud we Solr had master and slave and the slaves were brbrreplicas brbrbrof the master, but with SolrCloud there is no master and all the brbrreplicas of brbrbrthe shard are peers, although at any moment of time one of them will brbe brbrbrconsidered the leader
Re: Terminology question: Core vs. Collection vs...
And I would revise node to note that in SolrCloud a node is simply an instance of a Solr server. And, technically, you can have multiple shards in a single instance of Solr, separating the logical sharding of keys from the distribution of the data. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Thursday, January 03, 2013 9:08 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Oops... let me word that a little more carefully: ...we are replicating the data of each shard. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Thursday, January 03, 2013 9:03 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... No, a shard is a subset (or slice) of the collection. Sharding is a way of slicing the original data, before we talk about how the shards get stored and replicated on actual Solr cores. Replicas are instances of the data for a shard. Sometimes people may loosely speak of a replica as being a shard, but that's just loose use of the terminology. So, we're not sharding shards, but we are replicating shards. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 8:51 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Thanks again. (And sorry to jump into this convo) But I had a question on your statement: On 1/3/2013 08:07 AM Jack Krupansky wrote: brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. A collection is sharded, meaning it is distributed across cores. A shard itself is not distributed across cores in the same since. Rather a shard exist on a single core and is replicated on other cores. Is that right? The way its worded above, it sounds like a shard can also be sharded... brbrbr--- Original Message --- On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real brmachine or a virtualized machine. Technically, you could have multiple brvirtual nodes on the same physical box. Each Solr replica would be on a brdifferent node. br brTechnically, you could have multiple Solr instances running on a single brhardware node, each with a different port. They are simply instances of brSolr, although you could consider each Solr instance a node in a Solr cloud bras well, a virtual node. So, technically, you could have multiple replicas bron the same node, but that sort of defeats most of the purpose of having brreplicas in the first place - to distribute the data for performance and brfault tolerance. But, you could have replicas of different shards on the brsame node/box for a partial improvement of performance and fault tolerance. br brA Solr cloud' is really a cluster. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:16 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brGood write up. br brAnd what about node? br brI think there needs to be an official glossary of terms that is sanctioned brby the solr team and some terms still ni use may need to be labeled brdeprecated. After so many years, its still confusing. br brbrbrbr--- Original Message --- brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern brterm and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brbrcore being a replica of the other cores within that shard of that brbrcollection. brbr brbrInstance is a general term, but is commonly used to refer to a running brSolr brbrserver, each of which can service any number of cores. A sharded brcollection brbrwould typically require multiple instances of Solr, each with a shard of brthe brbrcollection. brbr brbrMultiple collections can be supported on a single instance of Solr. They brbrdon't have to be sharded or replicated. But if they are, each Solr brinstance brbrwill have a copy or replica of the data (index) of one shard of each brsharded brbrcollection - to the degree that each collection needs that many shards. brbr brbrAt the API level, you talk to a Solr instance, using a host and port, brand brbrgiving the collection name. Some operations will refer only to the brportion brbrof a multi-shard collection on that Solr instance, but typically Solr brwill brbrdistribute the operation, whether it be an update or a query, to all brof brbrthe shards of the named collection. In the case of update, the update brwill brbrbe distributed to all replicas as well, but in the case of query only brone brbrreplica of each shard of the collection is needed. brbr brbrBefore SolrCloud we Solr had master
RE: Re: Terminology question: Core vs. Collection vs...
I think what's confusing about your explanation below is when you have a situation where there is no replication factor. That's possible too, yes? So in that case, is each core of a shard of a collection, still referred to as a replica? To me a replica is a duplicate/backup of a shard's core. Not the sharded core itself. Or is there just no difference. Even a non-replicated core is called a replica? brbrbr--- Original Message --- On 1/3/2013 09:08 AM Jack Krupansky wrote:brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving the collection name. Some operations will refer only to the brbrportion brbrbrof a multi-shard collection on that Solr instance, but typically brSolr brbrwill brbrbrdistribute the operation, whether it be an update
RE: Re: Terminology question: Core vs. Collection vs...
Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving
Re: Terminology question: Core vs. Collection vs...
A single shard MAY exist on a single core, but only if it is not replicated. Generally, a single shard will exist on multiple cores, each a replica of the source data as it comes into the update handler. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 9:10 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Thanks. I got that part. A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core? brbrbr--- Original Message --- On 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving the collection name. Some operations will refer only to the brbrportion brbrbrof a multi-shard collection on that Solr instance, but typically brSolr brbrwill brbrbrdistribute the operation, whether it be an update or a query, to brall brbrof brbrbrthe shards of the named collection. In the case of update, the brupdate brbrwill brbrbrbe distributed to all replicas
RE: Re: Terminology question: Core vs. Collection vs...
Ah, ok. Good. Makes sense. I think I will draw all this up in a UML that includes the distinction between the logical terms and the physical terms (and their mapping) as they do get intertwined. I'll post it here when I'm done. brbrbr--- Original Message --- On 1/3/2013 09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated. brGenerally, a single shard will exist on multiple cores, each a replica of brthe source data as it comes into the update handler. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 9:10 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks. I got that part. br brA group of shards (and therefore cores) represent a collection, yes. But a brsingle shard exist only on a single core? br brbrbrbr--- Original Message --- brOn 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or brslice) of the collection. Sharding is a way of brbrslicing the original data, before we talk about how the shards get brstored brbrand replicated on actual Solr cores. Replicas are instances of the data brfor brbra shard. brbr brbrSometimes people may loosely speak of a replica as being a shard, but brbrthat's just loose use of the terminology. brbr brbrSo, we're not sharding shards, but we are replicating shards. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:51 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrThanks again. (And sorry to jump into this convo) brbr brbrBut I had a question on your statement: brbr brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote: brbr brCollection is the more modern term and incorporates the fact that brthe brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brcore being a replica of the other cores within that shard of brthat brbrbrcollection. brbr brbrA collection is sharded, meaning it is distributed across cores. A shard brbritself is not distributed across cores in the same since. Rather a shard brbrexist on a single core and is replicated on other cores. Is that right? brThe brbrway its worded above, it sounds like a shard can also be sharded... brbr brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brbrcluster or cloud (graph). It could be a real brbrbrmachine or a virtualized machine. Technically, you could have brmultiple brbrbrvirtual nodes on the same physical box. Each Solr replica would be bron brbra brbrbrdifferent node. brbrbr brbrbrTechnically, you could have multiple Solr instances running on a brsingle brbrbrhardware node, each with a different port. They are simply instances brof brbrbrSolr, although you could consider each Solr instance a node in a brSolr brbrcloud brbrbras well, a virtual node. So, technically, you could have multiple brbrreplicas brbrbron the same node, but that sort of defeats most of the purpose of brhaving brbrbrreplicas in the first place - to distribute the data for performance brand brbrbrfault tolerance. But, you could have replicas of different shards on brthe brbrbrsame node/box for a partial improvement of performance and fault brbrtolerance. brbrbr brbrbrA Solr cloud' is really a cluster. brbrbr brbrbr-- Jack Krupansky brbrbr brbrbr-Original Message- brbrbrFrom: Darren Govoni brbrbrSent: Thursday, January 03, 2013 8:16 AM brbrbrTo: solr-user@lucene.apache.org brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbrbr brbrbrGood write up. brbrbr brbrbrAnd what about node? brbrbr brbrbrI think there needs to be an official glossary of terms that is brbrsanctioned brbrbrby the solr team and some terms still ni use may need to be labeled brbrbrdeprecated. After so many years, its still confusing. brbrbr brbrbrbrbrbr--- Original Message --- brbrbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the brmore brbrmodern brbrbrterm and incorporates the fact that the brbrbrbrcollection may be sharded, with each shard on one or more cores, brbrwith brbrbreach brbrbrbrcore being a replica of the other cores within that shard of brthat brbrbrbrcollection. brbrbrbr brbrbrbrInstance is a general term, but is commonly used to refer to a brbrrunning brbrbrSolr brbrbrbrserver, each of which can service any number of cores. A sharded brbrbrcollection brbrbrbrwould typically require multiple instances of Solr, each with a brbrshard of brbrbrthe brbrbrbrcollection. brbrbrbr brbrbrbrMultiple collections can be supported on a single instance of brSolr. brbrThey brbrbrbrdon't have to be sharded or replicated. But if they are, each brSolr brbrbrinstance brbrbrbrwill have a copy or replica of the data (index) of one
Re: Terminology question: Core vs. Collection vs...
On Jan 3, 2013, at 9:17 AM, Darren Govoni dar...@ontrenet.com wrote: Even a non-replicated core is called a replica? To some :) Forcing agreement on terminology has been … challenging… And even if there is some agreement, new people come, old people that were not around for the agreement come back, etc. Usually you have to figure it out by context. I started trying to put a stake in the ground on the wiki - but it's still solidifying and does not include everything yet - eg I don't think it makes a call about replica being just copies or also the leader. There was some discussion about this very thing recently. Because all cores in a slice / logical shard are pretty much equal (anyone can become a leader), it doesn't seem crazy to consider them all replicas. If a leader goes down briefly and comes back - perhaps it just lost its connection for a moment - it will come back and no longer be a leader. Did it change from a non replica to a replica then? Gosh I don't know. Stick a fork in my eye :) - Mark
Re: Terminology question: Core vs. Collection vs...
Ah... the multiple shards (of the same collection) in a single node is about planning for future expansion of your cluster - create more shards than you need today, put more of them on a single node and then migrate them to their own nodes as the data outgrows the smaller number of nodes. In other words, add nodes incrementally without having to reindex all the data. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 9:18 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can
Re: Terminology question: Core vs. Collection vs...
Hi Here is my version - do not believe the explanations have been very clear We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) 2) Solr server JVMs 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts Terminology 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. 2) Node 3) Collection 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. Regards, Per Steffensen On 1/3/13 10:42 AM, Alexandre Rafalovitch wrote: Hello, I am trying to understand the core Solr terminology. I am looking for correct rather than loose meaning as I am trying to teach an example that starts from easy scenario and may scale to multi-core, multi-machine situation. Here are the terms that seem to be all overlapping and/or crossing over in my mind a the moment. 1) Index 2) Core 3) Collection 4) Instance 5) Replica (Replica of _what_?) 6) Others? I tried looking through documentation, but either there is a terminology drift or I am having trouble understanding the distinctions. If anybody has a clear picture in their mind, I would appreciate a clarification. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Terminology question: Core vs. Collection vs...
For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor. Regards, Per Steffensen On 1/3/13 3:52 PM, Per Steffensen wrote: Hi Here is my version - do not believe the explanations have been very clear We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) 2) Solr server JVMs 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts Terminology 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. 2) Node 3) Collection 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. Regards, Per Steffensen
Re: Terminology question: Core vs. Collection vs...
Yes, in the context of SolrCloud, Node = Solr server JVM. So, node is an instance of Solr, which can support multiple cores and multiple collections - or at least shards of multiple collections. -- Jack Krupansky -Original Message- From: Per Steffensen Sent: Thursday, January 03, 2013 9:52 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Hi Here is my version - do not believe the explanations have been very clear We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) 2) Solr server JVMs 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts Terminology 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. 2) Node 3) Collection 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. Regards, Per Steffensen On 1/3/13 10:42 AM, Alexandre Rafalovitch wrote: Hello, I am trying to understand the core Solr terminology. I am looking for correct rather than loose meaning as I am trying to teach an example that starts from easy scenario and may scale to multi-core, multi-machine situation. Here are the terms that seem to be all overlapping and/or crossing over in my mind a the moment. 1) Index 2) Core 3) Collection 4) Instance 5) Replica (Replica of _what_?) 6) Others? I tried looking through documentation, but either there is a terminology drift or I am having trouble understanding the distinctions. If anybody has a clear picture in their mind, I would appreciate a clarification. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Terminology question: Core vs. Collection vs...
This has pretty much become the standard across other distributed systems and in the literat…err…books. I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain. - Mark On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote: For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor. Regards, Per Steffensen On 1/3/13 3:52 PM, Per Steffensen wrote: Hi Here is my version - do not believe the explanations have been very clear We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) 2) Solr server JVMs 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts Terminology 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. 2) Node 3) Collection 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. Regards, Per Steffensen
Re: Terminology question: Core vs. Collection vs...
On 1/3/13 4:33 PM, Mark Miller wrote: This has pretty much become the standard across other distributed systems and in the literat…err…books. Hmmm Im not sure you are right about that. Maybe more than one distributed system calls them Replica, but there is also a lot that doesnt. But if you are right, thats at least a good valid argument to do it this way, even though I generally prefer good logical naming over following bad naming from the industry :-) Just because there is a lot of crap out there, doesnt mean that we also want to make crap. Maybe good logical naming could even be a small entry in the Why Solr is better than its competitors list :-)
RE: Re: Terminology question: Core vs. Collection vs...
Great point. brbrbr--- Original Message --- On 1/3/2013 10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller wrote: br This has pretty much become the standard across other distributed systems and in the literat…err…books. brHmmm Im not sure you are right about that. Maybe more than one brdistributed system calls them Replica, but there is also a lot that brdoesnt. But if you are right, thats at least a good valid argument to do brit this way, even though I generally prefer good logical naming over brfollowing bad naming from the industry :-) Just because there is a lot brof crap out there, doesnt mean that we also want to make crap. Maybe brgood logical naming could even be a small entry in the Why Solr is brbetter than its competitors list :-) br
Re: Terminology question: Core vs. Collection vs...
On Jan 3, 2013, at 10:42 AM, Per Steffensen st...@designware.dk wrote: Why Solr is better than its competitors list :-) The problem is that it's not just Solr competitors. It seems to be pretty much everyone. If you can provide counter examples, I'd be interested to see them, but I've found confirmation examples in projects and books left and right. Trying to forge our own path here seems more confusing than helpful IMO. We have enough issues with terminology right now - where we can go with the industry standard, I think we should. - Mark
Re: Terminology question: Core vs. Collection vs...
On Jan 3, 2013, at 10:55 AM, Mark Miller markrmil...@gmail.com wrote: On Jan 3, 2013, at 10:42 AM, Per Steffensen st...@designware.dk wrote: Why Solr is better than its competitors list :-) The problem is that it's not just Solr competitors. It seems to be pretty much everyone. If you can provide counter examples, I'd be interested to see them, but I've found confirmation examples in projects and books left and right. Trying to forge our own path here seems more confusing than helpful IMO. We have enough issues with terminology right now - where we can go with the industry standard, I think we should. - Mark P.S. I'm referring specifically to replication factor and not replica. While I think it's probably a similar deal, I've only researched replication factor specifically. - Mark
Re: Terminology question: Core vs. Collection vs...
A factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. I think that recycling the term replication within Solr was confusing, but it is a bit late to change that. wunder On Jan 3, 2013, at 7:33 AM, Mark Miller wrote: This has pretty much become the standard across other distributed systems and in the literat…err…books. I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain. - Mark On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote: For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor. Regards, Per Steffensen On 1/3/13 3:52 PM, Per Steffensen wrote: Hi Here is my version - do not believe the explanations have been very clear We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) 2) Solr server JVMs 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts Terminology 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. 2) Node 3) Collection 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. Regards, Per Steffensen -- Walter Underwood wun...@wunderwood.org
RE: Re: Terminology question: Core vs. Collection vs...
And based on the previous explanation there is never a copy of a shard. A shard represents and contains only replicas for itself, replicas being copies of cores within the shard. brbrbr--- Original Message --- On 1/3/2013 11:58 AM Walter Underwood wrote:brA factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. br brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that. br brwunder br brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote: br br This has pretty much become the standard across other distributed systems and in the literat…err…books. br br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain. br br - Mark br br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote: br br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor. br br Regards, Per Steffensen br br On 1/3/13 3:52 PM, Per Steffensen wrote: br Hi br br Here is my version - do not believe the explanations have been very clear br br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) br 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) br 2) Solr server JVMs br 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) br 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts br br Terminology br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. br 2) Node br 3) Collection br 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. br br Regards, Per Steffensen br br br br-- brWalter Underwood brwun...@wunderwood.org br br br br
Re: Terminology question: Core vs. Collection vs...
Also, searching can be much faster if you put all of the shards on one machine, and the search distributor. That way, you search with multiple simultaneous threads inside one machine. I've seen this make searches several times faster. On 01/03/2013 06:36 AM, Jack Krupansky wrote: Ah... the multiple shards (of the same collection) in a single node is about planning for future expansion of your cluster - create more shards than you need today, put more of them on a single node and then migrate them to their own nodes as the data outgrows the smaller number of nodes. In other words, add nodes incrementally without having to reindex all the data. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 9:18 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection
Re: Terminology question: Core vs. Collection vs...
I see. So sharding and distributing/replicating can have separate and different advantages. On 01/03/2013 01:06 PM, Lance Norskog wrote: Also, searching can be much faster if you put all of the shards on one machine, and the search distributor. That way, you search with multiple simultaneous threads inside one machine. I've seen this make searches several times faster. On 01/03/2013 06:36 AM, Jack Krupansky wrote: Ah... the multiple shards (of the same collection) in a single node is about planning for future expansion of your cluster - create more shards than you need today, put more of them on a single node and then migrate them to their own nodes as the data outgrows the smaller number of nodes. In other words, add nodes incrementally without having to reindex all the data. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 9:18 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message
Re: Terminology question: Core vs. Collection vs...
On 1/3/13 4:55 PM, Mark Miller wrote: Trying to forge our own path here seems more confusing than helpful IMO. We have enough issues with terminology right now - where we can go with the industry standard, I think we should. - Mark Fair enough. I dont think our biggest problem is whether we decide to call it Replica/replicationFactor or ShardInstance/InstancesPerShard. Our biggest problem is that we really havent decided once and for all and made sure to reflect the decision consistently across code and documentation. As long as we havnt I believe it is still ok to change our minds.
Re: Terminology question: Core vs. Collection vs...
On 1/3/13 5:58 PM, Walter Underwood wrote: A factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. I think that recycling the term replication within Solr was confusing, but it is a bit late to change that. wunder Yes, the term factor is not misleading, but the term replication is. If we keep calling shard-instances for Replica I guess replicaFactor will be ok - at least much better than replicationFactor. But it would still be better with e.g. ShardInstance and InstancesPerShard