gus-asf commented on code in PR #4246: URL: https://github.com/apache/solr/pull/4246#discussion_r3006758621
########## solr/solr-ref-guide/modules/getting-started/pages/solr-glossary.adoc: ########## @@ -44,28 +44,37 @@ These control the inclusion or exclusion of keywords in a query by using operato [[SolrGlossary-C]] === C -[[cluster]]Cluster:: +[[cluster]]xref:cluster-types.adoc[Cluster]:: In Solr, a cluster is a set of Solr nodes operating in coordination with each other via <<zookeeper,ZooKeeper>>, and managed as a unit. A cluster may contain many collections. -See also <<solrclouddef,SolrCloud>>. +See also xref:cluster-types.adoc[] and <<solrclouddef,SolrCloud>>. [[collection]]Collection:: -In Solr, one or more <<document,Documents>> grouped together in a single logical index using a single configuration and Schema. +The complete logical set of searchable documents that share a schema and configuration. + -In <<solrclouddef,SolrCloud>> a collection may be divided up into multiple logical shards, which may in turn be distributed across many nodes. +In <<solrclouddef,SolrCloud>>, a collection may be divided up into multiple logical <<shard,shards>>, which may in turn be distributed across many <<node,nodes>> for scalability and fault tolerance. +Each collection encompasses all the shards and their <<replica,replicas>>. + -Single-node installations and user-managed clusters use instead the concept of a <<core,Core>>. -"Collection" is most frequently used in the SolrCloud context, but as it represents a "logical index", the term may be used to refer to individual cores in a user-managed cluster as well. +Single-node installations and user-managed clusters do not manage collections as first-class entities; instead they work directly with individual <<core,cores>>. + [[defcommit]]Commit:: To make document changes permanent in the index. In the case of added documents, they would be searchable after a _commit_. [[core]]Core:: -An individual Solr instance (represents a logical index). -Multiple cores can run on a single node. +In Solr's implementation, a core is the physical instance that represents a <<replica,Replica>>. Review Comment: Same comment as in cluster-types.adoc ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. +When configuration changes are made, they must be uploaded to ZooKeeper, which in turn makes sure each node knows changes have been made. + +SolrCloud manages collections as first-class entities. +A collection represents the entire group of shards and replicas that together provide access to a corpus of documents. +Collections share the same configurations (schema, `solrconfig.xml`, etc.). +This centralization of cluster management means that operations can be performed on the entire collection at one time. + +When changes are made to configurations, a single command to reload the collection will automatically reload each individual core (replica) that is a member of the collection. + +Sharding is handled automatically, simply by telling Solr during collection creation how many shards you'd like the collection to have. +Document updates are then generally balanced between each shard automatically. +Some degree of control over what documents are stored in which shards is also available, if needed. + +ZooKeeper also handles load balancing and failover. +Incoming requests, either to index documents or for user queries, can be sent to any node of the cluster and ZooKeeper will route the request to an appropriate replica of each shard. + +In SolrCloud, the leader is flexible, with built-in mechanisms for automatic leader election in case the current leader fails. Review Comment: "the leader replica within a shard is flexible" - let's keep the leader's domain crystal clear. ########## solr/solr-ref-guide/modules/getting-started/pages/solr-glossary.adoc: ########## @@ -179,12 +201,17 @@ Logic and configuration parameters that tell Solr how to handle incoming "reques Logic and configuration parameters used by request handlers to process query requests. Examples of search components include faceting, highlighting, and "more like this" functionality. +[[server]]Server:: +The hardware or virtual machine that hosts Solr software. +A server may run one or more Solr <<node,Nodes>>. + [[shard]]Shard:: -In SolrCloud, a logical partition of a single <<collection,Collection>>. -Every shard consists of at least one physical <<replica,Replica>>, but there may be multiple Replicas distributed across multiple <<node,Nodes>> for fault tolerance. +A logical slice of a <<collection,Collection>>. +Each shard represents a logical partition containing a subset of the collection's documents. Review Comment: ```suggestion Each shard represents a partition containing a subset of the collection's documents. ``` ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. Review Comment: I'd skip the "like most things" sentence. This will be dealt with in the section on user managed clusters. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. Review Comment: Or maybe: Shards slice a collection of documents into discrete non-overlapping subsets, and may be based on data values you specify or ranges of a hash on the document ID. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. +When configuration changes are made, they must be uploaded to ZooKeeper, which in turn makes sure each node knows changes have been made. + +SolrCloud manages collections as first-class entities. +A collection represents the entire group of shards and replicas that together provide access to a corpus of documents. +Collections share the same configurations (schema, `solrconfig.xml`, etc.). +This centralization of cluster management means that operations can be performed on the entire collection at one time. + +When changes are made to configurations, a single command to reload the collection will automatically reload each individual core (replica) that is a member of the collection. + +Sharding is handled automatically, simply by telling Solr during collection creation how many shards you'd like the collection to have. +Document updates are then generally balanced between each shard automatically. +Some degree of control over what documents are stored in which shards is also available, if needed. + +ZooKeeper also handles load balancing and failover. +Incoming requests, either to index documents or for user queries, can be sent to any node of the cluster and ZooKeeper will route the request to an appropriate replica of each shard. + +In SolrCloud, the leader is flexible, with built-in mechanisms for automatic leader election in case the current leader fails. +This means another replica can become the leader, and from that point forward it is the source-of-truth for all other replicas of that shard. + +As long as one replica of each relevant shard is available, a user query or indexing request can still be satisfied when running in SolrCloud mode. + +== User-Managed Mode + +Solr's user-managed mode requires that cluster coordination activities that SolrCloud normally uses ZooKeeper for be performed manually or with local scripts. + +If the corpus of documents is too large for a single shard, the logic to create multiple shards is entirely left to the user. +There are no automated or programmatic ways for Solr to create shards during indexing. + +Routing documents to shards is handled manually, either with a simple hashing system or a simple round-robin list of shards that sends each document to a different shard. Review Comment: ```suggestion Routing documents to shards is handled manually, either with a hashing system (that you design and implement), assignment of documents to shards based on the value of a field (implicit routing), or a simple round-robin list of shards that sends each document to a different shard. ``` (do we want to even mention the round robin case since it makes updates challenging/slow? Only valid case I can imagine for that is as an optimization for super high volume immutable event data with little or no text analysis where the cost of calculating a hash might become significant ) ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + Review Comment: I like this tree. It represents the conceptual organization. We need one somewhere for the physical organization too i.e. Cluster --> Node --> Replica . ########## solr/solr-ref-guide/modules/getting-started/pages/solr-glossary.adoc: ########## @@ -96,6 +105,11 @@ The arrangement of search results into categories based on indexed terms. [[field]]Field:: The content to be indexed/searched along with metadata defining how the content should be processed by Solr. +[[follower]]Follower:: +A <<replica,Replica>> that is not the <<leader,Leader>> for its <<shard,Shard>>. Review Comment: This is replica level in cloud and node level in standalone, which probably should be called out and clarified. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. Review Comment: no need to use "core" here, just say replica ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores Review Comment: I think I would write this section somewhat differently... > Historically the term "core" has _mostly_ been used as a synonym for replica, but the term "core" can be confusing because in everyday English it implies something central and singular. Since there may be many replicas in Solr, and they are distributed across the cluster "Replica" is the preferred term. Core is mostly only used for historical reasons in the code base and other places where renaming things would be disruptive. ########## solr/solr-ref-guide/modules/getting-started/pages/solr-glossary.adoc: ########## @@ -213,6 +240,12 @@ Synonyms generally are terms which are near to each other in meaning and may sub In a search engine implementation, synonyms may be abbreviations as well as words, or terms that are not consistently hyphenated. Examples of synonyms in this context would be "Inc." and "Incorporated" or "iPod" and "i-pod". +[[standalone]]Standalone:: +An informal term referring to Solr deployments that do not use <<solrclouddef,SolrCloud>> mode. Review Comment: ```suggestion An informal term referring to Solr deployments that do not utilize Apache Zookeeper and thus do not provide the centralized configuration management that is available in <<solrclouddef,SolrCloud>> mode. ``` ########## solr/solr-ref-guide/modules/getting-started/pages/solr-glossary.adoc: ########## @@ -163,7 +181,11 @@ The ability of a search engine to retrieve _all_ of the possible matches to a us The appropriateness of a document to the search conducted by the user. [[replica]]Replica:: -A <<core,Core>> that acts as a physical copy of a <<shard,Shard>> in a <<solrclouddef,SolrCloud>> <<collection,Collection>>. +The physical manifestation of a logical <<shard,Shard>>. +A replica is the actual running instance (represented as a <<core,Core>>) that holds and serves the documents belonging to that shard. Review Comment: No need to mention core here, we should promote one favored name for each entity. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. Review Comment: I don't think the word logical actually ads clarity. ########## solr/solr-ref-guide/modules/getting-started/pages/solr-glossary.adoc: ########## @@ -114,8 +132,8 @@ Since users search using terms they expect to be in documents, finding the term === L [[leader]]Leader:: -A single <<replica,Replica>> for each <<shard,Shard>> that takes charge of coordinating index updates (document additions or deletions) to other replicas in the same shard. -This is a transient responsibility assigned to a node via an election, if the current Shard Leader goes down, a new node will automatically be elected to take its place. +A single <<replica,Replica>> for each <<shard,Shard>> that serves as the source-of-truth and coordinates index updates (document additions or deletions) to the <<follower,follower>> replicas in the same shard. Review Comment: Again cloud/standalone differences ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + Review Comment: (maybe put cluster over the top of this one too) ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. +When configuration changes are made, they must be uploaded to ZooKeeper, which in turn makes sure each node knows changes have been made. + +SolrCloud manages collections as first-class entities. +A collection represents the entire group of shards and replicas that together provide access to a corpus of documents. +Collections share the same configurations (schema, `solrconfig.xml`, etc.). +This centralization of cluster management means that operations can be performed on the entire collection at one time. + +When changes are made to configurations, a single command to reload the collection will automatically reload each individual core (replica) that is a member of the collection. + +Sharding is handled automatically, simply by telling Solr during collection creation how many shards you'd like the collection to have. +Document updates are then generally balanced between each shard automatically. +Some degree of control over what documents are stored in which shards is also available, if needed. + +ZooKeeper also handles load balancing and failover. +Incoming requests, either to index documents or for user queries, can be sent to any node of the cluster and ZooKeeper will route the request to an appropriate replica of each shard. + +In SolrCloud, the leader is flexible, with built-in mechanisms for automatic leader election in case the current leader fails. +This means another replica can become the leader, and from that point forward it is the source-of-truth for all other replicas of that shard. + +As long as one replica of each relevant shard is available, a user query or indexing request can still be satisfied when running in SolrCloud mode. + +== User-Managed Mode + +Solr's user-managed mode requires that cluster coordination activities that SolrCloud normally uses ZooKeeper for be performed manually or with local scripts. Review Comment: "...thus they have no concept of collections, or shards and Zookeeper is not used as a centralized storage for any configuration or real-time state" Leaving out replica on purpose there because there are followers which can wind up looking sort of similar and also, I think we can sell/clarify the benefit of zookeeper here too. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. +When configuration changes are made, they must be uploaded to ZooKeeper, which in turn makes sure each node knows changes have been made. + +SolrCloud manages collections as first-class entities. +A collection represents the entire group of shards and replicas that together provide access to a corpus of documents. +Collections share the same configurations (schema, `solrconfig.xml`, etc.). +This centralization of cluster management means that operations can be performed on the entire collection at one time. + +When changes are made to configurations, a single command to reload the collection will automatically reload each individual core (replica) that is a member of the collection. Review Comment: Actually re-open seems vague to me. Reload has a clear sense of out with the old, in with the new (for me at least). But maybe it should say "...will automatically reload the configuration for each replica that is a member..." Since it's not really messing with all the data, just the config? ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. Review Comment: +1 and hyperlink "collection" to the section below ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. Review Comment: Or just be specific? In special cases where oversized **_pre-existing_** hardware must be utilized, a server might host two or more nodes. Note that such configurations are typically sub-optimal. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. Review Comment: "likely" raises the question when does it not have an update log? Can we clarify when or omit? Discussion of what a node does seems better placed in a the node section? (with the word replica as a hyperlink to this section). "SolrCore" is a class in the code, details like that are developer documentation, not relevant to the user. No need to say anything other than "replica" here? I do like the idea of noting that there is one Lucene index per replica here, but it seems better (to me) to remain focused on the idea, behind "replica" not the implementation. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. +When configuration changes are made, they must be uploaded to ZooKeeper, which in turn makes sure each node knows changes have been made. + +SolrCloud manages collections as first-class entities. +A collection represents the entire group of shards and replicas that together provide access to a corpus of documents. +Collections share the same configurations (schema, `solrconfig.xml`, etc.). +This centralization of cluster management means that operations can be performed on the entire collection at one time. + +When changes are made to configurations, a single command to reload the collection will automatically reload each individual core (replica) that is a member of the collection. + +Sharding is handled automatically, simply by telling Solr during collection creation how many shards you'd like the collection to have. +Document updates are then generally balanced between each shard automatically. Review Comment: This is optional... implicit routing can still be used with cloud. Perhaps this and the previous sentence about automatic sharding could be combined into "Collections may also be configured to provide automatic routing of documents to shards by hashing document ids and automatically assigning ranges of the possible hash values to shards." Hopefully that captures/sells the value of the automation without overstating it as a requirement. ########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. +When configuration changes are made, they must be uploaded to ZooKeeper, which in turn makes sure each node knows changes have been made. + +SolrCloud manages collections as first-class entities. +A collection represents the entire group of shards and replicas that together provide access to a corpus of documents. +Collections share the same configurations (schema, `solrconfig.xml`, etc.). +This centralization of cluster management means that operations can be performed on the entire collection at one time. + +When changes are made to configurations, a single command to reload the collection will automatically reload each individual core (replica) that is a member of the collection. + +Sharding is handled automatically, simply by telling Solr during collection creation how many shards you'd like the collection to have. +Document updates are then generally balanced between each shard automatically. +Some degree of control over what documents are stored in which shards is also available, if needed. + +ZooKeeper also handles load balancing and failover. +Incoming requests, either to index documents or for user queries, can be sent to any node of the cluster and ZooKeeper will route the request to an appropriate replica of each shard. Review Comment: Right. Zookeeper merely records the range of hash values for the shard. Once Solr has read those values zookeeper isn't (or at least shouldn't be!) consulted again unless the node containing the shard information zk was updated for some reason. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
