dsmiley commented on code in PR #4246: URL: https://github.com/apache/solr/pull/4246#discussion_r3007075073
########## solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc: ########## @@ -0,0 +1,158 @@ += Solr Cluster Types +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +A Solr cluster is a group of servers that each run one or more Solr _nodes_. + +There are two general modes of operating a cluster of Solr nodes. +One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), while the other allows you to operate a cluster without this central coordination (<<User-Managed Mode>>). + +TIP: "User Managed" and "Single Node" are sometimes referred to as "Standalone", especially in source code. + +Both modes share general concepts, but ultimately differ in how those concepts are reflected in functionality and features. + +First let's cover a few general concepts and then outline the differences between the two modes. + +== Cluster Concepts + +=== Servers and Nodes + +A _server_ is the hardware or virtual machine that hosts Solr software. +A _node_ is an instance of a running Solr process that services search and indexing requests. +Large servers may run multiple Solr nodes, though typically one node per server is most common. + +=== Shards + +In both cluster modes, a logical collection of documents can be divided across nodes as _shards_. +Each shard represents a logical slice of the overall collection and contains a subset of the documents. + +The number of shards determines the theoretical limit to the number of documents that can be stored. +It also dictates the amount of parallelization possible for an individual search request. + +=== Replicas + +A shard is a logical concept—a slice of your collection. +A _replica_ is the physical manifestation of that logical shard. +It is the actual running instance that holds and serves the documents belonging to that shard. + +A shard must have at least one replica to exist physically. +If you have one shard with one physical copy, you have one replica. +If you add redundancy by creating additional copies of that shard, you have multiple replicas—each is equally a replica, including the first one. + +IMPORTANT: There is no "original shard" separate from its replicas. +The replicas ARE how the shard exists. +This is why we say "a shard with 2 replicas" has 2 total physical copies, not an original plus 2 additional copies. + +All replicas of the same shard contain the same subset of documents and share the same configuration. + +The number of replicas determines the level of fault tolerance the cluster has in the event of a node failure. +It also dictates the theoretical limit on the number of concurrent search requests that can be processed under heavy load. + +=== Leaders and Followers + +Among the replicas for a given shard, one replica is designated as the _leader_. +The leader serves as the source-of-truth for its shard. +When document updates are made, they are first processed by the leader replica and then propagated to the other replicas (the exact mechanism varies by cluster mode). + +The replicas which are not leaders are called _followers_. + +=== Cores + +In Solr's implementation, each replica is represented as a _core_. +The term "core" is primarily an internal implementation detail—when you create a replica, Solr creates a core to represent it. +Multiple cores can be hosted on any one node. + +NOTE: The term "core" can be confusing because in everyday English it implies something central and singular, but in Solr it actually refers to one of potentially many replicas distributed across the cluster. +In most contexts, thinking of "core" as synonymous with "replica" will help clarify discussions about Solr's architecture. + +=== Collections and Indexes + +A _collection_ is the complete logical set of searchable documents that share a schema and configuration. +In SolrCloud mode (described below), a collection encompasses all the shards and their replicas. + +An _index_ refers to the physical data structures written to disk by Apache Lucene. +Each core (replica) maintains exactly one Lucene index on disk, containing the actual inverted indexes, stored fields, and other data structures that enable search. + +This creates a clear hierarchy from logical concepts to physical storage: + +[source,text] +---- +Collection (logical grouping of all searchable documents) + └─> Shard 1 (logical partition) + │ └─> Replica 1 / Core 1 (physical instance) + │ │ └─> Lucene Index (disk structures) + │ └─> Replica 2 / Core 2 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Shard 2 (logical partition) + └─> Replica 1 / Core 3 (physical instance) + │ └─> Lucene Index (disk structures) + └─> Replica 2 / Core 4 (physical instance) + └─> Lucene Index (disk structures) +---- + +In this example, a collection is divided into 2 shards, each shard has 2 replicas for redundancy, and each replica maintains its own Lucene index on disk. + +== SolrCloud Mode + +SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the centralized cluster management that is its main feature. +ZooKeeper tracks each node of the cluster and the state of each core on each node. + +In this mode, configuration files are stored in ZooKeeper and not on the file system of each node. +When configuration changes are made, they must be uploaded to ZooKeeper, which in turn makes sure each node knows changes have been made. + +SolrCloud manages collections as first-class entities. +A collection represents the entire group of shards and replicas that together provide access to a corpus of documents. +Collections share the same configurations (schema, `solrconfig.xml`, etc.). +This centralization of cluster management means that operations can be performed on the entire collection at one time. + +When changes are made to configurations, a single command to reload the collection will automatically reload each individual core (replica) that is a member of the collection. Review Comment: The word "reload" by itself is ambiguous as the the scope of what it is that is being reloaded. Even simply adding "core" or "collection" to it is still ambiguous. My first thought, and my instinctive thought, is that this has to do with the *data*. *If* the word "configuration" were to be tacked on here... _then_ that would bring great clarity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
