Re: [PR] SOLR-18179: Better highlight and expand upon our Cluster concepts in the Ref Guide [solr]

via GitHub Sun, 29 Mar 2026 17:57:51 -0700


dsmiley commented on code in PR #4246:
URL: https://github.com/apache/solr/pull/4246#discussion_r3007075073



##########
solr/solr-ref-guide/modules/getting-started/pages/cluster-types.adoc:
##########
@@ -0,0 +1,158 @@
+= Solr Cluster Types
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+A Solr cluster is a group of servers that each run one or more Solr _nodes_.
+
+There are two general modes of operating a cluster of Solr nodes.
+One mode provides central coordination of the Solr nodes (<<SolrCloud Mode>>), 
while the other allows you to operate a cluster without this central 
coordination (<<User-Managed Mode>>).
+
+TIP: "User Managed" and "Single Node" are sometimes referred to as 
"Standalone", especially in source code.
+
+Both modes share general concepts, but ultimately differ in how those concepts 
are reflected in functionality and features.
+
+First let's cover a few general concepts and then outline the differences 
between the two modes.
+
+== Cluster Concepts
+
+=== Servers and Nodes
+
+A _server_ is the hardware or virtual machine that hosts Solr software.
+A _node_ is an instance of a running Solr process that services search and 
indexing requests.
+Large servers may run multiple Solr nodes, though typically one node per 
server is most common.
+
+=== Shards
+
+In both cluster modes, a logical collection of documents can be divided across 
nodes as _shards_.
+Each shard represents a logical slice of the overall collection and contains a 
subset of the documents.
+
+The number of shards determines the theoretical limit to the number of 
documents that can be stored.
+It also dictates the amount of parallelization possible for an individual 
search request.
+
+=== Replicas
+
+A shard is a logical concept—a slice of your collection.
+A _replica_ is the physical manifestation of that logical shard.
+It is the actual running instance that holds and serves the documents 
belonging to that shard.
+
+A shard must have at least one replica to exist physically.
+If you have one shard with one physical copy, you have one replica.
+If you add redundancy by creating additional copies of that shard, you have 
multiple replicas—each is equally a replica, including the first one.
+
+IMPORTANT: There is no "original shard" separate from its replicas.
+The replicas ARE how the shard exists.
+This is why we say "a shard with 2 replicas" has 2 total physical copies, not 
an original plus 2 additional copies.
+
+All replicas of the same shard contain the same subset of documents and share 
the same configuration.
+
+The number of replicas determines the level of fault tolerance the cluster has 
in the event of a node failure.
+It also dictates the theoretical limit on the number of concurrent search 
requests that can be processed under heavy load.
+
+=== Leaders and Followers
+
+Among the replicas for a given shard, one replica is designated as the 
_leader_.
+The leader serves as the source-of-truth for its shard.
+When document updates are made, they are first processed by the leader replica 
and then propagated to the other replicas (the exact mechanism varies by 
cluster mode).
+
+The replicas which are not leaders are called _followers_.
+
+=== Cores
+
+In Solr's implementation, each replica is represented as a _core_.
+The term "core" is primarily an internal implementation detail—when you create 
a replica, Solr creates a core to represent it.
+Multiple cores can be hosted on any one node.
+
+NOTE: The term "core" can be confusing because in everyday English it implies 
something central and singular, but in Solr it actually refers to one of 
potentially many replicas distributed across the cluster.
+In most contexts, thinking of "core" as synonymous with "replica" will help 
clarify discussions about Solr's architecture.
+
+=== Collections and Indexes
+
+A _collection_ is the complete logical set of searchable documents that share 
a schema and configuration.
+In SolrCloud mode (described below), a collection encompasses all the shards 
and their replicas.
+
+An _index_ refers to the physical data structures written to disk by Apache 
Lucene.
+Each core (replica) maintains exactly one Lucene index on disk, containing the 
actual inverted indexes, stored fields, and other data structures that enable 
search.
+
+This creates a clear hierarchy from logical concepts to physical storage:
+
+[source,text]
+----
+Collection (logical grouping of all searchable documents)
+  └─> Shard 1 (logical partition)
+  │     └─> Replica 1 / Core 1 (physical instance)
+  │     │     └─> Lucene Index (disk structures)
+  │     └─> Replica 2 / Core 2 (physical instance)
+  │           └─> Lucene Index (disk structures)
+  └─> Shard 2 (logical partition)
+        └─> Replica 1 / Core 3 (physical instance)
+        │     └─> Lucene Index (disk structures)
+        └─> Replica 2 / Core 4 (physical instance)
+              └─> Lucene Index (disk structures)
+----
+
+In this example, a collection is divided into 2 shards, each shard has 2 
replicas for redundancy, and each replica maintains its own Lucene index on 
disk.
+
+== SolrCloud Mode
+
+SolrCloud mode (also called "SolrCloud") uses Apache ZooKeeper to provide the 
centralized cluster management that is its main feature.
+ZooKeeper tracks each node of the cluster and the state of each core on each 
node.
+
+In this mode, configuration files are stored in ZooKeeper and not on the file 
system of each node.
+When configuration changes are made, they must be uploaded to ZooKeeper, which 
in turn makes sure each node knows changes have been made.
+
+SolrCloud manages collections as first-class entities.
+A collection represents the entire group of shards and replicas that together 
provide access to a corpus of documents.
+Collections share the same configurations (schema, `solrconfig.xml`, etc.).
+This centralization of cluster management means that operations can be 
performed on the entire collection at one time.
+
+When changes are made to configurations, a single command to reload the 
collection will automatically reload each individual core (replica) that is a 
member of the collection.

Review Comment:
   The word "reload" by itself is ambiguous as the the scope of what it is that 
is being reloaded.  Even simply adding "core" or "collection" to it is still 
ambiguous.  My first thought, and my instinctive thought, is that this has to 
do with the *data*.  *If* the word "configuration" were to be tacked on here... 
_then_ that would bring great clarity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-18179: Better highlight and expand upon our Cluster concepts in the Ref Guide [solr]

Reply via email to