Hi Jason, Good questions! Note there is a related question on the users list, see thread Node roles vs SIP-20 Separation of Compute and Storage <https://lists.apache.org/thread/ox7xbl1hd2j87ccvlyjho4kqqv2jnfmc>
When I share the code (later this week in all likelihood) I will share a detailed write up. I'm open to discuss/present/write more as needed (and plan to attend Thursday 18/1 Meetup <https://cwiki.apache.org/confluence/display/SOLR/2024-01-18+Meeting+notes> to present the SIP). To your questions: - The name "Zero" is because "Shared" was too broad (and already used) and "Blob" was already used as well in Solr code, and we didn't find any better name (but the name can be changed, there are a couple thousand occurrences of it in different forms in the code, I know because "Zero" is not the name we use internally at Salesforce for that replica type, I've renamed everything for sharing here). The reason for "Zero" is that there is only one instance (zero copies) in persistent storage for each shard, and also this evoques the (longer term) option to have zero (i.e. no) replicas for a shard on the cluster, pick a node, load and materialize a replica when needed. But that's longer term. - Your understanding is correct, currently it's a kind of new type of collection given all replicas are ZERO (or none are) for a collection. ZERO could have been a new type of shard, as one could imagine different shards having different types (but shards do not have types), or why not also allow a given shard to have ZERO replicas as well as other types of replicas (this is not supported in the current implementation for simplicity and we didn't really felt the need). If collectively we think that PULL + ZERO do make sense, why not. PULL would then be fetching its content from ZERO replicas rather than directly from the shared storage. I don't see ZERO coexisting with NRT or TLOG in a given shard though. Currently as the implementation forces all replicas of all shards of a "Zero" collection to be ZERO, there is a flag on that collection saying it is a Zero collection. - How the compute behaves: With the implementation in its current state, an indexing batch (as sent by the client) is processed by the leader replica, committed to local disk then written into the remote storage (S3) before SolrCloud acknowledges the batch. Transaction log is not used and the commit is forced. This slows down indexing, but a design guideline from the start was that the SolrCloud nodes are stateless (can restart with empty disk) to simplify managing elasticity (and in a public cloud setting, Availability Zone failures). The next evolution step of this code that we plan to start upstream once the branch has been shared is to enable the transaction log for ZERO replicas, and have indexing behave more like a normal SolrCloud: persist the transaction log before replying to a client, do not force a commit. The transaction log would be changed to be on shared storage as well, with a single copy for a shard, not one log per node/replica. All replicas of the shard, when they're becoming leaders, would access the same transaction log. Some of the logic used for implementing the shared storage for the index segments will be reused. For serving queries, the non leader replicas (also of type ZERO) update themselves from the shared storage directly. They behave mostly like PULL replicas (except the data doesn't come from the leader but from the shared storage), but can become leader because the shared storage is the "source of truth" and by reading all the data present there, any replica can get itself up to date with all acknowledged updates to become leader (there is protection so two replicas that think they're leader at the same time do not overwrite each other, I can describe this on another occasion). Currently the (basic) approach is that a replica checks if it is up to date as long as it's getting queries. If it is already up to date, the cost of that check is a ZooKeeper node read. If it is not up to date, it then fetches the updated content from the shared storage. Other strategies (check less often, check while not getting queries etc. are easy to implement). The updates are currently done asynchronously so do not delay the queries (that serve the "previous" content, like normal SolrCloud replication does). I'd be happy to discuss and explain this in more detail. Quickly tomorrow then more once I've shared the source code branch and the design doc. I'd of course be happy to also follow up here with any questions, so don't hesitate to ask! Ilan On Wed, Jan 17, 2024 at 8:30 PM Jason Gerlowski <[email protected]> wrote: > Hey Ilan, > > Thanks for putting together this writeup. I think I understand the goal > conceptually, and it sounds like a good one for Solr! But I'm still having > trouble understanding how this all would actually work. So a few > questions, inline: > > > A fourth replica type called ZERO is introduced > > Why the name "Zero"? Is it conveying something about the design that I'm > not picking up on? > > > At Collection creation time, it is possible to specify that the > collection exclusively uses replicas of type ZERO rather than being a > “normal” collection that uses NRT/TLOG/PULL. > > Am I correct in understanding this to mean that if "zero" is used, it must > be used for every replica in the collection? If so, it almost sounds like > this isn't a new type of replica but a new "collection type" altogether? > > > This allows scaling compute (more queries, more indexing) independently > of storage > > I think the biggest question I have is: how does the "compute" side of this > actually work? > > On the indexing side: what all happens in Solr before giving a response > back to users? What happens on a commit? Are updates indexed only on the > leader (like TLOG/PULL) or on all replicas (like NRT), or some other > arrangement altogether? > > On the querying side: what situations cause index data to be pulled from > the remote store? > > (These last questions might be a bit lengthy to get into via email, but > they should probably be in the writeup? Not sure what's best there...) > > Best, > > Jason > > On Sat, Jan 13, 2024 at 9:15 PM Ishan Chattopadhyaya < > [email protected]> wrote: > > > +1, thanks for the contribution Ilan! Looking forward to seeing this > coming > > to fruition. > > > > On Sun, 14 Jan 2024 at 03:40, Ilan Ginzburg <[email protected]> wrote: > > > > > I have created SIP-20 > > > > > > > > > https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud > > > > > > In the next few days I will create a Jira + a branch that implements > > > the SIP proposal and that includes documentation on how to approach > > > that branch and what's in it. > > > > > > This proposed contribution is based on work done at Salesforce these > > > last few years and currently running at scale in multiple regions. > > > > > > Thanks, > > > Ilan > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > >
