Thanks for the detailed response! I've got a few follow-ups that I'll ask in line, but I think I've got the core idea now, thanks!
> With the implementation in its current state, an indexing batch (as sent > by the client) is processed by the leader replica, committed to local disk > then written into the remote storage (S3) before SolrCloud acknowledges the > batch. Transaction log is not used and the commit is forced. To clarify - since you mentioned durability, I'm assuming the commit that happens on each update batch is a "hard commit"? If the user made an explicit hard-commit or had an autoCommit configured, would that be a no-op for ZERO collections? How do soft-commits behave - I imagine they're handled similarly to whatever is done for PULL replicas currently? > Currently the (basic) approach is that a replica checks if it is up to > date as long as it's getting queries. If it is already up to date, the cost > of that check is a ZooKeeper node read. And that ZK read happens on each query? Or does the current strategy have it check "every X seconds" like how PULL/TLOG replicas check for updates? Is that ZK check looking at the "shard term" stuff, or would Zero replicas/collections store some additional state in ZK to manage this? Best, Jason On Wed, Jan 17, 2024 at 8:05 PM Ilan Ginzburg <[email protected]> wrote: > Hi Jason, > > Good questions! > Note there is a related question on the users list, see thread Node roles > vs SIP-20 Separation of Compute and Storage > <https://lists.apache.org/thread/ox7xbl1hd2j87ccvlyjho4kqqv2jnfmc> > > When I share the code (later this week in all likelihood) I will share a > detailed write up. I'm open to discuss/present/write more as needed (and > plan to attend Thursday 18/1 Meetup > <https://cwiki.apache.org/confluence/display/SOLR/2024-01-18+Meeting+notes> > to > present the SIP). > > To your questions: > > - The name "Zero" is because "Shared" was too broad (and already used) > and "Blob" was already used as well in Solr code, and we didn't find any > better name (but the name can be changed, there are a couple thousand > occurrences of it in different forms in the code, I know because "Zero" > is > not the name we use internally at Salesforce for that replica type, I've > renamed everything for sharing here). The reason for "Zero" is that > there > is only one instance (zero copies) in persistent storage for each shard, > and also this evoques the (longer term) option to have zero (i.e. no) > replicas for a shard on the cluster, pick a node, load and materialize a > replica when needed. But that's longer term. > - Your understanding is correct, currently it's a kind of new type of > collection given all replicas are ZERO (or none are) for a collection. > ZERO > could have been a new type of shard, as one could imagine different > shards > having different types (but shards do not have types), or why not also > allow a given shard to have ZERO replicas as well as other types of > replicas (this is not supported in the current implementation for > simplicity and we didn't really felt the need). If collectively we think > that PULL + ZERO do make sense, why not. PULL would then be fetching its > content from ZERO replicas rather than directly from the shared > storage. I > don't see ZERO coexisting with NRT or TLOG in a given shard though. > Currently as the implementation forces all replicas of all shards of a > "Zero" collection to be ZERO, there is a flag on that collection saying > it > is a Zero collection. > - How the compute behaves: > With the implementation in its current state, an indexing batch (as sent > by the client) is processed by the leader replica, committed to local > disk > then written into the remote storage (S3) before SolrCloud acknowledges > the > batch. Transaction log is not used and the commit is forced. This slows > down indexing, but a design guideline from the start was that the > SolrCloud > nodes are stateless (can restart with empty disk) to simplify managing > elasticity (and in a public cloud setting, Availability Zone failures). > The next evolution step of this code that we plan to start upstream once > the branch has been shared is to enable the transaction log for ZERO > replicas, and have indexing behave more like a normal SolrCloud: persist > the transaction log before replying to a client, do not force a commit. > The > transaction log would be changed to be on shared storage as well, with a > single copy for a shard, not one log per node/replica. All replicas of > the > shard, when they're becoming leaders, would access the same transaction > log. Some of the logic used for implementing the shared storage for the > index segments will be reused. > For serving queries, the non leader replicas (also of type ZERO) update > themselves from the shared storage directly. They behave mostly like > PULL > replicas (except the data doesn't come from the leader but from the > shared > storage), but can become leader because the shared storage is the > "source > of truth" and by reading all the data present there, any replica can get > itself up to date with all acknowledged updates to become leader (there > is > protection so two replicas that think they're leader at the same time do > not overwrite each other, I can describe this on another occasion). > Currently the (basic) approach is that a replica checks if it is up to > date as long as it's getting queries. If it is already up to date, the > cost > of that check is a ZooKeeper node read. If it is not up to date, it then > fetches the updated content from the shared storage. Other strategies > (check less often, check while not getting queries etc. are easy to > implement). The updates are currently done asynchronously so do not > delay > the queries (that serve the "previous" content, like normal SolrCloud > replication does). > > I'd be happy to discuss and explain this in more detail. Quickly tomorrow > then more once I've shared the source code branch and the design doc. > I'd of course be happy to also follow up here with any questions, so don't > hesitate to ask! > > Ilan > > > > On Wed, Jan 17, 2024 at 8:30 PM Jason Gerlowski <[email protected]> > wrote: > > > Hey Ilan, > > > > Thanks for putting together this writeup. I think I understand the goal > > conceptually, and it sounds like a good one for Solr! But I'm still > having > > trouble understanding how this all would actually work. So a few > > questions, inline: > > > > > A fourth replica type called ZERO is introduced > > > > Why the name "Zero"? Is it conveying something about the design that I'm > > not picking up on? > > > > > At Collection creation time, it is possible to specify that the > > collection exclusively uses replicas of type ZERO rather than being a > > “normal” collection that uses NRT/TLOG/PULL. > > > > Am I correct in understanding this to mean that if "zero" is used, it > must > > be used for every replica in the collection? If so, it almost sounds > like > > this isn't a new type of replica but a new "collection type" altogether? > > > > > This allows scaling compute (more queries, more indexing) independently > > of storage > > > > I think the biggest question I have is: how does the "compute" side of > this > > actually work? > > > > On the indexing side: what all happens in Solr before giving a response > > back to users? What happens on a commit? Are updates indexed only on > the > > leader (like TLOG/PULL) or on all replicas (like NRT), or some other > > arrangement altogether? > > > > On the querying side: what situations cause index data to be pulled from > > the remote store? > > > > (These last questions might be a bit lengthy to get into via email, but > > they should probably be in the writeup? Not sure what's best there...) > > > > Best, > > > > Jason > > > > On Sat, Jan 13, 2024 at 9:15 PM Ishan Chattopadhyaya < > > [email protected]> wrote: > > > > > +1, thanks for the contribution Ilan! Looking forward to seeing this > > coming > > > to fruition. > > > > > > On Sun, 14 Jan 2024 at 03:40, Ilan Ginzburg <[email protected]> > wrote: > > > > > > > I have created SIP-20 > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud > > > > > > > > In the next few days I will create a Jira + a branch that implements > > > > the SIP proposal and that includes documentation on how to approach > > > > that branch and what's in it. > > > > > > > > This proposed contribution is based on work done at Salesforce these > > > > last few years and currently running at scale in multiple regions. > > > > > > > > Thanks, > > > > Ilan > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [email protected] > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > >
