Re: [DISCUSS] SIP-20: Separation of Compute and Storage in SolrCloud

Ilan Ginzburg Wed, 17 Jan 2024 17:04:51 -0800

Hi Jason,

Good questions!
Note there is a related question on the users list, see thread Node roles
vs SIP-20 Separation of Compute and Storage
<https://lists.apache.org/thread/ox7xbl1hd2j87ccvlyjho4kqqv2jnfmc>


When I share the code (later this week in all likelihood) I will share a
detailed write up. I'm open to discuss/present/write more as needed (and
plan to attend Thursday 18/1 Meetup
<https://cwiki.apache.org/confluence/display/SOLR/2024-01-18+Meeting+notes> to
present the SIP).

To your questions:

   - The name "Zero" is because "Shared" was too broad (and already used)
   and "Blob" was already used as well in Solr code, and we didn't find any
   better name (but the name can be changed, there are a couple thousand
   occurrences of it in different forms in the code, I know because "Zero" is
   not the name we use internally at Salesforce for that replica type, I've
   renamed everything for sharing here). The reason for "Zero" is that there
   is only one instance (zero copies) in persistent storage for each shard,
   and also this evoques the (longer term) option to have zero (i.e. no)
   replicas for a shard on the cluster, pick a node, load and materialize a
   replica when needed. But that's longer term.
   - Your understanding is correct, currently it's a kind of new type of
   collection given all replicas are ZERO (or none are) for a collection. ZERO
   could have been a new type of shard, as one could imagine different shards
   having different types (but shards do not have types), or why not also
   allow a given shard to have ZERO replicas as well as other types of
   replicas (this is not supported in the current implementation for
   simplicity and we didn't really felt the need). If collectively we think
   that PULL + ZERO do make sense, why not. PULL would then be fetching its
   content from ZERO replicas rather than directly from the shared storage. I
   don't see ZERO coexisting with NRT or TLOG in a given shard though.
   Currently as the implementation forces all replicas of all shards of a
   "Zero" collection to be ZERO, there is a flag on that collection saying it
   is a Zero collection.
   - How the compute behaves:
   With the implementation in its current state, an indexing batch (as sent
   by the client) is processed by the leader replica, committed to local disk
   then written into the remote storage (S3) before SolrCloud acknowledges the
   batch. Transaction log is not used and the commit is forced. This slows
   down indexing, but a design guideline from the start was that the SolrCloud
   nodes are stateless (can restart with empty disk) to simplify managing
   elasticity (and in a public cloud setting, Availability Zone failures).
   The next evolution step of this code that we plan to start upstream once
   the branch has been shared is to enable the transaction log for ZERO
   replicas, and have indexing behave more like a normal SolrCloud: persist
   the transaction log before replying to a client, do not force a commit. The
   transaction log would be changed to be on shared storage as well, with a
   single copy for a shard, not one log per node/replica. All replicas of the
   shard, when they're becoming leaders, would access the same transaction
   log. Some of the logic used for implementing the shared storage for the
   index segments will be reused.
   For serving queries, the non leader replicas (also of type ZERO) update
   themselves from the shared storage directly. They behave mostly like PULL
   replicas (except the data doesn't come from the leader but from the shared
   storage), but can become leader because the shared storage is the "source
   of truth" and by reading all the data present there, any replica can get
   itself up to date with all acknowledged updates to become leader (there is
   protection so two replicas that think they're leader at the same time do
   not overwrite each other, I can describe this on another occasion).
   Currently the (basic) approach is that a replica checks if it is up to
   date as long as it's getting queries. If it is already up to date, the cost
   of that check is a ZooKeeper node read. If it is not up to date, it then
   fetches the updated content from the shared storage. Other strategies
   (check less often, check while not getting queries etc. are easy to
   implement). The updates are currently done asynchronously so do not delay
   the queries (that serve the "previous" content, like normal SolrCloud
   replication does).

I'd be happy to discuss and explain this in more detail. Quickly tomorrow
then more once I've shared the source code branch and the design doc.
I'd of course be happy to also follow up here with any questions, so don't
hesitate to ask!

Ilan



On Wed, Jan 17, 2024 at 8:30 PM Jason Gerlowski <gerlowsk...@gmail.com>
wrote:

> Hey Ilan,
>
> Thanks for putting together this writeup.  I think I understand the goal
> conceptually, and it sounds like a good one for Solr!  But I'm still having
> trouble understanding how this all would actually work.  So a few
> questions, inline:
>
> > A fourth replica type called ZERO is introduced
>
> Why the name "Zero"? Is it conveying something about the design that I'm
> not picking up on?
>
> > At Collection creation time, it is possible to specify that the
> collection exclusively uses replicas of type ZERO rather than being a
> “normal” collection that uses NRT/TLOG/PULL.
>
> Am I correct in understanding this to mean that if "zero" is used, it must
> be used for every replica in the collection?  If so, it almost sounds like
> this isn't a new type of replica but a new "collection type" altogether?
>
> > This allows scaling compute (more queries, more indexing) independently
> of storage
>
> I think the biggest question I have is: how does the "compute" side of this
> actually work?
>
> On the indexing side: what all happens in Solr before giving a response
> back to users?  What happens on a commit?  Are updates indexed only on the
> leader (like TLOG/PULL) or on all replicas (like NRT), or some other
> arrangement altogether?
>
> On the querying side: what situations cause index data to be pulled from
> the remote store?
>
> (These last questions might be a bit lengthy to get into via email, but
> they should probably be in the writeup?  Not sure what's best there...)
>
> Best,
>
> Jason
>
> On Sat, Jan 13, 2024 at 9:15 PM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
> > +1, thanks for the contribution Ilan! Looking forward to seeing this
> coming
> > to fruition.
> >
> > On Sun, 14 Jan 2024 at 03:40, Ilan Ginzburg <ilans...@gmail.com> wrote:
> >
> > > I have created SIP-20
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud
> > >
> > > In the next few days I will create a Jira + a branch that implements
> > > the SIP proposal and that includes documentation on how to approach
> > > that branch and what's in it.
> > >
> > > This proposed contribution is based on work done at Salesforce these
> > > last few years and currently running at scale in multiple regions.
> > >
> > > Thanks,
> > > Ilan
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > For additional commands, e-mail: dev-h...@solr.apache.org
> > >
> > >
> >
>

Re: [DISCUSS] SIP-20: Separation of Compute and Storage in SolrCloud

Reply via email to