Re: [DISCUSS] SIP-20: Separation of Compute and Storage in SolrCloud

Jason Gerlowski Mon, 22 Jan 2024 13:20:34 -0800

Thanks for the detailed response!  I've got a few follow-ups that I'll ask
in line, but I think I've got the core idea now, thanks!


> With the implementation in its current state, an indexing batch (as sent
> by the client) is processed by the leader replica, committed to local disk
> then written into the remote storage (S3) before SolrCloud acknowledges
the
>  batch. Transaction log is not used and the commit is forced.

To clarify - since you mentioned durability, I'm assuming the commit that
happens on each update batch is a "hard commit"?

If the user made an explicit hard-commit or had an autoCommit configured,
would that be a no-op for ZERO collections?

How do soft-commits behave - I imagine they're handled similarly to
whatever is done for PULL replicas currently?

> Currently the (basic) approach is that a replica checks if it is up to
> date as long as it's getting queries. If it is already up to date, the
cost
> of that check is a ZooKeeper node read.

And that ZK read happens on each query?  Or does the current strategy have
it check "every X seconds" like how PULL/TLOG replicas check for updates?

Is that ZK check looking at the "shard term" stuff, or would Zero
replicas/collections store some additional state in ZK to manage this?

Best,

Jason

On Wed, Jan 17, 2024 at 8:05 PM Ilan Ginzburg <ilans...@gmail.com> wrote:

> Hi Jason,
>
> Good questions!
> Note there is a related question on the users list, see thread Node roles
> vs SIP-20 Separation of Compute and Storage
> <https://lists.apache.org/thread/ox7xbl1hd2j87ccvlyjho4kqqv2jnfmc>
>
> When I share the code (later this week in all likelihood) I will share a
> detailed write up. I'm open to discuss/present/write more as needed (and
> plan to attend Thursday 18/1 Meetup
> <https://cwiki.apache.org/confluence/display/SOLR/2024-01-18+Meeting+notes>
> to
> present the SIP).
>
> To your questions:
>
>    - The name "Zero" is because "Shared" was too broad (and already used)
>    and "Blob" was already used as well in Solr code, and we didn't find any
>    better name (but the name can be changed, there are a couple thousand
>    occurrences of it in different forms in the code, I know because "Zero"
> is
>    not the name we use internally at Salesforce for that replica type, I've
>    renamed everything for sharing here). The reason for "Zero" is that
> there
>    is only one instance (zero copies) in persistent storage for each shard,
>    and also this evoques the (longer term) option to have zero (i.e. no)
>    replicas for a shard on the cluster, pick a node, load and materialize a
>    replica when needed. But that's longer term.
>    - Your understanding is correct, currently it's a kind of new type of
>    collection given all replicas are ZERO (or none are) for a collection.
> ZERO
>    could have been a new type of shard, as one could imagine different
> shards
>    having different types (but shards do not have types), or why not also
>    allow a given shard to have ZERO replicas as well as other types of
>    replicas (this is not supported in the current implementation for
>    simplicity and we didn't really felt the need). If collectively we think
>    that PULL + ZERO do make sense, why not. PULL would then be fetching its
>    content from ZERO replicas rather than directly from the shared
> storage. I
>    don't see ZERO coexisting with NRT or TLOG in a given shard though.
>    Currently as the implementation forces all replicas of all shards of a
>    "Zero" collection to be ZERO, there is a flag on that collection saying
> it
>    is a Zero collection.
>    - How the compute behaves:
>    With the implementation in its current state, an indexing batch (as sent
>    by the client) is processed by the leader replica, committed to local
> disk
>    then written into the remote storage (S3) before SolrCloud acknowledges
> the
>    batch. Transaction log is not used and the commit is forced. This slows
>    down indexing, but a design guideline from the start was that the
> SolrCloud
>    nodes are stateless (can restart with empty disk) to simplify managing
>    elasticity (and in a public cloud setting, Availability Zone failures).
>    The next evolution step of this code that we plan to start upstream once
>    the branch has been shared is to enable the transaction log for ZERO
>    replicas, and have indexing behave more like a normal SolrCloud: persist
>    the transaction log before replying to a client, do not force a commit.
> The
>    transaction log would be changed to be on shared storage as well, with a
>    single copy for a shard, not one log per node/replica. All replicas of
> the
>    shard, when they're becoming leaders, would access the same transaction
>    log. Some of the logic used for implementing the shared storage for the
>    index segments will be reused.
>    For serving queries, the non leader replicas (also of type ZERO) update
>    themselves from the shared storage directly. They behave mostly like
> PULL
>    replicas (except the data doesn't come from the leader but from the
> shared
>    storage), but can become leader because the shared storage is the
> "source
>    of truth" and by reading all the data present there, any replica can get
>    itself up to date with all acknowledged updates to become leader (there
> is
>    protection so two replicas that think they're leader at the same time do
>    not overwrite each other, I can describe this on another occasion).
>    Currently the (basic) approach is that a replica checks if it is up to
>    date as long as it's getting queries. If it is already up to date, the
> cost
>    of that check is a ZooKeeper node read. If it is not up to date, it then
>    fetches the updated content from the shared storage. Other strategies
>    (check less often, check while not getting queries etc. are easy to
>    implement). The updates are currently done asynchronously so do not
> delay
>    the queries (that serve the "previous" content, like normal SolrCloud
>    replication does).
>
> I'd be happy to discuss and explain this in more detail. Quickly tomorrow
> then more once I've shared the source code branch and the design doc.
> I'd of course be happy to also follow up here with any questions, so don't
> hesitate to ask!
>
> Ilan
>
>
>
> On Wed, Jan 17, 2024 at 8:30 PM Jason Gerlowski <gerlowsk...@gmail.com>
> wrote:
>
> > Hey Ilan,
> >
> > Thanks for putting together this writeup.  I think I understand the goal
> > conceptually, and it sounds like a good one for Solr!  But I'm still
> having
> > trouble understanding how this all would actually work.  So a few
> > questions, inline:
> >
> > > A fourth replica type called ZERO is introduced
> >
> > Why the name "Zero"? Is it conveying something about the design that I'm
> > not picking up on?
> >
> > > At Collection creation time, it is possible to specify that the
> > collection exclusively uses replicas of type ZERO rather than being a
> > “normal” collection that uses NRT/TLOG/PULL.
> >
> > Am I correct in understanding this to mean that if "zero" is used, it
> must
> > be used for every replica in the collection?  If so, it almost sounds
> like
> > this isn't a new type of replica but a new "collection type" altogether?
> >
> > > This allows scaling compute (more queries, more indexing) independently
> > of storage
> >
> > I think the biggest question I have is: how does the "compute" side of
> this
> > actually work?
> >
> > On the indexing side: what all happens in Solr before giving a response
> > back to users?  What happens on a commit?  Are updates indexed only on
> the
> > leader (like TLOG/PULL) or on all replicas (like NRT), or some other
> > arrangement altogether?
> >
> > On the querying side: what situations cause index data to be pulled from
> > the remote store?
> >
> > (These last questions might be a bit lengthy to get into via email, but
> > they should probably be in the writeup?  Not sure what's best there...)
> >
> > Best,
> >
> > Jason
> >
> > On Sat, Jan 13, 2024 at 9:15 PM Ishan Chattopadhyaya <
> > ichattopadhy...@gmail.com> wrote:
> >
> > > +1, thanks for the contribution Ilan! Looking forward to seeing this
> > coming
> > > to fruition.
> > >
> > > On Sun, 14 Jan 2024 at 03:40, Ilan Ginzburg <ilans...@gmail.com>
> wrote:
> > >
> > > > I have created SIP-20
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud
> > > >
> > > > In the next few days I will create a Jira + a branch that implements
> > > > the SIP proposal and that includes documentation on how to approach
> > > > that branch and what's in it.
> > > >
> > > > This proposed contribution is based on work done at Salesforce these
> > > > last few years and currently running at scale in multiple regions.
> > > >
> > > > Thanks,
> > > > Ilan
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > > For additional commands, e-mail: dev-h...@solr.apache.org
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] SIP-20: Separation of Compute and Storage in SolrCloud

Reply via email to