Re: [DISCUSS] SIP-20: Separation of Compute and Storage in SolrCloud

Ilan Ginzburg Mon, 22 Jan 2024 16:50:14 -0800

>
> To clarify - since you mentioned durability, I'm assuming the commit that
> happens on each update batch is a "hard commit"?
>
Yes.


If the user made an explicit hard-commit or had an autoCommit configured,
> would that be a no-op for ZERO collections?
>
Yes.


> How do soft-commits behave - I imagine they're handled similarly to
> whatever is done for PULL replicas currently?
>
Currently a hard commit is automatically added after the last document of a
batch. Not sure what would happen if that wasn't the case and a soft commit
was issued instead.
In any case the soft commit would not propagate to other ZERO replicas than
the leader that's for sure (I assume when you mention soft commit and PULL
you mean soft commit on another replica?). The soft commit will also not
write the files to the shared repository. So likely the batch would be
acknowledged to the client but would be lost if a failure hit the node
before the next (hard) commit (no transaction log, nodes are assumed
stateless, so Zero replicas skip recovery on node startup).

And that ZK read happens on each query?  Or does the current strategy have
> it check "every X seconds" like how PULL/TLOG replicas check for updates?
>
(about a replica checking with a ZK read if it's up to date). In the
current implementation such a read is enqueued each time there's a query,
but basically until that read has happened then another query arrived,
there will not be a second read. So if for example a lot of queries arrive
in the same second before the read happens, there would be a single read.
Or if the cluster is loaded (many replicas get queries and enqueue such
reads from ZK), the reads will not be executed right away and no new read
will be enqueued until the previous one was done.
Spacing such reads by some minimum delay is not a big change.

Is that ZK check looking at the "shard term" stuff, or would Zero
> replicas/collections store some additional state in ZK to manage this?

 It is not shard terms. Terms are being used when a follower replica does
not manage to get an indexing update from the leader. Given Zero replicas
do not send indexing updates to other replicas, terms are never increased
and are therefore not used for this type of replica.
The value that is read in ZK is basically the pointer to the current
metadata file for the shard in the [Backup]Repository (it's called
BackupRepository but the Zero implementation uses it as a general purpose
repository). When the shard in the repository changes, the shard metadata
file changes (you can think of it as listing the current stored commit
point in the repository). The shard ZooKeeper node references the current
valid shard metadata file in the repository. A node that has read that
value (then the repository metadata file then all the segments) can check
if the ZooKeeper node has changed, and if it has not, the node already has
the current version of the shard.

Ilan

On Mon, Jan 22, 2024 at 10:20 PM Jason Gerlowski <gerlowsk...@gmail.com>
wrote:

> Thanks for the detailed response!  I've got a few follow-ups that I'll ask
> in line, but I think I've got the core idea now, thanks!
>
> > With the implementation in its current state, an indexing batch (as sent
> > by the client) is processed by the leader replica, committed to local
> disk
> > then written into the remote storage (S3) before SolrCloud acknowledges
> the
> >  batch. Transaction log is not used and the commit is forced.
>
> To clarify - since you mentioned durability, I'm assuming the commit that
> happens on each update batch is a "hard commit"?
>
> If the user made an explicit hard-commit or had an autoCommit configured,
> would that be a no-op for ZERO collections?
>
> How do soft-commits behave - I imagine they're handled similarly to
> whatever is done for PULL replicas currently?
>
> > Currently the (basic) approach is that a replica checks if it is up to
> > date as long as it's getting queries. If it is already up to date, the
> cost
> > of that check is a ZooKeeper node read.
>
> And that ZK read happens on each query?  Or does the current strategy have
> it check "every X seconds" like how PULL/TLOG replicas check for updates?
>
> Is that ZK check looking at the "shard term" stuff, or would Zero
> replicas/collections store some additional state in ZK to manage this?
>
> Best,
>
> Jason
>
> On Wed, Jan 17, 2024 at 8:05 PM Ilan Ginzburg <ilans...@gmail.com> wrote:
>
> > Hi Jason,
> >
> > Good questions!
> > Note there is a related question on the users list, see thread Node roles
> > vs SIP-20 Separation of Compute and Storage
> > <https://lists.apache.org/thread/ox7xbl1hd2j87ccvlyjho4kqqv2jnfmc>
> >
> > When I share the code (later this week in all likelihood) I will share a
> > detailed write up. I'm open to discuss/present/write more as needed (and
> > plan to attend Thursday 18/1 Meetup
> > <
> https://cwiki.apache.org/confluence/display/SOLR/2024-01-18+Meeting+notes>
> > to
> > present the SIP).
> >
> > To your questions:
> >
> >    - The name "Zero" is because "Shared" was too broad (and already used)
> >    and "Blob" was already used as well in Solr code, and we didn't find
> any
> >    better name (but the name can be changed, there are a couple thousand
> >    occurrences of it in different forms in the code, I know because
> "Zero"
> > is
> >    not the name we use internally at Salesforce for that replica type,
> I've
> >    renamed everything for sharing here). The reason for "Zero" is that
> > there
> >    is only one instance (zero copies) in persistent storage for each
> shard,
> >    and also this evoques the (longer term) option to have zero (i.e. no)
> >    replicas for a shard on the cluster, pick a node, load and
> materialize a
> >    replica when needed. But that's longer term.
> >    - Your understanding is correct, currently it's a kind of new type of
> >    collection given all replicas are ZERO (or none are) for a collection.
> > ZERO
> >    could have been a new type of shard, as one could imagine different
> > shards
> >    having different types (but shards do not have types), or why not also
> >    allow a given shard to have ZERO replicas as well as other types of
> >    replicas (this is not supported in the current implementation for
> >    simplicity and we didn't really felt the need). If collectively we
> think
> >    that PULL + ZERO do make sense, why not. PULL would then be fetching
> its
> >    content from ZERO replicas rather than directly from the shared
> > storage. I
> >    don't see ZERO coexisting with NRT or TLOG in a given shard though.
> >    Currently as the implementation forces all replicas of all shards of a
> >    "Zero" collection to be ZERO, there is a flag on that collection
> saying
> > it
> >    is a Zero collection.
> >    - How the compute behaves:
> >    With the implementation in its current state, an indexing batch (as
> sent
> >    by the client) is processed by the leader replica, committed to local
> > disk
> >    then written into the remote storage (S3) before SolrCloud
> acknowledges
> > the
> >    batch. Transaction log is not used and the commit is forced. This
> slows
> >    down indexing, but a design guideline from the start was that the
> > SolrCloud
> >    nodes are stateless (can restart with empty disk) to simplify managing
> >    elasticity (and in a public cloud setting, Availability Zone
> failures).
> >    The next evolution step of this code that we plan to start upstream
> once
> >    the branch has been shared is to enable the transaction log for ZERO
> >    replicas, and have indexing behave more like a normal SolrCloud:
> persist
> >    the transaction log before replying to a client, do not force a
> commit.
> > The
> >    transaction log would be changed to be on shared storage as well,
> with a
> >    single copy for a shard, not one log per node/replica. All replicas of
> > the
> >    shard, when they're becoming leaders, would access the same
> transaction
> >    log. Some of the logic used for implementing the shared storage for
> the
> >    index segments will be reused.
> >    For serving queries, the non leader replicas (also of type ZERO)
> update
> >    themselves from the shared storage directly. They behave mostly like
> > PULL
> >    replicas (except the data doesn't come from the leader but from the
> > shared
> >    storage), but can become leader because the shared storage is the
> > "source
> >    of truth" and by reading all the data present there, any replica can
> get
> >    itself up to date with all acknowledged updates to become leader
> (there
> > is
> >    protection so two replicas that think they're leader at the same time
> do
> >    not overwrite each other, I can describe this on another occasion).
> >    Currently the (basic) approach is that a replica checks if it is up to
> >    date as long as it's getting queries. If it is already up to date, the
> > cost
> >    of that check is a ZooKeeper node read. If it is not up to date, it
> then
> >    fetches the updated content from the shared storage. Other strategies
> >    (check less often, check while not getting queries etc. are easy to
> >    implement). The updates are currently done asynchronously so do not
> > delay
> >    the queries (that serve the "previous" content, like normal SolrCloud
> >    replication does).
> >
> > I'd be happy to discuss and explain this in more detail. Quickly tomorrow
> > then more once I've shared the source code branch and the design doc.
> > I'd of course be happy to also follow up here with any questions, so
> don't
> > hesitate to ask!
> >
> > Ilan
> >
> >
> >
> > On Wed, Jan 17, 2024 at 8:30 PM Jason Gerlowski <gerlowsk...@gmail.com>
> > wrote:
> >
> > > Hey Ilan,
> > >
> > > Thanks for putting together this writeup.  I think I understand the
> goal
> > > conceptually, and it sounds like a good one for Solr!  But I'm still
> > having
> > > trouble understanding how this all would actually work.  So a few
> > > questions, inline:
> > >
> > > > A fourth replica type called ZERO is introduced
> > >
> > > Why the name "Zero"? Is it conveying something about the design that
> I'm
> > > not picking up on?
> > >
> > > > At Collection creation time, it is possible to specify that the
> > > collection exclusively uses replicas of type ZERO rather than being a
> > > “normal” collection that uses NRT/TLOG/PULL.
> > >
> > > Am I correct in understanding this to mean that if "zero" is used, it
> > must
> > > be used for every replica in the collection?  If so, it almost sounds
> > like
> > > this isn't a new type of replica but a new "collection type"
> altogether?
> > >
> > > > This allows scaling compute (more queries, more indexing)
> independently
> > > of storage
> > >
> > > I think the biggest question I have is: how does the "compute" side of
> > this
> > > actually work?
> > >
> > > On the indexing side: what all happens in Solr before giving a response
> > > back to users?  What happens on a commit?  Are updates indexed only on
> > the
> > > leader (like TLOG/PULL) or on all replicas (like NRT), or some other
> > > arrangement altogether?
> > >
> > > On the querying side: what situations cause index data to be pulled
> from
> > > the remote store?
> > >
> > > (These last questions might be a bit lengthy to get into via email, but
> > > they should probably be in the writeup?  Not sure what's best there...)
> > >
> > > Best,
> > >
> > > Jason
> > >
> > > On Sat, Jan 13, 2024 at 9:15 PM Ishan Chattopadhyaya <
> > > ichattopadhy...@gmail.com> wrote:
> > >
> > > > +1, thanks for the contribution Ilan! Looking forward to seeing this
> > > coming
> > > > to fruition.
> > > >
> > > > On Sun, 14 Jan 2024 at 03:40, Ilan Ginzburg <ilans...@gmail.com>
> > wrote:
> > > >
> > > > > I have created SIP-20
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/SOLR/SIP-20%3A+Separation+of+Compute+and+Storage+in+SolrCloud
> > > > >
> > > > > In the next few days I will create a Jira + a branch that
> implements
> > > > > the SIP proposal and that includes documentation on how to approach
> > > > > that branch and what's in it.
> > > > >
> > > > > This proposed contribution is based on work done at Salesforce
> these
> > > > > last few years and currently running at scale in multiple regions.
> > > > >
> > > > > Thanks,
> > > > > Ilan
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > > > For additional commands, e-mail: dev-h...@solr.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] SIP-20: Separation of Compute and Storage in SolrCloud

Reply via email to