Re: [DISCUSS] KIP-986: Cross-Cluster Replication

Greg Harris Thu, 05 Oct 2023 08:40:18 -0700

Hey Viktor,

Please do! This is a draft, and it's open for you to edit to include
your new ideas :)


I don't think I understand what you mean here. Are you suggesting an
alternative to the Admin API? An external project could certainly
build such a component with the Admin API.

Thanks, Greg

On Thu, Oct 5, 2023 at 6:33 AM Viktor Somogyi-Vass
<viktor.somo...@cloudera.com.invalid> wrote:
>
> Hi Greg,
>
> Sure, I'll expand it with my thoughts. Is it fine if I add it to the KIP
> and update this discussion?
>
> Another thing that crossed my mind is that in MM2 you can handle configs
> and replication flow in a central place because it is a separate component.
> I think that for use-cases where there are many replication flows, this
> aspect can be useful (as Kafka itself is useful for microservices). For CCR
> too it could be useful to have some kind of separated service that collects
> this information. It could also serve as an admin endpoint (swagger maybe?)
> for managing flows and configuration. With this you could instruct clusters
> to create/pause/delete replications. What do you think?
>
> Thanks,
> Viktor
>
>
>
> On Wed, Oct 4, 2023 at 6:20 PM Greg Harris <greg.har...@aiven.io.invalid>
> wrote:
>
> > Hey Viktor,
> >
> > Thanks for thinking about Tiered Storage. I'm not so familiar there,
> > so if you could add some of your expectations about how the two
> > features will interact, I would appreciate that.
> >
> > It appears to me that follower-fetch-from-remote is a significant
> > optimization within TS, and so similar optimizations to support
> > cross-cluster-replicate-from-remote and out-of-band remote replication
> > could also be desirable.
> > I think we can explore the idea further, and make sure that CCR is
> > extensible to tiered topics if it doesn't make it into the initial
> > implementation.
> >
> > Thanks!
> > Greg
> >
> > On Wed, Oct 4, 2023 at 6:13 AM Viktor Somogyi-Vass
> > <viktor.somo...@cloudera.com.invalid> wrote:
> > >
> > > Hi Greg,
> > >
> > > Thanks for the answers. I think they all make sense.
> > >
> > > Another point I realized last evening is that now that tiered storage
> > (TS)
> > > is available, it might complicate things with CCR. What I'm thinking of
> > is
> > > that if you have multiple clusters in multiple regions, enabling the
> > object
> > > storage's replication between zones could be much more cost efficient
> > than
> > > replicating local+remote offsets through Kafka. You'd only need to copy
> > > local segments over and remote partition replication would be done by the
> > > remote layer. Or the user could simply choose to not replicate remote
> > > segments between regions but instead just reference them (so that the
> > > backup cluster's remote offsets point to the original region). These
> > > options however likely require bigger coordination between clusters than
> > in
> > > pre-TS Kafka. Do you think we should take this into consideration in the
> > > design and in the UX?
> > >
> > > Thanks,
> > > Viktor
> > >
> > > On Tue, Oct 3, 2023 at 6:30 PM Greg Harris <greg.har...@aiven.io.invalid
> > >
> > > wrote:
> > >
> > > > Hi Viktor,
> > > >
> > > > Thanks for your questions! I agree, replication is very fundamental in
> > > > Kafka, so it's been implemented in many different ways by different
> > > > people. I hope that this is the last implementation we'll need, but
> > > > every software engineer says that :)
> > > >
> > > > GT-1: I think as this KIP is very focused on the UX of the feature,
> > > > that user stories are appropriate to include. I think it isn't
> > > > necessary to explain how the different applications are accomplished
> > > > with MM2 or other solutions, but describing what they will look like
> > > > after this KIP would be a wonderful addition. +1
> > > >
> > > > MM2-1: I think that replacing the consumer is insufficient, as we need
> > > > a more expressive producer as well. This is not possible within the
> > > > design constraints of MM2 as a Connector, as MM2 uses the
> > > > connect-managed producer. This could be implemented in MM3 as a new
> > > > process that can use more expressive "internal clients", but then
> > > > we've thrown away the Connect runtime that made MM2 easier to run for
> > > > some users.
> > > > MM2-2: This is technically possible, but sounds operationally
> > hazardous to
> > > > me.
> > > > MM2-3: From the user perspective, I believe that CCR can be made more
> > > > simple to use and operate than MM2, while providing better guarantees.
> > > > From the implementation standpoint, I think that CCR will be
> > > > significantly more complex, as the architecture of MM2 leverages a lot
> > > > of the Connect infrastructure.
> > > >
> > > > LaK-1: Yes, I think you understand what I was going for.
> > > > LaK-2: I don't think that this is a user experience that we could add
> > > > to CCR without changing the Kafka clients to be aware of both clusters
> > > > concurrently. In order to redirect clients away from a failed cluster
> > > > with a metadata refresh, the cluster that they're currently connected
> > > > to must give them that data. But because the cluster failed, that
> > > > refresh will not be reliable. With a proxy between the client and
> > > > Kafka, that proxy can be available while the original Kafka cluster is
> > > > not. Failovers would happen between distinct sets of clients that are
> > > > part of the same logical application.
> > > >
> > > > Thanks for taking a look at the rejected alternatives!
> > > > Greg
> > > >
> > > > On Tue, Oct 3, 2023 at 3:24 AM Viktor Somogyi-Vass
> > > > <viktor.somo...@cloudera.com.invalid> wrote:
> > > > >
> > > > > Hi Greg,
> > > > >
> > > > > Seems like finding the perfect replication solution is a never ending
> > > > story
> > > > > for Kafka :).
> > > > >
> > > > > Some general thoughts:
> > > > > GT-1. While as you say it would be good to have some kind of built-in
> > > > > replication in Kafka, we definitely need to understand the problem
> > better
> > > > > to provide a better solution. Replication has lots of user stories
> > as you
> > > > > iterated over a few and I think it's very well worth the time to
> > detail
> > > > > each one in the KIP. This may help understanding the problem on a
> > deeper
> > > > > level to others who may want to contribute, somewhat sets the scope
> > and
> > > > > describes the problem in a way that a good solution can be deduced
> > from
> > > > it.
> > > > >
> > > > > I also have a few questions regarding some of the rejected solutions:
> > > > >
> > > > > MM2:
> > > > > I think your points about MM2 are fair (offset transparency and
> > > > operational
> > > > > complexity), however I think it needs more reasoning about why are we
> > > > > moving in a different direction?
> > > > > A few points I can think about what we could improve in MM2 that'd
> > > > > transform it into more like a solution that you aim for:
> > > > > MM2-1. What if we consider replacing the client based mechanism with
> > a
> > > > > follower fetch protocol?
> > > > > MM2-2. Operating an MM2 cluster might be familiar to those who
> > operate
> > > > > Connect anyway. For those who don't, can we provide a "built-in"
> > version
> > > > > that runs in the same process as Kafka, like an embedded dedicated
> > MM2
> > > > > cluster?
> > > > > MM2-3. Will we actually be able to achieve less complexity with a
> > > > built-in
> > > > > solution?
> > > > >
> > > > > Layer above Kafka:
> > > > > LaK-1. Would you please add more details about this? What I can
> > currently
> > > > > think of is that this "layer above Kafka" would be some kind of a
> > proxy
> > > > > which would proactively send an incoming request to multiple clusters
> > > > like
> > > > > "broadcast" it. Is that a correct assumption?
> > > > > LaK-2. In case of a cluster failover a client needs to change
> > bootstrap
> > > > > servers to a different cluster. A layer above Kafka or a proxy can
> > solve
> > > > > this by abstracting away the cluster itself. It could force out a
> > > > metadata
> > > > > refresh and from that point on clients can fetch from the other
> > cluster.
> > > > Is
> > > > > this problem within the scope of this KIP or not?
> > > > >
> > > > > Thanks,
> > > > > Viktor
> > > > >
> > > > >
> > > > > On Tue, Oct 3, 2023 at 2:55 AM Greg Harris
> > <greg.har...@aiven.io.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hey Tom,
> > > > > >
> > > > > > Thanks for the high-level questions, as I am certainly approaching
> > > > > > this KIP differently than I've seen before.
> > > > > >
> > > > > > I think that ideally this KIP will expand to include lots of
> > > > > > requirements and possible implementations, and that through
> > discussion
> > > > > > we can narrow the scope and form a roadmap for implementation
> > across
> > > > > > multiple KIPs. I don't plan to be the decision-maker for this
> > project,
> > > > > > as I'm more interested in building consensus among the co-authors.
> > I
> > > > > > can certainly poll that consensus and update the KIP to keep the
> > > > > > project moving, and any other co-author can do the same. And to
> > set an
> > > > > > example, I'll clarify your questions and for anything that I agree
> > > > > > with, I'll ask that you make the update to the KIP, so that the KIP
> > > > > > captures your understanding of the problem and your requirements.
> > If
> > > > > > you don't get the chance to make the changes yourself, I'll make
> > sure
> > > > > > they get included eventually, as they're very good ideas :)
> > > > > >
> > > > > > For your remaining questions:
> > > > > >
> > > > > > M1: I was trying to draw analogies to databases, but your suggested
> > > > > > properties are much more compelling and informative. I'd love it if
> > > > > > you added some formalism here, so that we have a better grasp on
> > what
> > > > > > we're trying to accomplish. +1
> > > > > > M2: I think the "asynchronous" problem corresponds to the goal of
> > > > > > "exactly once semantics" but the two are not obviously opposites. I
> > > > > > think the MM2 deficiencies could focus less on the architecture
> > > > > > (asynchronicity) and more on the user-facing effect (semantics). +1
> > > > > > M3: I had a "non-goals" section that ended up becoming the
> > "rejected
> > > > > > alternatives" section instead. If you have some non-goals in mind,
> > > > > > please add them.
> > > > > > M4+M5: I think it's too early to nail down the assumptions
> > directly,
> > > > > > but if you believe that "separate operators of source and target"
> > is a
> > > > > > requirement, that would be good to write down. +1
> > > > > > M6: That is a concerning edge case, and I don't know how to handle
> > it.
> > > > > > I was imagining that there would be a many:many relationship of
> > > > > > clusters and links, but I understand that the book-keeping of that
> > > > > > decision may be significant.
> > > > > > M7: I think this may be appropriate to cover in a "user story" or
> > > > > > "example usages". I naturally thought that the feature would
> > describe
> > > > > > some minimal way of linking two topics, and the applications
> > > > > > (combining multiple links, performing failovers, or running
> > > > > > active-active, etc) would be left to users to define. I included
> > the
> > > > > > regex configurations because I imagine that creating 100s or 1000s
> > of
> > > > > > links would be unnecessarily tedious. The feature may also encode
> > > > > > those use-cases directly as first-class citizens as well.
> > > > > >
> > > > > > U1: These are states that can happen in reality, and I meant for
> > that
> > > > > > section to imply that we should expect these states and model them
> > for
> > > > > > operations and observability.
> > > > > >
> > > > > > D1: I think I may have introduced this confusion by trying to be
> > > > > > terse. I imagined that there will be two different topics on the
> > > > > > source and target, which would be synced to have the same
> > > > > > configuration contents, similar to MM2's implementation. This would
> > > > > > allow for the replication link to be permanently disconnected and
> > the
> > > > > > target topic to become just a regular topic, Later, a new
> > replication
> > > > > > link and new target topic (with another separate topic-id) can be
> > > > > > created to rebuild the replication. I also thought that it was
> > > > > > possible that two clusters had already chosen the same topic-id,
> > and
> > > > > > that attempting to interpret one topic-id in two different clusters
> > > > > > was error-prone. As far as replicating __cluster_metadata: I hadn't
> > > > > > considered that, but that might be required depending on the
> > semantics
> > > > > > we choose.
> > > > > > D2: Thanks, that's a good clarification. Uptime and bandwidth
> > should
> > > > > > be assumed to be lower, and latency should be assumed to be
> > higher. +1
> > > > > > D3: I included this restriction because it would not be
> > transparent to
> > > > > > source consumers. They would need special support for connecting to
> > > > > > brokers from multiple clusters, with potentially distinct metadata.
> > > > > >
> > > > > > Thanks so much!
> > > > > > Greg
> > > > > >
> > > > > > On Mon, Oct 2, 2023 at 4:24 PM Tom Bentley <tbent...@redhat.com>
> > > > wrote:
> > > > > > >
> > > > > > > Hi Greg,
> > > > > > >
> > > > > > > Thanks for this KIP! It is obviously very ambitious, but it's
> > great
> > > > to
> > > > > > have
> > > > > > > a conversation about it.
> > > > > > >
> > > > > > > I'll start with some general points:
> > > > > > >
> > > > > > > Do you have a plan in mind for how to proceed with elaborating
> > this
> > > > KIP?
> > > > > > > While I like how you're involving the community in elaborating
> > the
> > > > KIP, I
> > > > > > > think there is a danger, which is more likely with this inclusive
> > > > > > approach,
> > > > > > > in trying to attempt too much at once.
> > > > > > >
> > > > > > > In my opinion someone needs to take the difficult decisions
> > > > necessary to
> > > > > > > limit the initial scope (and, just as importantly, communicate
> > that
> > > > > > > clearly) in order to maximise the chances of actually getting
> > > > something
> > > > > > > accepted and implemented. Can we assume that you're that person?
> > > > Defining
> > > > > > > the what and how of the metadata replication, and the log
> > replication
> > > > > > seem
> > > > > > > to me to be the core of what you're trying to achieve here. We
> > should
> > > > > > make
> > > > > > > anything that is not crucial to that (i.e. NAT punching) a
> > non-goal
> > > > of
> > > > > > this
> > > > > > > KIP. Future KIPs can easily add those features.
> > > > > > >
> > > > > > > I also had a few specific points:
> > > > > > >
> > > > > > > Motivation
> > > > > > > M1. I don't find the "logical replication" vs "physical
> > replication"
> > > > > > > particularly helpful. I think one key property is "offset
> > > > preserving",
> > > > > > > which is also self-explanatory. Slightly more generally, we could
> > > > define
> > > > > > > the concept of "consumer transparency", i.e. a consumer could
> > > > reconnect
> > > > > > to
> > > > > > > either cluster and observe the same sequences of records (same
> > order,
> > > > > > same
> > > > > > > offsets, and same transaction visibility). Consumer transparency
> > > > requires
> > > > > > > synchronous replication, but offset preserving does not.
> > > > > > > M2. In the motivation you mention that MM offers asynchronous
> > > > > > replication,
> > > > > > > but the Goals subsection doesn't mention support for synchronous
> > > > > > > replication. We should be clear which (or both) we're aiming for.
> > > > > > > M3. A Non-Goals section would be useful, especially for a KIP
> > that's
> > > > > > large
> > > > > > > and ambitious like this one.
> > > > > > > M4. It might also be worth having a list of Assumptions. Here we
> > > > could
> > > > > > list
> > > > > > > all the things we want to assume in order to make the initial KIP
> > > > > > feasible.
> > > > > > > M5. For example we should be explicit about whether or not it is
> > > > assumed
> > > > > > > that the same people are operating (and thus have visibility
> > into)
> > > > both
> > > > > > > clusters.
> > > > > > > M6. One thing worth calling out is whether the clusters
> > themselves
> > > > are
> > > > > > in a
> > > > > > > leader/follower relationship (e.g. the DR scenario), or whether
> > this
> > > > is a
> > > > > > > topic-level concern. I guess it's topic level from the topic and
> > > > consumer
> > > > > > > group regexes. But this has consequences we should explore. For
> > > > example
> > > > > > > what if a transaction includes records in topics X and Y, where
> > X is
> > > > > > > replicated but Y is not?
> > > > > > > M7. I think you should be clear about whether this
> > leader/follower
> > > > > > > relationship can be reversed, and in what circumstances. In the
> > user
> > > > > > > interface section you talk about "disconnected", but not this
> > kind of
> > > > > > > fail-back.
> > > > > > >
> > > > > > >
> > > > > > > User interface
> > > > > > > U1. "Links can be temporarily or permanently disconnected." Are
> > you
> > > > > > > describing a fact about the network between the two clusters, or
> > is
> > > > this
> > > > > > > disconnection something actively managed by the system, or by the
> > > > > > operator?
> > > > > > >
> > > > > > > Data semantics
> > > > > > > D1. The KIP says "both cross-cluster topics and intra-cluster
> > > > replicas:
> > > > > > > Have the same configuration as their source" but you also say
> > > > > > > "cross-cluster replicas: Have a separate topic-id", this seems
> > like a
> > > > > > > contradiction, on the face of it. It seems like there's a whole
> > host
> > > > of
> > > > > > > devils in the detail behind this. It implies replication of (some
> > > > of) the
> > > > > > > __cluster_metadata, I think, but not all (since you say ACLs are
> > not
> > > > > > > replicated). If that's right, then what does it imply about
> > > > referential
> > > > > > > integrity between metadata records? i.e. what if metadata record
> > A
> > > > (which
> > > > > > > is replicated) references record B (which is not)? Even if this
> > is
> > > > not
> > > > > > > possible by design initially, how does it constrain the future
> > > > evolution
> > > > > > of
> > > > > > > metadata record schemas? Is any such metadata replication going
> > to be
> > > > > > > transaction preserving? If the topic ids can differ then what is
> > > > > > > responsible for the mapping and rewriting of metadata records
> > which
> > > > > > include
> > > > > > > topic ids?
> > > > > > > D2. "The network path between Kafka clusters is assumed to be
> > less
> > > > > > reliable
> > > > > > > than the intra-cluster network," we should be explicit about
> > whether
> > > > or
> > > > > > not
> > > > > > > we're assuming similar network latencies and bandwidth for the
> > > > > > > inter-cluster network links as for the in-cluster ones.
> > > > > > > D3 "Are not eligible for fetch-from-follower on the source
> > cluster"
> > > > the
> > > > > > > reason for this isn't immediately apparent to me.
> > > > > > >
> > > > > > > Thanks again,
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > > > On Tue, 3 Oct 2023 at 09:37, Greg Harris
> > > > <greg.har...@aiven.io.invalid>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I've opened an extremely early draft of the Cross-Cluster
> > > > Replication
> > > > > > > > feature, and I would like to invite any and all co-authors to
> > > > expand
> > > > > > > > on it. Find the page here:
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-986%3A+Cross-Cluster+Replication
> > > > > > > >
> > > > > > > > This is not strictly an invitation to "review" the KIP, as the
> > > > > > > > document has much less detail than other KIPs of similar
> > > > complexity.
> > > > > > > > But if you are knowledgeable in this area, some early sanity
> > checks
> > > > > > > > would be greatly appreciated.
> > > > > > > >
> > > > > > > > I've included a "shopping list" of properties that appear to
> > me to
> > > > be
> > > > > > > > desirable, but I don't have an implementation in mind that
> > meets
> > > > these
> > > > > > > > requirements. If you have additional requirements, an
> > alternative
> > > > UX
> > > > > > > > in mind, or wish to propose some implementation details, please
> > > > edit
> > > > > > > > the KIP with your contributions.
> > > > > > > >
> > > > > > > > Thanks everyone!
> > > > > > > >
> > > > > > > > Greg Harris
> > > > > > > > Aiven, Inc
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: [DISCUSS] KIP-986: Cross-Cluster Replication

Reply via email to