Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Jason Gustafson Tue, 09 Apr 2019 17:57:39 -0700

Hi Colin,

On a related note, what do you think about the idea of storing the
> reassigning replicas in
> /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in the
> reassignment znode?  I don't think this requires a major change to the
> proposal-- when the controller becomes aware that it should do a
> reassignment, the controller could make the changes.  This also helps keep
> the reassignment znode from getting larger, which has been a problem.



Yeah, I think it's a good idea to store the reassignment state at a finer
level. I'm not sure the LeaderAndIsr znode is the right one though. Another
option is /brokers/topics/{topic}. That is where we currently store the
replica assignment. I think we basically want to represent both the current
state and the desired state. This would also open the door to a cleaner way
to update a reassignment while it is still in progress.

-Jason




On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consult...@yahoo.com.invalid>
wrote:

>  Hi Colin / Jason,
>
> Reassignment should really be doing a batches.  I am not too worried about
> reassignment znode getting larger.  In a real production environment,  too
> many concurrent reassignment and too frequent submission of reassignments
> seemed to cause latency spikes of kafka cluster.  So
> batching/staggering/throttling of submitting reassignments is recommended.
>
> In KIP-236,  The "originalReplicas" are only kept for the current
> reassigning partitions (small #), and kept in memory of the controller
> context partitionsBeingReassigned as well as in the znode
> /admin/reassign_partitions,  I think below "setting in the RPC like null =
> no replicas are reassigning" is a good idea.
>
> There seems to be some issues with the Mail archive server of this mailing
> list?  I didn't receive email after April 7th, and the archive for April
> 2019 has only 50 messages (
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread) ?
>
> Thanks,
> George
>
>    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
>
>   Yeah, I think adding this information to LeaderAndIsr makes sense.  It
> would be better to track
> "reassigningReplicas" than "originalReplicas", I think.  Tracking
> "originalReplicas" is going
> to involve sending a lot more data, since most replicas in the system are
> not reassigning
> at any given point.  Or we would need a hack in the RPC like null = no
> replicas are reassigning.
>
> On a related note, what do you think about the idea of storing the
> reassigning replicas in
>  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
> the reassignment znode?
>  I don't think this requires a major change to the proposal-- when the
> controller becomes
> aware that it should do a reassignment, the controller could make the
> changes.  This also
> helps keep the reassignment znode from getting larger, which has been a
> problem.
>
> best,
> Colin
>
>
> On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> > Hey George,
> >
> > For the URP during a reassignment,  if the "original_replicas" is kept
> for
> > > the current pending reassignment. I think it will be very easy to
> compare
> > > that with the topic/partition's ISR.  If all "original_replicas" are in
> > > ISR, then URP should be 0 for that topic/partition.
> >
> >
> > Yeah, that makes sense. But I guess we would need "original_replicas" to
> be
> > propagated to partition leaders in the LeaderAndIsr request since leaders
> > are the ones that are computing URPs. That is basically what KIP-352 had
> > proposed, but we also need the changes to the reassignment path. Perhaps
> it
> > makes more sense to address this problem in KIP-236 since that is where
> you
> > have already introduced "original_replicas"? I'm also happy to do KIP-352
> > as a follow-up to KIP-236.
> >
> > Best,
> > Jason
> >
> >
> > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <isma...@gmail.com> wrote:
> >
> > > Good discussion about where we should do batching. I think if there is
> a
> > > clear great way to batch, then it makes a lot of sense to just do it
> once.
> > > However, if we think there is scope for experimenting with different
> > > approaches, then an API that tools can use makes a lot of sense. They
> can
> > > experiment and innovate. Eventually, we can integrate something into
> Kafka
> > > if it makes sense.
> > >
> > > Ismael
> > >
> > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cmcc...@apache.org> wrote:
> > >
> > > > Hi George,
> > > >
> > > > As Jason was saying, it seems like there are two directions we could
> go
> > > > here: an external system handling batching, and the controller
> handling
> > > > batching.  I think the controller handling batching would be better,
> > > since
> > > > the controller has more information about the state of the system.
> If
> > > the
> > > > controller handles batching, then the controller could also handle
> things
> > > > like setting up replication quotas for individual partitions.  The
> > > > controller could do things like throttle replication down if the
> cluster
> > > > was having problems.
> > > >
> > > > We kind of need to figure out which way we're going to go on this one
> > > > before we set up big new APIs, I think.  If we want an external
> system to
> > > > handle batching, then we can keep the idea that there is only one
> > > > reassignment in progress at once.  If we want the controller to
> handle
> > > > batching, we will need to get away from that idea.  Instead, we
> should
> > > just
> > > > have a bunch of "ideal assignments" that we tell the controller
> about,
> > > and
> > > > let it decide how to do the batching.  These ideal assignments could
> > > change
> > > > continuously over time, so from the admin's point of view, there
> would be
> > > > no start/stop/cancel, but just individual partition reassignments
> that we
> > > > submit, perhaps over a long period of time.  And then cancellation
> might
> > > > just mean cancelling just that individual partition reassignment,
> not all
> > > > partition reassignments.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > > > >  Hi Jason / Viktor,
> > > > >
> > > > > For the URP during a reassignment,  if the "original_replicas" is
> kept
> > > > > for the current pending reassignment. I think it will be very easy
> to
> > > > > compare that with the topic/partition's ISR.  If all
> > > > > "original_replicas" are in ISR, then URP should be 0 for that
> > > > > topic/partition.
> > > > >
> > > > > It would be also nice to separate the metrics MaxLag/TotalLag for
> > > > > Reassignments. I think that will also require "original_replicas"
> (the
> > > > > topic/partition's replicas just before reassignment when the AR
> > > > > (Assigned Replicas) is set to Set(original_replicas) +
> > > > > Set(new_replicas_in_reassign_partitions) ).
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> > > > > <ja...@confluent.io> wrote:
> > > > >
> > > > >  Hi Viktor,
> > > > >
> > > > > Thanks for writing this up. As far as questions about overlap with
> > > > KIP-236,
> > > > > I agree it seems mostly orthogonal. I think KIP-236 may have had a
> > > larger
> > > > > initial scope, but now it focuses on cancellation and batching is
> left
> > > > for
> > > > > future work.
> > > > >
> > > > > With that said, I think we may not actually need a KIP for the
> current
> > > > > proposal since it doesn't change any APIs. To make it more
> generally
> > > > > useful, however, it would be nice to handle batching at the
> partition
> > > > level
> > > > > as well as Jun suggests. The basic question is at what level
> should the
> > > > > batching be determined. You could rely on external processes (e.g.
> > > cruise
> > > > > control) or it could be built into the controller. There are
> tradeoffs
> > > > > either way, but I think it simplifies such tools if it is handled
> > > > > internally. Then it would be much safer to submit a larger
> reassignment
> > > > > even just using the simple tools that come with Kafka.
> > > > >
> > > > > By the way, since you are looking into some of the reassignment
> logic,
> > > > > another problem that we might want to address is the misleading
> way we
> > > > > report URPs during a reassignment. I had a naive proposal for this
> > > > > previously, but it didn't really work
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > > .
> > > > > Potentially fixing that could fall under this work as well if you
> think
> > > > > it
> > > > > makes sense.
> > > > >
> > > > > Best,
> > > > > Jason
> > > > >
> > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <j...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Viktor,
> > > > > >
> > > > > > Thanks for the KIP. A couple of comments below.
> > > > > >
> > > > > > 1. Another potential thing to do reassignment incrementally is to
> > > move
> > > > a
> > > > > > batch of partitions at a time, instead of all partitions. This
> may
> > > > lead to
> > > > > > less data replication since by the time the first batch of
> partitions
> > > > have
> > > > > > been completely moved, some data of the next batch may have been
> > > > deleted
> > > > > > due to retention and doesn't need to be replicated.
> > > > > >
> > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
> Which
> ZK
> > > > path
> > > > > > is this for?
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > > > > viktorsomo...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Harsha,
> > > > > > >
> > > > > > > As far as I understand KIP-236 it's about enabling reassignment
> > > > > > > cancellation and as a future plan providing a queue of replica
> > > > > > reassignment
> > > > > > > steps to allow manual reassignment chains. While I agree that
> the
> > > > > > > reassignment chain has a specific use case that allows fine
> grain
> > > > control
> > > > > > > over reassignment process, My proposal on the other hand
> doesn't
> > > talk
> > > > > > about
> > > > > > > cancellation but it only provides an automatic way to
> > > incrementalize
> > > > an
> > > > > > > arbitrary reassignment which I think fits the general use case
> > > where
> > > > > > users
> > > > > > > don't want that level of control but still would like a
> balanced
> > > way
> > > > of
> > > > > > > reassignments. Therefore I think it's still relevant as an
> > > > improvement of
> > > > > > > the current algorithm.
> > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think
> it
> > > > would be
> > > > > > a
> > > > > > > great improvement to Kafka.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Viktor
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
> wrote:
> > > > > > >
> > > > > > > > Hi Viktor,
> > > > > > > >            There is already KIP-236 for the same feature
> and
> > > George
> > > > > > made
> > > > > > > > a PR for this as well.
> > > > > > > > Lets consolidate these two discussions. If you have any
> cases
> > > that
> > > > are
> > > > > > > not
> > > > > > > > being solved by KIP-236 can you please mention them in
> that
> > > > thread. We
> > > > > > > can
> > > > > > > > address as part of KIP-236.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Harsha
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > > > > > > Hi Folks,
> > > > > > > > >
> > > > > > > > > I've created a KIP about an improvement of the reassignment
> > > > algorithm
> > > > > > > we
> > > > > > > > > have. It aims to enable partition-wise incremental
> > > reassignment.
> > > > The
> > > > > > > > > motivation for this is to avoid excess load that the
> current
> > > > > > > replication
> > > > > > > > > algorithm implicitly carries as in that case there
> are points
> > > in
> > > > the
> > > > > > > > > algorithm where both the new and old replica set could
> be
> > > online
> > > > and
> > > > > > > > > replicating which puts double (or almost double) pressure
> on
> > > the
> > > > > > > brokers
> > > > > > > > > which could cause problems.
> > > > > > > > > Instead my proposal would slice this up into several
> steps
> > > where
> > > > each
> > > > > > > > step
> > > > > > > > > is calculated based on the final target replicas and
> the
> > > current
> > > > > > > replica
> > > > > > > > > assignment taking into account scenarios where brokers
> could be
> > > > > > offline
> > > > > > > > and
> > > > > > > > > when there are not enough replicas to fulfil the
> > > > min.insync.replica
> > > > > > > > > requirement.
> > > > > > > > >
> > > > > > > > > The link to the KIP:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > > > > > >
> > > > > > > > > I'd be happy to receive any feedback.
> > > > > > > > >
> > > > > > > > > An important note is that this KIP and another one,
> KIP-236
> > > that
> > > > is
> > > > > > > > > about
> > > > > > > > > interruptible reassignment (
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > > > > > )
> > > > > > > > > should be compatible.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Viktor
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Reply via email to