Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Jun Rao Mon, 18 Dec 2017 10:12:54 -0800

Hi, Tom,

Thanks for the reply. A few more followup comments below.


10. Another concern of mine is on consistency with the current pattern. The
current pattern for change notification based on ZK is (1) we first write
the actual value in the entity path and then write the change notification
path, and (2)  the change notification path only includes what entity has
changed but not the actual changes. If we want to follow this pattern for
consistency, /admin/reassignment_requests/request_xxx will only have the
partitions whose reassignment have changed, but not the actual reassignment.

11. Ok. I am not sure that I fully understand the description of that part.
Does "assigned" refer to the current assignment? Could you also describe
where the length of the original assignment is stored in ZK?

13. Hmm, I am not sure that the cancellation needs to be done for the whole
batch. The reason that I brought this up is for consistency. The KIP allows
override when using the new approach. It just seems that it's simpler to
extend this model when resolving multiple changes between the old and the
new approach.

Jun



On Mon, Dec 18, 2017 at 2:45 AM, Tom Bentley <t.j.bent...@gmail.com> wrote:

> Hi Jun,
>
> Thanks for replying, some answers below:
>
>
> > 10. The proposal now stores the reassignment for all partitions in
> > /admin/reassignment_requests/request_xxx. If the number of reassigned
> > partitions is larger, the ZK write may hit the default 1MB limit and
> fail.
> > An alternative approach is to have the reassignment requester first write
> > the new assignment for each partition under
> > /admin/reassignments/$topic/$partition and then write
> > /admin/reassignment_requests/request_xxx with an empty value. The
> > controller can then read all values under /admin/reassignments.
> >
>
> You're right that reassigning enough partitions would hit the 1MB limit,
> but I don't think this would be a problem in practice because it would be
> trivial to split the partitions into several requests (i.e. mutleiple
> request_xxx).
> I don't think the non-atomicity this would imply is a problem. By writing
> the partitions whose /admin/reassignments/$topic/$partition has been
> created or changed it makes it much more efficient to know which of those
> znodes we need to read. If I understand your suggestion, you would have to
> read every node under /admin/reassignments to figure out which had changed.
>
>
> > 11. The improvement you suggested in onPartitionReassignment() sounds
> good
> > at the high level. The computation of those dropped partitions seems a
> bit
> > complicated. Perhaps a simple approach is to drop the replicas not in the
> > original assignment and newest reassignment?
> >
>
> This was what I came up with originally too, but when I looked into
> implementing it I found a couple of things which made me reconsider it.
> Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we start
> reassigning to [2,3], but then change our minds about 2 and switch it to 4
> (maybe we've figured out a better overall balance). At that point it is
> perfectly possible that broker 2 is in-sync and broker 1 is not in-sync. It
> seems silly to drop broker 2 in favour of broker 1: We're needlessly giving
> the cluster more work to do.
>
> The second thing that made me reconsider was in that same scenario it's
> even possible that broker 2 is the leader of the partition. Obviously we
> can elect a new leader before dropping it, but not without causing
> disruption to producers and consumers.
>
> By accepting a little more complexity in choosing which brokers to drop we
> make the dropping simpler (no need for leader election) and ensure the
> cluster has less work to do.
>
>
> > 12. You brought up the need of remembering the original assignment. This
> > will be lost if the assignment is changed multiple times if we follow the
> > approach described in 10. One way is to store the original assignment in
> > /brokers/topics/[topic] as the following. When the final reassignment
> > completes, we can remove the original field.
> > {
> >   "version": 1,
> >   "partitions": {"0": [0, 1, 3] },
> >   "originals": {"0": [0, 1, 2] }
> > }
> >
>
> While I was implementing my first version of onPartitionReassignment(),
> where I preferred the originals, I was storing the originals in the
> /admin/reassignments/$topic/$partition znodes. Since we will remove that
> znode at the end of reassignment anyway, I would suggest this is a better
> place to store that data (if it's necessary to do so), so that we can avoid
> another ZK round trip.
>
>
> > 13. For resolving the conflict between /admin/reassign_partitions and
> > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> just
> > let the assignment with a newer timestamp to override the older one?
> >
>
> That would work but with slightly different semantics to what I have: Since
> /admin/reassign_partitions contains multiple partitions, using the
> timestamp means the whole batch wins or losses. By tracking how each
> request was made we can be more fine-grained. I'm to use the modification
> time if such granularity is not required.
>
>
> > 14. Implementation wise, currently, we register a watcher of the isr path
> > of each partition being reassigned. This has the potential issue of
> > registering many listeners. An improvement could be just piggybacking on
> > the existing IsrChangeNotificationHandler, which only watches a single ZK
> > path and is triggered on a batch of isr changes. This is kind of
> orthogonal
> > to the KIP. However, if we are touching the reassignment logic, it may be
> > worth considering.
>
>
> Let me look into that.
>
> Thanks,
>
> Tom
>
> On 16 December 2017 at 02:19, Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the updated KIP. A few more comments below.
> >
> > 10. The proposal now stores the reassignment for all partitions in
> > /admin/reassignment_requests/request_xxx. If the number of reassigned
> > partitions is larger, the ZK write may hit the default 1MB limit and
> fail.
> > An alternative approach is to have the reassignment requester first write
> > the new assignment for each partition under
> > /admin/reassignments/$topic/$partition and then write
> > /admin/reassignment_requests/request_xxx with an empty value. The
> > controller can then read all values under /admin/reassignments.
> >
> > 11. The improvement you suggested in onPartitionReassignment() sounds
> good
> > at the high level. The computation of those dropped partitions seems a
> bit
> > complicated. Perhaps a simple approach is to drop the replicas not in the
> > original assignment and newest reassignment?
> >
> > 12. You brought up the need of remembering the original assignment. This
> > will be lost if the assignment is changed multiple times if we follow the
> > approach described in 10. One way is to store the original assignment in
> > /brokers/topics/[topic] as the following. When the final reassignment
> > completes, we can remove the original field.
> > {
> >   "version": 1,
> >   "partitions": {"0": [0, 1, 3] },
> >   "originals": {"0": [0, 1, 2] }
> > }
> >
> > 13. For resolving the conflict between /admin/reassign_partitions and
> > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> just
> > let the assignment with a newer timestamp to override the older one?
> >
> > 14. Implementation wise, currently, we register a watcher of the isr path
> > of each partition being reassigned. This has the potential issue of
> > registering many listeners. An improvement could be just piggybacking on
> > the existing IsrChangeNotificationHandler, which only watches a single ZK
> > path and is triggered on a batch of isr changes. This is kind of
> orthogonal
> > to the KIP. However, if we are touching the reassignment logic, it may be
> > worth considering.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t.j.bent...@gmail.com>
> > wrote:
> >
> > > Just wanted to mention that I've started KIP-240, which builds on top
> of
> > > this one to provide an AdminClient API for listing and describing
> > > reassignments.
> > >
> > > On 15 December 2017 at 14:34, Tom Bentley <t.j.bent...@gmail.com>
> wrote:
> > >
> > > > > Should we seek to improve this algorithm in this KIP, or leave that
> > as
> > > > a later optimisation?
> > > >
> > > > I've updated the KIP with a proposed algorithm.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 14 December 2017 at 09:57, Tom Bentley <t.j.bent...@gmail.com>
> > wrote:
> > > >
> > > >> Thanks Ted, now fixed.
> > > >>
> > > >> On 13 December 2017 at 18:38, Ted Yu <yuzhih...@gmail.com> wrote:
> > > >>
> > > >>> Tom:
> > > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > > >>>
> > > >>> Looks like the tree structure above should be:
> > > >>>
> > > >>> /admin/reassignments/$topic/$partition
> > > >>>
> > > >>> bq. The controller removes /admin/reassignment/$topic/$partition
> > > >>>
> > > >>> Note the lack of 's' for reassignment. It would be good to make
> > > zookeeper
> > > >>> paths consistent.
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <
> t.j.bent...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Hi Jun and Ted,
> > > >>> >
> > > >>> > Jun, you're right that needing one watcher per reassigned
> partition
> > > >>> > presents a scalability problem, and using a separate notification
> > > path
> > > >>> > solves that. I also agree that it makes sense to prevent users
> from
> > > >>> using
> > > >>> > both methods on the same reassignment.
> > > >>> >
> > > >>> > Ted, naming the reassignments like mytopic-42 was simpler while I
> > was
> > > >>> > proposing a watcher-per-reassignment (I'd have needed a child
> > watcher
> > > >>> on
> > > >>> > /admin/reassignments and also on /admin/reassignments/mytopic).
> > Using
> > > >>> the
> > > >>> > separate notification path means I don't need any watchers in the
> > > >>> > /admin/reassignments subtree, so switching to
> > > >>> /admin/reassignments/mytopic/
> > > >>> > 42
> > > >>> > would work, and avoid /admin/reassignments having a very large
> > number
> > > >>> of
> > > >>> > child nodes. On the other hand it also means I have to create and
> > > >>> delete
> > > >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs
> > the
> > > >>> cost
> > > >>> > of extra round trips to zookeeper. I suppose that since
> > reassignment
> > > is
> > > >>> > generally a slow process it makes little difference if we
> increase
> > > the
> > > >>> > latency of the interactions with zookeeper.
> > > >>> >
> > > >>> > I have updated the KIP with these improvements, and a more
> detailed
> > > >>> > description of exactly how we would manage these znodes.
> > > >>> >
> > > >>> > Reading the algorithm in KafkaController.
> > onPartitionReassignment(),
> > > it
> > > >>> > seems that it would be suboptimal for changing reassignments
> > > in-flight.
> > > >>> > Consider an initial assignment of [1,2], reassigned to [2,3] and
> > then
> > > >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas
> > until
> > > >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> > > original
> > > >>> > assigned replicas and is no longer a new assigned replica. I
> think
> > > this
> > > >>> > also affects the case where the reassignment is cancelled
> > > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up,
> > even
> > > >>> though
> > > >>> > its replica will then be deleted.
> > > >>> >
> > > >>> > Should we seek to improve this algorithm in this KIP, or leave
> that
> > > as
> > > >>> a
> > > >>> > later optimisation?
> > > >>> >
> > > >>> > Cheers,
> > > >>> >
> > > >>> > Tom
> > > >>> >
> > > >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
> > > >>> >
> > > >>> > > Another question is on the compatibility. Since now there are 2
> > > ways
> > > >>> of
> > > >>> > > specifying a partition reassignment, one under
> > > >>> /admin/reassign_partitions
> > > >>> > > and the other under /admin/reassignments, we probably want to
> > > >>> prevent the
> > > >>> > > same topic being reassigned under both paths at the same time?
> > > >>> > > Thanks,
> > > >>> > >
> > > >>> > > Jun
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io>
> > wrote:
> > > >>> > >
> > > >>> > > > Hi, Tom,
> > > >>> > > >
> > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > > points
> > > >>> in
> > > >>> > > > partition reassignment. Another issue that it also addresses
> is
> > > >>> the ZK
> > > >>> > > node
> > > >>> > > > size limit when writing the reassignment JSON.
> > > >>> > > >
> > > >>> > > > My only concern is that the KIP needs to create one watcher
> per
> > > >>> > > reassigned
> > > >>> > > > partition. This could add overhead in ZK and complexity for
> > > >>> debugging
> > > >>> > > when
> > > >>> > > > lots of partitions are being reassigned simultaneously. We
> > could
> > > >>> > > > potentially improve this by introducing a separate ZK path
> for
> > > >>> change
> > > >>> > > > notification as we do for configs. For example, every time we
> > > >>> change
> > > >>> > the
> > > >>> > > > assignment for a set of partitions, we could further write a
> > > >>> sequential
> > > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > > >>> controller
> > > >>> > > > only needs to watch the change path. Once a change is
> > triggered,
> > > >>> the
> > > >>> > > > controller can read everything under /admin/reassignments/.
> > > >>> > > >
> > > >>> > > > Jun
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > t.j.bent...@gmail.com
> > > >>> >
> > > >>> > > wrote:
> > > >>> > > >
> > > >>> > > >> Hi,
> > > >>> > > >>
> > > >>> > > >> This is still very new, but I wanted some quick feedback on
> a
> > > >>> > > preliminary
> > > >>> > > >> KIP which could, I think, help with providing an AdminClient
> > API
> > > >>> for
> > > >>> > > >> partition reassignment.
> > > >>> > > >>
> > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > >>> > > >>
> > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > AdminClient
> > > >>> API in
> > > >>> > > >> this
> > > >>> > > >> KIP (which would make it very big, and difficult to read),
> or
> > > >>> whether
> > > >>> > to
> > > >>> > > >> break it down into smaller KIPs (which makes it easier to
> read
> > > and
> > > >>> > > >> implement in pieces, but harder to get a high-level picture
> of
> > > the
> > > >>> > > >> ultimate
> > > >>> > > >> destination). For now I've gone for a very small initial
> KIP,
> > > but
> > > >>> I'm
> > > >>> > > >> happy
> > > >>> > > >> to sketch the bigger picture here if people are interested.
> > > >>> > > >>
> > > >>> > > >> Cheers,
> > > >>> > > >>
> > > >>> > > >> Tom
> > > >>> > > >>
> > > >>> > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>> >
> > > >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
> > > >>> >
> > > >>> > > Another question is on the compatibility. Since now there are 2
> > > ways
> > > >>> of
> > > >>> > > specifying a partition reassignment, one under
> > > >>> /admin/reassign_partitions
> > > >>> > > and the other under /admin/reassignments, we probably want to
> > > >>> prevent the
> > > >>> > > same topic being reassigned under both paths at the same time?
> > > >>> > > Thanks,
> > > >>> > >
> > > >>> > > Jun
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io>
> > wrote:
> > > >>> > >
> > > >>> > > > Hi, Tom,
> > > >>> > > >
> > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > > points
> > > >>> in
> > > >>> > > > partition reassignment. Another issue that it also addresses
> is
> > > >>> the ZK
> > > >>> > > node
> > > >>> > > > size limit when writing the reassignment JSON.
> > > >>> > > >
> > > >>> > > > My only concern is that the KIP needs to create one watcher
> per
> > > >>> > > reassigned
> > > >>> > > > partition. This could add overhead in ZK and complexity for
> > > >>> debugging
> > > >>> > > when
> > > >>> > > > lots of partitions are being reassigned simultaneously. We
> > could
> > > >>> > > > potentially improve this by introducing a separate ZK path
> for
> > > >>> change
> > > >>> > > > notification as we do for configs. For example, every time we
> > > >>> change
> > > >>> > the
> > > >>> > > > assignment for a set of partitions, we could further write a
> > > >>> sequential
> > > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > > >>> controller
> > > >>> > > > only needs to watch the change path. Once a change is
> > triggered,
> > > >>> the
> > > >>> > > > controller can read everything under /admin/reassignments/.
> > > >>> > > >
> > > >>> > > > Jun
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > t.j.bent...@gmail.com
> > > >>> >
> > > >>> > > wrote:
> > > >>> > > >
> > > >>> > > >> Hi,
> > > >>> > > >>
> > > >>> > > >> This is still very new, but I wanted some quick feedback on
> a
> > > >>> > > preliminary
> > > >>> > > >> KIP which could, I think, help with providing an AdminClient
> > API
> > > >>> for
> > > >>> > > >> partition reassignment.
> > > >>> > > >>
> > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > >>> > > >>
> > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > AdminClient
> > > >>> API in
> > > >>> > > >> this
> > > >>> > > >> KIP (which would make it very big, and difficult to read),
> or
> > > >>> whether
> > > >>> > to
> > > >>> > > >> break it down into smaller KIPs (which makes it easier to
> read
> > > and
> > > >>> > > >> implement in pieces, but harder to get a high-level picture
> of
> > > the
> > > >>> > > >> ultimate
> > > >>> > > >> destination). For now I've gone for a very small initial
> KIP,
> > > but
> > > >>> I'm
> > > >>> > > >> happy
> > > >>> > > >> to sketch the bigger picture here if people are interested.
> > > >>> > > >>
> > > >>> > > >> Cheers,
> > > >>> > > >>
> > > >>> > > >> Tom
> > > >>> > > >>
> > > >>> > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Reply via email to