Hi, Tom,

Thanks for the updated KIP. A few more comments below.

10. The proposal now stores the reassignment for all partitions in
/admin/reassignment_requests/request_xxx. If the number of reassigned
partitions is larger, the ZK write may hit the default 1MB limit and fail.
An alternative approach is to have the reassignment requester first write
the new assignment for each partition under
/admin/reassignments/$topic/$partition and then write
/admin/reassignment_requests/request_xxx with an empty value. The
controller can then read all values under /admin/reassignments.

11. The improvement you suggested in onPartitionReassignment() sounds good
at the high level. The computation of those dropped partitions seems a bit
complicated. Perhaps a simple approach is to drop the replicas not in the
original assignment and newest reassignment?

12. You brought up the need of remembering the original assignment. This
will be lost if the assignment is changed multiple times if we follow the
approach described in 10. One way is to store the original assignment in
/brokers/topics/[topic] as the following. When the final reassignment
completes, we can remove the original field.
{
  "version": 1,
  "partitions": {"0": [0, 1, 3] },
  "originals": {"0": [0, 1, 2] }
}

13. For resolving the conflict between /admin/reassign_partitions and
/admin/reassignments/$topic/$partition, perhaps it's more natural to just
let the assignment with a newer timestamp to override the older one?

14. Implementation wise, currently, we register a watcher of the isr path
of each partition being reassigned. This has the potential issue of
registering many listeners. An improvement could be just piggybacking on
the existing IsrChangeNotificationHandler, which only watches a single ZK
path and is triggered on a batch of isr changes. This is kind of orthogonal
to the KIP. However, if we are touching the reassignment logic, it may be
worth considering.

Thanks,

Jun

On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t.j.bent...@gmail.com> wrote:

> Just wanted to mention that I've started KIP-240, which builds on top of
> this one to provide an AdminClient API for listing and describing
> reassignments.
>
> On 15 December 2017 at 14:34, Tom Bentley <t.j.bent...@gmail.com> wrote:
>
> > > Should we seek to improve this algorithm in this KIP, or leave that as
> > a later optimisation?
> >
> > I've updated the KIP with a proposed algorithm.
> >
> >
> >
> >
> >
> >
> > On 14 December 2017 at 09:57, Tom Bentley <t.j.bent...@gmail.com> wrote:
> >
> >> Thanks Ted, now fixed.
> >>
> >> On 13 December 2017 at 18:38, Ted Yu <yuzhih...@gmail.com> wrote:
> >>
> >>> Tom:
> >>> bq. create a znode /admin/reassignments/$topic-$partition
> >>>
> >>> Looks like the tree structure above should be:
> >>>
> >>> /admin/reassignments/$topic/$partition
> >>>
> >>> bq. The controller removes /admin/reassignment/$topic/$partition
> >>>
> >>> Note the lack of 's' for reassignment. It would be good to make
> zookeeper
> >>> paths consistent.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t.j.bent...@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi Jun and Ted,
> >>> >
> >>> > Jun, you're right that needing one watcher per reassigned partition
> >>> > presents a scalability problem, and using a separate notification
> path
> >>> > solves that. I also agree that it makes sense to prevent users from
> >>> using
> >>> > both methods on the same reassignment.
> >>> >
> >>> > Ted, naming the reassignments like mytopic-42 was simpler while I was
> >>> > proposing a watcher-per-reassignment (I'd have needed a child watcher
> >>> on
> >>> > /admin/reassignments and also on /admin/reassignments/mytopic). Using
> >>> the
> >>> > separate notification path means I don't need any watchers in the
> >>> > /admin/reassignments subtree, so switching to
> >>> /admin/reassignments/mytopic/
> >>> > 42
> >>> > would work, and avoid /admin/reassignments having a very large number
> >>> of
> >>> > child nodes. On the other hand it also means I have to create and
> >>> delete
> >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the
> >>> cost
> >>> > of extra round trips to zookeeper. I suppose that since reassignment
> is
> >>> > generally a slow process it makes little difference if we increase
> the
> >>> > latency of the interactions with zookeeper.
> >>> >
> >>> > I have updated the KIP with these improvements, and a more detailed
> >>> > description of exactly how we would manage these znodes.
> >>> >
> >>> > Reading the algorithm in KafkaController.onPartitionReassignment(),
> it
> >>> > seems that it would be suboptimal for changing reassignments
> in-flight.
> >>> > Consider an initial assignment of [1,2], reassigned to [2,3] and then
> >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas until
> >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> original
> >>> > assigned replicas and is no longer a new assigned replica. I think
> this
> >>> > also affects the case where the reassignment is cancelled
> >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even
> >>> though
> >>> > its replica will then be deleted.
> >>> >
> >>> > Should we seek to improve this algorithm in this KIP, or leave that
> as
> >>> a
> >>> > later optimisation?
> >>> >
> >>> > Cheers,
> >>> >
> >>> > Tom
> >>> >
> >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
> >>> >
> >>> > > Another question is on the compatibility. Since now there are 2
> ways
> >>> of
> >>> > > specifying a partition reassignment, one under
> >>> /admin/reassign_partitions
> >>> > > and the other under /admin/reassignments, we probably want to
> >>> prevent the
> >>> > > same topic being reassigned under both paths at the same time?
> >>> > > Thanks,
> >>> > >
> >>> > > Jun
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
> >>> > >
> >>> > > > Hi, Tom,
> >>> > > >
> >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> points
> >>> in
> >>> > > > partition reassignment. Another issue that it also addresses is
> >>> the ZK
> >>> > > node
> >>> > > > size limit when writing the reassignment JSON.
> >>> > > >
> >>> > > > My only concern is that the KIP needs to create one watcher per
> >>> > > reassigned
> >>> > > > partition. This could add overhead in ZK and complexity for
> >>> debugging
> >>> > > when
> >>> > > > lots of partitions are being reassigned simultaneously. We could
> >>> > > > potentially improve this by introducing a separate ZK path for
> >>> change
> >>> > > > notification as we do for configs. For example, every time we
> >>> change
> >>> > the
> >>> > > > assignment for a set of partitions, we could further write a
> >>> sequential
> >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> >>> controller
> >>> > > > only needs to watch the change path. Once a change is triggered,
> >>> the
> >>> > > > controller can read everything under /admin/reassignments/.
> >>> > > >
> >>> > > > Jun
> >>> > > >
> >>> > > >
> >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> t.j.bent...@gmail.com
> >>> >
> >>> > > wrote:
> >>> > > >
> >>> > > >> Hi,
> >>> > > >>
> >>> > > >> This is still very new, but I wanted some quick feedback on a
> >>> > > preliminary
> >>> > > >> KIP which could, I think, help with providing an AdminClient API
> >>> for
> >>> > > >> partition reassignment.
> >>> > > >>
> >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >>> > > >> 3A+Interruptible+Partition+Reassignment
> >>> > > >>
> >>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
> >>> API in
> >>> > > >> this
> >>> > > >> KIP (which would make it very big, and difficult to read), or
> >>> whether
> >>> > to
> >>> > > >> break it down into smaller KIPs (which makes it easier to read
> and
> >>> > > >> implement in pieces, but harder to get a high-level picture of
> the
> >>> > > >> ultimate
> >>> > > >> destination). For now I've gone for a very small initial KIP,
> but
> >>> I'm
> >>> > > >> happy
> >>> > > >> to sketch the bigger picture here if people are interested.
> >>> > > >>
> >>> > > >> Cheers,
> >>> > > >>
> >>> > > >> Tom
> >>> > > >>
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>> >
> >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
> >>> >
> >>> > > Another question is on the compatibility. Since now there are 2
> ways
> >>> of
> >>> > > specifying a partition reassignment, one under
> >>> /admin/reassign_partitions
> >>> > > and the other under /admin/reassignments, we probably want to
> >>> prevent the
> >>> > > same topic being reassigned under both paths at the same time?
> >>> > > Thanks,
> >>> > >
> >>> > > Jun
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
> >>> > >
> >>> > > > Hi, Tom,
> >>> > > >
> >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> points
> >>> in
> >>> > > > partition reassignment. Another issue that it also addresses is
> >>> the ZK
> >>> > > node
> >>> > > > size limit when writing the reassignment JSON.
> >>> > > >
> >>> > > > My only concern is that the KIP needs to create one watcher per
> >>> > > reassigned
> >>> > > > partition. This could add overhead in ZK and complexity for
> >>> debugging
> >>> > > when
> >>> > > > lots of partitions are being reassigned simultaneously. We could
> >>> > > > potentially improve this by introducing a separate ZK path for
> >>> change
> >>> > > > notification as we do for configs. For example, every time we
> >>> change
> >>> > the
> >>> > > > assignment for a set of partitions, we could further write a
> >>> sequential
> >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> >>> controller
> >>> > > > only needs to watch the change path. Once a change is triggered,
> >>> the
> >>> > > > controller can read everything under /admin/reassignments/.
> >>> > > >
> >>> > > > Jun
> >>> > > >
> >>> > > >
> >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> t.j.bent...@gmail.com
> >>> >
> >>> > > wrote:
> >>> > > >
> >>> > > >> Hi,
> >>> > > >>
> >>> > > >> This is still very new, but I wanted some quick feedback on a
> >>> > > preliminary
> >>> > > >> KIP which could, I think, help with providing an AdminClient API
> >>> for
> >>> > > >> partition reassignment.
> >>> > > >>
> >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >>> > > >> 3A+Interruptible+Partition+Reassignment
> >>> > > >>
> >>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
> >>> API in
> >>> > > >> this
> >>> > > >> KIP (which would make it very big, and difficult to read), or
> >>> whether
> >>> > to
> >>> > > >> break it down into smaller KIPs (which makes it easier to read
> and
> >>> > > >> implement in pieces, but harder to get a high-level picture of
> the
> >>> > > >> ultimate
> >>> > > >> destination). For now I've gone for a very small initial KIP,
> but
> >>> I'm
> >>> > > >> happy
> >>> > > >> to sketch the bigger picture here if people are interested.
> >>> > > >>
> >>> > > >> Cheers,
> >>> > > >>
> >>> > > >> Tom
> >>> > > >>
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Reply via email to