Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Thu, 10 Aug 2017 12:44:41 -0700

Hey Jun,

This is a very good idea. I have updated the KIP-113 so that
DescribeDirResponse returns lag instead of LEO. If the replica is not a
temporary replica, then lag = max(0, HW - LEO). Otherwise, lag = primary
Replica's LEO - temporary Replica's LEO.


Thanks!
Dong

On Thu, Aug 10, 2017 at 10:21 AM, Jun Rao <j...@confluent.io> wrote:

> Hi, Tom, Dong,
>
> A couple of comments on that.
>
> 1. I think we can unify the reporting of lags. Basically, the lag will be
> reported on every replica (temporary or permanent), not just at the leader
> replica. If it's permanent, lag is max(0, HW - LEO) as it is now.
> Otherwise, lag is (LEO of permanent replica - LEO of temporary replica).
> That way, it seems that we can use a single request to monitor the progress
> of both inter and intra replica movement and it would be more accurate than
> relying on LEO directly.
>
> 2. Tom, note that currently, the LeaderAndIsrRequest doesn't specify the
> log dir. So, I am not sure in your new proposal, how the log dir info is
> communicated to all brokers. Is the broker receiving the
> ReassignPartitionsRequest
> going to forward that to all brokers?
>
> Thanks,
>
> Jun
>
>
>
> On Thu, Aug 10, 2017 at 6:57 AM, Tom Bentley <t.j.bent...@gmail.com>
> wrote:
>
> > I've spent some time thinking about KIP-179 and KIP-113, the proposed
> > algorithms and APIs, and trying to weigh the pros and cons of various
> > alternative options.
> >
> > I think Dong's reasons for the algorithm for inter-broker move in KIP-113
> > make a lot of sense. I don't think it would be at all simple to try to
> > change that algorithm to one where the whole thing can be triggered by a
> > single call to an AdminClient method. So I guess we should try to keep as
> > much of that algorithm as possible.
> >
> > KIP-179 will need to change this step
> >
> >  - The script creates reassignment znode in zookeeper.
> > >
> >
> > with an AdminClient API call. This call can the same one as currently
> > specified in KIP-179 -- reassignPartitions() -- except the argument needs
> > to take into account the need to pass log dirs as well as broker ids.
> Thus
> > I would suggest
> >
> >     ReassignPartitionsResult reassignPartitions(Map<TopicPartition,
> > List<ReplicAssignment>> assignments)
> >
> > where:
> >
> >     class ReplicAssignment {
> >         int broker()
> >         String logDirectory()// can be null
> >     }
> >
> > (This is just a Java representation of the reassignment json in KIP-113,
> > which itself is a superset of the reassignment json currently in use)
> >
> > The corresponding protocol would look like this:
> >
> >     ReassignPartitionsRequest => timeout validate_only log_dirs
> > [topic_assignment]
> >       timout => int32
> >       validate_only => boolean
> >       log_dirs => [string]
> >       topic_assignment => topic [partition_assignment]
> >         topic => string
> >         partition_assignment => partition [replica]
> >           partition => int32
> >           replica => broker log_dir_index
> >             broker => int32
> >             log_dir_index => int16
> >
> > The purpose of log_dirs is to serialize each log dir in the request only
> > once. These are then referred to by index in log_dir_index. The
> > log_dir_index can be -1, which means the caller doesn't care which log
> > directory should be used on the receiving broker.
> >
> > This request can be sent to *any* broker. The broker which receives a
> > ReassignPartitionsRequest essentially converts it into reassignment JSON
> > and writes that JSON to the znode, then returns a
> > ReassignPartitionsResponse:
> >
> >     ReassignPartitionsResponse => throttle_time_ms
> > [topic_assignment_result]
> >       throttle_time_ms => INT32
> >           log_dirs => [string]
> >       topic_assignment_result => topic partition_assignment_result
> >             topic => STRING
> >             partition_assignment_result => partition [replica_result]
> >               partition => int32
> >               replica_result => broker log_dir_index error_code
> > error_message
> >                 broker => int32
> >                 log_dir_index => int16
> >             error_code => INT16
> >                  error_message => NULLABLE_STRING
> >
> > This is using the same encoding scheme as wrt log_dirs as described
> above.
> >
> > Meanwhile the controller is notified by ZK of the change in value of the
> > znode and proceeds, as currently, by sending LeaderAndIsrRequest and
> > StopReplicaRequest in order to complete the reassignments.
> >
> > The remaining problem is around how to measure progress of reassignment.
> As
> > mentioned in the email I wrote this morning, I think we really need two
> > different lag calculations if we're using the lag to measure progress and
> > we want the property that lag=0 means reassignment has finished. The
> > problem with that, I now realise, is the script might be called with
> > reassignments which are a mix of:
> >
> > * inter-broker moves without a log dir (=> use HW-replicaLEO)
> > * inter-broker moves with a log dir (=> use HW-replicaLEO)
> > * intra-broker moves with a log dir (=> use .log_LEO - .move_LEO)
> >
> > And if there were two APIs we'd end up needing to make both kinds of
> query
> > to each broker in the cluster. This morning I said:
> >
> > But AFAICS this observation doesn't really help much in terms of the APIs
> > > concerned though. Since the requests would still need to go to
> different
> > > brokers depending on which kind of movement is being performed.
> > >
> >
> > But I wonder if that's *really* such a problem: In the case of an
> > inter-broker move we just need to ask the leader, and in the case of an
> > intra-broker move we just have to ask that broker. In generally we'd
> need a
> > single request to each broker in the cluster. Then each broker would need
> > to service that request, but presumably it's just pulling a number out
> of a
> > ConcurrentHashMap, which is updated by the replica movement code in each
> of
> > the two cases (inter-broker and intra-broker). WDYT?
> >
> > Assuming no one can see any glaring holes in what I'm proposing here, or
> > wants to suggest a workable alternative set of APIs and algorithms, then
> > I'll update KIP-179 to this effect.
> >
> > Thanks for taking the time to read this far!
> >
> > Tom
> >
> > On 10 August 2017 at 11:56, Tom Bentley <t.j.bent...@gmail.com> wrote:
> >
> > > Hi Dong and Jun,
> > >
> > > It seems that KIP-179 does not explicitly specify the definition of
> this
> > >> lag.
> > >
> > >
> > > Given that the definition of "caught up" is "is the replica in the
> ISR?",
> > > I found the code in Partition.maybeExpandIsr() which decides whether a
> > > replica should be added to the to the ISR and it uses
> > replica.logEndOffset.
> > > offsetDiff(leaderHW) >= 0, so for this purpose I would define the lag
> as
> > > max(leaderHW - replicaLEO, 0). I think we agree this should work for
> > > inter-broker movement, where the leader knows these quantities.
> > >
> > > As Dong says, this doesn't work for the intra-broker case:
> > >
> > > Note that we can not calculate lag as max(0, HW - LEO)
> > >> because we still need the difference between two lags to measure the
> > >> progress of intra-broker replica movement.
> > >>
> > >
> > > It seems to me that the intra-broker case is actually a special case of
> > > the inter-broker case. Conceptually with an intra-broker move the
> ".log"
> > > replica is the leader, the ".move" directory is the follower, the ISR
> is
> > > the singleton containing the leader, thus the HW if the LEO of the
> > ".log".
> > > Viewed in this way, Dong's method of leaderLEO - followerLEO is the
> same
> > > thing for the intra-broker case as HW-LEO is for the inter-broker case.
> > >
> > > But AFAICS this observation doesn't really help much in terms of the
> APIs
> > > concerned though. Since the requests would still need to go to
> different
> > > brokers depending on which kind of movement is being performed.
> > >
> > > So perhaps this is another case where maybe it makes sense to keep the
> > two
> > > APIs separate, one API for measuring inter-broker movement progress an
> > > another for the intra-broker case. WDYT?
> > >
> > > Thanks for the continuing discussion on this!
> > >
> > > Tom
> > >
> > >
> > > On 10 August 2017 at 05:28, Dong Lin <lindon...@gmail.com> wrote:
> > >
> > >> Hey Jun,
> > >>
> > >> I have been thinking about whether it is better to return lag (i.e.
> HW -
> > >> LEO) instead of LEO. Note that the lag in the DescribeDirsResponse may
> > be
> > >> negative if LEO > HW. It will almost always be negative for leader and
> > >> in-sync replicas. Note that we can not calculate lag as max(0, HW -
> LEO)
> > >> because we still need the difference between two lags to measure the
> > >> progress of intra-broker replica movement. The AdminClient API can
> > choose
> > >> to return max(0, HW - LEO) depending on whether it is used for
> tracking
> > >> progress of inter-broker reassignment or intra-broker movement. Is it
> > OK?
> > >> If so, I will update the KIP-113 accordingly to return lag in the
> > >> DescribeDirsResponse .
> > >>
> > >> Thanks,
> > >> Dong
> > >>
> > >>
> > >>
> > >> <https://www.avast.com/sig-email?utm_medium=email&utm_source
> > >> =link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
> > >> Virus-free.
> > >> www.avast.com
> > >> <https://www.avast.com/sig-email?utm_medium=email&utm_source
> > >> =link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
> > >> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> > >>
> > >> On Wed, Aug 9, 2017 at 5:06 PM, Jun Rao <j...@confluent.io> wrote:
> > >>
> > >> > Hi, Dong,
> > >> >
> > >> > Yes, the lag in a replica is calculated as the difference of LEO of
> > the
> > >> > replica and the HW. So, as long as a replica is in sync, the lag is
> > >> always
> > >> > 0.
> > >> >
> > >> > So, I was suggesting to return lag instead of LEO in
> > >> DescribeDirsResponse
> > >> > for each replica. I am not sure if we need to return HW though.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jun
> > >> >
> > >> > On Wed, Aug 9, 2017 at 5:01 PM, Dong Lin <lindon...@gmail.com>
> wrote:
> > >> >
> > >> > > Hey Jun,
> > >> > >
> > >> > > It just came to me that you may be assuming that folower_lag = HW
> -
> > >> > > follower_LEO. If that is the case, then we need to have new
> > >> > > request/response to retrieve this lag since the
> DescribeDirsResponse
> > >> > > doesn't even include HW. It seems that KIP-179 does not explicitly
> > >> > specify
> > >> > > the definition of this lag.
> > >> > >
> > >> > > I have been assuming that follow_lag = leader_LEO - follower_LEO
> > given
> > >> > that
> > >> > > the request is used to query the reassignment status. Strictly
> > >> speaking
> > >> > the
> > >> > > difference between leader_LEO and the HW is limited by the amount
> of
> > >> data
> > >> > > produced in KafkaConfig.replicaLagTimeMaxMs, which is 10
> seconds. I
> > >> also
> > >> > > assumed that 10 seconds is probably not a big deal given the
> typical
> > >> time
> > >> > > length of the reassignment.
> > >> > >
> > >> > > Thanks,
> > >> > > Dong
> > >> > >
> > >> > > On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > >> > >
> > >> > > > Hey Jun,
> > >> > > >
> > >> > > > If I understand you right, you are suggesting that, in the case
> > when
> > >> > > there
> > >> > > > is continuous incoming traffic, the approach in the KIP-179 will
> > >> report
> > >> > > lag
> > >> > > > as 0 whereas the approach using DescribeDirsRequest will report
> > lag
> > >> as
> > >> > > > non-zero. But I think the approach in KIP-179 will also report
> > >> non-zero
> > >> > > lag
> > >> > > > when there is continuous traffic. This is because at the time
> the
> > >> > leader
> > >> > > > receives ReplicaStatusRequest, it is likely that some data has
> > been
> > >> > > > appended to the partition after the last FetchRequest from the
> > >> > follower.
> > >> > > > Does this make sense?
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Dong
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io>
> wrote:
> > >> > > >
> > >> > > >> Hi, Dong,
> > >> > > >>
> > >> > > >> As for whether to return LEO or lag, my point was the
> following.
> > >> What
> > >> > > you
> > >> > > >> are concerned about is that an in-sync replica could become out
> > of
> > >> > sync
> > >> > > >> again. However, the more common case is that once a replica is
> > >> caught
> > >> > > up,
> > >> > > >> it will stay in sync afterwards. In that case, once the
> > >> reassignment
> > >> > > >> process completes, if we report based on lag, all lags will be
> 0.
> > >> If
> > >> > we
> > >> > > >> report based on Math.max(0, leaderLEO - followerLEO), the value
> > may
> > >> > not
> > >> > > be
> > >> > > >> 0 if there is continuous incoming traffic, which will be
> > confusing
> > >> to
> > >> > > the
> > >> > > >> user.
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >>
> > >> > > >> Jun
> > >> > > >>
> > >> > > >>
> > >> > > >>
> > >> > > >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com>
> > >> wrote:
> > >> > > >>
> > >> > > >> > Hey Jun,
> > >> > > >> >
> > >> > > >> > Thanks for the comment!
> > >> > > >> >
> > >> > > >> > Yes, it should work. The tool can send request to any broker
> > and
> > >> > > broker
> > >> > > >> can
> > >> > > >> > just write the reassignment znode. My previous intuition is
> > that
> > >> it
> > >> > > may
> > >> > > >> be
> > >> > > >> > better to only send this request to controller. But I don't
> > have
> > >> > good
> > >> > > >> > reasons for this restriction.
> > >> > > >> >
> > >> > > >> > My intuition is that we can keep them separate as well.
> Becket
> > >> and I
> > >> > > >> have
> > >> > > >> > discussed this both offline and in
> https://github.com/apache/
> > >> > > >> > kafka/pull/3621.
> > >> > > >> > Currently I don't have a strong opinion on this and I am open
> > to
> > >> > using
> > >> > > >> only
> > >> > > >> > one API to do both if someone can come up with a reasonable
> API
> > >> > > >> signature
> > >> > > >> > for this method. For now I have added the method
> > >> alterReplicaDir()
> > >> > in
> > >> > > >> > KafkaAdminClient instead of the AdminClient interface so that
> > the
> > >> > > >> > reassignment script can use this method without concluding
> what
> > >> the
> > >> > > API
> > >> > > >> > would look like in AdminClient in the future.
> > >> > > >> >
> > >> > > >> > Regarding DescribeDirsResponse, I think it is probably OK to
> > have
> > >> > > >> slightly
> > >> > > >> > more lag. The script can calculate the lag of the follower
> > >> replica
> > >> > as
> > >> > > >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be
> > >> > slightly
> > >> > > >> less
> > >> > > >> > accurate than the current approach in KIP-179. But even with
> > the
> > >> > > current
> > >> > > >> > approach in KIP-179, the result provided by the script is an
> > >> > > >> approximation
> > >> > > >> > anyway, since there is delay from the time that leader
> returns
> > >> > > response
> > >> > > >> to
> > >> > > >> > the time that the script collects response from all brokers
> and
> > >> > prints
> > >> > > >> > result to user. I think if the slight difference in the
> > accuracy
> > >> > > between
> > >> > > >> > the two approaches does not make a difference to the intended
> > >> > use-case
> > >> > > >> of
> > >> > > >> > this API, then we probably want to re-use the exiting
> > >> > request/response
> > >> > > >> to
> > >> > > >> > keep the protocol simple.
> > >> > > >> >
> > >> > > >> > Thanks,
> > >> > > >> > Dong
> > >> > > >> >
> > >> > > >> >
> > >> > > >> >
> > >> > > >> >
> > >> > > >> >
> > >> > > >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io>
> > >> wrote:
> > >> > > >> >
> > >> > > >> > > Hi, Dong,
> > >> > > >> > >
> > >> > > >> > > I think Tom was suggesting to have the AlterTopicsRequest
> > sent
> > >> to
> > >> > > any
> > >> > > >> > > broker, which just writes the reassignment json to ZK. The
> > >> > > controller
> > >> > > >> > will
> > >> > > >> > > pick up the reassignment and act on it as usual. This
> should
> > >> work,
> > >> > > >> right?
> > >> > > >> > >
> > >> > > >> > > Having a separate AlterTopicsRequest and
> > AlterReplicaDirRequest
> > >> > > seems
> > >> > > >> > > simpler to me. The former is handled by the controller and
> > the
> > >> > > latter
> > >> > > >> is
> > >> > > >> > > handled by the affected broker. They don't always have to
> be
> > >> done
> > >> > > >> > together.
> > >> > > >> > > Merging the two into a single request probably will make
> both
> > >> the
> > >> > > api
> > >> > > >> and
> > >> > > >> > > the implementation a bit more complicated. If we do keep
> the
> > >> two
> > >> > > >> separate
> > >> > > >> > > requests, it seems that we should just add
> > >> AlterReplicaDirRequest
> > >> > to
> > >> > > >> the
> > >> > > >> > > AdminClient interface?
> > >> > > >> > >
> > >> > > >> > > Now, regarding DescribeDirsResponse. I agree that it can be
> > >> used
> > >> > for
> > >> > > >> the
> > >> > > >> > > status reporting in KIP-179 as well. However, it seems that
> > >> > > reporting
> > >> > > >> the
> > >> > > >> > > log end offset of each replica may not be easy to use. The
> > log
> > >> end
> > >> > > >> offset
> > >> > > >> > > will be returned from different brokers in slightly
> different
> > >> > time.
> > >> > > If
> > >> > > >> > > there is continuous producing traffic, the difference in
> log
> > >> end
> > >> > > >> offset
> > >> > > >> > > between the leader and the follower could be larger than 0
> > >> even if
> > >> > > the
> > >> > > >> > > follower has fully caught up. I am wondering if it's better
> > to
> > >> > > instead
> > >> > > >> > > return the lag in offset per replica. This way, the status
> > can
> > >> > > >> probably
> > >> > > >> > be
> > >> > > >> > > reported more reliably.
> > >> > > >> > >
> > >> > > >> > > Thanks,
> > >> > > >> > >
> > >> > > >> > > Jun
> > >> > > >> > >
> > >> > > >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <
> > lindon...@gmail.com
> > >> >
> > >> > > >> wrote:
> > >> > > >> > >
> > >> > > >> > > > Hey Tom,
> > >> > > >> > > >
> > >> > > >> > > > Thanks for the quick reply. Please see my comment inline.
> > >> > > >> > > >
> > >> > > >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <
> > >> > > t.j.bent...@gmail.com
> > >> > > >> >
> > >> > > >> > > > wrote:
> > >> > > >> > > >
> > >> > > >> > > > > Hi Dong,
> > >> > > >> > > > >
> > >> > > >> > > > > Replies inline, as usual
> > >> > > >> > > > >
> > >> > > >> > > > > > As I originally envisaged it, KIP-179's support for
> > >> > > reassigning
> > >> > > >> > > > > partitions
> > >> > > >> > > > > >
> > >> > > >> > > > > > would have more-or-less taken the logic currently in
> > the
> > >> > > >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to
> > the
> > >> > > >> > > > > > > ZkUtils.ReassignPartitionsPath)
> > >> > > >> > > > > > > and put it behind a suitable network protocol API.
> > >> Thus it
> > >> > > >> > wouldn't
> > >> > > >> > > > > > matter
> > >> > > >> > > > > > > which broker received the protocol call: It would
> be
> > >> acted
> > >> > > on
> > >> > > >> by
> > >> > > >> > > > > brokers
> > >> > > >> > > > > > > being notified of the change in the ZK path, just
> as
> > >> > > >> currently.
> > >> > > >> > > This
> > >> > > >> > > > > > would
> > >> > > >> > > > > > > have kept the ReassignPartitionsCommand relatively
> > >> simple,
> > >> > > as
> > >> > > >> it
> > >> > > >> > > > > > currently
> > >> > > >> > > > > > > is.
> > >> > > >> > > > > > >
> > >> > > >> > > > > >
> > >> > > >> > > > > > I am not sure I fully understand your proposal. I
> think
> > >> you
> > >> > > are
> > >> > > >> > > saying
> > >> > > >> > > > > that
> > >> > > >> > > > > > any broker can receive and handle the
> > AlterTopicRequest.
> > >> > > >> > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > That's right.
> > >> > > >> > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > > Let's say a
> > >> > > >> > > > > > non-controller broker received AlterTopicRequest, is
> > this
> > >> > > broker
> > >> > > >> > > going
> > >> > > >> > > > to
> > >> > > >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this
> > >> broker
> > >> > > >> create
> > >> > > >> > > the
> > >> > > >> > > > > > reassignment znode in zookeper?
> > >> > > >> > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > Exactly: It's going to write some JSON to the relevant
> > >> znode.
> > >> > > >> Other
> > >> > > >> > > > brokers
> > >> > > >> > > > > will get notified by zk when the contents of this znode
> > >> > changes,
> > >> > > >> and
> > >> > > >> > do
> > >> > > >> > > > as
> > >> > > >> > > > > they do now. This is what the tool/script does now.
> > >> > > >> > > > >
> > >> > > >> > > > > I will confess that I don't completely understand the
> > role
> > >> of
> > >> > > >> > > > > LeaderAndIsrRequest, since the current code just seems
> to
> > >> > write
> > >> > > to
> > >> > > >> > the
> > >> > > >> > > > > znode do get the brokers to do the reassignment. If you
> > >> could
> > >> > > >> explain
> > >> > > >> > > the
> > >> > > >> > > > > role of LeaderAndIsrRequest that would be great.
> > >> > > >> > > > >
> > >> > > >> > > >
> > >> > > >> > > > Currently only the controller will listen to the
> > reassignment
> > >> > > znode
> > >> > > >> and
> > >> > > >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to
> brokers
> > >> in
> > >> > > >> order to
> > >> > > >> > > > complete reassignment. Brokers won't need to listen to
> > >> zookeeper
> > >> > > for
> > >> > > >> > any
> > >> > > >> > > > reassignment -- brokers only reacts to the request from
> > >> > > controller.
> > >> > > >> > > > Currently Kafka's design replies a lot on the controller
> to
> > >> > keep a
> > >> > > >> > > > consistent view of who are the leader of partitions and
> > what
> > >> is
> > >> > > the
> > >> > > >> ISR
> > >> > > >> > > > etc. It will be a pretty drastic change, if not
> impossible,
> > >> for
> > >> > > the
> > >> > > >> > > script
> > >> > > >> > > > to reassign partitions without going through controller.
> > >> > > >> > > >
> > >> > > >> > > > Thus I think it is likely that your AlterTopicsRequest
> can
> > >> only
> > >> > be
> > >> > > >> sent
> > >> > > >> > > to
> > >> > > >> > > > controller. Then the controller can create the
> reassignment
> > >> > znode
> > >> > > in
> > >> > > >> > > > zookeeper so that the information is persisted across
> > >> controller
> > >> > > >> fail
> > >> > > >> > > over.
> > >> > > >> > > > I haven't think through this in detail though.
> > >> > > >> > > >
> > >> > > >> > > >
> > >> > > >> > > >
> > >> > > >> > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > > I may have missed it. But I couldn't find
> > >> > > >> > > > > > the explanation of AlterTopicRequest handling in
> > KIP-179.
> > >> > > >> > > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > You're right, it doesn't go into that much detail. I
> will
> > >> fix
> > >> > > >> that.
> > >> > > >> > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > > >
> > >> > > >> > > > > > > KIP-113 is obviously seeking to make more radical
> > >> changes.
> > >> > > The
> > >> > > >> > > > > algorithm
> > >> > > >> > > > > > > described for moving a replica to a particular
> > >> directory
> > >> > on
> > >> > > a
> > >> > > >> > > > different
> > >> > > >> > > > > > > broker (
> > >> > > >> > > > > > > https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > >> > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> > >> > > >> > directories#KIP-113:
> > >> > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2)
> > >> > > >> > > > > > > Howtoreassignreplicabetweenlog
> > directoriesacrossbrokers
> > >> > > >> > > > > > > <https://cwiki.apache.org/conf
> > >> luence/display/KAFKA/KIP-
> > >> > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> > >> > > >> > directories#KIP-113:
> > >> > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2%
> > >> > > >> > > > > > > 29Howtoreassignreplicabetweenl
> > >> > ogdirectoriesacrossbrokers>)
> > >> > > >> > > > > > > involves both sending AlterReplicaDirRequest to
> "the"
> > >> > broker
> > >> > > >> (the
> > >> > > >> > > > > > receiving
> > >> > > >> > > > > > > broker, I assume, but it's not spelled out), _as
> well
> > >> as_
> > >> > > >> writing
> > >> > > >> > > to
> > >> > > >> > > > > the
> > >> > > >> > > > > > ZK
> > >> > > >> > > > > > > node.
> > >> > > >> > > > > > >
> > >> > > >> > > > > > > This assumes the script (ReassignPartitionsCommand)
> > has
> > >> > > direct
> > >> > > >> > > access
> > >> > > >> > > > > to
> > >> > > >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to
> > >> deprecate.
> > >> > It
> > >> > > >> > seems
> > >> > > >> > > a
> > >> > > >> > > > > > waste
> > >> > > >> > > > > > > of time to put the logic in the script as part of
> > >> KIP-113,
> > >> > > >> only
> > >> > > >> > for
> > >> > > >> > > > > > KIP-179
> > >> > > >> > > > > > > to have to move it back to the controller.
> > >> > > >> > > > > > >
> > >> > > >> > > > > >
> > >> > > >> > > > > > I am not sure I understand what you mean by "It
> seems a
> > >> > waste
> > >> > > of
> > >> > > >> > time
> > >> > > >> > > > to
> > >> > > >> > > > > > put the logic in the script as part of KIP-113, only
> > for
> > >> > > >> KIP-179 to
> > >> > > >> > > > have
> > >> > > >> > > > > to
> > >> > > >> > > > > > move it back to the controller".
> > >> > > >> > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > Sorry, I misunderstood slightly what you were proposing
> > in
> > >> > > >> KIP-113,
> > >> > > >> > so
> > >> > > >> > > > the
> > >> > > >> > > > > "waste of time" comment isn't quite right, but I'm
> still
> > >> not
> > >> > > >> > convinced
> > >> > > >> > > > that
> > >> > > >> > > > > KIP-113+KIP-179 (in its current form) ends with a
> > >> satisfactory
> > >> > > >> > result.
> > >> > > >> > > > >
> > >> > > >> > > > > Let me elaborate... KIP-113 says that to support
> > >> reassigning
> > >> > > >> replica
> > >> > > >> > > > > between log directories across brokers:
> > >> > > >> > > > > * ...
> > >> > > >> > > > > * The script sends AlterReplicaDirRequest to those
> > brokers
> > >> > which
> > >> > > >> need
> > >> > > >> > > to
> > >> > > >> > > > > move replicas...
> > >> > > >> > > > > * The script creates reassignment znode in zookeeper.
> > >> > > >> > > > > * The script retries AlterReplicaDirRequest to those
> > >> broker...
> > >> > > >> > > > > * ...
> > >> > > >> > > > >
> > >> > > >> > > > > So the ReassignPartitionsCommand still talks to ZK
> > >> directly,
> > >> > but
> > >> > > >> now
> > >> > > >> > > it's
> > >> > > >> > > > > bracketed by calls to the AdminClient. KIP-179 could
> > >> replace
> > >> > > that
> > >> > > >> > > talking
> > >> > > >> > > > > to ZK directly with a new call to the AdminClient. But
> > then
> > >> > > we've
> > >> > > >> > got a
> > >> > > >> > > > > pretty weird API, where we have to make three
> AdminClient
> > >> > calls
> > >> > > >> (two
> > >> > > >> > of
> > >> > > >> > > > > them to the same method), to move a replica. I don't
> > really
> > >> > > >> > understand
> > >> > > >> > > > why
> > >> > > >> > > > > the admin client can't present a single API method to
> > >> achieve
> > >> > > >> this,
> > >> > > >> > and
> > >> > > >> > > > > encapsulate on the server side the careful sequence of
> > >> events
> > >> > > >> > necessary
> > >> > > >> > > > to
> > >> > > >> > > > > coordinate the movement. I understood this position is
> > what
> > >> > > Ismael
> > >> > > >> > was
> > >> > > >> > > > > advocating when he said it was better to put the logic
> in
> > >> the
> > >> > > >> > > controller
> > >> > > >> > > > > than spread between the script and the controller. But
> > >> maybe I
> > >> > > >> > > > > misunderstood him.
> > >> > > >> > > > >
> > >> > > >> > > >
> > >> > > >> > > > I have some concern with putting this logic in controller
> > >> which
> > >> > > can
> > >> > > >> be
> > >> > > >> > > > found in my previous email. Before that is addressed, the
> > >> script
> > >> > > (or
> > >> > > >> > > > AdminClient) seems to be the simplest place to have this
> > >> logic.
> > >> > > >> > > >
> > >> > > >> > > > I agree it is better to have a single API to achieve both
> > >> > > partition
> > >> > > >> and
> > >> > > >> > > > replica -> dir assignment. I think it is likely that we
> > will
> > >> > find
> > >> > > a
> > >> > > >> > good
> > >> > > >> > > > API to do both. I have updated the KIP-113 to remove API
> > >> > > >> > alterReplicaDir
> > >> > > >> > > > from AdminClient interface.
> > >> > > >> > > >
> > >> > > >> > > >
> > >> > > >> > > > >
> > >> > > >> > > > >
> > >> > > >> > > > > > I assume that the logic you mentioned is
> > >> > > >> > > > > > "movement of replica to the specified log directory".
> > >> This
> > >> > > logic
> > >> > > >> > (or
> > >> > > >> > > > the
> > >> > > >> > > > > > implementation of this logic) resides mainly in the
> > >> > > >> > KafkaAdminClient
> > >> > > >> > > > and
> > >> > > >> > > > > > broker. The script only needs to parse the json file
> as
> > >> > > >> appropriate
> > >> > > >> > > and
> > >> > > >> > > > > > call the new API in AdminClient as appropriate. The
> > >> logic in
> > >> > > the
> > >> > > >> > > script
> > >> > > >> > > > > is
> > >> > > >> > > > > > therefore not much and can be easily moved to other
> > >> classes
> > >> > if
> > >> > > >> > > needed.
> > >> > > >> > > > > >
> > >> > > >> > > > > > Can you clarify why this logic, i.e. movement of
> > replica
> > >> to
> > >> > > the
> > >> > > >> > > > specified
> > >> > > >> > > > > > log directory, needs to be moved to controller in
> > >> KIP-179? I
> > >> > > >> think
> > >> > > >> > it
> > >> > > >> > > > can
> > >> > > >> > > > > > still be done in the script and controller should not
> > >> need
> > >> > to
> > >> > > >> worry
> > >> > > >> > > > about
> > >> > > >> > > > > > log directory of any replica.
> > >> > > >> > > > > >
> > >> > > >> > > > > > Thanks,
> > >> > > >> > > > > > Dong
> > >> > > >> > > > > >
> > >> > > >> > > > >
> > >> > > >> > > >
> > >> > > >> > >
> > >> > > >> >
> > >> > > >>
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to