Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Jun Rao Thu, 10 Aug 2017 10:22:30 -0700

Hi, Tom, Dong,

A couple of comments on that.


1. I think we can unify the reporting of lags. Basically, the lag will be
reported on every replica (temporary or permanent), not just at the leader
replica. If it's permanent, lag is max(0, HW - LEO) as it is now.
Otherwise, lag is (LEO of permanent replica - LEO of temporary replica).
That way, it seems that we can use a single request to monitor the progress
of both inter and intra replica movement and it would be more accurate than
relying on LEO directly.

2. Tom, note that currently, the LeaderAndIsrRequest doesn't specify the
log dir. So, I am not sure in your new proposal, how the log dir info is
communicated to all brokers. Is the broker receiving the
ReassignPartitionsRequest
going to forward that to all brokers?

Thanks,

Jun



On Thu, Aug 10, 2017 at 6:57 AM, Tom Bentley <[email protected]> wrote:

> I've spent some time thinking about KIP-179 and KIP-113, the proposed
> algorithms and APIs, and trying to weigh the pros and cons of various
> alternative options.
>
> I think Dong's reasons for the algorithm for inter-broker move in KIP-113
> make a lot of sense. I don't think it would be at all simple to try to
> change that algorithm to one where the whole thing can be triggered by a
> single call to an AdminClient method. So I guess we should try to keep as
> much of that algorithm as possible.
>
> KIP-179 will need to change this step
>
>  - The script creates reassignment znode in zookeeper.
> >
>
> with an AdminClient API call. This call can the same one as currently
> specified in KIP-179 -- reassignPartitions() -- except the argument needs
> to take into account the need to pass log dirs as well as broker ids. Thus
> I would suggest
>
>     ReassignPartitionsResult reassignPartitions(Map<TopicPartition,
> List<ReplicAssignment>> assignments)
>
> where:
>
>     class ReplicAssignment {
>         int broker()
>         String logDirectory()// can be null
>     }
>
> (This is just a Java representation of the reassignment json in KIP-113,
> which itself is a superset of the reassignment json currently in use)
>
> The corresponding protocol would look like this:
>
>     ReassignPartitionsRequest => timeout validate_only log_dirs
> [topic_assignment]
>       timout => int32
>       validate_only => boolean
>       log_dirs => [string]
>       topic_assignment => topic [partition_assignment]
>         topic => string
>         partition_assignment => partition [replica]
>           partition => int32
>           replica => broker log_dir_index
>             broker => int32
>             log_dir_index => int16
>
> The purpose of log_dirs is to serialize each log dir in the request only
> once. These are then referred to by index in log_dir_index. The
> log_dir_index can be -1, which means the caller doesn't care which log
> directory should be used on the receiving broker.
>
> This request can be sent to *any* broker. The broker which receives a
> ReassignPartitionsRequest essentially converts it into reassignment JSON
> and writes that JSON to the znode, then returns a
> ReassignPartitionsResponse:
>
>     ReassignPartitionsResponse => throttle_time_ms
> [topic_assignment_result]
>       throttle_time_ms => INT32
>           log_dirs => [string]
>       topic_assignment_result => topic partition_assignment_result
>             topic => STRING
>             partition_assignment_result => partition [replica_result]
>               partition => int32
>               replica_result => broker log_dir_index error_code
> error_message
>                 broker => int32
>                 log_dir_index => int16
>             error_code => INT16
>                  error_message => NULLABLE_STRING
>
> This is using the same encoding scheme as wrt log_dirs as described above.
>
> Meanwhile the controller is notified by ZK of the change in value of the
> znode and proceeds, as currently, by sending LeaderAndIsrRequest and
> StopReplicaRequest in order to complete the reassignments.
>
> The remaining problem is around how to measure progress of reassignment. As
> mentioned in the email I wrote this morning, I think we really need two
> different lag calculations if we're using the lag to measure progress and
> we want the property that lag=0 means reassignment has finished. The
> problem with that, I now realise, is the script might be called with
> reassignments which are a mix of:
>
> * inter-broker moves without a log dir (=> use HW-replicaLEO)
> * inter-broker moves with a log dir (=> use HW-replicaLEO)
> * intra-broker moves with a log dir (=> use .log_LEO - .move_LEO)
>
> And if there were two APIs we'd end up needing to make both kinds of query
> to each broker in the cluster. This morning I said:
>
> But AFAICS this observation doesn't really help much in terms of the APIs
> > concerned though. Since the requests would still need to go to different
> > brokers depending on which kind of movement is being performed.
> >
>
> But I wonder if that's *really* such a problem: In the case of an
> inter-broker move we just need to ask the leader, and in the case of an
> intra-broker move we just have to ask that broker. In generally we'd need a
> single request to each broker in the cluster. Then each broker would need
> to service that request, but presumably it's just pulling a number out of a
> ConcurrentHashMap, which is updated by the replica movement code in each of
> the two cases (inter-broker and intra-broker). WDYT?
>
> Assuming no one can see any glaring holes in what I'm proposing here, or
> wants to suggest a workable alternative set of APIs and algorithms, then
> I'll update KIP-179 to this effect.
>
> Thanks for taking the time to read this far!
>
> Tom
>
> On 10 August 2017 at 11:56, Tom Bentley <[email protected]> wrote:
>
> > Hi Dong and Jun,
> >
> > It seems that KIP-179 does not explicitly specify the definition of this
> >> lag.
> >
> >
> > Given that the definition of "caught up" is "is the replica in the ISR?",
> > I found the code in Partition.maybeExpandIsr() which decides whether a
> > replica should be added to the to the ISR and it uses
> replica.logEndOffset.
> > offsetDiff(leaderHW) >= 0, so for this purpose I would define the lag as
> > max(leaderHW - replicaLEO, 0). I think we agree this should work for
> > inter-broker movement, where the leader knows these quantities.
> >
> > As Dong says, this doesn't work for the intra-broker case:
> >
> > Note that we can not calculate lag as max(0, HW - LEO)
> >> because we still need the difference between two lags to measure the
> >> progress of intra-broker replica movement.
> >>
> >
> > It seems to me that the intra-broker case is actually a special case of
> > the inter-broker case. Conceptually with an intra-broker move the ".log"
> > replica is the leader, the ".move" directory is the follower, the ISR is
> > the singleton containing the leader, thus the HW if the LEO of the
> ".log".
> > Viewed in this way, Dong's method of leaderLEO - followerLEO is the same
> > thing for the intra-broker case as HW-LEO is for the inter-broker case.
> >
> > But AFAICS this observation doesn't really help much in terms of the APIs
> > concerned though. Since the requests would still need to go to different
> > brokers depending on which kind of movement is being performed.
> >
> > So perhaps this is another case where maybe it makes sense to keep the
> two
> > APIs separate, one API for measuring inter-broker movement progress an
> > another for the intra-broker case. WDYT?
> >
> > Thanks for the continuing discussion on this!
> >
> > Tom
> >
> >
> > On 10 August 2017 at 05:28, Dong Lin <[email protected]> wrote:
> >
> >> Hey Jun,
> >>
> >> I have been thinking about whether it is better to return lag (i.e. HW -
> >> LEO) instead of LEO. Note that the lag in the DescribeDirsResponse may
> be
> >> negative if LEO > HW. It will almost always be negative for leader and
> >> in-sync replicas. Note that we can not calculate lag as max(0, HW - LEO)
> >> because we still need the difference between two lags to measure the
> >> progress of intra-broker replica movement. The AdminClient API can
> choose
> >> to return max(0, HW - LEO) depending on whether it is used for tracking
> >> progress of inter-broker reassignment or intra-broker movement. Is it
> OK?
> >> If so, I will update the KIP-113 accordingly to return lag in the
> >> DescribeDirsResponse .
> >>
> >> Thanks,
> >> Dong
> >>
> >>
> >>
> >> <https://www.avast.com/sig-email?utm_medium=email&utm_source
> >> =link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
> >> Virus-free.
> >> www.avast.com
> >> <https://www.avast.com/sig-email?utm_medium=email&utm_source
> >> =link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>
> >> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >>
> >> On Wed, Aug 9, 2017 at 5:06 PM, Jun Rao <[email protected]> wrote:
> >>
> >> > Hi, Dong,
> >> >
> >> > Yes, the lag in a replica is calculated as the difference of LEO of
> the
> >> > replica and the HW. So, as long as a replica is in sync, the lag is
> >> always
> >> > 0.
> >> >
> >> > So, I was suggesting to return lag instead of LEO in
> >> DescribeDirsResponse
> >> > for each replica. I am not sure if we need to return HW though.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > On Wed, Aug 9, 2017 at 5:01 PM, Dong Lin <[email protected]> wrote:
> >> >
> >> > > Hey Jun,
> >> > >
> >> > > It just came to me that you may be assuming that folower_lag = HW -
> >> > > follower_LEO. If that is the case, then we need to have new
> >> > > request/response to retrieve this lag since the DescribeDirsResponse
> >> > > doesn't even include HW. It seems that KIP-179 does not explicitly
> >> > specify
> >> > > the definition of this lag.
> >> > >
> >> > > I have been assuming that follow_lag = leader_LEO - follower_LEO
> given
> >> > that
> >> > > the request is used to query the reassignment status. Strictly
> >> speaking
> >> > the
> >> > > difference between leader_LEO and the HW is limited by the amount of
> >> data
> >> > > produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 seconds. I
> >> also
> >> > > assumed that 10 seconds is probably not a big deal given the typical
> >> time
> >> > > length of the reassignment.
> >> > >
> >> > > Thanks,
> >> > > Dong
> >> > >
> >> > > On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <[email protected]>
> wrote:
> >> > >
> >> > > > Hey Jun,
> >> > > >
> >> > > > If I understand you right, you are suggesting that, in the case
> when
> >> > > there
> >> > > > is continuous incoming traffic, the approach in the KIP-179 will
> >> report
> >> > > lag
> >> > > > as 0 whereas the approach using DescribeDirsRequest will report
> lag
> >> as
> >> > > > non-zero. But I think the approach in KIP-179 will also report
> >> non-zero
> >> > > lag
> >> > > > when there is continuous traffic. This is because at the time the
> >> > leader
> >> > > > receives ReplicaStatusRequest, it is likely that some data has
> been
> >> > > > appended to the partition after the last FetchRequest from the
> >> > follower.
> >> > > > Does this make sense?
> >> > > >
> >> > > > Thanks,
> >> > > > Dong
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <[email protected]> wrote:
> >> > > >
> >> > > >> Hi, Dong,
> >> > > >>
> >> > > >> As for whether to return LEO or lag, my point was the following.
> >> What
> >> > > you
> >> > > >> are concerned about is that an in-sync replica could become out
> of
> >> > sync
> >> > > >> again. However, the more common case is that once a replica is
> >> caught
> >> > > up,
> >> > > >> it will stay in sync afterwards. In that case, once the
> >> reassignment
> >> > > >> process completes, if we report based on lag, all lags will be 0.
> >> If
> >> > we
> >> > > >> report based on Math.max(0, leaderLEO - followerLEO), the value
> may
> >> > not
> >> > > be
> >> > > >> 0 if there is continuous incoming traffic, which will be
> confusing
> >> to
> >> > > the
> >> > > >> user.
> >> > > >>
> >> > > >> Thanks,
> >> > > >>
> >> > > >> Jun
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <[email protected]>
> >> wrote:
> >> > > >>
> >> > > >> > Hey Jun,
> >> > > >> >
> >> > > >> > Thanks for the comment!
> >> > > >> >
> >> > > >> > Yes, it should work. The tool can send request to any broker
> and
> >> > > broker
> >> > > >> can
> >> > > >> > just write the reassignment znode. My previous intuition is
> that
> >> it
> >> > > may
> >> > > >> be
> >> > > >> > better to only send this request to controller. But I don't
> have
> >> > good
> >> > > >> > reasons for this restriction.
> >> > > >> >
> >> > > >> > My intuition is that we can keep them separate as well. Becket
> >> and I
> >> > > >> have
> >> > > >> > discussed this both offline and in https://github.com/apache/
> >> > > >> > kafka/pull/3621.
> >> > > >> > Currently I don't have a strong opinion on this and I am open
> to
> >> > using
> >> > > >> only
> >> > > >> > one API to do both if someone can come up with a reasonable API
> >> > > >> signature
> >> > > >> > for this method. For now I have added the method
> >> alterReplicaDir()
> >> > in
> >> > > >> > KafkaAdminClient instead of the AdminClient interface so that
> the
> >> > > >> > reassignment script can use this method without concluding what
> >> the
> >> > > API
> >> > > >> > would look like in AdminClient in the future.
> >> > > >> >
> >> > > >> > Regarding DescribeDirsResponse, I think it is probably OK to
> have
> >> > > >> slightly
> >> > > >> > more lag. The script can calculate the lag of the follower
> >> replica
> >> > as
> >> > > >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be
> >> > slightly
> >> > > >> less
> >> > > >> > accurate than the current approach in KIP-179. But even with
> the
> >> > > current
> >> > > >> > approach in KIP-179, the result provided by the script is an
> >> > > >> approximation
> >> > > >> > anyway, since there is delay from the time that leader returns
> >> > > response
> >> > > >> to
> >> > > >> > the time that the script collects response from all brokers and
> >> > prints
> >> > > >> > result to user. I think if the slight difference in the
> accuracy
> >> > > between
> >> > > >> > the two approaches does not make a difference to the intended
> >> > use-case
> >> > > >> of
> >> > > >> > this API, then we probably want to re-use the exiting
> >> > request/response
> >> > > >> to
> >> > > >> > keep the protocol simple.
> >> > > >> >
> >> > > >> > Thanks,
> >> > > >> > Dong
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <[email protected]>
> >> wrote:
> >> > > >> >
> >> > > >> > > Hi, Dong,
> >> > > >> > >
> >> > > >> > > I think Tom was suggesting to have the AlterTopicsRequest
> sent
> >> to
> >> > > any
> >> > > >> > > broker, which just writes the reassignment json to ZK. The
> >> > > controller
> >> > > >> > will
> >> > > >> > > pick up the reassignment and act on it as usual. This should
> >> work,
> >> > > >> right?
> >> > > >> > >
> >> > > >> > > Having a separate AlterTopicsRequest and
> AlterReplicaDirRequest
> >> > > seems
> >> > > >> > > simpler to me. The former is handled by the controller and
> the
> >> > > latter
> >> > > >> is
> >> > > >> > > handled by the affected broker. They don't always have to be
> >> done
> >> > > >> > together.
> >> > > >> > > Merging the two into a single request probably will make both
> >> the
> >> > > api
> >> > > >> and
> >> > > >> > > the implementation a bit more complicated. If we do keep the
> >> two
> >> > > >> separate
> >> > > >> > > requests, it seems that we should just add
> >> AlterReplicaDirRequest
> >> > to
> >> > > >> the
> >> > > >> > > AdminClient interface?
> >> > > >> > >
> >> > > >> > > Now, regarding DescribeDirsResponse. I agree that it can be
> >> used
> >> > for
> >> > > >> the
> >> > > >> > > status reporting in KIP-179 as well. However, it seems that
> >> > > reporting
> >> > > >> the
> >> > > >> > > log end offset of each replica may not be easy to use. The
> log
> >> end
> >> > > >> offset
> >> > > >> > > will be returned from different brokers in slightly different
> >> > time.
> >> > > If
> >> > > >> > > there is continuous producing traffic, the difference in log
> >> end
> >> > > >> offset
> >> > > >> > > between the leader and the follower could be larger than 0
> >> even if
> >> > > the
> >> > > >> > > follower has fully caught up. I am wondering if it's better
> to
> >> > > instead
> >> > > >> > > return the lag in offset per replica. This way, the status
> can
> >> > > >> probably
> >> > > >> > be
> >> > > >> > > reported more reliably.
> >> > > >> > >
> >> > > >> > > Thanks,
> >> > > >> > >
> >> > > >> > > Jun
> >> > > >> > >
> >> > > >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <
> [email protected]
> >> >
> >> > > >> wrote:
> >> > > >> > >
> >> > > >> > > > Hey Tom,
> >> > > >> > > >
> >> > > >> > > > Thanks for the quick reply. Please see my comment inline.
> >> > > >> > > >
> >> > > >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <
> >> > > [email protected]
> >> > > >> >
> >> > > >> > > > wrote:
> >> > > >> > > >
> >> > > >> > > > > Hi Dong,
> >> > > >> > > > >
> >> > > >> > > > > Replies inline, as usual
> >> > > >> > > > >
> >> > > >> > > > > > As I originally envisaged it, KIP-179's support for
> >> > > reassigning
> >> > > >> > > > > partitions
> >> > > >> > > > > >
> >> > > >> > > > > > would have more-or-less taken the logic currently in
> the
> >> > > >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to
> the
> >> > > >> > > > > > > ZkUtils.ReassignPartitionsPath)
> >> > > >> > > > > > > and put it behind a suitable network protocol API.
> >> Thus it
> >> > > >> > wouldn't
> >> > > >> > > > > > matter
> >> > > >> > > > > > > which broker received the protocol call: It would be
> >> acted
> >> > > on
> >> > > >> by
> >> > > >> > > > > brokers
> >> > > >> > > > > > > being notified of the change in the ZK path, just as
> >> > > >> currently.
> >> > > >> > > This
> >> > > >> > > > > > would
> >> > > >> > > > > > > have kept the ReassignPartitionsCommand relatively
> >> simple,
> >> > > as
> >> > > >> it
> >> > > >> > > > > > currently
> >> > > >> > > > > > > is.
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > I am not sure I fully understand your proposal. I think
> >> you
> >> > > are
> >> > > >> > > saying
> >> > > >> > > > > that
> >> > > >> > > > > > any broker can receive and handle the
> AlterTopicRequest.
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > That's right.
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > > Let's say a
> >> > > >> > > > > > non-controller broker received AlterTopicRequest, is
> this
> >> > > broker
> >> > > >> > > going
> >> > > >> > > > to
> >> > > >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this
> >> broker
> >> > > >> create
> >> > > >> > > the
> >> > > >> > > > > > reassignment znode in zookeper?
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > Exactly: It's going to write some JSON to the relevant
> >> znode.
> >> > > >> Other
> >> > > >> > > > brokers
> >> > > >> > > > > will get notified by zk when the contents of this znode
> >> > changes,
> >> > > >> and
> >> > > >> > do
> >> > > >> > > > as
> >> > > >> > > > > they do now. This is what the tool/script does now.
> >> > > >> > > > >
> >> > > >> > > > > I will confess that I don't completely understand the
> role
> >> of
> >> > > >> > > > > LeaderAndIsrRequest, since the current code just seems to
> >> > write
> >> > > to
> >> > > >> > the
> >> > > >> > > > > znode do get the brokers to do the reassignment. If you
> >> could
> >> > > >> explain
> >> > > >> > > the
> >> > > >> > > > > role of LeaderAndIsrRequest that would be great.
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > > > Currently only the controller will listen to the
> reassignment
> >> > > znode
> >> > > >> and
> >> > > >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers
> >> in
> >> > > >> order to
> >> > > >> > > > complete reassignment. Brokers won't need to listen to
> >> zookeeper
> >> > > for
> >> > > >> > any
> >> > > >> > > > reassignment -- brokers only reacts to the request from
> >> > > controller.
> >> > > >> > > > Currently Kafka's design replies a lot on the controller to
> >> > keep a
> >> > > >> > > > consistent view of who are the leader of partitions and
> what
> >> is
> >> > > the
> >> > > >> ISR
> >> > > >> > > > etc. It will be a pretty drastic change, if not impossible,
> >> for
> >> > > the
> >> > > >> > > script
> >> > > >> > > > to reassign partitions without going through controller.
> >> > > >> > > >
> >> > > >> > > > Thus I think it is likely that your AlterTopicsRequest can
> >> only
> >> > be
> >> > > >> sent
> >> > > >> > > to
> >> > > >> > > > controller. Then the controller can create the reassignment
> >> > znode
> >> > > in
> >> > > >> > > > zookeeper so that the information is persisted across
> >> controller
> >> > > >> fail
> >> > > >> > > over.
> >> > > >> > > > I haven't think through this in detail though.
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > > I may have missed it. But I couldn't find
> >> > > >> > > > > > the explanation of AlterTopicRequest handling in
> KIP-179.
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > > > You're right, it doesn't go into that much detail. I will
> >> fix
> >> > > >> that.
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > > > KIP-113 is obviously seeking to make more radical
> >> changes.
> >> > > The
> >> > > >> > > > > algorithm
> >> > > >> > > > > > > described for moving a replica to a particular
> >> directory
> >> > on
> >> > > a
> >> > > >> > > > different
> >> > > >> > > > > > > broker (
> >> > > >> > > > > > > https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> >> > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> >> > > >> > directories#KIP-113:
> >> > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2)
> >> > > >> > > > > > > Howtoreassignreplicabetweenlog
> directoriesacrossbrokers
> >> > > >> > > > > > > <https://cwiki.apache.org/conf
> >> luence/display/KAFKA/KIP-
> >> > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> >> > > >> > directories#KIP-113:
> >> > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2%
> >> > > >> > > > > > > 29Howtoreassignreplicabetweenl
> >> > ogdirectoriesacrossbrokers>)
> >> > > >> > > > > > > involves both sending AlterReplicaDirRequest to "the"
> >> > broker
> >> > > >> (the
> >> > > >> > > > > > receiving
> >> > > >> > > > > > > broker, I assume, but it's not spelled out), _as well
> >> as_
> >> > > >> writing
> >> > > >> > > to
> >> > > >> > > > > the
> >> > > >> > > > > > ZK
> >> > > >> > > > > > > node.
> >> > > >> > > > > > >
> >> > > >> > > > > > > This assumes the script (ReassignPartitionsCommand)
> has
> >> > > direct
> >> > > >> > > access
> >> > > >> > > > > to
> >> > > >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to
> >> deprecate.
> >> > It
> >> > > >> > seems
> >> > > >> > > a
> >> > > >> > > > > > waste
> >> > > >> > > > > > > of time to put the logic in the script as part of
> >> KIP-113,
> >> > > >> only
> >> > > >> > for
> >> > > >> > > > > > KIP-179
> >> > > >> > > > > > > to have to move it back to the controller.
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > I am not sure I understand what you mean by "It seems a
> >> > waste
> >> > > of
> >> > > >> > time
> >> > > >> > > > to
> >> > > >> > > > > > put the logic in the script as part of KIP-113, only
> for
> >> > > >> KIP-179 to
> >> > > >> > > > have
> >> > > >> > > > > to
> >> > > >> > > > > > move it back to the controller".
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > Sorry, I misunderstood slightly what you were proposing
> in
> >> > > >> KIP-113,
> >> > > >> > so
> >> > > >> > > > the
> >> > > >> > > > > "waste of time" comment isn't quite right, but I'm still
> >> not
> >> > > >> > convinced
> >> > > >> > > > that
> >> > > >> > > > > KIP-113+KIP-179 (in its current form) ends with a
> >> satisfactory
> >> > > >> > result.
> >> > > >> > > > >
> >> > > >> > > > > Let me elaborate... KIP-113 says that to support
> >> reassigning
> >> > > >> replica
> >> > > >> > > > > between log directories across brokers:
> >> > > >> > > > > * ...
> >> > > >> > > > > * The script sends AlterReplicaDirRequest to those
> brokers
> >> > which
> >> > > >> need
> >> > > >> > > to
> >> > > >> > > > > move replicas...
> >> > > >> > > > > * The script creates reassignment znode in zookeeper.
> >> > > >> > > > > * The script retries AlterReplicaDirRequest to those
> >> broker...
> >> > > >> > > > > * ...
> >> > > >> > > > >
> >> > > >> > > > > So the ReassignPartitionsCommand still talks to ZK
> >> directly,
> >> > but
> >> > > >> now
> >> > > >> > > it's
> >> > > >> > > > > bracketed by calls to the AdminClient. KIP-179 could
> >> replace
> >> > > that
> >> > > >> > > talking
> >> > > >> > > > > to ZK directly with a new call to the AdminClient. But
> then
> >> > > we've
> >> > > >> > got a
> >> > > >> > > > > pretty weird API, where we have to make three AdminClient
> >> > calls
> >> > > >> (two
> >> > > >> > of
> >> > > >> > > > > them to the same method), to move a replica. I don't
> really
> >> > > >> > understand
> >> > > >> > > > why
> >> > > >> > > > > the admin client can't present a single API method to
> >> achieve
> >> > > >> this,
> >> > > >> > and
> >> > > >> > > > > encapsulate on the server side the careful sequence of
> >> events
> >> > > >> > necessary
> >> > > >> > > > to
> >> > > >> > > > > coordinate the movement. I understood this position is
> what
> >> > > Ismael
> >> > > >> > was
> >> > > >> > > > > advocating when he said it was better to put the logic in
> >> the
> >> > > >> > > controller
> >> > > >> > > > > than spread between the script and the controller. But
> >> maybe I
> >> > > >> > > > > misunderstood him.
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > > > I have some concern with putting this logic in controller
> >> which
> >> > > can
> >> > > >> be
> >> > > >> > > > found in my previous email. Before that is addressed, the
> >> script
> >> > > (or
> >> > > >> > > > AdminClient) seems to be the simplest place to have this
> >> logic.
> >> > > >> > > >
> >> > > >> > > > I agree it is better to have a single API to achieve both
> >> > > partition
> >> > > >> and
> >> > > >> > > > replica -> dir assignment. I think it is likely that we
> will
> >> > find
> >> > > a
> >> > > >> > good
> >> > > >> > > > API to do both. I have updated the KIP-113 to remove API
> >> > > >> > alterReplicaDir
> >> > > >> > > > from AdminClient interface.
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > > I assume that the logic you mentioned is
> >> > > >> > > > > > "movement of replica to the specified log directory".
> >> This
> >> > > logic
> >> > > >> > (or
> >> > > >> > > > the
> >> > > >> > > > > > implementation of this logic) resides mainly in the
> >> > > >> > KafkaAdminClient
> >> > > >> > > > and
> >> > > >> > > > > > broker. The script only needs to parse the json file as
> >> > > >> appropriate
> >> > > >> > > and
> >> > > >> > > > > > call the new API in AdminClient as appropriate. The
> >> logic in
> >> > > the
> >> > > >> > > script
> >> > > >> > > > > is
> >> > > >> > > > > > therefore not much and can be easily moved to other
> >> classes
> >> > if
> >> > > >> > > needed.
> >> > > >> > > > > >
> >> > > >> > > > > > Can you clarify why this logic, i.e. movement of
> replica
> >> to
> >> > > the
> >> > > >> > > > specified
> >> > > >> > > > > > log directory, needs to be moved to controller in
> >> KIP-179? I
> >> > > >> think
> >> > > >> > it
> >> > > >> > > > can
> >> > > >> > > > > > still be done in the script and controller should not
> >> need
> >> > to
> >> > > >> worry
> >> > > >> > > > about
> >> > > >> > > > > > log directory of any replica.
> >> > > >> > > > > >
> >> > > >> > > > > > Thanks,
> >> > > >> > > > > > Dong
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to