Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Jun Rao Wed, 09 Aug 2017 17:07:17 -0700

Hi, Dong,

Yes, the lag in a replica is calculated as the difference of LEO of the
replica and the HW. So, as long as a replica is in sync, the lag is always
0.


So, I was suggesting to return lag instead of LEO in DescribeDirsResponse
for each replica. I am not sure if we need to return HW though.

Thanks,

Jun

On Wed, Aug 9, 2017 at 5:01 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Jun,
>
> It just came to me that you may be assuming that folower_lag = HW -
> follower_LEO. If that is the case, then we need to have new
> request/response to retrieve this lag since the DescribeDirsResponse
> doesn't even include HW. It seems that KIP-179 does not explicitly specify
> the definition of this lag.
>
> I have been assuming that follow_lag = leader_LEO - follower_LEO given that
> the request is used to query the reassignment status. Strictly speaking the
> difference between leader_LEO and the HW is limited by the amount of data
> produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 seconds. I also
> assumed that 10 seconds is probably not a big deal given the typical time
> length of the reassignment.
>
> Thanks,
> Dong
>
> On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > If I understand you right, you are suggesting that, in the case when
> there
> > is continuous incoming traffic, the approach in the KIP-179 will report
> lag
> > as 0 whereas the approach using DescribeDirsRequest will report lag as
> > non-zero. But I think the approach in KIP-179 will also report non-zero
> lag
> > when there is continuous traffic. This is because at the time the leader
> > receives ReplicaStatusRequest, it is likely that some data has been
> > appended to the partition after the last FetchRequest from the follower.
> > Does this make sense?
> >
> > Thanks,
> > Dong
> >
> >
> >
> > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io> wrote:
> >
> >> Hi, Dong,
> >>
> >> As for whether to return LEO or lag, my point was the following. What
> you
> >> are concerned about is that an in-sync replica could become out of sync
> >> again. However, the more common case is that once a replica is caught
> up,
> >> it will stay in sync afterwards. In that case, once the reassignment
> >> process completes, if we report based on lag, all lags will be 0. If we
> >> report based on Math.max(0, leaderLEO - followerLEO), the value may not
> be
> >> 0 if there is continuous incoming traffic, which will be confusing to
> the
> >> user.
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >>
> >>
> >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com> wrote:
> >>
> >> > Hey Jun,
> >> >
> >> > Thanks for the comment!
> >> >
> >> > Yes, it should work. The tool can send request to any broker and
> broker
> >> can
> >> > just write the reassignment znode. My previous intuition is that it
> may
> >> be
> >> > better to only send this request to controller. But I don't have good
> >> > reasons for this restriction.
> >> >
> >> > My intuition is that we can keep them separate as well. Becket and I
> >> have
> >> > discussed this both offline and in https://github.com/apache/
> >> > kafka/pull/3621.
> >> > Currently I don't have a strong opinion on this and I am open to using
> >> only
> >> > one API to do both if someone can come up with a reasonable API
> >> signature
> >> > for this method. For now I have added the method alterReplicaDir() in
> >> > KafkaAdminClient instead of the AdminClient interface so that the
> >> > reassignment script can use this method without concluding what the
> API
> >> > would look like in AdminClient in the future.
> >> >
> >> > Regarding DescribeDirsResponse, I think it is probably OK to have
> >> slightly
> >> > more lag. The script can calculate the lag of the follower replica as
> >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be slightly
> >> less
> >> > accurate than the current approach in KIP-179. But even with the
> current
> >> > approach in KIP-179, the result provided by the script is an
> >> approximation
> >> > anyway, since there is delay from the time that leader returns
> response
> >> to
> >> > the time that the script collects response from all brokers and prints
> >> > result to user. I think if the slight difference in the accuracy
> between
> >> > the two approaches does not make a difference to the intended use-case
> >> of
> >> > this API, then we probably want to re-use the exiting request/response
> >> to
> >> > keep the protocol simple.
> >> >
> >> > Thanks,
> >> > Dong
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> wrote:
> >> >
> >> > > Hi, Dong,
> >> > >
> >> > > I think Tom was suggesting to have the AlterTopicsRequest sent to
> any
> >> > > broker, which just writes the reassignment json to ZK. The
> controller
> >> > will
> >> > > pick up the reassignment and act on it as usual. This should work,
> >> right?
> >> > >
> >> > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest
> seems
> >> > > simpler to me. The former is handled by the controller and the
> latter
> >> is
> >> > > handled by the affected broker. They don't always have to be done
> >> > together.
> >> > > Merging the two into a single request probably will make both the
> api
> >> and
> >> > > the implementation a bit more complicated. If we do keep the two
> >> separate
> >> > > requests, it seems that we should just add AlterReplicaDirRequest to
> >> the
> >> > > AdminClient interface?
> >> > >
> >> > > Now, regarding DescribeDirsResponse. I agree that it can be used for
> >> the
> >> > > status reporting in KIP-179 as well. However, it seems that
> reporting
> >> the
> >> > > log end offset of each replica may not be easy to use. The log end
> >> offset
> >> > > will be returned from different brokers in slightly different time.
> If
> >> > > there is continuous producing traffic, the difference in log end
> >> offset
> >> > > between the leader and the follower could be larger than 0 even if
> the
> >> > > follower has fully caught up. I am wondering if it's better to
> instead
> >> > > return the lag in offset per replica. This way, the status can
> >> probably
> >> > be
> >> > > reported more reliably.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jun
> >> > >
> >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hey Tom,
> >> > > >
> >> > > > Thanks for the quick reply. Please see my comment inline.
> >> > > >
> >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <
> t.j.bent...@gmail.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi Dong,
> >> > > > >
> >> > > > > Replies inline, as usual
> >> > > > >
> >> > > > > > As I originally envisaged it, KIP-179's support for
> reassigning
> >> > > > > partitions
> >> > > > > >
> >> > > > > > would have more-or-less taken the logic currently in the
> >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to the
> >> > > > > > > ZkUtils.ReassignPartitionsPath)
> >> > > > > > > and put it behind a suitable network protocol API. Thus it
> >> > wouldn't
> >> > > > > > matter
> >> > > > > > > which broker received the protocol call: It would be acted
> on
> >> by
> >> > > > > brokers
> >> > > > > > > being notified of the change in the ZK path, just as
> >> currently.
> >> > > This
> >> > > > > > would
> >> > > > > > > have kept the ReassignPartitionsCommand relatively simple,
> as
> >> it
> >> > > > > > currently
> >> > > > > > > is.
> >> > > > > > >
> >> > > > > >
> >> > > > > > I am not sure I fully understand your proposal. I think you
> are
> >> > > saying
> >> > > > > that
> >> > > > > > any broker can receive and handle the AlterTopicRequest.
> >> > > > >
> >> > > > >
> >> > > > > That's right.
> >> > > > >
> >> > > > >
> >> > > > > > Let's say a
> >> > > > > > non-controller broker received AlterTopicRequest, is this
> broker
> >> > > going
> >> > > > to
> >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this broker
> >> create
> >> > > the
> >> > > > > > reassignment znode in zookeper?
> >> > > > >
> >> > > > >
> >> > > > > Exactly: It's going to write some JSON to the relevant znode.
> >> Other
> >> > > > brokers
> >> > > > > will get notified by zk when the contents of this znode changes,
> >> and
> >> > do
> >> > > > as
> >> > > > > they do now. This is what the tool/script does now.
> >> > > > >
> >> > > > > I will confess that I don't completely understand the role of
> >> > > > > LeaderAndIsrRequest, since the current code just seems to write
> to
> >> > the
> >> > > > > znode do get the brokers to do the reassignment. If you could
> >> explain
> >> > > the
> >> > > > > role of LeaderAndIsrRequest that would be great.
> >> > > > >
> >> > > >
> >> > > > Currently only the controller will listen to the reassignment
> znode
> >> and
> >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in
> >> order to
> >> > > > complete reassignment. Brokers won't need to listen to zookeeper
> for
> >> > any
> >> > > > reassignment -- brokers only reacts to the request from
> controller.
> >> > > > Currently Kafka's design replies a lot on the controller to keep a
> >> > > > consistent view of who are the leader of partitions and what is
> the
> >> ISR
> >> > > > etc. It will be a pretty drastic change, if not impossible, for
> the
> >> > > script
> >> > > > to reassign partitions without going through controller.
> >> > > >
> >> > > > Thus I think it is likely that your AlterTopicsRequest can only be
> >> sent
> >> > > to
> >> > > > controller. Then the controller can create the reassignment znode
> in
> >> > > > zookeeper so that the information is persisted across controller
> >> fail
> >> > > over.
> >> > > > I haven't think through this in detail though.
> >> > > >
> >> > > >
> >> > > >
> >> > > > >
> >> > > > >
> >> > > > > > I may have missed it. But I couldn't find
> >> > > > > > the explanation of AlterTopicRequest handling in KIP-179.
> >> > > > > >
> >> > > > >
> >> > > > > You're right, it doesn't go into that much detail. I will fix
> >> that.
> >> > > > >
> >> > > > >
> >> > > > > > >
> >> > > > > > > KIP-113 is obviously seeking to make more radical changes.
> The
> >> > > > > algorithm
> >> > > > > > > described for moving a replica to a particular directory on
> a
> >> > > > different
> >> > > > > > > broker (
> >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> >> > directories#KIP-113:
> >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2)
> >> > > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers
> >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > > > > > > 113%3A+Support+replicas+movement+between+log+
> >> > directories#KIP-113:
> >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2%
> >> > > > > > > 29Howtoreassignreplicabetweenlogdirectoriesacrossbrokers>)
> >> > > > > > > involves both sending AlterReplicaDirRequest to "the" broker
> >> (the
> >> > > > > > receiving
> >> > > > > > > broker, I assume, but it's not spelled out), _as well as_
> >> writing
> >> > > to
> >> > > > > the
> >> > > > > > ZK
> >> > > > > > > node.
> >> > > > > > >
> >> > > > > > > This assumes the script (ReassignPartitionsCommand) has
> direct
> >> > > access
> >> > > > > to
> >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. It
> >> > seems
> >> > > a
> >> > > > > > waste
> >> > > > > > > of time to put the logic in the script as part of KIP-113,
> >> only
> >> > for
> >> > > > > > KIP-179
> >> > > > > > > to have to move it back to the controller.
> >> > > > > > >
> >> > > > > >
> >> > > > > > I am not sure I understand what you mean by "It seems a waste
> of
> >> > time
> >> > > > to
> >> > > > > > put the logic in the script as part of KIP-113, only for
> >> KIP-179 to
> >> > > > have
> >> > > > > to
> >> > > > > > move it back to the controller".
> >> > > > >
> >> > > > >
> >> > > > > Sorry, I misunderstood slightly what you were proposing in
> >> KIP-113,
> >> > so
> >> > > > the
> >> > > > > "waste of time" comment isn't quite right, but I'm still not
> >> > convinced
> >> > > > that
> >> > > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory
> >> > result.
> >> > > > >
> >> > > > > Let me elaborate... KIP-113 says that to support reassigning
> >> replica
> >> > > > > between log directories across brokers:
> >> > > > > * ...
> >> > > > > * The script sends AlterReplicaDirRequest to those brokers which
> >> need
> >> > > to
> >> > > > > move replicas...
> >> > > > > * The script creates reassignment znode in zookeeper.
> >> > > > > * The script retries AlterReplicaDirRequest to those broker...
> >> > > > > * ...
> >> > > > >
> >> > > > > So the ReassignPartitionsCommand still talks to ZK directly, but
> >> now
> >> > > it's
> >> > > > > bracketed by calls to the AdminClient. KIP-179 could replace
> that
> >> > > talking
> >> > > > > to ZK directly with a new call to the AdminClient. But then
> we've
> >> > got a
> >> > > > > pretty weird API, where we have to make three AdminClient calls
> >> (two
> >> > of
> >> > > > > them to the same method), to move a replica. I don't really
> >> > understand
> >> > > > why
> >> > > > > the admin client can't present a single API method to achieve
> >> this,
> >> > and
> >> > > > > encapsulate on the server side the careful sequence of events
> >> > necessary
> >> > > > to
> >> > > > > coordinate the movement. I understood this position is what
> Ismael
> >> > was
> >> > > > > advocating when he said it was better to put the logic in the
> >> > > controller
> >> > > > > than spread between the script and the controller. But maybe I
> >> > > > > misunderstood him.
> >> > > > >
> >> > > >
> >> > > > I have some concern with putting this logic in controller which
> can
> >> be
> >> > > > found in my previous email. Before that is addressed, the script
> (or
> >> > > > AdminClient) seems to be the simplest place to have this logic.
> >> > > >
> >> > > > I agree it is better to have a single API to achieve both
> partition
> >> and
> >> > > > replica -> dir assignment. I think it is likely that we will find
> a
> >> > good
> >> > > > API to do both. I have updated the KIP-113 to remove API
> >> > alterReplicaDir
> >> > > > from AdminClient interface.
> >> > > >
> >> > > >
> >> > > > >
> >> > > > >
> >> > > > > > I assume that the logic you mentioned is
> >> > > > > > "movement of replica to the specified log directory". This
> logic
> >> > (or
> >> > > > the
> >> > > > > > implementation of this logic) resides mainly in the
> >> > KafkaAdminClient
> >> > > > and
> >> > > > > > broker. The script only needs to parse the json file as
> >> appropriate
> >> > > and
> >> > > > > > call the new API in AdminClient as appropriate. The logic in
> the
> >> > > script
> >> > > > > is
> >> > > > > > therefore not much and can be easily moved to other classes if
> >> > > needed.
> >> > > > > >
> >> > > > > > Can you clarify why this logic, i.e. movement of replica to
> the
> >> > > > specified
> >> > > > > > log directory, needs to be moved to controller in KIP-179? I
> >> think
> >> > it
> >> > > > can
> >> > > > > > still be done in the script and controller should not need to
> >> worry
> >> > > > about
> >> > > > > > log directory of any replica.
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > Dong
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to