Hey Jun,

It just came to me that you may be assuming that folower_lag = HW -
follower_LEO. If that is the case, then we need to have new
request/response to retrieve this lag since the DescribeDirsResponse
doesn't even include HW. It seems that KIP-179 does not explicitly specify
the definition of this lag.

I have been assuming that follow_lag = leader_LEO - follower_LEO given that
the request is used to query the reassignment status. Strictly speaking the
difference between leader_LEO and the HW is limited by the amount of data
produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 seconds. I also
assumed that 10 seconds is probably not a big deal given the typical time
length of the reassignment.

Thanks,
Dong

On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Jun,
>
> If I understand you right, you are suggesting that, in the case when there
> is continuous incoming traffic, the approach in the KIP-179 will report lag
> as 0 whereas the approach using DescribeDirsRequest will report lag as
> non-zero. But I think the approach in KIP-179 will also report non-zero lag
> when there is continuous traffic. This is because at the time the leader
> receives ReplicaStatusRequest, it is likely that some data has been
> appended to the partition after the last FetchRequest from the follower.
> Does this make sense?
>
> Thanks,
> Dong
>
>
>
> On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io> wrote:
>
>> Hi, Dong,
>>
>> As for whether to return LEO or lag, my point was the following. What you
>> are concerned about is that an in-sync replica could become out of sync
>> again. However, the more common case is that once a replica is caught up,
>> it will stay in sync afterwards. In that case, once the reassignment
>> process completes, if we report based on lag, all lags will be 0. If we
>> report based on Math.max(0, leaderLEO - followerLEO), the value may not be
>> 0 if there is continuous incoming traffic, which will be confusing to the
>> user.
>>
>> Thanks,
>>
>> Jun
>>
>>
>>
>> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com> wrote:
>>
>> > Hey Jun,
>> >
>> > Thanks for the comment!
>> >
>> > Yes, it should work. The tool can send request to any broker and broker
>> can
>> > just write the reassignment znode. My previous intuition is that it may
>> be
>> > better to only send this request to controller. But I don't have good
>> > reasons for this restriction.
>> >
>> > My intuition is that we can keep them separate as well. Becket and I
>> have
>> > discussed this both offline and in https://github.com/apache/
>> > kafka/pull/3621.
>> > Currently I don't have a strong opinion on this and I am open to using
>> only
>> > one API to do both if someone can come up with a reasonable API
>> signature
>> > for this method. For now I have added the method alterReplicaDir() in
>> > KafkaAdminClient instead of the AdminClient interface so that the
>> > reassignment script can use this method without concluding what the API
>> > would look like in AdminClient in the future.
>> >
>> > Regarding DescribeDirsResponse, I think it is probably OK to have
>> slightly
>> > more lag. The script can calculate the lag of the follower replica as
>> > Math.max(0, leaderLEO - followerLEO). I agree that it will be slightly
>> less
>> > accurate than the current approach in KIP-179. But even with the current
>> > approach in KIP-179, the result provided by the script is an
>> approximation
>> > anyway, since there is delay from the time that leader returns response
>> to
>> > the time that the script collects response from all brokers and prints
>> > result to user. I think if the slight difference in the accuracy between
>> > the two approaches does not make a difference to the intended use-case
>> of
>> > this API, then we probably want to re-use the exiting request/response
>> to
>> > keep the protocol simple.
>> >
>> > Thanks,
>> > Dong
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > I think Tom was suggesting to have the AlterTopicsRequest sent to any
>> > > broker, which just writes the reassignment json to ZK. The controller
>> > will
>> > > pick up the reassignment and act on it as usual. This should work,
>> right?
>> > >
>> > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest seems
>> > > simpler to me. The former is handled by the controller and the latter
>> is
>> > > handled by the affected broker. They don't always have to be done
>> > together.
>> > > Merging the two into a single request probably will make both the api
>> and
>> > > the implementation a bit more complicated. If we do keep the two
>> separate
>> > > requests, it seems that we should just add AlterReplicaDirRequest to
>> the
>> > > AdminClient interface?
>> > >
>> > > Now, regarding DescribeDirsResponse. I agree that it can be used for
>> the
>> > > status reporting in KIP-179 as well. However, it seems that reporting
>> the
>> > > log end offset of each replica may not be easy to use. The log end
>> offset
>> > > will be returned from different brokers in slightly different time. If
>> > > there is continuous producing traffic, the difference in log end
>> offset
>> > > between the leader and the follower could be larger than 0 even if the
>> > > follower has fully caught up. I am wondering if it's better to instead
>> > > return the lag in offset per replica. This way, the status can
>> probably
>> > be
>> > > reported more reliably.
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com>
>> wrote:
>> > >
>> > > > Hey Tom,
>> > > >
>> > > > Thanks for the quick reply. Please see my comment inline.
>> > > >
>> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <t.j.bent...@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Hi Dong,
>> > > > >
>> > > > > Replies inline, as usual
>> > > > >
>> > > > > > As I originally envisaged it, KIP-179's support for reassigning
>> > > > > partitions
>> > > > > >
>> > > > > > would have more-or-less taken the logic currently in the
>> > > > > > > ReassignPartitionsCommand (that is, writing JSON to the
>> > > > > > > ZkUtils.ReassignPartitionsPath)
>> > > > > > > and put it behind a suitable network protocol API. Thus it
>> > wouldn't
>> > > > > > matter
>> > > > > > > which broker received the protocol call: It would be acted on
>> by
>> > > > > brokers
>> > > > > > > being notified of the change in the ZK path, just as
>> currently.
>> > > This
>> > > > > > would
>> > > > > > > have kept the ReassignPartitionsCommand relatively simple, as
>> it
>> > > > > > currently
>> > > > > > > is.
>> > > > > > >
>> > > > > >
>> > > > > > I am not sure I fully understand your proposal. I think you are
>> > > saying
>> > > > > that
>> > > > > > any broker can receive and handle the AlterTopicRequest.
>> > > > >
>> > > > >
>> > > > > That's right.
>> > > > >
>> > > > >
>> > > > > > Let's say a
>> > > > > > non-controller broker received AlterTopicRequest, is this broker
>> > > going
>> > > > to
>> > > > > > send LeaderAndIsrRequest to other brokers? Or is this broker
>> create
>> > > the
>> > > > > > reassignment znode in zookeper?
>> > > > >
>> > > > >
>> > > > > Exactly: It's going to write some JSON to the relevant znode.
>> Other
>> > > > brokers
>> > > > > will get notified by zk when the contents of this znode changes,
>> and
>> > do
>> > > > as
>> > > > > they do now. This is what the tool/script does now.
>> > > > >
>> > > > > I will confess that I don't completely understand the role of
>> > > > > LeaderAndIsrRequest, since the current code just seems to write to
>> > the
>> > > > > znode do get the brokers to do the reassignment. If you could
>> explain
>> > > the
>> > > > > role of LeaderAndIsrRequest that would be great.
>> > > > >
>> > > >
>> > > > Currently only the controller will listen to the reassignment znode
>> and
>> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in
>> order to
>> > > > complete reassignment. Brokers won't need to listen to zookeeper for
>> > any
>> > > > reassignment -- brokers only reacts to the request from controller.
>> > > > Currently Kafka's design replies a lot on the controller to keep a
>> > > > consistent view of who are the leader of partitions and what is the
>> ISR
>> > > > etc. It will be a pretty drastic change, if not impossible, for the
>> > > script
>> > > > to reassign partitions without going through controller.
>> > > >
>> > > > Thus I think it is likely that your AlterTopicsRequest can only be
>> sent
>> > > to
>> > > > controller. Then the controller can create the reassignment znode in
>> > > > zookeeper so that the information is persisted across controller
>> fail
>> > > over.
>> > > > I haven't think through this in detail though.
>> > > >
>> > > >
>> > > >
>> > > > >
>> > > > >
>> > > > > > I may have missed it. But I couldn't find
>> > > > > > the explanation of AlterTopicRequest handling in KIP-179.
>> > > > > >
>> > > > >
>> > > > > You're right, it doesn't go into that much detail. I will fix
>> that.
>> > > > >
>> > > > >
>> > > > > > >
>> > > > > > > KIP-113 is obviously seeking to make more radical changes. The
>> > > > > algorithm
>> > > > > > > described for moving a replica to a particular directory on a
>> > > > different
>> > > > > > > broker (
>> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > > > 113%3A+Support+replicas+movement+between+log+
>> > directories#KIP-113:
>> > > > > > > Supportreplicasmovementbetweenlogdirectories-2)
>> > > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers
>> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > > > > 113%3A+Support+replicas+movement+between+log+
>> > directories#KIP-113:
>> > > > > > > Supportreplicasmovementbetweenlogdirectories-2%
>> > > > > > > 29Howtoreassignreplicabetweenlogdirectoriesacrossbrokers>)
>> > > > > > > involves both sending AlterReplicaDirRequest to "the" broker
>> (the
>> > > > > > receiving
>> > > > > > > broker, I assume, but it's not spelled out), _as well as_
>> writing
>> > > to
>> > > > > the
>> > > > > > ZK
>> > > > > > > node.
>> > > > > > >
>> > > > > > > This assumes the script (ReassignPartitionsCommand) has direct
>> > > access
>> > > > > to
>> > > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. It
>> > seems
>> > > a
>> > > > > > waste
>> > > > > > > of time to put the logic in the script as part of KIP-113,
>> only
>> > for
>> > > > > > KIP-179
>> > > > > > > to have to move it back to the controller.
>> > > > > > >
>> > > > > >
>> > > > > > I am not sure I understand what you mean by "It seems a waste of
>> > time
>> > > > to
>> > > > > > put the logic in the script as part of KIP-113, only for
>> KIP-179 to
>> > > > have
>> > > > > to
>> > > > > > move it back to the controller".
>> > > > >
>> > > > >
>> > > > > Sorry, I misunderstood slightly what you were proposing in
>> KIP-113,
>> > so
>> > > > the
>> > > > > "waste of time" comment isn't quite right, but I'm still not
>> > convinced
>> > > > that
>> > > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory
>> > result.
>> > > > >
>> > > > > Let me elaborate... KIP-113 says that to support reassigning
>> replica
>> > > > > between log directories across brokers:
>> > > > > * ...
>> > > > > * The script sends AlterReplicaDirRequest to those brokers which
>> need
>> > > to
>> > > > > move replicas...
>> > > > > * The script creates reassignment znode in zookeeper.
>> > > > > * The script retries AlterReplicaDirRequest to those broker...
>> > > > > * ...
>> > > > >
>> > > > > So the ReassignPartitionsCommand still talks to ZK directly, but
>> now
>> > > it's
>> > > > > bracketed by calls to the AdminClient. KIP-179 could replace that
>> > > talking
>> > > > > to ZK directly with a new call to the AdminClient. But then we've
>> > got a
>> > > > > pretty weird API, where we have to make three AdminClient calls
>> (two
>> > of
>> > > > > them to the same method), to move a replica. I don't really
>> > understand
>> > > > why
>> > > > > the admin client can't present a single API method to achieve
>> this,
>> > and
>> > > > > encapsulate on the server side the careful sequence of events
>> > necessary
>> > > > to
>> > > > > coordinate the movement. I understood this position is what Ismael
>> > was
>> > > > > advocating when he said it was better to put the logic in the
>> > > controller
>> > > > > than spread between the script and the controller. But maybe I
>> > > > > misunderstood him.
>> > > > >
>> > > >
>> > > > I have some concern with putting this logic in controller which can
>> be
>> > > > found in my previous email. Before that is addressed, the script (or
>> > > > AdminClient) seems to be the simplest place to have this logic.
>> > > >
>> > > > I agree it is better to have a single API to achieve both partition
>> and
>> > > > replica -> dir assignment. I think it is likely that we will find a
>> > good
>> > > > API to do both. I have updated the KIP-113 to remove API
>> > alterReplicaDir
>> > > > from AdminClient interface.
>> > > >
>> > > >
>> > > > >
>> > > > >
>> > > > > > I assume that the logic you mentioned is
>> > > > > > "movement of replica to the specified log directory". This logic
>> > (or
>> > > > the
>> > > > > > implementation of this logic) resides mainly in the
>> > KafkaAdminClient
>> > > > and
>> > > > > > broker. The script only needs to parse the json file as
>> appropriate
>> > > and
>> > > > > > call the new API in AdminClient as appropriate. The logic in the
>> > > script
>> > > > > is
>> > > > > > therefore not much and can be easily moved to other classes if
>> > > needed.
>> > > > > >
>> > > > > > Can you clarify why this logic, i.e. movement of replica to the
>> > > > specified
>> > > > > > log directory, needs to be moved to controller in KIP-179? I
>> think
>> > it
>> > > > can
>> > > > > > still be done in the script and controller should not need to
>> worry
>> > > > about
>> > > > > > log directory of any replica.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Dong
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to