Hi, Dong, Yes, the lag in a replica is calculated as the difference of LEO of the replica and the HW. So, as long as a replica is in sync, the lag is always 0.
So, I was suggesting to return lag instead of LEO in DescribeDirsResponse for each replica. I am not sure if we need to return HW though. Thanks, Jun On Wed, Aug 9, 2017 at 5:01 PM, Dong Lin <[email protected]> wrote: > Hey Jun, > > It just came to me that you may be assuming that folower_lag = HW - > follower_LEO. If that is the case, then we need to have new > request/response to retrieve this lag since the DescribeDirsResponse > doesn't even include HW. It seems that KIP-179 does not explicitly specify > the definition of this lag. > > I have been assuming that follow_lag = leader_LEO - follower_LEO given that > the request is used to query the reassignment status. Strictly speaking the > difference between leader_LEO and the HW is limited by the amount of data > produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 seconds. I also > assumed that 10 seconds is probably not a big deal given the typical time > length of the reassignment. > > Thanks, > Dong > > On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <[email protected]> wrote: > > > Hey Jun, > > > > If I understand you right, you are suggesting that, in the case when > there > > is continuous incoming traffic, the approach in the KIP-179 will report > lag > > as 0 whereas the approach using DescribeDirsRequest will report lag as > > non-zero. But I think the approach in KIP-179 will also report non-zero > lag > > when there is continuous traffic. This is because at the time the leader > > receives ReplicaStatusRequest, it is likely that some data has been > > appended to the partition after the last FetchRequest from the follower. > > Does this make sense? > > > > Thanks, > > Dong > > > > > > > > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <[email protected]> wrote: > > > >> Hi, Dong, > >> > >> As for whether to return LEO or lag, my point was the following. What > you > >> are concerned about is that an in-sync replica could become out of sync > >> again. However, the more common case is that once a replica is caught > up, > >> it will stay in sync afterwards. In that case, once the reassignment > >> process completes, if we report based on lag, all lags will be 0. If we > >> report based on Math.max(0, leaderLEO - followerLEO), the value may not > be > >> 0 if there is continuous incoming traffic, which will be confusing to > the > >> user. > >> > >> Thanks, > >> > >> Jun > >> > >> > >> > >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <[email protected]> wrote: > >> > >> > Hey Jun, > >> > > >> > Thanks for the comment! > >> > > >> > Yes, it should work. The tool can send request to any broker and > broker > >> can > >> > just write the reassignment znode. My previous intuition is that it > may > >> be > >> > better to only send this request to controller. But I don't have good > >> > reasons for this restriction. > >> > > >> > My intuition is that we can keep them separate as well. Becket and I > >> have > >> > discussed this both offline and in https://github.com/apache/ > >> > kafka/pull/3621. > >> > Currently I don't have a strong opinion on this and I am open to using > >> only > >> > one API to do both if someone can come up with a reasonable API > >> signature > >> > for this method. For now I have added the method alterReplicaDir() in > >> > KafkaAdminClient instead of the AdminClient interface so that the > >> > reassignment script can use this method without concluding what the > API > >> > would look like in AdminClient in the future. > >> > > >> > Regarding DescribeDirsResponse, I think it is probably OK to have > >> slightly > >> > more lag. The script can calculate the lag of the follower replica as > >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be slightly > >> less > >> > accurate than the current approach in KIP-179. But even with the > current > >> > approach in KIP-179, the result provided by the script is an > >> approximation > >> > anyway, since there is delay from the time that leader returns > response > >> to > >> > the time that the script collects response from all brokers and prints > >> > result to user. I think if the slight difference in the accuracy > between > >> > the two approaches does not make a difference to the intended use-case > >> of > >> > this API, then we probably want to re-use the exiting request/response > >> to > >> > keep the protocol simple. > >> > > >> > Thanks, > >> > Dong > >> > > >> > > >> > > >> > > >> > > >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <[email protected]> wrote: > >> > > >> > > Hi, Dong, > >> > > > >> > > I think Tom was suggesting to have the AlterTopicsRequest sent to > any > >> > > broker, which just writes the reassignment json to ZK. The > controller > >> > will > >> > > pick up the reassignment and act on it as usual. This should work, > >> right? > >> > > > >> > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest > seems > >> > > simpler to me. The former is handled by the controller and the > latter > >> is > >> > > handled by the affected broker. They don't always have to be done > >> > together. > >> > > Merging the two into a single request probably will make both the > api > >> and > >> > > the implementation a bit more complicated. If we do keep the two > >> separate > >> > > requests, it seems that we should just add AlterReplicaDirRequest to > >> the > >> > > AdminClient interface? > >> > > > >> > > Now, regarding DescribeDirsResponse. I agree that it can be used for > >> the > >> > > status reporting in KIP-179 as well. However, it seems that > reporting > >> the > >> > > log end offset of each replica may not be easy to use. The log end > >> offset > >> > > will be returned from different brokers in slightly different time. > If > >> > > there is continuous producing traffic, the difference in log end > >> offset > >> > > between the leader and the follower could be larger than 0 even if > the > >> > > follower has fully caught up. I am wondering if it's better to > instead > >> > > return the lag in offset per replica. This way, the status can > >> probably > >> > be > >> > > reported more reliably. > >> > > > >> > > Thanks, > >> > > > >> > > Jun > >> > > > >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <[email protected]> > >> wrote: > >> > > > >> > > > Hey Tom, > >> > > > > >> > > > Thanks for the quick reply. Please see my comment inline. > >> > > > > >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley < > [email protected] > >> > > >> > > > wrote: > >> > > > > >> > > > > Hi Dong, > >> > > > > > >> > > > > Replies inline, as usual > >> > > > > > >> > > > > > As I originally envisaged it, KIP-179's support for > reassigning > >> > > > > partitions > >> > > > > > > >> > > > > > would have more-or-less taken the logic currently in the > >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to the > >> > > > > > > ZkUtils.ReassignPartitionsPath) > >> > > > > > > and put it behind a suitable network protocol API. Thus it > >> > wouldn't > >> > > > > > matter > >> > > > > > > which broker received the protocol call: It would be acted > on > >> by > >> > > > > brokers > >> > > > > > > being notified of the change in the ZK path, just as > >> currently. > >> > > This > >> > > > > > would > >> > > > > > > have kept the ReassignPartitionsCommand relatively simple, > as > >> it > >> > > > > > currently > >> > > > > > > is. > >> > > > > > > > >> > > > > > > >> > > > > > I am not sure I fully understand your proposal. I think you > are > >> > > saying > >> > > > > that > >> > > > > > any broker can receive and handle the AlterTopicRequest. > >> > > > > > >> > > > > > >> > > > > That's right. > >> > > > > > >> > > > > > >> > > > > > Let's say a > >> > > > > > non-controller broker received AlterTopicRequest, is this > broker > >> > > going > >> > > > to > >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this broker > >> create > >> > > the > >> > > > > > reassignment znode in zookeper? > >> > > > > > >> > > > > > >> > > > > Exactly: It's going to write some JSON to the relevant znode. > >> Other > >> > > > brokers > >> > > > > will get notified by zk when the contents of this znode changes, > >> and > >> > do > >> > > > as > >> > > > > they do now. This is what the tool/script does now. > >> > > > > > >> > > > > I will confess that I don't completely understand the role of > >> > > > > LeaderAndIsrRequest, since the current code just seems to write > to > >> > the > >> > > > > znode do get the brokers to do the reassignment. If you could > >> explain > >> > > the > >> > > > > role of LeaderAndIsrRequest that would be great. > >> > > > > > >> > > > > >> > > > Currently only the controller will listen to the reassignment > znode > >> and > >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in > >> order to > >> > > > complete reassignment. Brokers won't need to listen to zookeeper > for > >> > any > >> > > > reassignment -- brokers only reacts to the request from > controller. > >> > > > Currently Kafka's design replies a lot on the controller to keep a > >> > > > consistent view of who are the leader of partitions and what is > the > >> ISR > >> > > > etc. It will be a pretty drastic change, if not impossible, for > the > >> > > script > >> > > > to reassign partitions without going through controller. > >> > > > > >> > > > Thus I think it is likely that your AlterTopicsRequest can only be > >> sent > >> > > to > >> > > > controller. Then the controller can create the reassignment znode > in > >> > > > zookeeper so that the information is persisted across controller > >> fail > >> > > over. > >> > > > I haven't think through this in detail though. > >> > > > > >> > > > > >> > > > > >> > > > > > >> > > > > > >> > > > > > I may have missed it. But I couldn't find > >> > > > > > the explanation of AlterTopicRequest handling in KIP-179. > >> > > > > > > >> > > > > > >> > > > > You're right, it doesn't go into that much detail. I will fix > >> that. > >> > > > > > >> > > > > > >> > > > > > > > >> > > > > > > KIP-113 is obviously seeking to make more radical changes. > The > >> > > > > algorithm > >> > > > > > > described for moving a replica to a particular directory on > a > >> > > > different > >> > > > > > > broker ( > >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > >> > > > > > > 113%3A+Support+replicas+movement+between+log+ > >> > directories#KIP-113: > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2) > >> > > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers > >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > >> > > > > > > 113%3A+Support+replicas+movement+between+log+ > >> > directories#KIP-113: > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2% > >> > > > > > > 29Howtoreassignreplicabetweenlogdirectoriesacrossbrokers>) > >> > > > > > > involves both sending AlterReplicaDirRequest to "the" broker > >> (the > >> > > > > > receiving > >> > > > > > > broker, I assume, but it's not spelled out), _as well as_ > >> writing > >> > > to > >> > > > > the > >> > > > > > ZK > >> > > > > > > node. > >> > > > > > > > >> > > > > > > This assumes the script (ReassignPartitionsCommand) has > direct > >> > > access > >> > > > > to > >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. It > >> > seems > >> > > a > >> > > > > > waste > >> > > > > > > of time to put the logic in the script as part of KIP-113, > >> only > >> > for > >> > > > > > KIP-179 > >> > > > > > > to have to move it back to the controller. > >> > > > > > > > >> > > > > > > >> > > > > > I am not sure I understand what you mean by "It seems a waste > of > >> > time > >> > > > to > >> > > > > > put the logic in the script as part of KIP-113, only for > >> KIP-179 to > >> > > > have > >> > > > > to > >> > > > > > move it back to the controller". > >> > > > > > >> > > > > > >> > > > > Sorry, I misunderstood slightly what you were proposing in > >> KIP-113, > >> > so > >> > > > the > >> > > > > "waste of time" comment isn't quite right, but I'm still not > >> > convinced > >> > > > that > >> > > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory > >> > result. > >> > > > > > >> > > > > Let me elaborate... KIP-113 says that to support reassigning > >> replica > >> > > > > between log directories across brokers: > >> > > > > * ... > >> > > > > * The script sends AlterReplicaDirRequest to those brokers which > >> need > >> > > to > >> > > > > move replicas... > >> > > > > * The script creates reassignment znode in zookeeper. > >> > > > > * The script retries AlterReplicaDirRequest to those broker... > >> > > > > * ... > >> > > > > > >> > > > > So the ReassignPartitionsCommand still talks to ZK directly, but > >> now > >> > > it's > >> > > > > bracketed by calls to the AdminClient. KIP-179 could replace > that > >> > > talking > >> > > > > to ZK directly with a new call to the AdminClient. But then > we've > >> > got a > >> > > > > pretty weird API, where we have to make three AdminClient calls > >> (two > >> > of > >> > > > > them to the same method), to move a replica. I don't really > >> > understand > >> > > > why > >> > > > > the admin client can't present a single API method to achieve > >> this, > >> > and > >> > > > > encapsulate on the server side the careful sequence of events > >> > necessary > >> > > > to > >> > > > > coordinate the movement. I understood this position is what > Ismael > >> > was > >> > > > > advocating when he said it was better to put the logic in the > >> > > controller > >> > > > > than spread between the script and the controller. But maybe I > >> > > > > misunderstood him. > >> > > > > > >> > > > > >> > > > I have some concern with putting this logic in controller which > can > >> be > >> > > > found in my previous email. Before that is addressed, the script > (or > >> > > > AdminClient) seems to be the simplest place to have this logic. > >> > > > > >> > > > I agree it is better to have a single API to achieve both > partition > >> and > >> > > > replica -> dir assignment. I think it is likely that we will find > a > >> > good > >> > > > API to do both. I have updated the KIP-113 to remove API > >> > alterReplicaDir > >> > > > from AdminClient interface. > >> > > > > >> > > > > >> > > > > > >> > > > > > >> > > > > > I assume that the logic you mentioned is > >> > > > > > "movement of replica to the specified log directory". This > logic > >> > (or > >> > > > the > >> > > > > > implementation of this logic) resides mainly in the > >> > KafkaAdminClient > >> > > > and > >> > > > > > broker. The script only needs to parse the json file as > >> appropriate > >> > > and > >> > > > > > call the new API in AdminClient as appropriate. The logic in > the > >> > > script > >> > > > > is > >> > > > > > therefore not much and can be easily moved to other classes if > >> > > needed. > >> > > > > > > >> > > > > > Can you clarify why this logic, i.e. movement of replica to > the > >> > > > specified > >> > > > > > log directory, needs to be moved to controller in KIP-179? I > >> think > >> > it > >> > > > can > >> > > > > > still be done in the script and controller should not need to > >> worry > >> > > > about > >> > > > > > log directory of any replica. > >> > > > > > > >> > > > > > Thanks, > >> > > > > > Dong > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >
