Hey Jun, It just came to me that you may be assuming that folower_lag = HW - follower_LEO. If that is the case, then we need to have new request/response to retrieve this lag since the DescribeDirsResponse doesn't even include HW. It seems that KIP-179 does not explicitly specify the definition of this lag.
I have been assuming that follow_lag = leader_LEO - follower_LEO given that the request is used to query the reassignment status. Strictly speaking the difference between leader_LEO and the HW is limited by the amount of data produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 seconds. I also assumed that 10 seconds is probably not a big deal given the typical time length of the reassignment. Thanks, Dong On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Jun, > > If I understand you right, you are suggesting that, in the case when there > is continuous incoming traffic, the approach in the KIP-179 will report lag > as 0 whereas the approach using DescribeDirsRequest will report lag as > non-zero. But I think the approach in KIP-179 will also report non-zero lag > when there is continuous traffic. This is because at the time the leader > receives ReplicaStatusRequest, it is likely that some data has been > appended to the partition after the last FetchRequest from the follower. > Does this make sense? > > Thanks, > Dong > > > > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io> wrote: > >> Hi, Dong, >> >> As for whether to return LEO or lag, my point was the following. What you >> are concerned about is that an in-sync replica could become out of sync >> again. However, the more common case is that once a replica is caught up, >> it will stay in sync afterwards. In that case, once the reassignment >> process completes, if we report based on lag, all lags will be 0. If we >> report based on Math.max(0, leaderLEO - followerLEO), the value may not be >> 0 if there is continuous incoming traffic, which will be confusing to the >> user. >> >> Thanks, >> >> Jun >> >> >> >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com> wrote: >> >> > Hey Jun, >> > >> > Thanks for the comment! >> > >> > Yes, it should work. The tool can send request to any broker and broker >> can >> > just write the reassignment znode. My previous intuition is that it may >> be >> > better to only send this request to controller. But I don't have good >> > reasons for this restriction. >> > >> > My intuition is that we can keep them separate as well. Becket and I >> have >> > discussed this both offline and in https://github.com/apache/ >> > kafka/pull/3621. >> > Currently I don't have a strong opinion on this and I am open to using >> only >> > one API to do both if someone can come up with a reasonable API >> signature >> > for this method. For now I have added the method alterReplicaDir() in >> > KafkaAdminClient instead of the AdminClient interface so that the >> > reassignment script can use this method without concluding what the API >> > would look like in AdminClient in the future. >> > >> > Regarding DescribeDirsResponse, I think it is probably OK to have >> slightly >> > more lag. The script can calculate the lag of the follower replica as >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be slightly >> less >> > accurate than the current approach in KIP-179. But even with the current >> > approach in KIP-179, the result provided by the script is an >> approximation >> > anyway, since there is delay from the time that leader returns response >> to >> > the time that the script collects response from all brokers and prints >> > result to user. I think if the slight difference in the accuracy between >> > the two approaches does not make a difference to the intended use-case >> of >> > this API, then we probably want to re-use the exiting request/response >> to >> > keep the protocol simple. >> > >> > Thanks, >> > Dong >> > >> > >> > >> > >> > >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> wrote: >> > >> > > Hi, Dong, >> > > >> > > I think Tom was suggesting to have the AlterTopicsRequest sent to any >> > > broker, which just writes the reassignment json to ZK. The controller >> > will >> > > pick up the reassignment and act on it as usual. This should work, >> right? >> > > >> > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest seems >> > > simpler to me. The former is handled by the controller and the latter >> is >> > > handled by the affected broker. They don't always have to be done >> > together. >> > > Merging the two into a single request probably will make both the api >> and >> > > the implementation a bit more complicated. If we do keep the two >> separate >> > > requests, it seems that we should just add AlterReplicaDirRequest to >> the >> > > AdminClient interface? >> > > >> > > Now, regarding DescribeDirsResponse. I agree that it can be used for >> the >> > > status reporting in KIP-179 as well. However, it seems that reporting >> the >> > > log end offset of each replica may not be easy to use. The log end >> offset >> > > will be returned from different brokers in slightly different time. If >> > > there is continuous producing traffic, the difference in log end >> offset >> > > between the leader and the follower could be larger than 0 even if the >> > > follower has fully caught up. I am wondering if it's better to instead >> > > return the lag in offset per replica. This way, the status can >> probably >> > be >> > > reported more reliably. >> > > >> > > Thanks, >> > > >> > > Jun >> > > >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com> >> wrote: >> > > >> > > > Hey Tom, >> > > > >> > > > Thanks for the quick reply. Please see my comment inline. >> > > > >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <t.j.bent...@gmail.com >> > >> > > > wrote: >> > > > >> > > > > Hi Dong, >> > > > > >> > > > > Replies inline, as usual >> > > > > >> > > > > > As I originally envisaged it, KIP-179's support for reassigning >> > > > > partitions >> > > > > > >> > > > > > would have more-or-less taken the logic currently in the >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to the >> > > > > > > ZkUtils.ReassignPartitionsPath) >> > > > > > > and put it behind a suitable network protocol API. Thus it >> > wouldn't >> > > > > > matter >> > > > > > > which broker received the protocol call: It would be acted on >> by >> > > > > brokers >> > > > > > > being notified of the change in the ZK path, just as >> currently. >> > > This >> > > > > > would >> > > > > > > have kept the ReassignPartitionsCommand relatively simple, as >> it >> > > > > > currently >> > > > > > > is. >> > > > > > > >> > > > > > >> > > > > > I am not sure I fully understand your proposal. I think you are >> > > saying >> > > > > that >> > > > > > any broker can receive and handle the AlterTopicRequest. >> > > > > >> > > > > >> > > > > That's right. >> > > > > >> > > > > >> > > > > > Let's say a >> > > > > > non-controller broker received AlterTopicRequest, is this broker >> > > going >> > > > to >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this broker >> create >> > > the >> > > > > > reassignment znode in zookeper? >> > > > > >> > > > > >> > > > > Exactly: It's going to write some JSON to the relevant znode. >> Other >> > > > brokers >> > > > > will get notified by zk when the contents of this znode changes, >> and >> > do >> > > > as >> > > > > they do now. This is what the tool/script does now. >> > > > > >> > > > > I will confess that I don't completely understand the role of >> > > > > LeaderAndIsrRequest, since the current code just seems to write to >> > the >> > > > > znode do get the brokers to do the reassignment. If you could >> explain >> > > the >> > > > > role of LeaderAndIsrRequest that would be great. >> > > > > >> > > > >> > > > Currently only the controller will listen to the reassignment znode >> and >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in >> order to >> > > > complete reassignment. Brokers won't need to listen to zookeeper for >> > any >> > > > reassignment -- brokers only reacts to the request from controller. >> > > > Currently Kafka's design replies a lot on the controller to keep a >> > > > consistent view of who are the leader of partitions and what is the >> ISR >> > > > etc. It will be a pretty drastic change, if not impossible, for the >> > > script >> > > > to reassign partitions without going through controller. >> > > > >> > > > Thus I think it is likely that your AlterTopicsRequest can only be >> sent >> > > to >> > > > controller. Then the controller can create the reassignment znode in >> > > > zookeeper so that the information is persisted across controller >> fail >> > > over. >> > > > I haven't think through this in detail though. >> > > > >> > > > >> > > > >> > > > > >> > > > > >> > > > > > I may have missed it. But I couldn't find >> > > > > > the explanation of AlterTopicRequest handling in KIP-179. >> > > > > > >> > > > > >> > > > > You're right, it doesn't go into that much detail. I will fix >> that. >> > > > > >> > > > > >> > > > > > > >> > > > > > > KIP-113 is obviously seeking to make more radical changes. The >> > > > > algorithm >> > > > > > > described for moving a replica to a particular directory on a >> > > > different >> > > > > > > broker ( >> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- >> > > > > > > 113%3A+Support+replicas+movement+between+log+ >> > directories#KIP-113: >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2) >> > > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers >> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- >> > > > > > > 113%3A+Support+replicas+movement+between+log+ >> > directories#KIP-113: >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2% >> > > > > > > 29Howtoreassignreplicabetweenlogdirectoriesacrossbrokers>) >> > > > > > > involves both sending AlterReplicaDirRequest to "the" broker >> (the >> > > > > > receiving >> > > > > > > broker, I assume, but it's not spelled out), _as well as_ >> writing >> > > to >> > > > > the >> > > > > > ZK >> > > > > > > node. >> > > > > > > >> > > > > > > This assumes the script (ReassignPartitionsCommand) has direct >> > > access >> > > > > to >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. It >> > seems >> > > a >> > > > > > waste >> > > > > > > of time to put the logic in the script as part of KIP-113, >> only >> > for >> > > > > > KIP-179 >> > > > > > > to have to move it back to the controller. >> > > > > > > >> > > > > > >> > > > > > I am not sure I understand what you mean by "It seems a waste of >> > time >> > > > to >> > > > > > put the logic in the script as part of KIP-113, only for >> KIP-179 to >> > > > have >> > > > > to >> > > > > > move it back to the controller". >> > > > > >> > > > > >> > > > > Sorry, I misunderstood slightly what you were proposing in >> KIP-113, >> > so >> > > > the >> > > > > "waste of time" comment isn't quite right, but I'm still not >> > convinced >> > > > that >> > > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory >> > result. >> > > > > >> > > > > Let me elaborate... KIP-113 says that to support reassigning >> replica >> > > > > between log directories across brokers: >> > > > > * ... >> > > > > * The script sends AlterReplicaDirRequest to those brokers which >> need >> > > to >> > > > > move replicas... >> > > > > * The script creates reassignment znode in zookeeper. >> > > > > * The script retries AlterReplicaDirRequest to those broker... >> > > > > * ... >> > > > > >> > > > > So the ReassignPartitionsCommand still talks to ZK directly, but >> now >> > > it's >> > > > > bracketed by calls to the AdminClient. KIP-179 could replace that >> > > talking >> > > > > to ZK directly with a new call to the AdminClient. But then we've >> > got a >> > > > > pretty weird API, where we have to make three AdminClient calls >> (two >> > of >> > > > > them to the same method), to move a replica. I don't really >> > understand >> > > > why >> > > > > the admin client can't present a single API method to achieve >> this, >> > and >> > > > > encapsulate on the server side the careful sequence of events >> > necessary >> > > > to >> > > > > coordinate the movement. I understood this position is what Ismael >> > was >> > > > > advocating when he said it was better to put the logic in the >> > > controller >> > > > > than spread between the script and the controller. But maybe I >> > > > > misunderstood him. >> > > > > >> > > > >> > > > I have some concern with putting this logic in controller which can >> be >> > > > found in my previous email. Before that is addressed, the script (or >> > > > AdminClient) seems to be the simplest place to have this logic. >> > > > >> > > > I agree it is better to have a single API to achieve both partition >> and >> > > > replica -> dir assignment. I think it is likely that we will find a >> > good >> > > > API to do both. I have updated the KIP-113 to remove API >> > alterReplicaDir >> > > > from AdminClient interface. >> > > > >> > > > >> > > > > >> > > > > >> > > > > > I assume that the logic you mentioned is >> > > > > > "movement of replica to the specified log directory". This logic >> > (or >> > > > the >> > > > > > implementation of this logic) resides mainly in the >> > KafkaAdminClient >> > > > and >> > > > > > broker. The script only needs to parse the json file as >> appropriate >> > > and >> > > > > > call the new API in AdminClient as appropriate. The logic in the >> > > script >> > > > > is >> > > > > > therefore not much and can be easily moved to other classes if >> > > needed. >> > > > > > >> > > > > > Can you clarify why this logic, i.e. movement of replica to the >> > > > specified >> > > > > > log directory, needs to be moved to controller in KIP-179? I >> think >> > it >> > > > can >> > > > > > still be done in the script and controller should not need to >> worry >> > > > about >> > > > > > log directory of any replica. >> > > > > > >> > > > > > Thanks, >> > > > > > Dong >> > > > > > >> > > > > >> > > > >> > > >> > >> > >