Hey Jun, This is a very good idea. I have updated the KIP-113 so that DescribeDirResponse returns lag instead of LEO. If the replica is not a temporary replica, then lag = max(0, HW - LEO). Otherwise, lag = primary Replica's LEO - temporary Replica's LEO.
Thanks! Dong On Thu, Aug 10, 2017 at 10:21 AM, Jun Rao <j...@confluent.io> wrote: > Hi, Tom, Dong, > > A couple of comments on that. > > 1. I think we can unify the reporting of lags. Basically, the lag will be > reported on every replica (temporary or permanent), not just at the leader > replica. If it's permanent, lag is max(0, HW - LEO) as it is now. > Otherwise, lag is (LEO of permanent replica - LEO of temporary replica). > That way, it seems that we can use a single request to monitor the progress > of both inter and intra replica movement and it would be more accurate than > relying on LEO directly. > > 2. Tom, note that currently, the LeaderAndIsrRequest doesn't specify the > log dir. So, I am not sure in your new proposal, how the log dir info is > communicated to all brokers. Is the broker receiving the > ReassignPartitionsRequest > going to forward that to all brokers? > > Thanks, > > Jun > > > > On Thu, Aug 10, 2017 at 6:57 AM, Tom Bentley <t.j.bent...@gmail.com> > wrote: > > > I've spent some time thinking about KIP-179 and KIP-113, the proposed > > algorithms and APIs, and trying to weigh the pros and cons of various > > alternative options. > > > > I think Dong's reasons for the algorithm for inter-broker move in KIP-113 > > make a lot of sense. I don't think it would be at all simple to try to > > change that algorithm to one where the whole thing can be triggered by a > > single call to an AdminClient method. So I guess we should try to keep as > > much of that algorithm as possible. > > > > KIP-179 will need to change this step > > > > - The script creates reassignment znode in zookeeper. > > > > > > > with an AdminClient API call. This call can the same one as currently > > specified in KIP-179 -- reassignPartitions() -- except the argument needs > > to take into account the need to pass log dirs as well as broker ids. > Thus > > I would suggest > > > > ReassignPartitionsResult reassignPartitions(Map<TopicPartition, > > List<ReplicAssignment>> assignments) > > > > where: > > > > class ReplicAssignment { > > int broker() > > String logDirectory()// can be null > > } > > > > (This is just a Java representation of the reassignment json in KIP-113, > > which itself is a superset of the reassignment json currently in use) > > > > The corresponding protocol would look like this: > > > > ReassignPartitionsRequest => timeout validate_only log_dirs > > [topic_assignment] > > timout => int32 > > validate_only => boolean > > log_dirs => [string] > > topic_assignment => topic [partition_assignment] > > topic => string > > partition_assignment => partition [replica] > > partition => int32 > > replica => broker log_dir_index > > broker => int32 > > log_dir_index => int16 > > > > The purpose of log_dirs is to serialize each log dir in the request only > > once. These are then referred to by index in log_dir_index. The > > log_dir_index can be -1, which means the caller doesn't care which log > > directory should be used on the receiving broker. > > > > This request can be sent to *any* broker. The broker which receives a > > ReassignPartitionsRequest essentially converts it into reassignment JSON > > and writes that JSON to the znode, then returns a > > ReassignPartitionsResponse: > > > > ReassignPartitionsResponse => throttle_time_ms > > [topic_assignment_result] > > throttle_time_ms => INT32 > > log_dirs => [string] > > topic_assignment_result => topic partition_assignment_result > > topic => STRING > > partition_assignment_result => partition [replica_result] > > partition => int32 > > replica_result => broker log_dir_index error_code > > error_message > > broker => int32 > > log_dir_index => int16 > > error_code => INT16 > > error_message => NULLABLE_STRING > > > > This is using the same encoding scheme as wrt log_dirs as described > above. > > > > Meanwhile the controller is notified by ZK of the change in value of the > > znode and proceeds, as currently, by sending LeaderAndIsrRequest and > > StopReplicaRequest in order to complete the reassignments. > > > > The remaining problem is around how to measure progress of reassignment. > As > > mentioned in the email I wrote this morning, I think we really need two > > different lag calculations if we're using the lag to measure progress and > > we want the property that lag=0 means reassignment has finished. The > > problem with that, I now realise, is the script might be called with > > reassignments which are a mix of: > > > > * inter-broker moves without a log dir (=> use HW-replicaLEO) > > * inter-broker moves with a log dir (=> use HW-replicaLEO) > > * intra-broker moves with a log dir (=> use .log_LEO - .move_LEO) > > > > And if there were two APIs we'd end up needing to make both kinds of > query > > to each broker in the cluster. This morning I said: > > > > But AFAICS this observation doesn't really help much in terms of the APIs > > > concerned though. Since the requests would still need to go to > different > > > brokers depending on which kind of movement is being performed. > > > > > > > But I wonder if that's *really* such a problem: In the case of an > > inter-broker move we just need to ask the leader, and in the case of an > > intra-broker move we just have to ask that broker. In generally we'd > need a > > single request to each broker in the cluster. Then each broker would need > > to service that request, but presumably it's just pulling a number out > of a > > ConcurrentHashMap, which is updated by the replica movement code in each > of > > the two cases (inter-broker and intra-broker). WDYT? > > > > Assuming no one can see any glaring holes in what I'm proposing here, or > > wants to suggest a workable alternative set of APIs and algorithms, then > > I'll update KIP-179 to this effect. > > > > Thanks for taking the time to read this far! > > > > Tom > > > > On 10 August 2017 at 11:56, Tom Bentley <t.j.bent...@gmail.com> wrote: > > > > > Hi Dong and Jun, > > > > > > It seems that KIP-179 does not explicitly specify the definition of > this > > >> lag. > > > > > > > > > Given that the definition of "caught up" is "is the replica in the > ISR?", > > > I found the code in Partition.maybeExpandIsr() which decides whether a > > > replica should be added to the to the ISR and it uses > > replica.logEndOffset. > > > offsetDiff(leaderHW) >= 0, so for this purpose I would define the lag > as > > > max(leaderHW - replicaLEO, 0). I think we agree this should work for > > > inter-broker movement, where the leader knows these quantities. > > > > > > As Dong says, this doesn't work for the intra-broker case: > > > > > > Note that we can not calculate lag as max(0, HW - LEO) > > >> because we still need the difference between two lags to measure the > > >> progress of intra-broker replica movement. > > >> > > > > > > It seems to me that the intra-broker case is actually a special case of > > > the inter-broker case. Conceptually with an intra-broker move the > ".log" > > > replica is the leader, the ".move" directory is the follower, the ISR > is > > > the singleton containing the leader, thus the HW if the LEO of the > > ".log". > > > Viewed in this way, Dong's method of leaderLEO - followerLEO is the > same > > > thing for the intra-broker case as HW-LEO is for the inter-broker case. > > > > > > But AFAICS this observation doesn't really help much in terms of the > APIs > > > concerned though. Since the requests would still need to go to > different > > > brokers depending on which kind of movement is being performed. > > > > > > So perhaps this is another case where maybe it makes sense to keep the > > two > > > APIs separate, one API for measuring inter-broker movement progress an > > > another for the intra-broker case. WDYT? > > > > > > Thanks for the continuing discussion on this! > > > > > > Tom > > > > > > > > > On 10 August 2017 at 05:28, Dong Lin <lindon...@gmail.com> wrote: > > > > > >> Hey Jun, > > >> > > >> I have been thinking about whether it is better to return lag (i.e. > HW - > > >> LEO) instead of LEO. Note that the lag in the DescribeDirsResponse may > > be > > >> negative if LEO > HW. It will almost always be negative for leader and > > >> in-sync replicas. Note that we can not calculate lag as max(0, HW - > LEO) > > >> because we still need the difference between two lags to measure the > > >> progress of intra-broker replica movement. The AdminClient API can > > choose > > >> to return max(0, HW - LEO) depending on whether it is used for > tracking > > >> progress of inter-broker reassignment or intra-broker movement. Is it > > OK? > > >> If so, I will update the KIP-113 accordingly to return lag in the > > >> DescribeDirsResponse . > > >> > > >> Thanks, > > >> Dong > > >> > > >> > > >> > > >> <https://www.avast.com/sig-email?utm_medium=email&utm_source > > >> =link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> > > >> Virus-free. > > >> www.avast.com > > >> <https://www.avast.com/sig-email?utm_medium=email&utm_source > > >> =link&utm_campaign=sig-email&utm_content=webmail&utm_term=link> > > >> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > > >> > > >> On Wed, Aug 9, 2017 at 5:06 PM, Jun Rao <j...@confluent.io> wrote: > > >> > > >> > Hi, Dong, > > >> > > > >> > Yes, the lag in a replica is calculated as the difference of LEO of > > the > > >> > replica and the HW. So, as long as a replica is in sync, the lag is > > >> always > > >> > 0. > > >> > > > >> > So, I was suggesting to return lag instead of LEO in > > >> DescribeDirsResponse > > >> > for each replica. I am not sure if we need to return HW though. > > >> > > > >> > Thanks, > > >> > > > >> > Jun > > >> > > > >> > On Wed, Aug 9, 2017 at 5:01 PM, Dong Lin <lindon...@gmail.com> > wrote: > > >> > > > >> > > Hey Jun, > > >> > > > > >> > > It just came to me that you may be assuming that folower_lag = HW > - > > >> > > follower_LEO. If that is the case, then we need to have new > > >> > > request/response to retrieve this lag since the > DescribeDirsResponse > > >> > > doesn't even include HW. It seems that KIP-179 does not explicitly > > >> > specify > > >> > > the definition of this lag. > > >> > > > > >> > > I have been assuming that follow_lag = leader_LEO - follower_LEO > > given > > >> > that > > >> > > the request is used to query the reassignment status. Strictly > > >> speaking > > >> > the > > >> > > difference between leader_LEO and the HW is limited by the amount > of > > >> data > > >> > > produced in KafkaConfig.replicaLagTimeMaxMs, which is 10 > seconds. I > > >> also > > >> > > assumed that 10 seconds is probably not a big deal given the > typical > > >> time > > >> > > length of the reassignment. > > >> > > > > >> > > Thanks, > > >> > > Dong > > >> > > > > >> > > On Wed, Aug 9, 2017 at 4:40 PM, Dong Lin <lindon...@gmail.com> > > wrote: > > >> > > > > >> > > > Hey Jun, > > >> > > > > > >> > > > If I understand you right, you are suggesting that, in the case > > when > > >> > > there > > >> > > > is continuous incoming traffic, the approach in the KIP-179 will > > >> report > > >> > > lag > > >> > > > as 0 whereas the approach using DescribeDirsRequest will report > > lag > > >> as > > >> > > > non-zero. But I think the approach in KIP-179 will also report > > >> non-zero > > >> > > lag > > >> > > > when there is continuous traffic. This is because at the time > the > > >> > leader > > >> > > > receives ReplicaStatusRequest, it is likely that some data has > > been > > >> > > > appended to the partition after the last FetchRequest from the > > >> > follower. > > >> > > > Does this make sense? > > >> > > > > > >> > > > Thanks, > > >> > > > Dong > > >> > > > > > >> > > > > > >> > > > > > >> > > > On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io> > wrote: > > >> > > > > > >> > > >> Hi, Dong, > > >> > > >> > > >> > > >> As for whether to return LEO or lag, my point was the > following. > > >> What > > >> > > you > > >> > > >> are concerned about is that an in-sync replica could become out > > of > > >> > sync > > >> > > >> again. However, the more common case is that once a replica is > > >> caught > > >> > > up, > > >> > > >> it will stay in sync afterwards. In that case, once the > > >> reassignment > > >> > > >> process completes, if we report based on lag, all lags will be > 0. > > >> If > > >> > we > > >> > > >> report based on Math.max(0, leaderLEO - followerLEO), the value > > may > > >> > not > > >> > > be > > >> > > >> 0 if there is continuous incoming traffic, which will be > > confusing > > >> to > > >> > > the > > >> > > >> user. > > >> > > >> > > >> > > >> Thanks, > > >> > > >> > > >> > > >> Jun > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com> > > >> wrote: > > >> > > >> > > >> > > >> > Hey Jun, > > >> > > >> > > > >> > > >> > Thanks for the comment! > > >> > > >> > > > >> > > >> > Yes, it should work. The tool can send request to any broker > > and > > >> > > broker > > >> > > >> can > > >> > > >> > just write the reassignment znode. My previous intuition is > > that > > >> it > > >> > > may > > >> > > >> be > > >> > > >> > better to only send this request to controller. But I don't > > have > > >> > good > > >> > > >> > reasons for this restriction. > > >> > > >> > > > >> > > >> > My intuition is that we can keep them separate as well. > Becket > > >> and I > > >> > > >> have > > >> > > >> > discussed this both offline and in > https://github.com/apache/ > > >> > > >> > kafka/pull/3621. > > >> > > >> > Currently I don't have a strong opinion on this and I am open > > to > > >> > using > > >> > > >> only > > >> > > >> > one API to do both if someone can come up with a reasonable > API > > >> > > >> signature > > >> > > >> > for this method. For now I have added the method > > >> alterReplicaDir() > > >> > in > > >> > > >> > KafkaAdminClient instead of the AdminClient interface so that > > the > > >> > > >> > reassignment script can use this method without concluding > what > > >> the > > >> > > API > > >> > > >> > would look like in AdminClient in the future. > > >> > > >> > > > >> > > >> > Regarding DescribeDirsResponse, I think it is probably OK to > > have > > >> > > >> slightly > > >> > > >> > more lag. The script can calculate the lag of the follower > > >> replica > > >> > as > > >> > > >> > Math.max(0, leaderLEO - followerLEO). I agree that it will be > > >> > slightly > > >> > > >> less > > >> > > >> > accurate than the current approach in KIP-179. But even with > > the > > >> > > current > > >> > > >> > approach in KIP-179, the result provided by the script is an > > >> > > >> approximation > > >> > > >> > anyway, since there is delay from the time that leader > returns > > >> > > response > > >> > > >> to > > >> > > >> > the time that the script collects response from all brokers > and > > >> > prints > > >> > > >> > result to user. I think if the slight difference in the > > accuracy > > >> > > between > > >> > > >> > the two approaches does not make a difference to the intended > > >> > use-case > > >> > > >> of > > >> > > >> > this API, then we probably want to re-use the exiting > > >> > request/response > > >> > > >> to > > >> > > >> > keep the protocol simple. > > >> > > >> > > > >> > > >> > Thanks, > > >> > > >> > Dong > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> > > >> wrote: > > >> > > >> > > > >> > > >> > > Hi, Dong, > > >> > > >> > > > > >> > > >> > > I think Tom was suggesting to have the AlterTopicsRequest > > sent > > >> to > > >> > > any > > >> > > >> > > broker, which just writes the reassignment json to ZK. The > > >> > > controller > > >> > > >> > will > > >> > > >> > > pick up the reassignment and act on it as usual. This > should > > >> work, > > >> > > >> right? > > >> > > >> > > > > >> > > >> > > Having a separate AlterTopicsRequest and > > AlterReplicaDirRequest > > >> > > seems > > >> > > >> > > simpler to me. The former is handled by the controller and > > the > > >> > > latter > > >> > > >> is > > >> > > >> > > handled by the affected broker. They don't always have to > be > > >> done > > >> > > >> > together. > > >> > > >> > > Merging the two into a single request probably will make > both > > >> the > > >> > > api > > >> > > >> and > > >> > > >> > > the implementation a bit more complicated. If we do keep > the > > >> two > > >> > > >> separate > > >> > > >> > > requests, it seems that we should just add > > >> AlterReplicaDirRequest > > >> > to > > >> > > >> the > > >> > > >> > > AdminClient interface? > > >> > > >> > > > > >> > > >> > > Now, regarding DescribeDirsResponse. I agree that it can be > > >> used > > >> > for > > >> > > >> the > > >> > > >> > > status reporting in KIP-179 as well. However, it seems that > > >> > > reporting > > >> > > >> the > > >> > > >> > > log end offset of each replica may not be easy to use. The > > log > > >> end > > >> > > >> offset > > >> > > >> > > will be returned from different brokers in slightly > different > > >> > time. > > >> > > If > > >> > > >> > > there is continuous producing traffic, the difference in > log > > >> end > > >> > > >> offset > > >> > > >> > > between the leader and the follower could be larger than 0 > > >> even if > > >> > > the > > >> > > >> > > follower has fully caught up. I am wondering if it's better > > to > > >> > > instead > > >> > > >> > > return the lag in offset per replica. This way, the status > > can > > >> > > >> probably > > >> > > >> > be > > >> > > >> > > reported more reliably. > > >> > > >> > > > > >> > > >> > > Thanks, > > >> > > >> > > > > >> > > >> > > Jun > > >> > > >> > > > > >> > > >> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin < > > lindon...@gmail.com > > >> > > > >> > > >> wrote: > > >> > > >> > > > > >> > > >> > > > Hey Tom, > > >> > > >> > > > > > >> > > >> > > > Thanks for the quick reply. Please see my comment inline. > > >> > > >> > > > > > >> > > >> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley < > > >> > > t.j.bent...@gmail.com > > >> > > >> > > > >> > > >> > > > wrote: > > >> > > >> > > > > > >> > > >> > > > > Hi Dong, > > >> > > >> > > > > > > >> > > >> > > > > Replies inline, as usual > > >> > > >> > > > > > > >> > > >> > > > > > As I originally envisaged it, KIP-179's support for > > >> > > reassigning > > >> > > >> > > > > partitions > > >> > > >> > > > > > > > >> > > >> > > > > > would have more-or-less taken the logic currently in > > the > > >> > > >> > > > > > > ReassignPartitionsCommand (that is, writing JSON to > > the > > >> > > >> > > > > > > ZkUtils.ReassignPartitionsPath) > > >> > > >> > > > > > > and put it behind a suitable network protocol API. > > >> Thus it > > >> > > >> > wouldn't > > >> > > >> > > > > > matter > > >> > > >> > > > > > > which broker received the protocol call: It would > be > > >> acted > > >> > > on > > >> > > >> by > > >> > > >> > > > > brokers > > >> > > >> > > > > > > being notified of the change in the ZK path, just > as > > >> > > >> currently. > > >> > > >> > > This > > >> > > >> > > > > > would > > >> > > >> > > > > > > have kept the ReassignPartitionsCommand relatively > > >> simple, > > >> > > as > > >> > > >> it > > >> > > >> > > > > > currently > > >> > > >> > > > > > > is. > > >> > > >> > > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > I am not sure I fully understand your proposal. I > think > > >> you > > >> > > are > > >> > > >> > > saying > > >> > > >> > > > > that > > >> > > >> > > > > > any broker can receive and handle the > > AlterTopicRequest. > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > That's right. > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > Let's say a > > >> > > >> > > > > > non-controller broker received AlterTopicRequest, is > > this > > >> > > broker > > >> > > >> > > going > > >> > > >> > > > to > > >> > > >> > > > > > send LeaderAndIsrRequest to other brokers? Or is this > > >> broker > > >> > > >> create > > >> > > >> > > the > > >> > > >> > > > > > reassignment znode in zookeper? > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > Exactly: It's going to write some JSON to the relevant > > >> znode. > > >> > > >> Other > > >> > > >> > > > brokers > > >> > > >> > > > > will get notified by zk when the contents of this znode > > >> > changes, > > >> > > >> and > > >> > > >> > do > > >> > > >> > > > as > > >> > > >> > > > > they do now. This is what the tool/script does now. > > >> > > >> > > > > > > >> > > >> > > > > I will confess that I don't completely understand the > > role > > >> of > > >> > > >> > > > > LeaderAndIsrRequest, since the current code just seems > to > > >> > write > > >> > > to > > >> > > >> > the > > >> > > >> > > > > znode do get the brokers to do the reassignment. If you > > >> could > > >> > > >> explain > > >> > > >> > > the > > >> > > >> > > > > role of LeaderAndIsrRequest that would be great. > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > Currently only the controller will listen to the > > reassignment > > >> > > znode > > >> > > >> and > > >> > > >> > > > sends LeaderAndIsrRequest and StopReplicaRequest to > brokers > > >> in > > >> > > >> order to > > >> > > >> > > > complete reassignment. Brokers won't need to listen to > > >> zookeeper > > >> > > for > > >> > > >> > any > > >> > > >> > > > reassignment -- brokers only reacts to the request from > > >> > > controller. > > >> > > >> > > > Currently Kafka's design replies a lot on the controller > to > > >> > keep a > > >> > > >> > > > consistent view of who are the leader of partitions and > > what > > >> is > > >> > > the > > >> > > >> ISR > > >> > > >> > > > etc. It will be a pretty drastic change, if not > impossible, > > >> for > > >> > > the > > >> > > >> > > script > > >> > > >> > > > to reassign partitions without going through controller. > > >> > > >> > > > > > >> > > >> > > > Thus I think it is likely that your AlterTopicsRequest > can > > >> only > > >> > be > > >> > > >> sent > > >> > > >> > > to > > >> > > >> > > > controller. Then the controller can create the > reassignment > > >> > znode > > >> > > in > > >> > > >> > > > zookeeper so that the information is persisted across > > >> controller > > >> > > >> fail > > >> > > >> > > over. > > >> > > >> > > > I haven't think through this in detail though. > > >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > I may have missed it. But I couldn't find > > >> > > >> > > > > > the explanation of AlterTopicRequest handling in > > KIP-179. > > >> > > >> > > > > > > > >> > > >> > > > > > > >> > > >> > > > > You're right, it doesn't go into that much detail. I > will > > >> fix > > >> > > >> that. > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > > > > >> > > >> > > > > > > KIP-113 is obviously seeking to make more radical > > >> changes. > > >> > > The > > >> > > >> > > > > algorithm > > >> > > >> > > > > > > described for moving a replica to a particular > > >> directory > > >> > on > > >> > > a > > >> > > >> > > > different > > >> > > >> > > > > > > broker ( > > >> > > >> > > > > > > https://cwiki.apache.org/ > > confluence/display/KAFKA/KIP- > > >> > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+ > > >> > > >> > directories#KIP-113: > > >> > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2) > > >> > > >> > > > > > > Howtoreassignreplicabetweenlog > > directoriesacrossbrokers > > >> > > >> > > > > > > <https://cwiki.apache.org/conf > > >> luence/display/KAFKA/KIP- > > >> > > >> > > > > > > 113%3A+Support+replicas+movement+between+log+ > > >> > > >> > directories#KIP-113: > > >> > > >> > > > > > > Supportreplicasmovementbetweenlogdirectories-2% > > >> > > >> > > > > > > 29Howtoreassignreplicabetweenl > > >> > ogdirectoriesacrossbrokers>) > > >> > > >> > > > > > > involves both sending AlterReplicaDirRequest to > "the" > > >> > broker > > >> > > >> (the > > >> > > >> > > > > > receiving > > >> > > >> > > > > > > broker, I assume, but it's not spelled out), _as > well > > >> as_ > > >> > > >> writing > > >> > > >> > > to > > >> > > >> > > > > the > > >> > > >> > > > > > ZK > > >> > > >> > > > > > > node. > > >> > > >> > > > > > > > > >> > > >> > > > > > > This assumes the script (ReassignPartitionsCommand) > > has > > >> > > direct > > >> > > >> > > access > > >> > > >> > > > > to > > >> > > >> > > > > > > ZooKeeper, which is what KIP-179 is seeking to > > >> deprecate. > > >> > It > > >> > > >> > seems > > >> > > >> > > a > > >> > > >> > > > > > waste > > >> > > >> > > > > > > of time to put the logic in the script as part of > > >> KIP-113, > > >> > > >> only > > >> > > >> > for > > >> > > >> > > > > > KIP-179 > > >> > > >> > > > > > > to have to move it back to the controller. > > >> > > >> > > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > I am not sure I understand what you mean by "It > seems a > > >> > waste > > >> > > of > > >> > > >> > time > > >> > > >> > > > to > > >> > > >> > > > > > put the logic in the script as part of KIP-113, only > > for > > >> > > >> KIP-179 to > > >> > > >> > > > have > > >> > > >> > > > > to > > >> > > >> > > > > > move it back to the controller". > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > Sorry, I misunderstood slightly what you were proposing > > in > > >> > > >> KIP-113, > > >> > > >> > so > > >> > > >> > > > the > > >> > > >> > > > > "waste of time" comment isn't quite right, but I'm > still > > >> not > > >> > > >> > convinced > > >> > > >> > > > that > > >> > > >> > > > > KIP-113+KIP-179 (in its current form) ends with a > > >> satisfactory > > >> > > >> > result. > > >> > > >> > > > > > > >> > > >> > > > > Let me elaborate... KIP-113 says that to support > > >> reassigning > > >> > > >> replica > > >> > > >> > > > > between log directories across brokers: > > >> > > >> > > > > * ... > > >> > > >> > > > > * The script sends AlterReplicaDirRequest to those > > brokers > > >> > which > > >> > > >> need > > >> > > >> > > to > > >> > > >> > > > > move replicas... > > >> > > >> > > > > * The script creates reassignment znode in zookeeper. > > >> > > >> > > > > * The script retries AlterReplicaDirRequest to those > > >> broker... > > >> > > >> > > > > * ... > > >> > > >> > > > > > > >> > > >> > > > > So the ReassignPartitionsCommand still talks to ZK > > >> directly, > > >> > but > > >> > > >> now > > >> > > >> > > it's > > >> > > >> > > > > bracketed by calls to the AdminClient. KIP-179 could > > >> replace > > >> > > that > > >> > > >> > > talking > > >> > > >> > > > > to ZK directly with a new call to the AdminClient. But > > then > > >> > > we've > > >> > > >> > got a > > >> > > >> > > > > pretty weird API, where we have to make three > AdminClient > > >> > calls > > >> > > >> (two > > >> > > >> > of > > >> > > >> > > > > them to the same method), to move a replica. I don't > > really > > >> > > >> > understand > > >> > > >> > > > why > > >> > > >> > > > > the admin client can't present a single API method to > > >> achieve > > >> > > >> this, > > >> > > >> > and > > >> > > >> > > > > encapsulate on the server side the careful sequence of > > >> events > > >> > > >> > necessary > > >> > > >> > > > to > > >> > > >> > > > > coordinate the movement. I understood this position is > > what > > >> > > Ismael > > >> > > >> > was > > >> > > >> > > > > advocating when he said it was better to put the logic > in > > >> the > > >> > > >> > > controller > > >> > > >> > > > > than spread between the script and the controller. But > > >> maybe I > > >> > > >> > > > > misunderstood him. > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > I have some concern with putting this logic in controller > > >> which > > >> > > can > > >> > > >> be > > >> > > >> > > > found in my previous email. Before that is addressed, the > > >> script > > >> > > (or > > >> > > >> > > > AdminClient) seems to be the simplest place to have this > > >> logic. > > >> > > >> > > > > > >> > > >> > > > I agree it is better to have a single API to achieve both > > >> > > partition > > >> > > >> and > > >> > > >> > > > replica -> dir assignment. I think it is likely that we > > will > > >> > find > > >> > > a > > >> > > >> > good > > >> > > >> > > > API to do both. I have updated the KIP-113 to remove API > > >> > > >> > alterReplicaDir > > >> > > >> > > > from AdminClient interface. > > >> > > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > > >> > > >> > > > > > > >> > > >> > > > > > I assume that the logic you mentioned is > > >> > > >> > > > > > "movement of replica to the specified log directory". > > >> This > > >> > > logic > > >> > > >> > (or > > >> > > >> > > > the > > >> > > >> > > > > > implementation of this logic) resides mainly in the > > >> > > >> > KafkaAdminClient > > >> > > >> > > > and > > >> > > >> > > > > > broker. The script only needs to parse the json file > as > > >> > > >> appropriate > > >> > > >> > > and > > >> > > >> > > > > > call the new API in AdminClient as appropriate. The > > >> logic in > > >> > > the > > >> > > >> > > script > > >> > > >> > > > > is > > >> > > >> > > > > > therefore not much and can be easily moved to other > > >> classes > > >> > if > > >> > > >> > > needed. > > >> > > >> > > > > > > > >> > > >> > > > > > Can you clarify why this logic, i.e. movement of > > replica > > >> to > > >> > > the > > >> > > >> > > > specified > > >> > > >> > > > > > log directory, needs to be moved to controller in > > >> KIP-179? I > > >> > > >> think > > >> > > >> > it > > >> > > >> > > > can > > >> > > >> > > > > > still be done in the script and controller should not > > >> need > > >> > to > > >> > > >> worry > > >> > > >> > > > about > > >> > > >> > > > > > log directory of any replica. > > >> > > >> > > > > > > > >> > > >> > > > > > Thanks, > > >> > > >> > > > > > Dong > > >> > > >> > > > > > > > >> > > >> > > > > > > >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >