Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Wed, 09 Aug 2017 16:41:25 -0700

Hey Jun,

If I understand you right, you are suggesting that, in the case when there
is continuous incoming traffic, the approach in the KIP-179 will report lag
as 0 whereas the approach using DescribeDirsRequest will report lag as
non-zero. But I think the approach in KIP-179 will also report non-zero lag
when there is continuous traffic. This is because at the time the leader
receives ReplicaStatusRequest, it is likely that some data has been
appended to the partition after the last FetchRequest from the follower.
Does this make sense?


Thanks,
Dong



On Wed, Aug 9, 2017 at 4:24 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Dong,
>
> As for whether to return LEO or lag, my point was the following. What you
> are concerned about is that an in-sync replica could become out of sync
> again. However, the more common case is that once a replica is caught up,
> it will stay in sync afterwards. In that case, once the reassignment
> process completes, if we report based on lag, all lags will be 0. If we
> report based on Math.max(0, leaderLEO - followerLEO), the value may not be
> 0 if there is continuous incoming traffic, which will be confusing to the
> user.
>
> Thanks,
>
> Jun
>
>
>
> On Tue, Aug 8, 2017 at 6:26 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Jun,
> >
> > Thanks for the comment!
> >
> > Yes, it should work. The tool can send request to any broker and broker
> can
> > just write the reassignment znode. My previous intuition is that it may
> be
> > better to only send this request to controller. But I don't have good
> > reasons for this restriction.
> >
> > My intuition is that we can keep them separate as well. Becket and I have
> > discussed this both offline and in https://github.com/apache/
> > kafka/pull/3621.
> > Currently I don't have a strong opinion on this and I am open to using
> only
> > one API to do both if someone can come up with a reasonable API signature
> > for this method. For now I have added the method alterReplicaDir() in
> > KafkaAdminClient instead of the AdminClient interface so that the
> > reassignment script can use this method without concluding what the API
> > would look like in AdminClient in the future.
> >
> > Regarding DescribeDirsResponse, I think it is probably OK to have
> slightly
> > more lag. The script can calculate the lag of the follower replica as
> > Math.max(0, leaderLEO - followerLEO). I agree that it will be slightly
> less
> > accurate than the current approach in KIP-179. But even with the current
> > approach in KIP-179, the result provided by the script is an
> approximation
> > anyway, since there is delay from the time that leader returns response
> to
> > the time that the script collects response from all brokers and prints
> > result to user. I think if the slight difference in the accuracy between
> > the two approaches does not make a difference to the intended use-case of
> > this API, then we probably want to re-use the exiting request/response to
> > keep the protocol simple.
> >
> > Thanks,
> > Dong
> >
> >
> >
> >
> >
> > On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Dong,
> > >
> > > I think Tom was suggesting to have the AlterTopicsRequest sent to any
> > > broker, which just writes the reassignment json to ZK. The controller
> > will
> > > pick up the reassignment and act on it as usual. This should work,
> right?
> > >
> > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest seems
> > > simpler to me. The former is handled by the controller and the latter
> is
> > > handled by the affected broker. They don't always have to be done
> > together.
> > > Merging the two into a single request probably will make both the api
> and
> > > the implementation a bit more complicated. If we do keep the two
> separate
> > > requests, it seems that we should just add AlterReplicaDirRequest to
> the
> > > AdminClient interface?
> > >
> > > Now, regarding DescribeDirsResponse. I agree that it can be used for
> the
> > > status reporting in KIP-179 as well. However, it seems that reporting
> the
> > > log end offset of each replica may not be easy to use. The log end
> offset
> > > will be returned from different brokers in slightly different time. If
> > > there is continuous producing traffic, the difference in log end offset
> > > between the leader and the follower could be larger than 0 even if the
> > > follower has fully caught up. I am wondering if it's better to instead
> > > return the lag in offset per replica. This way, the status can probably
> > be
> > > reported more reliably.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com> wrote:
> > >
> > > > Hey Tom,
> > > >
> > > > Thanks for the quick reply. Please see my comment inline.
> > > >
> > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <t.j.bent...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Dong,
> > > > >
> > > > > Replies inline, as usual
> > > > >
> > > > > > As I originally envisaged it, KIP-179's support for reassigning
> > > > > partitions
> > > > > >
> > > > > > would have more-or-less taken the logic currently in the
> > > > > > > ReassignPartitionsCommand (that is, writing JSON to the
> > > > > > > ZkUtils.ReassignPartitionsPath)
> > > > > > > and put it behind a suitable network protocol API. Thus it
> > wouldn't
> > > > > > matter
> > > > > > > which broker received the protocol call: It would be acted on
> by
> > > > > brokers
> > > > > > > being notified of the change in the ZK path, just as currently.
> > > This
> > > > > > would
> > > > > > > have kept the ReassignPartitionsCommand relatively simple, as
> it
> > > > > > currently
> > > > > > > is.
> > > > > > >
> > > > > >
> > > > > > I am not sure I fully understand your proposal. I think you are
> > > saying
> > > > > that
> > > > > > any broker can receive and handle the AlterTopicRequest.
> > > > >
> > > > >
> > > > > That's right.
> > > > >
> > > > >
> > > > > > Let's say a
> > > > > > non-controller broker received AlterTopicRequest, is this broker
> > > going
> > > > to
> > > > > > send LeaderAndIsrRequest to other brokers? Or is this broker
> create
> > > the
> > > > > > reassignment znode in zookeper?
> > > > >
> > > > >
> > > > > Exactly: It's going to write some JSON to the relevant znode. Other
> > > > brokers
> > > > > will get notified by zk when the contents of this znode changes,
> and
> > do
> > > > as
> > > > > they do now. This is what the tool/script does now.
> > > > >
> > > > > I will confess that I don't completely understand the role of
> > > > > LeaderAndIsrRequest, since the current code just seems to write to
> > the
> > > > > znode do get the brokers to do the reassignment. If you could
> explain
> > > the
> > > > > role of LeaderAndIsrRequest that would be great.
> > > > >
> > > >
> > > > Currently only the controller will listen to the reassignment znode
> and
> > > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in order
> to
> > > > complete reassignment. Brokers won't need to listen to zookeeper for
> > any
> > > > reassignment -- brokers only reacts to the request from controller.
> > > > Currently Kafka's design replies a lot on the controller to keep a
> > > > consistent view of who are the leader of partitions and what is the
> ISR
> > > > etc. It will be a pretty drastic change, if not impossible, for the
> > > script
> > > > to reassign partitions without going through controller.
> > > >
> > > > Thus I think it is likely that your AlterTopicsRequest can only be
> sent
> > > to
> > > > controller. Then the controller can create the reassignment znode in
> > > > zookeeper so that the information is persisted across controller fail
> > > over.
> > > > I haven't think through this in detail though.
> > > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > > I may have missed it. But I couldn't find
> > > > > > the explanation of AlterTopicRequest handling in KIP-179.
> > > > > >
> > > > >
> > > > > You're right, it doesn't go into that much detail. I will fix that.
> > > > >
> > > > >
> > > > > > >
> > > > > > > KIP-113 is obviously seeking to make more radical changes. The
> > > > > algorithm
> > > > > > > described for moving a replica to a particular directory on a
> > > > different
> > > > > > > broker (
> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > 113%3A+Support+replicas+movement+between+log+
> > directories#KIP-113:
> > > > > > > Supportreplicasmovementbetweenlogdirectories-2)
> > > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > > > 113%3A+Support+replicas+movement+between+log+
> > directories#KIP-113:
> > > > > > > Supportreplicasmovementbetweenlogdirectories-2%
> > > > > > > 29Howtoreassignreplicabetweenlogdirectoriesacrossbrokers>)
> > > > > > > involves both sending AlterReplicaDirRequest to "the" broker
> (the
> > > > > > receiving
> > > > > > > broker, I assume, but it's not spelled out), _as well as_
> writing
> > > to
> > > > > the
> > > > > > ZK
> > > > > > > node.
> > > > > > >
> > > > > > > This assumes the script (ReassignPartitionsCommand) has direct
> > > access
> > > > > to
> > > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. It
> > seems
> > > a
> > > > > > waste
> > > > > > > of time to put the logic in the script as part of KIP-113, only
> > for
> > > > > > KIP-179
> > > > > > > to have to move it back to the controller.
> > > > > > >
> > > > > >
> > > > > > I am not sure I understand what you mean by "It seems a waste of
> > time
> > > > to
> > > > > > put the logic in the script as part of KIP-113, only for KIP-179
> to
> > > > have
> > > > > to
> > > > > > move it back to the controller".
> > > > >
> > > > >
> > > > > Sorry, I misunderstood slightly what you were proposing in KIP-113,
> > so
> > > > the
> > > > > "waste of time" comment isn't quite right, but I'm still not
> > convinced
> > > > that
> > > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory
> > result.
> > > > >
> > > > > Let me elaborate... KIP-113 says that to support reassigning
> replica
> > > > > between log directories across brokers:
> > > > > * ...
> > > > > * The script sends AlterReplicaDirRequest to those brokers which
> need
> > > to
> > > > > move replicas...
> > > > > * The script creates reassignment znode in zookeeper.
> > > > > * The script retries AlterReplicaDirRequest to those broker...
> > > > > * ...
> > > > >
> > > > > So the ReassignPartitionsCommand still talks to ZK directly, but
> now
> > > it's
> > > > > bracketed by calls to the AdminClient. KIP-179 could replace that
> > > talking
> > > > > to ZK directly with a new call to the AdminClient. But then we've
> > got a
> > > > > pretty weird API, where we have to make three AdminClient calls
> (two
> > of
> > > > > them to the same method), to move a replica. I don't really
> > understand
> > > > why
> > > > > the admin client can't present a single API method to achieve this,
> > and
> > > > > encapsulate on the server side the careful sequence of events
> > necessary
> > > > to
> > > > > coordinate the movement. I understood this position is what Ismael
> > was
> > > > > advocating when he said it was better to put the logic in the
> > > controller
> > > > > than spread between the script and the controller. But maybe I
> > > > > misunderstood him.
> > > > >
> > > >
> > > > I have some concern with putting this logic in controller which can
> be
> > > > found in my previous email. Before that is addressed, the script (or
> > > > AdminClient) seems to be the simplest place to have this logic.
> > > >
> > > > I agree it is better to have a single API to achieve both partition
> and
> > > > replica -> dir assignment. I think it is likely that we will find a
> > good
> > > > API to do both. I have updated the KIP-113 to remove API
> > alterReplicaDir
> > > > from AdminClient interface.
> > > >
> > > >
> > > > >
> > > > >
> > > > > > I assume that the logic you mentioned is
> > > > > > "movement of replica to the specified log directory". This logic
> > (or
> > > > the
> > > > > > implementation of this logic) resides mainly in the
> > KafkaAdminClient
> > > > and
> > > > > > broker. The script only needs to parse the json file as
> appropriate
> > > and
> > > > > > call the new API in AdminClient as appropriate. The logic in the
> > > script
> > > > > is
> > > > > > therefore not much and can be easily moved to other classes if
> > > needed.
> > > > > >
> > > > > > Can you clarify why this logic, i.e. movement of replica to the
> > > > specified
> > > > > > log directory, needs to be moved to controller in KIP-179? I
> think
> > it
> > > > can
> > > > > > still be done in the script and controller should not need to
> worry
> > > > about
> > > > > > log directory of any replica.
> > > > > >
> > > > > > Thanks,
> > > > > > Dong
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to