Hey Ismael,

Thanks for the comments! Here are my answers:

1. Yes it has been considered. Here are the reasons why we don't do it
through controller.

- There can be use-cases where we only want to rebalance the load of log
directories on a given broker. It seems unnecessary to go through
controller in this case.

 - If controller is responsible for sending ChangeReplicaDirRequest, and if
the user-specified log directory is either invalid or offline, then
controller probably needs a way to tell user that the partition
reassignment has failed. We currently don't have a way to do this since
kafka-reassign-partition.sh simply creates the reassignment znode without
waiting for response. I am not sure that is a good solution to this.

- If controller is responsible for sending ChangeReplicaDirRequest, the
controller logic would be more complicated because controller needs to
first send ChangeReplicaRequest so that the broker memorize the partition
-> log directory mapping, send LeaderAndIsrRequest, and keep sending
ChangeReplicaDirRequest (just in case broker restarted) until replica is
created. Note that the last step needs repeat and timeout as the proposed
in the KIP-113.

Overall I think this adds quite a bit complexity to controller and we
probably want to do this only if there is strong clear of doing so.
Currently in KIP-113 the kafka-reassign-partitions.sh is responsible for
sending ChangeReplicaDirRequest with repeat and provides error to user if
it either fails or timeout. It seems to be much simpler and user shouldn't
care whether it is done through controller.

And thanks for the suggestion. I will add this to the Rejected Alternative
Section in the KIP-113.

2) I think user needs to be able to specify different log directories for
the replicas of the same partition in order to rebalance load across log
directories of all brokers. I am not sure I understand the question. Can
you explain a bit more why "that the log directory has to be the same for
all replicas of a given partition"?

3) Good point. I think the alterReplicaDir is a better than
changeReplicaDir for the reason you provided. I will also update names of
the request/response as well in the KIP.


Thanks,
Dong

On Fri, Aug 4, 2017 at 9:49 AM, Ismael Juma <ism...@juma.me.uk> wrote:

> Thanks Dong. I have a few initial questions, sorry if I it has been
> discussed and I missed it.
>
> 1. The KIP suggests that the reassignment tool is responsible for sending
> the ChangeReplicaDirRequests to the relevant brokers. I had imagined that
> this would be done by the Controller, like the rest of the reassignment
> process. Was this considered? If so, it would be good to include the
> details of why it was rejected in the "Rejected Alternatives" section.
>
> 2. The reassignment JSON format was extended so that one can choose the log
> directory for a partition. This means that the log directory has to be the
> same for all replicas of a given partition. The alternative would be for
> the log dir to be assignable for each replica. Similar to the other
> question, it would be good to have a section in "Rejected Alternatives" for
> this approach. It's generally very helpful to have more information on the
> rationale for the design choices that were made and rejected.
>
> 3. Should changeReplicaDir be alterReplicaDir? We have used `alter` for
> other methods.
>
> Thanks,
> Ismael
>
>
>
>
> On Fri, Aug 4, 2017 at 5:37 AM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hi all,
> >
> > I realized that we need new API in AdminClient in order to use the new
> > request/response added in KIP-113. Since this is required by KIP-113, I
> > choose to add the new interface in this KIP instead of creating a new
> KIP.
> >
> > The documentation of the new API in AdminClient can be found here
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 113%3A+Support+replicas+movement+between+log+directories#KIP-113:
> > Supportreplicasmovementbetweenlogdirectories-AdminClient>.
> > Can you please review and comment if you have any concern?
> >
> > Thanks!
> > Dong
> >
> >
> >
> > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> wrote:
> >
> > > The protocol change has been updated in KIP-113
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 113%3A+Support+replicas+movement+between+log+directories>
> > > .
> > >
> > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com>
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I have made a minor change to the DescribeDirsRequest so that user can
> > >> choose to query the status for a specific list of partitions. This is
> a
> > bit
> > >> more fine-granular than the previous format that allows user to query
> > the
> > >> status for a specific list of topics. I realized that querying the
> > status
> > >> of selected partitions can be useful to check the whether the
> > reassignment
> > >> of the replicas to the specific log directories has been completed.
> > >>
> > >> I will assume this minor change is OK if there is no concern with it
> in
> > >> the community :)
> > >>
> > >> Thanks,
> > >> Dong
> > >>
> > >>
> > >>
> > >> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <lindon...@gmail.com>
> wrote:
> > >>
> > >>> Hey Colin,
> > >>>
> > >>> Thanks for the suggestion. We have actually considered this and list
> > >>> this as the first future work in KIP-112
> > >>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 112%3A+Handle+disk+failure+for+JBOD>.
> > >>> The two advantages that you mentioned are exactly the motivation for
> > this
> > >>> feature. Also as you have mentioned, this involves the tradeoff
> between
> > >>> disk performance and availability -- the more you distribute topic
> > across
> > >>> disks, the more topics will be offline due to a single disk failure.
> > >>>
> > >>> Despite its complexity, it is not clear to me that the reduced
> > rebalance
> > >>> overhead is worth the reduction in availability. I am optimistic that
> > the
> > >>> rebalance overhead will not be that a big problem since we are not
> too
> > >>> bothered by cross-broker rebalance as of now.
> > >>>
> > >>> Thanks,
> > >>> Dong
> > >>>
> > >>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <cmcc...@apache.org>
> > >>> wrote:
> > >>>
> > >>>> Has anyone considered a scheme for sharding topic data across
> multiple
> > >>>> disks?
> > >>>>
> > >>>> For example, if you sharded topics across 3 disks, and you had 10
> > disks,
> > >>>> you could pick a different set of 3 disks for each topic.  If you
> > >>>> distribute them randomly then you have 10 choose 3 = 120 different
> > >>>> combinations.  You would probably never need rebalancing if you had
> a
> > >>>> reasonable distribution of topic sizes (could probably prove this
> > with a
> > >>>> Monte Carlo or something).
> > >>>>
> > >>>> The disadvantage is that if one of the 3 disks fails, then you have
> to
> > >>>> take the topic offline.  But if we assume independent disk failure
> > >>>> probabilities, probability of failure with RAID 0 is: 1 -
> > >>>> Psuccess^(num_disks) whereas the probability of failure with this
> > scheme
> > >>>> is 1 - Psuccess ^ 3.
> > >>>>
> > >>>> This addresses the biggest downsides of JBOD now:
> > >>>> * limiting a topic to the size of a single disk limits scalability
> > >>>> * the topic movement process is tricky to get right and involves
> > "racing
> > >>>> against producers" and wasted double I/Os
> > >>>>
> > >>>> Of course, one other question is how frequently we add new disk
> drives
> > >>>> to an existing broker.  In this case, you might reasonably want disk
> > >>>> rebalancing to avoid overloading the new disk(s) with writes.
> > >>>>
> > >>>> cheers,
> > >>>> Colin
> > >>>>
> > >>>>
> > >>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote:
> > >>>> > Just a few comments on this.
> > >>>> >
> > >>>> > 1. One of the issues with using RAID 0 is that a single disk
> failure
> > >>>> > causes
> > >>>> > a hard failure of the broker. Hard failure increases the
> > >>>> unavailability
> > >>>> > window for all the partitions on the failed broker, which includes
> > the
> > >>>> > failure detection time (tied to ZK session timeout right now) and
> > >>>> leader
> > >>>> > election time by the controller. If we support JBOD natively,
> when a
> > >>>> > single
> > >>>> > disk fails, only partitions on the failed disk will experience a
> > hard
> > >>>> > failure. The availability for partitions on the rest of the disks
> > are
> > >>>> not
> > >>>> > affected.
> > >>>> >
> > >>>> > 2. For running things on the Cloud such as AWS. Currently, each
> EBS
> > >>>> > volume
> > >>>> > has a throughout limit of about 300MB/sec. If you get an enhanced
> > EC2
> > >>>> > instance, you can get 20Gb/sec network. To saturate the network,
> you
> > >>>> may
> > >>>> > need about 7 EBS volumes. So, being able to support JBOD in the
> > Cloud
> > >>>> is
> > >>>> > still potentially useful.
> > >>>> >
> > >>>> > 3. On the benefit of balancing data across disks within the same
> > >>>> broker.
> > >>>> > Data imbalance can happen across brokers as well as across disks
> > >>>> within
> > >>>> > the
> > >>>> > same broker. Balancing the data across disks within the broker has
> > the
> > >>>> > benefit of saving network bandwidth as Dong mentioned. So, if
> intra
> > >>>> > broker
> > >>>> > load balancing is possible, it's probably better to avoid the more
> > >>>> > expensive inter broker load balancing. One of the reasons for disk
> > >>>> > imbalance right now is that partitions within a broker are
> assigned
> > to
> > >>>> > disks just based on the partition count. So, it does seem possible
> > for
> > >>>> > disks to get imbalanced from time to time. If someone can share
> some
> > >>>> > stats
> > >>>> > for that in practice, that will be very helpful.
> > >>>> >
> > >>>> > Thanks,
> > >>>> >
> > >>>> > Jun
> > >>>> >
> > >>>> >
> > >>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > >>>> >
> > >>>> > > Hey Sriram,
> > >>>> > >
> > >>>> > > I think there is one way to explain why the ability to move
> > replica
> > >>>> between
> > >>>> > > disks can save space. Let's say the load is distributed to disks
> > >>>> > > independent of the broker. Sooner or later, the load imbalance
> > will
> > >>>> exceed
> > >>>> > > a threshold and we will need to rebalance load across disks. Now
> > our
> > >>>> > > questions is whether our rebalancing algorithm will be able to
> > take
> > >>>> > > advantage of locality by moving replicas between disks on the
> same
> > >>>> broker.
> > >>>> > >
> > >>>> > > Say for a given disk, there is 20% probability it is overloaded,
> > 20%
> > >>>> > > probability it is underloaded, and 60% probability its load is
> > >>>> around the
> > >>>> > > expected average load if the cluster is well balanced. Then for
> a
> > >>>> broker of
> > >>>> > > 10 disks, we would 2 disks need to have in-bound replica
> movement,
> > >>>> 2 disks
> > >>>> > > need to have out-bound replica movement, and 6 disks do not need
> > >>>> replica
> > >>>> > > movement. Thus we would expect KIP-113 to be useful since we
> will
> > >>>> be able
> > >>>> > > to move replica from the two over-loaded disks to the two
> > >>>> under-loaded
> > >>>> > > disks on the same broKER. Does this make sense?
> > >>>> > >
> > >>>> > > Thanks,
> > >>>> > > Dong
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <lindon...@gmail.com>
> > >>>> wrote:
> > >>>> > >
> > >>>> > > > Hey Sriram,
> > >>>> > > >
> > >>>> > > > Thanks for raising these concerns. Let me answer these
> questions
> > >>>> below:
> > >>>> > > >
> > >>>> > > > - The benefit of those additional complexity to move the data
> > >>>> stored on a
> > >>>> > > > disk within the broker is to avoid network bandwidth usage.
> > >>>> Creating
> > >>>> > > > replica on another broker is less efficient than creating
> > replica
> > >>>> on
> > >>>> > > > another disk in the same broker IF there is actually
> > >>>> lightly-loaded disk
> > >>>> > > on
> > >>>> > > > the same broker.
> > >>>> > > >
> > >>>> > > > - In my opinion the rebalance algorithm would this: 1) we
> > balance
> > >>>> the
> > >>>> > > load
> > >>>> > > > across brokers using the same algorithm we are using today. 2)
> > we
> > >>>> balance
> > >>>> > > > load across disk on a given broker using a greedy algorithm,
> > i.e.
> > >>>> move
> > >>>> > > > replica from the overloaded disk to lightly loaded disk. The
> > >>>> greedy
> > >>>> > > > algorithm would only consider the capacity and replica size.
> We
> > >>>> can
> > >>>> > > improve
> > >>>> > > > it to consider throughput in the future.
> > >>>> > > >
> > >>>> > > > - With 30 brokers with each having 10 disks, using the
> > rebalancing
> > >>>> > > algorithm,
> > >>>> > > > the chances of choosing disks within the broker can be high.
> > >>>> There will
> > >>>> > > > always be load imbalance across disks of the same broker for
> the
> > >>>> same
> > >>>> > > > reason that there will always be load imbalance across
> brokers.
> > >>>> The
> > >>>> > > > algorithm specified above will take advantage of the locality,
> > >>>> i.e. first
> > >>>> > > > balance load across disks of the same broker, and only balance
> > >>>> across
> > >>>> > > > brokers if some brokers are much more loaded than others.
> > >>>> > > >
> > >>>> > > > I think it is useful to note that the load imbalance across
> > disks
> > >>>> of the
> > >>>> > > > same broker is independent of the load imbalance across
> brokers.
> > >>>> Both are
> > >>>> > > > guaranteed to happen in any Kafka cluster for the same reason,
> > >>>> i.e.
> > >>>> > > > variation in the partition size. Say broker 1 have two disks
> > that
> > >>>> are 80%
> > >>>> > > > loaded and 20% loaded. And broker 2 have two disks that are
> also
> > >>>> 80%
> > >>>> > > > loaded and 20%. We can balance them without inter-broker
> traffic
> > >>>> with
> > >>>> > > > KIP-113.  This is why I think KIP-113 can be very useful.
> > >>>> > > >
> > >>>> > > > Do these explanation sound reasonable?
> > >>>> > > >
> > >>>> > > > Thanks,
> > >>>> > > > Dong
> > >>>> > > >
> > >>>> > > >
> > >>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <
> > >>>> r...@confluent.io>
> > >>>> > > > wrote:
> > >>>> > > >
> > >>>> > > >> Hey Dong,
> > >>>> > > >>
> > >>>> > > >> Thanks for the explanation. I don't think anyone is denying
> > that
> > >>>> we
> > >>>> > > should
> > >>>> > > >> rebalance at the disk level. I think it is important to
> restore
> > >>>> the disk
> > >>>> > > >> and not wait for disk replacement. There are also other
> > benefits
> > >>>> of
> > >>>> > > doing
> > >>>> > > >> that which is that you don't need to opt for hot swap racks
> > that
> > >>>> can
> > >>>> > > save
> > >>>> > > >> cost.
> > >>>> > > >>
> > >>>> > > >> The question here is what do you save by trying to add
> > >>>> complexity to
> > >>>> > > move
> > >>>> > > >> the data stored on a disk within the broker? Why would you
> not
> > >>>> simply
> > >>>> > > >> create another replica on the disk that results in a balanced
> > >>>> load
> > >>>> > > across
> > >>>> > > >> brokers and have it catch up. We are missing a few things
> here
> > -
> > >>>> > > >> 1. What would your data balancing algorithm be? Would it
> > include
> > >>>> just
> > >>>> > > >> capacity or will it also consider throughput on disk to
> decide
> > >>>> on the
> > >>>> > > >> final
> > >>>> > > >> location of a partition?
> > >>>> > > >> 2. With 30 brokers with each having 10 disks, using the
> > >>>> rebalancing
> > >>>> > > >> algorithm, the chances of choosing disks within the broker is
> > >>>> going to
> > >>>> > > be
> > >>>> > > >> low. This probability further decreases with more brokers and
> > >>>> disks.
> > >>>> > > Given
> > >>>> > > >> that, why are we trying to save network cost? How much would
> > >>>> that saving
> > >>>> > > >> be
> > >>>> > > >> if you go that route?
> > >>>> > > >>
> > >>>> > > >> These questions are hard to answer without having to verify
> > >>>> empirically.
> > >>>> > > >> My
> > >>>> > > >> suggestion is to avoid doing pre matured optimization that
> > >>>> brings in the
> > >>>> > > >> added complexity to the code and treat inter and intra broker
> > >>>> movements
> > >>>> > > of
> > >>>> > > >> partition the same. Deploy the code, use it and see if it is
> an
> > >>>> actual
> > >>>> > > >> problem and you get great savings by avoiding the network
> route
> > >>>> to move
> > >>>> > > >> partitions within the same broker. If so, add this
> > optimization.
> > >>>> > > >>
> > >>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <
> lindon...@gmail.com>
> > >>>> wrote:
> > >>>> > > >>
> > >>>> > > >> > Hey Jay, Sriram,
> > >>>> > > >> >
> > >>>> > > >> > Great point. If I understand you right, you are suggesting
> > >>>> that we can
> > >>>> > > >> > simply use RAID-0 so that the load can be evenly
> distributed
> > >>>> across
> > >>>> > > >> disks.
> > >>>> > > >> > And even though a disk failure will bring down the enter
> > >>>> broker, the
> > >>>> > > >> > reduced availability as compared to using KIP-112 and
> KIP-113
> > >>>> will may
> > >>>> > > >> be
> > >>>> > > >> > negligible. And it may be better to just accept the
> slightly
> > >>>> reduced
> > >>>> > > >> > availability instead of introducing the complexity from
> > >>>> KIP-112 and
> > >>>> > > >> > KIP-113.
> > >>>> > > >> >
> > >>>> > > >> > Let's assume the following:
> > >>>> > > >> >
> > >>>> > > >> > - There are 30 brokers in a cluster and each broker has 10
> > >>>> disks
> > >>>> > > >> > - The replication factor is 3 and min.isr = 2.
> > >>>> > > >> > - The probability of annual disk failure rate is 2%
> according
> > >>>> to this
> > >>>> > > >> > <https://www.backblaze.com/blo
> g/hard-drive-failure-rates-q1-
> > >>>> 2017/>
> > >>>> > > >> blog.
> > >>>> > > >> > - It takes 3 days to replace a disk.
> > >>>> > > >> >
> > >>>> > > >> > Here is my calculation for probability of data loss due to
> > disk
> > >>>> > > failure:
> > >>>> > > >> > probability of a given disk fails in a given year: 2%
> > >>>> > > >> > probability of a given disk stays offline for one day in a
> > >>>> given day:
> > >>>> > > >> 2% /
> > >>>> > > >> > 365 * 3
> > >>>> > > >> > probability of a given broker stays offline for one day in
> a
> > >>>> given day
> > >>>> > > >> due
> > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> > >>>> > > >> > probability of any broker stays offline for one day in a
> > given
> > >>>> day due
> > >>>> > > >> to
> > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> > >>>> > > >> > probability of any three broker stays offline for one day
> in
> > a
> > >>>> given
> > >>>> > > day
> > >>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125%
> > >>>> > > >> > probability of data loss due to disk failure: 0.0125%
> > >>>> > > >> >
> > >>>> > > >> > Here is my calculation for probability of service
> > >>>> unavailability due
> > >>>> > > to
> > >>>> > > >> > disk failure:
> > >>>> > > >> > probability of a given disk fails in a given year: 2%
> > >>>> > > >> > probability of a given disk stays offline for one day in a
> > >>>> given day:
> > >>>> > > >> 2% /
> > >>>> > > >> > 365 * 3
> > >>>> > > >> > probability of a given broker stays offline for one day in
> a
> > >>>> given day
> > >>>> > > >> due
> > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> > >>>> > > >> > probability of any broker stays offline for one day in a
> > given
> > >>>> day due
> > >>>> > > >> to
> > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> > >>>> > > >> > probability of any two broker stays offline for one day in
> a
> > >>>> given day
> > >>>> > > >> due
> > >>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25%
> > >>>> > > >> > probability of unavailability due to disk failure: 0.25%
> > >>>> > > >> >
> > >>>> > > >> > Note that the unavailability due to disk failure will be
> > >>>> unacceptably
> > >>>> > > >> high
> > >>>> > > >> > in this case. And the probability of data loss due to disk
> > >>>> failure
> > >>>> > > will
> > >>>> > > >> be
> > >>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is
> intended
> > >>>> to
> > >>>> > > achieve
> > >>>> > > >> > four nigh availability.
> > >>>> > > >> >
> > >>>> > > >> > Thanks,
> > >>>> > > >> > Dong
> > >>>> > > >> >
> > >>>> > > >> >
> > >>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <
> j...@confluent.io
> > >
> > >>>> wrote:
> > >>>> > > >> >
> > >>>> > > >> > > I think Ram's point is that in place failure is pretty
> > >>>> complicated,
> > >>>> > > >> and
> > >>>> > > >> > > this is meant to be a cost saving feature, we should
> > >>>> construct an
> > >>>> > > >> > argument
> > >>>> > > >> > > for it grounded in data.
> > >>>> > > >> > >
> > >>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but data
> > is
> > >>>> > > available
> > >>>> > > >> > > online), and assume it takes 3 days to get the drive
> > >>>> replaced. Say
> > >>>> > > you
> > >>>> > > >> > have
> > >>>> > > >> > > 10 drives per server. Then the expected downtime for each
> > >>>> server is
> > >>>> > > >> > roughly
> > >>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off
> > since
> > >>>> I'm
> > >>>> > > >> ignoring
> > >>>> > > >> > > the case of multiple failures, but I don't know that
> > changes
> > >>>> it
> > >>>> > > >> much). So
> > >>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say you
> > >>>> have 1000
> > >>>> > > >> > servers
> > >>>> > > >> > > and they cost $3000/year fully loaded including power,
> the
> > >>>> cost of
> > >>>> > > >> the hw
> > >>>> > > >> > > amortized over it's life, etc. Then this feature saves
> you
> > >>>> $3000 on
> > >>>> > > >> your
> > >>>> > > >> > > total server cost of $3m which seems not very worthwhile
> > >>>> compared to
> > >>>> > > >> > other
> > >>>> > > >> > > optimizations...?
> > >>>> > > >> > >
> > >>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i
> think
> > >>>> that is
> > >>>> > > >> the
> > >>>> > > >> > > type of argument that would be helpful to think about the
> > >>>> tradeoff
> > >>>> > > in
> > >>>> > > >> > > complexity.
> > >>>> > > >> > >
> > >>>> > > >> > > -Jay
> > >>>> > > >> > >
> > >>>> > > >> > >
> > >>>> > > >> > >
> > >>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <
> > >>>> lindon...@gmail.com>
> > >>>> > > wrote:
> > >>>> > > >> > >
> > >>>> > > >> > > > Hey Sriram,
> > >>>> > > >> > > >
> > >>>> > > >> > > > Thanks for taking time to review the KIP. Please see
> > below
> > >>>> my
> > >>>> > > >> answers
> > >>>> > > >> > to
> > >>>> > > >> > > > your questions:
> > >>>> > > >> > > >
> > >>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration and
> go
> > >>>> over what
> > >>>> > > >> is
> > >>>> > > >> > the
> > >>>> > > >> > > > >average disk/partition repair/restore time that we are
> > >>>> targeting
> > >>>> > > >> for a
> > >>>> > > >> > > > >typical JBOD setup?
> > >>>> > > >> > > >
> > >>>> > > >> > > > We currently don't have this data. I think the
> > >>>> disk/partition
> > >>>> > > >> > > repair/store
> > >>>> > > >> > > > time depends on availability of hardware, the response
> > >>>> time of
> > >>>> > > >> > > > site-reliability engineer, the amount of data on the
> bad
> > >>>> disk etc.
> > >>>> > > >> > These
> > >>>> > > >> > > > vary between companies and even clusters within the
> same
> > >>>> company
> > >>>> > > >> and it
> > >>>> > > >> > > is
> > >>>> > > >> > > > probably hard to determine what is the average
> situation.
> > >>>> > > >> > > >
> > >>>> > > >> > > > I am not very sure why we need this. Can you explain a
> > bit
> > >>>> why
> > >>>> > > this
> > >>>> > > >> > data
> > >>>> > > >> > > is
> > >>>> > > >> > > > useful to evaluate the motivation and design of this
> KIP?
> > >>>> > > >> > > >
> > >>>> > > >> > > > >2. How often do we believe disks are going to fail (in
> > >>>> your
> > >>>> > > example
> > >>>> > > >> > > > >configuration) and what do we gain by avoiding the
> > network
> > >>>> > > overhead
> > >>>> > > >> > and
> > >>>> > > >> > > > >doing all the work of moving the replica within the
> > >>>> broker to
> > >>>> > > >> another
> > >>>> > > >> > > disk
> > >>>> > > >> > > > >instead of balancing it globally?
> > >>>> > > >> > > >
> > >>>> > > >> > > > I think the chance of disk failure depends mainly on
> the
> > >>>> disk
> > >>>> > > itself
> > >>>> > > >> > > rather
> > >>>> > > >> > > > than the broker configuration. I don't have this data
> > now.
> > >>>> I will
> > >>>> > > >> ask
> > >>>> > > >> > our
> > >>>> > > >> > > > SRE whether they know the mean-time-to-fail for our
> disk.
> > >>>> What I
> > >>>> > > was
> > >>>> > > >> > told
> > >>>> > > >> > > > by SRE is that disk failure is the most common type of
> > >>>> hardware
> > >>>> > > >> > failure.
> > >>>> > > >> > > >
> > >>>> > > >> > > > When there is disk failure, I think it is reasonable to
> > >>>> move
> > >>>> > > >> replica to
> > >>>> > > >> > > > another broker instead of another disk on the same
> > broker.
> > >>>> The
> > >>>> > > >> reason
> > >>>> > > >> > we
> > >>>> > > >> > > > want to move replica within broker is mainly to
> optimize
> > >>>> the Kafka
> > >>>> > > >> > > cluster
> > >>>> > > >> > > > performance when we balance load across disks.
> > >>>> > > >> > > >
> > >>>> > > >> > > > In comparison to balancing replicas globally, the
> benefit
> > >>>> of
> > >>>> > > moving
> > >>>> > > >> > > replica
> > >>>> > > >> > > > within broker is that:
> > >>>> > > >> > > >
> > >>>> > > >> > > > 1) the movement is faster since it doesn't go through
> > >>>> socket or
> > >>>> > > >> rely on
> > >>>> > > >> > > the
> > >>>> > > >> > > > available network bandwidth;
> > >>>> > > >> > > > 2) much less impact on the replication traffic between
> > >>>> broker by
> > >>>> > > not
> > >>>> > > >> > > taking
> > >>>> > > >> > > > up bandwidth between brokers. Depending on the pattern
> of
> > >>>> traffic,
> > >>>> > > >> we
> > >>>> > > >> > may
> > >>>> > > >> > > > need to balance load across disk frequently and it is
> > >>>> necessary to
> > >>>> > > >> > > prevent
> > >>>> > > >> > > > this operation from slowing down the existing operation
> > >>>> (e.g.
> > >>>> > > >> produce,
> > >>>> > > >> > > > consume, replication) in the Kafka cluster.
> > >>>> > > >> > > > 3) It gives us opportunity to do automatic broker
> > rebalance
> > >>>> > > between
> > >>>> > > >> > disks
> > >>>> > > >> > > > on the same broker.
> > >>>> > > >> > > >
> > >>>> > > >> > > >
> > >>>> > > >> > > > >3. Even if we had to move the replica within the
> broker,
> > >>>> why
> > >>>> > > >> cannot we
> > >>>> > > >> > > > just
> > >>>> > > >> > > > >treat it as another replica and have it go through the
> > >>>> same
> > >>>> > > >> > replication
> > >>>> > > >> > > > >code path that we have today? The downside here is
> > >>>> obviously that
> > >>>> > > >> you
> > >>>> > > >> > > need
> > >>>> > > >> > > > >to catchup from the leader but it is completely free!
> > >>>> What do we
> > >>>> > > >> think
> > >>>> > > >> > > is
> > >>>> > > >> > > > >the impact of the network overhead in this case?
> > >>>> > > >> > > >
> > >>>> > > >> > > > Good point. My initial proposal actually used the
> > existing
> > >>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to
> > move
> > >>>> replica
> > >>>> > > >> > > between
> > >>>> > > >> > > > disks. However, I switched to use separate thread pool
> > >>>> after
> > >>>> > > >> discussion
> > >>>> > > >> > > > with Jun and Becket.
> > >>>> > > >> > > >
> > >>>> > > >> > > > The main argument for using separate thread pool is to
> > >>>> actually
> > >>>> > > keep
> > >>>> > > >> > the
> > >>>> > > >> > > > design simply and easy to reason about. There are a
> > number
> > >>>> of
> > >>>> > > >> > difference
> > >>>> > > >> > > > between inter-broker replication and intra-broker
> > >>>> replication
> > >>>> > > which
> > >>>> > > >> > makes
> > >>>> > > >> > > > it cleaner to do them in separate code path. I will
> list
> > >>>> them
> > >>>> > > below:
> > >>>> > > >> > > >
> > >>>> > > >> > > > - The throttling mechanism for inter-broker replication
> > >>>> traffic
> > >>>> > > and
> > >>>> > > >> > > > intra-broker replication traffic is different. For
> > >>>> example, we may
> > >>>> > > >> want
> > >>>> > > >> > > to
> > >>>> > > >> > > > specify per-topic quota for inter-broker replication
> > >>>> traffic
> > >>>> > > >> because we
> > >>>> > > >> > > may
> > >>>> > > >> > > > want some topic to be moved faster than other topic.
> But
> > >>>> we don't
> > >>>> > > >> care
> > >>>> > > >> > > > about priority of topics for intra-broker movement. So
> > the
> > >>>> current
> > >>>> > > >> > > proposal
> > >>>> > > >> > > > only allows user to specify per-broker quota for
> > >>>> inter-broker
> > >>>> > > >> > replication
> > >>>> > > >> > > > traffic.
> > >>>> > > >> > > >
> > >>>> > > >> > > > - The quota value for inter-broker replication traffic
> > and
> > >>>> > > >> intra-broker
> > >>>> > > >> > > > replication traffic is different. The available
> bandwidth
> > >>>> for
> > >>>> > > >> > > inter-broker
> > >>>> > > >> > > > replication can probably be much higher than the
> > bandwidth
> > >>>> for
> > >>>> > > >> > > inter-broker
> > >>>> > > >> > > > replication.
> > >>>> > > >> > > >
> > >>>> > > >> > > > - The ReplicaFetchThread is per broker. Intuitively,
> the
> > >>>> number of
> > >>>> > > >> > > threads
> > >>>> > > >> > > > doing intra broker data movement should be related to
> the
> > >>>> number
> > >>>> > > of
> > >>>> > > >> > disks
> > >>>> > > >> > > > in the broker, not the number of brokers in the
> cluster.
> > >>>> > > >> > > >
> > >>>> > > >> > > > - The leader replica has no ReplicaFetchThread to start
> > >>>> with. It
> > >>>> > > >> seems
> > >>>> > > >> > > > weird to
> > >>>> > > >> > > > start one just for intra-broker replication.
> > >>>> > > >> > > >
> > >>>> > > >> > > > Because of these difference, we think it is simpler to
> > use
> > >>>> > > separate
> > >>>> > > >> > > thread
> > >>>> > > >> > > > pool and code path so that we can configure and
> throttle
> > >>>> them
> > >>>> > > >> > separately.
> > >>>> > > >> > > >
> > >>>> > > >> > > >
> > >>>> > > >> > > > >4. What are the chances that we will be able to
> identify
> > >>>> another
> > >>>> > > >> disk
> > >>>> > > >> > to
> > >>>> > > >> > > > >balance within the broker instead of another disk on
> > >>>> another
> > >>>> > > >> broker?
> > >>>> > > >> > If
> > >>>> > > >> > > we
> > >>>> > > >> > > > >have 100's of machines, the probability of finding a
> > >>>> better
> > >>>> > > >> balance by
> > >>>> > > >> > > > >choosing another broker is much higher than balancing
> > >>>> within the
> > >>>> > > >> > broker.
> > >>>> > > >> > > > >Could you add some info on how we are determining
> this?
> > >>>> > > >> > > >
> > >>>> > > >> > > > It is possible that we can find available space on a
> > remote
> > >>>> > > broker.
> > >>>> > > >> The
> > >>>> > > >> > > > benefit of allowing intra-broker replication is that,
> > when
> > >>>> there
> > >>>> > > are
> > >>>> > > >> > > > available space in both the current broker and a remote
> > >>>> broker,
> > >>>> > > the
> > >>>> > > >> > > > rebalance can be completed faster with much less impact
> > on
> > >>>> the
> > >>>> > > >> > > inter-broker
> > >>>> > > >> > > > replication or the users traffic. It is about taking
> > >>>> advantage of
> > >>>> > > >> > > locality
> > >>>> > > >> > > > when balance the load.
> > >>>> > > >> > > >
> > >>>> > > >> > > > >5. Finally, in a cloud setup where more users are
> going
> > to
> > >>>> > > >> leverage a
> > >>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this
> change
> > >>>> is not
> > >>>> > > of
> > >>>> > > >> > much
> > >>>> > > >> > > > >gain since you don't need to balance between the
> volumes
> > >>>> within
> > >>>> > > the
> > >>>> > > >> > same
> > >>>> > > >> > > > >broker.
> > >>>> > > >> > > >
> > >>>> > > >> > > > You are right. This KIP-113 is useful only if user uses
> > >>>> JBOD. If
> > >>>> > > >> user
> > >>>> > > >> > > uses
> > >>>> > > >> > > > an extra storage layer of replication, such as RAID-10
> or
> > >>>> EBS,
> > >>>> > > they
> > >>>> > > >> > don't
> > >>>> > > >> > > > need KIP-112 or KIP-113. Note that user will replicate
> > >>>> data more
> > >>>> > > >> times
> > >>>> > > >> > > than
> > >>>> > > >> > > > the replication factor of the Kafka topic if an extra
> > >>>> storage
> > >>>> > > layer
> > >>>> > > >> of
> > >>>> > > >> > > > replication is used.
> > >>>> > > >> > > >
> > >>>> > > >> > >
> > >>>> > > >> >
> > >>>> > > >>
> > >>>> > > >
> > >>>> > > >
> > >>>> > >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Reply via email to