Hi Dong,

Your comments on KIP-179 prompted me to look at KIP-113, and I have a
question:

AFAICS the DescribeDirsResponse (via ReplicaInfo) can be used to get the
size of a partition on a disk, but I don't see a mechanism for knowing the
total capacity of a disk (and/or the free capacity of a disk). That would
be very useful information to have to help figure out that certain
assignments are impossible, for instance. Is there a reason you've left
this out?

Cheers,

Tom

On 4 August 2017 at 18:47, Dong Lin <lindon...@gmail.com> wrote:

> Hey Ismael,
>
> Thanks for the comments! Here are my answers:
>
> 1. Yes it has been considered. Here are the reasons why we don't do it
> through controller.
>
> - There can be use-cases where we only want to rebalance the load of log
> directories on a given broker. It seems unnecessary to go through
> controller in this case.
>
>  - If controller is responsible for sending ChangeReplicaDirRequest, and if
> the user-specified log directory is either invalid or offline, then
> controller probably needs a way to tell user that the partition
> reassignment has failed. We currently don't have a way to do this since
> kafka-reassign-partition.sh simply creates the reassignment znode without
> waiting for response. I am not sure that is a good solution to this.
>
> - If controller is responsible for sending ChangeReplicaDirRequest, the
> controller logic would be more complicated because controller needs to
> first send ChangeReplicaRequest so that the broker memorize the partition
> -> log directory mapping, send LeaderAndIsrRequest, and keep sending
> ChangeReplicaDirRequest (just in case broker restarted) until replica is
> created. Note that the last step needs repeat and timeout as the proposed
> in the KIP-113.
>
> Overall I think this adds quite a bit complexity to controller and we
> probably want to do this only if there is strong clear of doing so.
> Currently in KIP-113 the kafka-reassign-partitions.sh is responsible for
> sending ChangeReplicaDirRequest with repeat and provides error to user if
> it either fails or timeout. It seems to be much simpler and user shouldn't
> care whether it is done through controller.
>
> And thanks for the suggestion. I will add this to the Rejected Alternative
> Section in the KIP-113.
>
> 2) I think user needs to be able to specify different log directories for
> the replicas of the same partition in order to rebalance load across log
> directories of all brokers. I am not sure I understand the question. Can
> you explain a bit more why "that the log directory has to be the same for
> all replicas of a given partition"?
>
> 3) Good point. I think the alterReplicaDir is a better than
> changeReplicaDir for the reason you provided. I will also update names of
> the request/response as well in the KIP.
>
>
> Thanks,
> Dong
>
> On Fri, Aug 4, 2017 at 9:49 AM, Ismael Juma <ism...@juma.me.uk> wrote:
>
> > Thanks Dong. I have a few initial questions, sorry if I it has been
> > discussed and I missed it.
> >
> > 1. The KIP suggests that the reassignment tool is responsible for sending
> > the ChangeReplicaDirRequests to the relevant brokers. I had imagined that
> > this would be done by the Controller, like the rest of the reassignment
> > process. Was this considered? If so, it would be good to include the
> > details of why it was rejected in the "Rejected Alternatives" section.
> >
> > 2. The reassignment JSON format was extended so that one can choose the
> log
> > directory for a partition. This means that the log directory has to be
> the
> > same for all replicas of a given partition. The alternative would be for
> > the log dir to be assignable for each replica. Similar to the other
> > question, it would be good to have a section in "Rejected Alternatives"
> for
> > this approach. It's generally very helpful to have more information on
> the
> > rationale for the design choices that were made and rejected.
> >
> > 3. Should changeReplicaDir be alterReplicaDir? We have used `alter` for
> > other methods.
> >
> > Thanks,
> > Ismael
> >
> >
> >
> >
> > On Fri, Aug 4, 2017 at 5:37 AM, Dong Lin <lindon...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I realized that we need new API in AdminClient in order to use the new
> > > request/response added in KIP-113. Since this is required by KIP-113, I
> > > choose to add the new interface in this KIP instead of creating a new
> > KIP.
> > >
> > > The documentation of the new API in AdminClient can be found here
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113:
> > > Supportreplicasmovementbetweenlogdirectories-AdminClient>.
> > > Can you please review and comment if you have any concern?
> > >
> > > Thanks!
> > > Dong
> > >
> > >
> > >
> > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com>
> wrote:
> > >
> > > > The protocol change has been updated in KIP-113
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 113%3A+Support+replicas+movement+between+log+directories>
> > > > .
> > > >
> > > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I have made a minor change to the DescribeDirsRequest so that user
> can
> > > >> choose to query the status for a specific list of partitions. This
> is
> > a
> > > bit
> > > >> more fine-granular than the previous format that allows user to
> query
> > > the
> > > >> status for a specific list of topics. I realized that querying the
> > > status
> > > >> of selected partitions can be useful to check the whether the
> > > reassignment
> > > >> of the replicas to the specific log directories has been completed.
> > > >>
> > > >> I will assume this minor change is OK if there is no concern with it
> > in
> > > >> the community :)
> > > >>
> > > >> Thanks,
> > > >> Dong
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > > >>
> > > >>> Hey Colin,
> > > >>>
> > > >>> Thanks for the suggestion. We have actually considered this and
> list
> > > >>> this as the first future work in KIP-112
> > > >>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 112%3A+Handle+disk+failure+for+JBOD>.
> > > >>> The two advantages that you mentioned are exactly the motivation
> for
> > > this
> > > >>> feature. Also as you have mentioned, this involves the tradeoff
> > between
> > > >>> disk performance and availability -- the more you distribute topic
> > > across
> > > >>> disks, the more topics will be offline due to a single disk
> failure.
> > > >>>
> > > >>> Despite its complexity, it is not clear to me that the reduced
> > > rebalance
> > > >>> overhead is worth the reduction in availability. I am optimistic
> that
> > > the
> > > >>> rebalance overhead will not be that a big problem since we are not
> > too
> > > >>> bothered by cross-broker rebalance as of now.
> > > >>>
> > > >>> Thanks,
> > > >>> Dong
> > > >>>
> > > >>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <cmcc...@apache.org
> >
> > > >>> wrote:
> > > >>>
> > > >>>> Has anyone considered a scheme for sharding topic data across
> > multiple
> > > >>>> disks?
> > > >>>>
> > > >>>> For example, if you sharded topics across 3 disks, and you had 10
> > > disks,
> > > >>>> you could pick a different set of 3 disks for each topic.  If you
> > > >>>> distribute them randomly then you have 10 choose 3 = 120 different
> > > >>>> combinations.  You would probably never need rebalancing if you
> had
> > a
> > > >>>> reasonable distribution of topic sizes (could probably prove this
> > > with a
> > > >>>> Monte Carlo or something).
> > > >>>>
> > > >>>> The disadvantage is that if one of the 3 disks fails, then you
> have
> > to
> > > >>>> take the topic offline.  But if we assume independent disk failure
> > > >>>> probabilities, probability of failure with RAID 0 is: 1 -
> > > >>>> Psuccess^(num_disks) whereas the probability of failure with this
> > > scheme
> > > >>>> is 1 - Psuccess ^ 3.
> > > >>>>
> > > >>>> This addresses the biggest downsides of JBOD now:
> > > >>>> * limiting a topic to the size of a single disk limits scalability
> > > >>>> * the topic movement process is tricky to get right and involves
> > > "racing
> > > >>>> against producers" and wasted double I/Os
> > > >>>>
> > > >>>> Of course, one other question is how frequently we add new disk
> > drives
> > > >>>> to an existing broker.  In this case, you might reasonably want
> disk
> > > >>>> rebalancing to avoid overloading the new disk(s) with writes.
> > > >>>>
> > > >>>> cheers,
> > > >>>> Colin
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote:
> > > >>>> > Just a few comments on this.
> > > >>>> >
> > > >>>> > 1. One of the issues with using RAID 0 is that a single disk
> > failure
> > > >>>> > causes
> > > >>>> > a hard failure of the broker. Hard failure increases the
> > > >>>> unavailability
> > > >>>> > window for all the partitions on the failed broker, which
> includes
> > > the
> > > >>>> > failure detection time (tied to ZK session timeout right now)
> and
> > > >>>> leader
> > > >>>> > election time by the controller. If we support JBOD natively,
> > when a
> > > >>>> > single
> > > >>>> > disk fails, only partitions on the failed disk will experience a
> > > hard
> > > >>>> > failure. The availability for partitions on the rest of the
> disks
> > > are
> > > >>>> not
> > > >>>> > affected.
> > > >>>> >
> > > >>>> > 2. For running things on the Cloud such as AWS. Currently, each
> > EBS
> > > >>>> > volume
> > > >>>> > has a throughout limit of about 300MB/sec. If you get an
> enhanced
> > > EC2
> > > >>>> > instance, you can get 20Gb/sec network. To saturate the network,
> > you
> > > >>>> may
> > > >>>> > need about 7 EBS volumes. So, being able to support JBOD in the
> > > Cloud
> > > >>>> is
> > > >>>> > still potentially useful.
> > > >>>> >
> > > >>>> > 3. On the benefit of balancing data across disks within the same
> > > >>>> broker.
> > > >>>> > Data imbalance can happen across brokers as well as across disks
> > > >>>> within
> > > >>>> > the
> > > >>>> > same broker. Balancing the data across disks within the broker
> has
> > > the
> > > >>>> > benefit of saving network bandwidth as Dong mentioned. So, if
> > intra
> > > >>>> > broker
> > > >>>> > load balancing is possible, it's probably better to avoid the
> more
> > > >>>> > expensive inter broker load balancing. One of the reasons for
> disk
> > > >>>> > imbalance right now is that partitions within a broker are
> > assigned
> > > to
> > > >>>> > disks just based on the partition count. So, it does seem
> possible
> > > for
> > > >>>> > disks to get imbalanced from time to time. If someone can share
> > some
> > > >>>> > stats
> > > >>>> > for that in practice, that will be very helpful.
> > > >>>> >
> > > >>>> > Thanks,
> > > >>>> >
> > > >>>> > Jun
> > > >>>> >
> > > >>>> >
> > > >>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <lindon...@gmail.com>
> > > wrote:
> > > >>>> >
> > > >>>> > > Hey Sriram,
> > > >>>> > >
> > > >>>> > > I think there is one way to explain why the ability to move
> > > replica
> > > >>>> between
> > > >>>> > > disks can save space. Let's say the load is distributed to
> disks
> > > >>>> > > independent of the broker. Sooner or later, the load imbalance
> > > will
> > > >>>> exceed
> > > >>>> > > a threshold and we will need to rebalance load across disks.
> Now
> > > our
> > > >>>> > > questions is whether our rebalancing algorithm will be able to
> > > take
> > > >>>> > > advantage of locality by moving replicas between disks on the
> > same
> > > >>>> broker.
> > > >>>> > >
> > > >>>> > > Say for a given disk, there is 20% probability it is
> overloaded,
> > > 20%
> > > >>>> > > probability it is underloaded, and 60% probability its load is
> > > >>>> around the
> > > >>>> > > expected average load if the cluster is well balanced. Then
> for
> > a
> > > >>>> broker of
> > > >>>> > > 10 disks, we would 2 disks need to have in-bound replica
> > movement,
> > > >>>> 2 disks
> > > >>>> > > need to have out-bound replica movement, and 6 disks do not
> need
> > > >>>> replica
> > > >>>> > > movement. Thus we would expect KIP-113 to be useful since we
> > will
> > > >>>> be able
> > > >>>> > > to move replica from the two over-loaded disks to the two
> > > >>>> under-loaded
> > > >>>> > > disks on the same broKER. Does this make sense?
> > > >>>> > >
> > > >>>> > > Thanks,
> > > >>>> > > Dong
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > >
> > > >>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <lindon...@gmail.com
> >
> > > >>>> wrote:
> > > >>>> > >
> > > >>>> > > > Hey Sriram,
> > > >>>> > > >
> > > >>>> > > > Thanks for raising these concerns. Let me answer these
> > questions
> > > >>>> below:
> > > >>>> > > >
> > > >>>> > > > - The benefit of those additional complexity to move the
> data
> > > >>>> stored on a
> > > >>>> > > > disk within the broker is to avoid network bandwidth usage.
> > > >>>> Creating
> > > >>>> > > > replica on another broker is less efficient than creating
> > > replica
> > > >>>> on
> > > >>>> > > > another disk in the same broker IF there is actually
> > > >>>> lightly-loaded disk
> > > >>>> > > on
> > > >>>> > > > the same broker.
> > > >>>> > > >
> > > >>>> > > > - In my opinion the rebalance algorithm would this: 1) we
> > > balance
> > > >>>> the
> > > >>>> > > load
> > > >>>> > > > across brokers using the same algorithm we are using today.
> 2)
> > > we
> > > >>>> balance
> > > >>>> > > > load across disk on a given broker using a greedy algorithm,
> > > i.e.
> > > >>>> move
> > > >>>> > > > replica from the overloaded disk to lightly loaded disk. The
> > > >>>> greedy
> > > >>>> > > > algorithm would only consider the capacity and replica size.
> > We
> > > >>>> can
> > > >>>> > > improve
> > > >>>> > > > it to consider throughput in the future.
> > > >>>> > > >
> > > >>>> > > > - With 30 brokers with each having 10 disks, using the
> > > rebalancing
> > > >>>> > > algorithm,
> > > >>>> > > > the chances of choosing disks within the broker can be high.
> > > >>>> There will
> > > >>>> > > > always be load imbalance across disks of the same broker for
> > the
> > > >>>> same
> > > >>>> > > > reason that there will always be load imbalance across
> > brokers.
> > > >>>> The
> > > >>>> > > > algorithm specified above will take advantage of the
> locality,
> > > >>>> i.e. first
> > > >>>> > > > balance load across disks of the same broker, and only
> balance
> > > >>>> across
> > > >>>> > > > brokers if some brokers are much more loaded than others.
> > > >>>> > > >
> > > >>>> > > > I think it is useful to note that the load imbalance across
> > > disks
> > > >>>> of the
> > > >>>> > > > same broker is independent of the load imbalance across
> > brokers.
> > > >>>> Both are
> > > >>>> > > > guaranteed to happen in any Kafka cluster for the same
> reason,
> > > >>>> i.e.
> > > >>>> > > > variation in the partition size. Say broker 1 have two disks
> > > that
> > > >>>> are 80%
> > > >>>> > > > loaded and 20% loaded. And broker 2 have two disks that are
> > also
> > > >>>> 80%
> > > >>>> > > > loaded and 20%. We can balance them without inter-broker
> > traffic
> > > >>>> with
> > > >>>> > > > KIP-113.  This is why I think KIP-113 can be very useful.
> > > >>>> > > >
> > > >>>> > > > Do these explanation sound reasonable?
> > > >>>> > > >
> > > >>>> > > > Thanks,
> > > >>>> > > > Dong
> > > >>>> > > >
> > > >>>> > > >
> > > >>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <
> > > >>>> r...@confluent.io>
> > > >>>> > > > wrote:
> > > >>>> > > >
> > > >>>> > > >> Hey Dong,
> > > >>>> > > >>
> > > >>>> > > >> Thanks for the explanation. I don't think anyone is denying
> > > that
> > > >>>> we
> > > >>>> > > should
> > > >>>> > > >> rebalance at the disk level. I think it is important to
> > restore
> > > >>>> the disk
> > > >>>> > > >> and not wait for disk replacement. There are also other
> > > benefits
> > > >>>> of
> > > >>>> > > doing
> > > >>>> > > >> that which is that you don't need to opt for hot swap racks
> > > that
> > > >>>> can
> > > >>>> > > save
> > > >>>> > > >> cost.
> > > >>>> > > >>
> > > >>>> > > >> The question here is what do you save by trying to add
> > > >>>> complexity to
> > > >>>> > > move
> > > >>>> > > >> the data stored on a disk within the broker? Why would you
> > not
> > > >>>> simply
> > > >>>> > > >> create another replica on the disk that results in a
> balanced
> > > >>>> load
> > > >>>> > > across
> > > >>>> > > >> brokers and have it catch up. We are missing a few things
> > here
> > > -
> > > >>>> > > >> 1. What would your data balancing algorithm be? Would it
> > > include
> > > >>>> just
> > > >>>> > > >> capacity or will it also consider throughput on disk to
> > decide
> > > >>>> on the
> > > >>>> > > >> final
> > > >>>> > > >> location of a partition?
> > > >>>> > > >> 2. With 30 brokers with each having 10 disks, using the
> > > >>>> rebalancing
> > > >>>> > > >> algorithm, the chances of choosing disks within the broker
> is
> > > >>>> going to
> > > >>>> > > be
> > > >>>> > > >> low. This probability further decreases with more brokers
> and
> > > >>>> disks.
> > > >>>> > > Given
> > > >>>> > > >> that, why are we trying to save network cost? How much
> would
> > > >>>> that saving
> > > >>>> > > >> be
> > > >>>> > > >> if you go that route?
> > > >>>> > > >>
> > > >>>> > > >> These questions are hard to answer without having to verify
> > > >>>> empirically.
> > > >>>> > > >> My
> > > >>>> > > >> suggestion is to avoid doing pre matured optimization that
> > > >>>> brings in the
> > > >>>> > > >> added complexity to the code and treat inter and intra
> broker
> > > >>>> movements
> > > >>>> > > of
> > > >>>> > > >> partition the same. Deploy the code, use it and see if it
> is
> > an
> > > >>>> actual
> > > >>>> > > >> problem and you get great savings by avoiding the network
> > route
> > > >>>> to move
> > > >>>> > > >> partitions within the same broker. If so, add this
> > > optimization.
> > > >>>> > > >>
> > > >>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <
> > lindon...@gmail.com>
> > > >>>> wrote:
> > > >>>> > > >>
> > > >>>> > > >> > Hey Jay, Sriram,
> > > >>>> > > >> >
> > > >>>> > > >> > Great point. If I understand you right, you are
> suggesting
> > > >>>> that we can
> > > >>>> > > >> > simply use RAID-0 so that the load can be evenly
> > distributed
> > > >>>> across
> > > >>>> > > >> disks.
> > > >>>> > > >> > And even though a disk failure will bring down the enter
> > > >>>> broker, the
> > > >>>> > > >> > reduced availability as compared to using KIP-112 and
> > KIP-113
> > > >>>> will may
> > > >>>> > > >> be
> > > >>>> > > >> > negligible. And it may be better to just accept the
> > slightly
> > > >>>> reduced
> > > >>>> > > >> > availability instead of introducing the complexity from
> > > >>>> KIP-112 and
> > > >>>> > > >> > KIP-113.
> > > >>>> > > >> >
> > > >>>> > > >> > Let's assume the following:
> > > >>>> > > >> >
> > > >>>> > > >> > - There are 30 brokers in a cluster and each broker has
> 10
> > > >>>> disks
> > > >>>> > > >> > - The replication factor is 3 and min.isr = 2.
> > > >>>> > > >> > - The probability of annual disk failure rate is 2%
> > according
> > > >>>> to this
> > > >>>> > > >> > <https://www.backblaze.com/blo
> > g/hard-drive-failure-rates-q1-
> > > >>>> 2017/>
> > > >>>> > > >> blog.
> > > >>>> > > >> > - It takes 3 days to replace a disk.
> > > >>>> > > >> >
> > > >>>> > > >> > Here is my calculation for probability of data loss due
> to
> > > disk
> > > >>>> > > failure:
> > > >>>> > > >> > probability of a given disk fails in a given year: 2%
> > > >>>> > > >> > probability of a given disk stays offline for one day in
> a
> > > >>>> given day:
> > > >>>> > > >> 2% /
> > > >>>> > > >> > 365 * 3
> > > >>>> > > >> > probability of a given broker stays offline for one day
> in
> > a
> > > >>>> given day
> > > >>>> > > >> due
> > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> > > >>>> > > >> > probability of any broker stays offline for one day in a
> > > given
> > > >>>> day due
> > > >>>> > > >> to
> > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> > > >>>> > > >> > probability of any three broker stays offline for one day
> > in
> > > a
> > > >>>> given
> > > >>>> > > day
> > > >>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125%
> > > >>>> > > >> > probability of data loss due to disk failure: 0.0125%
> > > >>>> > > >> >
> > > >>>> > > >> > Here is my calculation for probability of service
> > > >>>> unavailability due
> > > >>>> > > to
> > > >>>> > > >> > disk failure:
> > > >>>> > > >> > probability of a given disk fails in a given year: 2%
> > > >>>> > > >> > probability of a given disk stays offline for one day in
> a
> > > >>>> given day:
> > > >>>> > > >> 2% /
> > > >>>> > > >> > 365 * 3
> > > >>>> > > >> > probability of a given broker stays offline for one day
> in
> > a
> > > >>>> given day
> > > >>>> > > >> due
> > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> > > >>>> > > >> > probability of any broker stays offline for one day in a
> > > given
> > > >>>> day due
> > > >>>> > > >> to
> > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> > > >>>> > > >> > probability of any two broker stays offline for one day
> in
> > a
> > > >>>> given day
> > > >>>> > > >> due
> > > >>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25%
> > > >>>> > > >> > probability of unavailability due to disk failure: 0.25%
> > > >>>> > > >> >
> > > >>>> > > >> > Note that the unavailability due to disk failure will be
> > > >>>> unacceptably
> > > >>>> > > >> high
> > > >>>> > > >> > in this case. And the probability of data loss due to
> disk
> > > >>>> failure
> > > >>>> > > will
> > > >>>> > > >> be
> > > >>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is
> > intended
> > > >>>> to
> > > >>>> > > achieve
> > > >>>> > > >> > four nigh availability.
> > > >>>> > > >> >
> > > >>>> > > >> > Thanks,
> > > >>>> > > >> > Dong
> > > >>>> > > >> >
> > > >>>> > > >> >
> > > >>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <
> > j...@confluent.io
> > > >
> > > >>>> wrote:
> > > >>>> > > >> >
> > > >>>> > > >> > > I think Ram's point is that in place failure is pretty
> > > >>>> complicated,
> > > >>>> > > >> and
> > > >>>> > > >> > > this is meant to be a cost saving feature, we should
> > > >>>> construct an
> > > >>>> > > >> > argument
> > > >>>> > > >> > > for it grounded in data.
> > > >>>> > > >> > >
> > > >>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but
> data
> > > is
> > > >>>> > > available
> > > >>>> > > >> > > online), and assume it takes 3 days to get the drive
> > > >>>> replaced. Say
> > > >>>> > > you
> > > >>>> > > >> > have
> > > >>>> > > >> > > 10 drives per server. Then the expected downtime for
> each
> > > >>>> server is
> > > >>>> > > >> > roughly
> > > >>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off
> > > since
> > > >>>> I'm
> > > >>>> > > >> ignoring
> > > >>>> > > >> > > the case of multiple failures, but I don't know that
> > > changes
> > > >>>> it
> > > >>>> > > >> much). So
> > > >>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say
> you
> > > >>>> have 1000
> > > >>>> > > >> > servers
> > > >>>> > > >> > > and they cost $3000/year fully loaded including power,
> > the
> > > >>>> cost of
> > > >>>> > > >> the hw
> > > >>>> > > >> > > amortized over it's life, etc. Then this feature saves
> > you
> > > >>>> $3000 on
> > > >>>> > > >> your
> > > >>>> > > >> > > total server cost of $3m which seems not very
> worthwhile
> > > >>>> compared to
> > > >>>> > > >> > other
> > > >>>> > > >> > > optimizations...?
> > > >>>> > > >> > >
> > > >>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i
> > think
> > > >>>> that is
> > > >>>> > > >> the
> > > >>>> > > >> > > type of argument that would be helpful to think about
> the
> > > >>>> tradeoff
> > > >>>> > > in
> > > >>>> > > >> > > complexity.
> > > >>>> > > >> > >
> > > >>>> > > >> > > -Jay
> > > >>>> > > >> > >
> > > >>>> > > >> > >
> > > >>>> > > >> > >
> > > >>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <
> > > >>>> lindon...@gmail.com>
> > > >>>> > > wrote:
> > > >>>> > > >> > >
> > > >>>> > > >> > > > Hey Sriram,
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > Thanks for taking time to review the KIP. Please see
> > > below
> > > >>>> my
> > > >>>> > > >> answers
> > > >>>> > > >> > to
> > > >>>> > > >> > > > your questions:
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration and
> > go
> > > >>>> over what
> > > >>>> > > >> is
> > > >>>> > > >> > the
> > > >>>> > > >> > > > >average disk/partition repair/restore time that we
> are
> > > >>>> targeting
> > > >>>> > > >> for a
> > > >>>> > > >> > > > >typical JBOD setup?
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > We currently don't have this data. I think the
> > > >>>> disk/partition
> > > >>>> > > >> > > repair/store
> > > >>>> > > >> > > > time depends on availability of hardware, the
> response
> > > >>>> time of
> > > >>>> > > >> > > > site-reliability engineer, the amount of data on the
> > bad
> > > >>>> disk etc.
> > > >>>> > > >> > These
> > > >>>> > > >> > > > vary between companies and even clusters within the
> > same
> > > >>>> company
> > > >>>> > > >> and it
> > > >>>> > > >> > > is
> > > >>>> > > >> > > > probably hard to determine what is the average
> > situation.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > I am not very sure why we need this. Can you explain
> a
> > > bit
> > > >>>> why
> > > >>>> > > this
> > > >>>> > > >> > data
> > > >>>> > > >> > > is
> > > >>>> > > >> > > > useful to evaluate the motivation and design of this
> > KIP?
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > >2. How often do we believe disks are going to fail
> (in
> > > >>>> your
> > > >>>> > > example
> > > >>>> > > >> > > > >configuration) and what do we gain by avoiding the
> > > network
> > > >>>> > > overhead
> > > >>>> > > >> > and
> > > >>>> > > >> > > > >doing all the work of moving the replica within the
> > > >>>> broker to
> > > >>>> > > >> another
> > > >>>> > > >> > > disk
> > > >>>> > > >> > > > >instead of balancing it globally?
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > I think the chance of disk failure depends mainly on
> > the
> > > >>>> disk
> > > >>>> > > itself
> > > >>>> > > >> > > rather
> > > >>>> > > >> > > > than the broker configuration. I don't have this data
> > > now.
> > > >>>> I will
> > > >>>> > > >> ask
> > > >>>> > > >> > our
> > > >>>> > > >> > > > SRE whether they know the mean-time-to-fail for our
> > disk.
> > > >>>> What I
> > > >>>> > > was
> > > >>>> > > >> > told
> > > >>>> > > >> > > > by SRE is that disk failure is the most common type
> of
> > > >>>> hardware
> > > >>>> > > >> > failure.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > When there is disk failure, I think it is reasonable
> to
> > > >>>> move
> > > >>>> > > >> replica to
> > > >>>> > > >> > > > another broker instead of another disk on the same
> > > broker.
> > > >>>> The
> > > >>>> > > >> reason
> > > >>>> > > >> > we
> > > >>>> > > >> > > > want to move replica within broker is mainly to
> > optimize
> > > >>>> the Kafka
> > > >>>> > > >> > > cluster
> > > >>>> > > >> > > > performance when we balance load across disks.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > In comparison to balancing replicas globally, the
> > benefit
> > > >>>> of
> > > >>>> > > moving
> > > >>>> > > >> > > replica
> > > >>>> > > >> > > > within broker is that:
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > 1) the movement is faster since it doesn't go through
> > > >>>> socket or
> > > >>>> > > >> rely on
> > > >>>> > > >> > > the
> > > >>>> > > >> > > > available network bandwidth;
> > > >>>> > > >> > > > 2) much less impact on the replication traffic
> between
> > > >>>> broker by
> > > >>>> > > not
> > > >>>> > > >> > > taking
> > > >>>> > > >> > > > up bandwidth between brokers. Depending on the
> pattern
> > of
> > > >>>> traffic,
> > > >>>> > > >> we
> > > >>>> > > >> > may
> > > >>>> > > >> > > > need to balance load across disk frequently and it is
> > > >>>> necessary to
> > > >>>> > > >> > > prevent
> > > >>>> > > >> > > > this operation from slowing down the existing
> operation
> > > >>>> (e.g.
> > > >>>> > > >> produce,
> > > >>>> > > >> > > > consume, replication) in the Kafka cluster.
> > > >>>> > > >> > > > 3) It gives us opportunity to do automatic broker
> > > rebalance
> > > >>>> > > between
> > > >>>> > > >> > disks
> > > >>>> > > >> > > > on the same broker.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > >3. Even if we had to move the replica within the
> > broker,
> > > >>>> why
> > > >>>> > > >> cannot we
> > > >>>> > > >> > > > just
> > > >>>> > > >> > > > >treat it as another replica and have it go through
> the
> > > >>>> same
> > > >>>> > > >> > replication
> > > >>>> > > >> > > > >code path that we have today? The downside here is
> > > >>>> obviously that
> > > >>>> > > >> you
> > > >>>> > > >> > > need
> > > >>>> > > >> > > > >to catchup from the leader but it is completely
> free!
> > > >>>> What do we
> > > >>>> > > >> think
> > > >>>> > > >> > > is
> > > >>>> > > >> > > > >the impact of the network overhead in this case?
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > Good point. My initial proposal actually used the
> > > existing
> > > >>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to
> > > move
> > > >>>> replica
> > > >>>> > > >> > > between
> > > >>>> > > >> > > > disks. However, I switched to use separate thread
> pool
> > > >>>> after
> > > >>>> > > >> discussion
> > > >>>> > > >> > > > with Jun and Becket.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > The main argument for using separate thread pool is
> to
> > > >>>> actually
> > > >>>> > > keep
> > > >>>> > > >> > the
> > > >>>> > > >> > > > design simply and easy to reason about. There are a
> > > number
> > > >>>> of
> > > >>>> > > >> > difference
> > > >>>> > > >> > > > between inter-broker replication and intra-broker
> > > >>>> replication
> > > >>>> > > which
> > > >>>> > > >> > makes
> > > >>>> > > >> > > > it cleaner to do them in separate code path. I will
> > list
> > > >>>> them
> > > >>>> > > below:
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > - The throttling mechanism for inter-broker
> replication
> > > >>>> traffic
> > > >>>> > > and
> > > >>>> > > >> > > > intra-broker replication traffic is different. For
> > > >>>> example, we may
> > > >>>> > > >> want
> > > >>>> > > >> > > to
> > > >>>> > > >> > > > specify per-topic quota for inter-broker replication
> > > >>>> traffic
> > > >>>> > > >> because we
> > > >>>> > > >> > > may
> > > >>>> > > >> > > > want some topic to be moved faster than other topic.
> > But
> > > >>>> we don't
> > > >>>> > > >> care
> > > >>>> > > >> > > > about priority of topics for intra-broker movement.
> So
> > > the
> > > >>>> current
> > > >>>> > > >> > > proposal
> > > >>>> > > >> > > > only allows user to specify per-broker quota for
> > > >>>> inter-broker
> > > >>>> > > >> > replication
> > > >>>> > > >> > > > traffic.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > - The quota value for inter-broker replication
> traffic
> > > and
> > > >>>> > > >> intra-broker
> > > >>>> > > >> > > > replication traffic is different. The available
> > bandwidth
> > > >>>> for
> > > >>>> > > >> > > inter-broker
> > > >>>> > > >> > > > replication can probably be much higher than the
> > > bandwidth
> > > >>>> for
> > > >>>> > > >> > > inter-broker
> > > >>>> > > >> > > > replication.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > - The ReplicaFetchThread is per broker. Intuitively,
> > the
> > > >>>> number of
> > > >>>> > > >> > > threads
> > > >>>> > > >> > > > doing intra broker data movement should be related to
> > the
> > > >>>> number
> > > >>>> > > of
> > > >>>> > > >> > disks
> > > >>>> > > >> > > > in the broker, not the number of brokers in the
> > cluster.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > - The leader replica has no ReplicaFetchThread to
> start
> > > >>>> with. It
> > > >>>> > > >> seems
> > > >>>> > > >> > > > weird to
> > > >>>> > > >> > > > start one just for intra-broker replication.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > Because of these difference, we think it is simpler
> to
> > > use
> > > >>>> > > separate
> > > >>>> > > >> > > thread
> > > >>>> > > >> > > > pool and code path so that we can configure and
> > throttle
> > > >>>> them
> > > >>>> > > >> > separately.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > >4. What are the chances that we will be able to
> > identify
> > > >>>> another
> > > >>>> > > >> disk
> > > >>>> > > >> > to
> > > >>>> > > >> > > > >balance within the broker instead of another disk on
> > > >>>> another
> > > >>>> > > >> broker?
> > > >>>> > > >> > If
> > > >>>> > > >> > > we
> > > >>>> > > >> > > > >have 100's of machines, the probability of finding a
> > > >>>> better
> > > >>>> > > >> balance by
> > > >>>> > > >> > > > >choosing another broker is much higher than
> balancing
> > > >>>> within the
> > > >>>> > > >> > broker.
> > > >>>> > > >> > > > >Could you add some info on how we are determining
> > this?
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > It is possible that we can find available space on a
> > > remote
> > > >>>> > > broker.
> > > >>>> > > >> The
> > > >>>> > > >> > > > benefit of allowing intra-broker replication is that,
> > > when
> > > >>>> there
> > > >>>> > > are
> > > >>>> > > >> > > > available space in both the current broker and a
> remote
> > > >>>> broker,
> > > >>>> > > the
> > > >>>> > > >> > > > rebalance can be completed faster with much less
> impact
> > > on
> > > >>>> the
> > > >>>> > > >> > > inter-broker
> > > >>>> > > >> > > > replication or the users traffic. It is about taking
> > > >>>> advantage of
> > > >>>> > > >> > > locality
> > > >>>> > > >> > > > when balance the load.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > >5. Finally, in a cloud setup where more users are
> > going
> > > to
> > > >>>> > > >> leverage a
> > > >>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this
> > change
> > > >>>> is not
> > > >>>> > > of
> > > >>>> > > >> > much
> > > >>>> > > >> > > > >gain since you don't need to balance between the
> > volumes
> > > >>>> within
> > > >>>> > > the
> > > >>>> > > >> > same
> > > >>>> > > >> > > > >broker.
> > > >>>> > > >> > > >
> > > >>>> > > >> > > > You are right. This KIP-113 is useful only if user
> uses
> > > >>>> JBOD. If
> > > >>>> > > >> user
> > > >>>> > > >> > > uses
> > > >>>> > > >> > > > an extra storage layer of replication, such as
> RAID-10
> > or
> > > >>>> EBS,
> > > >>>> > > they
> > > >>>> > > >> > don't
> > > >>>> > > >> > > > need KIP-112 or KIP-113. Note that user will
> replicate
> > > >>>> data more
> > > >>>> > > >> times
> > > >>>> > > >> > > than
> > > >>>> > > >> > > > the replication factor of the Kafka topic if an extra
> > > >>>> storage
> > > >>>> > > layer
> > > >>>> > > >> of
> > > >>>> > > >> > > > replication is used.
> > > >>>> > > >> > > >
> > > >>>> > > >> > >
> > > >>>> > > >> >
> > > >>>> > > >>
> > > >>>> > > >
> > > >>>> > > >
> > > >>>> > >
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >
> > >
> >
>

Reply via email to