Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Ismael Juma Fri, 04 Aug 2017 09:51:11 -0700

Thanks Dong. I have a few initial questions, sorry if I it has been
discussed and I missed it.


1. The KIP suggests that the reassignment tool is responsible for sending
the ChangeReplicaDirRequests to the relevant brokers. I had imagined that
this would be done by the Controller, like the rest of the reassignment
process. Was this considered? If so, it would be good to include the
details of why it was rejected in the "Rejected Alternatives" section.

2. The reassignment JSON format was extended so that one can choose the log
directory for a partition. This means that the log directory has to be the
same for all replicas of a given partition. The alternative would be for
the log dir to be assignable for each replica. Similar to the other
question, it would be good to have a section in "Rejected Alternatives" for
this approach. It's generally very helpful to have more information on the
rationale for the design choices that were made and rejected.

3. Should changeReplicaDir be alterReplicaDir? We have used `alter` for
other methods.

Thanks,
Ismael




On Fri, Aug 4, 2017 at 5:37 AM, Dong Lin <[email protected]> wrote:

> Hi all,
>
> I realized that we need new API in AdminClient in order to use the new
> request/response added in KIP-113. Since this is required by KIP-113, I
> choose to add the new interface in this KIP instead of creating a new KIP.
>
> The documentation of the new API in AdminClient can be found here
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 113%3A+Support+replicas+movement+between+log+directories#KIP-113:
> Supportreplicasmovementbetweenlogdirectories-AdminClient>.
> Can you please review and comment if you have any concern?
>
> Thanks!
> Dong
>
>
>
> On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <[email protected]> wrote:
>
> > The protocol change has been updated in KIP-113
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 113%3A+Support+replicas+movement+between+log+directories>
> > .
> >
> > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <[email protected]> wrote:
> >
> >> Hi all,
> >>
> >> I have made a minor change to the DescribeDirsRequest so that user can
> >> choose to query the status for a specific list of partitions. This is a
> bit
> >> more fine-granular than the previous format that allows user to query
> the
> >> status for a specific list of topics. I realized that querying the
> status
> >> of selected partitions can be useful to check the whether the
> reassignment
> >> of the replicas to the specific log directories has been completed.
> >>
> >> I will assume this minor change is OK if there is no concern with it in
> >> the community :)
> >>
> >> Thanks,
> >> Dong
> >>
> >>
> >>
> >> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <[email protected]> wrote:
> >>
> >>> Hey Colin,
> >>>
> >>> Thanks for the suggestion. We have actually considered this and list
> >>> this as the first future work in KIP-112
> >>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 112%3A+Handle+disk+failure+for+JBOD>.
> >>> The two advantages that you mentioned are exactly the motivation for
> this
> >>> feature. Also as you have mentioned, this involves the tradeoff between
> >>> disk performance and availability -- the more you distribute topic
> across
> >>> disks, the more topics will be offline due to a single disk failure.
> >>>
> >>> Despite its complexity, it is not clear to me that the reduced
> rebalance
> >>> overhead is worth the reduction in availability. I am optimistic that
> the
> >>> rebalance overhead will not be that a big problem since we are not too
> >>> bothered by cross-broker rebalance as of now.
> >>>
> >>> Thanks,
> >>> Dong
> >>>
> >>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <[email protected]>
> >>> wrote:
> >>>
> >>>> Has anyone considered a scheme for sharding topic data across multiple
> >>>> disks?
> >>>>
> >>>> For example, if you sharded topics across 3 disks, and you had 10
> disks,
> >>>> you could pick a different set of 3 disks for each topic.  If you
> >>>> distribute them randomly then you have 10 choose 3 = 120 different
> >>>> combinations.  You would probably never need rebalancing if you had a
> >>>> reasonable distribution of topic sizes (could probably prove this
> with a
> >>>> Monte Carlo or something).
> >>>>
> >>>> The disadvantage is that if one of the 3 disks fails, then you have to
> >>>> take the topic offline.  But if we assume independent disk failure
> >>>> probabilities, probability of failure with RAID 0 is: 1 -
> >>>> Psuccess^(num_disks) whereas the probability of failure with this
> scheme
> >>>> is 1 - Psuccess ^ 3.
> >>>>
> >>>> This addresses the biggest downsides of JBOD now:
> >>>> * limiting a topic to the size of a single disk limits scalability
> >>>> * the topic movement process is tricky to get right and involves
> "racing
> >>>> against producers" and wasted double I/Os
> >>>>
> >>>> Of course, one other question is how frequently we add new disk drives
> >>>> to an existing broker.  In this case, you might reasonably want disk
> >>>> rebalancing to avoid overloading the new disk(s) with writes.
> >>>>
> >>>> cheers,
> >>>> Colin
> >>>>
> >>>>
> >>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote:
> >>>> > Just a few comments on this.
> >>>> >
> >>>> > 1. One of the issues with using RAID 0 is that a single disk failure
> >>>> > causes
> >>>> > a hard failure of the broker. Hard failure increases the
> >>>> unavailability
> >>>> > window for all the partitions on the failed broker, which includes
> the
> >>>> > failure detection time (tied to ZK session timeout right now) and
> >>>> leader
> >>>> > election time by the controller. If we support JBOD natively, when a
> >>>> > single
> >>>> > disk fails, only partitions on the failed disk will experience a
> hard
> >>>> > failure. The availability for partitions on the rest of the disks
> are
> >>>> not
> >>>> > affected.
> >>>> >
> >>>> > 2. For running things on the Cloud such as AWS. Currently, each EBS
> >>>> > volume
> >>>> > has a throughout limit of about 300MB/sec. If you get an enhanced
> EC2
> >>>> > instance, you can get 20Gb/sec network. To saturate the network, you
> >>>> may
> >>>> > need about 7 EBS volumes. So, being able to support JBOD in the
> Cloud
> >>>> is
> >>>> > still potentially useful.
> >>>> >
> >>>> > 3. On the benefit of balancing data across disks within the same
> >>>> broker.
> >>>> > Data imbalance can happen across brokers as well as across disks
> >>>> within
> >>>> > the
> >>>> > same broker. Balancing the data across disks within the broker has
> the
> >>>> > benefit of saving network bandwidth as Dong mentioned. So, if intra
> >>>> > broker
> >>>> > load balancing is possible, it's probably better to avoid the more
> >>>> > expensive inter broker load balancing. One of the reasons for disk
> >>>> > imbalance right now is that partitions within a broker are assigned
> to
> >>>> > disks just based on the partition count. So, it does seem possible
> for
> >>>> > disks to get imbalanced from time to time. If someone can share some
> >>>> > stats
> >>>> > for that in practice, that will be very helpful.
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> > Jun
> >>>> >
> >>>> >
> >>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <[email protected]>
> wrote:
> >>>> >
> >>>> > > Hey Sriram,
> >>>> > >
> >>>> > > I think there is one way to explain why the ability to move
> replica
> >>>> between
> >>>> > > disks can save space. Let's say the load is distributed to disks
> >>>> > > independent of the broker. Sooner or later, the load imbalance
> will
> >>>> exceed
> >>>> > > a threshold and we will need to rebalance load across disks. Now
> our
> >>>> > > questions is whether our rebalancing algorithm will be able to
> take
> >>>> > > advantage of locality by moving replicas between disks on the same
> >>>> broker.
> >>>> > >
> >>>> > > Say for a given disk, there is 20% probability it is overloaded,
> 20%
> >>>> > > probability it is underloaded, and 60% probability its load is
> >>>> around the
> >>>> > > expected average load if the cluster is well balanced. Then for a
> >>>> broker of
> >>>> > > 10 disks, we would 2 disks need to have in-bound replica movement,
> >>>> 2 disks
> >>>> > > need to have out-bound replica movement, and 6 disks do not need
> >>>> replica
> >>>> > > movement. Thus we would expect KIP-113 to be useful since we will
> >>>> be able
> >>>> > > to move replica from the two over-loaded disks to the two
> >>>> under-loaded
> >>>> > > disks on the same broKER. Does this make sense?
> >>>> > >
> >>>> > > Thanks,
> >>>> > > Dong
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <[email protected]>
> >>>> wrote:
> >>>> > >
> >>>> > > > Hey Sriram,
> >>>> > > >
> >>>> > > > Thanks for raising these concerns. Let me answer these questions
> >>>> below:
> >>>> > > >
> >>>> > > > - The benefit of those additional complexity to move the data
> >>>> stored on a
> >>>> > > > disk within the broker is to avoid network bandwidth usage.
> >>>> Creating
> >>>> > > > replica on another broker is less efficient than creating
> replica
> >>>> on
> >>>> > > > another disk in the same broker IF there is actually
> >>>> lightly-loaded disk
> >>>> > > on
> >>>> > > > the same broker.
> >>>> > > >
> >>>> > > > - In my opinion the rebalance algorithm would this: 1) we
> balance
> >>>> the
> >>>> > > load
> >>>> > > > across brokers using the same algorithm we are using today. 2)
> we
> >>>> balance
> >>>> > > > load across disk on a given broker using a greedy algorithm,
> i.e.
> >>>> move
> >>>> > > > replica from the overloaded disk to lightly loaded disk. The
> >>>> greedy
> >>>> > > > algorithm would only consider the capacity and replica size. We
> >>>> can
> >>>> > > improve
> >>>> > > > it to consider throughput in the future.
> >>>> > > >
> >>>> > > > - With 30 brokers with each having 10 disks, using the
> rebalancing
> >>>> > > algorithm,
> >>>> > > > the chances of choosing disks within the broker can be high.
> >>>> There will
> >>>> > > > always be load imbalance across disks of the same broker for the
> >>>> same
> >>>> > > > reason that there will always be load imbalance across brokers.
> >>>> The
> >>>> > > > algorithm specified above will take advantage of the locality,
> >>>> i.e. first
> >>>> > > > balance load across disks of the same broker, and only balance
> >>>> across
> >>>> > > > brokers if some brokers are much more loaded than others.
> >>>> > > >
> >>>> > > > I think it is useful to note that the load imbalance across
> disks
> >>>> of the
> >>>> > > > same broker is independent of the load imbalance across brokers.
> >>>> Both are
> >>>> > > > guaranteed to happen in any Kafka cluster for the same reason,
> >>>> i.e.
> >>>> > > > variation in the partition size. Say broker 1 have two disks
> that
> >>>> are 80%
> >>>> > > > loaded and 20% loaded. And broker 2 have two disks that are also
> >>>> 80%
> >>>> > > > loaded and 20%. We can balance them without inter-broker traffic
> >>>> with
> >>>> > > > KIP-113.  This is why I think KIP-113 can be very useful.
> >>>> > > >
> >>>> > > > Do these explanation sound reasonable?
> >>>> > > >
> >>>> > > > Thanks,
> >>>> > > > Dong
> >>>> > > >
> >>>> > > >
> >>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <
> >>>> [email protected]>
> >>>> > > > wrote:
> >>>> > > >
> >>>> > > >> Hey Dong,
> >>>> > > >>
> >>>> > > >> Thanks for the explanation. I don't think anyone is denying
> that
> >>>> we
> >>>> > > should
> >>>> > > >> rebalance at the disk level. I think it is important to restore
> >>>> the disk
> >>>> > > >> and not wait for disk replacement. There are also other
> benefits
> >>>> of
> >>>> > > doing
> >>>> > > >> that which is that you don't need to opt for hot swap racks
> that
> >>>> can
> >>>> > > save
> >>>> > > >> cost.
> >>>> > > >>
> >>>> > > >> The question here is what do you save by trying to add
> >>>> complexity to
> >>>> > > move
> >>>> > > >> the data stored on a disk within the broker? Why would you not
> >>>> simply
> >>>> > > >> create another replica on the disk that results in a balanced
> >>>> load
> >>>> > > across
> >>>> > > >> brokers and have it catch up. We are missing a few things here
> -
> >>>> > > >> 1. What would your data balancing algorithm be? Would it
> include
> >>>> just
> >>>> > > >> capacity or will it also consider throughput on disk to decide
> >>>> on the
> >>>> > > >> final
> >>>> > > >> location of a partition?
> >>>> > > >> 2. With 30 brokers with each having 10 disks, using the
> >>>> rebalancing
> >>>> > > >> algorithm, the chances of choosing disks within the broker is
> >>>> going to
> >>>> > > be
> >>>> > > >> low. This probability further decreases with more brokers and
> >>>> disks.
> >>>> > > Given
> >>>> > > >> that, why are we trying to save network cost? How much would
> >>>> that saving
> >>>> > > >> be
> >>>> > > >> if you go that route?
> >>>> > > >>
> >>>> > > >> These questions are hard to answer without having to verify
> >>>> empirically.
> >>>> > > >> My
> >>>> > > >> suggestion is to avoid doing pre matured optimization that
> >>>> brings in the
> >>>> > > >> added complexity to the code and treat inter and intra broker
> >>>> movements
> >>>> > > of
> >>>> > > >> partition the same. Deploy the code, use it and see if it is an
> >>>> actual
> >>>> > > >> problem and you get great savings by avoiding the network route
> >>>> to move
> >>>> > > >> partitions within the same broker. If so, add this
> optimization.
> >>>> > > >>
> >>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <[email protected]>
> >>>> wrote:
> >>>> > > >>
> >>>> > > >> > Hey Jay, Sriram,
> >>>> > > >> >
> >>>> > > >> > Great point. If I understand you right, you are suggesting
> >>>> that we can
> >>>> > > >> > simply use RAID-0 so that the load can be evenly distributed
> >>>> across
> >>>> > > >> disks.
> >>>> > > >> > And even though a disk failure will bring down the enter
> >>>> broker, the
> >>>> > > >> > reduced availability as compared to using KIP-112 and KIP-113
> >>>> will may
> >>>> > > >> be
> >>>> > > >> > negligible. And it may be better to just accept the slightly
> >>>> reduced
> >>>> > > >> > availability instead of introducing the complexity from
> >>>> KIP-112 and
> >>>> > > >> > KIP-113.
> >>>> > > >> >
> >>>> > > >> > Let's assume the following:
> >>>> > > >> >
> >>>> > > >> > - There are 30 brokers in a cluster and each broker has 10
> >>>> disks
> >>>> > > >> > - The replication factor is 3 and min.isr = 2.
> >>>> > > >> > - The probability of annual disk failure rate is 2% according
> >>>> to this
> >>>> > > >> > <https://www.backblaze.com/blog/hard-drive-failure-rates-q1-
> >>>> 2017/>
> >>>> > > >> blog.
> >>>> > > >> > - It takes 3 days to replace a disk.
> >>>> > > >> >
> >>>> > > >> > Here is my calculation for probability of data loss due to
> disk
> >>>> > > failure:
> >>>> > > >> > probability of a given disk fails in a given year: 2%
> >>>> > > >> > probability of a given disk stays offline for one day in a
> >>>> given day:
> >>>> > > >> 2% /
> >>>> > > >> > 365 * 3
> >>>> > > >> > probability of a given broker stays offline for one day in a
> >>>> given day
> >>>> > > >> due
> >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> >>>> > > >> > probability of any broker stays offline for one day in a
> given
> >>>> day due
> >>>> > > >> to
> >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> >>>> > > >> > probability of any three broker stays offline for one day in
> a
> >>>> given
> >>>> > > day
> >>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125%
> >>>> > > >> > probability of data loss due to disk failure: 0.0125%
> >>>> > > >> >
> >>>> > > >> > Here is my calculation for probability of service
> >>>> unavailability due
> >>>> > > to
> >>>> > > >> > disk failure:
> >>>> > > >> > probability of a given disk fails in a given year: 2%
> >>>> > > >> > probability of a given disk stays offline for one day in a
> >>>> given day:
> >>>> > > >> 2% /
> >>>> > > >> > 365 * 3
> >>>> > > >> > probability of a given broker stays offline for one day in a
> >>>> given day
> >>>> > > >> due
> >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> >>>> > > >> > probability of any broker stays offline for one day in a
> given
> >>>> day due
> >>>> > > >> to
> >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> >>>> > > >> > probability of any two broker stays offline for one day in a
> >>>> given day
> >>>> > > >> due
> >>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25%
> >>>> > > >> > probability of unavailability due to disk failure: 0.25%
> >>>> > > >> >
> >>>> > > >> > Note that the unavailability due to disk failure will be
> >>>> unacceptably
> >>>> > > >> high
> >>>> > > >> > in this case. And the probability of data loss due to disk
> >>>> failure
> >>>> > > will
> >>>> > > >> be
> >>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is intended
> >>>> to
> >>>> > > achieve
> >>>> > > >> > four nigh availability.
> >>>> > > >> >
> >>>> > > >> > Thanks,
> >>>> > > >> > Dong
> >>>> > > >> >
> >>>> > > >> >
> >>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <[email protected]
> >
> >>>> wrote:
> >>>> > > >> >
> >>>> > > >> > > I think Ram's point is that in place failure is pretty
> >>>> complicated,
> >>>> > > >> and
> >>>> > > >> > > this is meant to be a cost saving feature, we should
> >>>> construct an
> >>>> > > >> > argument
> >>>> > > >> > > for it grounded in data.
> >>>> > > >> > >
> >>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but data
> is
> >>>> > > available
> >>>> > > >> > > online), and assume it takes 3 days to get the drive
> >>>> replaced. Say
> >>>> > > you
> >>>> > > >> > have
> >>>> > > >> > > 10 drives per server. Then the expected downtime for each
> >>>> server is
> >>>> > > >> > roughly
> >>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off
> since
> >>>> I'm
> >>>> > > >> ignoring
> >>>> > > >> > > the case of multiple failures, but I don't know that
> changes
> >>>> it
> >>>> > > >> much). So
> >>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say you
> >>>> have 1000
> >>>> > > >> > servers
> >>>> > > >> > > and they cost $3000/year fully loaded including power, the
> >>>> cost of
> >>>> > > >> the hw
> >>>> > > >> > > amortized over it's life, etc. Then this feature saves you
> >>>> $3000 on
> >>>> > > >> your
> >>>> > > >> > > total server cost of $3m which seems not very worthwhile
> >>>> compared to
> >>>> > > >> > other
> >>>> > > >> > > optimizations...?
> >>>> > > >> > >
> >>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i think
> >>>> that is
> >>>> > > >> the
> >>>> > > >> > > type of argument that would be helpful to think about the
> >>>> tradeoff
> >>>> > > in
> >>>> > > >> > > complexity.
> >>>> > > >> > >
> >>>> > > >> > > -Jay
> >>>> > > >> > >
> >>>> > > >> > >
> >>>> > > >> > >
> >>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <
> >>>> [email protected]>
> >>>> > > wrote:
> >>>> > > >> > >
> >>>> > > >> > > > Hey Sriram,
> >>>> > > >> > > >
> >>>> > > >> > > > Thanks for taking time to review the KIP. Please see
> below
> >>>> my
> >>>> > > >> answers
> >>>> > > >> > to
> >>>> > > >> > > > your questions:
> >>>> > > >> > > >
> >>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration and go
> >>>> over what
> >>>> > > >> is
> >>>> > > >> > the
> >>>> > > >> > > > >average disk/partition repair/restore time that we are
> >>>> targeting
> >>>> > > >> for a
> >>>> > > >> > > > >typical JBOD setup?
> >>>> > > >> > > >
> >>>> > > >> > > > We currently don't have this data. I think the
> >>>> disk/partition
> >>>> > > >> > > repair/store
> >>>> > > >> > > > time depends on availability of hardware, the response
> >>>> time of
> >>>> > > >> > > > site-reliability engineer, the amount of data on the bad
> >>>> disk etc.
> >>>> > > >> > These
> >>>> > > >> > > > vary between companies and even clusters within the same
> >>>> company
> >>>> > > >> and it
> >>>> > > >> > > is
> >>>> > > >> > > > probably hard to determine what is the average situation.
> >>>> > > >> > > >
> >>>> > > >> > > > I am not very sure why we need this. Can you explain a
> bit
> >>>> why
> >>>> > > this
> >>>> > > >> > data
> >>>> > > >> > > is
> >>>> > > >> > > > useful to evaluate the motivation and design of this KIP?
> >>>> > > >> > > >
> >>>> > > >> > > > >2. How often do we believe disks are going to fail (in
> >>>> your
> >>>> > > example
> >>>> > > >> > > > >configuration) and what do we gain by avoiding the
> network
> >>>> > > overhead
> >>>> > > >> > and
> >>>> > > >> > > > >doing all the work of moving the replica within the
> >>>> broker to
> >>>> > > >> another
> >>>> > > >> > > disk
> >>>> > > >> > > > >instead of balancing it globally?
> >>>> > > >> > > >
> >>>> > > >> > > > I think the chance of disk failure depends mainly on the
> >>>> disk
> >>>> > > itself
> >>>> > > >> > > rather
> >>>> > > >> > > > than the broker configuration. I don't have this data
> now.
> >>>> I will
> >>>> > > >> ask
> >>>> > > >> > our
> >>>> > > >> > > > SRE whether they know the mean-time-to-fail for our disk.
> >>>> What I
> >>>> > > was
> >>>> > > >> > told
> >>>> > > >> > > > by SRE is that disk failure is the most common type of
> >>>> hardware
> >>>> > > >> > failure.
> >>>> > > >> > > >
> >>>> > > >> > > > When there is disk failure, I think it is reasonable to
> >>>> move
> >>>> > > >> replica to
> >>>> > > >> > > > another broker instead of another disk on the same
> broker.
> >>>> The
> >>>> > > >> reason
> >>>> > > >> > we
> >>>> > > >> > > > want to move replica within broker is mainly to optimize
> >>>> the Kafka
> >>>> > > >> > > cluster
> >>>> > > >> > > > performance when we balance load across disks.
> >>>> > > >> > > >
> >>>> > > >> > > > In comparison to balancing replicas globally, the benefit
> >>>> of
> >>>> > > moving
> >>>> > > >> > > replica
> >>>> > > >> > > > within broker is that:
> >>>> > > >> > > >
> >>>> > > >> > > > 1) the movement is faster since it doesn't go through
> >>>> socket or
> >>>> > > >> rely on
> >>>> > > >> > > the
> >>>> > > >> > > > available network bandwidth;
> >>>> > > >> > > > 2) much less impact on the replication traffic between
> >>>> broker by
> >>>> > > not
> >>>> > > >> > > taking
> >>>> > > >> > > > up bandwidth between brokers. Depending on the pattern of
> >>>> traffic,
> >>>> > > >> we
> >>>> > > >> > may
> >>>> > > >> > > > need to balance load across disk frequently and it is
> >>>> necessary to
> >>>> > > >> > > prevent
> >>>> > > >> > > > this operation from slowing down the existing operation
> >>>> (e.g.
> >>>> > > >> produce,
> >>>> > > >> > > > consume, replication) in the Kafka cluster.
> >>>> > > >> > > > 3) It gives us opportunity to do automatic broker
> rebalance
> >>>> > > between
> >>>> > > >> > disks
> >>>> > > >> > > > on the same broker.
> >>>> > > >> > > >
> >>>> > > >> > > >
> >>>> > > >> > > > >3. Even if we had to move the replica within the broker,
> >>>> why
> >>>> > > >> cannot we
> >>>> > > >> > > > just
> >>>> > > >> > > > >treat it as another replica and have it go through the
> >>>> same
> >>>> > > >> > replication
> >>>> > > >> > > > >code path that we have today? The downside here is
> >>>> obviously that
> >>>> > > >> you
> >>>> > > >> > > need
> >>>> > > >> > > > >to catchup from the leader but it is completely free!
> >>>> What do we
> >>>> > > >> think
> >>>> > > >> > > is
> >>>> > > >> > > > >the impact of the network overhead in this case?
> >>>> > > >> > > >
> >>>> > > >> > > > Good point. My initial proposal actually used the
> existing
> >>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to
> move
> >>>> replica
> >>>> > > >> > > between
> >>>> > > >> > > > disks. However, I switched to use separate thread pool
> >>>> after
> >>>> > > >> discussion
> >>>> > > >> > > > with Jun and Becket.
> >>>> > > >> > > >
> >>>> > > >> > > > The main argument for using separate thread pool is to
> >>>> actually
> >>>> > > keep
> >>>> > > >> > the
> >>>> > > >> > > > design simply and easy to reason about. There are a
> number
> >>>> of
> >>>> > > >> > difference
> >>>> > > >> > > > between inter-broker replication and intra-broker
> >>>> replication
> >>>> > > which
> >>>> > > >> > makes
> >>>> > > >> > > > it cleaner to do them in separate code path. I will list
> >>>> them
> >>>> > > below:
> >>>> > > >> > > >
> >>>> > > >> > > > - The throttling mechanism for inter-broker replication
> >>>> traffic
> >>>> > > and
> >>>> > > >> > > > intra-broker replication traffic is different. For
> >>>> example, we may
> >>>> > > >> want
> >>>> > > >> > > to
> >>>> > > >> > > > specify per-topic quota for inter-broker replication
> >>>> traffic
> >>>> > > >> because we
> >>>> > > >> > > may
> >>>> > > >> > > > want some topic to be moved faster than other topic. But
> >>>> we don't
> >>>> > > >> care
> >>>> > > >> > > > about priority of topics for intra-broker movement. So
> the
> >>>> current
> >>>> > > >> > > proposal
> >>>> > > >> > > > only allows user to specify per-broker quota for
> >>>> inter-broker
> >>>> > > >> > replication
> >>>> > > >> > > > traffic.
> >>>> > > >> > > >
> >>>> > > >> > > > - The quota value for inter-broker replication traffic
> and
> >>>> > > >> intra-broker
> >>>> > > >> > > > replication traffic is different. The available bandwidth
> >>>> for
> >>>> > > >> > > inter-broker
> >>>> > > >> > > > replication can probably be much higher than the
> bandwidth
> >>>> for
> >>>> > > >> > > inter-broker
> >>>> > > >> > > > replication.
> >>>> > > >> > > >
> >>>> > > >> > > > - The ReplicaFetchThread is per broker. Intuitively, the
> >>>> number of
> >>>> > > >> > > threads
> >>>> > > >> > > > doing intra broker data movement should be related to the
> >>>> number
> >>>> > > of
> >>>> > > >> > disks
> >>>> > > >> > > > in the broker, not the number of brokers in the cluster.
> >>>> > > >> > > >
> >>>> > > >> > > > - The leader replica has no ReplicaFetchThread to start
> >>>> with. It
> >>>> > > >> seems
> >>>> > > >> > > > weird to
> >>>> > > >> > > > start one just for intra-broker replication.
> >>>> > > >> > > >
> >>>> > > >> > > > Because of these difference, we think it is simpler to
> use
> >>>> > > separate
> >>>> > > >> > > thread
> >>>> > > >> > > > pool and code path so that we can configure and throttle
> >>>> them
> >>>> > > >> > separately.
> >>>> > > >> > > >
> >>>> > > >> > > >
> >>>> > > >> > > > >4. What are the chances that we will be able to identify
> >>>> another
> >>>> > > >> disk
> >>>> > > >> > to
> >>>> > > >> > > > >balance within the broker instead of another disk on
> >>>> another
> >>>> > > >> broker?
> >>>> > > >> > If
> >>>> > > >> > > we
> >>>> > > >> > > > >have 100's of machines, the probability of finding a
> >>>> better
> >>>> > > >> balance by
> >>>> > > >> > > > >choosing another broker is much higher than balancing
> >>>> within the
> >>>> > > >> > broker.
> >>>> > > >> > > > >Could you add some info on how we are determining this?
> >>>> > > >> > > >
> >>>> > > >> > > > It is possible that we can find available space on a
> remote
> >>>> > > broker.
> >>>> > > >> The
> >>>> > > >> > > > benefit of allowing intra-broker replication is that,
> when
> >>>> there
> >>>> > > are
> >>>> > > >> > > > available space in both the current broker and a remote
> >>>> broker,
> >>>> > > the
> >>>> > > >> > > > rebalance can be completed faster with much less impact
> on
> >>>> the
> >>>> > > >> > > inter-broker
> >>>> > > >> > > > replication or the users traffic. It is about taking
> >>>> advantage of
> >>>> > > >> > > locality
> >>>> > > >> > > > when balance the load.
> >>>> > > >> > > >
> >>>> > > >> > > > >5. Finally, in a cloud setup where more users are going
> to
> >>>> > > >> leverage a
> >>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this change
> >>>> is not
> >>>> > > of
> >>>> > > >> > much
> >>>> > > >> > > > >gain since you don't need to balance between the volumes
> >>>> within
> >>>> > > the
> >>>> > > >> > same
> >>>> > > >> > > > >broker.
> >>>> > > >> > > >
> >>>> > > >> > > > You are right. This KIP-113 is useful only if user uses
> >>>> JBOD. If
> >>>> > > >> user
> >>>> > > >> > > uses
> >>>> > > >> > > > an extra storage layer of replication, such as RAID-10 or
> >>>> EBS,
> >>>> > > they
> >>>> > > >> > don't
> >>>> > > >> > > > need KIP-112 or KIP-113. Note that user will replicate
> >>>> data more
> >>>> > > >> times
> >>>> > > >> > > than
> >>>> > > >> > > > the replication factor of the Kafka topic if an extra
> >>>> storage
> >>>> > > layer
> >>>> > > >> of
> >>>> > > >> > > > replication is used.
> >>>> > > >> > > >
> >>>> > > >> > >
> >>>> > > >> >
> >>>> > > >>
> >>>> > > >
> >>>> > > >
> >>>> > >
> >>>>
> >>>
> >>>
> >>
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to