Hi all, I realized that we need new API in AdminClient in order to use the new request/response added in KIP-113. Since this is required by KIP-113, I choose to add the new interface in this KIP instead of creating a new KIP.
The documentation of the new API in AdminClient can be found here <https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories#KIP-113:Supportreplicasmovementbetweenlogdirectories-AdminClient>. Can you please review and comment if you have any concern? Thanks! Dong On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> wrote: > The protocol change has been updated in KIP-113 > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories> > . > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> wrote: > >> Hi all, >> >> I have made a minor change to the DescribeDirsRequest so that user can >> choose to query the status for a specific list of partitions. This is a bit >> more fine-granular than the previous format that allows user to query the >> status for a specific list of topics. I realized that querying the status >> of selected partitions can be useful to check the whether the reassignment >> of the replicas to the specific log directories has been completed. >> >> I will assume this minor change is OK if there is no concern with it in >> the community :) >> >> Thanks, >> Dong >> >> >> >> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <lindon...@gmail.com> wrote: >> >>> Hey Colin, >>> >>> Thanks for the suggestion. We have actually considered this and list >>> this as the first future work in KIP-112 >>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD>. >>> The two advantages that you mentioned are exactly the motivation for this >>> feature. Also as you have mentioned, this involves the tradeoff between >>> disk performance and availability -- the more you distribute topic across >>> disks, the more topics will be offline due to a single disk failure. >>> >>> Despite its complexity, it is not clear to me that the reduced rebalance >>> overhead is worth the reduction in availability. I am optimistic that the >>> rebalance overhead will not be that a big problem since we are not too >>> bothered by cross-broker rebalance as of now. >>> >>> Thanks, >>> Dong >>> >>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <cmcc...@apache.org> >>> wrote: >>> >>>> Has anyone considered a scheme for sharding topic data across multiple >>>> disks? >>>> >>>> For example, if you sharded topics across 3 disks, and you had 10 disks, >>>> you could pick a different set of 3 disks for each topic. If you >>>> distribute them randomly then you have 10 choose 3 = 120 different >>>> combinations. You would probably never need rebalancing if you had a >>>> reasonable distribution of topic sizes (could probably prove this with a >>>> Monte Carlo or something). >>>> >>>> The disadvantage is that if one of the 3 disks fails, then you have to >>>> take the topic offline. But if we assume independent disk failure >>>> probabilities, probability of failure with RAID 0 is: 1 - >>>> Psuccess^(num_disks) whereas the probability of failure with this scheme >>>> is 1 - Psuccess ^ 3. >>>> >>>> This addresses the biggest downsides of JBOD now: >>>> * limiting a topic to the size of a single disk limits scalability >>>> * the topic movement process is tricky to get right and involves "racing >>>> against producers" and wasted double I/Os >>>> >>>> Of course, one other question is how frequently we add new disk drives >>>> to an existing broker. In this case, you might reasonably want disk >>>> rebalancing to avoid overloading the new disk(s) with writes. >>>> >>>> cheers, >>>> Colin >>>> >>>> >>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote: >>>> > Just a few comments on this. >>>> > >>>> > 1. One of the issues with using RAID 0 is that a single disk failure >>>> > causes >>>> > a hard failure of the broker. Hard failure increases the >>>> unavailability >>>> > window for all the partitions on the failed broker, which includes the >>>> > failure detection time (tied to ZK session timeout right now) and >>>> leader >>>> > election time by the controller. If we support JBOD natively, when a >>>> > single >>>> > disk fails, only partitions on the failed disk will experience a hard >>>> > failure. The availability for partitions on the rest of the disks are >>>> not >>>> > affected. >>>> > >>>> > 2. For running things on the Cloud such as AWS. Currently, each EBS >>>> > volume >>>> > has a throughout limit of about 300MB/sec. If you get an enhanced EC2 >>>> > instance, you can get 20Gb/sec network. To saturate the network, you >>>> may >>>> > need about 7 EBS volumes. So, being able to support JBOD in the Cloud >>>> is >>>> > still potentially useful. >>>> > >>>> > 3. On the benefit of balancing data across disks within the same >>>> broker. >>>> > Data imbalance can happen across brokers as well as across disks >>>> within >>>> > the >>>> > same broker. Balancing the data across disks within the broker has the >>>> > benefit of saving network bandwidth as Dong mentioned. So, if intra >>>> > broker >>>> > load balancing is possible, it's probably better to avoid the more >>>> > expensive inter broker load balancing. One of the reasons for disk >>>> > imbalance right now is that partitions within a broker are assigned to >>>> > disks just based on the partition count. So, it does seem possible for >>>> > disks to get imbalanced from time to time. If someone can share some >>>> > stats >>>> > for that in practice, that will be very helpful. >>>> > >>>> > Thanks, >>>> > >>>> > Jun >>>> > >>>> > >>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <lindon...@gmail.com> wrote: >>>> > >>>> > > Hey Sriram, >>>> > > >>>> > > I think there is one way to explain why the ability to move replica >>>> between >>>> > > disks can save space. Let's say the load is distributed to disks >>>> > > independent of the broker. Sooner or later, the load imbalance will >>>> exceed >>>> > > a threshold and we will need to rebalance load across disks. Now our >>>> > > questions is whether our rebalancing algorithm will be able to take >>>> > > advantage of locality by moving replicas between disks on the same >>>> broker. >>>> > > >>>> > > Say for a given disk, there is 20% probability it is overloaded, 20% >>>> > > probability it is underloaded, and 60% probability its load is >>>> around the >>>> > > expected average load if the cluster is well balanced. Then for a >>>> broker of >>>> > > 10 disks, we would 2 disks need to have in-bound replica movement, >>>> 2 disks >>>> > > need to have out-bound replica movement, and 6 disks do not need >>>> replica >>>> > > movement. Thus we would expect KIP-113 to be useful since we will >>>> be able >>>> > > to move replica from the two over-loaded disks to the two >>>> under-loaded >>>> > > disks on the same broKER. Does this make sense? >>>> > > >>>> > > Thanks, >>>> > > Dong >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <lindon...@gmail.com> >>>> wrote: >>>> > > >>>> > > > Hey Sriram, >>>> > > > >>>> > > > Thanks for raising these concerns. Let me answer these questions >>>> below: >>>> > > > >>>> > > > - The benefit of those additional complexity to move the data >>>> stored on a >>>> > > > disk within the broker is to avoid network bandwidth usage. >>>> Creating >>>> > > > replica on another broker is less efficient than creating replica >>>> on >>>> > > > another disk in the same broker IF there is actually >>>> lightly-loaded disk >>>> > > on >>>> > > > the same broker. >>>> > > > >>>> > > > - In my opinion the rebalance algorithm would this: 1) we balance >>>> the >>>> > > load >>>> > > > across brokers using the same algorithm we are using today. 2) we >>>> balance >>>> > > > load across disk on a given broker using a greedy algorithm, i.e. >>>> move >>>> > > > replica from the overloaded disk to lightly loaded disk. The >>>> greedy >>>> > > > algorithm would only consider the capacity and replica size. We >>>> can >>>> > > improve >>>> > > > it to consider throughput in the future. >>>> > > > >>>> > > > - With 30 brokers with each having 10 disks, using the rebalancing >>>> > > algorithm, >>>> > > > the chances of choosing disks within the broker can be high. >>>> There will >>>> > > > always be load imbalance across disks of the same broker for the >>>> same >>>> > > > reason that there will always be load imbalance across brokers. >>>> The >>>> > > > algorithm specified above will take advantage of the locality, >>>> i.e. first >>>> > > > balance load across disks of the same broker, and only balance >>>> across >>>> > > > brokers if some brokers are much more loaded than others. >>>> > > > >>>> > > > I think it is useful to note that the load imbalance across disks >>>> of the >>>> > > > same broker is independent of the load imbalance across brokers. >>>> Both are >>>> > > > guaranteed to happen in any Kafka cluster for the same reason, >>>> i.e. >>>> > > > variation in the partition size. Say broker 1 have two disks that >>>> are 80% >>>> > > > loaded and 20% loaded. And broker 2 have two disks that are also >>>> 80% >>>> > > > loaded and 20%. We can balance them without inter-broker traffic >>>> with >>>> > > > KIP-113. This is why I think KIP-113 can be very useful. >>>> > > > >>>> > > > Do these explanation sound reasonable? >>>> > > > >>>> > > > Thanks, >>>> > > > Dong >>>> > > > >>>> > > > >>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian < >>>> r...@confluent.io> >>>> > > > wrote: >>>> > > > >>>> > > >> Hey Dong, >>>> > > >> >>>> > > >> Thanks for the explanation. I don't think anyone is denying that >>>> we >>>> > > should >>>> > > >> rebalance at the disk level. I think it is important to restore >>>> the disk >>>> > > >> and not wait for disk replacement. There are also other benefits >>>> of >>>> > > doing >>>> > > >> that which is that you don't need to opt for hot swap racks that >>>> can >>>> > > save >>>> > > >> cost. >>>> > > >> >>>> > > >> The question here is what do you save by trying to add >>>> complexity to >>>> > > move >>>> > > >> the data stored on a disk within the broker? Why would you not >>>> simply >>>> > > >> create another replica on the disk that results in a balanced >>>> load >>>> > > across >>>> > > >> brokers and have it catch up. We are missing a few things here - >>>> > > >> 1. What would your data balancing algorithm be? Would it include >>>> just >>>> > > >> capacity or will it also consider throughput on disk to decide >>>> on the >>>> > > >> final >>>> > > >> location of a partition? >>>> > > >> 2. With 30 brokers with each having 10 disks, using the >>>> rebalancing >>>> > > >> algorithm, the chances of choosing disks within the broker is >>>> going to >>>> > > be >>>> > > >> low. This probability further decreases with more brokers and >>>> disks. >>>> > > Given >>>> > > >> that, why are we trying to save network cost? How much would >>>> that saving >>>> > > >> be >>>> > > >> if you go that route? >>>> > > >> >>>> > > >> These questions are hard to answer without having to verify >>>> empirically. >>>> > > >> My >>>> > > >> suggestion is to avoid doing pre matured optimization that >>>> brings in the >>>> > > >> added complexity to the code and treat inter and intra broker >>>> movements >>>> > > of >>>> > > >> partition the same. Deploy the code, use it and see if it is an >>>> actual >>>> > > >> problem and you get great savings by avoiding the network route >>>> to move >>>> > > >> partitions within the same broker. If so, add this optimization. >>>> > > >> >>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <lindon...@gmail.com> >>>> wrote: >>>> > > >> >>>> > > >> > Hey Jay, Sriram, >>>> > > >> > >>>> > > >> > Great point. If I understand you right, you are suggesting >>>> that we can >>>> > > >> > simply use RAID-0 so that the load can be evenly distributed >>>> across >>>> > > >> disks. >>>> > > >> > And even though a disk failure will bring down the enter >>>> broker, the >>>> > > >> > reduced availability as compared to using KIP-112 and KIP-113 >>>> will may >>>> > > >> be >>>> > > >> > negligible. And it may be better to just accept the slightly >>>> reduced >>>> > > >> > availability instead of introducing the complexity from >>>> KIP-112 and >>>> > > >> > KIP-113. >>>> > > >> > >>>> > > >> > Let's assume the following: >>>> > > >> > >>>> > > >> > - There are 30 brokers in a cluster and each broker has 10 >>>> disks >>>> > > >> > - The replication factor is 3 and min.isr = 2. >>>> > > >> > - The probability of annual disk failure rate is 2% according >>>> to this >>>> > > >> > <https://www.backblaze.com/blog/hard-drive-failure-rates-q1- >>>> 2017/> >>>> > > >> blog. >>>> > > >> > - It takes 3 days to replace a disk. >>>> > > >> > >>>> > > >> > Here is my calculation for probability of data loss due to disk >>>> > > failure: >>>> > > >> > probability of a given disk fails in a given year: 2% >>>> > > >> > probability of a given disk stays offline for one day in a >>>> given day: >>>> > > >> 2% / >>>> > > >> > 365 * 3 >>>> > > >> > probability of a given broker stays offline for one day in a >>>> given day >>>> > > >> due >>>> > > >> > to disk failure: 2% / 365 * 3 * 10 >>>> > > >> > probability of any broker stays offline for one day in a given >>>> day due >>>> > > >> to >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% >>>> > > >> > probability of any three broker stays offline for one day in a >>>> given >>>> > > day >>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125% >>>> > > >> > probability of data loss due to disk failure: 0.0125% >>>> > > >> > >>>> > > >> > Here is my calculation for probability of service >>>> unavailability due >>>> > > to >>>> > > >> > disk failure: >>>> > > >> > probability of a given disk fails in a given year: 2% >>>> > > >> > probability of a given disk stays offline for one day in a >>>> given day: >>>> > > >> 2% / >>>> > > >> > 365 * 3 >>>> > > >> > probability of a given broker stays offline for one day in a >>>> given day >>>> > > >> due >>>> > > >> > to disk failure: 2% / 365 * 3 * 10 >>>> > > >> > probability of any broker stays offline for one day in a given >>>> day due >>>> > > >> to >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% >>>> > > >> > probability of any two broker stays offline for one day in a >>>> given day >>>> > > >> due >>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25% >>>> > > >> > probability of unavailability due to disk failure: 0.25% >>>> > > >> > >>>> > > >> > Note that the unavailability due to disk failure will be >>>> unacceptably >>>> > > >> high >>>> > > >> > in this case. And the probability of data loss due to disk >>>> failure >>>> > > will >>>> > > >> be >>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is intended >>>> to >>>> > > achieve >>>> > > >> > four nigh availability. >>>> > > >> > >>>> > > >> > Thanks, >>>> > > >> > Dong >>>> > > >> > >>>> > > >> > >>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <j...@confluent.io> >>>> wrote: >>>> > > >> > >>>> > > >> > > I think Ram's point is that in place failure is pretty >>>> complicated, >>>> > > >> and >>>> > > >> > > this is meant to be a cost saving feature, we should >>>> construct an >>>> > > >> > argument >>>> > > >> > > for it grounded in data. >>>> > > >> > > >>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but data is >>>> > > available >>>> > > >> > > online), and assume it takes 3 days to get the drive >>>> replaced. Say >>>> > > you >>>> > > >> > have >>>> > > >> > > 10 drives per server. Then the expected downtime for each >>>> server is >>>> > > >> > roughly >>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off since >>>> I'm >>>> > > >> ignoring >>>> > > >> > > the case of multiple failures, but I don't know that changes >>>> it >>>> > > >> much). So >>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say you >>>> have 1000 >>>> > > >> > servers >>>> > > >> > > and they cost $3000/year fully loaded including power, the >>>> cost of >>>> > > >> the hw >>>> > > >> > > amortized over it's life, etc. Then this feature saves you >>>> $3000 on >>>> > > >> your >>>> > > >> > > total server cost of $3m which seems not very worthwhile >>>> compared to >>>> > > >> > other >>>> > > >> > > optimizations...? >>>> > > >> > > >>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i think >>>> that is >>>> > > >> the >>>> > > >> > > type of argument that would be helpful to think about the >>>> tradeoff >>>> > > in >>>> > > >> > > complexity. >>>> > > >> > > >>>> > > >> > > -Jay >>>> > > >> > > >>>> > > >> > > >>>> > > >> > > >>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin < >>>> lindon...@gmail.com> >>>> > > wrote: >>>> > > >> > > >>>> > > >> > > > Hey Sriram, >>>> > > >> > > > >>>> > > >> > > > Thanks for taking time to review the KIP. Please see below >>>> my >>>> > > >> answers >>>> > > >> > to >>>> > > >> > > > your questions: >>>> > > >> > > > >>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration and go >>>> over what >>>> > > >> is >>>> > > >> > the >>>> > > >> > > > >average disk/partition repair/restore time that we are >>>> targeting >>>> > > >> for a >>>> > > >> > > > >typical JBOD setup? >>>> > > >> > > > >>>> > > >> > > > We currently don't have this data. I think the >>>> disk/partition >>>> > > >> > > repair/store >>>> > > >> > > > time depends on availability of hardware, the response >>>> time of >>>> > > >> > > > site-reliability engineer, the amount of data on the bad >>>> disk etc. >>>> > > >> > These >>>> > > >> > > > vary between companies and even clusters within the same >>>> company >>>> > > >> and it >>>> > > >> > > is >>>> > > >> > > > probably hard to determine what is the average situation. >>>> > > >> > > > >>>> > > >> > > > I am not very sure why we need this. Can you explain a bit >>>> why >>>> > > this >>>> > > >> > data >>>> > > >> > > is >>>> > > >> > > > useful to evaluate the motivation and design of this KIP? >>>> > > >> > > > >>>> > > >> > > > >2. How often do we believe disks are going to fail (in >>>> your >>>> > > example >>>> > > >> > > > >configuration) and what do we gain by avoiding the network >>>> > > overhead >>>> > > >> > and >>>> > > >> > > > >doing all the work of moving the replica within the >>>> broker to >>>> > > >> another >>>> > > >> > > disk >>>> > > >> > > > >instead of balancing it globally? >>>> > > >> > > > >>>> > > >> > > > I think the chance of disk failure depends mainly on the >>>> disk >>>> > > itself >>>> > > >> > > rather >>>> > > >> > > > than the broker configuration. I don't have this data now. >>>> I will >>>> > > >> ask >>>> > > >> > our >>>> > > >> > > > SRE whether they know the mean-time-to-fail for our disk. >>>> What I >>>> > > was >>>> > > >> > told >>>> > > >> > > > by SRE is that disk failure is the most common type of >>>> hardware >>>> > > >> > failure. >>>> > > >> > > > >>>> > > >> > > > When there is disk failure, I think it is reasonable to >>>> move >>>> > > >> replica to >>>> > > >> > > > another broker instead of another disk on the same broker. >>>> The >>>> > > >> reason >>>> > > >> > we >>>> > > >> > > > want to move replica within broker is mainly to optimize >>>> the Kafka >>>> > > >> > > cluster >>>> > > >> > > > performance when we balance load across disks. >>>> > > >> > > > >>>> > > >> > > > In comparison to balancing replicas globally, the benefit >>>> of >>>> > > moving >>>> > > >> > > replica >>>> > > >> > > > within broker is that: >>>> > > >> > > > >>>> > > >> > > > 1) the movement is faster since it doesn't go through >>>> socket or >>>> > > >> rely on >>>> > > >> > > the >>>> > > >> > > > available network bandwidth; >>>> > > >> > > > 2) much less impact on the replication traffic between >>>> broker by >>>> > > not >>>> > > >> > > taking >>>> > > >> > > > up bandwidth between brokers. Depending on the pattern of >>>> traffic, >>>> > > >> we >>>> > > >> > may >>>> > > >> > > > need to balance load across disk frequently and it is >>>> necessary to >>>> > > >> > > prevent >>>> > > >> > > > this operation from slowing down the existing operation >>>> (e.g. >>>> > > >> produce, >>>> > > >> > > > consume, replication) in the Kafka cluster. >>>> > > >> > > > 3) It gives us opportunity to do automatic broker rebalance >>>> > > between >>>> > > >> > disks >>>> > > >> > > > on the same broker. >>>> > > >> > > > >>>> > > >> > > > >>>> > > >> > > > >3. Even if we had to move the replica within the broker, >>>> why >>>> > > >> cannot we >>>> > > >> > > > just >>>> > > >> > > > >treat it as another replica and have it go through the >>>> same >>>> > > >> > replication >>>> > > >> > > > >code path that we have today? The downside here is >>>> obviously that >>>> > > >> you >>>> > > >> > > need >>>> > > >> > > > >to catchup from the leader but it is completely free! >>>> What do we >>>> > > >> think >>>> > > >> > > is >>>> > > >> > > > >the impact of the network overhead in this case? >>>> > > >> > > > >>>> > > >> > > > Good point. My initial proposal actually used the existing >>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to move >>>> replica >>>> > > >> > > between >>>> > > >> > > > disks. However, I switched to use separate thread pool >>>> after >>>> > > >> discussion >>>> > > >> > > > with Jun and Becket. >>>> > > >> > > > >>>> > > >> > > > The main argument for using separate thread pool is to >>>> actually >>>> > > keep >>>> > > >> > the >>>> > > >> > > > design simply and easy to reason about. There are a number >>>> of >>>> > > >> > difference >>>> > > >> > > > between inter-broker replication and intra-broker >>>> replication >>>> > > which >>>> > > >> > makes >>>> > > >> > > > it cleaner to do them in separate code path. I will list >>>> them >>>> > > below: >>>> > > >> > > > >>>> > > >> > > > - The throttling mechanism for inter-broker replication >>>> traffic >>>> > > and >>>> > > >> > > > intra-broker replication traffic is different. For >>>> example, we may >>>> > > >> want >>>> > > >> > > to >>>> > > >> > > > specify per-topic quota for inter-broker replication >>>> traffic >>>> > > >> because we >>>> > > >> > > may >>>> > > >> > > > want some topic to be moved faster than other topic. But >>>> we don't >>>> > > >> care >>>> > > >> > > > about priority of topics for intra-broker movement. So the >>>> current >>>> > > >> > > proposal >>>> > > >> > > > only allows user to specify per-broker quota for >>>> inter-broker >>>> > > >> > replication >>>> > > >> > > > traffic. >>>> > > >> > > > >>>> > > >> > > > - The quota value for inter-broker replication traffic and >>>> > > >> intra-broker >>>> > > >> > > > replication traffic is different. The available bandwidth >>>> for >>>> > > >> > > inter-broker >>>> > > >> > > > replication can probably be much higher than the bandwidth >>>> for >>>> > > >> > > inter-broker >>>> > > >> > > > replication. >>>> > > >> > > > >>>> > > >> > > > - The ReplicaFetchThread is per broker. Intuitively, the >>>> number of >>>> > > >> > > threads >>>> > > >> > > > doing intra broker data movement should be related to the >>>> number >>>> > > of >>>> > > >> > disks >>>> > > >> > > > in the broker, not the number of brokers in the cluster. >>>> > > >> > > > >>>> > > >> > > > - The leader replica has no ReplicaFetchThread to start >>>> with. It >>>> > > >> seems >>>> > > >> > > > weird to >>>> > > >> > > > start one just for intra-broker replication. >>>> > > >> > > > >>>> > > >> > > > Because of these difference, we think it is simpler to use >>>> > > separate >>>> > > >> > > thread >>>> > > >> > > > pool and code path so that we can configure and throttle >>>> them >>>> > > >> > separately. >>>> > > >> > > > >>>> > > >> > > > >>>> > > >> > > > >4. What are the chances that we will be able to identify >>>> another >>>> > > >> disk >>>> > > >> > to >>>> > > >> > > > >balance within the broker instead of another disk on >>>> another >>>> > > >> broker? >>>> > > >> > If >>>> > > >> > > we >>>> > > >> > > > >have 100's of machines, the probability of finding a >>>> better >>>> > > >> balance by >>>> > > >> > > > >choosing another broker is much higher than balancing >>>> within the >>>> > > >> > broker. >>>> > > >> > > > >Could you add some info on how we are determining this? >>>> > > >> > > > >>>> > > >> > > > It is possible that we can find available space on a remote >>>> > > broker. >>>> > > >> The >>>> > > >> > > > benefit of allowing intra-broker replication is that, when >>>> there >>>> > > are >>>> > > >> > > > available space in both the current broker and a remote >>>> broker, >>>> > > the >>>> > > >> > > > rebalance can be completed faster with much less impact on >>>> the >>>> > > >> > > inter-broker >>>> > > >> > > > replication or the users traffic. It is about taking >>>> advantage of >>>> > > >> > > locality >>>> > > >> > > > when balance the load. >>>> > > >> > > > >>>> > > >> > > > >5. Finally, in a cloud setup where more users are going to >>>> > > >> leverage a >>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this change >>>> is not >>>> > > of >>>> > > >> > much >>>> > > >> > > > >gain since you don't need to balance between the volumes >>>> within >>>> > > the >>>> > > >> > same >>>> > > >> > > > >broker. >>>> > > >> > > > >>>> > > >> > > > You are right. This KIP-113 is useful only if user uses >>>> JBOD. If >>>> > > >> user >>>> > > >> > > uses >>>> > > >> > > > an extra storage layer of replication, such as RAID-10 or >>>> EBS, >>>> > > they >>>> > > >> > don't >>>> > > >> > > > need KIP-112 or KIP-113. Note that user will replicate >>>> data more >>>> > > >> times >>>> > > >> > > than >>>> > > >> > > > the replication factor of the Kafka topic if an extra >>>> storage >>>> > > layer >>>> > > >> of >>>> > > >> > > > replication is used. >>>> > > >> > > > >>>> > > >> > > >>>> > > >> > >>>> > > >> >>>> > > > >>>> > > > >>>> > > >>>> >>> >>> >> >