Hi Dong, The reason I thought this would be useful is it seems likely to me that people will want to write tools to help them generate allocations. If, as you say, all the brokers and all the disks are the same size, then it's not too difficult to tell the tool the size of the disk. But if they're not the same, then using the tool becomes a lot harder. Obviously if the size of the disk is included in the DescribeDirsResponse then you can literally just point the tool at the cluster.
On the other hand, it seems likely that tools might also want to take into account other things when trying to find a good assignment (per-device IO for example) between the disks on a broker, so maybe including the total disk capacity is only of limited use. Cheers, Tom On 7 August 2017 at 17:54, Dong Lin <lindon...@gmail.com> wrote: > Hey Tom, > > Good question. We have actually considered having DescribeDirsResponse > provide the capacity of each disk as well. This was not included because we > believe Kafka cluster admin will always configure all brokers with same > number of disks of the same size. This is because it is generally easier to > manager a homogeneous cluster. If this is not the case then I think we > should include this information in the response. > > Thanks, > Dong > > > On Mon, Aug 7, 2017 at 3:44 AM, Tom Bentley <t.j.bent...@gmail.com> wrote: > > > Hi Dong, > > > > Your comments on KIP-179 prompted me to look at KIP-113, and I have a > > question: > > > > AFAICS the DescribeDirsResponse (via ReplicaInfo) can be used to get the > > size of a partition on a disk, but I don't see a mechanism for knowing > the > > total capacity of a disk (and/or the free capacity of a disk). That would > > be very useful information to have to help figure out that certain > > assignments are impossible, for instance. Is there a reason you've left > > this out? > > > > Cheers, > > > > Tom > > > > On 4 August 2017 at 18:47, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hey Ismael, > > > > > > Thanks for the comments! Here are my answers: > > > > > > 1. Yes it has been considered. Here are the reasons why we don't do it > > > through controller. > > > > > > - There can be use-cases where we only want to rebalance the load of > log > > > directories on a given broker. It seems unnecessary to go through > > > controller in this case. > > > > > > - If controller is responsible for sending ChangeReplicaDirRequest, > and > > if > > > the user-specified log directory is either invalid or offline, then > > > controller probably needs a way to tell user that the partition > > > reassignment has failed. We currently don't have a way to do this since > > > kafka-reassign-partition.sh simply creates the reassignment znode > without > > > waiting for response. I am not sure that is a good solution to this. > > > > > > - If controller is responsible for sending ChangeReplicaDirRequest, the > > > controller logic would be more complicated because controller needs to > > > first send ChangeReplicaRequest so that the broker memorize the > partition > > > -> log directory mapping, send LeaderAndIsrRequest, and keep sending > > > ChangeReplicaDirRequest (just in case broker restarted) until replica > is > > > created. Note that the last step needs repeat and timeout as the > proposed > > > in the KIP-113. > > > > > > Overall I think this adds quite a bit complexity to controller and we > > > probably want to do this only if there is strong clear of doing so. > > > Currently in KIP-113 the kafka-reassign-partitions.sh is responsible > for > > > sending ChangeReplicaDirRequest with repeat and provides error to user > if > > > it either fails or timeout. It seems to be much simpler and user > > shouldn't > > > care whether it is done through controller. > > > > > > And thanks for the suggestion. I will add this to the Rejected > > Alternative > > > Section in the KIP-113. > > > > > > 2) I think user needs to be able to specify different log directories > for > > > the replicas of the same partition in order to rebalance load across > log > > > directories of all brokers. I am not sure I understand the question. > Can > > > you explain a bit more why "that the log directory has to be the same > for > > > all replicas of a given partition"? > > > > > > 3) Good point. I think the alterReplicaDir is a better than > > > changeReplicaDir for the reason you provided. I will also update names > of > > > the request/response as well in the KIP. > > > > > > > > > Thanks, > > > Dong > > > > > > On Fri, Aug 4, 2017 at 9:49 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > > > > > Thanks Dong. I have a few initial questions, sorry if I it has been > > > > discussed and I missed it. > > > > > > > > 1. The KIP suggests that the reassignment tool is responsible for > > sending > > > > the ChangeReplicaDirRequests to the relevant brokers. I had imagined > > that > > > > this would be done by the Controller, like the rest of the > reassignment > > > > process. Was this considered? If so, it would be good to include the > > > > details of why it was rejected in the "Rejected Alternatives" > section. > > > > > > > > 2. The reassignment JSON format was extended so that one can choose > the > > > log > > > > directory for a partition. This means that the log directory has to > be > > > the > > > > same for all replicas of a given partition. The alternative would be > > for > > > > the log dir to be assignable for each replica. Similar to the other > > > > question, it would be good to have a section in "Rejected > Alternatives" > > > for > > > > this approach. It's generally very helpful to have more information > on > > > the > > > > rationale for the design choices that were made and rejected. > > > > > > > > 3. Should changeReplicaDir be alterReplicaDir? We have used `alter` > for > > > > other methods. > > > > > > > > Thanks, > > > > Ismael > > > > > > > > > > > > > > > > > > > > On Fri, Aug 4, 2017 at 5:37 AM, Dong Lin <lindon...@gmail.com> > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I realized that we need new API in AdminClient in order to use the > > new > > > > > request/response added in KIP-113. Since this is required by > > KIP-113, I > > > > > choose to add the new interface in this KIP instead of creating a > new > > > > KIP. > > > > > > > > > > The documentation of the new API in AdminClient can be found here > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113: > > > > > Supportreplicasmovementbetweenlogdirectories-AdminClient>. > > > > > Can you please review and comment if you have any concern? > > > > > > > > > > Thanks! > > > > > Dong > > > > > > > > > > > > > > > > > > > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> > > > wrote: > > > > > > > > > > > The protocol change has been updated in KIP-113 > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > 113%3A+Support+replicas+movement+between+log+directories> > > > > > > . > > > > > > > > > > > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com> > > > > wrote: > > > > > > > > > > > >> Hi all, > > > > > >> > > > > > >> I have made a minor change to the DescribeDirsRequest so that > user > > > can > > > > > >> choose to query the status for a specific list of partitions. > This > > > is > > > > a > > > > > bit > > > > > >> more fine-granular than the previous format that allows user to > > > query > > > > > the > > > > > >> status for a specific list of topics. I realized that querying > the > > > > > status > > > > > >> of selected partitions can be useful to check the whether the > > > > > reassignment > > > > > >> of the replicas to the specific log directories has been > > completed. > > > > > >> > > > > > >> I will assume this minor change is OK if there is no concern > with > > it > > > > in > > > > > >> the community :) > > > > > >> > > > > > >> Thanks, > > > > > >> Dong > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <lindon...@gmail.com > > > > > > wrote: > > > > > >> > > > > > >>> Hey Colin, > > > > > >>> > > > > > >>> Thanks for the suggestion. We have actually considered this and > > > list > > > > > >>> this as the first future work in KIP-112 > > > > > >>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > 112%3A+Handle+disk+failure+for+JBOD>. > > > > > >>> The two advantages that you mentioned are exactly the > motivation > > > for > > > > > this > > > > > >>> feature. Also as you have mentioned, this involves the tradeoff > > > > between > > > > > >>> disk performance and availability -- the more you distribute > > topic > > > > > across > > > > > >>> disks, the more topics will be offline due to a single disk > > > failure. > > > > > >>> > > > > > >>> Despite its complexity, it is not clear to me that the reduced > > > > > rebalance > > > > > >>> overhead is worth the reduction in availability. I am > optimistic > > > that > > > > > the > > > > > >>> rebalance overhead will not be that a big problem since we are > > not > > > > too > > > > > >>> bothered by cross-broker rebalance as of now. > > > > > >>> > > > > > >>> Thanks, > > > > > >>> Dong > > > > > >>> > > > > > >>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe < > > cmcc...@apache.org > > > > > > > > > >>> wrote: > > > > > >>> > > > > > >>>> Has anyone considered a scheme for sharding topic data across > > > > multiple > > > > > >>>> disks? > > > > > >>>> > > > > > >>>> For example, if you sharded topics across 3 disks, and you had > > 10 > > > > > disks, > > > > > >>>> you could pick a different set of 3 disks for each topic. If > > you > > > > > >>>> distribute them randomly then you have 10 choose 3 = 120 > > different > > > > > >>>> combinations. You would probably never need rebalancing if > you > > > had > > > > a > > > > > >>>> reasonable distribution of topic sizes (could probably prove > > this > > > > > with a > > > > > >>>> Monte Carlo or something). > > > > > >>>> > > > > > >>>> The disadvantage is that if one of the 3 disks fails, then you > > > have > > > > to > > > > > >>>> take the topic offline. But if we assume independent disk > > failure > > > > > >>>> probabilities, probability of failure with RAID 0 is: 1 - > > > > > >>>> Psuccess^(num_disks) whereas the probability of failure with > > this > > > > > scheme > > > > > >>>> is 1 - Psuccess ^ 3. > > > > > >>>> > > > > > >>>> This addresses the biggest downsides of JBOD now: > > > > > >>>> * limiting a topic to the size of a single disk limits > > scalability > > > > > >>>> * the topic movement process is tricky to get right and > involves > > > > > "racing > > > > > >>>> against producers" and wasted double I/Os > > > > > >>>> > > > > > >>>> Of course, one other question is how frequently we add new > disk > > > > drives > > > > > >>>> to an existing broker. In this case, you might reasonably > want > > > disk > > > > > >>>> rebalancing to avoid overloading the new disk(s) with writes. > > > > > >>>> > > > > > >>>> cheers, > > > > > >>>> Colin > > > > > >>>> > > > > > >>>> > > > > > >>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote: > > > > > >>>> > Just a few comments on this. > > > > > >>>> > > > > > > >>>> > 1. One of the issues with using RAID 0 is that a single disk > > > > failure > > > > > >>>> > causes > > > > > >>>> > a hard failure of the broker. Hard failure increases the > > > > > >>>> unavailability > > > > > >>>> > window for all the partitions on the failed broker, which > > > includes > > > > > the > > > > > >>>> > failure detection time (tied to ZK session timeout right > now) > > > and > > > > > >>>> leader > > > > > >>>> > election time by the controller. If we support JBOD > natively, > > > > when a > > > > > >>>> > single > > > > > >>>> > disk fails, only partitions on the failed disk will > > experience a > > > > > hard > > > > > >>>> > failure. The availability for partitions on the rest of the > > > disks > > > > > are > > > > > >>>> not > > > > > >>>> > affected. > > > > > >>>> > > > > > > >>>> > 2. For running things on the Cloud such as AWS. Currently, > > each > > > > EBS > > > > > >>>> > volume > > > > > >>>> > has a throughout limit of about 300MB/sec. If you get an > > > enhanced > > > > > EC2 > > > > > >>>> > instance, you can get 20Gb/sec network. To saturate the > > network, > > > > you > > > > > >>>> may > > > > > >>>> > need about 7 EBS volumes. So, being able to support JBOD in > > the > > > > > Cloud > > > > > >>>> is > > > > > >>>> > still potentially useful. > > > > > >>>> > > > > > > >>>> > 3. On the benefit of balancing data across disks within the > > same > > > > > >>>> broker. > > > > > >>>> > Data imbalance can happen across brokers as well as across > > disks > > > > > >>>> within > > > > > >>>> > the > > > > > >>>> > same broker. Balancing the data across disks within the > broker > > > has > > > > > the > > > > > >>>> > benefit of saving network bandwidth as Dong mentioned. So, > if > > > > intra > > > > > >>>> > broker > > > > > >>>> > load balancing is possible, it's probably better to avoid > the > > > more > > > > > >>>> > expensive inter broker load balancing. One of the reasons > for > > > disk > > > > > >>>> > imbalance right now is that partitions within a broker are > > > > assigned > > > > > to > > > > > >>>> > disks just based on the partition count. So, it does seem > > > possible > > > > > for > > > > > >>>> > disks to get imbalanced from time to time. If someone can > > share > > > > some > > > > > >>>> > stats > > > > > >>>> > for that in practice, that will be very helpful. > > > > > >>>> > > > > > > >>>> > Thanks, > > > > > >>>> > > > > > > >>>> > Jun > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin < > lindon...@gmail.com > > > > > > > > wrote: > > > > > >>>> > > > > > > >>>> > > Hey Sriram, > > > > > >>>> > > > > > > > >>>> > > I think there is one way to explain why the ability to > move > > > > > replica > > > > > >>>> between > > > > > >>>> > > disks can save space. Let's say the load is distributed to > > > disks > > > > > >>>> > > independent of the broker. Sooner or later, the load > > imbalance > > > > > will > > > > > >>>> exceed > > > > > >>>> > > a threshold and we will need to rebalance load across > disks. > > > Now > > > > > our > > > > > >>>> > > questions is whether our rebalancing algorithm will be > able > > to > > > > > take > > > > > >>>> > > advantage of locality by moving replicas between disks on > > the > > > > same > > > > > >>>> broker. > > > > > >>>> > > > > > > > >>>> > > Say for a given disk, there is 20% probability it is > > > overloaded, > > > > > 20% > > > > > >>>> > > probability it is underloaded, and 60% probability its > load > > is > > > > > >>>> around the > > > > > >>>> > > expected average load if the cluster is well balanced. > Then > > > for > > > > a > > > > > >>>> broker of > > > > > >>>> > > 10 disks, we would 2 disks need to have in-bound replica > > > > movement, > > > > > >>>> 2 disks > > > > > >>>> > > need to have out-bound replica movement, and 6 disks do > not > > > need > > > > > >>>> replica > > > > > >>>> > > movement. Thus we would expect KIP-113 to be useful since > we > > > > will > > > > > >>>> be able > > > > > >>>> > > to move replica from the two over-loaded disks to the two > > > > > >>>> under-loaded > > > > > >>>> > > disks on the same broKER. Does this make sense? > > > > > >>>> > > > > > > > >>>> > > Thanks, > > > > > >>>> > > Dong > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin < > > lindon...@gmail.com > > > > > > > > > >>>> wrote: > > > > > >>>> > > > > > > > >>>> > > > Hey Sriram, > > > > > >>>> > > > > > > > > >>>> > > > Thanks for raising these concerns. Let me answer these > > > > questions > > > > > >>>> below: > > > > > >>>> > > > > > > > > >>>> > > > - The benefit of those additional complexity to move the > > > data > > > > > >>>> stored on a > > > > > >>>> > > > disk within the broker is to avoid network bandwidth > > usage. > > > > > >>>> Creating > > > > > >>>> > > > replica on another broker is less efficient than > creating > > > > > replica > > > > > >>>> on > > > > > >>>> > > > another disk in the same broker IF there is actually > > > > > >>>> lightly-loaded disk > > > > > >>>> > > on > > > > > >>>> > > > the same broker. > > > > > >>>> > > > > > > > > >>>> > > > - In my opinion the rebalance algorithm would this: 1) > we > > > > > balance > > > > > >>>> the > > > > > >>>> > > load > > > > > >>>> > > > across brokers using the same algorithm we are using > > today. > > > 2) > > > > > we > > > > > >>>> balance > > > > > >>>> > > > load across disk on a given broker using a greedy > > algorithm, > > > > > i.e. > > > > > >>>> move > > > > > >>>> > > > replica from the overloaded disk to lightly loaded disk. > > The > > > > > >>>> greedy > > > > > >>>> > > > algorithm would only consider the capacity and replica > > size. > > > > We > > > > > >>>> can > > > > > >>>> > > improve > > > > > >>>> > > > it to consider throughput in the future. > > > > > >>>> > > > > > > > > >>>> > > > - With 30 brokers with each having 10 disks, using the > > > > > rebalancing > > > > > >>>> > > algorithm, > > > > > >>>> > > > the chances of choosing disks within the broker can be > > high. > > > > > >>>> There will > > > > > >>>> > > > always be load imbalance across disks of the same broker > > for > > > > the > > > > > >>>> same > > > > > >>>> > > > reason that there will always be load imbalance across > > > > brokers. > > > > > >>>> The > > > > > >>>> > > > algorithm specified above will take advantage of the > > > locality, > > > > > >>>> i.e. first > > > > > >>>> > > > balance load across disks of the same broker, and only > > > balance > > > > > >>>> across > > > > > >>>> > > > brokers if some brokers are much more loaded than > others. > > > > > >>>> > > > > > > > > >>>> > > > I think it is useful to note that the load imbalance > > across > > > > > disks > > > > > >>>> of the > > > > > >>>> > > > same broker is independent of the load imbalance across > > > > brokers. > > > > > >>>> Both are > > > > > >>>> > > > guaranteed to happen in any Kafka cluster for the same > > > reason, > > > > > >>>> i.e. > > > > > >>>> > > > variation in the partition size. Say broker 1 have two > > disks > > > > > that > > > > > >>>> are 80% > > > > > >>>> > > > loaded and 20% loaded. And broker 2 have two disks that > > are > > > > also > > > > > >>>> 80% > > > > > >>>> > > > loaded and 20%. We can balance them without inter-broker > > > > traffic > > > > > >>>> with > > > > > >>>> > > > KIP-113. This is why I think KIP-113 can be very > useful. > > > > > >>>> > > > > > > > > >>>> > > > Do these explanation sound reasonable? > > > > > >>>> > > > > > > > > >>>> > > > Thanks, > > > > > >>>> > > > Dong > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian < > > > > > >>>> r...@confluent.io> > > > > > >>>> > > > wrote: > > > > > >>>> > > > > > > > > >>>> > > >> Hey Dong, > > > > > >>>> > > >> > > > > > >>>> > > >> Thanks for the explanation. I don't think anyone is > > denying > > > > > that > > > > > >>>> we > > > > > >>>> > > should > > > > > >>>> > > >> rebalance at the disk level. I think it is important to > > > > restore > > > > > >>>> the disk > > > > > >>>> > > >> and not wait for disk replacement. There are also other > > > > > benefits > > > > > >>>> of > > > > > >>>> > > doing > > > > > >>>> > > >> that which is that you don't need to opt for hot swap > > racks > > > > > that > > > > > >>>> can > > > > > >>>> > > save > > > > > >>>> > > >> cost. > > > > > >>>> > > >> > > > > > >>>> > > >> The question here is what do you save by trying to add > > > > > >>>> complexity to > > > > > >>>> > > move > > > > > >>>> > > >> the data stored on a disk within the broker? Why would > > you > > > > not > > > > > >>>> simply > > > > > >>>> > > >> create another replica on the disk that results in a > > > balanced > > > > > >>>> load > > > > > >>>> > > across > > > > > >>>> > > >> brokers and have it catch up. We are missing a few > things > > > > here > > > > > - > > > > > >>>> > > >> 1. What would your data balancing algorithm be? Would > it > > > > > include > > > > > >>>> just > > > > > >>>> > > >> capacity or will it also consider throughput on disk to > > > > decide > > > > > >>>> on the > > > > > >>>> > > >> final > > > > > >>>> > > >> location of a partition? > > > > > >>>> > > >> 2. With 30 brokers with each having 10 disks, using the > > > > > >>>> rebalancing > > > > > >>>> > > >> algorithm, the chances of choosing disks within the > > broker > > > is > > > > > >>>> going to > > > > > >>>> > > be > > > > > >>>> > > >> low. This probability further decreases with more > brokers > > > and > > > > > >>>> disks. > > > > > >>>> > > Given > > > > > >>>> > > >> that, why are we trying to save network cost? How much > > > would > > > > > >>>> that saving > > > > > >>>> > > >> be > > > > > >>>> > > >> if you go that route? > > > > > >>>> > > >> > > > > > >>>> > > >> These questions are hard to answer without having to > > verify > > > > > >>>> empirically. > > > > > >>>> > > >> My > > > > > >>>> > > >> suggestion is to avoid doing pre matured optimization > > that > > > > > >>>> brings in the > > > > > >>>> > > >> added complexity to the code and treat inter and intra > > > broker > > > > > >>>> movements > > > > > >>>> > > of > > > > > >>>> > > >> partition the same. Deploy the code, use it and see if > it > > > is > > > > an > > > > > >>>> actual > > > > > >>>> > > >> problem and you get great savings by avoiding the > network > > > > route > > > > > >>>> to move > > > > > >>>> > > >> partitions within the same broker. If so, add this > > > > > optimization. > > > > > >>>> > > >> > > > > > >>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin < > > > > lindon...@gmail.com> > > > > > >>>> wrote: > > > > > >>>> > > >> > > > > > >>>> > > >> > Hey Jay, Sriram, > > > > > >>>> > > >> > > > > > > >>>> > > >> > Great point. If I understand you right, you are > > > suggesting > > > > > >>>> that we can > > > > > >>>> > > >> > simply use RAID-0 so that the load can be evenly > > > > distributed > > > > > >>>> across > > > > > >>>> > > >> disks. > > > > > >>>> > > >> > And even though a disk failure will bring down the > > enter > > > > > >>>> broker, the > > > > > >>>> > > >> > reduced availability as compared to using KIP-112 and > > > > KIP-113 > > > > > >>>> will may > > > > > >>>> > > >> be > > > > > >>>> > > >> > negligible. And it may be better to just accept the > > > > slightly > > > > > >>>> reduced > > > > > >>>> > > >> > availability instead of introducing the complexity > from > > > > > >>>> KIP-112 and > > > > > >>>> > > >> > KIP-113. > > > > > >>>> > > >> > > > > > > >>>> > > >> > Let's assume the following: > > > > > >>>> > > >> > > > > > > >>>> > > >> > - There are 30 brokers in a cluster and each broker > has > > > 10 > > > > > >>>> disks > > > > > >>>> > > >> > - The replication factor is 3 and min.isr = 2. > > > > > >>>> > > >> > - The probability of annual disk failure rate is 2% > > > > according > > > > > >>>> to this > > > > > >>>> > > >> > <https://www.backblaze.com/blo > > > > g/hard-drive-failure-rates-q1- > > > > > >>>> 2017/> > > > > > >>>> > > >> blog. > > > > > >>>> > > >> > - It takes 3 days to replace a disk. > > > > > >>>> > > >> > > > > > > >>>> > > >> > Here is my calculation for probability of data loss > due > > > to > > > > > disk > > > > > >>>> > > failure: > > > > > >>>> > > >> > probability of a given disk fails in a given year: 2% > > > > > >>>> > > >> > probability of a given disk stays offline for one day > > in > > > a > > > > > >>>> given day: > > > > > >>>> > > >> 2% / > > > > > >>>> > > >> > 365 * 3 > > > > > >>>> > > >> > probability of a given broker stays offline for one > day > > > in > > > > a > > > > > >>>> given day > > > > > >>>> > > >> due > > > > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10 > > > > > >>>> > > >> > probability of any broker stays offline for one day > in > > a > > > > > given > > > > > >>>> day due > > > > > >>>> > > >> to > > > > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > > > > > >>>> > > >> > probability of any three broker stays offline for one > > day > > > > in > > > > > a > > > > > >>>> given > > > > > >>>> > > day > > > > > >>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125% > > > > > >>>> > > >> > probability of data loss due to disk failure: 0.0125% > > > > > >>>> > > >> > > > > > > >>>> > > >> > Here is my calculation for probability of service > > > > > >>>> unavailability due > > > > > >>>> > > to > > > > > >>>> > > >> > disk failure: > > > > > >>>> > > >> > probability of a given disk fails in a given year: 2% > > > > > >>>> > > >> > probability of a given disk stays offline for one day > > in > > > a > > > > > >>>> given day: > > > > > >>>> > > >> 2% / > > > > > >>>> > > >> > 365 * 3 > > > > > >>>> > > >> > probability of a given broker stays offline for one > day > > > in > > > > a > > > > > >>>> given day > > > > > >>>> > > >> due > > > > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10 > > > > > >>>> > > >> > probability of any broker stays offline for one day > in > > a > > > > > given > > > > > >>>> day due > > > > > >>>> > > >> to > > > > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > > > > > >>>> > > >> > probability of any two broker stays offline for one > day > > > in > > > > a > > > > > >>>> given day > > > > > >>>> > > >> due > > > > > >>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25% > > > > > >>>> > > >> > probability of unavailability due to disk failure: > > 0.25% > > > > > >>>> > > >> > > > > > > >>>> > > >> > Note that the unavailability due to disk failure will > > be > > > > > >>>> unacceptably > > > > > >>>> > > >> high > > > > > >>>> > > >> > in this case. And the probability of data loss due to > > > disk > > > > > >>>> failure > > > > > >>>> > > will > > > > > >>>> > > >> be > > > > > >>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is > > > > intended > > > > > >>>> to > > > > > >>>> > > achieve > > > > > >>>> > > >> > four nigh availability. > > > > > >>>> > > >> > > > > > > >>>> > > >> > Thanks, > > > > > >>>> > > >> > Dong > > > > > >>>> > > >> > > > > > > >>>> > > >> > > > > > > >>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps < > > > > j...@confluent.io > > > > > > > > > > > >>>> wrote: > > > > > >>>> > > >> > > > > > > >>>> > > >> > > I think Ram's point is that in place failure is > > pretty > > > > > >>>> complicated, > > > > > >>>> > > >> and > > > > > >>>> > > >> > > this is meant to be a cost saving feature, we > should > > > > > >>>> construct an > > > > > >>>> > > >> > argument > > > > > >>>> > > >> > > for it grounded in data. > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > Assume an annual failure rate of 1% (reasonable, > but > > > data > > > > > is > > > > > >>>> > > available > > > > > >>>> > > >> > > online), and assume it takes 3 days to get the > drive > > > > > >>>> replaced. Say > > > > > >>>> > > you > > > > > >>>> > > >> > have > > > > > >>>> > > >> > > 10 drives per server. Then the expected downtime > for > > > each > > > > > >>>> server is > > > > > >>>> > > >> > roughly > > > > > >>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly > > off > > > > > since > > > > > >>>> I'm > > > > > >>>> > > >> ignoring > > > > > >>>> > > >> > > the case of multiple failures, but I don't know > that > > > > > changes > > > > > >>>> it > > > > > >>>> > > >> much). So > > > > > >>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. > Say > > > you > > > > > >>>> have 1000 > > > > > >>>> > > >> > servers > > > > > >>>> > > >> > > and they cost $3000/year fully loaded including > > power, > > > > the > > > > > >>>> cost of > > > > > >>>> > > >> the hw > > > > > >>>> > > >> > > amortized over it's life, etc. Then this feature > > saves > > > > you > > > > > >>>> $3000 on > > > > > >>>> > > >> your > > > > > >>>> > > >> > > total server cost of $3m which seems not very > > > worthwhile > > > > > >>>> compared to > > > > > >>>> > > >> > other > > > > > >>>> > > >> > > optimizations...? > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > Anyhow, not sure the arithmetic is right there, > but i > > > > think > > > > > >>>> that is > > > > > >>>> > > >> the > > > > > >>>> > > >> > > type of argument that would be helpful to think > about > > > the > > > > > >>>> tradeoff > > > > > >>>> > > in > > > > > >>>> > > >> > > complexity. > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > -Jay > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin < > > > > > >>>> lindon...@gmail.com> > > > > > >>>> > > wrote: > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > > Hey Sriram, > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > Thanks for taking time to review the KIP. Please > > see > > > > > below > > > > > >>>> my > > > > > >>>> > > >> answers > > > > > >>>> > > >> > to > > > > > >>>> > > >> > > > your questions: > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration > > and > > > > go > > > > > >>>> over what > > > > > >>>> > > >> is > > > > > >>>> > > >> > the > > > > > >>>> > > >> > > > >average disk/partition repair/restore time that > we > > > are > > > > > >>>> targeting > > > > > >>>> > > >> for a > > > > > >>>> > > >> > > > >typical JBOD setup? > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > We currently don't have this data. I think the > > > > > >>>> disk/partition > > > > > >>>> > > >> > > repair/store > > > > > >>>> > > >> > > > time depends on availability of hardware, the > > > response > > > > > >>>> time of > > > > > >>>> > > >> > > > site-reliability engineer, the amount of data on > > the > > > > bad > > > > > >>>> disk etc. > > > > > >>>> > > >> > These > > > > > >>>> > > >> > > > vary between companies and even clusters within > the > > > > same > > > > > >>>> company > > > > > >>>> > > >> and it > > > > > >>>> > > >> > > is > > > > > >>>> > > >> > > > probably hard to determine what is the average > > > > situation. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > I am not very sure why we need this. Can you > > explain > > > a > > > > > bit > > > > > >>>> why > > > > > >>>> > > this > > > > > >>>> > > >> > data > > > > > >>>> > > >> > > is > > > > > >>>> > > >> > > > useful to evaluate the motivation and design of > > this > > > > KIP? > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > >2. How often do we believe disks are going to > fail > > > (in > > > > > >>>> your > > > > > >>>> > > example > > > > > >>>> > > >> > > > >configuration) and what do we gain by avoiding > the > > > > > network > > > > > >>>> > > overhead > > > > > >>>> > > >> > and > > > > > >>>> > > >> > > > >doing all the work of moving the replica within > > the > > > > > >>>> broker to > > > > > >>>> > > >> another > > > > > >>>> > > >> > > disk > > > > > >>>> > > >> > > > >instead of balancing it globally? > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > I think the chance of disk failure depends mainly > > on > > > > the > > > > > >>>> disk > > > > > >>>> > > itself > > > > > >>>> > > >> > > rather > > > > > >>>> > > >> > > > than the broker configuration. I don't have this > > data > > > > > now. > > > > > >>>> I will > > > > > >>>> > > >> ask > > > > > >>>> > > >> > our > > > > > >>>> > > >> > > > SRE whether they know the mean-time-to-fail for > our > > > > disk. > > > > > >>>> What I > > > > > >>>> > > was > > > > > >>>> > > >> > told > > > > > >>>> > > >> > > > by SRE is that disk failure is the most common > type > > > of > > > > > >>>> hardware > > > > > >>>> > > >> > failure. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > When there is disk failure, I think it is > > reasonable > > > to > > > > > >>>> move > > > > > >>>> > > >> replica to > > > > > >>>> > > >> > > > another broker instead of another disk on the > same > > > > > broker. > > > > > >>>> The > > > > > >>>> > > >> reason > > > > > >>>> > > >> > we > > > > > >>>> > > >> > > > want to move replica within broker is mainly to > > > > optimize > > > > > >>>> the Kafka > > > > > >>>> > > >> > > cluster > > > > > >>>> > > >> > > > performance when we balance load across disks. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > In comparison to balancing replicas globally, the > > > > benefit > > > > > >>>> of > > > > > >>>> > > moving > > > > > >>>> > > >> > > replica > > > > > >>>> > > >> > > > within broker is that: > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > 1) the movement is faster since it doesn't go > > through > > > > > >>>> socket or > > > > > >>>> > > >> rely on > > > > > >>>> > > >> > > the > > > > > >>>> > > >> > > > available network bandwidth; > > > > > >>>> > > >> > > > 2) much less impact on the replication traffic > > > between > > > > > >>>> broker by > > > > > >>>> > > not > > > > > >>>> > > >> > > taking > > > > > >>>> > > >> > > > up bandwidth between brokers. Depending on the > > > pattern > > > > of > > > > > >>>> traffic, > > > > > >>>> > > >> we > > > > > >>>> > > >> > may > > > > > >>>> > > >> > > > need to balance load across disk frequently and > it > > is > > > > > >>>> necessary to > > > > > >>>> > > >> > > prevent > > > > > >>>> > > >> > > > this operation from slowing down the existing > > > operation > > > > > >>>> (e.g. > > > > > >>>> > > >> produce, > > > > > >>>> > > >> > > > consume, replication) in the Kafka cluster. > > > > > >>>> > > >> > > > 3) It gives us opportunity to do automatic broker > > > > > rebalance > > > > > >>>> > > between > > > > > >>>> > > >> > disks > > > > > >>>> > > >> > > > on the same broker. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > >3. Even if we had to move the replica within the > > > > broker, > > > > > >>>> why > > > > > >>>> > > >> cannot we > > > > > >>>> > > >> > > > just > > > > > >>>> > > >> > > > >treat it as another replica and have it go > through > > > the > > > > > >>>> same > > > > > >>>> > > >> > replication > > > > > >>>> > > >> > > > >code path that we have today? The downside here > is > > > > > >>>> obviously that > > > > > >>>> > > >> you > > > > > >>>> > > >> > > need > > > > > >>>> > > >> > > > >to catchup from the leader but it is completely > > > free! > > > > > >>>> What do we > > > > > >>>> > > >> think > > > > > >>>> > > >> > > is > > > > > >>>> > > >> > > > >the impact of the network overhead in this case? > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > Good point. My initial proposal actually used the > > > > > existing > > > > > >>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code > path) > > to > > > > > move > > > > > >>>> replica > > > > > >>>> > > >> > > between > > > > > >>>> > > >> > > > disks. However, I switched to use separate thread > > > pool > > > > > >>>> after > > > > > >>>> > > >> discussion > > > > > >>>> > > >> > > > with Jun and Becket. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > The main argument for using separate thread pool > is > > > to > > > > > >>>> actually > > > > > >>>> > > keep > > > > > >>>> > > >> > the > > > > > >>>> > > >> > > > design simply and easy to reason about. There > are a > > > > > number > > > > > >>>> of > > > > > >>>> > > >> > difference > > > > > >>>> > > >> > > > between inter-broker replication and intra-broker > > > > > >>>> replication > > > > > >>>> > > which > > > > > >>>> > > >> > makes > > > > > >>>> > > >> > > > it cleaner to do them in separate code path. I > will > > > > list > > > > > >>>> them > > > > > >>>> > > below: > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > - The throttling mechanism for inter-broker > > > replication > > > > > >>>> traffic > > > > > >>>> > > and > > > > > >>>> > > >> > > > intra-broker replication traffic is different. > For > > > > > >>>> example, we may > > > > > >>>> > > >> want > > > > > >>>> > > >> > > to > > > > > >>>> > > >> > > > specify per-topic quota for inter-broker > > replication > > > > > >>>> traffic > > > > > >>>> > > >> because we > > > > > >>>> > > >> > > may > > > > > >>>> > > >> > > > want some topic to be moved faster than other > > topic. > > > > But > > > > > >>>> we don't > > > > > >>>> > > >> care > > > > > >>>> > > >> > > > about priority of topics for intra-broker > movement. > > > So > > > > > the > > > > > >>>> current > > > > > >>>> > > >> > > proposal > > > > > >>>> > > >> > > > only allows user to specify per-broker quota for > > > > > >>>> inter-broker > > > > > >>>> > > >> > replication > > > > > >>>> > > >> > > > traffic. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > - The quota value for inter-broker replication > > > traffic > > > > > and > > > > > >>>> > > >> intra-broker > > > > > >>>> > > >> > > > replication traffic is different. The available > > > > bandwidth > > > > > >>>> for > > > > > >>>> > > >> > > inter-broker > > > > > >>>> > > >> > > > replication can probably be much higher than the > > > > > bandwidth > > > > > >>>> for > > > > > >>>> > > >> > > inter-broker > > > > > >>>> > > >> > > > replication. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > - The ReplicaFetchThread is per broker. > > Intuitively, > > > > the > > > > > >>>> number of > > > > > >>>> > > >> > > threads > > > > > >>>> > > >> > > > doing intra broker data movement should be > related > > to > > > > the > > > > > >>>> number > > > > > >>>> > > of > > > > > >>>> > > >> > disks > > > > > >>>> > > >> > > > in the broker, not the number of brokers in the > > > > cluster. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > - The leader replica has no ReplicaFetchThread to > > > start > > > > > >>>> with. It > > > > > >>>> > > >> seems > > > > > >>>> > > >> > > > weird to > > > > > >>>> > > >> > > > start one just for intra-broker replication. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > Because of these difference, we think it is > simpler > > > to > > > > > use > > > > > >>>> > > separate > > > > > >>>> > > >> > > thread > > > > > >>>> > > >> > > > pool and code path so that we can configure and > > > > throttle > > > > > >>>> them > > > > > >>>> > > >> > separately. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > >4. What are the chances that we will be able to > > > > identify > > > > > >>>> another > > > > > >>>> > > >> disk > > > > > >>>> > > >> > to > > > > > >>>> > > >> > > > >balance within the broker instead of another > disk > > on > > > > > >>>> another > > > > > >>>> > > >> broker? > > > > > >>>> > > >> > If > > > > > >>>> > > >> > > we > > > > > >>>> > > >> > > > >have 100's of machines, the probability of > > finding a > > > > > >>>> better > > > > > >>>> > > >> balance by > > > > > >>>> > > >> > > > >choosing another broker is much higher than > > > balancing > > > > > >>>> within the > > > > > >>>> > > >> > broker. > > > > > >>>> > > >> > > > >Could you add some info on how we are > determining > > > > this? > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > It is possible that we can find available space > on > > a > > > > > remote > > > > > >>>> > > broker. > > > > > >>>> > > >> The > > > > > >>>> > > >> > > > benefit of allowing intra-broker replication is > > that, > > > > > when > > > > > >>>> there > > > > > >>>> > > are > > > > > >>>> > > >> > > > available space in both the current broker and a > > > remote > > > > > >>>> broker, > > > > > >>>> > > the > > > > > >>>> > > >> > > > rebalance can be completed faster with much less > > > impact > > > > > on > > > > > >>>> the > > > > > >>>> > > >> > > inter-broker > > > > > >>>> > > >> > > > replication or the users traffic. It is about > > taking > > > > > >>>> advantage of > > > > > >>>> > > >> > > locality > > > > > >>>> > > >> > > > when balance the load. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > >5. Finally, in a cloud setup where more users > are > > > > going > > > > > to > > > > > >>>> > > >> leverage a > > > > > >>>> > > >> > > > >shared filesystem (example, EBS in AWS), all > this > > > > change > > > > > >>>> is not > > > > > >>>> > > of > > > > > >>>> > > >> > much > > > > > >>>> > > >> > > > >gain since you don't need to balance between the > > > > volumes > > > > > >>>> within > > > > > >>>> > > the > > > > > >>>> > > >> > same > > > > > >>>> > > >> > > > >broker. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > You are right. This KIP-113 is useful only if > user > > > uses > > > > > >>>> JBOD. If > > > > > >>>> > > >> user > > > > > >>>> > > >> > > uses > > > > > >>>> > > >> > > > an extra storage layer of replication, such as > > > RAID-10 > > > > or > > > > > >>>> EBS, > > > > > >>>> > > they > > > > > >>>> > > >> > don't > > > > > >>>> > > >> > > > need KIP-112 or KIP-113. Note that user will > > > replicate > > > > > >>>> data more > > > > > >>>> > > >> times > > > > > >>>> > > >> > > than > > > > > >>>> > > >> > > > the replication factor of the Kafka topic if an > > extra > > > > > >>>> storage > > > > > >>>> > > layer > > > > > >>>> > > >> of > > > > > >>>> > > >> > > > replication is used. > > > > > >>>> > > >> > > > > > > > > >>>> > > >> > > > > > > > >>>> > > >> > > > > > > >>>> > > >> > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > >>>> > > > > > >>> > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > > > > > > >