Hey Colin, Thanks for the suggestion. We have actually considered this and list this as the first future work in KIP-112 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD>. The two advantages that you mentioned are exactly the motivation for this feature. Also as you have mentioned, this involves the tradeoff between disk performance and availability -- the more you distribute topic across disks, the more topics will be offline due to a single disk failure.
Despite its complexity, it is not clear to me that the reduced rebalance overhead is worth the reduction in availability. I am optimistic that the rebalance overhead will not be that a big problem since we are not too bothered by cross-broker rebalance as of now. Thanks, Dong On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <cmcc...@apache.org> wrote: > Has anyone considered a scheme for sharding topic data across multiple > disks? > > For example, if you sharded topics across 3 disks, and you had 10 disks, > you could pick a different set of 3 disks for each topic. If you > distribute them randomly then you have 10 choose 3 = 120 different > combinations. You would probably never need rebalancing if you had a > reasonable distribution of topic sizes (could probably prove this with a > Monte Carlo or something). > > The disadvantage is that if one of the 3 disks fails, then you have to > take the topic offline. But if we assume independent disk failure > probabilities, probability of failure with RAID 0 is: 1 - > Psuccess^(num_disks) whereas the probability of failure with this scheme > is 1 - Psuccess ^ 3. > > This addresses the biggest downsides of JBOD now: > * limiting a topic to the size of a single disk limits scalability > * the topic movement process is tricky to get right and involves "racing > against producers" and wasted double I/Os > > Of course, one other question is how frequently we add new disk drives > to an existing broker. In this case, you might reasonably want disk > rebalancing to avoid overloading the new disk(s) with writes. > > cheers, > Colin > > > On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote: > > Just a few comments on this. > > > > 1. One of the issues with using RAID 0 is that a single disk failure > > causes > > a hard failure of the broker. Hard failure increases the unavailability > > window for all the partitions on the failed broker, which includes the > > failure detection time (tied to ZK session timeout right now) and leader > > election time by the controller. If we support JBOD natively, when a > > single > > disk fails, only partitions on the failed disk will experience a hard > > failure. The availability for partitions on the rest of the disks are not > > affected. > > > > 2. For running things on the Cloud such as AWS. Currently, each EBS > > volume > > has a throughout limit of about 300MB/sec. If you get an enhanced EC2 > > instance, you can get 20Gb/sec network. To saturate the network, you may > > need about 7 EBS volumes. So, being able to support JBOD in the Cloud is > > still potentially useful. > > > > 3. On the benefit of balancing data across disks within the same broker. > > Data imbalance can happen across brokers as well as across disks within > > the > > same broker. Balancing the data across disks within the broker has the > > benefit of saving network bandwidth as Dong mentioned. So, if intra > > broker > > load balancing is possible, it's probably better to avoid the more > > expensive inter broker load balancing. One of the reasons for disk > > imbalance right now is that partitions within a broker are assigned to > > disks just based on the partition count. So, it does seem possible for > > disks to get imbalanced from time to time. If someone can share some > > stats > > for that in practice, that will be very helpful. > > > > Thanks, > > > > Jun > > > > > > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hey Sriram, > > > > > > I think there is one way to explain why the ability to move replica > between > > > disks can save space. Let's say the load is distributed to disks > > > independent of the broker. Sooner or later, the load imbalance will > exceed > > > a threshold and we will need to rebalance load across disks. Now our > > > questions is whether our rebalancing algorithm will be able to take > > > advantage of locality by moving replicas between disks on the same > broker. > > > > > > Say for a given disk, there is 20% probability it is overloaded, 20% > > > probability it is underloaded, and 60% probability its load is around > the > > > expected average load if the cluster is well balanced. Then for a > broker of > > > 10 disks, we would 2 disks need to have in-bound replica movement, 2 > disks > > > need to have out-bound replica movement, and 6 disks do not need > replica > > > movement. Thus we would expect KIP-113 to be useful since we will be > able > > > to move replica from the two over-loaded disks to the two under-loaded > > > disks on the same broKER. Does this make sense? > > > > > > Thanks, > > > Dong > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > > > Hey Sriram, > > > > > > > > Thanks for raising these concerns. Let me answer these questions > below: > > > > > > > > - The benefit of those additional complexity to move the data stored > on a > > > > disk within the broker is to avoid network bandwidth usage. Creating > > > > replica on another broker is less efficient than creating replica on > > > > another disk in the same broker IF there is actually lightly-loaded > disk > > > on > > > > the same broker. > > > > > > > > - In my opinion the rebalance algorithm would this: 1) we balance the > > > load > > > > across brokers using the same algorithm we are using today. 2) we > balance > > > > load across disk on a given broker using a greedy algorithm, i.e. > move > > > > replica from the overloaded disk to lightly loaded disk. The greedy > > > > algorithm would only consider the capacity and replica size. We can > > > improve > > > > it to consider throughput in the future. > > > > > > > > - With 30 brokers with each having 10 disks, using the rebalancing > > > algorithm, > > > > the chances of choosing disks within the broker can be high. There > will > > > > always be load imbalance across disks of the same broker for the same > > > > reason that there will always be load imbalance across brokers. The > > > > algorithm specified above will take advantage of the locality, i.e. > first > > > > balance load across disks of the same broker, and only balance across > > > > brokers if some brokers are much more loaded than others. > > > > > > > > I think it is useful to note that the load imbalance across disks of > the > > > > same broker is independent of the load imbalance across brokers. > Both are > > > > guaranteed to happen in any Kafka cluster for the same reason, i.e. > > > > variation in the partition size. Say broker 1 have two disks that > are 80% > > > > loaded and 20% loaded. And broker 2 have two disks that are also 80% > > > > loaded and 20%. We can balance them without inter-broker traffic with > > > > KIP-113. This is why I think KIP-113 can be very useful. > > > > > > > > Do these explanation sound reasonable? > > > > > > > > Thanks, > > > > Dong > > > > > > > > > > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <r...@confluent.io > > > > > > wrote: > > > > > > > >> Hey Dong, > > > >> > > > >> Thanks for the explanation. I don't think anyone is denying that we > > > should > > > >> rebalance at the disk level. I think it is important to restore the > disk > > > >> and not wait for disk replacement. There are also other benefits of > > > doing > > > >> that which is that you don't need to opt for hot swap racks that can > > > save > > > >> cost. > > > >> > > > >> The question here is what do you save by trying to add complexity to > > > move > > > >> the data stored on a disk within the broker? Why would you not > simply > > > >> create another replica on the disk that results in a balanced load > > > across > > > >> brokers and have it catch up. We are missing a few things here - > > > >> 1. What would your data balancing algorithm be? Would it include > just > > > >> capacity or will it also consider throughput on disk to decide on > the > > > >> final > > > >> location of a partition? > > > >> 2. With 30 brokers with each having 10 disks, using the rebalancing > > > >> algorithm, the chances of choosing disks within the broker is going > to > > > be > > > >> low. This probability further decreases with more brokers and disks. > > > Given > > > >> that, why are we trying to save network cost? How much would that > saving > > > >> be > > > >> if you go that route? > > > >> > > > >> These questions are hard to answer without having to verify > empirically. > > > >> My > > > >> suggestion is to avoid doing pre matured optimization that brings > in the > > > >> added complexity to the code and treat inter and intra broker > movements > > > of > > > >> partition the same. Deploy the code, use it and see if it is an > actual > > > >> problem and you get great savings by avoiding the network route to > move > > > >> partitions within the same broker. If so, add this optimization. > > > >> > > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <lindon...@gmail.com> > wrote: > > > >> > > > >> > Hey Jay, Sriram, > > > >> > > > > >> > Great point. If I understand you right, you are suggesting that > we can > > > >> > simply use RAID-0 so that the load can be evenly distributed > across > > > >> disks. > > > >> > And even though a disk failure will bring down the enter broker, > the > > > >> > reduced availability as compared to using KIP-112 and KIP-113 > will may > > > >> be > > > >> > negligible. And it may be better to just accept the slightly > reduced > > > >> > availability instead of introducing the complexity from KIP-112 > and > > > >> > KIP-113. > > > >> > > > > >> > Let's assume the following: > > > >> > > > > >> > - There are 30 brokers in a cluster and each broker has 10 disks > > > >> > - The replication factor is 3 and min.isr = 2. > > > >> > - The probability of annual disk failure rate is 2% according to > this > > > >> > <https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/ > > > > > >> blog. > > > >> > - It takes 3 days to replace a disk. > > > >> > > > > >> > Here is my calculation for probability of data loss due to disk > > > failure: > > > >> > probability of a given disk fails in a given year: 2% > > > >> > probability of a given disk stays offline for one day in a given > day: > > > >> 2% / > > > >> > 365 * 3 > > > >> > probability of a given broker stays offline for one day in a > given day > > > >> due > > > >> > to disk failure: 2% / 365 * 3 * 10 > > > >> > probability of any broker stays offline for one day in a given > day due > > > >> to > > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > > > >> > probability of any three broker stays offline for one day in a > given > > > day > > > >> > due to disk failure: 5% * 5% * 5% = 0.0125% > > > >> > probability of data loss due to disk failure: 0.0125% > > > >> > > > > >> > Here is my calculation for probability of service unavailability > due > > > to > > > >> > disk failure: > > > >> > probability of a given disk fails in a given year: 2% > > > >> > probability of a given disk stays offline for one day in a given > day: > > > >> 2% / > > > >> > 365 * 3 > > > >> > probability of a given broker stays offline for one day in a > given day > > > >> due > > > >> > to disk failure: 2% / 365 * 3 * 10 > > > >> > probability of any broker stays offline for one day in a given > day due > > > >> to > > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > > > >> > probability of any two broker stays offline for one day in a > given day > > > >> due > > > >> > to disk failure: 5% * 5% * 5% = 0.25% > > > >> > probability of unavailability due to disk failure: 0.25% > > > >> > > > > >> > Note that the unavailability due to disk failure will be > unacceptably > > > >> high > > > >> > in this case. And the probability of data loss due to disk failure > > > will > > > >> be > > > >> > higher than 0.01%. Neither is acceptable if Kafka is intended to > > > achieve > > > >> > four nigh availability. > > > >> > > > > >> > Thanks, > > > >> > Dong > > > >> > > > > >> > > > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <j...@confluent.io> > wrote: > > > >> > > > > >> > > I think Ram's point is that in place failure is pretty > complicated, > > > >> and > > > >> > > this is meant to be a cost saving feature, we should construct > an > > > >> > argument > > > >> > > for it grounded in data. > > > >> > > > > > >> > > Assume an annual failure rate of 1% (reasonable, but data is > > > available > > > >> > > online), and assume it takes 3 days to get the drive replaced. > Say > > > you > > > >> > have > > > >> > > 10 drives per server. Then the expected downtime for each > server is > > > >> > roughly > > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off since I'm > > > >> ignoring > > > >> > > the case of multiple failures, but I don't know that changes it > > > >> much). So > > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say you have > 1000 > > > >> > servers > > > >> > > and they cost $3000/year fully loaded including power, the cost > of > > > >> the hw > > > >> > > amortized over it's life, etc. Then this feature saves you > $3000 on > > > >> your > > > >> > > total server cost of $3m which seems not very worthwhile > compared to > > > >> > other > > > >> > > optimizations...? > > > >> > > > > > >> > > Anyhow, not sure the arithmetic is right there, but i think > that is > > > >> the > > > >> > > type of argument that would be helpful to think about the > tradeoff > > > in > > > >> > > complexity. > > > >> > > > > > >> > > -Jay > > > >> > > > > > >> > > > > > >> > > > > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <lindon...@gmail.com> > > > wrote: > > > >> > > > > > >> > > > Hey Sriram, > > > >> > > > > > > >> > > > Thanks for taking time to review the KIP. Please see below my > > > >> answers > > > >> > to > > > >> > > > your questions: > > > >> > > > > > > >> > > > >1. Could you pick a hardware/Kafka configuration and go over > what > > > >> is > > > >> > the > > > >> > > > >average disk/partition repair/restore time that we are > targeting > > > >> for a > > > >> > > > >typical JBOD setup? > > > >> > > > > > > >> > > > We currently don't have this data. I think the disk/partition > > > >> > > repair/store > > > >> > > > time depends on availability of hardware, the response time of > > > >> > > > site-reliability engineer, the amount of data on the bad disk > etc. > > > >> > These > > > >> > > > vary between companies and even clusters within the same > company > > > >> and it > > > >> > > is > > > >> > > > probably hard to determine what is the average situation. > > > >> > > > > > > >> > > > I am not very sure why we need this. Can you explain a bit why > > > this > > > >> > data > > > >> > > is > > > >> > > > useful to evaluate the motivation and design of this KIP? > > > >> > > > > > > >> > > > >2. How often do we believe disks are going to fail (in your > > > example > > > >> > > > >configuration) and what do we gain by avoiding the network > > > overhead > > > >> > and > > > >> > > > >doing all the work of moving the replica within the broker to > > > >> another > > > >> > > disk > > > >> > > > >instead of balancing it globally? > > > >> > > > > > > >> > > > I think the chance of disk failure depends mainly on the disk > > > itself > > > >> > > rather > > > >> > > > than the broker configuration. I don't have this data now. I > will > > > >> ask > > > >> > our > > > >> > > > SRE whether they know the mean-time-to-fail for our disk. > What I > > > was > > > >> > told > > > >> > > > by SRE is that disk failure is the most common type of > hardware > > > >> > failure. > > > >> > > > > > > >> > > > When there is disk failure, I think it is reasonable to move > > > >> replica to > > > >> > > > another broker instead of another disk on the same broker. The > > > >> reason > > > >> > we > > > >> > > > want to move replica within broker is mainly to optimize the > Kafka > > > >> > > cluster > > > >> > > > performance when we balance load across disks. > > > >> > > > > > > >> > > > In comparison to balancing replicas globally, the benefit of > > > moving > > > >> > > replica > > > >> > > > within broker is that: > > > >> > > > > > > >> > > > 1) the movement is faster since it doesn't go through socket > or > > > >> rely on > > > >> > > the > > > >> > > > available network bandwidth; > > > >> > > > 2) much less impact on the replication traffic between broker > by > > > not > > > >> > > taking > > > >> > > > up bandwidth between brokers. Depending on the pattern of > traffic, > > > >> we > > > >> > may > > > >> > > > need to balance load across disk frequently and it is > necessary to > > > >> > > prevent > > > >> > > > this operation from slowing down the existing operation (e.g. > > > >> produce, > > > >> > > > consume, replication) in the Kafka cluster. > > > >> > > > 3) It gives us opportunity to do automatic broker rebalance > > > between > > > >> > disks > > > >> > > > on the same broker. > > > >> > > > > > > >> > > > > > > >> > > > >3. Even if we had to move the replica within the broker, why > > > >> cannot we > > > >> > > > just > > > >> > > > >treat it as another replica and have it go through the same > > > >> > replication > > > >> > > > >code path that we have today? The downside here is obviously > that > > > >> you > > > >> > > need > > > >> > > > >to catchup from the leader but it is completely free! What > do we > > > >> think > > > >> > > is > > > >> > > > >the impact of the network overhead in this case? > > > >> > > > > > > >> > > > Good point. My initial proposal actually used the existing > > > >> > > > ReplicaFetcherThread (i.e. the existing code path) to move > replica > > > >> > > between > > > >> > > > disks. However, I switched to use separate thread pool after > > > >> discussion > > > >> > > > with Jun and Becket. > > > >> > > > > > > >> > > > The main argument for using separate thread pool is to > actually > > > keep > > > >> > the > > > >> > > > design simply and easy to reason about. There are a number of > > > >> > difference > > > >> > > > between inter-broker replication and intra-broker replication > > > which > > > >> > makes > > > >> > > > it cleaner to do them in separate code path. I will list them > > > below: > > > >> > > > > > > >> > > > - The throttling mechanism for inter-broker replication > traffic > > > and > > > >> > > > intra-broker replication traffic is different. For example, > we may > > > >> want > > > >> > > to > > > >> > > > specify per-topic quota for inter-broker replication traffic > > > >> because we > > > >> > > may > > > >> > > > want some topic to be moved faster than other topic. But we > don't > > > >> care > > > >> > > > about priority of topics for intra-broker movement. So the > current > > > >> > > proposal > > > >> > > > only allows user to specify per-broker quota for inter-broker > > > >> > replication > > > >> > > > traffic. > > > >> > > > > > > >> > > > - The quota value for inter-broker replication traffic and > > > >> intra-broker > > > >> > > > replication traffic is different. The available bandwidth for > > > >> > > inter-broker > > > >> > > > replication can probably be much higher than the bandwidth for > > > >> > > inter-broker > > > >> > > > replication. > > > >> > > > > > > >> > > > - The ReplicaFetchThread is per broker. Intuitively, the > number of > > > >> > > threads > > > >> > > > doing intra broker data movement should be related to the > number > > > of > > > >> > disks > > > >> > > > in the broker, not the number of brokers in the cluster. > > > >> > > > > > > >> > > > - The leader replica has no ReplicaFetchThread to start with. > It > > > >> seems > > > >> > > > weird to > > > >> > > > start one just for intra-broker replication. > > > >> > > > > > > >> > > > Because of these difference, we think it is simpler to use > > > separate > > > >> > > thread > > > >> > > > pool and code path so that we can configure and throttle them > > > >> > separately. > > > >> > > > > > > >> > > > > > > >> > > > >4. What are the chances that we will be able to identify > another > > > >> disk > > > >> > to > > > >> > > > >balance within the broker instead of another disk on another > > > >> broker? > > > >> > If > > > >> > > we > > > >> > > > >have 100's of machines, the probability of finding a better > > > >> balance by > > > >> > > > >choosing another broker is much higher than balancing within > the > > > >> > broker. > > > >> > > > >Could you add some info on how we are determining this? > > > >> > > > > > > >> > > > It is possible that we can find available space on a remote > > > broker. > > > >> The > > > >> > > > benefit of allowing intra-broker replication is that, when > there > > > are > > > >> > > > available space in both the current broker and a remote > broker, > > > the > > > >> > > > rebalance can be completed faster with much less impact on the > > > >> > > inter-broker > > > >> > > > replication or the users traffic. It is about taking > advantage of > > > >> > > locality > > > >> > > > when balance the load. > > > >> > > > > > > >> > > > >5. Finally, in a cloud setup where more users are going to > > > >> leverage a > > > >> > > > >shared filesystem (example, EBS in AWS), all this change is > not > > > of > > > >> > much > > > >> > > > >gain since you don't need to balance between the volumes > within > > > the > > > >> > same > > > >> > > > >broker. > > > >> > > > > > > >> > > > You are right. This KIP-113 is useful only if user uses JBOD. > If > > > >> user > > > >> > > uses > > > >> > > > an extra storage layer of replication, such as RAID-10 or EBS, > > > they > > > >> > don't > > > >> > > > need KIP-112 or KIP-113. Note that user will replicate data > more > > > >> times > > > >> > > than > > > >> > > > the replication factor of the Kafka topic if an extra storage > > > layer > > > >> of > > > >> > > > replication is used. > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > >