Hey Dong, Thanks for the explanation. I don't think anyone is denying that we should rebalance at the disk level. I think it is important to restore the disk and not wait for disk replacement. There are also other benefits of doing that which is that you don't need to opt for hot swap racks that can save cost.
The question here is what do you save by trying to add complexity to move the data stored on a disk within the broker? Why would you not simply create another replica on the disk that results in a balanced load across brokers and have it catch up. We are missing a few things here - 1. What would your data balancing algorithm be? Would it include just capacity or will it also consider throughput on disk to decide on the final location of a partition? 2. With 30 brokers with each having 10 disks, using the rebalancing algorithm, the chances of choosing disks within the broker is going to be low. This probability further decreases with more brokers and disks. Given that, why are we trying to save network cost? How much would that saving be if you go that route? These questions are hard to answer without having to verify empirically. My suggestion is to avoid doing pre matured optimization that brings in the added complexity to the code and treat inter and intra broker movements of partition the same. Deploy the code, use it and see if it is an actual problem and you get great savings by avoiding the network route to move partitions within the same broker. If so, add this optimization. On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Jay, Sriram, > > Great point. If I understand you right, you are suggesting that we can > simply use RAID-0 so that the load can be evenly distributed across disks. > And even though a disk failure will bring down the enter broker, the > reduced availability as compared to using KIP-112 and KIP-113 will may be > negligible. And it may be better to just accept the slightly reduced > availability instead of introducing the complexity from KIP-112 and > KIP-113. > > Let's assume the following: > > - There are 30 brokers in a cluster and each broker has 10 disks > - The replication factor is 3 and min.isr = 2. > - The probability of annual disk failure rate is 2% according to this > <https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/> blog. > - It takes 3 days to replace a disk. > > Here is my calculation for probability of data loss due to disk failure: > probability of a given disk fails in a given year: 2% > probability of a given disk stays offline for one day in a given day: 2% / > 365 * 3 > probability of a given broker stays offline for one day in a given day due > to disk failure: 2% / 365 * 3 * 10 > probability of any broker stays offline for one day in a given day due to > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > probability of any three broker stays offline for one day in a given day > due to disk failure: 5% * 5% * 5% = 0.0125% > probability of data loss due to disk failure: 0.0125% > > Here is my calculation for probability of service unavailability due to > disk failure: > probability of a given disk fails in a given year: 2% > probability of a given disk stays offline for one day in a given day: 2% / > 365 * 3 > probability of a given broker stays offline for one day in a given day due > to disk failure: 2% / 365 * 3 * 10 > probability of any broker stays offline for one day in a given day due to > disk failure: 2% / 365 * 3 * 10 * 30 = 5% > probability of any two broker stays offline for one day in a given day due > to disk failure: 5% * 5% * 5% = 0.25% > probability of unavailability due to disk failure: 0.25% > > Note that the unavailability due to disk failure will be unacceptably high > in this case. And the probability of data loss due to disk failure will be > higher than 0.01%. Neither is acceptable if Kafka is intended to achieve > four nigh availability. > > Thanks, > Dong > > > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <j...@confluent.io> wrote: > > > I think Ram's point is that in place failure is pretty complicated, and > > this is meant to be a cost saving feature, we should construct an > argument > > for it grounded in data. > > > > Assume an annual failure rate of 1% (reasonable, but data is available > > online), and assume it takes 3 days to get the drive replaced. Say you > have > > 10 drives per server. Then the expected downtime for each server is > roughly > > 1% * 3 days * 10 = 0.3 days/year (this is slightly off since I'm ignoring > > the case of multiple failures, but I don't know that changes it much). So > > the savings from this feature is 0.3/365 = 0.08%. Say you have 1000 > servers > > and they cost $3000/year fully loaded including power, the cost of the hw > > amortized over it's life, etc. Then this feature saves you $3000 on your > > total server cost of $3m which seems not very worthwhile compared to > other > > optimizations...? > > > > Anyhow, not sure the arithmetic is right there, but i think that is the > > type of argument that would be helpful to think about the tradeoff in > > complexity. > > > > -Jay > > > > > > > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hey Sriram, > > > > > > Thanks for taking time to review the KIP. Please see below my answers > to > > > your questions: > > > > > > >1. Could you pick a hardware/Kafka configuration and go over what is > the > > > >average disk/partition repair/restore time that we are targeting for a > > > >typical JBOD setup? > > > > > > We currently don't have this data. I think the disk/partition > > repair/store > > > time depends on availability of hardware, the response time of > > > site-reliability engineer, the amount of data on the bad disk etc. > These > > > vary between companies and even clusters within the same company and it > > is > > > probably hard to determine what is the average situation. > > > > > > I am not very sure why we need this. Can you explain a bit why this > data > > is > > > useful to evaluate the motivation and design of this KIP? > > > > > > >2. How often do we believe disks are going to fail (in your example > > > >configuration) and what do we gain by avoiding the network overhead > and > > > >doing all the work of moving the replica within the broker to another > > disk > > > >instead of balancing it globally? > > > > > > I think the chance of disk failure depends mainly on the disk itself > > rather > > > than the broker configuration. I don't have this data now. I will ask > our > > > SRE whether they know the mean-time-to-fail for our disk. What I was > told > > > by SRE is that disk failure is the most common type of hardware > failure. > > > > > > When there is disk failure, I think it is reasonable to move replica to > > > another broker instead of another disk on the same broker. The reason > we > > > want to move replica within broker is mainly to optimize the Kafka > > cluster > > > performance when we balance load across disks. > > > > > > In comparison to balancing replicas globally, the benefit of moving > > replica > > > within broker is that: > > > > > > 1) the movement is faster since it doesn't go through socket or rely on > > the > > > available network bandwidth; > > > 2) much less impact on the replication traffic between broker by not > > taking > > > up bandwidth between brokers. Depending on the pattern of traffic, we > may > > > need to balance load across disk frequently and it is necessary to > > prevent > > > this operation from slowing down the existing operation (e.g. produce, > > > consume, replication) in the Kafka cluster. > > > 3) It gives us opportunity to do automatic broker rebalance between > disks > > > on the same broker. > > > > > > > > > >3. Even if we had to move the replica within the broker, why cannot we > > > just > > > >treat it as another replica and have it go through the same > replication > > > >code path that we have today? The downside here is obviously that you > > need > > > >to catchup from the leader but it is completely free! What do we think > > is > > > >the impact of the network overhead in this case? > > > > > > Good point. My initial proposal actually used the existing > > > ReplicaFetcherThread (i.e. the existing code path) to move replica > > between > > > disks. However, I switched to use separate thread pool after discussion > > > with Jun and Becket. > > > > > > The main argument for using separate thread pool is to actually keep > the > > > design simply and easy to reason about. There are a number of > difference > > > between inter-broker replication and intra-broker replication which > makes > > > it cleaner to do them in separate code path. I will list them below: > > > > > > - The throttling mechanism for inter-broker replication traffic and > > > intra-broker replication traffic is different. For example, we may want > > to > > > specify per-topic quota for inter-broker replication traffic because we > > may > > > want some topic to be moved faster than other topic. But we don't care > > > about priority of topics for intra-broker movement. So the current > > proposal > > > only allows user to specify per-broker quota for inter-broker > replication > > > traffic. > > > > > > - The quota value for inter-broker replication traffic and intra-broker > > > replication traffic is different. The available bandwidth for > > inter-broker > > > replication can probably be much higher than the bandwidth for > > inter-broker > > > replication. > > > > > > - The ReplicaFetchThread is per broker. Intuitively, the number of > > threads > > > doing intra broker data movement should be related to the number of > disks > > > in the broker, not the number of brokers in the cluster. > > > > > > - The leader replica has no ReplicaFetchThread to start with. It seems > > > weird to > > > start one just for intra-broker replication. > > > > > > Because of these difference, we think it is simpler to use separate > > thread > > > pool and code path so that we can configure and throttle them > separately. > > > > > > > > > >4. What are the chances that we will be able to identify another disk > to > > > >balance within the broker instead of another disk on another broker? > If > > we > > > >have 100's of machines, the probability of finding a better balance by > > > >choosing another broker is much higher than balancing within the > broker. > > > >Could you add some info on how we are determining this? > > > > > > It is possible that we can find available space on a remote broker. The > > > benefit of allowing intra-broker replication is that, when there are > > > available space in both the current broker and a remote broker, the > > > rebalance can be completed faster with much less impact on the > > inter-broker > > > replication or the users traffic. It is about taking advantage of > > locality > > > when balance the load. > > > > > > >5. Finally, in a cloud setup where more users are going to leverage a > > > >shared filesystem (example, EBS in AWS), all this change is not of > much > > > >gain since you don't need to balance between the volumes within the > same > > > >broker. > > > > > > You are right. This KIP-113 is useful only if user uses JBOD. If user > > uses > > > an extra storage layer of replication, such as RAID-10 or EBS, they > don't > > > need KIP-112 or KIP-113. Note that user will replicate data more times > > than > > > the replication factor of the Kafka topic if an extra storage layer of > > > replication is used. > > > > > >