Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Eno Thereska Tue, 28 Feb 2017 01:44:29 -0800
Makes sense, thank you Dong.

Eno
> On 28 Feb 2017, at 01:51, Dong Lin <lindon...@gmail.com> wrote:
> 
> Hi Jun,
> 
> In addition to the Eno's reference of why rebuild time with RAID-5 is more
> expensive, another concern is that RAID-5 will fail if more than one disk
> fails. JBOD is still works with 1+ disk failure and has better performance
> with one disk failure. These seems like good argument for using JBOD
> instead of RAID-5.
> 
> If a leader replica goes offline, the broker should first take all actions
> (i.e. remove the partition from fetcher thread) as if it has received
> StopReplicaRequest for this partition because the replica can no longer
> work anyway. It will also respond with error to any ProduceRequest and
> FetchRequest for partition. The broker notifies controller by writing
> notification znode in ZK. The controller learns the disk failure event from
> ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
> that the replica is offline. The controller will then elect new leader for
> this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
> relevant brokers. The broker should stop adjusting the ISR for this
> partition as if the broker is already offline. I am not sure there is any
> inconsistency in broker's behavior when it is leader or follower. Is there
> any concern with this approach?
> 
> Thanks for catching this. I have removed that reference from the KIP.
> 
> Hi Eno,
> 
> Thank you for providing the reference of the RAID-5. In LinkedIn we have 10
> disks per Kafka machine. It will not be a show-stopper operationally for
> LinkedIn if we have to deploy one-broker-per-disk. On the other hand we
> previously discussed the advantage of JBOD vs. one-broker-per-disk or
> one-broker-per-machine. One-broker-per-disk suffers from the problems
> described in the KIP and one-broker-per-machine increases the failure
> caused by disk failure by 10X. Since JBOD is strictly better than either of
> the two, it is also better then one-broker-per-multiple-disk which is
> somewhere between one-broker-per-disk and one-broker-per-machine.
> 
> I personally think the benefits of JBOD design is worth the implementation
> complexity it introduces. I would also argue that it is reasonable for
> Kafka to manage this low level detail because Kafka is already exposing and
> managing replication factor of its data. But whether the complexity is
> worthwhile can be subjective and I can not prove my opinion. I am
> contributing significant amount of time to do this KIP because Kafka
> develops at LinkedIn believes it is useful and worth the effort. Yeah, it
> will be useful to see what everyone else think about it.
> 
> 
> Thanks,
> Dong
> 
> 
> On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <j...@confluent.io> wrote:
> 
>> Hi, Dong,
>> 
>> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
>> fails, typically an admin has to bring down the broker, replace the failed
>> disk with a new one, trigger the RAID rebuild, and bring up the broker.
>> This way, there is no performance impact at runtime due to rebuild. The
>> benefit is that a broker doesn't fail in a hard way when there is a disk
>> failure and can be brought down in a controlled way for maintenance. While
>> the broker is running with a failed disk, reads may be more expensive since
>> they have to be computed from the parity. However, if most reads are from
>> page cache, this may not be a big issue either. So, it would be useful to
>> do some tests on RAID5 before we completely rule it out.
>> 
>> Regarding whether to remove an offline replica from the fetcher thread
>> immediately. What do we do when a failed replica is a leader? Do we do
>> nothing or mark the replica as not the leader immediately? Intuitively, it
>> seems it's better if the broker acts consistently on a failed replica
>> whether it's a leader or a follower. For ISR churns, I was just pointing
>> out that if we don't send StopReplicaRequest to a broker to be shut down in
>> a controlled way, then the leader will shrink ISR, expand it and shrink it
>> again after the timeout.
>> 
>> The KIP seems to still reference "
>> /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state".
>> 
>> Thanks,
>> 
>> Jun
>> 
>> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote:
>> 
>>> Hey Jun,
>>> 
>>> Thanks for the suggestion. I think it is a good idea to know put created
>>> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
>>> repilcas was in NewReplica state. It will only fail the replica creation
>> in
>>> the scenario that the controller fails after
>>> topic-creation/partition-reassignment/partition-number-change but before
>>> actually sends out the LeaderAndIsrRequest while there is ongoing disk
>>> failure, which should be pretty rare and acceptable. This should simplify
>>> the design of this KIP.
>>> 
>>> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
>>> performance when there is no failure. For example, RAID-5 can support up
>> to
>>> one disk failure and it takes time to rebuild disk after one disk
>>> failure. RAID 5 implementations are susceptible to system failures
>> because
>>> of trends regarding array rebuild time and the chance of drive failure
>>> during rebuild. There is no such performance degradation for JBOD and
>> JBOD
>>> can support multiple log directory failure without reducing performance
>> of
>>> good log directories. Would this be a reasonable reason for using JBOD
>>> instead of RAID-5/6?
>>> 
>>> Previously we discussed wether broker should remove offline replica from
>>> replica fetcher thread. I still think it should do it instead of
>> printing a
>>> lot of error in the log4j log. We can still let controller send
>>> StopReplicaRequest to the broker. I am not sure I undertand why allowing
>>> broker to remove offline replica from fetcher thread will increase churns
>>> in ISR. Do you think this is concern with this approach?
>>> 
>>> I have updated the KIP to remove created flag from ZK and change the
>> filed
>>> name to isNewReplica. Can you check if there is any issue with the latest
>>> KIP? Thanks for your time!
>>> 
>>> Regards,
>>> Dong
>>> 
>>> 
>>> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote:
>>> 
>>>> Hi, Dong,
>>>> 
>>>> Thanks for the reply.
>>>> 
>>>> Personally, I'd prefer not to write the created flag per replica in ZK.
>>>> Your suggestion of disabling replica creation if there is a bad log
>>>> directory on the broker could work. The only thing is that it may delay
>>> the
>>>> creation of new replicas. I was thinking that an alternative is to
>> extend
>>>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That
>>> field
>>>> will be set when a replica is transitioning from the NewReplica state
>> to
>>>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
>>>> replica is marked as the new replica, it will be created on a good log
>>>> directory, if not already present. Otherwise, it only creates the
>> replica
>>>> if all log directories are good and the replica is not already present.
>>>> This way, we don't delay the processing of new replicas in the common
>>> case.
>>>> 
>>>> I am ok with not persisting the offline replicas in ZK and just
>>> discovering
>>>> them through the LeaderAndIsrRequest. It handles the cases when a
>> broker
>>>> starts up with bad log directories better. So, the additional overhead
>> of
>>>> rediscovering the offline replicas is justified.
>>>> 
>>>> 
>>>> Another high level question. The proposal rejected RAID5/6 since it
>> adds
>>>> additional I/Os. The main issue with RAID5 is that to write a block
>> that
>>>> doesn't match the RAID stripe size, we have to first read the old
>> parity
>>> to
>>>> compute the new one, which increases the number of I/Os (
>>>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you
>>> have
>>>> tested RAID5's performance by creating a file system whose block size
>>>> matches the RAID stripe size (https://www.percona.com/blog/
>>>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a
>>> block
>>>> doesn't require a read first. A large block size may increase the
>> amount
>>> of
>>>> data writes, when the same block has to be written to disk multiple
>>> times.
>>>> However, this is probably ok in Kafka's use case since we batch the I/O
>>>> flush already. As you can see, we will be adding some complexity to
>>> support
>>>> JBOD in Kafka one way or another. If we can tune the performance of
>> RAID5
>>>> to match that of RAID10, perhaps using RAID5 is a simpler solution.
>>>> 
>>>> Thanks,
>>>> 
>>>> Jun
>>>> 
>>>> 
>>>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com>
>> wrote:
>>>> 
>>>>> Hey Jun,
>>>>> 
>>>>> I don't think we should allow failed replicas to be re-created on the
>>>> good
>>>>> disks. Say there are 2 disks and each of them is 51% loaded. If any
>>> disk
>>>>> fail, and we allow replicas to be re-created on the other disks, both
>>>> disks
>>>>> will fail. Alternatively we can disable replica creation if there is
>>> bad
>>>>> disk on a broker. I personally think it is worth the additional
>>>> complexity
>>>>> in the broker to store created replicas in ZK so that we allow new
>>>> replicas
>>>>> to be created on the broker even when there is bad log directory.
>> This
>>>>> approach won't add complexity in the controller. But I am fine with
>>>>> disabling replica creation when there is bad log directory that if it
>>> is
>>>>> the only blocking issue for this KIP.
>>>>> 
>>>>> Whether we store created flags is independent of whether/how we store
>>>>> offline replicas. Per our previous discussion, do you think it is OK
>>> not
>>>>> store offline replicas in ZK and propagate the offline replicas from
>>>> broker
>>>>> to controller via LeaderAndIsrRequest?
>>>>> 
>>>>> Thanks,
>>>>> Dong
>>>>> 
>>>> 
>>> 
>>
Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to