Andrii,

1. I was wondering what if the controller fails over after step 4). Since
the ZK node is gone, how does the controller know those failed replicas due
to disk failures? Otherwise, the controller will assume those replicas are
alive again.

2. Just to clarify. In the proposal, those failed replicas will not be auto
repaired and those affected partitions will just be running in the under
replicated mode, right? To repair the failed replicas, the admin still
needs to stop the broker?

Thanks,

Jun



On Fri, Apr 10, 2015 at 10:29 AM, Andrii Biletskyi <
andrii.bilets...@stealth.ly> wrote:

> Todd, Jun,
>
> Thanks for comments.
>
> I agree we might want to change "fair" on disk partition assignment
> in scope of these changes. I'm open to suggestions, I didn't bring it
> up here because of the facts that Todd mentioned - there is still  no
> clear understanding who should be responsible for assignment -
> broker or controller.
>
> 1. Yes, the way broker initiates partition restart should be discussed.
> But I don't understand the problem with controller failover. The intended
> workflow is the following:
> 0) On error Broker removes partitions from ReplicaManager and LogManager
> 1) Broker creates zk node
> 2) Controller picks up, re-generates leaders and followers for partitions
> 3) Controller sends new LeaderAndIsr and UpdateMetadata to the cluster
> 4) Controller deletes zk node
> Now, if controller fails between 3) and 4), yes, controller will send L&I
> requests twice, but broker which requested partition restart will "ignore"
> second time because partition would have been created at that point -
> while handling "first" L&I request.
>
> 2. The main benefit, from my perspective, is that if currently any file
> IO error means broker halts, you have to remove disk, restart the broker,
> with this KIP on IO error we simply reject that single request (or any
> action during
> which file IO error occurred), broker detects affected partitions and
> silently
> restarts them, normally handling other requests at the same time (of course
> if those are not related to the broken disk).
>
> 3. I agree, the lack of tools to perform such operational commands won't
> let us
> fully leverage JBOD architecture. That's why I think we should design it
> that
> way so implementing such tools must be a simple thing to do. But before
> that
> it'd be good to understand whether we are on the right path in general.
>
> Thanks,
> Andrii Biletskyi
>
> On Fri, Apr 10, 2015 at 6:56 PM, Jun Rao <j...@confluent.io> wrote:
>
> > Andrii,
> >
> > Thanks for writing up the proposal. A few thoughts on this.
> >
> > 1. Your proposal is to have the broker notify the controller about failed
> > replicas. We need to think through this a bit more. The controller may
> fail
> > later. During the controller failover, it needs to be able to detect
> those
> > failed replicas again. Otherwise, it may revert some of the decisions
> that
> > it has made earlier. In the current proposal, it seems that the info
> about
> > the failed replicas will be lost during controller failover?
> >
> > 2. Overall, it's not very clear to me what benefit this proposal
> provides.
> > The proposal seems to detect failed disks and then just marks the
> > associated replicas as offline. How do we bring those replicas to online
> > again? Do we have to stop the broker and either fix the failed disk or
> > remove it from the configured log dir? If so, there will still be a down
> > time of the broker. The changes in the proposal is non-trivial. So, we
> need
> > to be certain that we get some significant benefits.
> >
> > 3. As Todd pointed out, it will be worth thinking through other issues
> > related to JBOD.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Apr 9, 2015 at 5:36 AM, Andrii Biletskyi <
> > andrii.bilets...@stealth.ly> wrote:
> >
> > > Hi,
> > >
> > > Let me start discussion thread for KIP-18 - JBOD Support.
> > >
> > > Link to wiki:
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support
> > >
> > >
> > > Thanks,
> > > Andrii Biletskyi
> > >
> >
>

Reply via email to