Re: [DISCUSS] KIP-18 - JBOD Support

Jun Rao Sun, 12 Apr 2015 20:18:38 -0700

Andrii,

2. So the idea is to immediately start replicating those replicas on the
failed directory to the other directories? An IOException can be caused by
a disk running out of space. In this case, perhaps an admin may want to
bring down the broker, free up some disk space and restart the broker? This
introduces less data movement.


Thanks,

Jun

On Sun, Apr 12, 2015 at 2:59 PM, Andrii Biletskyi <
andrii.bilets...@stealth.ly> wrote:

> Jun
>
> 1. Hm, it looks like I didn't take this case into account in KIP.
> I see your point. Why don't we do the same thing as with reassign
> partitions - let's setup new (or
> reuse) ReassignedPartitionsIsrChangeListener
> that will check whether brokers that requested partitions restart
> catch up (in isr state) and update zk node /restart_partitions to remove
> irrelevant replicas. - This should be done instead of step 4) - Controller
> deletes zk node.
>
> 2. No, the intent _is_ actually make replicas auto-repaired.
> There are two parts to it - 1) catch IO exceptions, so that the whole
> broker doesn't crash; 2) request partition restart on io errors - re-fetch
> lost partitions through the described in KIP mechanism.
> At least, I believe, this is our goal, otherwise there are no benefits.
>
> Thanks,
> Andrii Biletskyi
>
>
> On Sat, Apr 11, 2015 at 2:20 AM, Jun Rao <j...@confluent.io> wrote:
>
> > Andrii,
> >
> > 1. I was wondering what if the controller fails over after step 4). Since
> > the ZK node is gone, how does the controller know those failed replicas
> due
> > to disk failures? Otherwise, the controller will assume those replicas
> are
> > alive again.
> >
> > 2. Just to clarify. In the proposal, those failed replicas will not be
> auto
> > repaired and those affected partitions will just be running in the under
> > replicated mode, right? To repair the failed replicas, the admin still
> > needs to stop the broker?
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Fri, Apr 10, 2015 at 10:29 AM, Andrii Biletskyi <
> > andrii.bilets...@stealth.ly> wrote:
> >
> > > Todd, Jun,
> > >
> > > Thanks for comments.
> > >
> > > I agree we might want to change "fair" on disk partition assignment
> > > in scope of these changes. I'm open to suggestions, I didn't bring it
> > > up here because of the facts that Todd mentioned - there is still  no
> > > clear understanding who should be responsible for assignment -
> > > broker or controller.
> > >
> > > 1. Yes, the way broker initiates partition restart should be discussed.
> > > But I don't understand the problem with controller failover. The
> intended
> > > workflow is the following:
> > > 0) On error Broker removes partitions from ReplicaManager and
> LogManager
> > > 1) Broker creates zk node
> > > 2) Controller picks up, re-generates leaders and followers for
> partitions
> > > 3) Controller sends new LeaderAndIsr and UpdateMetadata to the cluster
> > > 4) Controller deletes zk node
> > > Now, if controller fails between 3) and 4), yes, controller will send
> L&I
> > > requests twice, but broker which requested partition restart will
> > "ignore"
> > > second time because partition would have been created at that point -
> > > while handling "first" L&I request.
> > >
> > > 2. The main benefit, from my perspective, is that if currently any file
> > > IO error means broker halts, you have to remove disk, restart the
> broker,
> > > with this KIP on IO error we simply reject that single request (or any
> > > action during
> > > which file IO error occurred), broker detects affected partitions and
> > > silently
> > > restarts them, normally handling other requests at the same time (of
> > course
> > > if those are not related to the broken disk).
> > >
> > > 3. I agree, the lack of tools to perform such operational commands
> won't
> > > let us
> > > fully leverage JBOD architecture. That's why I think we should design
> it
> > > that
> > > way so implementing such tools must be a simple thing to do. But before
> > > that
> > > it'd be good to understand whether we are on the right path in general.
> > >
> > > Thanks,
> > > Andrii Biletskyi
> > >
> > > On Fri, Apr 10, 2015 at 6:56 PM, Jun Rao <j...@confluent.io> wrote:
> > >
> > > > Andrii,
> > > >
> > > > Thanks for writing up the proposal. A few thoughts on this.
> > > >
> > > > 1. Your proposal is to have the broker notify the controller about
> > failed
> > > > replicas. We need to think through this a bit more. The controller
> may
> > > fail
> > > > later. During the controller failover, it needs to be able to detect
> > > those
> > > > failed replicas again. Otherwise, it may revert some of the decisions
> > > that
> > > > it has made earlier. In the current proposal, it seems that the info
> > > about
> > > > the failed replicas will be lost during controller failover?
> > > >
> > > > 2. Overall, it's not very clear to me what benefit this proposal
> > > provides.
> > > > The proposal seems to detect failed disks and then just marks the
> > > > associated replicas as offline. How do we bring those replicas to
> > online
> > > > again? Do we have to stop the broker and either fix the failed disk
> or
> > > > remove it from the configured log dir? If so, there will still be a
> > down
> > > > time of the broker. The changes in the proposal is non-trivial. So,
> we
> > > need
> > > > to be certain that we get some significant benefits.
> > > >
> > > > 3. As Todd pointed out, it will be worth thinking through other
> issues
> > > > related to JBOD.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Apr 9, 2015 at 5:36 AM, Andrii Biletskyi <
> > > > andrii.bilets...@stealth.ly> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Let me start discussion thread for KIP-18 - JBOD Support.
> > > > >
> > > > > Link to wiki:
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Andrii Biletskyi
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-18 - JBOD Support

Reply via email to