Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-04-26 Thread Dong Lin
Hey Jun, Ismael,

Thanks for all the review! Can you vote for KIP-112 if you are OK with the
latest design doc?

Thanks,
Dong

On Thu, Mar 30, 2017 at 3:29 PM, Dong Lin  wrote:

> Hi all,
>
> Thanks for all the comments. I am going to open the voting thread if
> there is no further concern with the KIP.
>
> Dong
>
> On Wed, Mar 15, 2017 at 5:25 PM, Ismael Juma  wrote:
>
>> Thanks for the updates Dong, they look good to me.
>>
>> Ismael
>>
>> On Wed, Mar 15, 2017 at 5:50 PM, Dong Lin  wrote:
>>
>> > Hey Ismael,
>> >
>> > Sure, I have updated "Changes in Operational Procedures" section in
>> KIP-113
>> > to specify the problem and solution with known disk failure. And I
>> updated
>> > the "Test Plan" section to note that we have test in KIP-113 to verify
>> that
>> > replicas already created on the good log directories will not be
>> affected
>> > by failure of other log directories.
>> >
>> > Please let me know if there is any other improvement I can make. Thanks
>> for
>> > your comment.
>> >
>> > Dong
>> >
>> >
>> > On Wed, Mar 15, 2017 at 3:18 AM, Ismael Juma  wrote:
>> >
>> > > Hi Dong,
>> > >
>> > > Yes, that sounds good to me. I'd list option 2 first since that is
>> safe
>> > > and, as you said, no worse than what happens today. The file approach
>> is
>> > a
>> > > bit hacky as you said, so it may be a bit fragile. Not sure if we
>> really
>> > > want to mention that. :)
>> > >
>> > > About the note in KIP-112 versus adding the test in KIP-113, I think
>> it
>> > > would make sense to add a short sentence stating that this scenario is
>> > > covered in KIP-113. People won't necessarily read both KIPs at the
>> same
>> > > time and it's helpful to cross-reference when it makes sense.
>> > >
>> > > Thanks for your work on this.
>> > >
>> > > Ismael
>> > >
>> > > On Tue, Mar 14, 2017 at 11:00 PM, Dong Lin 
>> wrote:
>> > >
>> > > > Hey Ismael,
>> > > >
>> > > > I get your concern that it is more likely for a disk to be slow, or
>> > > exhibit
>> > > > other forms of non-fatal symptom, after some known fatal error.
>> Then it
>> > > is
>> > > > weird for user to start broker with the likely-problematic disk in
>> the
>> > > > broker config. In that case, I think there are two things user can
>> do:
>> > > >
>> > > > 1) Intentionally change the log directory in the config to point to
>> a
>> > > file.
>> > > > This is a bit hacky but it works well before we make
>> more-appropriate
>> > > > long-term change in Kafka to handle this case.
>> > > > 2) Just don't start broker with bad log directories. Always fix disk
>> > > before
>> > > > restarting the broker. This is a safe approach that is no worse than
>> > > > current practice.
>> > > >
>> > > > Would this address your concern if I specify the problem and the two
>> > > > solutions in the KIP?
>> > > >
>> > > > Thanks,
>> > > > Dong
>> > > >
>> > > > On Tue, Mar 14, 2017 at 3:29 PM, Dong Lin 
>> wrote:
>> > > >
>> > > > > Hey Ismael,
>> > > > >
>> > > > > Thanks for the comment. Please see my reply below.
>> > > > >
>> > > > > On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma 
>> > > wrote:
>> > > > >
>> > > > >> Thanks Dong. Comments inline.
>> > > > >>
>> > > > >> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin 
>> > > wrote:
>> > > > >> >
>> > > > >> > I get your point. But I am not sure we should recommend user to
>> > > simply
>> > > > >> > remove disk from the broker config. If user simply does this
>> > without
>> > > > >> > checking the utilization of good disks, replica on the bad disk
>> > will
>> > > > be
>> > > > >> > re-created on the good disk and may overload the good disks,
>> > causing
>> > > > >> > cascading failure.
>> > > > >> >
>> > > > >>
>> > > > >> Good point.
>> > > > >>
>> > > > >>
>> > > > >> >
>> > > > >> > I agree with you and Colin that slow disk may cause problem.
>> > > However,
>> > > > >> > performance degradation due to slow disk this is an existing
>> > problem
>> > > > >> that
>> > > > >> > is not detected or handled by Kafka or KIP-112.
>> > > > >>
>> > > > >>
>> > > > >> I think an important difference is that a number of disk errors
>> are
>> > > > >> currently fatal and won't be after KIP-112. So it introduces new
>> > > > scenarios
>> > > > >> (for example, bouncing a broker that is working fine although
>> some
>> > > disks
>> > > > >> have been marked bad).
>> > > > >>
>> > > > >
>> > > > > Hmm.. I am still trying to understand why KIP-112 creates new
>> > > scenarios.
>> > > > > Slow disk is not considered fatal error and won't be caught by
>> either
>> > > > > existing Kafka design or this KIP. If any disk is marked bad, it
>> > means
>> > > > > broker encounters IOException when accessing disk, most likely the
>> > > broker
>> > > > > will encounter IOException again when accessing this disk and mark
>> > this
>> > > > > disk as bad after bounce. I guess 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-30 Thread Dong Lin
Hi all,

Thanks for all the comments. I am going to open the voting thread if there
is no further concern with the KIP.

Dong

On Wed, Mar 15, 2017 at 5:25 PM, Ismael Juma  wrote:

> Thanks for the updates Dong, they look good to me.
>
> Ismael
>
> On Wed, Mar 15, 2017 at 5:50 PM, Dong Lin  wrote:
>
> > Hey Ismael,
> >
> > Sure, I have updated "Changes in Operational Procedures" section in
> KIP-113
> > to specify the problem and solution with known disk failure. And I
> updated
> > the "Test Plan" section to note that we have test in KIP-113 to verify
> that
> > replicas already created on the good log directories will not be affected
> > by failure of other log directories.
> >
> > Please let me know if there is any other improvement I can make. Thanks
> for
> > your comment.
> >
> > Dong
> >
> >
> > On Wed, Mar 15, 2017 at 3:18 AM, Ismael Juma  wrote:
> >
> > > Hi Dong,
> > >
> > > Yes, that sounds good to me. I'd list option 2 first since that is safe
> > > and, as you said, no worse than what happens today. The file approach
> is
> > a
> > > bit hacky as you said, so it may be a bit fragile. Not sure if we
> really
> > > want to mention that. :)
> > >
> > > About the note in KIP-112 versus adding the test in KIP-113, I think it
> > > would make sense to add a short sentence stating that this scenario is
> > > covered in KIP-113. People won't necessarily read both KIPs at the same
> > > time and it's helpful to cross-reference when it makes sense.
> > >
> > > Thanks for your work on this.
> > >
> > > Ismael
> > >
> > > On Tue, Mar 14, 2017 at 11:00 PM, Dong Lin 
> wrote:
> > >
> > > > Hey Ismael,
> > > >
> > > > I get your concern that it is more likely for a disk to be slow, or
> > > exhibit
> > > > other forms of non-fatal symptom, after some known fatal error. Then
> it
> > > is
> > > > weird for user to start broker with the likely-problematic disk in
> the
> > > > broker config. In that case, I think there are two things user can
> do:
> > > >
> > > > 1) Intentionally change the log directory in the config to point to a
> > > file.
> > > > This is a bit hacky but it works well before we make more-appropriate
> > > > long-term change in Kafka to handle this case.
> > > > 2) Just don't start broker with bad log directories. Always fix disk
> > > before
> > > > restarting the broker. This is a safe approach that is no worse than
> > > > current practice.
> > > >
> > > > Would this address your concern if I specify the problem and the two
> > > > solutions in the KIP?
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > > On Tue, Mar 14, 2017 at 3:29 PM, Dong Lin 
> wrote:
> > > >
> > > > > Hey Ismael,
> > > > >
> > > > > Thanks for the comment. Please see my reply below.
> > > > >
> > > > > On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma 
> > > wrote:
> > > > >
> > > > >> Thanks Dong. Comments inline.
> > > > >>
> > > > >> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin 
> > > wrote:
> > > > >> >
> > > > >> > I get your point. But I am not sure we should recommend user to
> > > simply
> > > > >> > remove disk from the broker config. If user simply does this
> > without
> > > > >> > checking the utilization of good disks, replica on the bad disk
> > will
> > > > be
> > > > >> > re-created on the good disk and may overload the good disks,
> > causing
> > > > >> > cascading failure.
> > > > >> >
> > > > >>
> > > > >> Good point.
> > > > >>
> > > > >>
> > > > >> >
> > > > >> > I agree with you and Colin that slow disk may cause problem.
> > > However,
> > > > >> > performance degradation due to slow disk this is an existing
> > problem
> > > > >> that
> > > > >> > is not detected or handled by Kafka or KIP-112.
> > > > >>
> > > > >>
> > > > >> I think an important difference is that a number of disk errors
> are
> > > > >> currently fatal and won't be after KIP-112. So it introduces new
> > > > scenarios
> > > > >> (for example, bouncing a broker that is working fine although some
> > > disks
> > > > >> have been marked bad).
> > > > >>
> > > > >
> > > > > Hmm.. I am still trying to understand why KIP-112 creates new
> > > scenarios.
> > > > > Slow disk is not considered fatal error and won't be caught by
> either
> > > > > existing Kafka design or this KIP. If any disk is marked bad, it
> > means
> > > > > broker encounters IOException when accessing disk, most likely the
> > > broker
> > > > > will encounter IOException again when accessing this disk and mark
> > this
> > > > > disk as bad after bounce. I guess you are talking about the case
> > that a
> > > > > disk is marked bad, broker is bounced, then the disk provides
> > degraded
> > > > > performance without being marked bad, right? But this seems to be
> an
> > > > > existing problem we already have today with slow disk.
> > > > >
> > > > > Here are the possible scenarios with bad disk after broker bounce:
> > > > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-15 Thread Ismael Juma
Thanks for the updates Dong, they look good to me.

Ismael

On Wed, Mar 15, 2017 at 5:50 PM, Dong Lin  wrote:

> Hey Ismael,
>
> Sure, I have updated "Changes in Operational Procedures" section in KIP-113
> to specify the problem and solution with known disk failure. And I updated
> the "Test Plan" section to note that we have test in KIP-113 to verify that
> replicas already created on the good log directories will not be affected
> by failure of other log directories.
>
> Please let me know if there is any other improvement I can make. Thanks for
> your comment.
>
> Dong
>
>
> On Wed, Mar 15, 2017 at 3:18 AM, Ismael Juma  wrote:
>
> > Hi Dong,
> >
> > Yes, that sounds good to me. I'd list option 2 first since that is safe
> > and, as you said, no worse than what happens today. The file approach is
> a
> > bit hacky as you said, so it may be a bit fragile. Not sure if we really
> > want to mention that. :)
> >
> > About the note in KIP-112 versus adding the test in KIP-113, I think it
> > would make sense to add a short sentence stating that this scenario is
> > covered in KIP-113. People won't necessarily read both KIPs at the same
> > time and it's helpful to cross-reference when it makes sense.
> >
> > Thanks for your work on this.
> >
> > Ismael
> >
> > On Tue, Mar 14, 2017 at 11:00 PM, Dong Lin  wrote:
> >
> > > Hey Ismael,
> > >
> > > I get your concern that it is more likely for a disk to be slow, or
> > exhibit
> > > other forms of non-fatal symptom, after some known fatal error. Then it
> > is
> > > weird for user to start broker with the likely-problematic disk in the
> > > broker config. In that case, I think there are two things user can do:
> > >
> > > 1) Intentionally change the log directory in the config to point to a
> > file.
> > > This is a bit hacky but it works well before we make more-appropriate
> > > long-term change in Kafka to handle this case.
> > > 2) Just don't start broker with bad log directories. Always fix disk
> > before
> > > restarting the broker. This is a safe approach that is no worse than
> > > current practice.
> > >
> > > Would this address your concern if I specify the problem and the two
> > > solutions in the KIP?
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Tue, Mar 14, 2017 at 3:29 PM, Dong Lin  wrote:
> > >
> > > > Hey Ismael,
> > > >
> > > > Thanks for the comment. Please see my reply below.
> > > >
> > > > On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma 
> > wrote:
> > > >
> > > >> Thanks Dong. Comments inline.
> > > >>
> > > >> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin 
> > wrote:
> > > >> >
> > > >> > I get your point. But I am not sure we should recommend user to
> > simply
> > > >> > remove disk from the broker config. If user simply does this
> without
> > > >> > checking the utilization of good disks, replica on the bad disk
> will
> > > be
> > > >> > re-created on the good disk and may overload the good disks,
> causing
> > > >> > cascading failure.
> > > >> >
> > > >>
> > > >> Good point.
> > > >>
> > > >>
> > > >> >
> > > >> > I agree with you and Colin that slow disk may cause problem.
> > However,
> > > >> > performance degradation due to slow disk this is an existing
> problem
> > > >> that
> > > >> > is not detected or handled by Kafka or KIP-112.
> > > >>
> > > >>
> > > >> I think an important difference is that a number of disk errors are
> > > >> currently fatal and won't be after KIP-112. So it introduces new
> > > scenarios
> > > >> (for example, bouncing a broker that is working fine although some
> > disks
> > > >> have been marked bad).
> > > >>
> > > >
> > > > Hmm.. I am still trying to understand why KIP-112 creates new
> > scenarios.
> > > > Slow disk is not considered fatal error and won't be caught by either
> > > > existing Kafka design or this KIP. If any disk is marked bad, it
> means
> > > > broker encounters IOException when accessing disk, most likely the
> > broker
> > > > will encounter IOException again when accessing this disk and mark
> this
> > > > disk as bad after bounce. I guess you are talking about the case
> that a
> > > > disk is marked bad, broker is bounced, then the disk provides
> degraded
> > > > performance without being marked bad, right? But this seems to be an
> > > > existing problem we already have today with slow disk.
> > > >
> > > > Here are the possible scenarios with bad disk after broker bounce:
> > > >
> > > > 1) bad disk -> broker bounce -> good disk. This would be great.
> > > > 2) bad disk -> broker bounce -> slow disk. Slow disk is an existing
> > > > problem that is not addressed by Kafka today.
> > > > 3) bad disk -> broker bounce -> bad disk. This is handled by this KIP
> > > such
> > > > that only replicas on the bad disk become offline.
> > > >
> > > >
> > > >>
> > > >> > Detection and handling of
> > > >> > slow disk is a separate problem that needs to be addressed 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-15 Thread Dong Lin
Hey Ismael,

Sure, I have updated "Changes in Operational Procedures" section in KIP-113
to specify the problem and solution with known disk failure. And I updated
the "Test Plan" section to note that we have test in KIP-113 to verify that
replicas already created on the good log directories will not be affected
by failure of other log directories.

Please let me know if there is any other improvement I can make. Thanks for
your comment.

Dong


On Wed, Mar 15, 2017 at 3:18 AM, Ismael Juma  wrote:

> Hi Dong,
>
> Yes, that sounds good to me. I'd list option 2 first since that is safe
> and, as you said, no worse than what happens today. The file approach is a
> bit hacky as you said, so it may be a bit fragile. Not sure if we really
> want to mention that. :)
>
> About the note in KIP-112 versus adding the test in KIP-113, I think it
> would make sense to add a short sentence stating that this scenario is
> covered in KIP-113. People won't necessarily read both KIPs at the same
> time and it's helpful to cross-reference when it makes sense.
>
> Thanks for your work on this.
>
> Ismael
>
> On Tue, Mar 14, 2017 at 11:00 PM, Dong Lin  wrote:
>
> > Hey Ismael,
> >
> > I get your concern that it is more likely for a disk to be slow, or
> exhibit
> > other forms of non-fatal symptom, after some known fatal error. Then it
> is
> > weird for user to start broker with the likely-problematic disk in the
> > broker config. In that case, I think there are two things user can do:
> >
> > 1) Intentionally change the log directory in the config to point to a
> file.
> > This is a bit hacky but it works well before we make more-appropriate
> > long-term change in Kafka to handle this case.
> > 2) Just don't start broker with bad log directories. Always fix disk
> before
> > restarting the broker. This is a safe approach that is no worse than
> > current practice.
> >
> > Would this address your concern if I specify the problem and the two
> > solutions in the KIP?
> >
> > Thanks,
> > Dong
> >
> > On Tue, Mar 14, 2017 at 3:29 PM, Dong Lin  wrote:
> >
> > > Hey Ismael,
> > >
> > > Thanks for the comment. Please see my reply below.
> > >
> > > On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma 
> wrote:
> > >
> > >> Thanks Dong. Comments inline.
> > >>
> > >> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin 
> wrote:
> > >> >
> > >> > I get your point. But I am not sure we should recommend user to
> simply
> > >> > remove disk from the broker config. If user simply does this without
> > >> > checking the utilization of good disks, replica on the bad disk will
> > be
> > >> > re-created on the good disk and may overload the good disks, causing
> > >> > cascading failure.
> > >> >
> > >>
> > >> Good point.
> > >>
> > >>
> > >> >
> > >> > I agree with you and Colin that slow disk may cause problem.
> However,
> > >> > performance degradation due to slow disk this is an existing problem
> > >> that
> > >> > is not detected or handled by Kafka or KIP-112.
> > >>
> > >>
> > >> I think an important difference is that a number of disk errors are
> > >> currently fatal and won't be after KIP-112. So it introduces new
> > scenarios
> > >> (for example, bouncing a broker that is working fine although some
> disks
> > >> have been marked bad).
> > >>
> > >
> > > Hmm.. I am still trying to understand why KIP-112 creates new
> scenarios.
> > > Slow disk is not considered fatal error and won't be caught by either
> > > existing Kafka design or this KIP. If any disk is marked bad, it means
> > > broker encounters IOException when accessing disk, most likely the
> broker
> > > will encounter IOException again when accessing this disk and mark this
> > > disk as bad after bounce. I guess you are talking about the case that a
> > > disk is marked bad, broker is bounced, then the disk provides degraded
> > > performance without being marked bad, right? But this seems to be an
> > > existing problem we already have today with slow disk.
> > >
> > > Here are the possible scenarios with bad disk after broker bounce:
> > >
> > > 1) bad disk -> broker bounce -> good disk. This would be great.
> > > 2) bad disk -> broker bounce -> slow disk. Slow disk is an existing
> > > problem that is not addressed by Kafka today.
> > > 3) bad disk -> broker bounce -> bad disk. This is handled by this KIP
> > such
> > > that only replicas on the bad disk become offline.
> > >
> > >
> > >>
> > >> > Detection and handling of
> > >> > slow disk is a separate problem that needs to be addressed in a
> future
> > >> KIP.
> > >> > It is currently listed in the future work. Does this sound OK?
> > >> >
> > >>
> > >> I'm OK with it being handled in the future. In the meantime, I was
> just
> > >> hoping that we can make it clear to users about the potential issue
> of a
> > >> disk marked as bad becoming good again after a bounce (which can be
> > >> dangerous).
> > >>
> > >> The main 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-15 Thread Ismael Juma
Hi Dong,

Yes, that sounds good to me. I'd list option 2 first since that is safe
and, as you said, no worse than what happens today. The file approach is a
bit hacky as you said, so it may be a bit fragile. Not sure if we really
want to mention that. :)

About the note in KIP-112 versus adding the test in KIP-113, I think it
would make sense to add a short sentence stating that this scenario is
covered in KIP-113. People won't necessarily read both KIPs at the same
time and it's helpful to cross-reference when it makes sense.

Thanks for your work on this.

Ismael

On Tue, Mar 14, 2017 at 11:00 PM, Dong Lin  wrote:

> Hey Ismael,
>
> I get your concern that it is more likely for a disk to be slow, or exhibit
> other forms of non-fatal symptom, after some known fatal error. Then it is
> weird for user to start broker with the likely-problematic disk in the
> broker config. In that case, I think there are two things user can do:
>
> 1) Intentionally change the log directory in the config to point to a file.
> This is a bit hacky but it works well before we make more-appropriate
> long-term change in Kafka to handle this case.
> 2) Just don't start broker with bad log directories. Always fix disk before
> restarting the broker. This is a safe approach that is no worse than
> current practice.
>
> Would this address your concern if I specify the problem and the two
> solutions in the KIP?
>
> Thanks,
> Dong
>
> On Tue, Mar 14, 2017 at 3:29 PM, Dong Lin  wrote:
>
> > Hey Ismael,
> >
> > Thanks for the comment. Please see my reply below.
> >
> > On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma  wrote:
> >
> >> Thanks Dong. Comments inline.
> >>
> >> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin  wrote:
> >> >
> >> > I get your point. But I am not sure we should recommend user to simply
> >> > remove disk from the broker config. If user simply does this without
> >> > checking the utilization of good disks, replica on the bad disk will
> be
> >> > re-created on the good disk and may overload the good disks, causing
> >> > cascading failure.
> >> >
> >>
> >> Good point.
> >>
> >>
> >> >
> >> > I agree with you and Colin that slow disk may cause problem. However,
> >> > performance degradation due to slow disk this is an existing problem
> >> that
> >> > is not detected or handled by Kafka or KIP-112.
> >>
> >>
> >> I think an important difference is that a number of disk errors are
> >> currently fatal and won't be after KIP-112. So it introduces new
> scenarios
> >> (for example, bouncing a broker that is working fine although some disks
> >> have been marked bad).
> >>
> >
> > Hmm.. I am still trying to understand why KIP-112 creates new scenarios.
> > Slow disk is not considered fatal error and won't be caught by either
> > existing Kafka design or this KIP. If any disk is marked bad, it means
> > broker encounters IOException when accessing disk, most likely the broker
> > will encounter IOException again when accessing this disk and mark this
> > disk as bad after bounce. I guess you are talking about the case that a
> > disk is marked bad, broker is bounced, then the disk provides degraded
> > performance without being marked bad, right? But this seems to be an
> > existing problem we already have today with slow disk.
> >
> > Here are the possible scenarios with bad disk after broker bounce:
> >
> > 1) bad disk -> broker bounce -> good disk. This would be great.
> > 2) bad disk -> broker bounce -> slow disk. Slow disk is an existing
> > problem that is not addressed by Kafka today.
> > 3) bad disk -> broker bounce -> bad disk. This is handled by this KIP
> such
> > that only replicas on the bad disk become offline.
> >
> >
> >>
> >> > Detection and handling of
> >> > slow disk is a separate problem that needs to be addressed in a future
> >> KIP.
> >> > It is currently listed in the future work. Does this sound OK?
> >> >
> >>
> >> I'm OK with it being handled in the future. In the meantime, I was just
> >> hoping that we can make it clear to users about the potential issue of a
> >> disk marked as bad becoming good again after a bounce (which can be
> >> dangerous).
> >>
> >> The main benefit of creating the second topic after log directory goes
> >> > offline is that we can make sure the second topic is created on the
> good
> >> > log directory. I am not sure we can simply assume that the first topic
> >> will
> >> > always be created on the first log directory in the broker config and
> >> the
> >> > second topic will be created on the second log directory in the broker
> >> > config.
> >>
> >>
> >>
> >> > However, I can add this test in KIP-113 which allows user to
> >> > re-assign replica to specific log directory of a broker. Is this OK?
> >> >
> >>
> >> OK. Please add a note to KIP-112 about this as well (so that it's clear
> >> why
> >> we only do it in KIP-113).
> >>
> >
> > Sure. Instead of adding note to KIP-112, I have 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-14 Thread Dong Lin
Hey Ismael,

I get your concern that it is more likely for a disk to be slow, or exhibit
other forms of non-fatal symptom, after some known fatal error. Then it is
weird for user to start broker with the likely-problematic disk in the
broker config. In that case, I think there are two things user can do:

1) Intentionally change the log directory in the config to point to a file.
This is a bit hacky but it works well before we make more-appropriate
long-term change in Kafka to handle this case.
2) Just don't start broker with bad log directories. Always fix disk before
restarting the broker. This is a safe approach that is no worse than
current practice.

Would this address your concern if I specify the problem and the two
solutions in the KIP?

Thanks,
Dong

On Tue, Mar 14, 2017 at 3:29 PM, Dong Lin  wrote:

> Hey Ismael,
>
> Thanks for the comment. Please see my reply below.
>
> On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma  wrote:
>
>> Thanks Dong. Comments inline.
>>
>> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin  wrote:
>> >
>> > I get your point. But I am not sure we should recommend user to simply
>> > remove disk from the broker config. If user simply does this without
>> > checking the utilization of good disks, replica on the bad disk will be
>> > re-created on the good disk and may overload the good disks, causing
>> > cascading failure.
>> >
>>
>> Good point.
>>
>>
>> >
>> > I agree with you and Colin that slow disk may cause problem. However,
>> > performance degradation due to slow disk this is an existing problem
>> that
>> > is not detected or handled by Kafka or KIP-112.
>>
>>
>> I think an important difference is that a number of disk errors are
>> currently fatal and won't be after KIP-112. So it introduces new scenarios
>> (for example, bouncing a broker that is working fine although some disks
>> have been marked bad).
>>
>
> Hmm.. I am still trying to understand why KIP-112 creates new scenarios.
> Slow disk is not considered fatal error and won't be caught by either
> existing Kafka design or this KIP. If any disk is marked bad, it means
> broker encounters IOException when accessing disk, most likely the broker
> will encounter IOException again when accessing this disk and mark this
> disk as bad after bounce. I guess you are talking about the case that a
> disk is marked bad, broker is bounced, then the disk provides degraded
> performance without being marked bad, right? But this seems to be an
> existing problem we already have today with slow disk.
>
> Here are the possible scenarios with bad disk after broker bounce:
>
> 1) bad disk -> broker bounce -> good disk. This would be great.
> 2) bad disk -> broker bounce -> slow disk. Slow disk is an existing
> problem that is not addressed by Kafka today.
> 3) bad disk -> broker bounce -> bad disk. This is handled by this KIP such
> that only replicas on the bad disk become offline.
>
>
>>
>> > Detection and handling of
>> > slow disk is a separate problem that needs to be addressed in a future
>> KIP.
>> > It is currently listed in the future work. Does this sound OK?
>> >
>>
>> I'm OK with it being handled in the future. In the meantime, I was just
>> hoping that we can make it clear to users about the potential issue of a
>> disk marked as bad becoming good again after a bounce (which can be
>> dangerous).
>>
>> The main benefit of creating the second topic after log directory goes
>> > offline is that we can make sure the second topic is created on the good
>> > log directory. I am not sure we can simply assume that the first topic
>> will
>> > always be created on the first log directory in the broker config and
>> the
>> > second topic will be created on the second log directory in the broker
>> > config.
>>
>>
>>
>> > However, I can add this test in KIP-113 which allows user to
>> > re-assign replica to specific log directory of a broker. Is this OK?
>> >
>>
>> OK. Please add a note to KIP-112 about this as well (so that it's clear
>> why
>> we only do it in KIP-113).
>>
>
> Sure. Instead of adding note to KIP-112, I have added test in KIP-113 to
> verify that bad log directories discovered during runtime would not affect
> replicas on the good log directories. Does this address the problem?
>
>
>> Ismael
>>
>
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-14 Thread Dong Lin
Hey Ismael,

Thanks for the comment. Please see my reply below.

On Tue, Mar 14, 2017 at 10:31 AM, Ismael Juma  wrote:

> Thanks Dong. Comments inline.
>
> On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin  wrote:
> >
> > I get your point. But I am not sure we should recommend user to simply
> > remove disk from the broker config. If user simply does this without
> > checking the utilization of good disks, replica on the bad disk will be
> > re-created on the good disk and may overload the good disks, causing
> > cascading failure.
> >
>
> Good point.
>
>
> >
> > I agree with you and Colin that slow disk may cause problem. However,
> > performance degradation due to slow disk this is an existing problem that
> > is not detected or handled by Kafka or KIP-112.
>
>
> I think an important difference is that a number of disk errors are
> currently fatal and won't be after KIP-112. So it introduces new scenarios
> (for example, bouncing a broker that is working fine although some disks
> have been marked bad).
>

Hmm.. I am still trying to understand why KIP-112 creates new scenarios.
Slow disk is not considered fatal error and won't be caught by either
existing Kafka design or this KIP. If any disk is marked bad, it means
broker encounters IOException when accessing disk, most likely the broker
will encounter IOException again when accessing this disk and mark this
disk as bad after bounce. I guess you are talking about the case that a
disk is marked bad, broker is bounced, then the disk provides degraded
performance without being marked bad, right? But this seems to be an
existing problem we already have today with slow disk.

Here are the possible scenarios with bad disk after broker bounce:

1) bad disk -> broker bounce -> good disk. This would be great.
2) bad disk -> broker bounce -> slow disk. Slow disk is an existing problem
that is not addressed by Kafka today.
3) bad disk -> broker bounce -> bad disk. This is handled by this KIP such
that only replicas on the bad disk become offline.


>
> > Detection and handling of
> > slow disk is a separate problem that needs to be addressed in a future
> KIP.
> > It is currently listed in the future work. Does this sound OK?
> >
>
> I'm OK with it being handled in the future. In the meantime, I was just
> hoping that we can make it clear to users about the potential issue of a
> disk marked as bad becoming good again after a bounce (which can be
> dangerous).
>
> The main benefit of creating the second topic after log directory goes
> > offline is that we can make sure the second topic is created on the good
> > log directory. I am not sure we can simply assume that the first topic
> will
> > always be created on the first log directory in the broker config and the
> > second topic will be created on the second log directory in the broker
> > config.
>
>
>
> > However, I can add this test in KIP-113 which allows user to
> > re-assign replica to specific log directory of a broker. Is this OK?
> >
>
> OK. Please add a note to KIP-112 about this as well (so that it's clear why
> we only do it in KIP-113).
>

Sure. Instead of adding note to KIP-112, I have added test in KIP-113 to
verify that bad log directories discovered during runtime would not affect
replicas on the good log directories. Does this address the problem?


> Ismael
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-14 Thread Ismael Juma
Thanks Dong. Comments inline.

On Fri, Mar 10, 2017 at 6:25 PM, Dong Lin  wrote:
>
> I get your point. But I am not sure we should recommend user to simply
> remove disk from the broker config. If user simply does this without
> checking the utilization of good disks, replica on the bad disk will be
> re-created on the good disk and may overload the good disks, causing
> cascading failure.
>

Good point.


>
> I agree with you and Colin that slow disk may cause problem. However,
> performance degradation due to slow disk this is an existing problem that
> is not detected or handled by Kafka or KIP-112.


I think an important difference is that a number of disk errors are
currently fatal and won't be after KIP-112. So it introduces new scenarios
(for example, bouncing a broker that is working fine although some disks
have been marked bad).


> Detection and handling of
> slow disk is a separate problem that needs to be addressed in a future KIP.
> It is currently listed in the future work. Does this sound OK?
>

I'm OK with it being handled in the future. In the meantime, I was just
hoping that we can make it clear to users about the potential issue of a
disk marked as bad becoming good again after a bounce (which can be
dangerous).

The main benefit of creating the second topic after log directory goes
> offline is that we can make sure the second topic is created on the good
> log directory. I am not sure we can simply assume that the first topic will
> always be created on the first log directory in the broker config and the
> second topic will be created on the second log directory in the broker
> config.



> However, I can add this test in KIP-113 which allows user to
> re-assign replica to specific log directory of a broker. Is this OK?
>

OK. Please add a note to KIP-112 about this as well (so that it's clear why
we only do it in KIP-113).

Ismael


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-10 Thread Dong Lin
Hey Ismael,

Thanks for your comments. Please see my reply below.

On Fri, Mar 10, 2017 at 9:12 AM, Ismael Juma  wrote:

> Hi Dong,
>
> Thanks for the updates, they look good. A couple of comments below.
>
> On Tue, Mar 7, 2017 at 7:30 PM, Dong Lin  wrote:
> >
> > >
> > > 3. Another point regarding operational procedures, with a large enough
> > > cluster, disk failures may not be that uncommon. It may be worth
> > explaining
> > > the recommended procedure if someone needs to do a rolling bounce of a
> > > cluster with some bad disks. One option is to simply do the bounce and
> > hope
> > > that the bad disks are detected during restart, but we know that this
> is
> > > not guaranteed to happen immediately. A better option may be to remove
> > the
> > > bad log dirs from the broker config until the disk is replaced.
> > >
> >
> > I am not sure if I understand your suggestion here. I think user doesn't
> > need to differentiate between log directory failure during rolling bounce
> > and log directory failure during runtime. All they need to do is to
> detect
> > and handle log directory failure specified above. And they don't have to
> > remove the bad log directory immediately from broker config. The only
> > drawback of keeping log directory there is that a new replica may not be
> > created on the broker. But the chance of that happening is really low,
> > since the controller has to fail in a small window after user initiated
> the
> > topic creation but before it sends LeaderAndIsrRequest with
> > is_new_replica=true to the broker. In practice this shouldn't matter.
> >
>
>  Let me try to clarify what I mean. The document states that a broker
> assumes that a log directory is good if it can read from it when it starts.
> So, bouncing a broker with a bad disk without doing anything is a bit
> dangerous because it may be considered good again and cause issues due to
> slow performance, for example. As Colin pointed out, this is not uncommon.
> So, perhaps we should state that it is safer to remove the bad log dir from
> the broker config if a bounce is required before the disk is fixed. Does
> that make sense?
>


I get your point. But I am not sure we should recommend user to simply
remove disk from the broker config. If user simply does this without
checking the utilization of good disks, replica on the bad disk will be
re-created on the good disk and may overload the good disks, causing
cascading failure.

I agree with you and Colin that slow disk may cause problem. However,
performance degradation due to slow disk this is an existing problem that
is not detected or handled by Kafka or KIP-112. Detection and handling of
slow disk is a separate problem that needs to be addressed in a future KIP.
It is currently listed in the future work. Does this sound OK?


>
> Sure. I have updated the test description to specify that each broker will
> > have two log directories.
> >
> > The existing test case will actually create 2 topics to validate that
> > failed log directory won't affect the good ones. You can find them after
> > "Now validate that the previous leader can still serve replicas on the
> good
> > log directories" and "Now validate that the follower can still serve
> > replicas on the good log directories".
>
>
> The current plan suggests creating a second topic after the log directory
> has been marked as bad via the permission change. I am suggesting that we
> should ideally have more than one topic (or partition) before the log
> directory is marked as bad. Both cases are important and should be tested,
> in my opinion.
>

It is simpler to have multiple topic of 1 partition each instead of a topic
of multiple partitions. This is because in the latter case, it is possible
that some partition of the topic may be offline and we can not simply
consume from the topic to validate that the partitions on the good disks
can be consumed.

The main benefit of creating the second topic after log directory goes
offline is that we can make sure the second topic is created on the good
log directory. I am not sure we can simply assume that the first topic will
always be created on the first log directory in the broker config and the
second topic will be created on the second log directory in the broker
config. However, I can add this test in KIP-113 which allows user to
re-assign replica to specific log directory of a broker. Is this OK?



>
> Ismael
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-10 Thread Ismael Juma
Hi Dong,

Thanks for the updates, they look good. A couple of comments below.

On Tue, Mar 7, 2017 at 7:30 PM, Dong Lin  wrote:
>
> >
> > 3. Another point regarding operational procedures, with a large enough
> > cluster, disk failures may not be that uncommon. It may be worth
> explaining
> > the recommended procedure if someone needs to do a rolling bounce of a
> > cluster with some bad disks. One option is to simply do the bounce and
> hope
> > that the bad disks are detected during restart, but we know that this is
> > not guaranteed to happen immediately. A better option may be to remove
> the
> > bad log dirs from the broker config until the disk is replaced.
> >
>
> I am not sure if I understand your suggestion here. I think user doesn't
> need to differentiate between log directory failure during rolling bounce
> and log directory failure during runtime. All they need to do is to detect
> and handle log directory failure specified above. And they don't have to
> remove the bad log directory immediately from broker config. The only
> drawback of keeping log directory there is that a new replica may not be
> created on the broker. But the chance of that happening is really low,
> since the controller has to fail in a small window after user initiated the
> topic creation but before it sends LeaderAndIsrRequest with
> is_new_replica=true to the broker. In practice this shouldn't matter.
>

 Let me try to clarify what I mean. The document states that a broker
assumes that a log directory is good if it can read from it when it starts.
So, bouncing a broker with a bad disk without doing anything is a bit
dangerous because it may be considered good again and cause issues due to
slow performance, for example. As Colin pointed out, this is not uncommon.
So, perhaps we should state that it is safer to remove the bad log dir from
the broker config if a bounce is required before the disk is fixed. Does
that make sense?

Sure. I have updated the test description to specify that each broker will
> have two log directories.
>
> The existing test case will actually create 2 topics to validate that
> failed log directory won't affect the good ones. You can find them after
> "Now validate that the previous leader can still serve replicas on the good
> log directories" and "Now validate that the follower can still serve
> replicas on the good log directories".


The current plan suggests creating a second topic after the log directory
has been marked as bad via the permission change. I am suggesting that we
should ideally have more than one topic (or partition) before the log
directory is marked as bad. Both cases are important and should be tested,
in my opinion.

Ismael


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-07 Thread Dong Lin
Hey Becket,

Thanks for the review.

1. I have thought about this before. I think it is fine to delete the node
after controller reads it. On controller failover, the new controller will
always send LeaderAndIsrRequest for all partitions to each broker in order
to learn about offline replicas.

2. This field is not necessary now because we currently only use this znode
for LogDirFailure event. But I envision that it may be useful in the future
e.g. for log directory fix without having to reboot broker. I have updated
the KIP to specify that we use 1 to indicate LogDirFailure event. I think
this field makes the znode more general and extensible. But I am OK to
remove this field from the znode for now and add it in the future.

Thanks,
Dong





On Tue, Mar 7, 2017 at 1:26 PM, Becket Qin  wrote:

> Hi Dong,
>
> Thanks for the KIP, a few more comments:
>
> 1. In the KIP wiki section "A log directory stops working on a broker
> during runtime", the controller deletes the notification node right after
> it reads the znode. It seems safer to do this at last so even though the
> controller fails the new controller will still see the notification.
> 2. In the notification znode we have Event field as an integer. Can we
> document what is the value of LogDirFailure? And also are there any other
> possible values?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Mar 7, 2017 at 11:30 AM, Dong Lin  wrote:
>
> > Hey Ismael,
> >
> > Thanks much for taking time to review the KIP and read through all the
> > discussion!
> >
> > Please see my reply inline.
> >
> > On Tue, Mar 7, 2017 at 9:47 AM, Ismael Juma  wrote:
> >
> > > Hi Dong,
> > >
> > > It took me a while, but I finally went through the whole thread. I
> have a
> > > few minor comments:
> > >
> > > 1. Regarding the metrics, can we include the full name (e.g.
> > > kafka.cluster:type=Partition,name=InSyncReplicasCount,
> > > topic={topic},partition={partition} was defined in KIP-96)?
> > >
> > > Certainly. I have updated the KIP to specify the full name.
> >
> >
> > > 2. We talk about changes in operational procedures for people switching
> > > from RAID to JBOD, but what about people who are already using JBOD?
> > Since
> > > disk failures won't necessarily cause broker failures, some adjustments
> > may
> > > be needed.
> > >
> >
> > Good point. I indeed missed one operational procedure for both the
> existing
> > RAID/JBOD user. I have updated the KIP to specify the following:
> >
> > Administrator will need to detect log directory failure by looking at
> > OfflineLogDirectoriesCount. After log directory failure is detected,
> > administrator needs to fix disks and reboot broker.
> >
> >
> > >
> > > 3. Another point regarding operational procedures, with a large enough
> > > cluster, disk failures may not be that uncommon. It may be worth
> > explaining
> > > the recommended procedure if someone needs to do a rolling bounce of a
> > > cluster with some bad disks. One option is to simply do the bounce and
> > hope
> > > that the bad disks are detected during restart, but we know that this
> is
> > > not guaranteed to happen immediately. A better option may be to remove
> > the
> > > bad log dirs from the broker config until the disk is replaced.
> > >
> >
> > I am not sure if I understand your suggestion here. I think user doesn't
> > need to differentiate between log directory failure during rolling bounce
> > and log directory failure during runtime. All they need to do is to
> detect
> > and handle log directory failure specified above. And they don't have to
> > remove the bad log directory immediately from broker config. The only
> > drawback of keeping log directory there is that a new replica may not be
> > created on the broker. But the chance of that happening is really low,
> > since the controller has to fail in a small window after user initiated
> the
> > topic creation but before it sends LeaderAndIsrRequest with
> > is_new_replica=true to the broker. In practice this shouldn't matter.
> >
> >
> > >
> > > 4. The test plan doesn't mention the number of log directories per
> > broker.
> > > It could be good to specify this. Also, we seem to create one topic
> with
> > > one partition, which means that only one log directory will be
> populated.
> > > It seems like we should have partitions in more than one log directory
> to
> > > verify that the failed log directory doesn't affect the ones that are
> > still
> > > good.
> > >
> >
> > Sure. I have updated the test description to specify that each broker
> will
> > have two log directories.
> >
> > The existing test case will actually create 2 topics to validate that
> > failed log directory won't affect the good ones. You can find them after
> > "Now validate that the previous leader can still serve replicas on the
> good
> > log directories" and "Now validate that the follower can still serve
> > replicas on the good log directories".
> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-07 Thread Becket Qin
Hi Dong,

Thanks for the KIP, a few more comments:

1. In the KIP wiki section "A log directory stops working on a broker
during runtime", the controller deletes the notification node right after
it reads the znode. It seems safer to do this at last so even though the
controller fails the new controller will still see the notification.
2. In the notification znode we have Event field as an integer. Can we
document what is the value of LogDirFailure? And also are there any other
possible values?

Thanks,

Jiangjie (Becket) Qin

On Tue, Mar 7, 2017 at 11:30 AM, Dong Lin  wrote:

> Hey Ismael,
>
> Thanks much for taking time to review the KIP and read through all the
> discussion!
>
> Please see my reply inline.
>
> On Tue, Mar 7, 2017 at 9:47 AM, Ismael Juma  wrote:
>
> > Hi Dong,
> >
> > It took me a while, but I finally went through the whole thread. I have a
> > few minor comments:
> >
> > 1. Regarding the metrics, can we include the full name (e.g.
> > kafka.cluster:type=Partition,name=InSyncReplicasCount,
> > topic={topic},partition={partition} was defined in KIP-96)?
> >
> > Certainly. I have updated the KIP to specify the full name.
>
>
> > 2. We talk about changes in operational procedures for people switching
> > from RAID to JBOD, but what about people who are already using JBOD?
> Since
> > disk failures won't necessarily cause broker failures, some adjustments
> may
> > be needed.
> >
>
> Good point. I indeed missed one operational procedure for both the existing
> RAID/JBOD user. I have updated the KIP to specify the following:
>
> Administrator will need to detect log directory failure by looking at
> OfflineLogDirectoriesCount. After log directory failure is detected,
> administrator needs to fix disks and reboot broker.
>
>
> >
> > 3. Another point regarding operational procedures, with a large enough
> > cluster, disk failures may not be that uncommon. It may be worth
> explaining
> > the recommended procedure if someone needs to do a rolling bounce of a
> > cluster with some bad disks. One option is to simply do the bounce and
> hope
> > that the bad disks are detected during restart, but we know that this is
> > not guaranteed to happen immediately. A better option may be to remove
> the
> > bad log dirs from the broker config until the disk is replaced.
> >
>
> I am not sure if I understand your suggestion here. I think user doesn't
> need to differentiate between log directory failure during rolling bounce
> and log directory failure during runtime. All they need to do is to detect
> and handle log directory failure specified above. And they don't have to
> remove the bad log directory immediately from broker config. The only
> drawback of keeping log directory there is that a new replica may not be
> created on the broker. But the chance of that happening is really low,
> since the controller has to fail in a small window after user initiated the
> topic creation but before it sends LeaderAndIsrRequest with
> is_new_replica=true to the broker. In practice this shouldn't matter.
>
>
> >
> > 4. The test plan doesn't mention the number of log directories per
> broker.
> > It could be good to specify this. Also, we seem to create one topic with
> > one partition, which means that only one log directory will be populated.
> > It seems like we should have partitions in more than one log directory to
> > verify that the failed log directory doesn't affect the ones that are
> still
> > good.
> >
>
> Sure. I have updated the test description to specify that each broker will
> have two log directories.
>
> The existing test case will actually create 2 topics to validate that
> failed log directory won't affect the good ones. You can find them after
> "Now validate that the previous leader can still serve replicas on the good
> log directories" and "Now validate that the follower can still serve
> replicas on the good log directories".
>
>
> >
> > 5. In the protocol definition, we have isNewReplica, but it should
> probably
> > be is_new_replica.
> >
>
> Good point. My bad. It is fixed now.
>
>
> >
> > Thanks,
> > Ismael
> >
> >
> > On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:
> >
> > > Hi all,
> > >
> > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> > wiki
> > > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 112%3A+Handle+disk+failure+for+JBOD.
> > >
> > > This KIP is related to KIP-113
> > >  > > 113%3A+Support+replicas+movement+between+log+directories>:
> > > Support replicas movement between log directories. They are needed in
> > order
> > > to support JBOD in Kafka. Please help review the KIP. You feedback is
> > > appreciated!
> > >
> > > Thanks,
> > > Dong
> > >
> >
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-07 Thread Dong Lin
Hey Ismael,

Thanks much for taking time to review the KIP and read through all the
discussion!

Please see my reply inline.

On Tue, Mar 7, 2017 at 9:47 AM, Ismael Juma  wrote:

> Hi Dong,
>
> It took me a while, but I finally went through the whole thread. I have a
> few minor comments:
>
> 1. Regarding the metrics, can we include the full name (e.g.
> kafka.cluster:type=Partition,name=InSyncReplicasCount,
> topic={topic},partition={partition} was defined in KIP-96)?
>
> Certainly. I have updated the KIP to specify the full name.


> 2. We talk about changes in operational procedures for people switching
> from RAID to JBOD, but what about people who are already using JBOD? Since
> disk failures won't necessarily cause broker failures, some adjustments may
> be needed.
>

Good point. I indeed missed one operational procedure for both the existing
RAID/JBOD user. I have updated the KIP to specify the following:

Administrator will need to detect log directory failure by looking at
OfflineLogDirectoriesCount. After log directory failure is detected,
administrator needs to fix disks and reboot broker.


>
> 3. Another point regarding operational procedures, with a large enough
> cluster, disk failures may not be that uncommon. It may be worth explaining
> the recommended procedure if someone needs to do a rolling bounce of a
> cluster with some bad disks. One option is to simply do the bounce and hope
> that the bad disks are detected during restart, but we know that this is
> not guaranteed to happen immediately. A better option may be to remove the
> bad log dirs from the broker config until the disk is replaced.
>

I am not sure if I understand your suggestion here. I think user doesn't
need to differentiate between log directory failure during rolling bounce
and log directory failure during runtime. All they need to do is to detect
and handle log directory failure specified above. And they don't have to
remove the bad log directory immediately from broker config. The only
drawback of keeping log directory there is that a new replica may not be
created on the broker. But the chance of that happening is really low,
since the controller has to fail in a small window after user initiated the
topic creation but before it sends LeaderAndIsrRequest with
is_new_replica=true to the broker. In practice this shouldn't matter.


>
> 4. The test plan doesn't mention the number of log directories per broker.
> It could be good to specify this. Also, we seem to create one topic with
> one partition, which means that only one log directory will be populated.
> It seems like we should have partitions in more than one log directory to
> verify that the failed log directory doesn't affect the ones that are still
> good.
>

Sure. I have updated the test description to specify that each broker will
have two log directories.

The existing test case will actually create 2 topics to validate that
failed log directory won't affect the good ones. You can find them after
"Now validate that the previous leader can still serve replicas on the good
log directories" and "Now validate that the follower can still serve
replicas on the good log directories".


>
> 5. In the protocol definition, we have isNewReplica, but it should probably
> be is_new_replica.
>

Good point. My bad. It is fixed now.


>
> Thanks,
> Ismael
>
>
> On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:
>
> > Hi all,
> >
> > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> wiki
> > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 112%3A+Handle+disk+failure+for+JBOD.
> >
> > This KIP is related to KIP-113
> >  > 113%3A+Support+replicas+movement+between+log+directories>:
> > Support replicas movement between log directories. They are needed in
> order
> > to support JBOD in Kafka. Please help review the KIP. You feedback is
> > appreciated!
> >
> > Thanks,
> > Dong
> >
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-07 Thread Ismael Juma
Hi Dong,

It took me a while, but I finally went through the whole thread. I have a
few minor comments:

1. Regarding the metrics, can we include the full name (e.g.
kafka.cluster:type=Partition,name=InSyncReplicasCount,
topic={topic},partition={partition} was defined in KIP-96)?

2. We talk about changes in operational procedures for people switching
from RAID to JBOD, but what about people who are already using JBOD? Since
disk failures won't necessarily cause broker failures, some adjustments may
be needed.

3. Another point regarding operational procedures, with a large enough
cluster, disk failures may not be that uncommon. It may be worth explaining
the recommended procedure if someone needs to do a rolling bounce of a
cluster with some bad disks. One option is to simply do the bounce and hope
that the bad disks are detected during restart, but we know that this is
not guaranteed to happen immediately. A better option may be to remove the
bad log dirs from the broker config until the disk is replaced.

4. The test plan doesn't mention the number of log directories per broker.
It could be good to specify this. Also, we seem to create one topic with
one partition, which means that only one log directory will be populated.
It seems like we should have partitions in more than one log directory to
verify that the failed log directory doesn't affect the ones that are still
good.

5. In the protocol definition, we have isNewReplica, but it should probably
be is_new_replica.

Thanks,
Ismael


On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:

> Hi all,
>
> We created KIP-112: Handle disk failure for JBOD. Please find the KIP wiki
> in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 112%3A+Handle+disk+failure+for+JBOD.
>
> This KIP is related to KIP-113
>  113%3A+Support+replicas+movement+between+log+directories>:
> Support replicas movement between log directories. They are needed in order
> to support JBOD in Kafka. Please help review the KIP. You feedback is
> appreciated!
>
> Thanks,
> Dong
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-05 Thread Dong Lin
Hey Jun,

I am happy to work for a few days if that is what it takes to discuss
KIP-113. But if it takes 2+ weeks to discuss KIP-113, I wondering if we can
vote for KIP-112 first. We aim to start JBOD test in a test cluster by end
of Q1 and there is only three weeks from that. If we can miss the schedule,
we may have to do one-broker-per-disk setup first which causes the
unnecessary trouble to migrate to JBOD later. It will greatly help me meet
this schedule if we can vote for KIP-112 and get the patch reviewed while
KIP-113 is under discussion.

It seems reasonable to vote for KIP-112 before KIP-113. The reason is that
we have agreed on the motivation of both KIPs and we already have a
compelling case for JBOD. What we need to discuss for KIP-113 is its
implementation detail instead of motivation. Whatever implementation we end
up with KIP-113 should be implemented on top of KIP-112 without backward
incompatible change.

I understand that you are busy with reviewing many stuffs so I would prefer
not to ask you for quick response on KIP-113. If KIP-112 is blocked on
KIP-113, I would somehow be under pressure to bother you repeatedly.
Allowing me to work on KIP-112 implementation can make our schedule easier:)

Thank you for all the time to review these KIPs!

Dong



On Thu, Mar 2, 2017 at 9:36 AM, Jun Rao  wrote:

> Hi, Dong,
>
> Ok. We can keep LeaderAndIsrRequest as it is in the wiki.
>
> Since we need both KIP-112 and KIP-113 to make a compelling case for JBOD,
> perhaps we should discuss KIP-113 before voting for both? I left some
> comments in the other thread.
>
> Thanks,
>
> Jun
>
> On Wed, Mar 1, 2017 at 1:58 PM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > Do you think it is OK to keep the existing wire protocol in the KIP? I am
> > wondering if we can initiate vote for this KIP.
> >
> > Thanks,
> > Dong
> >
> >
> >
> > On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin  wrote:
> >
> > > Hey Jun,
> > >
> > > I just realized that StopReplicaRequest itself doesn't specify the
> > > replicaId in the wire protocol. Thus controller would need to log the
> > > brokerId with StopReplicaRequest in the log. Thus it may be
> > > reasonable for controller to do the same with LeaderAndIsrRequest and
> > only
> > > specify the isNewReplica for the broker that receives
> > LeaderAndIsrRequest.
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin  wrote:
> > >
> > >> Hi Jun,
> > >>
> > >> Yeah there is tradeoff between controller's implementation complexity
> > vs.
> > >> wire-protocol complexity. I personally think it is more important to
> > keep
> > >> wire-protocol concise and only add information in wire-protocol if
> > >> necessary. It seems fine to add a little bit complexity to
> controller's
> > >> implementation, e.g. log destination broker per LeaderAndIsrRequet.
> > Becket
> > >> also shares this opinion with me. Is the only purpose of doing so to
> > make
> > >> controller log simpler?
> > >>
> > >> And certainly, I have added Todd's comment in the wiki.
> > >>
> > >> Thanks,
> > >> Dong
> > >>
> > >>
> > >> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao  wrote:
> > >>
> > >>> Hi, Dong,
> > >>>
> > >>> 52. What you suggested would work. However, I am thinking that it's
> > >>> probably simpler to just set isNewReplica at the replica level. That
> > way,
> > >>> the LeaderAndIsrRequest can be created a bit simpler. When reading a
> > >>> LeaderAndIsrRequest in the controller log, it's easier to see which
> > >>> replicas are new without looking at which broker the request is
> > intended
> > >>> for.
> > >>>
> > >>> Could you also add those additional points from Todd's on 1 broker
> per
> > >>> disk
> > >>> vs JBOD vs RAID5/6 to the KIP?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Hi, Todd,
> > >>>
> > >>> Thanks for the feedback. That's very useful.
> > >>>
> > >>> Jun
> > >>>
> > >>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin 
> > wrote:
> > >>>
> > >>> > Hey Jun,
> > >>> >
> > >>> > Certainly, I have added Todd to reply to the thread. And I have
> > >>> updated the
> > >>> > item to in the wiki.
> > >>> >
> > >>> > 50. The full statement is "Broker assumes a log directory to be
> good
> > >>> after
> > >>> > it starts, and mark log directory as bad once there is IOException
> > when
> > >>> > broker attempts to access (i.e. read or write) the log directory".
> > This
> > >>> > statement seems reasonable, right? If a log directory is actually
> > bad,
> > >>> then
> > >>> > the broker will first assume it is OK, try to read logs on this log
> > >>> > directory, encounter IOException, and then mark it as bad.
> > >>> >
> > >>> > 51. My bad. I thought I removed it but I didn't. It is removed now.
> > >>> >
> > >>> > 52. I don't think so.. The isNewReplica field in the
> > >>> LeaderAndIsrRequest is
> > >>> > only relevant to the replica (i.e. broker) that receives the
> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-02 Thread Jun Rao
Hi, Dong,

Ok. We can keep LeaderAndIsrRequest as it is in the wiki.

Since we need both KIP-112 and KIP-113 to make a compelling case for JBOD,
perhaps we should discuss KIP-113 before voting for both? I left some
comments in the other thread.

Thanks,

Jun

On Wed, Mar 1, 2017 at 1:58 PM, Dong Lin  wrote:

> Hey Jun,
>
> Do you think it is OK to keep the existing wire protocol in the KIP? I am
> wondering if we can initiate vote for this KIP.
>
> Thanks,
> Dong
>
>
>
> On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > I just realized that StopReplicaRequest itself doesn't specify the
> > replicaId in the wire protocol. Thus controller would need to log the
> > brokerId with StopReplicaRequest in the log. Thus it may be
> > reasonable for controller to do the same with LeaderAndIsrRequest and
> only
> > specify the isNewReplica for the broker that receives
> LeaderAndIsrRequest.
> >
> > Thanks,
> > Dong
> >
> > On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin  wrote:
> >
> >> Hi Jun,
> >>
> >> Yeah there is tradeoff between controller's implementation complexity
> vs.
> >> wire-protocol complexity. I personally think it is more important to
> keep
> >> wire-protocol concise and only add information in wire-protocol if
> >> necessary. It seems fine to add a little bit complexity to controller's
> >> implementation, e.g. log destination broker per LeaderAndIsrRequet.
> Becket
> >> also shares this opinion with me. Is the only purpose of doing so to
> make
> >> controller log simpler?
> >>
> >> And certainly, I have added Todd's comment in the wiki.
> >>
> >> Thanks,
> >> Dong
> >>
> >>
> >> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao  wrote:
> >>
> >>> Hi, Dong,
> >>>
> >>> 52. What you suggested would work. However, I am thinking that it's
> >>> probably simpler to just set isNewReplica at the replica level. That
> way,
> >>> the LeaderAndIsrRequest can be created a bit simpler. When reading a
> >>> LeaderAndIsrRequest in the controller log, it's easier to see which
> >>> replicas are new without looking at which broker the request is
> intended
> >>> for.
> >>>
> >>> Could you also add those additional points from Todd's on 1 broker per
> >>> disk
> >>> vs JBOD vs RAID5/6 to the KIP?
> >>>
> >>> Thanks,
> >>>
> >>> Hi, Todd,
> >>>
> >>> Thanks for the feedback. That's very useful.
> >>>
> >>> Jun
> >>>
> >>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin 
> wrote:
> >>>
> >>> > Hey Jun,
> >>> >
> >>> > Certainly, I have added Todd to reply to the thread. And I have
> >>> updated the
> >>> > item to in the wiki.
> >>> >
> >>> > 50. The full statement is "Broker assumes a log directory to be good
> >>> after
> >>> > it starts, and mark log directory as bad once there is IOException
> when
> >>> > broker attempts to access (i.e. read or write) the log directory".
> This
> >>> > statement seems reasonable, right? If a log directory is actually
> bad,
> >>> then
> >>> > the broker will first assume it is OK, try to read logs on this log
> >>> > directory, encounter IOException, and then mark it as bad.
> >>> >
> >>> > 51. My bad. I thought I removed it but I didn't. It is removed now.
> >>> >
> >>> > 52. I don't think so.. The isNewReplica field in the
> >>> LeaderAndIsrRequest is
> >>> > only relevant to the replica (i.e. broker) that receives the
> >>> > LeaderAndIsrRequest. There is no need to specify whether each replica
> >>> is
> >>> > new inside LeaderAndIsrRequest. In other words, if a broker sends
> >>> > LeaderAndIsrRequest to three different replicas of a given partition,
> >>> the
> >>> > isNewReplica field can be different across these three requests.
> >>> >
> >>> > Yeah, I would definitely want to start discussion on KIP-113 after we
> >>> have
> >>> > reached agreement on KIP-112. I have actually opened KIP-113
> discussion
> >>> > thread on 1/12 together with this thread. I have yet to add the
> >>> ability to
> >>> > list offline directories in KIP-113 which we discussed in this
> thread.
> >>> >
> >>> > Thanks for all your reviews! Is there further concern with the latest
> >>> KIP?
> >>> >
> >>> > Thanks!
> >>> > Dong
> >>> >
> >>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:
> >>> >
> >>> > > Hi, Dong,
> >>> > >
> >>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks
> failure.
> >>> > Eno's
> >>> > > point is that the rebuild of RAID5/RAID6 requires reading more data
> >>> > > compared with RAID10, which increases the probability of error
> during
> >>> > > rebuild. This makes sense. In any case, do you think you could ask
> >>> the
> >>> > SREs
> >>> > > at LinkedIn to share their opinions on RAID5/RAID6?
> >>> > >
> >>> > > Yes, when a replica is offline due to a bad disk, it makes sense to
> >>> > handle
> >>> > > it immediately as if a StopReplicaRequest is received (i.e.,
> replica
> >>> is
> >>> > no
> >>> > > longer considered a 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-03-01 Thread Dong Lin
Hey Jun,

Do you think it is OK to keep the existing wire protocol in the KIP? I am
wondering if we can initiate vote for this KIP.

Thanks,
Dong



On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin  wrote:

> Hey Jun,
>
> I just realized that StopReplicaRequest itself doesn't specify the
> replicaId in the wire protocol. Thus controller would need to log the
> brokerId with StopReplicaRequest in the log. Thus it may be
> reasonable for controller to do the same with LeaderAndIsrRequest and only
> specify the isNewReplica for the broker that receives LeaderAndIsrRequest.
>
> Thanks,
> Dong
>
> On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin  wrote:
>
>> Hi Jun,
>>
>> Yeah there is tradeoff between controller's implementation complexity vs.
>> wire-protocol complexity. I personally think it is more important to keep
>> wire-protocol concise and only add information in wire-protocol if
>> necessary. It seems fine to add a little bit complexity to controller's
>> implementation, e.g. log destination broker per LeaderAndIsrRequet. Becket
>> also shares this opinion with me. Is the only purpose of doing so to make
>> controller log simpler?
>>
>> And certainly, I have added Todd's comment in the wiki.
>>
>> Thanks,
>> Dong
>>
>>
>> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao  wrote:
>>
>>> Hi, Dong,
>>>
>>> 52. What you suggested would work. However, I am thinking that it's
>>> probably simpler to just set isNewReplica at the replica level. That way,
>>> the LeaderAndIsrRequest can be created a bit simpler. When reading a
>>> LeaderAndIsrRequest in the controller log, it's easier to see which
>>> replicas are new without looking at which broker the request is intended
>>> for.
>>>
>>> Could you also add those additional points from Todd's on 1 broker per
>>> disk
>>> vs JBOD vs RAID5/6 to the KIP?
>>>
>>> Thanks,
>>>
>>> Hi, Todd,
>>>
>>> Thanks for the feedback. That's very useful.
>>>
>>> Jun
>>>
>>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin  wrote:
>>>
>>> > Hey Jun,
>>> >
>>> > Certainly, I have added Todd to reply to the thread. And I have
>>> updated the
>>> > item to in the wiki.
>>> >
>>> > 50. The full statement is "Broker assumes a log directory to be good
>>> after
>>> > it starts, and mark log directory as bad once there is IOException when
>>> > broker attempts to access (i.e. read or write) the log directory". This
>>> > statement seems reasonable, right? If a log directory is actually bad,
>>> then
>>> > the broker will first assume it is OK, try to read logs on this log
>>> > directory, encounter IOException, and then mark it as bad.
>>> >
>>> > 51. My bad. I thought I removed it but I didn't. It is removed now.
>>> >
>>> > 52. I don't think so.. The isNewReplica field in the
>>> LeaderAndIsrRequest is
>>> > only relevant to the replica (i.e. broker) that receives the
>>> > LeaderAndIsrRequest. There is no need to specify whether each replica
>>> is
>>> > new inside LeaderAndIsrRequest. In other words, if a broker sends
>>> > LeaderAndIsrRequest to three different replicas of a given partition,
>>> the
>>> > isNewReplica field can be different across these three requests.
>>> >
>>> > Yeah, I would definitely want to start discussion on KIP-113 after we
>>> have
>>> > reached agreement on KIP-112. I have actually opened KIP-113 discussion
>>> > thread on 1/12 together with this thread. I have yet to add the
>>> ability to
>>> > list offline directories in KIP-113 which we discussed in this thread.
>>> >
>>> > Thanks for all your reviews! Is there further concern with the latest
>>> KIP?
>>> >
>>> > Thanks!
>>> > Dong
>>> >
>>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:
>>> >
>>> > > Hi, Dong,
>>> > >
>>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks failure.
>>> > Eno's
>>> > > point is that the rebuild of RAID5/RAID6 requires reading more data
>>> > > compared with RAID10, which increases the probability of error during
>>> > > rebuild. This makes sense. In any case, do you think you could ask
>>> the
>>> > SREs
>>> > > at LinkedIn to share their opinions on RAID5/RAID6?
>>> > >
>>> > > Yes, when a replica is offline due to a bad disk, it makes sense to
>>> > handle
>>> > > it immediately as if a StopReplicaRequest is received (i.e., replica
>>> is
>>> > no
>>> > > longer considered a leader and is removed from any replica fetcher
>>> > thread).
>>> > > Could you add that detail in item 2. in the wiki?
>>> > >
>>> > > 50. The wiki says "Broker assumes a log directory to be good after it
>>> > > starts" : A log directory actually could be bad during startup.
>>> > >
>>> > > 51. In item 4, the wiki says "The controller watches the path
>>> > > /log_dir_event_notification for new znode.". This doesn't seem be
>>> needed
>>> > > now?
>>> > >
>>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for each
>>> > > replica inside the replicas field, right?
>>> > >
>>> > > Other 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Dong Lin
Hey Jun,

I just realized that StopReplicaRequest itself doesn't specify the
replicaId in the wire protocol. Thus controller would need to log the
brokerId with StopReplicaRequest in the log. Thus it may be
reasonable for controller to do the same with LeaderAndIsrRequest and only
specify the isNewReplica for the broker that receives LeaderAndIsrRequest.

Thanks,
Dong

On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin  wrote:

> Hi Jun,
>
> Yeah there is tradeoff between controller's implementation complexity vs.
> wire-protocol complexity. I personally think it is more important to keep
> wire-protocol concise and only add information in wire-protocol if
> necessary. It seems fine to add a little bit complexity to controller's
> implementation, e.g. log destination broker per LeaderAndIsrRequet. Becket
> also shares this opinion with me. Is the only purpose of doing so to make
> controller log simpler?
>
> And certainly, I have added Todd's comment in the wiki.
>
> Thanks,
> Dong
>
>
> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao  wrote:
>
>> Hi, Dong,
>>
>> 52. What you suggested would work. However, I am thinking that it's
>> probably simpler to just set isNewReplica at the replica level. That way,
>> the LeaderAndIsrRequest can be created a bit simpler. When reading a
>> LeaderAndIsrRequest in the controller log, it's easier to see which
>> replicas are new without looking at which broker the request is intended
>> for.
>>
>> Could you also add those additional points from Todd's on 1 broker per
>> disk
>> vs JBOD vs RAID5/6 to the KIP?
>>
>> Thanks,
>>
>> Hi, Todd,
>>
>> Thanks for the feedback. That's very useful.
>>
>> Jun
>>
>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin  wrote:
>>
>> > Hey Jun,
>> >
>> > Certainly, I have added Todd to reply to the thread. And I have updated
>> the
>> > item to in the wiki.
>> >
>> > 50. The full statement is "Broker assumes a log directory to be good
>> after
>> > it starts, and mark log directory as bad once there is IOException when
>> > broker attempts to access (i.e. read or write) the log directory". This
>> > statement seems reasonable, right? If a log directory is actually bad,
>> then
>> > the broker will first assume it is OK, try to read logs on this log
>> > directory, encounter IOException, and then mark it as bad.
>> >
>> > 51. My bad. I thought I removed it but I didn't. It is removed now.
>> >
>> > 52. I don't think so.. The isNewReplica field in the
>> LeaderAndIsrRequest is
>> > only relevant to the replica (i.e. broker) that receives the
>> > LeaderAndIsrRequest. There is no need to specify whether each replica is
>> > new inside LeaderAndIsrRequest. In other words, if a broker sends
>> > LeaderAndIsrRequest to three different replicas of a given partition,
>> the
>> > isNewReplica field can be different across these three requests.
>> >
>> > Yeah, I would definitely want to start discussion on KIP-113 after we
>> have
>> > reached agreement on KIP-112. I have actually opened KIP-113 discussion
>> > thread on 1/12 together with this thread. I have yet to add the ability
>> to
>> > list offline directories in KIP-113 which we discussed in this thread.
>> >
>> > Thanks for all your reviews! Is there further concern with the latest
>> KIP?
>> >
>> > Thanks!
>> > Dong
>> >
>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks failure.
>> > Eno's
>> > > point is that the rebuild of RAID5/RAID6 requires reading more data
>> > > compared with RAID10, which increases the probability of error during
>> > > rebuild. This makes sense. In any case, do you think you could ask the
>> > SREs
>> > > at LinkedIn to share their opinions on RAID5/RAID6?
>> > >
>> > > Yes, when a replica is offline due to a bad disk, it makes sense to
>> > handle
>> > > it immediately as if a StopReplicaRequest is received (i.e., replica
>> is
>> > no
>> > > longer considered a leader and is removed from any replica fetcher
>> > thread).
>> > > Could you add that detail in item 2. in the wiki?
>> > >
>> > > 50. The wiki says "Broker assumes a log directory to be good after it
>> > > starts" : A log directory actually could be bad during startup.
>> > >
>> > > 51. In item 4, the wiki says "The controller watches the path
>> > > /log_dir_event_notification for new znode.". This doesn't seem be
>> needed
>> > > now?
>> > >
>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for each
>> > > replica inside the replicas field, right?
>> > >
>> > > Other than those, the current KIP looks good to me. Do you want to
>> start
>> > a
>> > > separate discussion thread on KIP-113? I do have some comments there.
>> > >
>> > > Thanks for working on this!
>> > >
>> > > Jun
>> > >
>> > >
>> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin 
>> wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > In addition to 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Dong Lin
Hi Jun,

Yeah there is tradeoff between controller's implementation complexity vs.
wire-protocol complexity. I personally think it is more important to keep
wire-protocol concise and only add information in wire-protocol if
necessary. It seems fine to add a little bit complexity to controller's
implementation, e.g. log destination broker per LeaderAndIsrRequet. Becket
also shares this opinion with me. Is the only purpose of doing so to make
controller log simpler?

And certainly, I have added Todd's comment in the wiki.

Thanks,
Dong


On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao  wrote:

> Hi, Dong,
>
> 52. What you suggested would work. However, I am thinking that it's
> probably simpler to just set isNewReplica at the replica level. That way,
> the LeaderAndIsrRequest can be created a bit simpler. When reading a
> LeaderAndIsrRequest in the controller log, it's easier to see which
> replicas are new without looking at which broker the request is intended
> for.
>
> Could you also add those additional points from Todd's on 1 broker per disk
> vs JBOD vs RAID5/6 to the KIP?
>
> Thanks,
>
> Hi, Todd,
>
> Thanks for the feedback. That's very useful.
>
> Jun
>
> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > Certainly, I have added Todd to reply to the thread. And I have updated
> the
> > item to in the wiki.
> >
> > 50. The full statement is "Broker assumes a log directory to be good
> after
> > it starts, and mark log directory as bad once there is IOException when
> > broker attempts to access (i.e. read or write) the log directory". This
> > statement seems reasonable, right? If a log directory is actually bad,
> then
> > the broker will first assume it is OK, try to read logs on this log
> > directory, encounter IOException, and then mark it as bad.
> >
> > 51. My bad. I thought I removed it but I didn't. It is removed now.
> >
> > 52. I don't think so.. The isNewReplica field in the LeaderAndIsrRequest
> is
> > only relevant to the replica (i.e. broker) that receives the
> > LeaderAndIsrRequest. There is no need to specify whether each replica is
> > new inside LeaderAndIsrRequest. In other words, if a broker sends
> > LeaderAndIsrRequest to three different replicas of a given partition, the
> > isNewReplica field can be different across these three requests.
> >
> > Yeah, I would definitely want to start discussion on KIP-113 after we
> have
> > reached agreement on KIP-112. I have actually opened KIP-113 discussion
> > thread on 1/12 together with this thread. I have yet to add the ability
> to
> > list offline directories in KIP-113 which we discussed in this thread.
> >
> > Thanks for all your reviews! Is there further concern with the latest
> KIP?
> >
> > Thanks!
> > Dong
> >
> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:
> >
> > > Hi, Dong,
> > >
> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks failure.
> > Eno's
> > > point is that the rebuild of RAID5/RAID6 requires reading more data
> > > compared with RAID10, which increases the probability of error during
> > > rebuild. This makes sense. In any case, do you think you could ask the
> > SREs
> > > at LinkedIn to share their opinions on RAID5/RAID6?
> > >
> > > Yes, when a replica is offline due to a bad disk, it makes sense to
> > handle
> > > it immediately as if a StopReplicaRequest is received (i.e., replica is
> > no
> > > longer considered a leader and is removed from any replica fetcher
> > thread).
> > > Could you add that detail in item 2. in the wiki?
> > >
> > > 50. The wiki says "Broker assumes a log directory to be good after it
> > > starts" : A log directory actually could be bad during startup.
> > >
> > > 51. In item 4, the wiki says "The controller watches the path
> > > /log_dir_event_notification for new znode.". This doesn't seem be
> needed
> > > now?
> > >
> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for each
> > > replica inside the replicas field, right?
> > >
> > > Other than those, the current KIP looks good to me. Do you want to
> start
> > a
> > > separate discussion thread on KIP-113? I do have some comments there.
> > >
> > > Thanks for working on this!
> > >
> > > Jun
> > >
> > >
> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > In addition to the Eno's reference of why rebuild time with RAID-5 is
> > > more
> > > > expensive, another concern is that RAID-5 will fail if more than one
> > disk
> > > > fails. JBOD is still works with 1+ disk failure and has better
> > > performance
> > > > with one disk failure. These seems like good argument for using JBOD
> > > > instead of RAID-5.
> > > >
> > > > If a leader replica goes offline, the broker should first take all
> > > actions
> > > > (i.e. remove the partition from fetcher thread) as if it has received
> > > > StopReplicaRequest for this partition because the replica can 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Jun Rao
Hi, Dong,

52. What you suggested would work. However, I am thinking that it's
probably simpler to just set isNewReplica at the replica level. That way,
the LeaderAndIsrRequest can be created a bit simpler. When reading a
LeaderAndIsrRequest in the controller log, it's easier to see which
replicas are new without looking at which broker the request is intended
for.

Could you also add those additional points from Todd's on 1 broker per disk
vs JBOD vs RAID5/6 to the KIP?

Thanks,

Hi, Todd,

Thanks for the feedback. That's very useful.

Jun

On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin  wrote:

> Hey Jun,
>
> Certainly, I have added Todd to reply to the thread. And I have updated the
> item to in the wiki.
>
> 50. The full statement is "Broker assumes a log directory to be good after
> it starts, and mark log directory as bad once there is IOException when
> broker attempts to access (i.e. read or write) the log directory". This
> statement seems reasonable, right? If a log directory is actually bad, then
> the broker will first assume it is OK, try to read logs on this log
> directory, encounter IOException, and then mark it as bad.
>
> 51. My bad. I thought I removed it but I didn't. It is removed now.
>
> 52. I don't think so.. The isNewReplica field in the LeaderAndIsrRequest is
> only relevant to the replica (i.e. broker) that receives the
> LeaderAndIsrRequest. There is no need to specify whether each replica is
> new inside LeaderAndIsrRequest. In other words, if a broker sends
> LeaderAndIsrRequest to three different replicas of a given partition, the
> isNewReplica field can be different across these three requests.
>
> Yeah, I would definitely want to start discussion on KIP-113 after we have
> reached agreement on KIP-112. I have actually opened KIP-113 discussion
> thread on 1/12 together with this thread. I have yet to add the ability to
> list offline directories in KIP-113 which we discussed in this thread.
>
> Thanks for all your reviews! Is there further concern with the latest KIP?
>
> Thanks!
> Dong
>
> On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:
>
> > Hi, Dong,
> >
> > RAID6 is an improvement over RAID5 and can tolerate 2 disks failure.
> Eno's
> > point is that the rebuild of RAID5/RAID6 requires reading more data
> > compared with RAID10, which increases the probability of error during
> > rebuild. This makes sense. In any case, do you think you could ask the
> SREs
> > at LinkedIn to share their opinions on RAID5/RAID6?
> >
> > Yes, when a replica is offline due to a bad disk, it makes sense to
> handle
> > it immediately as if a StopReplicaRequest is received (i.e., replica is
> no
> > longer considered a leader and is removed from any replica fetcher
> thread).
> > Could you add that detail in item 2. in the wiki?
> >
> > 50. The wiki says "Broker assumes a log directory to be good after it
> > starts" : A log directory actually could be bad during startup.
> >
> > 51. In item 4, the wiki says "The controller watches the path
> > /log_dir_event_notification for new znode.". This doesn't seem be needed
> > now?
> >
> > 52. The isNewReplica field in LeaderAndIsrRequest should be for each
> > replica inside the replicas field, right?
> >
> > Other than those, the current KIP looks good to me. Do you want to start
> a
> > separate discussion thread on KIP-113? I do have some comments there.
> >
> > Thanks for working on this!
> >
> > Jun
> >
> >
> > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin  wrote:
> >
> > > Hi Jun,
> > >
> > > In addition to the Eno's reference of why rebuild time with RAID-5 is
> > more
> > > expensive, another concern is that RAID-5 will fail if more than one
> disk
> > > fails. JBOD is still works with 1+ disk failure and has better
> > performance
> > > with one disk failure. These seems like good argument for using JBOD
> > > instead of RAID-5.
> > >
> > > If a leader replica goes offline, the broker should first take all
> > actions
> > > (i.e. remove the partition from fetcher thread) as if it has received
> > > StopReplicaRequest for this partition because the replica can no longer
> > > work anyway. It will also respond with error to any ProduceRequest and
> > > FetchRequest for partition. The broker notifies controller by writing
> > > notification znode in ZK. The controller learns the disk failure event
> > from
> > > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to
> learn
> > > that the replica is offline. The controller will then elect new leader
> > for
> > > this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
> > > relevant brokers. The broker should stop adjusting the ISR for this
> > > partition as if the broker is already offline. I am not sure there is
> any
> > > inconsistency in broker's behavior when it is leader or follower. Is
> > there
> > > any concern with this approach?
> > >
> > > Thanks for catching this. I have removed that reference 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Eno Thereska
Thanks Todd for the explanation.

Eno
> On 28 Feb 2017, at 18:15, Todd Palino  wrote:
> 
> We have tested RAID 5/6 in the past (and recently) and found it to be
> lacking. So, as noted, rebuild takes more time than RAID 10 because all the
> disks need to be accessed to recalculate parity. In addition, there’s a
> significant performance loss just in normal operations. It’s been a while
> since I ran those tests, but it was in the 30-50% range - nothing to shrug
> off. We didn’t even get to failure testing because of that.
> 
> Jun - to your question, we ran the tests with numerous combinations of
> block sizes and FS parameters. The performance varied, but it was never
> good enough to warrant more than a superficial look at using RAID 5/6. We
> also tested both software RAID and hardware RAID.
> 
> As far as the operational concerns around broker-per-disk and
> broker-per-server, we’ve been talking about this internally. Running one
> broker per disk adds a good bit of administrative overhead and complexity.
> If you perform a one by one rolling bounce of the cluster, you’re talking
> about a 10x increase in time. That means a cluster that restarts in 30
> minutes now takes 5 hours. If you try and optimize this by shutting down
> all the brokers on one host at a time, you can get close to the original
> number, but you now have added operational complexity by having to
> micro-manage the bounce. The broker count increase will percolate down to
> the rest of the administrative domain as well - maintaining ports for all
> the instances, monitoring more instances, managing configs, etc.
> 
> You also have the overhead of running the extra processes - extra heap,
> task switching, etc. We don’t have a problem with page cache really, since
> the VM subsystem is fairly efficient about how it works. But just because
> cache works doesn’t mean we’re not wasting other resources. And that gets
> pushed downstream to clients as well, because they all have to maintain
> more network connections and the resources that go along with it.
> 
> Running more brokers in a cluster also exposes you to more corner cases and
> race conditions within the Kafka code. Bugs in the brokers, bugs in the
> controllers, more complexity in balancing load in a cluster (though trying
> to balance load across disks in a single broker doing JBOD negates that).
> 
> -Todd
> 
> 
> On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:
> 
>> Hi, Dong,
>> 
>> RAID6 is an improvement over RAID5 and can tolerate 2 disks failure. Eno's
>> point is that the rebuild of RAID5/RAID6 requires reading more data
>> compared with RAID10, which increases the probability of error during
>> rebuild. This makes sense. In any case, do you think you could ask the SREs
>> at LinkedIn to share their opinions on RAID5/RAID6?
>> 
>> Yes, when a replica is offline due to a bad disk, it makes sense to handle
>> it immediately as if a StopReplicaRequest is received (i.e., replica is no
>> longer considered a leader and is removed from any replica fetcher thread).
>> Could you add that detail in item 2. in the wiki?
>> 
>> 50. The wiki says "Broker assumes a log directory to be good after it
>> starts" : A log directory actually could be bad during startup.
>> 
>> 51. In item 4, the wiki says "The controller watches the path
>> /log_dir_event_notification for new znode.". This doesn't seem be needed
>> now?
>> 
>> 52. The isNewReplica field in LeaderAndIsrRequest should be for each
>> replica inside the replicas field, right?
>> 
>> Other than those, the current KIP looks good to me. Do you want to start a
>> separate discussion thread on KIP-113? I do have some comments there.
>> 
>> Thanks for working on this!
>> 
>> Jun
>> 
>> 
>> On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin  wrote:
>> 
>>> Hi Jun,
>>> 
>>> In addition to the Eno's reference of why rebuild time with RAID-5 is
>> more
>>> expensive, another concern is that RAID-5 will fail if more than one disk
>>> fails. JBOD is still works with 1+ disk failure and has better
>> performance
>>> with one disk failure. These seems like good argument for using JBOD
>>> instead of RAID-5.
>>> 
>>> If a leader replica goes offline, the broker should first take all
>> actions
>>> (i.e. remove the partition from fetcher thread) as if it has received
>>> StopReplicaRequest for this partition because the replica can no longer
>>> work anyway. It will also respond with error to any ProduceRequest and
>>> FetchRequest for partition. The broker notifies controller by writing
>>> notification znode in ZK. The controller learns the disk failure event
>> from
>>> ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
>>> that the replica is offline. The controller will then elect new leader
>> for
>>> this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
>>> relevant brokers. The broker should stop adjusting the ISR for this
>>> partition as if 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Dong Lin
Hey Jun,

Certainly, I have added Todd to reply to the thread. And I have updated the
item to in the wiki.

50. The full statement is "Broker assumes a log directory to be good after
it starts, and mark log directory as bad once there is IOException when
broker attempts to access (i.e. read or write) the log directory". This
statement seems reasonable, right? If a log directory is actually bad, then
the broker will first assume it is OK, try to read logs on this log
directory, encounter IOException, and then mark it as bad.

51. My bad. I thought I removed it but I didn't. It is removed now.

52. I don't think so.. The isNewReplica field in the LeaderAndIsrRequest is
only relevant to the replica (i.e. broker) that receives the
LeaderAndIsrRequest. There is no need to specify whether each replica is
new inside LeaderAndIsrRequest. In other words, if a broker sends
LeaderAndIsrRequest to three different replicas of a given partition, the
isNewReplica field can be different across these three requests.

Yeah, I would definitely want to start discussion on KIP-113 after we have
reached agreement on KIP-112. I have actually opened KIP-113 discussion
thread on 1/12 together with this thread. I have yet to add the ability to
list offline directories in KIP-113 which we discussed in this thread.

Thanks for all your reviews! Is there further concern with the latest KIP?

Thanks!
Dong

On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:

> Hi, Dong,
>
> RAID6 is an improvement over RAID5 and can tolerate 2 disks failure. Eno's
> point is that the rebuild of RAID5/RAID6 requires reading more data
> compared with RAID10, which increases the probability of error during
> rebuild. This makes sense. In any case, do you think you could ask the SREs
> at LinkedIn to share their opinions on RAID5/RAID6?
>
> Yes, when a replica is offline due to a bad disk, it makes sense to handle
> it immediately as if a StopReplicaRequest is received (i.e., replica is no
> longer considered a leader and is removed from any replica fetcher thread).
> Could you add that detail in item 2. in the wiki?
>
> 50. The wiki says "Broker assumes a log directory to be good after it
> starts" : A log directory actually could be bad during startup.
>
> 51. In item 4, the wiki says "The controller watches the path
> /log_dir_event_notification for new znode.". This doesn't seem be needed
> now?
>
> 52. The isNewReplica field in LeaderAndIsrRequest should be for each
> replica inside the replicas field, right?
>
> Other than those, the current KIP looks good to me. Do you want to start a
> separate discussion thread on KIP-113? I do have some comments there.
>
> Thanks for working on this!
>
> Jun
>
>
> On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin  wrote:
>
> > Hi Jun,
> >
> > In addition to the Eno's reference of why rebuild time with RAID-5 is
> more
> > expensive, another concern is that RAID-5 will fail if more than one disk
> > fails. JBOD is still works with 1+ disk failure and has better
> performance
> > with one disk failure. These seems like good argument for using JBOD
> > instead of RAID-5.
> >
> > If a leader replica goes offline, the broker should first take all
> actions
> > (i.e. remove the partition from fetcher thread) as if it has received
> > StopReplicaRequest for this partition because the replica can no longer
> > work anyway. It will also respond with error to any ProduceRequest and
> > FetchRequest for partition. The broker notifies controller by writing
> > notification znode in ZK. The controller learns the disk failure event
> from
> > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
> > that the replica is offline. The controller will then elect new leader
> for
> > this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
> > relevant brokers. The broker should stop adjusting the ISR for this
> > partition as if the broker is already offline. I am not sure there is any
> > inconsistency in broker's behavior when it is leader or follower. Is
> there
> > any concern with this approach?
> >
> > Thanks for catching this. I have removed that reference from the KIP.
> >
> > Hi Eno,
> >
> > Thank you for providing the reference of the RAID-5. In LinkedIn we have
> 10
> > disks per Kafka machine. It will not be a show-stopper operationally for
> > LinkedIn if we have to deploy one-broker-per-disk. On the other hand we
> > previously discussed the advantage of JBOD vs. one-broker-per-disk or
> > one-broker-per-machine. One-broker-per-disk suffers from the problems
> > described in the KIP and one-broker-per-machine increases the failure
> > caused by disk failure by 10X. Since JBOD is strictly better than either
> of
> > the two, it is also better then one-broker-per-multiple-disk which is
> > somewhere between one-broker-per-disk and one-broker-per-machine.
> >
> > I personally think the benefits of JBOD design is worth the
> implementation
> > complexity it 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Todd Palino
We have tested RAID 5/6 in the past (and recently) and found it to be
lacking. So, as noted, rebuild takes more time than RAID 10 because all the
disks need to be accessed to recalculate parity. In addition, there’s a
significant performance loss just in normal operations. It’s been a while
since I ran those tests, but it was in the 30-50% range - nothing to shrug
off. We didn’t even get to failure testing because of that.

Jun - to your question, we ran the tests with numerous combinations of
block sizes and FS parameters. The performance varied, but it was never
good enough to warrant more than a superficial look at using RAID 5/6. We
also tested both software RAID and hardware RAID.

As far as the operational concerns around broker-per-disk and
broker-per-server, we’ve been talking about this internally. Running one
broker per disk adds a good bit of administrative overhead and complexity.
If you perform a one by one rolling bounce of the cluster, you’re talking
about a 10x increase in time. That means a cluster that restarts in 30
minutes now takes 5 hours. If you try and optimize this by shutting down
all the brokers on one host at a time, you can get close to the original
number, but you now have added operational complexity by having to
micro-manage the bounce. The broker count increase will percolate down to
the rest of the administrative domain as well - maintaining ports for all
the instances, monitoring more instances, managing configs, etc.

You also have the overhead of running the extra processes - extra heap,
task switching, etc. We don’t have a problem with page cache really, since
the VM subsystem is fairly efficient about how it works. But just because
cache works doesn’t mean we’re not wasting other resources. And that gets
pushed downstream to clients as well, because they all have to maintain
more network connections and the resources that go along with it.

Running more brokers in a cluster also exposes you to more corner cases and
race conditions within the Kafka code. Bugs in the brokers, bugs in the
controllers, more complexity in balancing load in a cluster (though trying
to balance load across disks in a single broker doing JBOD negates that).

-Todd


On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao  wrote:

> Hi, Dong,
>
> RAID6 is an improvement over RAID5 and can tolerate 2 disks failure. Eno's
> point is that the rebuild of RAID5/RAID6 requires reading more data
> compared with RAID10, which increases the probability of error during
> rebuild. This makes sense. In any case, do you think you could ask the SREs
> at LinkedIn to share their opinions on RAID5/RAID6?
>
> Yes, when a replica is offline due to a bad disk, it makes sense to handle
> it immediately as if a StopReplicaRequest is received (i.e., replica is no
> longer considered a leader and is removed from any replica fetcher thread).
> Could you add that detail in item 2. in the wiki?
>
> 50. The wiki says "Broker assumes a log directory to be good after it
> starts" : A log directory actually could be bad during startup.
>
> 51. In item 4, the wiki says "The controller watches the path
> /log_dir_event_notification for new znode.". This doesn't seem be needed
> now?
>
> 52. The isNewReplica field in LeaderAndIsrRequest should be for each
> replica inside the replicas field, right?
>
> Other than those, the current KIP looks good to me. Do you want to start a
> separate discussion thread on KIP-113? I do have some comments there.
>
> Thanks for working on this!
>
> Jun
>
>
> On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin  wrote:
>
> > Hi Jun,
> >
> > In addition to the Eno's reference of why rebuild time with RAID-5 is
> more
> > expensive, another concern is that RAID-5 will fail if more than one disk
> > fails. JBOD is still works with 1+ disk failure and has better
> performance
> > with one disk failure. These seems like good argument for using JBOD
> > instead of RAID-5.
> >
> > If a leader replica goes offline, the broker should first take all
> actions
> > (i.e. remove the partition from fetcher thread) as if it has received
> > StopReplicaRequest for this partition because the replica can no longer
> > work anyway. It will also respond with error to any ProduceRequest and
> > FetchRequest for partition. The broker notifies controller by writing
> > notification znode in ZK. The controller learns the disk failure event
> from
> > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
> > that the replica is offline. The controller will then elect new leader
> for
> > this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
> > relevant brokers. The broker should stop adjusting the ISR for this
> > partition as if the broker is already offline. I am not sure there is any
> > inconsistency in broker's behavior when it is leader or follower. Is
> there
> > any concern with this approach?
> >
> > Thanks for catching this. I have removed that reference from 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Jun Rao
Hi, Dong,

RAID6 is an improvement over RAID5 and can tolerate 2 disks failure. Eno's
point is that the rebuild of RAID5/RAID6 requires reading more data
compared with RAID10, which increases the probability of error during
rebuild. This makes sense. In any case, do you think you could ask the SREs
at LinkedIn to share their opinions on RAID5/RAID6?

Yes, when a replica is offline due to a bad disk, it makes sense to handle
it immediately as if a StopReplicaRequest is received (i.e., replica is no
longer considered a leader and is removed from any replica fetcher thread).
Could you add that detail in item 2. in the wiki?

50. The wiki says "Broker assumes a log directory to be good after it
starts" : A log directory actually could be bad during startup.

51. In item 4, the wiki says "The controller watches the path
/log_dir_event_notification for new znode.". This doesn't seem be needed
now?

52. The isNewReplica field in LeaderAndIsrRequest should be for each
replica inside the replicas field, right?

Other than those, the current KIP looks good to me. Do you want to start a
separate discussion thread on KIP-113? I do have some comments there.

Thanks for working on this!

Jun


On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin  wrote:

> Hi Jun,
>
> In addition to the Eno's reference of why rebuild time with RAID-5 is more
> expensive, another concern is that RAID-5 will fail if more than one disk
> fails. JBOD is still works with 1+ disk failure and has better performance
> with one disk failure. These seems like good argument for using JBOD
> instead of RAID-5.
>
> If a leader replica goes offline, the broker should first take all actions
> (i.e. remove the partition from fetcher thread) as if it has received
> StopReplicaRequest for this partition because the replica can no longer
> work anyway. It will also respond with error to any ProduceRequest and
> FetchRequest for partition. The broker notifies controller by writing
> notification znode in ZK. The controller learns the disk failure event from
> ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
> that the replica is offline. The controller will then elect new leader for
> this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
> relevant brokers. The broker should stop adjusting the ISR for this
> partition as if the broker is already offline. I am not sure there is any
> inconsistency in broker's behavior when it is leader or follower. Is there
> any concern with this approach?
>
> Thanks for catching this. I have removed that reference from the KIP.
>
> Hi Eno,
>
> Thank you for providing the reference of the RAID-5. In LinkedIn we have 10
> disks per Kafka machine. It will not be a show-stopper operationally for
> LinkedIn if we have to deploy one-broker-per-disk. On the other hand we
> previously discussed the advantage of JBOD vs. one-broker-per-disk or
> one-broker-per-machine. One-broker-per-disk suffers from the problems
> described in the KIP and one-broker-per-machine increases the failure
> caused by disk failure by 10X. Since JBOD is strictly better than either of
> the two, it is also better then one-broker-per-multiple-disk which is
> somewhere between one-broker-per-disk and one-broker-per-machine.
>
> I personally think the benefits of JBOD design is worth the implementation
> complexity it introduces. I would also argue that it is reasonable for
> Kafka to manage this low level detail because Kafka is already exposing and
> managing replication factor of its data. But whether the complexity is
> worthwhile can be subjective and I can not prove my opinion. I am
> contributing significant amount of time to do this KIP because Kafka
> develops at LinkedIn believes it is useful and worth the effort. Yeah, it
> will be useful to see what everyone else think about it.
>
>
> Thanks,
> Dong
>
>
> On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao  wrote:
>
> > Hi, Dong,
> >
> > For RAID5, I am not sure the rebuild cost is a big concern. If a disk
> > fails, typically an admin has to bring down the broker, replace the
> failed
> > disk with a new one, trigger the RAID rebuild, and bring up the broker.
> > This way, there is no performance impact at runtime due to rebuild. The
> > benefit is that a broker doesn't fail in a hard way when there is a disk
> > failure and can be brought down in a controlled way for maintenance.
> While
> > the broker is running with a failed disk, reads may be more expensive
> since
> > they have to be computed from the parity. However, if most reads are from
> > page cache, this may not be a big issue either. So, it would be useful to
> > do some tests on RAID5 before we completely rule it out.
> >
> > Regarding whether to remove an offline replica from the fetcher thread
> > immediately. What do we do when a failed replica is a leader? Do we do
> > nothing or mark the replica as not the leader immediately? Intuitively,
> it
> > seems it's better if 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-28 Thread Eno Thereska
Makes sense, thank you Dong.

Eno
> On 28 Feb 2017, at 01:51, Dong Lin  wrote:
> 
> Hi Jun,
> 
> In addition to the Eno's reference of why rebuild time with RAID-5 is more
> expensive, another concern is that RAID-5 will fail if more than one disk
> fails. JBOD is still works with 1+ disk failure and has better performance
> with one disk failure. These seems like good argument for using JBOD
> instead of RAID-5.
> 
> If a leader replica goes offline, the broker should first take all actions
> (i.e. remove the partition from fetcher thread) as if it has received
> StopReplicaRequest for this partition because the replica can no longer
> work anyway. It will also respond with error to any ProduceRequest and
> FetchRequest for partition. The broker notifies controller by writing
> notification znode in ZK. The controller learns the disk failure event from
> ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
> that the replica is offline. The controller will then elect new leader for
> this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
> relevant brokers. The broker should stop adjusting the ISR for this
> partition as if the broker is already offline. I am not sure there is any
> inconsistency in broker's behavior when it is leader or follower. Is there
> any concern with this approach?
> 
> Thanks for catching this. I have removed that reference from the KIP.
> 
> Hi Eno,
> 
> Thank you for providing the reference of the RAID-5. In LinkedIn we have 10
> disks per Kafka machine. It will not be a show-stopper operationally for
> LinkedIn if we have to deploy one-broker-per-disk. On the other hand we
> previously discussed the advantage of JBOD vs. one-broker-per-disk or
> one-broker-per-machine. One-broker-per-disk suffers from the problems
> described in the KIP and one-broker-per-machine increases the failure
> caused by disk failure by 10X. Since JBOD is strictly better than either of
> the two, it is also better then one-broker-per-multiple-disk which is
> somewhere between one-broker-per-disk and one-broker-per-machine.
> 
> I personally think the benefits of JBOD design is worth the implementation
> complexity it introduces. I would also argue that it is reasonable for
> Kafka to manage this low level detail because Kafka is already exposing and
> managing replication factor of its data. But whether the complexity is
> worthwhile can be subjective and I can not prove my opinion. I am
> contributing significant amount of time to do this KIP because Kafka
> develops at LinkedIn believes it is useful and worth the effort. Yeah, it
> will be useful to see what everyone else think about it.
> 
> 
> Thanks,
> Dong
> 
> 
> On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao  wrote:
> 
>> Hi, Dong,
>> 
>> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
>> fails, typically an admin has to bring down the broker, replace the failed
>> disk with a new one, trigger the RAID rebuild, and bring up the broker.
>> This way, there is no performance impact at runtime due to rebuild. The
>> benefit is that a broker doesn't fail in a hard way when there is a disk
>> failure and can be brought down in a controlled way for maintenance. While
>> the broker is running with a failed disk, reads may be more expensive since
>> they have to be computed from the parity. However, if most reads are from
>> page cache, this may not be a big issue either. So, it would be useful to
>> do some tests on RAID5 before we completely rule it out.
>> 
>> Regarding whether to remove an offline replica from the fetcher thread
>> immediately. What do we do when a failed replica is a leader? Do we do
>> nothing or mark the replica as not the leader immediately? Intuitively, it
>> seems it's better if the broker acts consistently on a failed replica
>> whether it's a leader or a follower. For ISR churns, I was just pointing
>> out that if we don't send StopReplicaRequest to a broker to be shut down in
>> a controlled way, then the leader will shrink ISR, expand it and shrink it
>> again after the timeout.
>> 
>> The KIP seems to still reference "
>> /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state".
>> 
>> Thanks,
>> 
>> Jun
>> 
>> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin  wrote:
>> 
>>> Hey Jun,
>>> 
>>> Thanks for the suggestion. I think it is a good idea to know put created
>>> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
>>> repilcas was in NewReplica state. It will only fail the replica creation
>> in
>>> the scenario that the controller fails after
>>> topic-creation/partition-reassignment/partition-number-change but before
>>> actually sends out the LeaderAndIsrRequest while there is ongoing disk
>>> failure, which should be pretty rare and acceptable. This should simplify
>>> the design of this KIP.
>>> 
>>> Regarding RAID-5, I think the concern with RAID-5/6 is not just about

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-27 Thread Dong Lin
Hi Jun,

In addition to the Eno's reference of why rebuild time with RAID-5 is more
expensive, another concern is that RAID-5 will fail if more than one disk
fails. JBOD is still works with 1+ disk failure and has better performance
with one disk failure. These seems like good argument for using JBOD
instead of RAID-5.

If a leader replica goes offline, the broker should first take all actions
(i.e. remove the partition from fetcher thread) as if it has received
StopReplicaRequest for this partition because the replica can no longer
work anyway. It will also respond with error to any ProduceRequest and
FetchRequest for partition. The broker notifies controller by writing
notification znode in ZK. The controller learns the disk failure event from
ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
that the replica is offline. The controller will then elect new leader for
this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
relevant brokers. The broker should stop adjusting the ISR for this
partition as if the broker is already offline. I am not sure there is any
inconsistency in broker's behavior when it is leader or follower. Is there
any concern with this approach?

Thanks for catching this. I have removed that reference from the KIP.

Hi Eno,

Thank you for providing the reference of the RAID-5. In LinkedIn we have 10
disks per Kafka machine. It will not be a show-stopper operationally for
LinkedIn if we have to deploy one-broker-per-disk. On the other hand we
previously discussed the advantage of JBOD vs. one-broker-per-disk or
one-broker-per-machine. One-broker-per-disk suffers from the problems
described in the KIP and one-broker-per-machine increases the failure
caused by disk failure by 10X. Since JBOD is strictly better than either of
the two, it is also better then one-broker-per-multiple-disk which is
somewhere between one-broker-per-disk and one-broker-per-machine.

I personally think the benefits of JBOD design is worth the implementation
complexity it introduces. I would also argue that it is reasonable for
Kafka to manage this low level detail because Kafka is already exposing and
managing replication factor of its data. But whether the complexity is
worthwhile can be subjective and I can not prove my opinion. I am
contributing significant amount of time to do this KIP because Kafka
develops at LinkedIn believes it is useful and worth the effort. Yeah, it
will be useful to see what everyone else think about it.


Thanks,
Dong


On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao  wrote:

> Hi, Dong,
>
> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
> fails, typically an admin has to bring down the broker, replace the failed
> disk with a new one, trigger the RAID rebuild, and bring up the broker.
> This way, there is no performance impact at runtime due to rebuild. The
> benefit is that a broker doesn't fail in a hard way when there is a disk
> failure and can be brought down in a controlled way for maintenance. While
> the broker is running with a failed disk, reads may be more expensive since
> they have to be computed from the parity. However, if most reads are from
> page cache, this may not be a big issue either. So, it would be useful to
> do some tests on RAID5 before we completely rule it out.
>
> Regarding whether to remove an offline replica from the fetcher thread
> immediately. What do we do when a failed replica is a leader? Do we do
> nothing or mark the replica as not the leader immediately? Intuitively, it
> seems it's better if the broker acts consistently on a failed replica
> whether it's a leader or a follower. For ISR churns, I was just pointing
> out that if we don't send StopReplicaRequest to a broker to be shut down in
> a controlled way, then the leader will shrink ISR, expand it and shrink it
> again after the timeout.
>
> The KIP seems to still reference "
> /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state".
>
> Thanks,
>
> Jun
>
> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > Thanks for the suggestion. I think it is a good idea to know put created
> > flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
> > repilcas was in NewReplica state. It will only fail the replica creation
> in
> > the scenario that the controller fails after
> > topic-creation/partition-reassignment/partition-number-change but before
> > actually sends out the LeaderAndIsrRequest while there is ongoing disk
> > failure, which should be pretty rare and acceptable. This should simplify
> > the design of this KIP.
> >
> > Regarding RAID-5, I think the concern with RAID-5/6 is not just about
> > performance when there is no failure. For example, RAID-5 can support up
> to
> > one disk failure and it takes time to rebuild disk after one disk
> > failure. RAID 5 implementations are susceptible to system failures
> because
> > of trends 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-27 Thread Eno Thereska
RAID-10's code is much simpler (just stripe plus mirror) and under failure the 
recovery is much faster since it just has to read from a mirror, not several 
disks to reconstruct the data. Of course, the price paid is that mirroring is 
more expensive in terms of storage space. 

E.g., see discussion at 
https://community.spiceworks.com/topic/1155094-raid-10-and-raid-6-is-either-one-really-better-than-the-other
 
.
 So yes, if you can afford the space, go for RAID-10.

If utilising storage space well is what you care about, nothing beats utilising 
the JBOD disks one-by-one (while replicating at a higher level as Kafka does). 
However, there is now more complexity for Kafka.

Dong, how many disks you you typically expect in a JBOD? 12 or 24 or higher? 
Are we absolutely sure that running 2-3 brokers/JBOD is a show-stopper 
operationally? I guess that would increase the rolling restart time (more 
brokers), but it would be great if we could have a conclusive strong argument 
against it. I don't have operational experience with Kafka, so I don't have a 
strong opinion, but is everyone else convinced? 

Eno

> On 27 Feb 2017, at 22:10, Jun Rao  wrote:
> 
> Hi, Eno,
> 
> Thanks for the pointers. Doesn't RAID-10 have a similar issue during
> rebuild? In both cases, all data on existing disks have to be read during
> rebuild? RAID-10 seems to still be used widely.
> 
> Jun
> 
> On Mon, Feb 27, 2017 at 1:38 PM, Eno Thereska 
> wrote:
> 
>> Unfortunately RAID-5/6 is not typically advised anymore due to failure
>> issues, as Dong mentions, e.g.: http://www.zdnet.com/article/
>> why-raid-6-stops-working-in-2019/ > why-raid-6-stops-working-in-2019/>
>> 
>> Eno
>> 
>> 
>>> On 27 Feb 2017, at 21:16, Jun Rao  wrote:
>>> 
>>> Hi, Dong,
>>> 
>>> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
>>> fails, typically an admin has to bring down the broker, replace the
>> failed
>>> disk with a new one, trigger the RAID rebuild, and bring up the broker.
>>> This way, there is no performance impact at runtime due to rebuild. The
>>> benefit is that a broker doesn't fail in a hard way when there is a disk
>>> failure and can be brought down in a controlled way for maintenance.
>> While
>>> the broker is running with a failed disk, reads may be more expensive
>> since
>>> they have to be computed from the parity. However, if most reads are from
>>> page cache, this may not be a big issue either. So, it would be useful to
>>> do some tests on RAID5 before we completely rule it out.
>>> 
>>> Regarding whether to remove an offline replica from the fetcher thread
>>> immediately. What do we do when a failed replica is a leader? Do we do
>>> nothing or mark the replica as not the leader immediately? Intuitively,
>> it
>>> seems it's better if the broker acts consistently on a failed replica
>>> whether it's a leader or a follower. For ISR churns, I was just pointing
>>> out that if we don't send StopReplicaRequest to a broker to be shut down
>> in
>>> a controlled way, then the leader will shrink ISR, expand it and shrink
>> it
>>> again after the timeout.
>>> 
>>> The KIP seems to still reference "
>>> /broker/topics/[topic]/partitions/[partitionId]/
>> controller_managed_state".
>>> 
>>> Thanks,
>>> 
>>> Jun
>>> 
>>> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin  wrote:
>>> 
 Hey Jun,
 
 Thanks for the suggestion. I think it is a good idea to know put created
 flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest
>> if
 repilcas was in NewReplica state. It will only fail the replica
>> creation in
 the scenario that the controller fails after
 topic-creation/partition-reassignment/partition-number-change but
>> before
 actually sends out the LeaderAndIsrRequest while there is ongoing disk
 failure, which should be pretty rare and acceptable. This should
>> simplify
 the design of this KIP.
 
 Regarding RAID-5, I think the concern with RAID-5/6 is not just about
 performance when there is no failure. For example, RAID-5 can support
>> up to
 one disk failure and it takes time to rebuild disk after one disk
 failure. RAID 5 implementations are susceptible to system failures
>> because
 of trends regarding array rebuild time and the chance of drive failure
 during rebuild. There is no such performance degradation for JBOD and
>> JBOD
 can support multiple log directory failure without reducing performance
>> of
 good log directories. Would this be a reasonable reason for using JBOD
 instead of RAID-5/6?
 
 Previously we discussed wether broker should remove offline replica from
 replica fetcher thread. I still think it should do it instead of
>> printing a
 lot of error in the log4j 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-27 Thread Jun Rao
Hi, Eno,

Thanks for the pointers. Doesn't RAID-10 have a similar issue during
rebuild? In both cases, all data on existing disks have to be read during
rebuild? RAID-10 seems to still be used widely.

Jun

On Mon, Feb 27, 2017 at 1:38 PM, Eno Thereska 
wrote:

> Unfortunately RAID-5/6 is not typically advised anymore due to failure
> issues, as Dong mentions, e.g.: http://www.zdnet.com/article/
> why-raid-6-stops-working-in-2019/  why-raid-6-stops-working-in-2019/>
>
> Eno
>
>
> > On 27 Feb 2017, at 21:16, Jun Rao  wrote:
> >
> > Hi, Dong,
> >
> > For RAID5, I am not sure the rebuild cost is a big concern. If a disk
> > fails, typically an admin has to bring down the broker, replace the
> failed
> > disk with a new one, trigger the RAID rebuild, and bring up the broker.
> > This way, there is no performance impact at runtime due to rebuild. The
> > benefit is that a broker doesn't fail in a hard way when there is a disk
> > failure and can be brought down in a controlled way for maintenance.
> While
> > the broker is running with a failed disk, reads may be more expensive
> since
> > they have to be computed from the parity. However, if most reads are from
> > page cache, this may not be a big issue either. So, it would be useful to
> > do some tests on RAID5 before we completely rule it out.
> >
> > Regarding whether to remove an offline replica from the fetcher thread
> > immediately. What do we do when a failed replica is a leader? Do we do
> > nothing or mark the replica as not the leader immediately? Intuitively,
> it
> > seems it's better if the broker acts consistently on a failed replica
> > whether it's a leader or a follower. For ISR churns, I was just pointing
> > out that if we don't send StopReplicaRequest to a broker to be shut down
> in
> > a controlled way, then the leader will shrink ISR, expand it and shrink
> it
> > again after the timeout.
> >
> > The KIP seems to still reference "
> > /broker/topics/[topic]/partitions/[partitionId]/
> controller_managed_state".
> >
> > Thanks,
> >
> > Jun
> >
> > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin  wrote:
> >
> >> Hey Jun,
> >>
> >> Thanks for the suggestion. I think it is a good idea to know put created
> >> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest
> if
> >> repilcas was in NewReplica state. It will only fail the replica
> creation in
> >> the scenario that the controller fails after
> >> topic-creation/partition-reassignment/partition-number-change but
> before
> >> actually sends out the LeaderAndIsrRequest while there is ongoing disk
> >> failure, which should be pretty rare and acceptable. This should
> simplify
> >> the design of this KIP.
> >>
> >> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
> >> performance when there is no failure. For example, RAID-5 can support
> up to
> >> one disk failure and it takes time to rebuild disk after one disk
> >> failure. RAID 5 implementations are susceptible to system failures
> because
> >> of trends regarding array rebuild time and the chance of drive failure
> >> during rebuild. There is no such performance degradation for JBOD and
> JBOD
> >> can support multiple log directory failure without reducing performance
> of
> >> good log directories. Would this be a reasonable reason for using JBOD
> >> instead of RAID-5/6?
> >>
> >> Previously we discussed wether broker should remove offline replica from
> >> replica fetcher thread. I still think it should do it instead of
> printing a
> >> lot of error in the log4j log. We can still let controller send
> >> StopReplicaRequest to the broker. I am not sure I undertand why allowing
> >> broker to remove offline replica from fetcher thread will increase
> churns
> >> in ISR. Do you think this is concern with this approach?
> >>
> >> I have updated the KIP to remove created flag from ZK and change the
> filed
> >> name to isNewReplica. Can you check if there is any issue with the
> latest
> >> KIP? Thanks for your time!
> >>
> >> Regards,
> >> Dong
> >>
> >>
> >> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao  wrote:
> >>
> >>> Hi, Dong,
> >>>
> >>> Thanks for the reply.
> >>>
> >>> Personally, I'd prefer not to write the created flag per replica in ZK.
> >>> Your suggestion of disabling replica creation if there is a bad log
> >>> directory on the broker could work. The only thing is that it may delay
> >> the
> >>> creation of new replicas. I was thinking that an alternative is to
> extend
> >>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That
> >> field
> >>> will be set when a replica is transitioning from the NewReplica state
> to
> >>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
> >>> replica is marked as the new replica, it will be created on a good log
> >>> directory, if not already present. Otherwise, it only creates the
> replica
> >>> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-27 Thread Eno Thereska
Unfortunately RAID-5/6 is not typically advised anymore due to failure issues, 
as Dong mentions, e.g.: 
http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/ 


Eno


> On 27 Feb 2017, at 21:16, Jun Rao  wrote:
> 
> Hi, Dong,
> 
> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
> fails, typically an admin has to bring down the broker, replace the failed
> disk with a new one, trigger the RAID rebuild, and bring up the broker.
> This way, there is no performance impact at runtime due to rebuild. The
> benefit is that a broker doesn't fail in a hard way when there is a disk
> failure and can be brought down in a controlled way for maintenance. While
> the broker is running with a failed disk, reads may be more expensive since
> they have to be computed from the parity. However, if most reads are from
> page cache, this may not be a big issue either. So, it would be useful to
> do some tests on RAID5 before we completely rule it out.
> 
> Regarding whether to remove an offline replica from the fetcher thread
> immediately. What do we do when a failed replica is a leader? Do we do
> nothing or mark the replica as not the leader immediately? Intuitively, it
> seems it's better if the broker acts consistently on a failed replica
> whether it's a leader or a follower. For ISR churns, I was just pointing
> out that if we don't send StopReplicaRequest to a broker to be shut down in
> a controlled way, then the leader will shrink ISR, expand it and shrink it
> again after the timeout.
> 
> The KIP seems to still reference "
> /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state".
> 
> Thanks,
> 
> Jun
> 
> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin  wrote:
> 
>> Hey Jun,
>> 
>> Thanks for the suggestion. I think it is a good idea to know put created
>> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
>> repilcas was in NewReplica state. It will only fail the replica creation in
>> the scenario that the controller fails after
>> topic-creation/partition-reassignment/partition-number-change but before
>> actually sends out the LeaderAndIsrRequest while there is ongoing disk
>> failure, which should be pretty rare and acceptable. This should simplify
>> the design of this KIP.
>> 
>> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
>> performance when there is no failure. For example, RAID-5 can support up to
>> one disk failure and it takes time to rebuild disk after one disk
>> failure. RAID 5 implementations are susceptible to system failures because
>> of trends regarding array rebuild time and the chance of drive failure
>> during rebuild. There is no such performance degradation for JBOD and JBOD
>> can support multiple log directory failure without reducing performance of
>> good log directories. Would this be a reasonable reason for using JBOD
>> instead of RAID-5/6?
>> 
>> Previously we discussed wether broker should remove offline replica from
>> replica fetcher thread. I still think it should do it instead of printing a
>> lot of error in the log4j log. We can still let controller send
>> StopReplicaRequest to the broker. I am not sure I undertand why allowing
>> broker to remove offline replica from fetcher thread will increase churns
>> in ISR. Do you think this is concern with this approach?
>> 
>> I have updated the KIP to remove created flag from ZK and change the filed
>> name to isNewReplica. Can you check if there is any issue with the latest
>> KIP? Thanks for your time!
>> 
>> Regards,
>> Dong
>> 
>> 
>> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao  wrote:
>> 
>>> Hi, Dong,
>>> 
>>> Thanks for the reply.
>>> 
>>> Personally, I'd prefer not to write the created flag per replica in ZK.
>>> Your suggestion of disabling replica creation if there is a bad log
>>> directory on the broker could work. The only thing is that it may delay
>> the
>>> creation of new replicas. I was thinking that an alternative is to extend
>>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That
>> field
>>> will be set when a replica is transitioning from the NewReplica state to
>>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
>>> replica is marked as the new replica, it will be created on a good log
>>> directory, if not already present. Otherwise, it only creates the replica
>>> if all log directories are good and the replica is not already present.
>>> This way, we don't delay the processing of new replicas in the common
>> case.
>>> 
>>> I am ok with not persisting the offline replicas in ZK and just
>> discovering
>>> them through the LeaderAndIsrRequest. It handles the cases when a broker
>>> starts up with bad log directories better. So, the additional overhead of
>>> rediscovering the offline replicas is justified.
>>> 
>>> 
>>> Another high level question. The 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-27 Thread Jun Rao
Hi, Dong,

For RAID5, I am not sure the rebuild cost is a big concern. If a disk
fails, typically an admin has to bring down the broker, replace the failed
disk with a new one, trigger the RAID rebuild, and bring up the broker.
This way, there is no performance impact at runtime due to rebuild. The
benefit is that a broker doesn't fail in a hard way when there is a disk
failure and can be brought down in a controlled way for maintenance. While
the broker is running with a failed disk, reads may be more expensive since
they have to be computed from the parity. However, if most reads are from
page cache, this may not be a big issue either. So, it would be useful to
do some tests on RAID5 before we completely rule it out.

Regarding whether to remove an offline replica from the fetcher thread
immediately. What do we do when a failed replica is a leader? Do we do
nothing or mark the replica as not the leader immediately? Intuitively, it
seems it's better if the broker acts consistently on a failed replica
whether it's a leader or a follower. For ISR churns, I was just pointing
out that if we don't send StopReplicaRequest to a broker to be shut down in
a controlled way, then the leader will shrink ISR, expand it and shrink it
again after the timeout.

The KIP seems to still reference "
/broker/topics/[topic]/partitions/[partitionId]/controller_managed_state".

Thanks,

Jun

On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin  wrote:

> Hey Jun,
>
> Thanks for the suggestion. I think it is a good idea to know put created
> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
> repilcas was in NewReplica state. It will only fail the replica creation in
> the scenario that the controller fails after
> topic-creation/partition-reassignment/partition-number-change but before
> actually sends out the LeaderAndIsrRequest while there is ongoing disk
> failure, which should be pretty rare and acceptable. This should simplify
> the design of this KIP.
>
> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
> performance when there is no failure. For example, RAID-5 can support up to
> one disk failure and it takes time to rebuild disk after one disk
> failure. RAID 5 implementations are susceptible to system failures because
> of trends regarding array rebuild time and the chance of drive failure
> during rebuild. There is no such performance degradation for JBOD and JBOD
> can support multiple log directory failure without reducing performance of
> good log directories. Would this be a reasonable reason for using JBOD
> instead of RAID-5/6?
>
> Previously we discussed wether broker should remove offline replica from
> replica fetcher thread. I still think it should do it instead of printing a
> lot of error in the log4j log. We can still let controller send
> StopReplicaRequest to the broker. I am not sure I undertand why allowing
> broker to remove offline replica from fetcher thread will increase churns
> in ISR. Do you think this is concern with this approach?
>
> I have updated the KIP to remove created flag from ZK and change the filed
> name to isNewReplica. Can you check if there is any issue with the latest
> KIP? Thanks for your time!
>
> Regards,
> Dong
>
>
> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao  wrote:
>
> > Hi, Dong,
> >
> > Thanks for the reply.
> >
> > Personally, I'd prefer not to write the created flag per replica in ZK.
> > Your suggestion of disabling replica creation if there is a bad log
> > directory on the broker could work. The only thing is that it may delay
> the
> > creation of new replicas. I was thinking that an alternative is to extend
> > LeaderAndIsrRequest by adding a isNewReplica field per replica. That
> field
> > will be set when a replica is transitioning from the NewReplica state to
> > Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
> > replica is marked as the new replica, it will be created on a good log
> > directory, if not already present. Otherwise, it only creates the replica
> > if all log directories are good and the replica is not already present.
> > This way, we don't delay the processing of new replicas in the common
> case.
> >
> > I am ok with not persisting the offline replicas in ZK and just
> discovering
> > them through the LeaderAndIsrRequest. It handles the cases when a broker
> > starts up with bad log directories better. So, the additional overhead of
> > rediscovering the offline replicas is justified.
> >
> >
> > Another high level question. The proposal rejected RAID5/6 since it adds
> > additional I/Os. The main issue with RAID5 is that to write a block that
> > doesn't match the RAID stripe size, we have to first read the old parity
> to
> > compute the new one, which increases the number of I/Os (
> > http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you
> have
> > tested RAID5's performance by creating a file system whose block size
> > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-25 Thread Dong Lin
Hey Jun,

Thanks for the suggestion. I think it is a good idea to know put created
flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if
repilcas was in NewReplica state. It will only fail the replica creation in
the scenario that the controller fails after
topic-creation/partition-reassignment/partition-number-change but before
actually sends out the LeaderAndIsrRequest while there is ongoing disk
failure, which should be pretty rare and acceptable. This should simplify
the design of this KIP.

Regarding RAID-5, I think the concern with RAID-5/6 is not just about
performance when there is no failure. For example, RAID-5 can support up to
one disk failure and it takes time to rebuild disk after one disk
failure. RAID 5 implementations are susceptible to system failures because
of trends regarding array rebuild time and the chance of drive failure
during rebuild. There is no such performance degradation for JBOD and JBOD
can support multiple log directory failure without reducing performance of
good log directories. Would this be a reasonable reason for using JBOD
instead of RAID-5/6?

Previously we discussed wether broker should remove offline replica from
replica fetcher thread. I still think it should do it instead of printing a
lot of error in the log4j log. We can still let controller send
StopReplicaRequest to the broker. I am not sure I undertand why allowing
broker to remove offline replica from fetcher thread will increase churns
in ISR. Do you think this is concern with this approach?

I have updated the KIP to remove created flag from ZK and change the filed
name to isNewReplica. Can you check if there is any issue with the latest
KIP? Thanks for your time!

Regards,
Dong


On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao  wrote:

> Hi, Dong,
>
> Thanks for the reply.
>
> Personally, I'd prefer not to write the created flag per replica in ZK.
> Your suggestion of disabling replica creation if there is a bad log
> directory on the broker could work. The only thing is that it may delay the
> creation of new replicas. I was thinking that an alternative is to extend
> LeaderAndIsrRequest by adding a isNewReplica field per replica. That field
> will be set when a replica is transitioning from the NewReplica state to
> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
> replica is marked as the new replica, it will be created on a good log
> directory, if not already present. Otherwise, it only creates the replica
> if all log directories are good and the replica is not already present.
> This way, we don't delay the processing of new replicas in the common case.
>
> I am ok with not persisting the offline replicas in ZK and just discovering
> them through the LeaderAndIsrRequest. It handles the cases when a broker
> starts up with bad log directories better. So, the additional overhead of
> rediscovering the offline replicas is justified.
>
>
> Another high level question. The proposal rejected RAID5/6 since it adds
> additional I/Os. The main issue with RAID5 is that to write a block that
> doesn't match the RAID stripe size, we have to first read the old parity to
> compute the new one, which increases the number of I/Os (
> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you have
> tested RAID5's performance by creating a file system whose block size
> matches the RAID stripe size (https://www.percona.com/blog/
> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a block
> doesn't require a read first. A large block size may increase the amount of
> data writes, when the same block has to be written to disk multiple times.
> However, this is probably ok in Kafka's use case since we batch the I/O
> flush already. As you can see, we will be adding some complexity to support
> JBOD in Kafka one way or another. If we can tune the performance of RAID5
> to match that of RAID10, perhaps using RAID5 is a simpler solution.
>
> Thanks,
>
> Jun
>
>
> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > I don't think we should allow failed replicas to be re-created on the
> good
> > disks. Say there are 2 disks and each of them is 51% loaded. If any disk
> > fail, and we allow replicas to be re-created on the other disks, both
> disks
> > will fail. Alternatively we can disable replica creation if there is bad
> > disk on a broker. I personally think it is worth the additional
> complexity
> > in the broker to store created replicas in ZK so that we allow new
> replicas
> > to be created on the broker even when there is bad log directory. This
> > approach won't add complexity in the controller. But I am fine with
> > disabling replica creation when there is bad log directory that if it is
> > the only blocking issue for this KIP.
> >
> > Whether we store created flags is independent of whether/how we store
> > offline replicas. Per our previous discussion, do you think it is OK not
> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-25 Thread Jun Rao
Hi, Dong,

Thanks for the reply.

Personally, I'd prefer not to write the created flag per replica in ZK.
Your suggestion of disabling replica creation if there is a bad log
directory on the broker could work. The only thing is that it may delay the
creation of new replicas. I was thinking that an alternative is to extend
LeaderAndIsrRequest by adding a isNewReplica field per replica. That field
will be set when a replica is transitioning from the NewReplica state to
Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
replica is marked as the new replica, it will be created on a good log
directory, if not already present. Otherwise, it only creates the replica
if all log directories are good and the replica is not already present.
This way, we don't delay the processing of new replicas in the common case.

I am ok with not persisting the offline replicas in ZK and just discovering
them through the LeaderAndIsrRequest. It handles the cases when a broker
starts up with bad log directories better. So, the additional overhead of
rediscovering the offline replicas is justified.


Another high level question. The proposal rejected RAID5/6 since it adds
additional I/Os. The main issue with RAID5 is that to write a block that
doesn't match the RAID stripe size, we have to first read the old parity to
compute the new one, which increases the number of I/Os (
http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you have
tested RAID5's performance by creating a file system whose block size
matches the RAID stripe size (https://www.percona.com/blog/
2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a block
doesn't require a read first. A large block size may increase the amount of
data writes, when the same block has to be written to disk multiple times.
However, this is probably ok in Kafka's use case since we batch the I/O
flush already. As you can see, we will be adding some complexity to support
JBOD in Kafka one way or another. If we can tune the performance of RAID5
to match that of RAID10, perhaps using RAID5 is a simpler solution.

Thanks,

Jun


On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin  wrote:

> Hey Jun,
>
> I don't think we should allow failed replicas to be re-created on the good
> disks. Say there are 2 disks and each of them is 51% loaded. If any disk
> fail, and we allow replicas to be re-created on the other disks, both disks
> will fail. Alternatively we can disable replica creation if there is bad
> disk on a broker. I personally think it is worth the additional complexity
> in the broker to store created replicas in ZK so that we allow new replicas
> to be created on the broker even when there is bad log directory. This
> approach won't add complexity in the controller. But I am fine with
> disabling replica creation when there is bad log directory that if it is
> the only blocking issue for this KIP.
>
> Whether we store created flags is independent of whether/how we store
> offline replicas. Per our previous discussion, do you think it is OK not
> store offline replicas in ZK and propagate the offline replicas from broker
> to controller via LeaderAndIsrRequest?
>
> Thanks,
> Dong
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-24 Thread Dong Lin
Hey Jun,

I don't think we should allow failed replicas to be re-created on the good
disks. Say there are 2 disks and each of them is 51% loaded. If any disk
fail, and we allow replicas to be re-created on the other disks, both disks
will fail. Alternatively we can disable replica creation if there is bad
disk on a broker. I personally think it is worth the additional complexity
in the broker to store created replicas in ZK so that we allow new replicas
to be created on the broker even when there is bad log directory. This
approach won't add complexity in the controller. But I am fine with
disabling replica creation when there is bad log directory that if it is
the only blocking issue for this KIP.

Whether we store created flags is independent of whether/how we store
offline replicas. Per our previous discussion, do you think it is OK not
store offline replicas in ZK and propagate the offline replicas from broker
to controller via LeaderAndIsrRequest?

Thanks,
Dong


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-23 Thread Dong Lin
Hey Jun,

I think there is one simpler design that doesn't need to add "create" flag
in LeaderAndIsrRequest and also remove the need for controller to
track/update which replicas are created. The idea is for each broker to
persist the created replicas in per-broker-per-topic znode. When a replica
is created or deleted, the broker updates the znode accordingly. When
broker receives LeaderAndIsrRequest, it learns the "create" flag from its
cache of these znode data. When a broker starts, it does need to read # of
znode proportional to the number of topics on its disks. But controller
still needs to learn about offline replicas from LeaderAndIsrResponse.

I think this is better than the current design. Do you have any concern
with this design?

Thanks,
Dong


On Thu, Feb 23, 2017 at 7:12 PM, Dong Lin  wrote:

> Hey Jun,
>
> Sure, here is my explanation.
>
> Design B would not work if it doesn't store created replicas in the ZK.
> For example, say broker B is health when it is shutdown. At this moment no
> offline replica is written in ZK for this broker. Suppose log directory is
> damaged when broker is offline, then when this broker starts, it won't know
> which replicas are in the bad log directory. And it won't be able to
> specify those offline replicas in /failed-log-directory either.
>
> Let's say design B stores created replica in ZK. Then the next problem is
> that, in the scenario that multiple log directories are damaged while
> broker is offline, when broker starts, it won't be able to know the exact
> list of offline replicas on each bad log directory. All it knows is the
> offline replicas on all those bad log directories. Thus it is impossible
> for broker to specify offline replicas per log directory in this scenario.
>
> I agree with your observation that, if admin fixes replaces dir1 with a
> good empty disk but leave dir2 untouched, design A won't create replica
> whereas design B can create it. But I am not sure that is a problem which
> we want to optimize. It seems reasonable for admin to fix both log
> directories in practice. If admin fixes only one of the two log
> directories, we can say it is a partial fix and Kafka won't re-create any
> offline replicas on dir1 and dir2. Similar to extra round of
> LeaderAndIsrRequest in case of log failure, I think this is also a pretty
> minor issue with design B.
>
> Thanks,
> Dong
>
>
> On Thu, Feb 23, 2017 at 6:46 PM, Jun Rao  wrote:
>
>> Hi, Dong,
>>
>> My replies are inlined below.
>>
>> On Thu, Feb 23, 2017 at 4:47 PM, Dong Lin  wrote:
>>
>> > Hey Jun,
>> >
>> > Thanks for you reply! Let me first comment on the things that you
>> listed as
>> > advantage of B over A.
>> >
>> > 1) No change in LeaderAndIsrRequest protocol.
>> >
>> > I agree with this.
>> >
>> > 2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK
>> > writes to record the created flag.
>> >
>> > I don't think this is true. There will be one round of
>> LeaderAndIsrRequest
>> > in both A and B. In the design A controller needs to write to ZK once to
>> > record this replica as created. The design B the broker needs to write
>> > zookeeper once to record this replica as created. So there is same
>> number
>> > of LeaderAndIsrRequest and ZK writes.
>> >
>> > Broker needs to record created replica in design B so that when it
>> > bootstraps with failed log directory, the broker can derive the offline
>> > replicas as the difference between created replicas and replicas found
>> on
>> > good log directories.
>> >
>> >
>> Design B actually doesn't write created replicas in ZK. When a broker
>> starts up, all offline replicas are stored in the /failed-log-directory
>> path in ZK. So if a replica is not there and is not in the live log
>> directories either, it's never created. Does this work?
>>
>>
>>
>> > 3) Step 2. One less round of LeaderAndIsrRequest and no additional
>> logic to
>> > handle LeaderAndIsrResponse.
>> >
>> > While I agree there is one less round of LeaderAndIsrRequest in design
>> B, I
>> > don't think one additional LeaderAndIsrRequest to handle log directory
>> > failure is a big deal given that it doesn't happen frequently.
>> >
>> > Also, while there is no additional logic to handle LeaderAndIsrResponse
>> in
>> > design B, I actually think this is something that controller should do
>> > anyway. Say the broker stops responding to any requests without removing
>> > itself from zookeeper, the only way for controller to realize this and
>> > re-elect leader is to send request to this broker and handle response.
>> The
>> > is a problem that we don't do it as of now.
>> >
>> > 4) Step 6. Additional ZK reads proportional to # of failed log
>> directories,
>> > instead of # of partitions.
>> >
>> > If one znode is able to describe all topic partitions in a log
>> directory,
>> > then the existing znode /brokers/topics/[topic] should be able to
>> describe
>> > created replicas in 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-23 Thread Dong Lin
Hey Jun,

Sure, here is my explanation.

Design B would not work if it doesn't store created replicas in the ZK. For
example, say broker B is health when it is shutdown. At this moment no
offline replica is written in ZK for this broker. Suppose log directory is
damaged when broker is offline, then when this broker starts, it won't know
which replicas are in the bad log directory. And it won't be able to
specify those offline replicas in /failed-log-directory either.

Let's say design B stores created replica in ZK. Then the next problem is
that, in the scenario that multiple log directories are damaged while
broker is offline, when broker starts, it won't be able to know the exact
list of offline replicas on each bad log directory. All it knows is the
offline replicas on all those bad log directories. Thus it is impossible
for broker to specify offline replicas per log directory in this scenario.

I agree with your observation that, if admin fixes replaces dir1 with a
good empty disk but leave dir2 untouched, design A won't create replica
whereas design B can create it. But I am not sure that is a problem which
we want to optimize. It seems reasonable for admin to fix both log
directories in practice. If admin fixes only one of the two log
directories, we can say it is a partial fix and Kafka won't re-create any
offline replicas on dir1 and dir2. Similar to extra round of
LeaderAndIsrRequest in case of log failure, I think this is also a pretty
minor issue with design B.

Thanks,
Dong


On Thu, Feb 23, 2017 at 6:46 PM, Jun Rao  wrote:

> Hi, Dong,
>
> My replies are inlined below.
>
> On Thu, Feb 23, 2017 at 4:47 PM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > Thanks for you reply! Let me first comment on the things that you listed
> as
> > advantage of B over A.
> >
> > 1) No change in LeaderAndIsrRequest protocol.
> >
> > I agree with this.
> >
> > 2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK
> > writes to record the created flag.
> >
> > I don't think this is true. There will be one round of
> LeaderAndIsrRequest
> > in both A and B. In the design A controller needs to write to ZK once to
> > record this replica as created. The design B the broker needs to write
> > zookeeper once to record this replica as created. So there is same number
> > of LeaderAndIsrRequest and ZK writes.
> >
> > Broker needs to record created replica in design B so that when it
> > bootstraps with failed log directory, the broker can derive the offline
> > replicas as the difference between created replicas and replicas found on
> > good log directories.
> >
> >
> Design B actually doesn't write created replicas in ZK. When a broker
> starts up, all offline replicas are stored in the /failed-log-directory
> path in ZK. So if a replica is not there and is not in the live log
> directories either, it's never created. Does this work?
>
>
>
> > 3) Step 2. One less round of LeaderAndIsrRequest and no additional logic
> to
> > handle LeaderAndIsrResponse.
> >
> > While I agree there is one less round of LeaderAndIsrRequest in design
> B, I
> > don't think one additional LeaderAndIsrRequest to handle log directory
> > failure is a big deal given that it doesn't happen frequently.
> >
> > Also, while there is no additional logic to handle LeaderAndIsrResponse
> in
> > design B, I actually think this is something that controller should do
> > anyway. Say the broker stops responding to any requests without removing
> > itself from zookeeper, the only way for controller to realize this and
> > re-elect leader is to send request to this broker and handle response.
> The
> > is a problem that we don't do it as of now.
> >
> > 4) Step 6. Additional ZK reads proportional to # of failed log
> directories,
> > instead of # of partitions.
> >
> > If one znode is able to describe all topic partitions in a log directory,
> > then the existing znode /brokers/topics/[topic] should be able to
> describe
> > created replicas in addition to the assigned replicas for every partition
> > of the topic. In this case, design A requires no additional ZK reads
> > whereas design B ZK reads proportional to # of failed log directories.
> >
> > 5) Step 3. In design A, if a broker is restarted and the failed log
> > directory is unreadable, the broker doesn't know which replicas are on
> the
> > failed log directory. So, when the broker receives the LeadAndIsrRequest
> > with created = false, it's bit hard for the broker to decide whether it
> > should create the missing replica on other log directories. This is
> easier
> > in design B since the list of failed replicas are persisted in ZK.
> >
> > I don't understand why it is hard for broker to make decision in design
> A.
> > With design A, if a broker is started with a failed log directory and it
> > receives LeaderAndIsrRequest with created=false for a replica that can
> not
> > be found on any good log directory, broker will not create this 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-23 Thread Dong Lin
Hey Jun,

Thanks for you reply! Let me first comment on the things that you listed as
advantage of B over A.

1) No change in LeaderAndIsrRequest protocol.

I agree with this.

2) Step 1. One less round of LeaderAndIsrRequest and no additional ZK
writes to record the created flag.

I don't think this is true. There will be one round of LeaderAndIsrRequest
in both A and B. In the design A controller needs to write to ZK once to
record this replica as created. The design B the broker needs to write
zookeeper once to record this replica as created. So there is same number
of LeaderAndIsrRequest and ZK writes.

Broker needs to record created replica in design B so that when it
bootstraps with failed log directory, the broker can derive the offline
replicas as the difference between created replicas and replicas found on
good log directories.

3) Step 2. One less round of LeaderAndIsrRequest and no additional logic to
handle LeaderAndIsrResponse.

While I agree there is one less round of LeaderAndIsrRequest in design B, I
don't think one additional LeaderAndIsrRequest to handle log directory
failure is a big deal given that it doesn't happen frequently.

Also, while there is no additional logic to handle LeaderAndIsrResponse in
design B, I actually think this is something that controller should do
anyway. Say the broker stops responding to any requests without removing
itself from zookeeper, the only way for controller to realize this and
re-elect leader is to send request to this broker and handle response. The
is a problem that we don't do it as of now.

4) Step 6. Additional ZK reads proportional to # of failed log directories,
instead of # of partitions.

If one znode is able to describe all topic partitions in a log directory,
then the existing znode /brokers/topics/[topic] should be able to describe
created replicas in addition to the assigned replicas for every partition
of the topic. In this case, design A requires no additional ZK reads
whereas design B ZK reads proportional to # of failed log directories.

5) Step 3. In design A, if a broker is restarted and the failed log
directory is unreadable, the broker doesn't know which replicas are on the
failed log directory. So, when the broker receives the LeadAndIsrRequest
with created = false, it's bit hard for the broker to decide whether it
should create the missing replica on other log directories. This is easier
in design B since the list of failed replicas are persisted in ZK.

I don't understand why it is hard for broker to make decision in design A.
With design A, if a broker is started with a failed log directory and it
receives LeaderAndIsrRequest with created=false for a replica that can not
be found on any good log directory, broker will not create this replica. Is
there any drawback with this approach?


Here is my summary of pros and cons of design B as compared to design A.

pros:

1) No change to LeaderAndIsrRequest.
2) One less round of LeaderAndIsrRequest in case of log directory failure.

cons:

1) This is impossible for broker to figure out the log directory of offline
replicas for failed-log-directory/[directory] if multiple log directories
are unreadable when broker starts.

2) The znode size limit of failed-log-directory/[directory] essentially
limits the number of topic partitions that can exist on a log directory. It
becomes more of a problem when a broker is configured to use multiple log
directories each of which is a RAID-10 of large capacity. While this may
not be a problem in practice with additional requirement (e.g. don't use
more than one log directory if using RAID-10), ideally we want to avoid
such limit.

3) Extra ZK read of failed-log-directory/[directory] when broker starts


My main concern with the design B is the use of znode
/brokers/ids/[brokerId]/failed-log-directory/[directory]. I don't really
think other pros/cons of design B matter to us. Does my summary make sense?

Thanks,
Dong


On Thu, Feb 23, 2017 at 2:20 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Just so that we are on the same page. Let me spec out the alternative
> design a bit more and then compare. Let's call the current design A and the
> alternative design B.
>
> Design B:
>
> New ZK path
> failed log directory path (persistent): This is created by a broker when a
> log directory fails and is potentially removed when the broker is
> restarted.
> /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of the
> replicas in the log directory }.
>
> *1. Topic gets created*
> - Works the same as before.
>
> *2. A log directory stops working on a broker during runtime*
>
> - The controller watches the path /failed-log-directory for the new znode.
>
> - The broker detects an offline log directory during runtime and marks
> affected replicas as offline in memory.
>
> - The broker writes the failed directory and all replicas in the failed
> directory under /failed-log-directory/directory1.
>
> - The controller reads 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-23 Thread Jun Rao
Hi, Dong,

Just so that we are on the same page. Let me spec out the alternative
design a bit more and then compare. Let's call the current design A and the
alternative design B.

Design B:

New ZK path
failed log directory path (persistent): This is created by a broker when a
log directory fails and is potentially removed when the broker is
restarted.
/brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of the
replicas in the log directory }.

*1. Topic gets created*
- Works the same as before.

*2. A log directory stops working on a broker during runtime*

- The controller watches the path /failed-log-directory for the new znode.

- The broker detects an offline log directory during runtime and marks
affected replicas as offline in memory.

- The broker writes the failed directory and all replicas in the failed
directory under /failed-log-directory/directory1.

- The controller reads /failed-log-directory/directory1 and stores in
memory a list of failed replicas due to disk failures.

- The controller moves those replicas due to disk failure to offline state
and triggers the state change in replica state machine.


*3. Broker is restarted*

- The broker reads /brokers/ids/[brokerId]/failed-log-directory, if any.

- For each failed log directory it reads from ZK, if the log directory
exists in log.dirs and is accessible now, or if the log directory no longer
exists in log.dirs, remove that log directory from failed-log-directory.
Otherwise, the broker loads replicas in the failed log directory in memory
as offline.

- The controller handles the failed log directory change event, if needed
(same as #2).

- The controller handles the broker registration event.


*6. Controller failover*
- Controller reads all child paths under /failed-log-directory to rebuild
the list of failed replicas due to disk failures. Those replicas will be
transitioned to the offline state during controller initialization.

Comparing this with design A, I think the following are the things that
design B simplifies.
* No change in LeaderAndIsrRequest protocol.
* Step 1. One less round of LeaderAndIsrRequest and no additional ZK writes
to record the created flag.
* Step 2. One less round of LeaderAndIsrRequest and no additional logic to
handle LeaderAndIsrResponse.
* Step 6. Additional ZK reads proportional to # of failed log directories,
instead of # of partitions.
* Step 3. In design A, if a broker is restarted and the failed log
directory is unreadable, the broker doesn't know which replicas are on the
failed log directory. So, when the broker receives the LeadAndIsrRequest
with created = false, it's bit hard for the broker to decide whether it
should create the missing replica on other log directories. This is easier
in design B since the list of failed replicas are persisted in ZK.

Now, for some of the other things that you mentioned.

* What happens if a log directory is renamed?
I think this can be handled in the same way as non-existing log directory
during broker restart.

* What happens if replicas are moved manually across disks?
Good point. Well, if all log directories are available, the failed log
directory path will be cleared. In the rarer case that a log directory is
still offline and one of the replicas registered in the failed log
directory shows up in another available log directory, I am not quite sure.
Perhaps the simplest approach is to just error out and let the admin fix
things manually?

Thanks,

Jun



On Wed, Feb 22, 2017 at 3:39 PM, Dong Lin  wrote:

> Hey Jun,
>
> Thanks much for the explanation. I have some questions about 21 but that is
> less important than 20. 20 would require considerable change to the KIP and
> probably requires weeks to discuss again. Thus I would like to be very sure
> that we agree on the problems with the current design as you mentioned and
> there is no foreseeable problem with the alternate design.
>
> Please see below I detail response. To summarize my points, I couldn't
> figure out any non-trival drawback of the current design as compared to the
> alternative design; and I couldn't figure out a good way to store offline
> replicas in the alternative design. Can you see if these points make sense?
> Thanks in advance for your time!!
>
>
> 1) The alternative design requires slightly more dependency on ZK. While
> both solutions store created replicas in the ZK, the alternative design
> would also store offline replicas in ZK but the current design doesn't.
> Thus
>
> 2) I am not sure that we should store offline replicas in znode
> /brokers/ids/[brokerId]/failed-log-directory/[directory]. We probably
> don't
> want to expose log directory path in zookeeper based on the concept that we
> should only store logical information (e.g. topic, brokerId) in zookeeper's
> path name. More specifically, we probably don't want to rename path in
> zookeeper simply because user renamed a log director. And we probably don't
> want to read/write these znode 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-22 Thread Dong Lin
Hey Jun,

Thanks much for the explanation. I have some questions about 21 but that is
less important than 20. 20 would require considerable change to the KIP and
probably requires weeks to discuss again. Thus I would like to be very sure
that we agree on the problems with the current design as you mentioned and
there is no foreseeable problem with the alternate design.

Please see below I detail response. To summarize my points, I couldn't
figure out any non-trival drawback of the current design as compared to the
alternative design; and I couldn't figure out a good way to store offline
replicas in the alternative design. Can you see if these points make sense?
Thanks in advance for your time!!


1) The alternative design requires slightly more dependency on ZK. While
both solutions store created replicas in the ZK, the alternative design
would also store offline replicas in ZK but the current design doesn't. Thus

2) I am not sure that we should store offline replicas in znode
/brokers/ids/[brokerId]/failed-log-directory/[directory]. We probably don't
want to expose log directory path in zookeeper based on the concept that we
should only store logical information (e.g. topic, brokerId) in zookeeper's
path name. More specifically, we probably don't want to rename path in
zookeeper simply because user renamed a log director. And we probably don't
want to read/write these znode just because user manually moved replicas
between log directories.

3) I couldn't find a good way to store offline replicas in ZK in the
alternative design. We can store this information one znode per-topic,
per-brokerId, or per-brokerId-topic. All these choices have their own
problems. If we store it in per-topic znode then multiple brokers may need
to read/write offline replicas in the same znode which is generally bad. If
we store it per-brokerId then we effectively limit the maximum number of
topic-partition that can be stored on a broker by the znode size limit.
This contradicts the idea to expand the single broker capacity by throwing
in more disks. If we store it per-brokerId-topic, then when controller
starts, it needs to read number of brokerId*topic znodes which may double
the overall znode reads during controller startup.

4) The alternative design is less efficient than the current design in case
of log directory failure. The alternative design requires extra znode reads
in order to read offline replicas from zk while the current design requires
only one pair of LeaderAndIsrRequest and LeaderAndIsrResponse. The extra
znode reads will be proportional to the number of topics on the broker if
we store offline replicas per-brokerId-topic.

5) While I agree that the failure reporting should be done where the
failure is originated, I think the current design is consistent with what
we are already doing. With the current design, broker will send
notification via zookeeper and controller will send LeaderAndIsrRequest to
broker. This is similar to how broker sends isr change notification and
controller read latest isr from broker. If we do want broker to report
failure directly to controller, we should probably have broker send RPC
directly to controller as it sends ControllerShutdownRequest. I can do this
as well.

6) I don't think the current design requires additional state management in
each of the existing state handling such as topic creation or controller
failover. All these existing logic should stay exactly the same except that
the controller should recognize offline replicas on the live broker instead
of assuming all replicas on live brokers are live. But this additional
change is required in both the current design and the alternate design.
Thus there should be no difference between current design and the alternate
design with respect to these existing state handling logic in controller.

7) While I agree that the current design requires additional complexity in
the controller in order to handle LeaderAndIsrResponse and potentially
change partition and replica state to offline in the sate machines, I think
such logic is necessary in a well-designed controller either with the
alternate design or even without JBOD. Controller should be able to handle
error (e.g. ClusterAuthorizationException) in LeaderAndIsrResponse and
every responses in general. For example, if the controller hasn't received
LeaderAndIsrResponse after a given period if time, it probably means the
broker has hang and the controller should consider this broker as offline
and re-elect leader from other brokers. This would actually fix some
problem we have seen before at LinkedIn, where broker hangs due to
RAID-controller failure. In other words, I think it is a good idea for
controller to handle response.

8) I am not sure that the additional state management to handle
LeaderAndIsrResponse causes new types of synchronization. It is true that
the logic is not handled ZK event handling
thread. But the existing ControllerShutdownRequest is also not handled by

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-22 Thread Jun Rao
Hi, Dong, Jiangjie,

20. (1) I agree that ideally we'd like to use direct RPC for
broker-to-broker communication instead of ZK. However, in the alternative
design, the failed log directory path also serves as the persistent state
for remembering the offline partitions. This is similar to the
controller_managed_state path in your design. The difference is that the
alternative design stores the state in fewer ZK paths, which helps reduce
the controller failover time. (2) I agree that we want the controller to be
the single place to make decisions. However, intuitively, the failure
reporting should be done where the failure is originated. For example, if a
broker fails, the broker reports failure by de-registering from ZK. The
failed log directory path is similar in that regard. (3) I am not worried
about the additional load from extra LeaderAndIsrRequest. What I worry
about is any unnecessary additional complexity in the controller. To me,
the additional complexity in the current design is the additional state
management in each of the existing state handling (e.g., topic creation,
controller failover, etc), and the additional synchronization since the
additional state management is not initiated from the ZK event handling
thread.

21. One of the reasons that we need to send a StopReplicaRequest to offline
replica is to handle controlled shutdown. In that case, a broker is still
alive, but indicates to the controller that it plans to shut down. Being
able to stop the replica in the shutting down broker reduces churns in ISR.
So, for simplicity, it's probably easier to always send a StopReplicaRequest
to any offline replica.

Thanks,

Jun


On Tue, Feb 21, 2017 at 2:37 PM, Dong Lin  wrote:

> Hey Jun,
>
> Thanks much for your comments.
>
> I actually proposed the design to store both offline replicas and created
> replicas in per-broker znode before switching to the design in the current
> KIP. The current design stores created replicas in per-partition znode and
> transmits offline replicas via LeaderAndIsrResponse. The original solution
> is roughly the same as what you suggested. The advantage of the current
> solution is kind of philosophical: 1) we want to transmit data (e.g.
> offline replicas) using RPC and reduce dependency on zookeeper; 2) we want
> controller to be the only one that determines any state (e.g. offline
> replicas) that will be exposed to user. The advantage of the solution to
> store offline replica in zookeeper is that we can save one roundtrip time
> for controller to handle log directory failure. However, this extra
> roundtrip time should not be a big deal since the log directory failure is
> rare and inefficiency of extra latency is less of a problem when there is
> log directory failure.
>
> Do you think the two philosophical advantages of the current KIP make
> sense? If not, then I can switch to the original design that stores offline
> replicas in zookeeper. It is actually written already. One disadvantage is
> that we have to make non-trivial change the KIP (e.g. no create flag in
> LeaderAndIsrRequest and no created flag zookeeper) and restart this KIP
> discussion.
>
> Regarding 21, it seems to me that LeaderAndIsrRequest/StopReplicaRequest
> only makes sense when broker can make the choice (e.g. fetch data for this
> replica or not). In the case that the log directory of the replica is
> already offline, broker have to stop fetching data for this replica
> regardless of what controller tells it to do. Thus it seems cleaner for
> broker to stop fetch data for this replica immediately. The advantage of
> this solution is that the controller logic is simpler since it doesn't need
> to send StopReplicaRequest in case of log directory failure, and the log4j
> log is also cleaner. Is there specific advantage of having controller send
> tells broker to stop fetching data for offline replicas?
>
> Regarding 22, I agree with your observation that it will happen. I will
> update the KIP and specify that broker will exist with proper error message
> in the log and user needs to manually remove partitions and restart the
> broker.
>
> Thanks!
> Dong
>
>
>
> On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao  wrote:
>
> > Hi, Dong,
> >
> > Sorry for the delay. A few more comments.
> >
> > 20. One complexity that I found in the current KIP is that the way the
> > broker communicates failed replicas to the controller is inefficient.
> When
> > a log directory fails, the broker only sends an indication through ZK to
> > the controller and the controller has to issue a LeaderAndIsrRequest to
> > discover which replicas are offline due to log directory failure. An
> > alternative approach is that when a log directory fails, the broker just
> > writes the failed the directory and the corresponding topic partitions
> in a
> > new failed log directory ZK path like the following.
> >
> > Failed log directory path:
> > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-21 Thread Dong Lin
Hey Jun,

Motivated by your suggestion, I think we can also store the information of
created replicas in per-broker znode at /brokers/created_replicas/ids/[id].
Does this sound good?

Regards,
Dong


On Tue, Feb 21, 2017 at 2:37 PM, Dong Lin  wrote:

> Hey Jun,
>
> Thanks much for your comments.
>
> I actually proposed the design to store both offline replicas and created
> replicas in per-broker znode before switching to the design in the current
> KIP. The current design stores created replicas in per-partition znode and
> transmits offline replicas via LeaderAndIsrResponse. The original solution
> is roughly the same as what you suggested. The advantage of the current
> solution is kind of philosophical: 1) we want to transmit data (e.g.
> offline replicas) using RPC and reduce dependency on zookeeper; 2) we want
> controller to be the only one that determines any state (e.g. offline
> replicas) that will be exposed to user. The advantage of the solution to
> store offline replica in zookeeper is that we can save one roundtrip time
> for controller to handle log directory failure. However, this extra
> roundtrip time should not be a big deal since the log directory failure is
> rare and inefficiency of extra latency is less of a problem when there is
> log directory failure.
>
> Do you think the two philosophical advantages of the current KIP make
> sense? If not, then I can switch to the original design that stores offline
> replicas in zookeeper. It is actually written already. One disadvantage is
> that we have to make non-trivial change the KIP (e.g. no create flag in
> LeaderAndIsrRequest and no created flag zookeeper) and restart this KIP
> discussion.
>
> Regarding 21, it seems to me that LeaderAndIsrRequest/StopReplicaRequest
> only makes sense when broker can make the choice (e.g. fetch data for this
> replica or not). In the case that the log directory of the replica is
> already offline, broker have to stop fetching data for this replica
> regardless of what controller tells it to do. Thus it seems cleaner for
> broker to stop fetch data for this replica immediately. The advantage of
> this solution is that the controller logic is simpler since it doesn't need
> to send StopReplicaRequest in case of log directory failure, and the log4j
> log is also cleaner. Is there specific advantage of having controller send
> tells broker to stop fetching data for offline replicas?
>
> Regarding 22, I agree with your observation that it will happen. I will
> update the KIP and specify that broker will exist with proper error message
> in the log and user needs to manually remove partitions and restart the
> broker.
>
> Thanks!
> Dong
>
>
>
> On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao  wrote:
>
>> Hi, Dong,
>>
>> Sorry for the delay. A few more comments.
>>
>> 20. One complexity that I found in the current KIP is that the way the
>> broker communicates failed replicas to the controller is inefficient. When
>> a log directory fails, the broker only sends an indication through ZK to
>> the controller and the controller has to issue a LeaderAndIsrRequest to
>> discover which replicas are offline due to log directory failure. An
>> alternative approach is that when a log directory fails, the broker just
>> writes the failed the directory and the corresponding topic partitions in
>> a
>> new failed log directory ZK path like the following.
>>
>> Failed log directory path:
>> /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of the
>> topic partitions in the log directory }.
>>
>> The controller just watches for child changes in
>> /brokers/ids/[brokerId]/failed-log-directory.
>> After reading this path, the broker knows the exact set of replicas that
>> are offline and can trigger that replica state change accordingly. This
>> saves an extra round of LeaderAndIsrRequest handling.
>>
>> With this new ZK path, we get probably get rid of/broker/topics/[topic]/
>> partitions/[partitionId]/controller_managed_state. The creation of a new
>> replica is expected to always succeed unless all log directories fail, in
>> which case, the broker goes down anyway. Then, during controller failover,
>> the controller just needs to additionally read from ZK the extra failed
>> log
>> directory paths, which is many fewer than topics or partitions.
>>
>> On broker startup, if a log directory becomes available, the corresponding
>> log directory path in ZK will be removed.
>>
>> The downside of this approach is that the value of this new ZK path can be
>> large. However, even with 5K partition per log directory and 100 bytes per
>> partition, the size of the value is 500KB, still less than the default 1MB
>> znode limit in ZK.
>>
>> 21. "Broker will remove offline replica from its replica fetcher threads."
>> The proposal lets the broker remove the replica from the replica fetcher
>> thread when it detects a directory failure. An alternative is to only do
>> that until the 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-21 Thread Dong Lin
Hey Jun,

Thanks much for your comments.

I actually proposed the design to store both offline replicas and created
replicas in per-broker znode before switching to the design in the current
KIP. The current design stores created replicas in per-partition znode and
transmits offline replicas via LeaderAndIsrResponse. The original solution
is roughly the same as what you suggested. The advantage of the current
solution is kind of philosophical: 1) we want to transmit data (e.g.
offline replicas) using RPC and reduce dependency on zookeeper; 2) we want
controller to be the only one that determines any state (e.g. offline
replicas) that will be exposed to user. The advantage of the solution to
store offline replica in zookeeper is that we can save one roundtrip time
for controller to handle log directory failure. However, this extra
roundtrip time should not be a big deal since the log directory failure is
rare and inefficiency of extra latency is less of a problem when there is
log directory failure.

Do you think the two philosophical advantages of the current KIP make
sense? If not, then I can switch to the original design that stores offline
replicas in zookeeper. It is actually written already. One disadvantage is
that we have to make non-trivial change the KIP (e.g. no create flag in
LeaderAndIsrRequest and no created flag zookeeper) and restart this KIP
discussion.

Regarding 21, it seems to me that LeaderAndIsrRequest/StopReplicaRequest
only makes sense when broker can make the choice (e.g. fetch data for this
replica or not). In the case that the log directory of the replica is
already offline, broker have to stop fetching data for this replica
regardless of what controller tells it to do. Thus it seems cleaner for
broker to stop fetch data for this replica immediately. The advantage of
this solution is that the controller logic is simpler since it doesn't need
to send StopReplicaRequest in case of log directory failure, and the log4j
log is also cleaner. Is there specific advantage of having controller send
tells broker to stop fetching data for offline replicas?

Regarding 22, I agree with your observation that it will happen. I will
update the KIP and specify that broker will exist with proper error message
in the log and user needs to manually remove partitions and restart the
broker.

Thanks!
Dong



On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Sorry for the delay. A few more comments.
>
> 20. One complexity that I found in the current KIP is that the way the
> broker communicates failed replicas to the controller is inefficient. When
> a log directory fails, the broker only sends an indication through ZK to
> the controller and the controller has to issue a LeaderAndIsrRequest to
> discover which replicas are offline due to log directory failure. An
> alternative approach is that when a log directory fails, the broker just
> writes the failed the directory and the corresponding topic partitions in a
> new failed log directory ZK path like the following.
>
> Failed log directory path:
> /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of the
> topic partitions in the log directory }.
>
> The controller just watches for child changes in
> /brokers/ids/[brokerId]/failed-log-directory.
> After reading this path, the broker knows the exact set of replicas that
> are offline and can trigger that replica state change accordingly. This
> saves an extra round of LeaderAndIsrRequest handling.
>
> With this new ZK path, we get probably get rid of/broker/topics/[topic]/
> partitions/[partitionId]/controller_managed_state. The creation of a new
> replica is expected to always succeed unless all log directories fail, in
> which case, the broker goes down anyway. Then, during controller failover,
> the controller just needs to additionally read from ZK the extra failed log
> directory paths, which is many fewer than topics or partitions.
>
> On broker startup, if a log directory becomes available, the corresponding
> log directory path in ZK will be removed.
>
> The downside of this approach is that the value of this new ZK path can be
> large. However, even with 5K partition per log directory and 100 bytes per
> partition, the size of the value is 500KB, still less than the default 1MB
> znode limit in ZK.
>
> 21. "Broker will remove offline replica from its replica fetcher threads."
> The proposal lets the broker remove the replica from the replica fetcher
> thread when it detects a directory failure. An alternative is to only do
> that until the broker receives the LeaderAndIsrRequest/StopReplicaRequest.
> The benefit of this is that the controller is the only one who decides
> which replica to be removed from the replica fetcher threads. The broker
> also doesn't need additional logic to remove the replica from replica
> fetcher threads. The downside is that in a small window, the replica fetch
> thread will keep writing to the failed log 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-21 Thread Becket Qin
Hey Jun,

Using Zookeeper to propagate the offline replica state as you suggest
sounds a simpler approach.

However, I am wondering if we want to avoid using zookeeper to propagate
information. In the past this caused a lot of problem for us, including
missing notifications, performance issue, etc.

At this point it might make sense to add a new request to allow the broker
to propagate states to controller. Currently we have ISR information and
local broker information which are maintained by the broker and notifies
the controller if any change occurs. We used to put quite a bit efforts to
improve the ISR propagation performance. So I think if possible we may want
to avoid using the same mechanism for offline replica propagation from
broker to the controller again.

That said, we can do that in a separate KIP also.

Thanks,

Jiangjie (Becket) Qin

On Mon, Feb 20, 2017 at 10:17 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Sorry for the delay. A few more comments.
>
> 20. One complexity that I found in the current KIP is that the way the
> broker communicates failed replicas to the controller is inefficient. When
> a log directory fails, the broker only sends an indication through ZK to
> the controller and the controller has to issue a LeaderAndIsrRequest to
> discover which replicas are offline due to log directory failure. An
> alternative approach is that when a log directory fails, the broker just
> writes the failed the directory and the corresponding topic partitions in a
> new failed log directory ZK path like the following.
>
> Failed log directory path:
> /brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of the
> topic partitions in the log directory }.
>
> The controller just watches for child changes in
> /brokers/ids/[brokerId]/failed-log-directory.
> After reading this path, the broker knows the exact set of replicas that
> are offline and can trigger that replica state change accordingly. This
> saves an extra round of LeaderAndIsrRequest handling.
>
> With this new ZK path, we get probably get rid of/broker/topics/[topic]/
> partitions/[partitionId]/controller_managed_state. The creation of a new
> replica is expected to always succeed unless all log directories fail, in
> which case, the broker goes down anyway. Then, during controller failover,
> the controller just needs to additionally read from ZK the extra failed log
> directory paths, which is many fewer than topics or partitions.
>
> On broker startup, if a log directory becomes available, the corresponding
> log directory path in ZK will be removed.
>
> The downside of this approach is that the value of this new ZK path can be
> large. However, even with 5K partition per log directory and 100 bytes per
> partition, the size of the value is 500KB, still less than the default 1MB
> znode limit in ZK.
>
> 21. "Broker will remove offline replica from its replica fetcher threads."
> The proposal lets the broker remove the replica from the replica fetcher
> thread when it detects a directory failure. An alternative is to only do
> that until the broker receives the LeaderAndIsrRequest/StopReplicaRequest.
> The benefit of this is that the controller is the only one who decides
> which replica to be removed from the replica fetcher threads. The broker
> also doesn't need additional logic to remove the replica from replica
> fetcher threads. The downside is that in a small window, the replica fetch
> thread will keep writing to the failed log directory and may pollute the
> log4j log.
>
> 22. In the current design, there is a potential corner case issue that the
> same partition may exist in more than one log directory at some point.
> Consider the following steps: (1) a new topic t1 is created and the
> controller sends LeaderAndIsrRequest to a broker; (2) the broker creates
> partition t1-p1 in log dir1; (3) before the broker sends a response, it
> goes down; (4) the broker is restarted with log dir1 unreadable; (5) the
> broker receives a new LeaderAndIsrRequest and creates partition t1-p1 on
> log dir2; (6) at some point, the broker is restarted with log dir1 fixed.
> Now partition t1-p1 is in two log dirs. The alternative approach that I
> suggested above may suffer from a similar corner case issue. Since this is
> rare, if the broker detects this during broker startup, it can probably
> just log an error and exit. The admin can remove the redundant partitions
> manually and then restart the broker.
>
> Thanks,
>
> Jun
>
> On Sat, Feb 18, 2017 at 9:31 PM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > Could you please let me know if the solutions above could address your
> > concern? I really want to move the discussion forward.
> >
> > Thanks,
> > Dong
> >
> >
> > On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin  wrote:
> >
> > > Hey Jun,
> > >
> > > Thanks for all your help and time to discuss this KIP. When you get the
> > > time, could you let me know if the previous answers 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-20 Thread Jun Rao
Hi, Dong,

Sorry for the delay. A few more comments.

20. One complexity that I found in the current KIP is that the way the
broker communicates failed replicas to the controller is inefficient. When
a log directory fails, the broker only sends an indication through ZK to
the controller and the controller has to issue a LeaderAndIsrRequest to
discover which replicas are offline due to log directory failure. An
alternative approach is that when a log directory fails, the broker just
writes the failed the directory and the corresponding topic partitions in a
new failed log directory ZK path like the following.

Failed log directory path:
/brokers/ids/[brokerId]/failed-log-directory/directory1 => { json of the
topic partitions in the log directory }.

The controller just watches for child changes in
/brokers/ids/[brokerId]/failed-log-directory.
After reading this path, the broker knows the exact set of replicas that
are offline and can trigger that replica state change accordingly. This
saves an extra round of LeaderAndIsrRequest handling.

With this new ZK path, we get probably get rid of/broker/topics/[topic]/
partitions/[partitionId]/controller_managed_state. The creation of a new
replica is expected to always succeed unless all log directories fail, in
which case, the broker goes down anyway. Then, during controller failover,
the controller just needs to additionally read from ZK the extra failed log
directory paths, which is many fewer than topics or partitions.

On broker startup, if a log directory becomes available, the corresponding
log directory path in ZK will be removed.

The downside of this approach is that the value of this new ZK path can be
large. However, even with 5K partition per log directory and 100 bytes per
partition, the size of the value is 500KB, still less than the default 1MB
znode limit in ZK.

21. "Broker will remove offline replica from its replica fetcher threads."
The proposal lets the broker remove the replica from the replica fetcher
thread when it detects a directory failure. An alternative is to only do
that until the broker receives the LeaderAndIsrRequest/StopReplicaRequest.
The benefit of this is that the controller is the only one who decides
which replica to be removed from the replica fetcher threads. The broker
also doesn't need additional logic to remove the replica from replica
fetcher threads. The downside is that in a small window, the replica fetch
thread will keep writing to the failed log directory and may pollute the
log4j log.

22. In the current design, there is a potential corner case issue that the
same partition may exist in more than one log directory at some point.
Consider the following steps: (1) a new topic t1 is created and the
controller sends LeaderAndIsrRequest to a broker; (2) the broker creates
partition t1-p1 in log dir1; (3) before the broker sends a response, it
goes down; (4) the broker is restarted with log dir1 unreadable; (5) the
broker receives a new LeaderAndIsrRequest and creates partition t1-p1 on
log dir2; (6) at some point, the broker is restarted with log dir1 fixed.
Now partition t1-p1 is in two log dirs. The alternative approach that I
suggested above may suffer from a similar corner case issue. Since this is
rare, if the broker detects this during broker startup, it can probably
just log an error and exit. The admin can remove the redundant partitions
manually and then restart the broker.

Thanks,

Jun

On Sat, Feb 18, 2017 at 9:31 PM, Dong Lin  wrote:

> Hey Jun,
>
> Could you please let me know if the solutions above could address your
> concern? I really want to move the discussion forward.
>
> Thanks,
> Dong
>
>
> On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > Thanks for all your help and time to discuss this KIP. When you get the
> > time, could you let me know if the previous answers address the concern?
> >
> > I think the more interesting question in your last email is where we
> > should store the "created" flag in ZK. I proposed the solution that I
> like
> > most, i.e. store it together with the replica assignment data in the
> /brokers/topics/[topic].
> > In order to expedite discussion, let me provide another two ideas to
> > address the concern just in case the first idea doesn't work:
> >
> > - We can avoid extra controller ZK read when there is no disk failure
> > (95% of time?). When controller starts, it doesn't
> > read controller_managed_state in ZK and sends LeaderAndIsrRequest with
> > "create = false". Only if LeaderAndIsrResponse shows failure for any
> > replica, then controller will read controller_managed_state for this
> > partition and re-send LeaderAndIsrRequset with "create=true" if this
> > replica has not been created.
> >
> > - We can significantly reduce this ZK read time by making
> > controller_managed_state a topic level information in ZK, e.g.
> > /brokers/topics/[topic]/state. Given that most topic has 10+ partition,

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-18 Thread Dong Lin
Hey Jun,

Could you please let me know if the solutions above could address your
concern? I really want to move the discussion forward.

Thanks,
Dong


On Tue, Feb 14, 2017 at 8:17 PM, Dong Lin  wrote:

> Hey Jun,
>
> Thanks for all your help and time to discuss this KIP. When you get the
> time, could you let me know if the previous answers address the concern?
>
> I think the more interesting question in your last email is where we
> should store the "created" flag in ZK. I proposed the solution that I like
> most, i.e. store it together with the replica assignment data in the 
> /brokers/topics/[topic].
> In order to expedite discussion, let me provide another two ideas to
> address the concern just in case the first idea doesn't work:
>
> - We can avoid extra controller ZK read when there is no disk failure
> (95% of time?). When controller starts, it doesn't
> read controller_managed_state in ZK and sends LeaderAndIsrRequest with
> "create = false". Only if LeaderAndIsrResponse shows failure for any
> replica, then controller will read controller_managed_state for this
> partition and re-send LeaderAndIsrRequset with "create=true" if this
> replica has not been created.
>
> - We can significantly reduce this ZK read time by making
> controller_managed_state a topic level information in ZK, e.g.
> /brokers/topics/[topic]/state. Given that most topic has 10+ partition,
> the extra ZK read time should be less than 10% of the existing total zk
> read time during controller failover.
>
> Thanks!
> Dong
>
>
> On Tue, Feb 14, 2017 at 7:30 AM, Dong Lin  wrote:
>
>> Hey Jun,
>>
>> I just realized that you may be suggesting that a tool for listing
>> offline directories is necessary for KIP-112 by asking whether KIP-112 and
>> KIP-113 will be in the same release. I think such a tool is useful but
>> doesn't have to be included in KIP-112. This is because as of now admin
>> needs to log into broker machine and check broker log to figure out the
>> cause of broker failure and the bad log directory in case of disk failure.
>> The KIP-112 won't make it harder since admin can still figure out the bad
>> log directory by doing the same thing. Thus it is probably OK to just
>> include this script in KIP-113. Regardless, my hope is to finish both KIPs
>> ASAP and make them in the same release since both KIPs are needed for the
>> JBOD setup.
>>
>> Thanks,
>> Dong
>>
>> On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin  wrote:
>>
>>> And the test plan has also been updated to simulate disk failure by
>>> changing log directory permission to 000.
>>>
>>> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin  wrote:
>>>
 Hi Jun,

 Thanks for the reply. These comments are very helpful. Let me answer
 them inline.


 On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Thanks for the reply. A few more replies and new comments below.
>
> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin  wrote:
>
> > Hi Jun,
> >
> > Thanks for the detailed comments. Please see answers inline:
> >
> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao  wrote:
> >
> > > Hi, Dong,
> > >
> > > Thanks for the updated wiki. A few comments below.
> > >
> > > 1. Topics get created
> > > 1.1 Instead of storing successfully created replicas in ZK, could
> we
> > store
> > > unsuccessfully created replicas in ZK? Since the latter is less
> common,
> > it
> > > probably reduces the load on ZK.
> > >
> >
> > We can store unsuccessfully created replicas in ZK. But I am not
> sure if
> > that can reduce write load on ZK.
> >
> > If we want to reduce write load on ZK using by store unsuccessfully
> created
> > replicas in ZK, then broker should not write to ZK if all replicas
> are
> > successfully created. It means that if /broker/topics/[topic]/partiti
> > ons/[partitionId]/controller_managed_state doesn't exist in ZK for
> a given
> > partition, we have to assume all replicas of this partition have been
> > successfully created and send LeaderAndIsrRequest with create =
> false. This
> > becomes a problem if controller crashes before receiving
> > LeaderAndIsrResponse to validate whether a replica has been created.
> >
> > I think this approach and reduce the number of bytes stored in ZK.
> But I am
> > not sure if this is a concern.
> >
> >
> >
> I was mostly concerned about the controller failover time. Currently,
> the
> controller failover is likely dominated by the cost of reading
> topic/partition level information from ZK. If we add another partition
> level path in ZK, it probably will double the controller failover
> time. If
> the approach of representing the non-created 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-14 Thread Dong Lin
Hey Jun,

Thanks for all your help and time to discuss this KIP. When you get the
time, could you let me know if the previous answers address the concern?

I think the more interesting question in your last email is where we should
store the "created" flag in ZK. I proposed the solution that I like most,
i.e. store it together with the replica assignment data in the
/brokers/topics/[topic].
In order to expedite discussion, let me provide another two ideas to
address the concern just in case the first idea doesn't work:

- We can avoid extra controller ZK read when there is no disk failure (95%
of time?). When controller starts, it doesn't read controller_managed_state
in ZK and sends LeaderAndIsrRequest with "create = false". Only if
LeaderAndIsrResponse shows failure for any replica, then controller will
read controller_managed_state for this partition and re-send
LeaderAndIsrRequset with "create=true" if this replica has not been created.

- We can significantly reduce this ZK read time by making
controller_managed_state a topic level information in ZK, e.g.
/brokers/topics/[topic]/state. Given that most topic has 10+ partition, the
extra ZK read time should be less than 10% of the existing total zk read
time during controller failover.

Thanks!
Dong


On Tue, Feb 14, 2017 at 7:30 AM, Dong Lin  wrote:

> Hey Jun,
>
> I just realized that you may be suggesting that a tool for listing offline
> directories is necessary for KIP-112 by asking whether KIP-112 and KIP-113
> will be in the same release. I think such a tool is useful but doesn't have
> to be included in KIP-112. This is because as of now admin needs to log
> into broker machine and check broker log to figure out the cause of broker
> failure and the bad log directory in case of disk failure. The KIP-112
> won't make it harder since admin can still figure out the bad log directory
> by doing the same thing. Thus it is probably OK to just include this script
> in KIP-113. Regardless, my hope is to finish both KIPs ASAP and make them
> in the same release since both KIPs are needed for the JBOD setup.
>
> Thanks,
> Dong
>
> On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin  wrote:
>
>> And the test plan has also been updated to simulate disk failure by
>> changing log directory permission to 000.
>>
>> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin  wrote:
>>
>>> Hi Jun,
>>>
>>> Thanks for the reply. These comments are very helpful. Let me answer
>>> them inline.
>>>
>>>
>>> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao  wrote:
>>>
 Hi, Dong,

 Thanks for the reply. A few more replies and new comments below.

 On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin  wrote:

 > Hi Jun,
 >
 > Thanks for the detailed comments. Please see answers inline:
 >
 > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao  wrote:
 >
 > > Hi, Dong,
 > >
 > > Thanks for the updated wiki. A few comments below.
 > >
 > > 1. Topics get created
 > > 1.1 Instead of storing successfully created replicas in ZK, could we
 > store
 > > unsuccessfully created replicas in ZK? Since the latter is less
 common,
 > it
 > > probably reduces the load on ZK.
 > >
 >
 > We can store unsuccessfully created replicas in ZK. But I am not sure
 if
 > that can reduce write load on ZK.
 >
 > If we want to reduce write load on ZK using by store unsuccessfully
 created
 > replicas in ZK, then broker should not write to ZK if all replicas are
 > successfully created. It means that if /broker/topics/[topic]/partiti
 > ons/[partitionId]/controller_managed_state doesn't exist in ZK for a
 given
 > partition, we have to assume all replicas of this partition have been
 > successfully created and send LeaderAndIsrRequest with create =
 false. This
 > becomes a problem if controller crashes before receiving
 > LeaderAndIsrResponse to validate whether a replica has been created.
 >
 > I think this approach and reduce the number of bytes stored in ZK.
 But I am
 > not sure if this is a concern.
 >
 >
 >
 I was mostly concerned about the controller failover time. Currently,
 the
 controller failover is likely dominated by the cost of reading
 topic/partition level information from ZK. If we add another partition
 level path in ZK, it probably will double the controller failover time.
 If
 the approach of representing the non-created replicas doesn't work, have
 you considered just adding the created flag in the leaderAndIsr path in
 ZK?


>>> Yes, I have considered adding the created flag in the leaderAndIsr path
>>> in ZK. If we were to add created flag per replica in the
>>> LeaderAndIsrRequest, then it requires a lot of change in the code base.
>>>
>>> If we don't add created flag per replica 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-14 Thread Dong Lin
Hey Jun,

I just realized that you may be suggesting that a tool for listing offline
directories is necessary for KIP-112 by asking whether KIP-112 and KIP-113
will be in the same release. I think such a tool is useful but doesn't have
to be included in KIP-112. This is because as of now admin needs to log
into broker machine and check broker log to figure out the cause of broker
failure and the bad log directory in case of disk failure. The KIP-112
won't make it harder since admin can still figure out the bad log directory
by doing the same thing. Thus it is probably OK to just include this script
in KIP-113. Regardless, my hope is to finish both KIPs ASAP and make them
in the same release since both KIPs are needed for the JBOD setup.

Thanks,
Dong

On Mon, Feb 13, 2017 at 5:52 PM, Dong Lin  wrote:

> And the test plan has also been updated to simulate disk failure by
> changing log directory permission to 000.
>
> On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin  wrote:
>
>> Hi Jun,
>>
>> Thanks for the reply. These comments are very helpful. Let me answer them
>> inline.
>>
>>
>> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao  wrote:
>>
>>> Hi, Dong,
>>>
>>> Thanks for the reply. A few more replies and new comments below.
>>>
>>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin  wrote:
>>>
>>> > Hi Jun,
>>> >
>>> > Thanks for the detailed comments. Please see answers inline:
>>> >
>>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao  wrote:
>>> >
>>> > > Hi, Dong,
>>> > >
>>> > > Thanks for the updated wiki. A few comments below.
>>> > >
>>> > > 1. Topics get created
>>> > > 1.1 Instead of storing successfully created replicas in ZK, could we
>>> > store
>>> > > unsuccessfully created replicas in ZK? Since the latter is less
>>> common,
>>> > it
>>> > > probably reduces the load on ZK.
>>> > >
>>> >
>>> > We can store unsuccessfully created replicas in ZK. But I am not sure
>>> if
>>> > that can reduce write load on ZK.
>>> >
>>> > If we want to reduce write load on ZK using by store unsuccessfully
>>> created
>>> > replicas in ZK, then broker should not write to ZK if all replicas are
>>> > successfully created. It means that if /broker/topics/[topic]/partiti
>>> > ons/[partitionId]/controller_managed_state doesn't exist in ZK for a
>>> given
>>> > partition, we have to assume all replicas of this partition have been
>>> > successfully created and send LeaderAndIsrRequest with create = false.
>>> This
>>> > becomes a problem if controller crashes before receiving
>>> > LeaderAndIsrResponse to validate whether a replica has been created.
>>> >
>>> > I think this approach and reduce the number of bytes stored in ZK. But
>>> I am
>>> > not sure if this is a concern.
>>> >
>>> >
>>> >
>>> I was mostly concerned about the controller failover time. Currently, the
>>> controller failover is likely dominated by the cost of reading
>>> topic/partition level information from ZK. If we add another partition
>>> level path in ZK, it probably will double the controller failover time.
>>> If
>>> the approach of representing the non-created replicas doesn't work, have
>>> you considered just adding the created flag in the leaderAndIsr path in
>>> ZK?
>>>
>>>
>> Yes, I have considered adding the created flag in the leaderAndIsr path
>> in ZK. If we were to add created flag per replica in the
>> LeaderAndIsrRequest, then it requires a lot of change in the code base.
>>
>> If we don't add created flag per replica in the LeaderAndIsrRequest, then
>> the information in leaderAndIsr path in ZK and LeaderAndIsrRequest would be
>> different. Further, the procedure for broker to update ISR in ZK will be a
>> bit complicated. When leader updates leaderAndIsr path in ZK, it will have
>> to first read created flags from ZK, change isr, and write leaderAndIsr
>> back to ZK. And it needs to check znode version and re-try write operation
>> in ZK if controller has updated ZK during this period. This is in contrast
>> to the current implementation where the leader either gets all the
>> information from LeaderAndIsrRequest sent by controller, or determine the
>> infromation by itself (e.g. ISR), before writing to leaderAndIsr path in ZK.
>>
>> It seems to me that the above solution is a bit complicated and not
>> clean. Thus I come up with the design in this KIP to store this created
>> flag in a separate zk path. The path is named controller_managed_state to
>> indicate that we can store in this znode all information that is managed by
>> controller only, as opposed to ISR.
>>
>> I agree with your concern of increased ZK read time during controller
>> failover. How about we store the "created" information in the
>> znode /brokers/topics/[topic]? We can change that znode to have the
>> following data format:
>>
>> {
>>   "version" : 2,
>>   "created" : {
>> "1" : [1, 2, 3],
>> ...
>>   }
>>   "partition" : {
>> "1" : [1, 2, 3],
>> ...
>>   }

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-13 Thread Dong Lin
And the test plan has also been updated to simulate disk failure by
changing log directory permission to 000.

On Mon, Feb 13, 2017 at 5:50 PM, Dong Lin  wrote:

> Hi Jun,
>
> Thanks for the reply. These comments are very helpful. Let me answer them
> inline.
>
>
> On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao  wrote:
>
>> Hi, Dong,
>>
>> Thanks for the reply. A few more replies and new comments below.
>>
>> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin  wrote:
>>
>> > Hi Jun,
>> >
>> > Thanks for the detailed comments. Please see answers inline:
>> >
>> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao  wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > Thanks for the updated wiki. A few comments below.
>> > >
>> > > 1. Topics get created
>> > > 1.1 Instead of storing successfully created replicas in ZK, could we
>> > store
>> > > unsuccessfully created replicas in ZK? Since the latter is less
>> common,
>> > it
>> > > probably reduces the load on ZK.
>> > >
>> >
>> > We can store unsuccessfully created replicas in ZK. But I am not sure if
>> > that can reduce write load on ZK.
>> >
>> > If we want to reduce write load on ZK using by store unsuccessfully
>> created
>> > replicas in ZK, then broker should not write to ZK if all replicas are
>> > successfully created. It means that if /broker/topics/[topic]/partiti
>> > ons/[partitionId]/controller_managed_state doesn't exist in ZK for a
>> given
>> > partition, we have to assume all replicas of this partition have been
>> > successfully created and send LeaderAndIsrRequest with create = false.
>> This
>> > becomes a problem if controller crashes before receiving
>> > LeaderAndIsrResponse to validate whether a replica has been created.
>> >
>> > I think this approach and reduce the number of bytes stored in ZK. But
>> I am
>> > not sure if this is a concern.
>> >
>> >
>> >
>> I was mostly concerned about the controller failover time. Currently, the
>> controller failover is likely dominated by the cost of reading
>> topic/partition level information from ZK. If we add another partition
>> level path in ZK, it probably will double the controller failover time. If
>> the approach of representing the non-created replicas doesn't work, have
>> you considered just adding the created flag in the leaderAndIsr path in
>> ZK?
>>
>>
> Yes, I have considered adding the created flag in the leaderAndIsr path in
> ZK. If we were to add created flag per replica in the LeaderAndIsrRequest,
> then it requires a lot of change in the code base.
>
> If we don't add created flag per replica in the LeaderAndIsrRequest, then
> the information in leaderAndIsr path in ZK and LeaderAndIsrRequest would be
> different. Further, the procedure for broker to update ISR in ZK will be a
> bit complicated. When leader updates leaderAndIsr path in ZK, it will have
> to first read created flags from ZK, change isr, and write leaderAndIsr
> back to ZK. And it needs to check znode version and re-try write operation
> in ZK if controller has updated ZK during this period. This is in contrast
> to the current implementation where the leader either gets all the
> information from LeaderAndIsrRequest sent by controller, or determine the
> infromation by itself (e.g. ISR), before writing to leaderAndIsr path in ZK.
>
> It seems to me that the above solution is a bit complicated and not clean.
> Thus I come up with the design in this KIP to store this created flag in a
> separate zk path. The path is named controller_managed_state to indicate
> that we can store in this znode all information that is managed by
> controller only, as opposed to ISR.
>
> I agree with your concern of increased ZK read time during controller
> failover. How about we store the "created" information in the
> znode /brokers/topics/[topic]? We can change that znode to have the
> following data format:
>
> {
>   "version" : 2,
>   "created" : {
> "1" : [1, 2, 3],
> ...
>   }
>   "partition" : {
> "1" : [1, 2, 3],
> ...
>   }
> }
>
> We won't have extra zk read using this solution. It also seems reasonable
> to put the partition assignment information together with replica creation
> information. The latter is only changed once after the partition is created
> or re-assigned.
>
>
>>
>>
>> >
>> > > 1.2 If an error is received for a follower, does the controller
>> eagerly
>> > > remove it from ISR or do we just let the leader removes it after
>> timeout?
>> > >
>> >
>> > No, Controller will not actively remove it from ISR. But controller will
>> > recognize it as offline replica and propagate this information to all
>> > brokers via UpdateMetadataRequest. Each leader can use this information
>> to
>> > actively remove offline replica from ISR set. I have updated to wiki to
>> > clarify it.
>> >
>> >
>>
>> That seems inconsistent with how the controller deals with offline
>> replicas
>> due to broker failures. When that happens, the broker will (1) select a

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-13 Thread Dong Lin
Hi Jun,

Thanks for the reply. These comments are very helpful. Let me answer them
inline.


On Mon, Feb 13, 2017 at 3:25 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Thanks for the reply. A few more replies and new comments below.
>
> On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin  wrote:
>
> > Hi Jun,
> >
> > Thanks for the detailed comments. Please see answers inline:
> >
> > On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao  wrote:
> >
> > > Hi, Dong,
> > >
> > > Thanks for the updated wiki. A few comments below.
> > >
> > > 1. Topics get created
> > > 1.1 Instead of storing successfully created replicas in ZK, could we
> > store
> > > unsuccessfully created replicas in ZK? Since the latter is less common,
> > it
> > > probably reduces the load on ZK.
> > >
> >
> > We can store unsuccessfully created replicas in ZK. But I am not sure if
> > that can reduce write load on ZK.
> >
> > If we want to reduce write load on ZK using by store unsuccessfully
> created
> > replicas in ZK, then broker should not write to ZK if all replicas are
> > successfully created. It means that if /broker/topics/[topic]/partiti
> > ons/[partitionId]/controller_managed_state doesn't exist in ZK for a
> given
> > partition, we have to assume all replicas of this partition have been
> > successfully created and send LeaderAndIsrRequest with create = false.
> This
> > becomes a problem if controller crashes before receiving
> > LeaderAndIsrResponse to validate whether a replica has been created.
> >
> > I think this approach and reduce the number of bytes stored in ZK. But I
> am
> > not sure if this is a concern.
> >
> >
> >
> I was mostly concerned about the controller failover time. Currently, the
> controller failover is likely dominated by the cost of reading
> topic/partition level information from ZK. If we add another partition
> level path in ZK, it probably will double the controller failover time. If
> the approach of representing the non-created replicas doesn't work, have
> you considered just adding the created flag in the leaderAndIsr path in ZK?
>
>
Yes, I have considered adding the created flag in the leaderAndIsr path in
ZK. If we were to add created flag per replica in the LeaderAndIsrRequest,
then it requires a lot of change in the code base.

If we don't add created flag per replica in the LeaderAndIsrRequest, then
the information in leaderAndIsr path in ZK and LeaderAndIsrRequest would be
different. Further, the procedure for broker to update ISR in ZK will be a
bit complicated. When leader updates leaderAndIsr path in ZK, it will have
to first read created flags from ZK, change isr, and write leaderAndIsr
back to ZK. And it needs to check znode version and re-try write operation
in ZK if controller has updated ZK during this period. This is in contrast
to the current implementation where the leader either gets all the
information from LeaderAndIsrRequest sent by controller, or determine the
infromation by itself (e.g. ISR), before writing to leaderAndIsr path in ZK.

It seems to me that the above solution is a bit complicated and not clean.
Thus I come up with the design in this KIP to store this created flag in a
separate zk path. The path is named controller_managed_state to indicate
that we can store in this znode all information that is managed by
controller only, as opposed to ISR.

I agree with your concern of increased ZK read time during controller
failover. How about we store the "created" information in the
znode /brokers/topics/[topic]? We can change that znode to have the
following data format:

{
  "version" : 2,
  "created" : {
"1" : [1, 2, 3],
...
  }
  "partition" : {
"1" : [1, 2, 3],
...
  }
}

We won't have extra zk read using this solution. It also seems reasonable
to put the partition assignment information together with replica creation
information. The latter is only changed once after the partition is created
or re-assigned.


>
>
> >
> > > 1.2 If an error is received for a follower, does the controller eagerly
> > > remove it from ISR or do we just let the leader removes it after
> timeout?
> > >
> >
> > No, Controller will not actively remove it from ISR. But controller will
> > recognize it as offline replica and propagate this information to all
> > brokers via UpdateMetadataRequest. Each leader can use this information
> to
> > actively remove offline replica from ISR set. I have updated to wiki to
> > clarify it.
> >
> >
>
> That seems inconsistent with how the controller deals with offline replicas
> due to broker failures. When that happens, the broker will (1) select a new
> leader if the offline replica is the leader; (2) remove the replica from
> ISR if the offline replica is the follower. So, intuitively, it seems that
> we should be doing the same thing when dealing with offline replicas due to
> disk failure.
>

My bad. I misunderstand how the controller currently handles broker failure
and ISR change. Yes we should 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-13 Thread Jun Rao
Hi, Dong,

Thanks for the reply. A few more replies and new comments below.

On Fri, Feb 10, 2017 at 4:27 PM, Dong Lin  wrote:

> Hi Jun,
>
> Thanks for the detailed comments. Please see answers inline:
>
> On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao  wrote:
>
> > Hi, Dong,
> >
> > Thanks for the updated wiki. A few comments below.
> >
> > 1. Topics get created
> > 1.1 Instead of storing successfully created replicas in ZK, could we
> store
> > unsuccessfully created replicas in ZK? Since the latter is less common,
> it
> > probably reduces the load on ZK.
> >
>
> We can store unsuccessfully created replicas in ZK. But I am not sure if
> that can reduce write load on ZK.
>
> If we want to reduce write load on ZK using by store unsuccessfully created
> replicas in ZK, then broker should not write to ZK if all replicas are
> successfully created. It means that if /broker/topics/[topic]/partiti
> ons/[partitionId]/controller_managed_state doesn't exist in ZK for a given
> partition, we have to assume all replicas of this partition have been
> successfully created and send LeaderAndIsrRequest with create = false. This
> becomes a problem if controller crashes before receiving
> LeaderAndIsrResponse to validate whether a replica has been created.
>
> I think this approach and reduce the number of bytes stored in ZK. But I am
> not sure if this is a concern.
>
>
>
I was mostly concerned about the controller failover time. Currently, the
controller failover is likely dominated by the cost of reading
topic/partition level information from ZK. If we add another partition
level path in ZK, it probably will double the controller failover time. If
the approach of representing the non-created replicas doesn't work, have
you considered just adding the created flag in the leaderAndIsr path in ZK?



>
> > 1.2 If an error is received for a follower, does the controller eagerly
> > remove it from ISR or do we just let the leader removes it after timeout?
> >
>
> No, Controller will not actively remove it from ISR. But controller will
> recognize it as offline replica and propagate this information to all
> brokers via UpdateMetadataRequest. Each leader can use this information to
> actively remove offline replica from ISR set. I have updated to wiki to
> clarify it.
>
>

That seems inconsistent with how the controller deals with offline replicas
due to broker failures. When that happens, the broker will (1) select a new
leader if the offline replica is the leader; (2) remove the replica from
ISR if the offline replica is the follower. So, intuitively, it seems that
we should be doing the same thing when dealing with offline replicas due to
disk failure.



>
> > 1.3 Similar, if an error is received for a leader, should the controller
> > trigger leader election again?
> >
>
> Yes, controller will trigger leader election if leader replica is offline.
> I have updated the wiki to clarify it.
>
>
> >
> > 2. A log directory stops working on a broker during runtime:
> > 2.1 It seems the broker remembers the failed directory after hitting an
> > IOException and the failed directory won't be used for creating new
> > partitions until the broker is restarted? If so, could you add that to
> the
> > wiki.
> >
>
> Right, broker assumes a log directory to be good after it starts, and mark
> log directory as bad once there is IOException when broker attempts to
> access the log directory. New replicas will only be created on good log
> directory. I just added this to the KIP.
>
>
> > 2.2 Could you be a bit more specific on how and during which operation
> the
> > broker detects directory failure? Is it when the broker hits an
> IOException
> > during writes, or both reads and writes?  For example, during broker
> > startup, it only reads from each of the log directories, if it hits an
> > IOException there, does the broker immediately mark the directory as
> > offline?
> >
>
> Broker marks log directory as bad once there is IOException when broker
> attempts to access the log directory. This includes read and write. These
> operations include log append, log read, log cleaning, watermark checkpoint
> etc. If broker hits IOException when it reads from each of the log
> directory during startup, it immediately mark the directory as offline.
>
> I just updated the KIP to clarify it.
>
>
> > 3. Partition reassignment: If we know a replica is offline, do we still
> > want to send StopReplicaRequest to it?
> >
>
> No, controller doesn't send StopReplicaRequest for an offline replica.
> Controller treats this scenario in the same way that exiting Kafka
> implementation does when the broker of this replica is offline.
>
>
> >
> > 4. UpdateMetadataRequestPartitionState: For offline_replicas, do they
> only
> > include offline replicas due to log directory failures or do they also
> > include offline replicas due to broker failure?
> >
>
> UpdateMetadataRequestPartitionState's offline_replicas include 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-10 Thread Dong Lin
Hi Jun,

Thanks for the detailed comments. Please see answers inline:

On Fri, Feb 10, 2017 at 3:08 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Thanks for the updated wiki. A few comments below.
>
> 1. Topics get created
> 1.1 Instead of storing successfully created replicas in ZK, could we store
> unsuccessfully created replicas in ZK? Since the latter is less common, it
> probably reduces the load on ZK.
>

We can store unsuccessfully created replicas in ZK. But I am not sure if
that can reduce write load on ZK.

If we want to reduce write load on ZK using by store unsuccessfully created
replicas in ZK, then broker should not write to ZK if all replicas are
successfully created. It means that if /broker/topics/[topic]/partiti
ons/[partitionId]/controller_managed_state doesn't exist in ZK for a given
partition, we have to assume all replicas of this partition have been
successfully created and send LeaderAndIsrRequest with create = false. This
becomes a problem if controller crashes before receiving
LeaderAndIsrResponse to validate whether a replica has been created.

I think this approach and reduce the number of bytes stored in ZK. But I am
not sure if this is a concern.



> 1.2 If an error is received for a follower, does the controller eagerly
> remove it from ISR or do we just let the leader removes it after timeout?
>

No, Controller will not actively remove it from ISR. But controller will
recognize it as offline replica and propagate this information to all
brokers via UpdateMetadataRequest. Each leader can use this information to
actively remove offline replica from ISR set. I have updated to wiki to
clarify it.


> 1.3 Similar, if an error is received for a leader, should the controller
> trigger leader election again?
>

Yes, controller will trigger leader election if leader replica is offline.
I have updated the wiki to clarify it.


>
> 2. A log directory stops working on a broker during runtime:
> 2.1 It seems the broker remembers the failed directory after hitting an
> IOException and the failed directory won't be used for creating new
> partitions until the broker is restarted? If so, could you add that to the
> wiki.
>

Right, broker assumes a log directory to be good after it starts, and mark
log directory as bad once there is IOException when broker attempts to
access the log directory. New replicas will only be created on good log
directory. I just added this to the KIP.


> 2.2 Could you be a bit more specific on how and during which operation the
> broker detects directory failure? Is it when the broker hits an IOException
> during writes, or both reads and writes?  For example, during broker
> startup, it only reads from each of the log directories, if it hits an
> IOException there, does the broker immediately mark the directory as
> offline?
>

Broker marks log directory as bad once there is IOException when broker
attempts to access the log directory. This includes read and write. These
operations include log append, log read, log cleaning, watermark checkpoint
etc. If broker hits IOException when it reads from each of the log
directory during startup, it immediately mark the directory as offline.

I just updated the KIP to clarify it.


> 3. Partition reassignment: If we know a replica is offline, do we still
> want to send StopReplicaRequest to it?
>

No, controller doesn't send StopReplicaRequest for an offline replica.
Controller treats this scenario in the same way that exiting Kafka
implementation does when the broker of this replica is offline.


>
> 4. UpdateMetadataRequestPartitionState: For offline_replicas, do they only
> include offline replicas due to log directory failures or do they also
> include offline replicas due to broker failure?
>

UpdateMetadataRequestPartitionState's offline_replicas include offline
replicas due to both log directory failure and broker failure. This is to
make the semantics of this field easier to understand. Broker can
distinguish whether a replica is offline due to broker failure or disk
failure by checking whether a broker is live in the UpdateMetadataRequest.


>
> 5. Tools: Could we add some kind of support in the tool to list offline
> directories?
>

In KIP-112 we don't have tools to list offline directories because we have
intentionally avoided exposing log directory information (e.g. log
directory path) to user or other brokers. I think we can add this feature
in KIP-113, in which we will have DescribeDirsRequest to list log directory
information (e.g. partition assignment, path, size) needed for rebalance.


>
> 6. Metrics: Could we add some metrics to show offline directories?
>

Sure. I think it makes sense to have each broker report its number of
offline replicas and offline log directories. The previous metric was put
in KIP-113. I just added both metrics in KIP-112.


>
> 7. There are still references to kafka-log-dirs.sh. Are they valid?
>

My bad. I just removed this from "Changes in Operational Procedures" 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-10 Thread Jun Rao
Hi, Dong,

Thanks for the updated wiki. A few comments below.

1. Topics get created
1.1 Instead of storing successfully created replicas in ZK, could we store
unsuccessfully created replicas in ZK? Since the latter is less common, it
probably reduces the load on ZK.
1.2 If an error is received for a follower, does the controller eagerly
remove it from ISR or do we just let the leader removes it after timeout?
1.3 Similar, if an error is received for a leader, should the controller
trigger leader election again?

2. A log directory stops working on a broker during runtime:
2.1 It seems the broker remembers the failed directory after hitting an
IOException and the failed directory won't be used for creating new
partitions until the broker is restarted? If so, could you add that to the
wiki.
2.2 Could you be a bit more specific on how and during which operation the
broker detects directory failure? Is it when the broker hits an IOException
during writes, or both reads and writes?  For example, during broker
startup, it only reads from each of the log directories, if it hits an
IOException there, does the broker immediately mark the directory as
offline?

3. Partition reassignment: If we know a replica is offline, do we still
want to send StopReplicaRequest to it?

4. UpdateMetadataRequestPartitionState: For offline_replicas, do they only
include offline replicas due to log directory failures or do they also
include offline replicas due to broker failure?

5. Tools: Could we add some kind of support in the tool to list offline
directories?

6. Metrics: Could we add some metrics to show offline directories?

7. There are still references to kafka-log-dirs.sh. Are they valid?

8. Do you think KIP-113 is ready for review? One thing that KIP-113
mentions during partition reassignment is to first send
LeaderAndIsrRequest, followed by ChangeReplicaDirRequest. It seems it's
better if the replicas are created in the right log directory in the first
place? The reason that I brought it up here is because it may affect the
protocol of LeaderAndIsrRequest.

Jun

On Fri, Feb 10, 2017 at 9:53 AM, Dong Lin  wrote:

> Hi Jun,
>
> Can I replace zookeeper access with direct RPC for both ISR notification
> and disk failure notification in a future KIP, or do you feel we should do
> it in this KIP?
>
> Hi Eno, Grant and everyone,
>
> Is there further improvement you would like to see with this KIP?
>
> Thanks you all for the comments,
>
> Dong
>
>
>
> On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin  wrote:
>
> >
> >
> > On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe  wrote:
> >
> >> On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
> >> > Thanks for all the comments Colin!
> >> >
> >> > To answer your questions:
> >> > - Yes, a broker will shutdown if all its log directories are bad.
> >>
> >> That makes sense.  Can you add this to the writeup?
> >>
> >
> > Sure. This has already been added. You can find it here
> >  pageId=67638402=9=10>
> > .
> >
> >
> >>
> >> > - I updated the KIP to explicitly state that a log directory will be
> >> > assumed to be good until broker sees IOException when it tries to
> access
> >> > the log directory.
> >>
> >> Thanks.
> >>
> >> > - Controller doesn't explicitly know whether there is new log
> directory
> >> > or
> >> > not. All controller knows is whether replicas are online or offline
> >> based
> >> > on LeaderAndIsrResponse. According to the existing Kafka
> implementation,
> >> > controller will always send LeaderAndIsrRequest to a broker after it
> >> > bounces.
> >>
> >> I thought so.  It's good to clarify, though.  Do you think it's worth
> >> adding a quick discussion of this on the wiki?
> >>
> >
> > Personally I don't think it is needed. If broker starts with no bad log
> > directory, everything should work it is and we should not need to clarify
> > it. The KIP has already covered the scenario when a broker starts with
> bad
> > log directory. Also, the KIP doesn't claim or hint that we support
> dynamic
> > addition of new log directories. I think we are good.
> >
> >
> >> best,
> >> Colin
> >>
> >> >
> >> > Please see this
> >> >  >> n.action?pageId=67638402=9&
> selectedPageVersions=10>
> >> > for the change of the KIP.
> >> >
> >> > On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe 
> >> wrote:
> >> >
> >> > > On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
> >> > > > Thanks, Dong L.
> >> > > >
> >> > > > Do we plan on bringing down the broker process when all log
> >> directories
> >> > > > are offline?
> >> > > >
> >> > > > Can you explicitly state on the KIP that the log dirs are all
> >> considered
> >> > > > good after the broker process is bounced?  It seems like an
> >> important
> >> > > > thing to be clear about.  Also, perhaps discuss how the controller
> >> > > > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-10 Thread Dong Lin
Hi Jun,

Can I replace zookeeper access with direct RPC for both ISR notification
and disk failure notification in a future KIP, or do you feel we should do
it in this KIP?

Hi Eno, Grant and everyone,

Is there further improvement you would like to see with this KIP?

Thanks you all for the comments,

Dong



On Thu, Feb 9, 2017 at 4:45 PM, Dong Lin  wrote:

>
>
> On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe  wrote:
>
>> On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
>> > Thanks for all the comments Colin!
>> >
>> > To answer your questions:
>> > - Yes, a broker will shutdown if all its log directories are bad.
>>
>> That makes sense.  Can you add this to the writeup?
>>
>
> Sure. This has already been added. You can find it here
> 
> .
>
>
>>
>> > - I updated the KIP to explicitly state that a log directory will be
>> > assumed to be good until broker sees IOException when it tries to access
>> > the log directory.
>>
>> Thanks.
>>
>> > - Controller doesn't explicitly know whether there is new log directory
>> > or
>> > not. All controller knows is whether replicas are online or offline
>> based
>> > on LeaderAndIsrResponse. According to the existing Kafka implementation,
>> > controller will always send LeaderAndIsrRequest to a broker after it
>> > bounces.
>>
>> I thought so.  It's good to clarify, though.  Do you think it's worth
>> adding a quick discussion of this on the wiki?
>>
>
> Personally I don't think it is needed. If broker starts with no bad log
> directory, everything should work it is and we should not need to clarify
> it. The KIP has already covered the scenario when a broker starts with bad
> log directory. Also, the KIP doesn't claim or hint that we support dynamic
> addition of new log directories. I think we are good.
>
>
>> best,
>> Colin
>>
>> >
>> > Please see this
>> > > n.action?pageId=67638402=9=10>
>> > for the change of the KIP.
>> >
>> > On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe 
>> wrote:
>> >
>> > > On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
>> > > > Thanks, Dong L.
>> > > >
>> > > > Do we plan on bringing down the broker process when all log
>> directories
>> > > > are offline?
>> > > >
>> > > > Can you explicitly state on the KIP that the log dirs are all
>> considered
>> > > > good after the broker process is bounced?  It seems like an
>> important
>> > > > thing to be clear about.  Also, perhaps discuss how the controller
>> > > > becomes aware of the newly good log directories after a broker
>> bounce
>> > > > (and whether this triggers re-election).
>> > >
>> > > I meant to write, all the log dirs where the broker can still read the
>> > > index and some other files.  Clearly, log dirs that are completely
>> > > inaccessible will still be considered bad after a broker process
>> bounce.
>> > >
>> > > best,
>> > > Colin
>> > >
>> > > >
>> > > > +1 (non-binding) aside from that
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
>> > > > > Hi all,
>> > > > >
>> > > > > Thank you all for the helpful suggestion. I have updated the KIP
>> to
>> > > > > address
>> > > > > the comments received so far. See here
>> > > > > > n.action?
>> > > pageId=67638402=8=9>to
>> > > > > read the changes of the KIP. Here is a summary of change:
>> > > > >
>> > > > > - Updated the Proposed Change section to change the recovery
>> steps.
>> > > After
>> > > > > this change, broker will also create replica as long as all log
>> > > > > directories
>> > > > > are working.
>> > > > > - Removed kafka-log-dirs.sh from this KIP since user no longer
>> needs to
>> > > > > use
>> > > > > it for recovery from bad disks.
>> > > > > - Explained how the znode controller_managed_state is managed in
>> the
>> > > > > Public
>> > > > > interface section.
>> > > > > - Explained what happens during controller failover, partition
>> > > > > reassignment
>> > > > > and topic deletion in the Proposed Change section.
>> > > > > - Updated Future Work section to include the following potential
>> > > > > improvements
>> > > > >   - Let broker notify controller of ISR change and disk state
>> change
>> > > via
>> > > > > RPC instead of using zookeeper
>> > > > >   - Handle various failure scenarios (e.g. slow disk) on a
>> case-by-case
>> > > > > basis. For example, we may want to detect slow disk and consider
>> it as
>> > > > > offline.
>> > > > >   - Allow admin to mark a directory as bad so that it will not be
>> used.
>> > > > >
>> > > > > Thanks,
>> > > > > Dong
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin 
>> wrote:
>> > > > >
>> > > > > > Hey Eno,
>> > > > > >
>> > > > > > Thanks much for the comment!
>> > > > > >
>> > > > > > I 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-09 Thread Dong Lin
On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe  wrote:

> On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
> > Thanks for all the comments Colin!
> >
> > To answer your questions:
> > - Yes, a broker will shutdown if all its log directories are bad.
>
> That makes sense.  Can you add this to the writeup?
>

Sure. This has already been added. You can find it here

.


>
> > - I updated the KIP to explicitly state that a log directory will be
> > assumed to be good until broker sees IOException when it tries to access
> > the log directory.
>
> Thanks.
>
> > - Controller doesn't explicitly know whether there is new log directory
> > or
> > not. All controller knows is whether replicas are online or offline based
> > on LeaderAndIsrResponse. According to the existing Kafka implementation,
> > controller will always send LeaderAndIsrRequest to a broker after it
> > bounces.
>
> I thought so.  It's good to clarify, though.  Do you think it's worth
> adding a quick discussion of this on the wiki?
>

Personally I don't think it is needed. If broker starts with no bad log
directory, everything should work it is and we should not need to clarify
it. The KIP has already covered the scenario when a broker starts with bad
log directory. Also, the KIP doesn't claim or hint that we support dynamic
addition of new log directories. I think we are good.


> best,
> Colin
>
> >
> > Please see this
> >  n.action?pageId=67638402=9=10>
> > for the change of the KIP.
> >
> > On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe 
> wrote:
> >
> > > On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
> > > > Thanks, Dong L.
> > > >
> > > > Do we plan on bringing down the broker process when all log
> directories
> > > > are offline?
> > > >
> > > > Can you explicitly state on the KIP that the log dirs are all
> considered
> > > > good after the broker process is bounced?  It seems like an important
> > > > thing to be clear about.  Also, perhaps discuss how the controller
> > > > becomes aware of the newly good log directories after a broker bounce
> > > > (and whether this triggers re-election).
> > >
> > > I meant to write, all the log dirs where the broker can still read the
> > > index and some other files.  Clearly, log dirs that are completely
> > > inaccessible will still be considered bad after a broker process
> bounce.
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > +1 (non-binding) aside from that
> > > >
> > > >
> > > >
> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
> > > > > Hi all,
> > > > >
> > > > > Thank you all for the helpful suggestion. I have updated the KIP to
> > > > > address
> > > > > the comments received so far. See here
> > > > >  n.action?
> > > pageId=67638402=8=9>to
> > > > > read the changes of the KIP. Here is a summary of change:
> > > > >
> > > > > - Updated the Proposed Change section to change the recovery steps.
> > > After
> > > > > this change, broker will also create replica as long as all log
> > > > > directories
> > > > > are working.
> > > > > - Removed kafka-log-dirs.sh from this KIP since user no longer
> needs to
> > > > > use
> > > > > it for recovery from bad disks.
> > > > > - Explained how the znode controller_managed_state is managed in
> the
> > > > > Public
> > > > > interface section.
> > > > > - Explained what happens during controller failover, partition
> > > > > reassignment
> > > > > and topic deletion in the Proposed Change section.
> > > > > - Updated Future Work section to include the following potential
> > > > > improvements
> > > > >   - Let broker notify controller of ISR change and disk state
> change
> > > via
> > > > > RPC instead of using zookeeper
> > > > >   - Handle various failure scenarios (e.g. slow disk) on a
> case-by-case
> > > > > basis. For example, we may want to detect slow disk and consider
> it as
> > > > > offline.
> > > > >   - Allow admin to mark a directory as bad so that it will not be
> used.
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin 
> wrote:
> > > > >
> > > > > > Hey Eno,
> > > > > >
> > > > > > Thanks much for the comment!
> > > > > >
> > > > > > I still think the complexity added to Kafka is justified by its
> > > benefit.
> > > > > > Let me provide my reasons below.
> > > > > >
> > > > > > 1) The additional logic is easy to understand and thus its
> complexity
> > > > > > should be reasonable.
> > > > > >
> > > > > > On the broker side, it needs to catch exception when access log
> > > directory,
> > > > > > mark log directory and all its replicas as offline, notify
> > > controller by
> > > > > > writing the zookeeper notification path, and specify error in
> > > > > > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-09 Thread Colin McCabe
On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
> Thanks for all the comments Colin!
> 
> To answer your questions:
> - Yes, a broker will shutdown if all its log directories are bad.

That makes sense.  Can you add this to the writeup?

> - I updated the KIP to explicitly state that a log directory will be
> assumed to be good until broker sees IOException when it tries to access
> the log directory.

Thanks.

> - Controller doesn't explicitly know whether there is new log directory
> or
> not. All controller knows is whether replicas are online or offline based
> on LeaderAndIsrResponse. According to the existing Kafka implementation,
> controller will always send LeaderAndIsrRequest to a broker after it
> bounces.

I thought so.  It's good to clarify, though.  Do you think it's worth
adding a quick discussion of this on the wiki?

best,
Colin

> 
> Please see this
> 
> for the change of the KIP.
> 
> On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe  wrote:
> 
> > On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
> > > Thanks, Dong L.
> > >
> > > Do we plan on bringing down the broker process when all log directories
> > > are offline?
> > >
> > > Can you explicitly state on the KIP that the log dirs are all considered
> > > good after the broker process is bounced?  It seems like an important
> > > thing to be clear about.  Also, perhaps discuss how the controller
> > > becomes aware of the newly good log directories after a broker bounce
> > > (and whether this triggers re-election).
> >
> > I meant to write, all the log dirs where the broker can still read the
> > index and some other files.  Clearly, log dirs that are completely
> > inaccessible will still be considered bad after a broker process bounce.
> >
> > best,
> > Colin
> >
> > >
> > > +1 (non-binding) aside from that
> > >
> > >
> > >
> > > On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
> > > > Hi all,
> > > >
> > > > Thank you all for the helpful suggestion. I have updated the KIP to
> > > > address
> > > > the comments received so far. See here
> > > >  > pageId=67638402=8=9>to
> > > > read the changes of the KIP. Here is a summary of change:
> > > >
> > > > - Updated the Proposed Change section to change the recovery steps.
> > After
> > > > this change, broker will also create replica as long as all log
> > > > directories
> > > > are working.
> > > > - Removed kafka-log-dirs.sh from this KIP since user no longer needs to
> > > > use
> > > > it for recovery from bad disks.
> > > > - Explained how the znode controller_managed_state is managed in the
> > > > Public
> > > > interface section.
> > > > - Explained what happens during controller failover, partition
> > > > reassignment
> > > > and topic deletion in the Proposed Change section.
> > > > - Updated Future Work section to include the following potential
> > > > improvements
> > > >   - Let broker notify controller of ISR change and disk state change
> > via
> > > > RPC instead of using zookeeper
> > > >   - Handle various failure scenarios (e.g. slow disk) on a case-by-case
> > > > basis. For example, we may want to detect slow disk and consider it as
> > > > offline.
> > > >   - Allow admin to mark a directory as bad so that it will not be used.
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > >
> > > >
> > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin  wrote:
> > > >
> > > > > Hey Eno,
> > > > >
> > > > > Thanks much for the comment!
> > > > >
> > > > > I still think the complexity added to Kafka is justified by its
> > benefit.
> > > > > Let me provide my reasons below.
> > > > >
> > > > > 1) The additional logic is easy to understand and thus its complexity
> > > > > should be reasonable.
> > > > >
> > > > > On the broker side, it needs to catch exception when access log
> > directory,
> > > > > mark log directory and all its replicas as offline, notify
> > controller by
> > > > > writing the zookeeper notification path, and specify error in
> > > > > LeaderAndIsrResponse. On the controller side, it will listener to
> > > > > zookeeper for disk failure notification, learn about offline
> > replicas in
> > > > > the LeaderAndIsrResponse, and take offline replicas into
> > consideration when
> > > > > electing leaders. It also mark replica as created in zookeeper and
> > use it
> > > > > to determine whether a replica is created.
> > > > >
> > > > > That is all the logic we need to add in Kafka. I personally feel
> > this is
> > > > > easy to reason about.
> > > > >
> > > > > 2) The additional code is not much.
> > > > >
> > > > > I expect the code for KIP-112 to be around 1100 lines new code.
> > Previously
> > > > > I have implemented a prototype of a slightly different design (see
> > here
> > > > >  > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-09 Thread Dong Lin
Thanks for all the comments Colin!

To answer your questions:
- Yes, a broker will shutdown if all its log directories are bad.
- I updated the KIP to explicitly state that a log directory will be
assumed to be good until broker sees IOException when it tries to access
the log directory.
- Controller doesn't explicitly know whether there is new log directory or
not. All controller knows is whether replicas are online or offline based
on LeaderAndIsrResponse. According to the existing Kafka implementation,
controller will always send LeaderAndIsrRequest to a broker after it
bounces.

Please see this

for the change of the KIP.

On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe  wrote:

> On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
> > Thanks, Dong L.
> >
> > Do we plan on bringing down the broker process when all log directories
> > are offline?
> >
> > Can you explicitly state on the KIP that the log dirs are all considered
> > good after the broker process is bounced?  It seems like an important
> > thing to be clear about.  Also, perhaps discuss how the controller
> > becomes aware of the newly good log directories after a broker bounce
> > (and whether this triggers re-election).
>
> I meant to write, all the log dirs where the broker can still read the
> index and some other files.  Clearly, log dirs that are completely
> inaccessible will still be considered bad after a broker process bounce.
>
> best,
> Colin
>
> >
> > +1 (non-binding) aside from that
> >
> >
> >
> > On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
> > > Hi all,
> > >
> > > Thank you all for the helpful suggestion. I have updated the KIP to
> > > address
> > > the comments received so far. See here
> > >  pageId=67638402=8=9>to
> > > read the changes of the KIP. Here is a summary of change:
> > >
> > > - Updated the Proposed Change section to change the recovery steps.
> After
> > > this change, broker will also create replica as long as all log
> > > directories
> > > are working.
> > > - Removed kafka-log-dirs.sh from this KIP since user no longer needs to
> > > use
> > > it for recovery from bad disks.
> > > - Explained how the znode controller_managed_state is managed in the
> > > Public
> > > interface section.
> > > - Explained what happens during controller failover, partition
> > > reassignment
> > > and topic deletion in the Proposed Change section.
> > > - Updated Future Work section to include the following potential
> > > improvements
> > >   - Let broker notify controller of ISR change and disk state change
> via
> > > RPC instead of using zookeeper
> > >   - Handle various failure scenarios (e.g. slow disk) on a case-by-case
> > > basis. For example, we may want to detect slow disk and consider it as
> > > offline.
> > >   - Allow admin to mark a directory as bad so that it will not be used.
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > >
> > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin  wrote:
> > >
> > > > Hey Eno,
> > > >
> > > > Thanks much for the comment!
> > > >
> > > > I still think the complexity added to Kafka is justified by its
> benefit.
> > > > Let me provide my reasons below.
> > > >
> > > > 1) The additional logic is easy to understand and thus its complexity
> > > > should be reasonable.
> > > >
> > > > On the broker side, it needs to catch exception when access log
> directory,
> > > > mark log directory and all its replicas as offline, notify
> controller by
> > > > writing the zookeeper notification path, and specify error in
> > > > LeaderAndIsrResponse. On the controller side, it will listener to
> > > > zookeeper for disk failure notification, learn about offline
> replicas in
> > > > the LeaderAndIsrResponse, and take offline replicas into
> consideration when
> > > > electing leaders. It also mark replica as created in zookeeper and
> use it
> > > > to determine whether a replica is created.
> > > >
> > > > That is all the logic we need to add in Kafka. I personally feel
> this is
> > > > easy to reason about.
> > > >
> > > > 2) The additional code is not much.
> > > >
> > > > I expect the code for KIP-112 to be around 1100 lines new code.
> Previously
> > > > I have implemented a prototype of a slightly different design (see
> here
> > > >  -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
> > > > and uploaded it to github (see here
> > > > ). The patch changed
> 33
> > > > files, added 1185 lines and deleted 183 lines. The size of prototype
> patch
> > > > is actually smaller than patch of KIP-107 (see here
> > > > ) which is already
> accepted.
> > > > The KIP-107 patch changed 49 files, added 1349 lines and deleted 141
> lines.
> > > >
> > > > 3) Comparison with 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-09 Thread Colin McCabe
On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
> Thanks, Dong L.
> 
> Do we plan on bringing down the broker process when all log directories
> are offline?
> 
> Can you explicitly state on the KIP that the log dirs are all considered
> good after the broker process is bounced?  It seems like an important
> thing to be clear about.  Also, perhaps discuss how the controller
> becomes aware of the newly good log directories after a broker bounce
> (and whether this triggers re-election).

I meant to write, all the log dirs where the broker can still read the
index and some other files.  Clearly, log dirs that are completely
inaccessible will still be considered bad after a broker process bounce.

best,
Colin

> 
> +1 (non-binding) aside from that
> 
> 
> 
> On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
> > Hi all,
> > 
> > Thank you all for the helpful suggestion. I have updated the KIP to
> > address
> > the comments received so far. See here
> > to
> > read the changes of the KIP. Here is a summary of change:
> > 
> > - Updated the Proposed Change section to change the recovery steps. After
> > this change, broker will also create replica as long as all log
> > directories
> > are working.
> > - Removed kafka-log-dirs.sh from this KIP since user no longer needs to
> > use
> > it for recovery from bad disks.
> > - Explained how the znode controller_managed_state is managed in the
> > Public
> > interface section.
> > - Explained what happens during controller failover, partition
> > reassignment
> > and topic deletion in the Proposed Change section.
> > - Updated Future Work section to include the following potential
> > improvements
> >   - Let broker notify controller of ISR change and disk state change via
> > RPC instead of using zookeeper
> >   - Handle various failure scenarios (e.g. slow disk) on a case-by-case
> > basis. For example, we may want to detect slow disk and consider it as
> > offline.
> >   - Allow admin to mark a directory as bad so that it will not be used.
> > 
> > Thanks,
> > Dong
> > 
> > 
> > 
> > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin  wrote:
> > 
> > > Hey Eno,
> > >
> > > Thanks much for the comment!
> > >
> > > I still think the complexity added to Kafka is justified by its benefit.
> > > Let me provide my reasons below.
> > >
> > > 1) The additional logic is easy to understand and thus its complexity
> > > should be reasonable.
> > >
> > > On the broker side, it needs to catch exception when access log directory,
> > > mark log directory and all its replicas as offline, notify controller by
> > > writing the zookeeper notification path, and specify error in
> > > LeaderAndIsrResponse. On the controller side, it will listener to
> > > zookeeper for disk failure notification, learn about offline replicas in
> > > the LeaderAndIsrResponse, and take offline replicas into consideration 
> > > when
> > > electing leaders. It also mark replica as created in zookeeper and use it
> > > to determine whether a replica is created.
> > >
> > > That is all the logic we need to add in Kafka. I personally feel this is
> > > easy to reason about.
> > >
> > > 2) The additional code is not much.
> > >
> > > I expect the code for KIP-112 to be around 1100 lines new code. Previously
> > > I have implemented a prototype of a slightly different design (see here
> > > )
> > > and uploaded it to github (see here
> > > ). The patch changed 33
> > > files, added 1185 lines and deleted 183 lines. The size of prototype patch
> > > is actually smaller than patch of KIP-107 (see here
> > > ) which is already accepted.
> > > The KIP-107 patch changed 49 files, added 1349 lines and deleted 141 
> > > lines.
> > >
> > > 3) Comparison with one-broker-per-multiple-volumes
> > >
> > > This KIP can improve the availability of Kafka in this case such that one
> > > failed volume doesn't bring down the entire broker.
> > >
> > > 4) Comparison with one-broker-per-volume
> > >
> > > If each volume maps to multiple disks, then we still have similar problem
> > > such that the broker will fail if any disk of the volume failed.
> > >
> > > If each volume maps to one disk, it means that we need to deploy 10
> > > brokers on a machine if the machine has 10 disks. I will explain the
> > > concern with this approach in order of their importance.
> > >
> > > - It is weird if we were to tell kafka user to deploy 50 brokers on a
> > > machine of 50 disks.
> > >
> > > - Either when user deploys Kafka on a commercial cloud platform or when
> > > user deploys their own cluster, the size or largest disk is usually
> > > limited. There will be scenarios where user want to increase broker
> > > capacity by having multiple disks per broker. This 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-09 Thread Colin McCabe
Thanks, Dong L.

Do we plan on bringing down the broker process when all log directories
are offline?

Can you explicitly state on the KIP that the log dirs are all considered
good after the broker process is bounced?  It seems like an important
thing to be clear about.  Also, perhaps discuss how the controller
becomes aware of the newly good log directories after a broker bounce
(and whether this triggers re-election).

+1 (non-binding) aside from that



On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
> Hi all,
> 
> Thank you all for the helpful suggestion. I have updated the KIP to
> address
> the comments received so far. See here
> to
> read the changes of the KIP. Here is a summary of change:
> 
> - Updated the Proposed Change section to change the recovery steps. After
> this change, broker will also create replica as long as all log
> directories
> are working.
> - Removed kafka-log-dirs.sh from this KIP since user no longer needs to
> use
> it for recovery from bad disks.
> - Explained how the znode controller_managed_state is managed in the
> Public
> interface section.
> - Explained what happens during controller failover, partition
> reassignment
> and topic deletion in the Proposed Change section.
> - Updated Future Work section to include the following potential
> improvements
>   - Let broker notify controller of ISR change and disk state change via
> RPC instead of using zookeeper
>   - Handle various failure scenarios (e.g. slow disk) on a case-by-case
> basis. For example, we may want to detect slow disk and consider it as
> offline.
>   - Allow admin to mark a directory as bad so that it will not be used.
> 
> Thanks,
> Dong
> 
> 
> 
> On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin  wrote:
> 
> > Hey Eno,
> >
> > Thanks much for the comment!
> >
> > I still think the complexity added to Kafka is justified by its benefit.
> > Let me provide my reasons below.
> >
> > 1) The additional logic is easy to understand and thus its complexity
> > should be reasonable.
> >
> > On the broker side, it needs to catch exception when access log directory,
> > mark log directory and all its replicas as offline, notify controller by
> > writing the zookeeper notification path, and specify error in
> > LeaderAndIsrResponse. On the controller side, it will listener to
> > zookeeper for disk failure notification, learn about offline replicas in
> > the LeaderAndIsrResponse, and take offline replicas into consideration when
> > electing leaders. It also mark replica as created in zookeeper and use it
> > to determine whether a replica is created.
> >
> > That is all the logic we need to add in Kafka. I personally feel this is
> > easy to reason about.
> >
> > 2) The additional code is not much.
> >
> > I expect the code for KIP-112 to be around 1100 lines new code. Previously
> > I have implemented a prototype of a slightly different design (see here
> > )
> > and uploaded it to github (see here
> > ). The patch changed 33
> > files, added 1185 lines and deleted 183 lines. The size of prototype patch
> > is actually smaller than patch of KIP-107 (see here
> > ) which is already accepted.
> > The KIP-107 patch changed 49 files, added 1349 lines and deleted 141 lines.
> >
> > 3) Comparison with one-broker-per-multiple-volumes
> >
> > This KIP can improve the availability of Kafka in this case such that one
> > failed volume doesn't bring down the entire broker.
> >
> > 4) Comparison with one-broker-per-volume
> >
> > If each volume maps to multiple disks, then we still have similar problem
> > such that the broker will fail if any disk of the volume failed.
> >
> > If each volume maps to one disk, it means that we need to deploy 10
> > brokers on a machine if the machine has 10 disks. I will explain the
> > concern with this approach in order of their importance.
> >
> > - It is weird if we were to tell kafka user to deploy 50 brokers on a
> > machine of 50 disks.
> >
> > - Either when user deploys Kafka on a commercial cloud platform or when
> > user deploys their own cluster, the size or largest disk is usually
> > limited. There will be scenarios where user want to increase broker
> > capacity by having multiple disks per broker. This JBOD KIP makes it
> > feasible without hurting availability due to single disk failure.
> >
> > - Automatic load rebalance across disks will be easier and more flexible
> > if one broker has multiple disks. This can be future work.
> >
> > - There is performance concern when you deploy 10 broker vs. 1 broker on
> > one machine. The metadata the cluster, including FetchRequest,
> > ProduceResponse, MetadataRequest and so on will all be 10X more. The
> > packet-per-second will be 10X higher which may 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-08 Thread Dong Lin
Hi all,

Thank you all for the helpful suggestion. I have updated the KIP to address
the comments received so far. See here
to
read the changes of the KIP. Here is a summary of change:

- Updated the Proposed Change section to change the recovery steps. After
this change, broker will also create replica as long as all log directories
are working.
- Removed kafka-log-dirs.sh from this KIP since user no longer needs to use
it for recovery from bad disks.
- Explained how the znode controller_managed_state is managed in the Public
interface section.
- Explained what happens during controller failover, partition reassignment
and topic deletion in the Proposed Change section.
- Updated Future Work section to include the following potential
improvements
  - Let broker notify controller of ISR change and disk state change via
RPC instead of using zookeeper
  - Handle various failure scenarios (e.g. slow disk) on a case-by-case
basis. For example, we may want to detect slow disk and consider it as
offline.
  - Allow admin to mark a directory as bad so that it will not be used.

Thanks,
Dong



On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin  wrote:

> Hey Eno,
>
> Thanks much for the comment!
>
> I still think the complexity added to Kafka is justified by its benefit.
> Let me provide my reasons below.
>
> 1) The additional logic is easy to understand and thus its complexity
> should be reasonable.
>
> On the broker side, it needs to catch exception when access log directory,
> mark log directory and all its replicas as offline, notify controller by
> writing the zookeeper notification path, and specify error in
> LeaderAndIsrResponse. On the controller side, it will listener to
> zookeeper for disk failure notification, learn about offline replicas in
> the LeaderAndIsrResponse, and take offline replicas into consideration when
> electing leaders. It also mark replica as created in zookeeper and use it
> to determine whether a replica is created.
>
> That is all the logic we need to add in Kafka. I personally feel this is
> easy to reason about.
>
> 2) The additional code is not much.
>
> I expect the code for KIP-112 to be around 1100 lines new code. Previously
> I have implemented a prototype of a slightly different design (see here
> )
> and uploaded it to github (see here
> ). The patch changed 33
> files, added 1185 lines and deleted 183 lines. The size of prototype patch
> is actually smaller than patch of KIP-107 (see here
> ) which is already accepted.
> The KIP-107 patch changed 49 files, added 1349 lines and deleted 141 lines.
>
> 3) Comparison with one-broker-per-multiple-volumes
>
> This KIP can improve the availability of Kafka in this case such that one
> failed volume doesn't bring down the entire broker.
>
> 4) Comparison with one-broker-per-volume
>
> If each volume maps to multiple disks, then we still have similar problem
> such that the broker will fail if any disk of the volume failed.
>
> If each volume maps to one disk, it means that we need to deploy 10
> brokers on a machine if the machine has 10 disks. I will explain the
> concern with this approach in order of their importance.
>
> - It is weird if we were to tell kafka user to deploy 50 brokers on a
> machine of 50 disks.
>
> - Either when user deploys Kafka on a commercial cloud platform or when
> user deploys their own cluster, the size or largest disk is usually
> limited. There will be scenarios where user want to increase broker
> capacity by having multiple disks per broker. This JBOD KIP makes it
> feasible without hurting availability due to single disk failure.
>
> - Automatic load rebalance across disks will be easier and more flexible
> if one broker has multiple disks. This can be future work.
>
> - There is performance concern when you deploy 10 broker vs. 1 broker on
> one machine. The metadata the cluster, including FetchRequest,
> ProduceResponse, MetadataRequest and so on will all be 10X more. The
> packet-per-second will be 10X higher which may limit performance if pps is
> the performance bottleneck. The number of socket on the machine is 10X
> higher. And the number of replication thread will be 100X more. The impact
> will be more significant with increasing number of disks per machine. Thus
> it will limit Kakfa's scalability in the long term.
>
> Thanks,
> Dong
>
>
> On Tue, Feb 7, 2017 at 1:51 AM, Eno Thereska 
> wrote:
>
>> Hi Dong,
>>
>> To simplify the discussion today, on my part I'll zoom into one thing
>> only:
>>
>> - I'll discuss the options called below : "one-broker-per-disk" or
>> "one-broker-per-few-disks".
>>
>> - I completely buy the JBOD vs RAID arguments so there is no need to
>> discuss that part 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-07 Thread Dong Lin
Hey Eno,

Thanks much for the comment!

I still think the complexity added to Kafka is justified by its benefit.
Let me provide my reasons below.

1) The additional logic is easy to understand and thus its complexity
should be reasonable.

On the broker side, it needs to catch exception when access log directory,
mark log directory and all its replicas as offline, notify controller by
writing the zookeeper notification path, and specify error in
LeaderAndIsrResponse. On the controller side, it will listener to zookeeper
for disk failure notification, learn about offline replicas in the
LeaderAndIsrResponse, and take offline replicas into consideration when
electing leaders. It also mark replica as created in zookeeper and use it
to determine whether a replica is created.

That is all the logic we need to add in Kafka. I personally feel this is
easy to reason about.

2) The additional code is not much.

I expect the code for KIP-112 to be around 1100 lines new code. Previously
I have implemented a prototype of a slightly different design (see here
)
and uploaded it to github (see here
). The patch changed 33
files, added 1185 lines and deleted 183 lines. The size of prototype patch
is actually smaller than patch of KIP-107 (see here
) which is already accepted. The
KIP-107 patch changed 49 files, added 1349 lines and deleted 141 lines.

3) Comparison with one-broker-per-multiple-volumes

This KIP can improve the availability of Kafka in this case such that one
failed volume doesn't bring down the entire broker.

4) Comparison with one-broker-per-volume

If each volume maps to multiple disks, then we still have similar problem
such that the broker will fail if any disk of the volume failed.

If each volume maps to one disk, it means that we need to deploy 10 brokers
on a machine if the machine has 10 disks. I will explain the concern with
this approach in order of their importance.

- It is weird if we were to tell kafka user to deploy 50 brokers on a
machine of 50 disks.

- Either when user deploys Kafka on a commercial cloud platform or when
user deploys their own cluster, the size or largest disk is usually
limited. There will be scenarios where user want to increase broker
capacity by having multiple disks per broker. This JBOD KIP makes it
feasible without hurting availability due to single disk failure.

- Automatic load rebalance across disks will be easier and more flexible if
one broker has multiple disks. This can be future work.

- There is performance concern when you deploy 10 broker vs. 1 broker on
one machine. The metadata the cluster, including FetchRequest,
ProduceResponse, MetadataRequest and so on will all be 10X more. The
packet-per-second will be 10X higher which may limit performance if pps is
the performance bottleneck. The number of socket on the machine is 10X
higher. And the number of replication thread will be 100X more. The impact
will be more significant with increasing number of disks per machine. Thus
it will limit Kakfa's scalability in the long term.

Thanks,
Dong


On Tue, Feb 7, 2017 at 1:51 AM, Eno Thereska  wrote:

> Hi Dong,
>
> To simplify the discussion today, on my part I'll zoom into one thing only:
>
> - I'll discuss the options called below : "one-broker-per-disk" or
> "one-broker-per-few-disks".
>
> - I completely buy the JBOD vs RAID arguments so there is no need to
> discuss that part for me. I buy it that JBODs are good.
>
> I find the terminology can be improved a bit. Ideally we'd be talking
> about volumes, not disks. Just to make it clear that Kafka understand
> volumes/directories, not individual raw disks. So by
> "one-broker-per-few-disks" what I mean is that the admin can pool a few
> disks together to create a volume/directory and give that to Kafka.
>
>
> The kernel of my question will be that the admin already has tools to 1)
> create volumes/directories from a JBOD and 2) start a broker on a desired
> machine and 3) assign a broker resources like a directory. I claim that
> those tools are sufficient to optimise resource allocation.  I understand
> that a broker could manage point 3) itself, ie juggle the directories. My
> question is whether the complexity added to Kafka is justified.
> Operationally it seems to me an admin will still have to do all the three
> items above.
>
> Looking forward to the discussion
> Thanks
> Eno
>
>
> > On 1 Feb 2017, at 17:21, Dong Lin  wrote:
> >
> > Hey Eno,
> >
> > Thanks much for the review.
> >
> > I think your suggestion is to split disks of a machine into multiple disk
> > sets and run one broker per disk set. Yeah this is similar to Colin's
> > suggestion of one-broker-per-disk, which we have evaluated at LinkedIn
> and
> > considered it to be a good short term approach.
> >
> > As of now I 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-07 Thread Dong Lin
Hey Jun,

Thanks for the all the comments. I should have written the summary earlier
but got delayed. I think Grant has summarized pretty much every major
issues we discussed in the KIP meeting. I have provided answer to each
issue. Let me try to address your questions here.

I will update the KIP to explain how that zookeeper path is managed and
used. I will also describe in the KIP what happens during (a) controller
failover, (b) partition reassignment, (c) topic deletion. I will let you
and everyone know once the change has been made in the KIP.

I actually think there should be no performance issue to have 5 RPC per
disk failure in the cluster given that we don't have performance issue with
having 5 RPC per partition ISR change. My gut feel is that the frequency of
ISR change should be 100X higher than that of disk failure. If RPC will not
be less of an issue if disk failure is frequent.

It seems fine if broker doesn't receive response when notifying controller
of the disk failure, as long as controller is guaranteed to send
LeaderAndIsrRequset to query the replica state on the broker.

I agree it is useful to replace zookeeper access with direct RPC for both
notification events. But I am wondering if we can do it in a future KIP.
Notification via zookeeper is pretty straightforward because we are already
doing this with no performance concern. On the other hand, letting broker
notify zookeeper via RPC requires non-trivial design. We need to decide the
wire protocol(s), whether broker should retry notification, which thread
should send this RPC to controller etc. Currently broker doesn't have a
very well-established way to send RPC to controller. When broker sends
ControlledShutdownRequest to controller, it creates NetworkClient on the
fly. It means that non-trivial work needs to be done in order to support a
long term solution for broker to send notification to controller.

Thanks,
Dong

On Tue, Feb 7, 2017 at 2:23 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Thanks for the discussion in the KIP meeting today. A few comments inlined
> below.
>
> On Mon, Feb 6, 2017 at 7:22 PM, Dong Lin  wrote:
>
> > Hey Jun,
> >
> > Thanks for the review! Please see reply inline.
> >
> > On Mon, Feb 6, 2017 at 6:21 PM, Jun Rao  wrote:
> >
> > > Hi, Dong,
> > >
> > > Thanks for the proposal. A few quick questions/comments.
> > >
> > > 1. Do you know why your stress test loses 15% of the throughput with
> the
> > > one-broker-per-disk setup?
> > >
> >
> > I think it is probably related to thread scheduling and socket
> management,
> > though I haven't validated this theory.
> >
> > With one-broker-per-disk setup, each broker has 16 io threads, 12 network
> > threads, 14 replica fetcher threads.
> > With one-broker-per-machine setup, each broker has 160 io threads, 120
> > network threads, 140 replica fetcher threads.
> >
> > I can test this theory by increasing the thread of broker by 10 in an
> > existing cluster and see if throughput capacity changes. It is not
> > surprising if performance does degrade with 10X threads. But I haven't
> > validated this yet.
> >
>
>
> > > 2. In the KIP, it wasn't super clear to me what
> > > /broker/topics/[topic]/partitions/[partitionId]/controller_m
> anaged_state
> > > represents
> > > and when it's updated. It seems that this path should be updated every
> > time
> > > the disk associated with one of the replica goes bad or is fixed.
> > However,
> > > the wiki only mentions updating this path when a new topic is created.
> It
> > > would be useful to provide a high level description of what this path
> > > stores, when it's updated and by who.
> > >
> >
> > This path will only be updated by controller. When a replica is
> > successfully created on a broker, controller at this replica id to the
> > "created" list of the corresponding partition. When a replica needs to be
> > re-created because the bad disk is replaced with an empty good disk, user
> > executes kafka-log-dirs.sh so that controller will remove this replica id
> > from the "created" list of the corresponding partition.
> >
> > The first part is described in "Topic gets created" scenario. The second
> > part is kind of mentioned in the "The disk (or log directory) gets fixed"
> > scenario but not clear as it doesn't mention the full zookeeper path. I
> > have made this
> >  > pageId=67638402=7=8>
> > change in the KIP to clarify the second part.
> >
> > Currently I have used steps per scenario to describe how the KIP works.
> > Would you like me to have a section to describe how this ZK path is
> > managed?
> >
> >
> >
> So, it seems that
> /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state
> is a reflection of the log directory state in the broker. It would be
> useful to describe how the broker maintains the directory state and whether
> that state is reset during broker 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-07 Thread Dong Lin
Hey Grant,

Thanks much for the detailed summary! Yes, this is pretty much my
understanding of the KIP meeting.

I also think everyone agreed on the point you outlined in the email. Here
is my reply to the five issues you mentioned.

1) Automatic vs Manual Recovery

In the case where a disk is replaced and not "repaired", we can allow
automatic recovery by doing this: a broker should always create replica on
its disk if it has not bad log directory regardless of create flag in
LeaderAndIsrRequest. I think this is a reasonable solution. If user remove
bad log directory from broker config or replace it with a good log
directory, broker should feel free to create replica if it doesn't exist.

After this change, user will no longer need to manually execute script to
reset created flag to recover a broker. The created flag in zookeeper will
only be used by broker to ensure that broker doesn't re-create a replica on
a good log directory if it restarts with some bad log directories.

I will update the KIP to use this solution. Does this address everyone's
concern on this issue?


2) Client communication

After the change described above, client will no longer need to write
notification on zookeeper.

Though the broker will still notify controller via zookeeper, it is
something that broker already does for ISR change notification. I am not
aware of any good reason for broker to notify controller via direct RPC
instead of zookeeper. If there is such reason, can we change both of them
together in a followup KIP?

3) What is failure

I agree it is useful to handle various failures in different ways. This KIP
does not address its issue and simply classify a log directory as good or
bad. A log directory is good if broker can read and write from it without
exception, otherwise it is bad. This is strictly better than the existing
broker implementation, where broker does not handle "slow" disk and will
fail if there is any exception.

I will include this as future work in the KIP. Does this sound good?


4) Manually making a directory as bad

I agree it is a good idea and can be handled by a future KIP. I will
include this as future work in the KIP.

5) Implementation complexity

Implementation complexity and its comparison with benefit can sometimes
be subjective. I will try to provide reason why I think its implementation
complexity is not that much.

- The broker logic is not complex. It catches exception when access log
directory, mark log directory and all its replicas as offline, notify
controller by writing the zookeeper notification path, and specify error in
LeaderAndIsrResponse.
- The controller logic is not complex. It listener to zookeeper for disk
failure notification, learn about offline replicas in the
LeaderAndIsrResponse, and take offline replicas into consideration when
electing leaders. It also mark replica as created in zookeeper and use it
to determine whether a replica is created.

That is all the logic we need to add to kafka code. I don't find it very
complex.

Also, I expect the code for KIP-112 to be around 1100 lines new code.
Previously I have implemented a prototype of a slight different design (see
here
)
and uploaded it to github (see here
). The patch changed 33
files, added 1185 lines and deleted 183 lines. The size of prototype patch
is actually smaller than patch of KIP-107 (see here
), which changed 49 files, added
1349 lines and deleted 141 lines.

Thanks,
Dong


On Tue, Feb 7, 2017 at 2:22 PM, Grant Henke  wrote:

> Hi Dong,
>
> Thanks for proposing the KIP and all the hard work on it!
>
> In order to help summarize the discussion from the KIP call today I wanted
> to list the things I heard as the main discussion points that people would
> like to be considered or discussed. However, this is strictly from memory
> so please add anything I missed.
>
> First I want to list the thing I think everyone agreed on. If you don't
> agree with this please correct me:
>
>- Partial broker failure on a single bad directory is an improvement
>over existing behavior
>- Any "simplified" implementation should not prevent improvements going
>forward
>- "one-broker-per-disk" or "one-broker-per-few-disks" is an option but
>doesn't satisfy everyones requirements and has complexities of its own.
>Improving the handling of failed directories doesn't prevent this
>deployment pattern.
>- It may still be worth listing, even if theoretical the overhead of
>   this type of deployment
>- This KIP and in general (at this time) no one wants Kafka to worry
>about physical disks or logical volumes. The discussion of that should
> be
>handled by any future KIPs intending to introduce related functionality.
>Today in Kafka and this KIP deal strictly with the top level
> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-07 Thread Jun Rao
Hi, Dong,

Thanks for the discussion in the KIP meeting today. A few comments inlined
below.

On Mon, Feb 6, 2017 at 7:22 PM, Dong Lin  wrote:

> Hey Jun,
>
> Thanks for the review! Please see reply inline.
>
> On Mon, Feb 6, 2017 at 6:21 PM, Jun Rao  wrote:
>
> > Hi, Dong,
> >
> > Thanks for the proposal. A few quick questions/comments.
> >
> > 1. Do you know why your stress test loses 15% of the throughput with the
> > one-broker-per-disk setup?
> >
>
> I think it is probably related to thread scheduling and socket management,
> though I haven't validated this theory.
>
> With one-broker-per-disk setup, each broker has 16 io threads, 12 network
> threads, 14 replica fetcher threads.
> With one-broker-per-machine setup, each broker has 160 io threads, 120
> network threads, 140 replica fetcher threads.
>
> I can test this theory by increasing the thread of broker by 10 in an
> existing cluster and see if throughput capacity changes. It is not
> surprising if performance does degrade with 10X threads. But I haven't
> validated this yet.
>


> > 2. In the KIP, it wasn't super clear to me what
> > /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state
> > represents
> > and when it's updated. It seems that this path should be updated every
> time
> > the disk associated with one of the replica goes bad or is fixed.
> However,
> > the wiki only mentions updating this path when a new topic is created. It
> > would be useful to provide a high level description of what this path
> > stores, when it's updated and by who.
> >
>
> This path will only be updated by controller. When a replica is
> successfully created on a broker, controller at this replica id to the
> "created" list of the corresponding partition. When a replica needs to be
> re-created because the bad disk is replaced with an empty good disk, user
> executes kafka-log-dirs.sh so that controller will remove this replica id
> from the "created" list of the corresponding partition.
>
> The first part is described in "Topic gets created" scenario. The second
> part is kind of mentioned in the "The disk (or log directory) gets fixed"
> scenario but not clear as it doesn't mention the full zookeeper path. I
> have made this
>  pageId=67638402=7=8>
> change in the KIP to clarify the second part.
>
> Currently I have used steps per scenario to describe how the KIP works.
> Would you like me to have a section to describe how this ZK path is
> managed?
>
>
>
So, it seems that
/broker/topics/[topic]/partitions/[partitionId]/controller_managed_state
is a reflection of the log directory state in the broker. It would be
useful to describe how the broker maintains the directory state and whether
that state is reset during broker restart.

For completeness, it would be useful to also describe what happens during
(a) controller failover, (b) partition reassignment, (c) topic deletion
(for example, what happens when a replica to be deleted is on a failed log
directory).


> > 3. The proposal uses ZK to propagate disk failure/recovery from the
> broker
> > to the controller. Not sure if this is the best approach in the long
> term.
> > It may be better for the broker to send RPC requests directly to the
> > controller?
> >
>
> I choose to propagate this information via ZK for simplicity of the design
> and implementation since isr notification is passed via ZK and most events
> (e.g. broker offline, partition reassignment) are triggered in controller
> via ZK listener. Yes it can be implemented using RPC. But I am not very
> sure what we gain by using RPC instead of ZK. Should we have a separate KIP
> in the future to migrate all existing notification to using RPC?
>
>
My concern with ZK-based communication is efficiency. To send a message
from the broker to the controller in this approach, the sender needs to do
1 write to ZK and the receiver needs to do 1 read from ZK, followed by 1
delete to ZK. So, we will need a total of 5 RPCs (a read from ZK requires 1
RPC and a write/delete to ZK requires at least 2 RPCs). If the broker can
send a message directly to the controller, it just needs 1 RPC. Another
potential issue with the ZK-based approach is that it's hard for the sender
to receive a response. We made an exception for using ZK-based notification
for ISR propagation since it's a quicker way to fix an existing problem.
Since we are adding a new feature, it would be useful to think through
what's the best way for the broker to communicate with the controller in
the long term.


Thanks,

Jun


> >
> > Jun
> >
> >
> > On Wed, Jan 25, 2017 at 1:50 PM, Dong Lin  wrote:
> >
> > > Hey Colin,
> > >
> > > Good point! Yeah we have actually considered and tested this solution,
> > > which we call one-broker-per-disk. It would work and should require no
> > > major change in Kafka as compared to this JBOD KIP. So it would be a

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-07 Thread Grant Henke
Hi Dong,

Thanks for proposing the KIP and all the hard work on it!

In order to help summarize the discussion from the KIP call today I wanted
to list the things I heard as the main discussion points that people would
like to be considered or discussed. However, this is strictly from memory
so please add anything I missed.

First I want to list the thing I think everyone agreed on. If you don't
agree with this please correct me:

   - Partial broker failure on a single bad directory is an improvement
   over existing behavior
   - Any "simplified" implementation should not prevent improvements going
   forward
   - "one-broker-per-disk" or "one-broker-per-few-disks" is an option but
   doesn't satisfy everyones requirements and has complexities of its own.
   Improving the handling of failed directories doesn't prevent this
   deployment pattern.
   - It may still be worth listing, even if theoretical the overhead of
  this type of deployment
   - This KIP and in general (at this time) no one wants Kafka to worry
   about physical disks or logical volumes. The discussion of that should be
   handled by any future KIPs intending to introduce related functionality.
   Today in Kafka and this KIP deal strictly with the top level "directories"
   (log.dirs) configured by Kafka and knows nothing more than that they are a
   directory.

Below are other ideas and feedback discussed.

*Automatic vs Manual Recovery:*

There are 2 ways to handle directory IO failures. One extreme is to try to
recover automatically (on restart in this case) and the other is to require
manual intervention from and administrator (even through restarts). Some
expressed that it would be valuable to track directories and make enabling
them an explicit admin action as opposed to broker state in memory that is
reset on broker restart. The sentiment from the discussion is that this
could be a goal for the future but doesn't need to block this KIP.

I noted that I think either choice is okay but don't think a mixed mode
where new topics are handled automatically and old topics are manual would
be easy to understand. We found this scenario is likely true only for when
a disk is replaced and not "repaired" therefore leaving the data in tact.
Regardless defining this explicitly is important.

*Client Communication:*

Given the above discussion I thought it was important to be sure we needed
any manually interaction at all from admins/clients. Today the KIP suggests
using a notification on zookeeper under /log_dir_event_notification. If
manual client interaction is still required. I request that clients
communicate with the broker via the wire protocol instead of directly to
Zookeeper. I think the AdminClient in KIP-117 would be a great place for
this api.

*What is Failure:*

There was a lot of discussion based around defining failure and the various
permanent and temporary ways a disk can "fail". It was suggested this can
be fairly complicated and it may be worth evaluating or at least
documenting the specific scenarios handled and how (read failures, write
failures, slow IO, etc). Going forward improvements to detect and handle
more interesting failures may be useful.

*Manually Marking a Directory as Bad:*

Because of the topics above it was mentioned that it would be useful to
allow an admin to mark a directory that appears "good" as "bad". Often in
soft failures an admin may know a directory is bad before Kafka does and it
would be nice to be able to mark the directory as bad without a restart. In
the case of a restart it was noted that you could simply update the
log.dirs configuration on the broker.

The general consensus was that this was a good idea, but could be handled
by a future KIP.

*Implementation complexity: *

Some people were concerned about the complexity of the implementation vs
the benefits. It sounds like many agreed it was "worth it", but that
justifying and describing the complexity tradeoffs in the KIP would be
useful. I will let others describe there concerns more in follow up emails.

Those were the main talking points I remember, please add more details or
follow ups on anything I missed.

Thanks,
Grant



On Tue, Feb 7, 2017 at 3:51 AM, Eno Thereska  wrote:

> Hi Dong,
>
> To simplify the discussion today, on my part I'll zoom into one thing only:
>
> - I'll discuss the options called below : "one-broker-per-disk" or
> "one-broker-per-few-disks".
>
> - I completely buy the JBOD vs RAID arguments so there is no need to
> discuss that part for me. I buy it that JBODs are good.
>
> I find the terminology can be improved a bit. Ideally we'd be talking
> about volumes, not disks. Just to make it clear that Kafka understand
> volumes/directories, not individual raw disks. So by
> "one-broker-per-few-disks" what I mean is that the admin can pool a few
> disks together to create a volume/directory and give that to Kafka.
>
>
> The kernel of my question will be that the admin already has tools 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-07 Thread Eno Thereska
Hi Dong,

To simplify the discussion today, on my part I'll zoom into one thing only:

- I'll discuss the options called below : "one-broker-per-disk" or 
"one-broker-per-few-disks". 

- I completely buy the JBOD vs RAID arguments so there is no need to discuss 
that part for me. I buy it that JBODs are good.

I find the terminology can be improved a bit. Ideally we'd be talking about 
volumes, not disks. Just to make it clear that Kafka understand 
volumes/directories, not individual raw disks. So by "one-broker-per-few-disks" 
what I mean is that the admin can pool a few disks together to create a 
volume/directory and give that to Kafka.


The kernel of my question will be that the admin already has tools to 1) create 
volumes/directories from a JBOD and 2) start a broker on a desired machine and 
3) assign a broker resources like a directory. I claim that those tools are 
sufficient to optimise resource allocation.  I understand that a broker could 
manage point 3) itself, ie juggle the directories. My question is whether the 
complexity added to Kafka is justified. Operationally it seems to me an admin 
will still have to do all the three items above.

Looking forward to the discussion
Thanks
Eno


> On 1 Feb 2017, at 17:21, Dong Lin  wrote:
> 
> Hey Eno,
> 
> Thanks much for the review.
> 
> I think your suggestion is to split disks of a machine into multiple disk
> sets and run one broker per disk set. Yeah this is similar to Colin's
> suggestion of one-broker-per-disk, which we have evaluated at LinkedIn and
> considered it to be a good short term approach.
> 
> As of now I don't think any of these approach is a better alternative in
> the long term. I will summarize these here. I have put these reasons in the
> KIP's motivation section and rejected alternative section. I am happy to
> discuss more and I would certainly like to use an alternative solution that
> is easier to do with better performance.
> 
> - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2 to
> JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
> doubles the tolerance of broker failure before data unavailability from 1
> to 2. This is pretty huge gain for any company that uses Kafka at large
> scale.
> 
> - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is that
> no major code change is needed in Kafka. Among the disadvantage of
> one-broker-per-disk summarized in the KIP and previous email with Colin,
> the biggest one is the 15% throughput loss compared to JBOD and less
> flexibility to balance across disks. Further, it probably requires change
> to internal deployment tools at various companies to deal with
> one-broker-per-disk setup.
> 
> - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem is
> that a broker becomes unavailable if any disk fail. Suppose
> replication-factor=2 and there are 10 disks per machine. Then the
> probability of of any message becomes unavailable due to disk failure with
> RAID-0 is 100X higher than that with JBOD.
> 
> - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
> between one-broker-per-disk and RAID-0. So it carries an averaged
> disadvantages of these two approaches.
> 
> To answer your question regarding, I think it is reasonable to mange disk
> in Kafka. By "managing disks" we mean the management of assignment of
> replicas across disks. Here are my reasons in more detail:
> 
> - I don't think this KIP is a big step change. By allowing user to
> configure Kafka to run multiple log directories or disks as of now, it is
> implicit that Kafka manages disks. It is just not a complete feature.
> Microsoft and probably other companies are using this feature under the
> undesirable effect that a broker will fail any if any disk fail. It is good
> to complete this feature.
> 
> - I think it is reasonable to manage disk in Kafka. One of the most
> important work that Kafka is doing is to determine the replica assignment
> across brokers and make sure enough copies of a given replica is available.
> I would argue that it is not much different than determining the replica
> assignment across disk conceptually.
> 
> - I would agree that this KIP is improve performance of Kafka at the cost
> of more complexity inside Kafka, by switching from RAID-10 to JBOD. I would
> argue that this is a right direction. If we can gain 20%+ performance by
> managing NIC in Kafka as compared to existing approach and other
> alternatives, I would say we should just do it. Such a gain in performance,
> or equivalently reduction in cost, can save millions of dollars per year
> for any company running Kafka at large scale.
> 
> Thanks,
> Dong
> 
> 
> On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska  wrote:
> 
>> I'm coming somewhat late to the discussion, apologies for that.
>> 
>> I'm worried about this proposal. It's moving Kafka to a world where it
>> manages disks. So in a 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-06 Thread Dong Lin
Hey Jun,

Thanks for the review! Please see reply inline.

On Mon, Feb 6, 2017 at 6:21 PM, Jun Rao  wrote:

> Hi, Dong,
>
> Thanks for the proposal. A few quick questions/comments.
>
> 1. Do you know why your stress test loses 15% of the throughput with the
> one-broker-per-disk setup?
>

I think it is probably related to thread scheduling and socket management,
though I haven't validated this theory.

With one-broker-per-disk setup, each broker has 16 io threads, 12 network
threads, 14 replica fetcher threads.
With one-broker-per-machine setup, each broker has 160 io threads, 120
network threads, 140 replica fetcher threads.

I can test this theory by increasing the thread of broker by 10 in an
existing cluster and see if throughput capacity changes. It is not
surprising if performance does degrade with 10X threads. But I haven't
validated this yet.


> 2. In the KIP, it wasn't super clear to me what
> /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state
> represents
> and when it's updated. It seems that this path should be updated every time
> the disk associated with one of the replica goes bad or is fixed. However,
> the wiki only mentions updating this path when a new topic is created. It
> would be useful to provide a high level description of what this path
> stores, when it's updated and by who.
>

This path will only be updated by controller. When a replica is
successfully created on a broker, controller at this replica id to the
"created" list of the corresponding partition. When a replica needs to be
re-created because the bad disk is replaced with an empty good disk, user
executes kafka-log-dirs.sh so that controller will remove this replica id
from the "created" list of the corresponding partition.

The first part is described in "Topic gets created" scenario. The second
part is kind of mentioned in the "The disk (or log directory) gets fixed"
scenario but not clear as it doesn't mention the full zookeeper path. I
have made this

change in the KIP to clarify the second part.

Currently I have used steps per scenario to describe how the KIP works.
Would you like me to have a section to describe how this ZK path is managed?


> 3. The proposal uses ZK to propagate disk failure/recovery from the broker
> to the controller. Not sure if this is the best approach in the long term.
> It may be better for the broker to send RPC requests directly to the
> controller?
>

I choose to propagate this information via ZK for simplicity of the design
and implementation since isr notification is passed via ZK and most events
(e.g. broker offline, partition reassignment) are triggered in controller
via ZK listener. Yes it can be implemented using RPC. But I am not very
sure what we gain by using RPC instead of ZK. Should we have a separate KIP
in the future to migrate all existing notification to using RPC?


>
> Jun
>
>
> On Wed, Jan 25, 2017 at 1:50 PM, Dong Lin  wrote:
>
> > Hey Colin,
> >
> > Good point! Yeah we have actually considered and tested this solution,
> > which we call one-broker-per-disk. It would work and should require no
> > major change in Kafka as compared to this JBOD KIP. So it would be a good
> > short term solution.
> >
> > But it has a few drawbacks which makes it less desirable in the long
> term.
> > Assume we have 10 disks on a machine. Here are the problems:
> >
> > 1) Our stress test result shows that one-broker-per-disk has 15% lower
> > throughput
> >
> > 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> > MetadataUpdateRequest and StopReplicaRequest. This increases the burden
> on
> > controller which can be the performance bottleneck.
> >
> > 3) Less efficient use of physical resource on the machine. The number of
> > socket on each machine will increase by 10X. The number of connection
> > between any two machine will increase by 100X.
> >
> > 4) Less efficient way to management memory and quota.
> >
> > 5) Rebalance between disks/brokers on the same machine will less
> efficient
> > and less flexible. Broker has to read data from another broker on the
> same
> > machine via socket. It is also harder to do automatic load balance
> between
> > disks on the same machine in the future.
> >
> > I will put this and the explanation in the rejected alternative section.
> I
> > have a few questions:
> >
> > - Can you explain why this solution can help avoid scalability
> bottleneck?
> > I actually think it will exacerbate the scalability problem due the 2)
> > above.
> > - Why can we push more RPC with this solution?
> > - It is true that a garbage collection in one broker would not affect
> > others. But that is after every broker only uses 1/10 of the memory. Can
> we
> > be sure that this will actually help performance?
> >
> > Thanks,
> > Dong
> >
> > On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-06 Thread Jun Rao
Hi, Dong,

Thanks for the proposal. A few quick questions/comments.

1. Do you know why your stress test loses 15% of the throughput with the
one-broker-per-disk setup?

2. In the KIP, it wasn't super clear to me what
/broker/topics/[topic]/partitions/[partitionId]/controller_managed_state
represents
and when it's updated. It seems that this path should be updated every time
the disk associated with one of the replica goes bad or is fixed. However,
the wiki only mentions updating this path when a new topic is created. It
would be useful to provide a high level description of what this path
stores, when it's updated and by who.

3. The proposal uses ZK to propagate disk failure/recovery from the broker
to the controller. Not sure if this is the best approach in the long term.
It may be better for the broker to send RPC requests directly to the
controller?

Jun


On Wed, Jan 25, 2017 at 1:50 PM, Dong Lin  wrote:

> Hey Colin,
>
> Good point! Yeah we have actually considered and tested this solution,
> which we call one-broker-per-disk. It would work and should require no
> major change in Kafka as compared to this JBOD KIP. So it would be a good
> short term solution.
>
> But it has a few drawbacks which makes it less desirable in the long term.
> Assume we have 10 disks on a machine. Here are the problems:
>
> 1) Our stress test result shows that one-broker-per-disk has 15% lower
> throughput
>
> 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> MetadataUpdateRequest and StopReplicaRequest. This increases the burden on
> controller which can be the performance bottleneck.
>
> 3) Less efficient use of physical resource on the machine. The number of
> socket on each machine will increase by 10X. The number of connection
> between any two machine will increase by 100X.
>
> 4) Less efficient way to management memory and quota.
>
> 5) Rebalance between disks/brokers on the same machine will less efficient
> and less flexible. Broker has to read data from another broker on the same
> machine via socket. It is also harder to do automatic load balance between
> disks on the same machine in the future.
>
> I will put this and the explanation in the rejected alternative section. I
> have a few questions:
>
> - Can you explain why this solution can help avoid scalability bottleneck?
> I actually think it will exacerbate the scalability problem due the 2)
> above.
> - Why can we push more RPC with this solution?
> - It is true that a garbage collection in one broker would not affect
> others. But that is after every broker only uses 1/10 of the memory. Can we
> be sure that this will actually help performance?
>
> Thanks,
> Dong
>
> On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe  wrote:
>
> > Hi Dong,
> >
> > Thanks for the writeup!  It's very interesting.
> >
> > I apologize in advance if this has been discussed somewhere else.  But I
> > am curious if you have considered the solution of running multiple
> > brokers per node.  Clearly there is a memory overhead with this solution
> > because of the fixed cost of starting multiple JVMs.  However, running
> > multiple JVMs would help avoid scalability bottlenecks.  You could
> > probably push more RPCs per second, for example.  A garbage collection
> > in one broker would not affect the others.  It would be interesting to
> > see this considered in the "alternate designs" design, even if you end
> > up deciding it's not the way to go.
> >
> > best,
> > Colin
> >
> >
> > On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> > > Hi all,
> > >
> > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> > > wiki
> > > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 112%3A+Handle+disk+failure+for+JBOD.
> > >
> > > This KIP is related to KIP-113
> > >  > 113%3A+Support+replicas+movement+between+log+directories>:
> > > Support replicas movement between log directories. They are needed in
> > > order
> > > to support JBOD in Kafka. Please help review the KIP. You feedback is
> > > appreciated!
> > >
> > > Thanks,
> > > Dong
> >
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-02 Thread Dong Lin
Hey Eno,

I forgot that. Sure, that works for us.

Thanks,
Dong

On Thu, Feb 2, 2017 at 2:03 AM, Eno Thereska  wrote:

> Hi Dong,
>
> The KIP meetings are traditionally held at 11am. Would that also work? So
> Tuesday 7th at 11am?
>
> Thanks
> Eno
>
> > On 2 Feb 2017, at 02:53, Dong Lin  wrote:
> >
> > Hey Eno, Colin,
> >
> > Would you have time next Tuesday morning to discuss the KIP? How about
> 10 -
> > 11 am?
> >
> > To make best use of our time, can you please invite one or more committer
> > from Confluent to join the meeting? I hope the KIP can receive one or
> more
> > +1 from committer at Confluent if we have no concern the KIP after the
> KIP
> > meeting.
> >
> > In the meeting time, please feel free to provide comment in the thread so
> > that discussion in the KIP meeting can be more efficient.
> >
> > Thanks,
> > Dong
> >
> > On Wed, Feb 1, 2017 at 5:43 PM, Dong Lin  wrote:
> >
> >> Hey Colin,
> >>
> >> Thanks much for the comment. Please see my reply inline.
> >>
> >> On Wed, Feb 1, 2017 at 1:54 PM, Colin McCabe 
> wrote:
> >>
> >>> On Wed, Feb 1, 2017, at 11:35, Dong Lin wrote:
>  Hey Grant, Colin,
> 
>  My bad, I misunderstood Grant's suggestion initially. Indeed this is a
>  very
>  interesting idea to just wait for replica.max.lag.ms for the replica
> on
>  the
>  bad disk to drop out of ISR instead of having broker actively
> reporting
>  this to the controller.
> 
>  I have several concerns with this approach.
> 
>  - Broker needs to maintain persist the information of all partitions
> >>> that
>  it has created, in a file in each disk. This is needed for broker to
> >>> know
>  the replicas already created on the bad disks that it can not access.
> If
>  we
>  don't do it, then controller sends LeaderAndIsrRequest to a broker to
>  become follower for a partition on the bad disk, the broker will
> create
>  partition on a good disk. The good disks may be overloaded as
> cascading
>  effect.
> 
>  While it is possible to let broker keep track of the replicas that it
> >>> has
>  created, I think it is less clean than the approach in the current KIP
>  for
>  reason provided in the rejective alternative section.
> 
>  - Change is needed in the controller logic to handle failure to make a
>  broker as leader when controller receives LeaderAndIsrResponse.
>  Otherwise,
>  things go wrong if partition on the bad disk is requested to become
>  leader.
>  As of now, broker doesn't handle error in LeaderAndIsrResponse.
> 
>  - We still need tools and mechanism for administrator to know whether
> a
>  replica is offline due to bad disk. I am worried that asking
>  administrator
>  to log into a machine and get this information in the log is not
> >>> scalable
>  when the broker number is large. Although each company can develop
> their
>  internal tools to get this information, it is a waste of developer
> time
>  to
>  reinvent the wheel. Reading this information in the log also seems
> less
>  reliable then getting it from Kafka request/response.
> 
>  I guess the goal of this alternative approach is to avoid making major
>  change in Kafka at the cost of increased disk failure discovery time
> >>> etc.
>  But I think the changes required for fixing the problems above won't
> be
>  much less.
> >>>
> >>> Thanks for the thoughtful replies, Dong L.
> >>>
> >>> Instead of having an "offline" state, how about having a "creating"
> >>> state for replicas and a "created" state?  Then if a replica was not
> >>> accessible on any disk, but still in "created" state, the broker could
> >>> know that something had gone wrong.  This also would catch issues like
> >>> the broker being started without all log directories configured, or
> >>> disks not being correctly mounted at the expected mount points, leading
> >>> to empty directories.
> >>>
> >>
> >> Indeed, we need to have an additional state per replica to solve this
> >> problem. The current KIP design addresses the problem by putting the
> >> "created" state in zookeeper, as you can see in the public interface
> change
> >> of the KIP. Are you suggesting to solve the problem by storing this
> >> information in local disk of the broker instead of zookeeper? I have two
> >> concerns with this approach:
> >>
> >> - It requires broker to keep track of the replicas it has created. This
> >> solution will split the task of determining offline replicas among
> >> controller and brokers as opposed to the current Kafka design, where the
> >> controller determines states of replicas and propagate this information
> to
> >> brokers. We think it is less error-prone to still let controller be the
> >> only entity that maintains metadata (e.g. replica state) of Kafka
> cluster.
> >>
> >> - If we 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-02 Thread Eno Thereska
Hi Dong,

The KIP meetings are traditionally held at 11am. Would that also work? So 
Tuesday 7th at 11am?

Thanks
Eno

> On 2 Feb 2017, at 02:53, Dong Lin  wrote:
> 
> Hey Eno, Colin,
> 
> Would you have time next Tuesday morning to discuss the KIP? How about 10 -
> 11 am?
> 
> To make best use of our time, can you please invite one or more committer
> from Confluent to join the meeting? I hope the KIP can receive one or more
> +1 from committer at Confluent if we have no concern the KIP after the KIP
> meeting.
> 
> In the meeting time, please feel free to provide comment in the thread so
> that discussion in the KIP meeting can be more efficient.
> 
> Thanks,
> Dong
> 
> On Wed, Feb 1, 2017 at 5:43 PM, Dong Lin  wrote:
> 
>> Hey Colin,
>> 
>> Thanks much for the comment. Please see my reply inline.
>> 
>> On Wed, Feb 1, 2017 at 1:54 PM, Colin McCabe  wrote:
>> 
>>> On Wed, Feb 1, 2017, at 11:35, Dong Lin wrote:
 Hey Grant, Colin,
 
 My bad, I misunderstood Grant's suggestion initially. Indeed this is a
 very
 interesting idea to just wait for replica.max.lag.ms for the replica on
 the
 bad disk to drop out of ISR instead of having broker actively reporting
 this to the controller.
 
 I have several concerns with this approach.
 
 - Broker needs to maintain persist the information of all partitions
>>> that
 it has created, in a file in each disk. This is needed for broker to
>>> know
 the replicas already created on the bad disks that it can not access. If
 we
 don't do it, then controller sends LeaderAndIsrRequest to a broker to
 become follower for a partition on the bad disk, the broker will create
 partition on a good disk. The good disks may be overloaded as cascading
 effect.
 
 While it is possible to let broker keep track of the replicas that it
>>> has
 created, I think it is less clean than the approach in the current KIP
 for
 reason provided in the rejective alternative section.
 
 - Change is needed in the controller logic to handle failure to make a
 broker as leader when controller receives LeaderAndIsrResponse.
 Otherwise,
 things go wrong if partition on the bad disk is requested to become
 leader.
 As of now, broker doesn't handle error in LeaderAndIsrResponse.
 
 - We still need tools and mechanism for administrator to know whether a
 replica is offline due to bad disk. I am worried that asking
 administrator
 to log into a machine and get this information in the log is not
>>> scalable
 when the broker number is large. Although each company can develop their
 internal tools to get this information, it is a waste of developer time
 to
 reinvent the wheel. Reading this information in the log also seems less
 reliable then getting it from Kafka request/response.
 
 I guess the goal of this alternative approach is to avoid making major
 change in Kafka at the cost of increased disk failure discovery time
>>> etc.
 But I think the changes required for fixing the problems above won't be
 much less.
>>> 
>>> Thanks for the thoughtful replies, Dong L.
>>> 
>>> Instead of having an "offline" state, how about having a "creating"
>>> state for replicas and a "created" state?  Then if a replica was not
>>> accessible on any disk, but still in "created" state, the broker could
>>> know that something had gone wrong.  This also would catch issues like
>>> the broker being started without all log directories configured, or
>>> disks not being correctly mounted at the expected mount points, leading
>>> to empty directories.
>>> 
>> 
>> Indeed, we need to have an additional state per replica to solve this
>> problem. The current KIP design addresses the problem by putting the
>> "created" state in zookeeper, as you can see in the public interface change
>> of the KIP. Are you suggesting to solve the problem by storing this
>> information in local disk of the broker instead of zookeeper? I have two
>> concerns with this approach:
>> 
>> - It requires broker to keep track of the replicas it has created. This
>> solution will split the task of determining offline replicas among
>> controller and brokers as opposed to the current Kafka design, where the
>> controller determines states of replicas and propagate this information to
>> brokers. We think it is less error-prone to still let controller be the
>> only entity that maintains metadata (e.g. replica state) of Kafka cluster.
>> 
>> - If we store this information in local disk, then we need to have
>> additional request/response protocol in order to request broker to reset
>> this information, e.g. after a bad disk is replaced with good disk, so that
>> the replica can be re-created on a good disk. Things would be easier if we
>> store this information in zookeeper.
>> 
>> 
>>> 
 
 To answer Colin's 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Dong Lin
Sorry for the typo. I mean that before the KIP meeting, please free feel to
provide comment in this email thread so that discussion in the KIP meeting
can be more efficient.

On Wed, Feb 1, 2017 at 6:53 PM, Dong Lin  wrote:

> Hey Eno, Colin,
>
> Would you have time next Tuesday morning to discuss the KIP? How about 10
> - 11 am?
>
> To make best use of our time, can you please invite one or more committer
> from Confluent to join the meeting? I hope the KIP can receive one or more
> +1 from committer at Confluent if we have no concern the KIP after the KIP
> meeting.
>
> In the meeting time, please feel free to provide comment in the thread so
> that discussion in the KIP meeting can be more efficient.
>
> Thanks,
> Dong
>
> On Wed, Feb 1, 2017 at 5:43 PM, Dong Lin  wrote:
>
>> Hey Colin,
>>
>> Thanks much for the comment. Please see my reply inline.
>>
>> On Wed, Feb 1, 2017 at 1:54 PM, Colin McCabe  wrote:
>>
>>> On Wed, Feb 1, 2017, at 11:35, Dong Lin wrote:
>>> > Hey Grant, Colin,
>>> >
>>> > My bad, I misunderstood Grant's suggestion initially. Indeed this is a
>>> > very
>>> > interesting idea to just wait for replica.max.lag.ms for the replica
>>> on
>>> > the
>>> > bad disk to drop out of ISR instead of having broker actively reporting
>>> > this to the controller.
>>> >
>>> > I have several concerns with this approach.
>>> >
>>> > - Broker needs to maintain persist the information of all partitions
>>> that
>>> > it has created, in a file in each disk. This is needed for broker to
>>> know
>>> > the replicas already created on the bad disks that it can not access.
>>> If
>>> > we
>>> > don't do it, then controller sends LeaderAndIsrRequest to a broker to
>>> > become follower for a partition on the bad disk, the broker will create
>>> > partition on a good disk. The good disks may be overloaded as cascading
>>> > effect.
>>> >
>>> > While it is possible to let broker keep track of the replicas that it
>>> has
>>> > created, I think it is less clean than the approach in the current KIP
>>> > for
>>> > reason provided in the rejective alternative section.
>>> >
>>> > - Change is needed in the controller logic to handle failure to make a
>>> > broker as leader when controller receives LeaderAndIsrResponse.
>>> > Otherwise,
>>> > things go wrong if partition on the bad disk is requested to become
>>> > leader.
>>> > As of now, broker doesn't handle error in LeaderAndIsrResponse.
>>> >
>>> > - We still need tools and mechanism for administrator to know whether a
>>> > replica is offline due to bad disk. I am worried that asking
>>> > administrator
>>> > to log into a machine and get this information in the log is not
>>> scalable
>>> > when the broker number is large. Although each company can develop
>>> their
>>> > internal tools to get this information, it is a waste of developer time
>>> > to
>>> > reinvent the wheel. Reading this information in the log also seems less
>>> > reliable then getting it from Kafka request/response.
>>> >
>>> > I guess the goal of this alternative approach is to avoid making major
>>> > change in Kafka at the cost of increased disk failure discovery time
>>> etc.
>>> > But I think the changes required for fixing the problems above won't be
>>> > much less.
>>>
>>> Thanks for the thoughtful replies, Dong L.
>>>
>>> Instead of having an "offline" state, how about having a "creating"
>>> state for replicas and a "created" state?  Then if a replica was not
>>> accessible on any disk, but still in "created" state, the broker could
>>> know that something had gone wrong.  This also would catch issues like
>>> the broker being started without all log directories configured, or
>>> disks not being correctly mounted at the expected mount points, leading
>>> to empty directories.
>>>
>>
>> Indeed, we need to have an additional state per replica to solve this
>> problem. The current KIP design addresses the problem by putting the
>> "created" state in zookeeper, as you can see in the public interface change
>> of the KIP. Are you suggesting to solve the problem by storing this
>> information in local disk of the broker instead of zookeeper? I have two
>> concerns with this approach:
>>
>> - It requires broker to keep track of the replicas it has created. This
>> solution will split the task of determining offline replicas among
>> controller and brokers as opposed to the current Kafka design, where the
>> controller determines states of replicas and propagate this information to
>> brokers. We think it is less error-prone to still let controller be the
>> only entity that maintains metadata (e.g. replica state) of Kafka cluster.
>>
>> - If we store this information in local disk, then we need to have
>> additional request/response protocol in order to request broker to reset
>> this information, e.g. after a bad disk is replaced with good disk, so that
>> the replica can be re-created on a good disk. Things would be 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Dong Lin
Hey Eno, Colin,

Would you have time next Tuesday morning to discuss the KIP? How about 10 -
11 am?

To make best use of our time, can you please invite one or more committer
from Confluent to join the meeting? I hope the KIP can receive one or more
+1 from committer at Confluent if we have no concern the KIP after the KIP
meeting.

In the meeting time, please feel free to provide comment in the thread so
that discussion in the KIP meeting can be more efficient.

Thanks,
Dong

On Wed, Feb 1, 2017 at 5:43 PM, Dong Lin  wrote:

> Hey Colin,
>
> Thanks much for the comment. Please see my reply inline.
>
> On Wed, Feb 1, 2017 at 1:54 PM, Colin McCabe  wrote:
>
>> On Wed, Feb 1, 2017, at 11:35, Dong Lin wrote:
>> > Hey Grant, Colin,
>> >
>> > My bad, I misunderstood Grant's suggestion initially. Indeed this is a
>> > very
>> > interesting idea to just wait for replica.max.lag.ms for the replica on
>> > the
>> > bad disk to drop out of ISR instead of having broker actively reporting
>> > this to the controller.
>> >
>> > I have several concerns with this approach.
>> >
>> > - Broker needs to maintain persist the information of all partitions
>> that
>> > it has created, in a file in each disk. This is needed for broker to
>> know
>> > the replicas already created on the bad disks that it can not access. If
>> > we
>> > don't do it, then controller sends LeaderAndIsrRequest to a broker to
>> > become follower for a partition on the bad disk, the broker will create
>> > partition on a good disk. The good disks may be overloaded as cascading
>> > effect.
>> >
>> > While it is possible to let broker keep track of the replicas that it
>> has
>> > created, I think it is less clean than the approach in the current KIP
>> > for
>> > reason provided in the rejective alternative section.
>> >
>> > - Change is needed in the controller logic to handle failure to make a
>> > broker as leader when controller receives LeaderAndIsrResponse.
>> > Otherwise,
>> > things go wrong if partition on the bad disk is requested to become
>> > leader.
>> > As of now, broker doesn't handle error in LeaderAndIsrResponse.
>> >
>> > - We still need tools and mechanism for administrator to know whether a
>> > replica is offline due to bad disk. I am worried that asking
>> > administrator
>> > to log into a machine and get this information in the log is not
>> scalable
>> > when the broker number is large. Although each company can develop their
>> > internal tools to get this information, it is a waste of developer time
>> > to
>> > reinvent the wheel. Reading this information in the log also seems less
>> > reliable then getting it from Kafka request/response.
>> >
>> > I guess the goal of this alternative approach is to avoid making major
>> > change in Kafka at the cost of increased disk failure discovery time
>> etc.
>> > But I think the changes required for fixing the problems above won't be
>> > much less.
>>
>> Thanks for the thoughtful replies, Dong L.
>>
>> Instead of having an "offline" state, how about having a "creating"
>> state for replicas and a "created" state?  Then if a replica was not
>> accessible on any disk, but still in "created" state, the broker could
>> know that something had gone wrong.  This also would catch issues like
>> the broker being started without all log directories configured, or
>> disks not being correctly mounted at the expected mount points, leading
>> to empty directories.
>>
>
> Indeed, we need to have an additional state per replica to solve this
> problem. The current KIP design addresses the problem by putting the
> "created" state in zookeeper, as you can see in the public interface change
> of the KIP. Are you suggesting to solve the problem by storing this
> information in local disk of the broker instead of zookeeper? I have two
> concerns with this approach:
>
> - It requires broker to keep track of the replicas it has created. This
> solution will split the task of determining offline replicas among
> controller and brokers as opposed to the current Kafka design, where the
> controller determines states of replicas and propagate this information to
> brokers. We think it is less error-prone to still let controller be the
> only entity that maintains metadata (e.g. replica state) of Kafka cluster.
>
> - If we store this information in local disk, then we need to have
> additional request/response protocol in order to request broker to reset
> this information, e.g. after a bad disk is replaced with good disk, so that
> the replica can be re-created on a good disk. Things would be easier if we
> store this information in zookeeper.
>
>
>>
>> >
>> > To answer Colin's questions:
>> >
>> > - There is no action required on the side of administrator in case of
>> log
>> > directory failure.
>> >
>> > - Broker itself is going to discover log directory failure and declare
>> > offline replicas. Broker doesn't explicitly declare log directory
>> > failure.
>> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Dong Lin
Hey Colin,

Thanks much for the comment. Please see my reply inline.

On Wed, Feb 1, 2017 at 1:54 PM, Colin McCabe  wrote:

> On Wed, Feb 1, 2017, at 11:35, Dong Lin wrote:
> > Hey Grant, Colin,
> >
> > My bad, I misunderstood Grant's suggestion initially. Indeed this is a
> > very
> > interesting idea to just wait for replica.max.lag.ms for the replica on
> > the
> > bad disk to drop out of ISR instead of having broker actively reporting
> > this to the controller.
> >
> > I have several concerns with this approach.
> >
> > - Broker needs to maintain persist the information of all partitions that
> > it has created, in a file in each disk. This is needed for broker to know
> > the replicas already created on the bad disks that it can not access. If
> > we
> > don't do it, then controller sends LeaderAndIsrRequest to a broker to
> > become follower for a partition on the bad disk, the broker will create
> > partition on a good disk. The good disks may be overloaded as cascading
> > effect.
> >
> > While it is possible to let broker keep track of the replicas that it has
> > created, I think it is less clean than the approach in the current KIP
> > for
> > reason provided in the rejective alternative section.
> >
> > - Change is needed in the controller logic to handle failure to make a
> > broker as leader when controller receives LeaderAndIsrResponse.
> > Otherwise,
> > things go wrong if partition on the bad disk is requested to become
> > leader.
> > As of now, broker doesn't handle error in LeaderAndIsrResponse.
> >
> > - We still need tools and mechanism for administrator to know whether a
> > replica is offline due to bad disk. I am worried that asking
> > administrator
> > to log into a machine and get this information in the log is not scalable
> > when the broker number is large. Although each company can develop their
> > internal tools to get this information, it is a waste of developer time
> > to
> > reinvent the wheel. Reading this information in the log also seems less
> > reliable then getting it from Kafka request/response.
> >
> > I guess the goal of this alternative approach is to avoid making major
> > change in Kafka at the cost of increased disk failure discovery time etc.
> > But I think the changes required for fixing the problems above won't be
> > much less.
>
> Thanks for the thoughtful replies, Dong L.
>
> Instead of having an "offline" state, how about having a "creating"
> state for replicas and a "created" state?  Then if a replica was not
> accessible on any disk, but still in "created" state, the broker could
> know that something had gone wrong.  This also would catch issues like
> the broker being started without all log directories configured, or
> disks not being correctly mounted at the expected mount points, leading
> to empty directories.
>

Indeed, we need to have an additional state per replica to solve this
problem. The current KIP design addresses the problem by putting the
"created" state in zookeeper, as you can see in the public interface change
of the KIP. Are you suggesting to solve the problem by storing this
information in local disk of the broker instead of zookeeper? I have two
concerns with this approach:

- It requires broker to keep track of the replicas it has created. This
solution will split the task of determining offline replicas among
controller and brokers as opposed to the current Kafka design, where the
controller determines states of replicas and propagate this information to
brokers. We think it is less error-prone to still let controller be the
only entity that maintains metadata (e.g. replica state) of Kafka cluster.

- If we store this information in local disk, then we need to have
additional request/response protocol in order to request broker to reset
this information, e.g. after a bad disk is replaced with good disk, so that
the replica can be re-created on a good disk. Things would be easier if we
store this information in zookeeper.


>
> >
> > To answer Colin's questions:
> >
> > - There is no action required on the side of administrator in case of log
> > directory failure.
> >
> > - Broker itself is going to discover log directory failure and declare
> > offline replicas. Broker doesn't explicitly declare log directory
> > failure.
> > But administrator can learn from the MetadataResponse that replica is
> > offline due to disk failure, i.e. if replica is offline but broker is
> > online.
>
> Can you expand on this a little bit?  It sounds like you are considering
> dealing with failures on a replica-by-replica basis, rather than a
> disk-by-disk basis.  But it's disks that fail, not really individual
> files or directories on disks.  This decision interacts poorly with the
> lack of a periodic scanner.  It's easy to imagine a scenario where an
> infrequently used replica sits on a dead disk for a long time without us
> declaring it dead.
>

Sure. The broker will fail deal with failures on a disk-by-disk 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Colin McCabe
On Wed, Feb 1, 2017, at 11:35, Dong Lin wrote:
> Hey Grant, Colin,
> 
> My bad, I misunderstood Grant's suggestion initially. Indeed this is a
> very
> interesting idea to just wait for replica.max.lag.ms for the replica on
> the
> bad disk to drop out of ISR instead of having broker actively reporting
> this to the controller.
> 
> I have several concerns with this approach.
> 
> - Broker needs to maintain persist the information of all partitions that
> it has created, in a file in each disk. This is needed for broker to know
> the replicas already created on the bad disks that it can not access. If
> we
> don't do it, then controller sends LeaderAndIsrRequest to a broker to
> become follower for a partition on the bad disk, the broker will create
> partition on a good disk. The good disks may be overloaded as cascading
> effect.
> 
> While it is possible to let broker keep track of the replicas that it has
> created, I think it is less clean than the approach in the current KIP
> for
> reason provided in the rejective alternative section.
> 
> - Change is needed in the controller logic to handle failure to make a
> broker as leader when controller receives LeaderAndIsrResponse.
> Otherwise,
> things go wrong if partition on the bad disk is requested to become
> leader.
> As of now, broker doesn't handle error in LeaderAndIsrResponse.
> 
> - We still need tools and mechanism for administrator to know whether a
> replica is offline due to bad disk. I am worried that asking
> administrator
> to log into a machine and get this information in the log is not scalable
> when the broker number is large. Although each company can develop their
> internal tools to get this information, it is a waste of developer time
> to
> reinvent the wheel. Reading this information in the log also seems less
> reliable then getting it from Kafka request/response.
> 
> I guess the goal of this alternative approach is to avoid making major
> change in Kafka at the cost of increased disk failure discovery time etc.
> But I think the changes required for fixing the problems above won't be
> much less.

Thanks for the thoughtful replies, Dong L.

Instead of having an "offline" state, how about having a "creating"
state for replicas and a "created" state?  Then if a replica was not
accessible on any disk, but still in "created" state, the broker could
know that something had gone wrong.  This also would catch issues like
the broker being started without all log directories configured, or
disks not being correctly mounted at the expected mount points, leading
to empty directories.

> 
> To answer Colin's questions:
> 
> - There is no action required on the side of administrator in case of log
> directory failure.
> 
> - Broker itself is going to discover log directory failure and declare
> offline replicas. Broker doesn't explicitly declare log directory
> failure.
> But administrator can learn from the MetadataResponse that replica is
> offline due to disk failure, i.e. if replica is offline but broker is
> online.

Can you expand on this a little bit?  It sounds like you are considering
dealing with failures on a replica-by-replica basis, rather than a
disk-by-disk basis.  But it's disks that fail, not really individual
files or directories on disks.  This decision interacts poorly with the
lack of a periodic scanner.  It's easy to imagine a scenario where an
infrequently used replica sits on a dead disk for a long time without us
declaring it dead.

> 
> - This KIP does not handle cases where a few disks on a broker are full,
> but the others have space. If a disk is full and can not be written then
> the disk is considered to have failed. The imbalance across disks is an
> existing problem and will be handled in KIP-113.

OK.

> 
> - This KIP does not do a disk scanner that will periodically check for
> error conditions. It doesn't handle any performance degradation of disks.
> We wait for a failure to happen before declaring a disk bad.
> 
> Yes, this KIP requires us to fix cases in the code where we are
> suppressing
> disk errors or ignoring their root cause. But decision of which Exception
> should be considered disk failure and how to handle each of these are
> more
> like implementation detail. I hope we can focus on the detail and high
> level idea of this KIP and only worry about specific exception when the
> patch is being reviewed.

Hmm... I think we should discuss how we are going to harden the code
against disk failures, and verify that it has been hardened.  Or maybe
we could do this in a follow-up KIP.

> After all we probably only know the list of
> exceptions and ways to handle them when we start to implement the KIP.
> And
> we need to improve this list over time as we discover various failure in
> the deployment.
> 
> 
> Hey Eno,
> 
> Sure thing. Thanks for offering time to have a KIP meeting to discuss
> this.
> I will ask other Kafka developer at LinkedIn about their availability.

Yeah, it would be nice to 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Dong Lin
Hey Grant, Colin,

My bad, I misunderstood Grant's suggestion initially. Indeed this is a very
interesting idea to just wait for replica.max.lag.ms for the replica on the
bad disk to drop out of ISR instead of having broker actively reporting
this to the controller.

I have several concerns with this approach.

- Broker needs to maintain persist the information of all partitions that
it has created, in a file in each disk. This is needed for broker to know
the replicas already created on the bad disks that it can not access. If we
don't do it, then controller sends LeaderAndIsrRequest to a broker to
become follower for a partition on the bad disk, the broker will create
partition on a good disk. The good disks may be overloaded as cascading
effect.

While it is possible to let broker keep track of the replicas that it has
created, I think it is less clean than the approach in the current KIP for
reason provided in the rejective alternative section.

- Change is needed in the controller logic to handle failure to make a
broker as leader when controller receives LeaderAndIsrResponse. Otherwise,
things go wrong if partition on the bad disk is requested to become leader.
As of now, broker doesn't handle error in LeaderAndIsrResponse.

- We still need tools and mechanism for administrator to know whether a
replica is offline due to bad disk. I am worried that asking administrator
to log into a machine and get this information in the log is not scalable
when the broker number is large. Although each company can develop their
internal tools to get this information, it is a waste of developer time to
reinvent the wheel. Reading this information in the log also seems less
reliable then getting it from Kafka request/response.

I guess the goal of this alternative approach is to avoid making major
change in Kafka at the cost of increased disk failure discovery time etc.
But I think the changes required for fixing the problems above won't be
much less.

To answer Colin's questions:

- There is no action required on the side of administrator in case of log
directory failure.

- Broker itself is going to discover log directory failure and declare
offline replicas. Broker doesn't explicitly declare log directory failure.
But administrator can learn from the MetadataResponse that replica is
offline due to disk failure, i.e. if replica is offline but broker is
online.

- This KIP does not handle cases where a few disks on a broker are full,
but the others have space. If a disk is full and can not be written then
the disk is considered to have failed. The imbalance across disks is an
existing problem and will be handled in KIP-113.

- This KIP does not do a disk scanner that will periodically check for
error conditions. It doesn't handle any performance degradation of disks.
We wait for a failure to happen before declaring a disk bad.

Yes, this KIP requires us to fix cases in the code where we are suppressing
disk errors or ignoring their root cause. But decision of which Exception
should be considered disk failure and how to handle each of these are more
like implementation detail. I hope we can focus on the detail and high
level idea of this KIP and only worry about specific exception when the
patch is being reviewed. After all we probably only know the list of
exceptions and ways to handle them when we start to implement the KIP. And
we need to improve this list over time as we discover various failure in
the deployment.


Hey Eno,

Sure thing. Thanks for offering time to have a KIP meeting to discuss this.
I will ask other Kafka developer at LinkedIn about their availability.

Thanks,
Dong


On Wed, Feb 1, 2017 at 10:37 AM, Eno Thereska 
wrote:

> Hi Dong,
>
> Would it make sense to do a discussion over video/voice about this? I
> think it's sufficiently complex that we can probably make quicker progress
> that way? So shall we do a KIP meeting soon? I can do this week (Thu/Fri)
> or next week.
>
> Thanks
> Eno
> > On 1 Feb 2017, at 18:29, Colin McCabe  wrote:
> >
> > Hmm.  Maybe I misinterpreted, but I got the impression that Grant was
> > suggesting that we avoid introducing this concept of "offline replicas"
> > for now.  Is that feasible?
> >
> > What is the strategy for declaring a log directory bad?  Is it an
> > administrative action?  Or is the broker itself going to be responsible
> > for this?  How do we handle cases where a few disks on a broker are
> > full, but the others have space?
> >
> > Are we going to have a disk scanner that will periodically check for
> > error conditions (similar to the background checks that RAID controllers
> > do)?  Or will we wait for a failure to happen before declaring a disk
> > bad?
> >
> > It seems to me that if we want this to work well we will need to fix
> > cases in the code where we are suppressing disk errors or ignoring their
> > root cause.  For example, any place where we are using the old Java APIs
> > that just 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Eno Thereska
Hi Dong,

Would it make sense to do a discussion over video/voice about this? I think 
it's sufficiently complex that we can probably make quicker progress that way? 
So shall we do a KIP meeting soon? I can do this week (Thu/Fri) or next week.

Thanks
Eno
> On 1 Feb 2017, at 18:29, Colin McCabe  wrote:
> 
> Hmm.  Maybe I misinterpreted, but I got the impression that Grant was
> suggesting that we avoid introducing this concept of "offline replicas"
> for now.  Is that feasible?
> 
> What is the strategy for declaring a log directory bad?  Is it an
> administrative action?  Or is the broker itself going to be responsible
> for this?  How do we handle cases where a few disks on a broker are
> full, but the others have space?
> 
> Are we going to have a disk scanner that will periodically check for
> error conditions (similar to the background checks that RAID controllers
> do)?  Or will we wait for a failure to happen before declaring a disk
> bad?
> 
> It seems to me that if we want this to work well we will need to fix
> cases in the code where we are suppressing disk errors or ignoring their
> root cause.  For example, any place where we are using the old Java APIs
> that just return a boolean on failure will need to be fixed, since the
> failure could now be disk full, permission denied, or IOE, and we will
> need to handle those cases differently.  Also, we will need to harden
> the code against disk errors.  Formerly it was OK to just crash on a
> disk error; now it is not.  It would be nice to see more in the test
> plan about injecting IOExceptions into disk handling code and verifying
> that we can handle it correctly.
> 
> regards,
> Colin
> 
> 
> On Wed, Feb 1, 2017, at 10:02, Dong Lin wrote:
>> Hey Grant,
>> 
>> Yes, this KIP does exactly what you described:)
>> 
>> Thanks,
>> Dong
>> 
>> On Wed, Feb 1, 2017 at 9:45 AM, Grant Henke  wrote:
>> 
>>> Hi Dong,
>>> 
>>> Thanks for putting this together.
>>> 
>>> Since we are discussing alternative/simplified options. Have you considered
>>> handling the disk failures broker side to prevent a crash, marking the disk
>>> as "bad" to that individual broker, and continuing as normal? I imagine the
>>> broker would then fall out of sync for the replicas hosted on the bad disk
>>> and the ISR would shrink. This would allow people using min.isr to keep
>>> their data safe and the cluster operators would see a shrink in many ISRs
>>> and hopefully an obvious log message leading to a quick fix. I haven't
>>> thought through this idea in depth though. So there could be some
>>> shortfalls.
>>> 
>>> Thanks,
>>> Grant
>>> 
>>> 
>>> 
>>> On Wed, Feb 1, 2017 at 11:21 AM, Dong Lin  wrote:
>>> 
 Hey Eno,
 
 Thanks much for the review.
 
 I think your suggestion is to split disks of a machine into multiple disk
 sets and run one broker per disk set. Yeah this is similar to Colin's
 suggestion of one-broker-per-disk, which we have evaluated at LinkedIn
>>> and
 considered it to be a good short term approach.
 
 As of now I don't think any of these approach is a better alternative in
 the long term. I will summarize these here. I have put these reasons in
>>> the
 KIP's motivation section and rejected alternative section. I am happy to
 discuss more and I would certainly like to use an alternative solution
>>> that
 is easier to do with better performance.
 
 - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2
>>> to
 JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
 doubles the tolerance of broker failure before data unavailability from 1
 to 2. This is pretty huge gain for any company that uses Kafka at large
 scale.
 
 - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is
>>> that
 no major code change is needed in Kafka. Among the disadvantage of
 one-broker-per-disk summarized in the KIP and previous email with Colin,
 the biggest one is the 15% throughput loss compared to JBOD and less
 flexibility to balance across disks. Further, it probably requires change
 to internal deployment tools at various companies to deal with
 one-broker-per-disk setup.
 
 - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem
>>> is
 that a broker becomes unavailable if any disk fail. Suppose
 replication-factor=2 and there are 10 disks per machine. Then the
 probability of of any message becomes unavailable due to disk failure
>>> with
 RAID-0 is 100X higher than that with JBOD.
 
 - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
 between one-broker-per-disk and RAID-0. So it carries an averaged
 disadvantages of these two approaches.
 
 To answer your question regarding, I think it is reasonable to mange disk
 in Kafka. By "managing disks" we mean the 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Colin McCabe
Hmm.  Maybe I misinterpreted, but I got the impression that Grant was
suggesting that we avoid introducing this concept of "offline replicas"
for now.  Is that feasible?

What is the strategy for declaring a log directory bad?  Is it an
administrative action?  Or is the broker itself going to be responsible
for this?  How do we handle cases where a few disks on a broker are
full, but the others have space?

Are we going to have a disk scanner that will periodically check for
error conditions (similar to the background checks that RAID controllers
do)?  Or will we wait for a failure to happen before declaring a disk
bad?

It seems to me that if we want this to work well we will need to fix
cases in the code where we are suppressing disk errors or ignoring their
root cause.  For example, any place where we are using the old Java APIs
that just return a boolean on failure will need to be fixed, since the
failure could now be disk full, permission denied, or IOE, and we will
need to handle those cases differently.  Also, we will need to harden
the code against disk errors.  Formerly it was OK to just crash on a
disk error; now it is not.  It would be nice to see more in the test
plan about injecting IOExceptions into disk handling code and verifying
that we can handle it correctly.

regards,
Colin


On Wed, Feb 1, 2017, at 10:02, Dong Lin wrote:
> Hey Grant,
> 
> Yes, this KIP does exactly what you described:)
> 
> Thanks,
> Dong
> 
> On Wed, Feb 1, 2017 at 9:45 AM, Grant Henke  wrote:
> 
> > Hi Dong,
> >
> > Thanks for putting this together.
> >
> > Since we are discussing alternative/simplified options. Have you considered
> > handling the disk failures broker side to prevent a crash, marking the disk
> > as "bad" to that individual broker, and continuing as normal? I imagine the
> > broker would then fall out of sync for the replicas hosted on the bad disk
> > and the ISR would shrink. This would allow people using min.isr to keep
> > their data safe and the cluster operators would see a shrink in many ISRs
> > and hopefully an obvious log message leading to a quick fix. I haven't
> > thought through this idea in depth though. So there could be some
> > shortfalls.
> >
> > Thanks,
> > Grant
> >
> >
> >
> > On Wed, Feb 1, 2017 at 11:21 AM, Dong Lin  wrote:
> >
> > > Hey Eno,
> > >
> > > Thanks much for the review.
> > >
> > > I think your suggestion is to split disks of a machine into multiple disk
> > > sets and run one broker per disk set. Yeah this is similar to Colin's
> > > suggestion of one-broker-per-disk, which we have evaluated at LinkedIn
> > and
> > > considered it to be a good short term approach.
> > >
> > > As of now I don't think any of these approach is a better alternative in
> > > the long term. I will summarize these here. I have put these reasons in
> > the
> > > KIP's motivation section and rejected alternative section. I am happy to
> > > discuss more and I would certainly like to use an alternative solution
> > that
> > > is easier to do with better performance.
> > >
> > > - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2
> > to
> > > JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
> > > doubles the tolerance of broker failure before data unavailability from 1
> > > to 2. This is pretty huge gain for any company that uses Kafka at large
> > > scale.
> > >
> > > - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is
> > that
> > > no major code change is needed in Kafka. Among the disadvantage of
> > > one-broker-per-disk summarized in the KIP and previous email with Colin,
> > > the biggest one is the 15% throughput loss compared to JBOD and less
> > > flexibility to balance across disks. Further, it probably requires change
> > > to internal deployment tools at various companies to deal with
> > > one-broker-per-disk setup.
> > >
> > > - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem
> > is
> > > that a broker becomes unavailable if any disk fail. Suppose
> > > replication-factor=2 and there are 10 disks per machine. Then the
> > > probability of of any message becomes unavailable due to disk failure
> > with
> > > RAID-0 is 100X higher than that with JBOD.
> > >
> > > - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
> > > between one-broker-per-disk and RAID-0. So it carries an averaged
> > > disadvantages of these two approaches.
> > >
> > > To answer your question regarding, I think it is reasonable to mange disk
> > > in Kafka. By "managing disks" we mean the management of assignment of
> > > replicas across disks. Here are my reasons in more detail:
> > >
> > > - I don't think this KIP is a big step change. By allowing user to
> > > configure Kafka to run multiple log directories or disks as of now, it is
> > > implicit that Kafka manages disks. It is just not a complete feature.
> > > Microsoft and probably other companies are 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Dong Lin
Hey Grant,

Yes, this KIP does exactly what you described:)

Thanks,
Dong

On Wed, Feb 1, 2017 at 9:45 AM, Grant Henke  wrote:

> Hi Dong,
>
> Thanks for putting this together.
>
> Since we are discussing alternative/simplified options. Have you considered
> handling the disk failures broker side to prevent a crash, marking the disk
> as "bad" to that individual broker, and continuing as normal? I imagine the
> broker would then fall out of sync for the replicas hosted on the bad disk
> and the ISR would shrink. This would allow people using min.isr to keep
> their data safe and the cluster operators would see a shrink in many ISRs
> and hopefully an obvious log message leading to a quick fix. I haven't
> thought through this idea in depth though. So there could be some
> shortfalls.
>
> Thanks,
> Grant
>
>
>
> On Wed, Feb 1, 2017 at 11:21 AM, Dong Lin  wrote:
>
> > Hey Eno,
> >
> > Thanks much for the review.
> >
> > I think your suggestion is to split disks of a machine into multiple disk
> > sets and run one broker per disk set. Yeah this is similar to Colin's
> > suggestion of one-broker-per-disk, which we have evaluated at LinkedIn
> and
> > considered it to be a good short term approach.
> >
> > As of now I don't think any of these approach is a better alternative in
> > the long term. I will summarize these here. I have put these reasons in
> the
> > KIP's motivation section and rejected alternative section. I am happy to
> > discuss more and I would certainly like to use an alternative solution
> that
> > is easier to do with better performance.
> >
> > - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2
> to
> > JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
> > doubles the tolerance of broker failure before data unavailability from 1
> > to 2. This is pretty huge gain for any company that uses Kafka at large
> > scale.
> >
> > - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is
> that
> > no major code change is needed in Kafka. Among the disadvantage of
> > one-broker-per-disk summarized in the KIP and previous email with Colin,
> > the biggest one is the 15% throughput loss compared to JBOD and less
> > flexibility to balance across disks. Further, it probably requires change
> > to internal deployment tools at various companies to deal with
> > one-broker-per-disk setup.
> >
> > - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem
> is
> > that a broker becomes unavailable if any disk fail. Suppose
> > replication-factor=2 and there are 10 disks per machine. Then the
> > probability of of any message becomes unavailable due to disk failure
> with
> > RAID-0 is 100X higher than that with JBOD.
> >
> > - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
> > between one-broker-per-disk and RAID-0. So it carries an averaged
> > disadvantages of these two approaches.
> >
> > To answer your question regarding, I think it is reasonable to mange disk
> > in Kafka. By "managing disks" we mean the management of assignment of
> > replicas across disks. Here are my reasons in more detail:
> >
> > - I don't think this KIP is a big step change. By allowing user to
> > configure Kafka to run multiple log directories or disks as of now, it is
> > implicit that Kafka manages disks. It is just not a complete feature.
> > Microsoft and probably other companies are using this feature under the
> > undesirable effect that a broker will fail any if any disk fail. It is
> good
> > to complete this feature.
> >
> > - I think it is reasonable to manage disk in Kafka. One of the most
> > important work that Kafka is doing is to determine the replica assignment
> > across brokers and make sure enough copies of a given replica is
> available.
> > I would argue that it is not much different than determining the replica
> > assignment across disk conceptually.
> >
> > - I would agree that this KIP is improve performance of Kafka at the cost
> > of more complexity inside Kafka, by switching from RAID-10 to JBOD. I
> would
> > argue that this is a right direction. If we can gain 20%+ performance by
> > managing NIC in Kafka as compared to existing approach and other
> > alternatives, I would say we should just do it. Such a gain in
> performance,
> > or equivalently reduction in cost, can save millions of dollars per year
> > for any company running Kafka at large scale.
> >
> > Thanks,
> > Dong
> >
> >
> > On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska 
> > wrote:
> >
> > > I'm coming somewhat late to the discussion, apologies for that.
> > >
> > > I'm worried about this proposal. It's moving Kafka to a world where it
> > > manages disks. So in a sense, the scope of the KIP is limited, but the
> > > direction it sets for Kafka is quite a big step change. Fundamentally
> > this
> > > is about balancing resources for a Kafka broker. This can be done by a
> > > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Grant Henke
Hi Dong,

Thanks for putting this together.

Since we are discussing alternative/simplified options. Have you considered
handling the disk failures broker side to prevent a crash, marking the disk
as "bad" to that individual broker, and continuing as normal? I imagine the
broker would then fall out of sync for the replicas hosted on the bad disk
and the ISR would shrink. This would allow people using min.isr to keep
their data safe and the cluster operators would see a shrink in many ISRs
and hopefully an obvious log message leading to a quick fix. I haven't
thought through this idea in depth though. So there could be some
shortfalls.

Thanks,
Grant



On Wed, Feb 1, 2017 at 11:21 AM, Dong Lin  wrote:

> Hey Eno,
>
> Thanks much for the review.
>
> I think your suggestion is to split disks of a machine into multiple disk
> sets and run one broker per disk set. Yeah this is similar to Colin's
> suggestion of one-broker-per-disk, which we have evaluated at LinkedIn and
> considered it to be a good short term approach.
>
> As of now I don't think any of these approach is a better alternative in
> the long term. I will summarize these here. I have put these reasons in the
> KIP's motivation section and rejected alternative section. I am happy to
> discuss more and I would certainly like to use an alternative solution that
> is easier to do with better performance.
>
> - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2 to
> JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
> doubles the tolerance of broker failure before data unavailability from 1
> to 2. This is pretty huge gain for any company that uses Kafka at large
> scale.
>
> - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is that
> no major code change is needed in Kafka. Among the disadvantage of
> one-broker-per-disk summarized in the KIP and previous email with Colin,
> the biggest one is the 15% throughput loss compared to JBOD and less
> flexibility to balance across disks. Further, it probably requires change
> to internal deployment tools at various companies to deal with
> one-broker-per-disk setup.
>
> - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem is
> that a broker becomes unavailable if any disk fail. Suppose
> replication-factor=2 and there are 10 disks per machine. Then the
> probability of of any message becomes unavailable due to disk failure with
> RAID-0 is 100X higher than that with JBOD.
>
> - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
> between one-broker-per-disk and RAID-0. So it carries an averaged
> disadvantages of these two approaches.
>
> To answer your question regarding, I think it is reasonable to mange disk
> in Kafka. By "managing disks" we mean the management of assignment of
> replicas across disks. Here are my reasons in more detail:
>
> - I don't think this KIP is a big step change. By allowing user to
> configure Kafka to run multiple log directories or disks as of now, it is
> implicit that Kafka manages disks. It is just not a complete feature.
> Microsoft and probably other companies are using this feature under the
> undesirable effect that a broker will fail any if any disk fail. It is good
> to complete this feature.
>
> - I think it is reasonable to manage disk in Kafka. One of the most
> important work that Kafka is doing is to determine the replica assignment
> across brokers and make sure enough copies of a given replica is available.
> I would argue that it is not much different than determining the replica
> assignment across disk conceptually.
>
> - I would agree that this KIP is improve performance of Kafka at the cost
> of more complexity inside Kafka, by switching from RAID-10 to JBOD. I would
> argue that this is a right direction. If we can gain 20%+ performance by
> managing NIC in Kafka as compared to existing approach and other
> alternatives, I would say we should just do it. Such a gain in performance,
> or equivalently reduction in cost, can save millions of dollars per year
> for any company running Kafka at large scale.
>
> Thanks,
> Dong
>
>
> On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska 
> wrote:
>
> > I'm coming somewhat late to the discussion, apologies for that.
> >
> > I'm worried about this proposal. It's moving Kafka to a world where it
> > manages disks. So in a sense, the scope of the KIP is limited, but the
> > direction it sets for Kafka is quite a big step change. Fundamentally
> this
> > is about balancing resources for a Kafka broker. This can be done by a
> > tool, rather than by changing Kafka. E.g., the tool would take a bunch of
> > disks together, create a volume over them and export that to a Kafka
> broker
> > (in addition to setting the memory limits for that broker or limiting
> other
> > resources). A different bunch of disks can then make up a second volume,
> > and be used by another Kafka broker. This is 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Dong Lin
Hey Eno,

Thanks much for the review.

I think your suggestion is to split disks of a machine into multiple disk
sets and run one broker per disk set. Yeah this is similar to Colin's
suggestion of one-broker-per-disk, which we have evaluated at LinkedIn and
considered it to be a good short term approach.

As of now I don't think any of these approach is a better alternative in
the long term. I will summarize these here. I have put these reasons in the
KIP's motivation section and rejected alternative section. I am happy to
discuss more and I would certainly like to use an alternative solution that
is easier to do with better performance.

- JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2 to
JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
doubles the tolerance of broker failure before data unavailability from 1
to 2. This is pretty huge gain for any company that uses Kafka at large
scale.

- JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is that
no major code change is needed in Kafka. Among the disadvantage of
one-broker-per-disk summarized in the KIP and previous email with Colin,
the biggest one is the 15% throughput loss compared to JBOD and less
flexibility to balance across disks. Further, it probably requires change
to internal deployment tools at various companies to deal with
one-broker-per-disk setup.

- JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem is
that a broker becomes unavailable if any disk fail. Suppose
replication-factor=2 and there are 10 disks per machine. Then the
probability of of any message becomes unavailable due to disk failure with
RAID-0 is 100X higher than that with JBOD.

- JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
between one-broker-per-disk and RAID-0. So it carries an averaged
disadvantages of these two approaches.

To answer your question regarding, I think it is reasonable to mange disk
in Kafka. By "managing disks" we mean the management of assignment of
replicas across disks. Here are my reasons in more detail:

- I don't think this KIP is a big step change. By allowing user to
configure Kafka to run multiple log directories or disks as of now, it is
implicit that Kafka manages disks. It is just not a complete feature.
Microsoft and probably other companies are using this feature under the
undesirable effect that a broker will fail any if any disk fail. It is good
to complete this feature.

- I think it is reasonable to manage disk in Kafka. One of the most
important work that Kafka is doing is to determine the replica assignment
across brokers and make sure enough copies of a given replica is available.
I would argue that it is not much different than determining the replica
assignment across disk conceptually.

- I would agree that this KIP is improve performance of Kafka at the cost
of more complexity inside Kafka, by switching from RAID-10 to JBOD. I would
argue that this is a right direction. If we can gain 20%+ performance by
managing NIC in Kafka as compared to existing approach and other
alternatives, I would say we should just do it. Such a gain in performance,
or equivalently reduction in cost, can save millions of dollars per year
for any company running Kafka at large scale.

Thanks,
Dong


On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska  wrote:

> I'm coming somewhat late to the discussion, apologies for that.
>
> I'm worried about this proposal. It's moving Kafka to a world where it
> manages disks. So in a sense, the scope of the KIP is limited, but the
> direction it sets for Kafka is quite a big step change. Fundamentally this
> is about balancing resources for a Kafka broker. This can be done by a
> tool, rather than by changing Kafka. E.g., the tool would take a bunch of
> disks together, create a volume over them and export that to a Kafka broker
> (in addition to setting the memory limits for that broker or limiting other
> resources). A different bunch of disks can then make up a second volume,
> and be used by another Kafka broker. This is aligned with what Colin is
> saying (as I understand it).
>
> Disks are not the only resource on a machine, there are several instances
> where multiple NICs are used for example. Do we want fine grained
> management of all these resources? I'd argue that opens us the system to a
> lot of complexity.
>
> Thanks
> Eno
>
>
> > On 1 Feb 2017, at 01:53, Dong Lin  wrote:
> >
> > Hi all,
> >
> > I am going to initiate the vote If there is no further concern with the
> KIP.
> >
> > Thanks,
> > Dong
> >
> >
> > On Fri, Jan 27, 2017 at 8:08 PM, radai 
> wrote:
> >
> >> a few extra points:
> >>
> >> 1. broker per disk might also incur more client <--> broker sockets:
> >> suppose every producer / consumer "talks" to >1 partition, there's a
> very
> >> good chance that partitions that were co-located on a single 10-disk
> broker
> >> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-02-01 Thread Eno Thereska
I'm coming somewhat late to the discussion, apologies for that.

I'm worried about this proposal. It's moving Kafka to a world where it manages 
disks. So in a sense, the scope of the KIP is limited, but the direction it 
sets for Kafka is quite a big step change. Fundamentally this is about 
balancing resources for a Kafka broker. This can be done by a tool, rather than 
by changing Kafka. E.g., the tool would take a bunch of disks together, create 
a volume over them and export that to a Kafka broker (in addition to setting 
the memory limits for that broker or limiting other resources). A different 
bunch of disks can then make up a second volume, and be used by another Kafka 
broker. This is aligned with what Colin is saying (as I understand it). 

Disks are not the only resource on a machine, there are several instances where 
multiple NICs are used for example. Do we want fine grained management of all 
these resources? I'd argue that opens us the system to a lot of complexity.

Thanks
Eno


> On 1 Feb 2017, at 01:53, Dong Lin  wrote:
> 
> Hi all,
> 
> I am going to initiate the vote If there is no further concern with the KIP.
> 
> Thanks,
> Dong
> 
> 
> On Fri, Jan 27, 2017 at 8:08 PM, radai  wrote:
> 
>> a few extra points:
>> 
>> 1. broker per disk might also incur more client <--> broker sockets:
>> suppose every producer / consumer "talks" to >1 partition, there's a very
>> good chance that partitions that were co-located on a single 10-disk broker
>> would now be split between several single-disk broker processes on the same
>> machine. hard to put a multiplier on this, but likely >x1. sockets are a
>> limited resource at the OS level and incur some memory cost (kernel
>> buffers)
>> 
>> 2. there's a memory overhead to spinning up a JVM (compiled code and byte
>> code objects etc). if we assume this overhead is ~300 MB (order of
>> magnitude, specifics vary) than spinning up 10 JVMs would lose you 3 GB of
>> RAM. not a ton, but non negligible.
>> 
>> 3. there would also be some overhead downstream of kafka in any management
>> / monitoring / log aggregation system. likely less than x10 though.
>> 
>> 4. (related to above) - added complexity of administration with more
>> running instances.
>> 
>> is anyone running kafka with anywhere near 100GB heaps? i thought the point
>> was to rely on kernel page cache to do the disk buffering 
>> 
>> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin  wrote:
>> 
>>> Hey Colin,
>>> 
>>> Thanks much for the comment. Please see me comment inline.
>>> 
>>> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe 
>> wrote:
>>> 
 On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> Hey Colin,
> 
> Good point! Yeah we have actually considered and tested this
>> solution,
> which we call one-broker-per-disk. It would work and should require
>> no
> major change in Kafka as compared to this JBOD KIP. So it would be a
>>> good
> short term solution.
> 
> But it has a few drawbacks which makes it less desirable in the long
> term.
> Assume we have 10 disks on a machine. Here are the problems:
 
 Hi Dong,
 
 Thanks for the thoughtful reply.
 
> 
> 1) Our stress test result shows that one-broker-per-disk has 15%
>> lower
> throughput
> 
> 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> MetadataUpdateRequest and StopReplicaRequest. This increases the
>> burden
> on
> controller which can be the performance bottleneck.
 
 Maybe I'm misunderstanding something, but there would not be 10x as
>> many
 StopReplicaRequest RPCs, would there?  The other requests would
>> increase
 10x, but from a pretty low base, right?  We are not reassigning
 partitions all the time, I hope (or else we have bigger problems...)
 
>>> 
>>> I think the controller will group StopReplicaRequest per broker and send
>>> only one StopReplicaRequest to a broker during controlled shutdown.
>> Anyway,
>>> we don't have to worry about this if we agree that other requests will
>>> increase by 10X. One MetadataRequest to send to each broker in the
>> cluster
>>> every time there is leadership change. I am not sure this is a real
>>> problem. But in theory this makes the overhead complexity O(number of
>>> broker) and may be a concern in the future. Ideally we should avoid it.
>>> 
>>> 
 
> 
> 3) Less efficient use of physical resource on the machine. The number
>>> of
> socket on each machine will increase by 10X. The number of connection
> between any two machine will increase by 100X.
> 
> 4) Less efficient way to management memory and quota.
> 
> 5) Rebalance between disks/brokers on the same machine will less
> efficient
> and less flexible. Broker has to read data from another broker on the
> same
> machine via socket. It is also harder to 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-31 Thread Dong Lin
Hi all,

I am going to initiate the vote If there is no further concern with the KIP.

Thanks,
Dong


On Fri, Jan 27, 2017 at 8:08 PM, radai  wrote:

> a few extra points:
>
> 1. broker per disk might also incur more client <--> broker sockets:
> suppose every producer / consumer "talks" to >1 partition, there's a very
> good chance that partitions that were co-located on a single 10-disk broker
> would now be split between several single-disk broker processes on the same
> machine. hard to put a multiplier on this, but likely >x1. sockets are a
> limited resource at the OS level and incur some memory cost (kernel
> buffers)
>
> 2. there's a memory overhead to spinning up a JVM (compiled code and byte
> code objects etc). if we assume this overhead is ~300 MB (order of
> magnitude, specifics vary) than spinning up 10 JVMs would lose you 3 GB of
> RAM. not a ton, but non negligible.
>
> 3. there would also be some overhead downstream of kafka in any management
> / monitoring / log aggregation system. likely less than x10 though.
>
> 4. (related to above) - added complexity of administration with more
> running instances.
>
> is anyone running kafka with anywhere near 100GB heaps? i thought the point
> was to rely on kernel page cache to do the disk buffering 
>
> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin  wrote:
>
> > Hey Colin,
> >
> > Thanks much for the comment. Please see me comment inline.
> >
> > On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe 
> wrote:
> >
> > > On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> > > > Hey Colin,
> > > >
> > > > Good point! Yeah we have actually considered and tested this
> solution,
> > > > which we call one-broker-per-disk. It would work and should require
> no
> > > > major change in Kafka as compared to this JBOD KIP. So it would be a
> > good
> > > > short term solution.
> > > >
> > > > But it has a few drawbacks which makes it less desirable in the long
> > > > term.
> > > > Assume we have 10 disks on a machine. Here are the problems:
> > >
> > > Hi Dong,
> > >
> > > Thanks for the thoughtful reply.
> > >
> > > >
> > > > 1) Our stress test result shows that one-broker-per-disk has 15%
> lower
> > > > throughput
> > > >
> > > > 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> > > > MetadataUpdateRequest and StopReplicaRequest. This increases the
> burden
> > > > on
> > > > controller which can be the performance bottleneck.
> > >
> > > Maybe I'm misunderstanding something, but there would not be 10x as
> many
> > > StopReplicaRequest RPCs, would there?  The other requests would
> increase
> > > 10x, but from a pretty low base, right?  We are not reassigning
> > > partitions all the time, I hope (or else we have bigger problems...)
> > >
> >
> > I think the controller will group StopReplicaRequest per broker and send
> > only one StopReplicaRequest to a broker during controlled shutdown.
> Anyway,
> > we don't have to worry about this if we agree that other requests will
> > increase by 10X. One MetadataRequest to send to each broker in the
> cluster
> > every time there is leadership change. I am not sure this is a real
> > problem. But in theory this makes the overhead complexity O(number of
> > broker) and may be a concern in the future. Ideally we should avoid it.
> >
> >
> > >
> > > >
> > > > 3) Less efficient use of physical resource on the machine. The number
> > of
> > > > socket on each machine will increase by 10X. The number of connection
> > > > between any two machine will increase by 100X.
> > > >
> > > > 4) Less efficient way to management memory and quota.
> > > >
> > > > 5) Rebalance between disks/brokers on the same machine will less
> > > > efficient
> > > > and less flexible. Broker has to read data from another broker on the
> > > > same
> > > > machine via socket. It is also harder to do automatic load balance
> > > > between
> > > > disks on the same machine in the future.
> > > >
> > > > I will put this and the explanation in the rejected alternative
> > section.
> > > > I
> > > > have a few questions:
> > > >
> > > > - Can you explain why this solution can help avoid scalability
> > > > bottleneck?
> > > > I actually think it will exacerbate the scalability problem due the
> 2)
> > > > above.
> > > > - Why can we push more RPC with this solution?
> > >
> > > To really answer this question we'd have to take a deep dive into the
> > > locking of the broker and figure out how effectively it can parallelize
> > > truly independent requests.  Almost every multithreaded process is
> going
> > > to have shared state, like shared queues or shared sockets, that is
> > > going to make scaling less than linear when you add disks or
> processors.
> > >  (And clearly, another option is to improve that scalability, rather
> > > than going multi-process!)
> > >
> >
> > Yeah I also think it is better to improve scalability inside kafka code
> if
> > possible. I am 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-27 Thread radai
a few extra points:

1. broker per disk might also incur more client <--> broker sockets:
suppose every producer / consumer "talks" to >1 partition, there's a very
good chance that partitions that were co-located on a single 10-disk broker
would now be split between several single-disk broker processes on the same
machine. hard to put a multiplier on this, but likely >x1. sockets are a
limited resource at the OS level and incur some memory cost (kernel buffers)

2. there's a memory overhead to spinning up a JVM (compiled code and byte
code objects etc). if we assume this overhead is ~300 MB (order of
magnitude, specifics vary) than spinning up 10 JVMs would lose you 3 GB of
RAM. not a ton, but non negligible.

3. there would also be some overhead downstream of kafka in any management
/ monitoring / log aggregation system. likely less than x10 though.

4. (related to above) - added complexity of administration with more
running instances.

is anyone running kafka with anywhere near 100GB heaps? i thought the point
was to rely on kernel page cache to do the disk buffering 

On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin  wrote:

> Hey Colin,
>
> Thanks much for the comment. Please see me comment inline.
>
> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe  wrote:
>
> > On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> > > Hey Colin,
> > >
> > > Good point! Yeah we have actually considered and tested this solution,
> > > which we call one-broker-per-disk. It would work and should require no
> > > major change in Kafka as compared to this JBOD KIP. So it would be a
> good
> > > short term solution.
> > >
> > > But it has a few drawbacks which makes it less desirable in the long
> > > term.
> > > Assume we have 10 disks on a machine. Here are the problems:
> >
> > Hi Dong,
> >
> > Thanks for the thoughtful reply.
> >
> > >
> > > 1) Our stress test result shows that one-broker-per-disk has 15% lower
> > > throughput
> > >
> > > 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> > > MetadataUpdateRequest and StopReplicaRequest. This increases the burden
> > > on
> > > controller which can be the performance bottleneck.
> >
> > Maybe I'm misunderstanding something, but there would not be 10x as many
> > StopReplicaRequest RPCs, would there?  The other requests would increase
> > 10x, but from a pretty low base, right?  We are not reassigning
> > partitions all the time, I hope (or else we have bigger problems...)
> >
>
> I think the controller will group StopReplicaRequest per broker and send
> only one StopReplicaRequest to a broker during controlled shutdown. Anyway,
> we don't have to worry about this if we agree that other requests will
> increase by 10X. One MetadataRequest to send to each broker in the cluster
> every time there is leadership change. I am not sure this is a real
> problem. But in theory this makes the overhead complexity O(number of
> broker) and may be a concern in the future. Ideally we should avoid it.
>
>
> >
> > >
> > > 3) Less efficient use of physical resource on the machine. The number
> of
> > > socket on each machine will increase by 10X. The number of connection
> > > between any two machine will increase by 100X.
> > >
> > > 4) Less efficient way to management memory and quota.
> > >
> > > 5) Rebalance between disks/brokers on the same machine will less
> > > efficient
> > > and less flexible. Broker has to read data from another broker on the
> > > same
> > > machine via socket. It is also harder to do automatic load balance
> > > between
> > > disks on the same machine in the future.
> > >
> > > I will put this and the explanation in the rejected alternative
> section.
> > > I
> > > have a few questions:
> > >
> > > - Can you explain why this solution can help avoid scalability
> > > bottleneck?
> > > I actually think it will exacerbate the scalability problem due the 2)
> > > above.
> > > - Why can we push more RPC with this solution?
> >
> > To really answer this question we'd have to take a deep dive into the
> > locking of the broker and figure out how effectively it can parallelize
> > truly independent requests.  Almost every multithreaded process is going
> > to have shared state, like shared queues or shared sockets, that is
> > going to make scaling less than linear when you add disks or processors.
> >  (And clearly, another option is to improve that scalability, rather
> > than going multi-process!)
> >
>
> Yeah I also think it is better to improve scalability inside kafka code if
> possible. I am not sure we currently have any scalability issue inside
> Kafka that can not be removed without using multi-process.
>
>
> >
> > > - It is true that a garbage collection in one broker would not affect
> > > others. But that is after every broker only uses 1/10 of the memory.
> Can
> > > we be sure that this will actually help performance?
> >
> > The big question is, how much memory do Kafka brokers use now, and how
> 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-26 Thread Dong Lin
Hey Colin,

Thanks much for the comment. Please see me comment inline.

On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe  wrote:

> On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> > Hey Colin,
> >
> > Good point! Yeah we have actually considered and tested this solution,
> > which we call one-broker-per-disk. It would work and should require no
> > major change in Kafka as compared to this JBOD KIP. So it would be a good
> > short term solution.
> >
> > But it has a few drawbacks which makes it less desirable in the long
> > term.
> > Assume we have 10 disks on a machine. Here are the problems:
>
> Hi Dong,
>
> Thanks for the thoughtful reply.
>
> >
> > 1) Our stress test result shows that one-broker-per-disk has 15% lower
> > throughput
> >
> > 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> > MetadataUpdateRequest and StopReplicaRequest. This increases the burden
> > on
> > controller which can be the performance bottleneck.
>
> Maybe I'm misunderstanding something, but there would not be 10x as many
> StopReplicaRequest RPCs, would there?  The other requests would increase
> 10x, but from a pretty low base, right?  We are not reassigning
> partitions all the time, I hope (or else we have bigger problems...)
>

I think the controller will group StopReplicaRequest per broker and send
only one StopReplicaRequest to a broker during controlled shutdown. Anyway,
we don't have to worry about this if we agree that other requests will
increase by 10X. One MetadataRequest to send to each broker in the cluster
every time there is leadership change. I am not sure this is a real
problem. But in theory this makes the overhead complexity O(number of
broker) and may be a concern in the future. Ideally we should avoid it.


>
> >
> > 3) Less efficient use of physical resource on the machine. The number of
> > socket on each machine will increase by 10X. The number of connection
> > between any two machine will increase by 100X.
> >
> > 4) Less efficient way to management memory and quota.
> >
> > 5) Rebalance between disks/brokers on the same machine will less
> > efficient
> > and less flexible. Broker has to read data from another broker on the
> > same
> > machine via socket. It is also harder to do automatic load balance
> > between
> > disks on the same machine in the future.
> >
> > I will put this and the explanation in the rejected alternative section.
> > I
> > have a few questions:
> >
> > - Can you explain why this solution can help avoid scalability
> > bottleneck?
> > I actually think it will exacerbate the scalability problem due the 2)
> > above.
> > - Why can we push more RPC with this solution?
>
> To really answer this question we'd have to take a deep dive into the
> locking of the broker and figure out how effectively it can parallelize
> truly independent requests.  Almost every multithreaded process is going
> to have shared state, like shared queues or shared sockets, that is
> going to make scaling less than linear when you add disks or processors.
>  (And clearly, another option is to improve that scalability, rather
> than going multi-process!)
>

Yeah I also think it is better to improve scalability inside kafka code if
possible. I am not sure we currently have any scalability issue inside
Kafka that can not be removed without using multi-process.


>
> > - It is true that a garbage collection in one broker would not affect
> > others. But that is after every broker only uses 1/10 of the memory. Can
> > we be sure that this will actually help performance?
>
> The big question is, how much memory do Kafka brokers use now, and how
> much will they use in the future?  Our experience in HDFS was that once
> you start getting more than 100-200GB Java heap sizes, full GCs start
> taking minutes to finish when using the standard JVMs.  That alone is a
> good reason to go multi-process or consider storing more things off the
> Java heap.
>

I see. Now I agree one-broker-per-disk should be more efficient in terms of
GC since each broker probably needs less than 1/10 of the memory available
on a typical machine nowadays. I will remove this from the reason of
rejection.


>
> Disk failure is the "easy" case.  The "hard" case, which is
> unfortunately also the much more common case, is disk misbehavior.
> Towards the end of their lives, disks tend to start slowing down
> unpredictably.  Requests that would have completed immediately before
> start taking 20, 100 500 milliseconds.  Some files may be readable and
> other files may not be.  System calls hang, sometimes forever, and the
> Java process can't abort them, because the hang is in the kernel.  It is
> not fun when threads are stuck in "D state"
> http://stackoverflow.com/questions/20423521/process-perminan
> tly-stuck-on-d-state
> .  Even kill -9 cannot abort the thread then.  Fortunately, this is
> rare.
>

I agree it is a harder problem and it is rare. We probably don't have to
worry about it in this KIP 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-26 Thread Colin McCabe
On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> Hey Colin,
> 
> Good point! Yeah we have actually considered and tested this solution,
> which we call one-broker-per-disk. It would work and should require no
> major change in Kafka as compared to this JBOD KIP. So it would be a good
> short term solution.
> 
> But it has a few drawbacks which makes it less desirable in the long
> term.
> Assume we have 10 disks on a machine. Here are the problems:

Hi Dong,

Thanks for the thoughtful reply.

> 
> 1) Our stress test result shows that one-broker-per-disk has 15% lower
> throughput
> 
> 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> MetadataUpdateRequest and StopReplicaRequest. This increases the burden
> on
> controller which can be the performance bottleneck.

Maybe I'm misunderstanding something, but there would not be 10x as many
StopReplicaRequest RPCs, would there?  The other requests would increase
10x, but from a pretty low base, right?  We are not reassigning
partitions all the time, I hope (or else we have bigger problems...)

> 
> 3) Less efficient use of physical resource on the machine. The number of
> socket on each machine will increase by 10X. The number of connection
> between any two machine will increase by 100X.
> 
> 4) Less efficient way to management memory and quota.
> 
> 5) Rebalance between disks/brokers on the same machine will less
> efficient
> and less flexible. Broker has to read data from another broker on the
> same
> machine via socket. It is also harder to do automatic load balance
> between
> disks on the same machine in the future.
> 
> I will put this and the explanation in the rejected alternative section.
> I
> have a few questions:
> 
> - Can you explain why this solution can help avoid scalability
> bottleneck?
> I actually think it will exacerbate the scalability problem due the 2)
> above.
> - Why can we push more RPC with this solution?

To really answer this question we'd have to take a deep dive into the
locking of the broker and figure out how effectively it can parallelize
truly independent requests.  Almost every multithreaded process is going
to have shared state, like shared queues or shared sockets, that is
going to make scaling less than linear when you add disks or processors.
 (And clearly, another option is to improve that scalability, rather
than going multi-process!)

> - It is true that a garbage collection in one broker would not affect
> others. But that is after every broker only uses 1/10 of the memory. Can
> we be sure that this will actually help performance?

The big question is, how much memory do Kafka brokers use now, and how
much will they use in the future?  Our experience in HDFS was that once
you start getting more than 100-200GB Java heap sizes, full GCs start
taking minutes to finish when using the standard JVMs.  That alone is a
good reason to go multi-process or consider storing more things off the
Java heap.

Disk failure is the "easy" case.  The "hard" case, which is
unfortunately also the much more common case, is disk misbehavior. 
Towards the end of their lives, disks tend to start slowing down
unpredictably.  Requests that would have completed immediately before
start taking 20, 100 500 milliseconds.  Some files may be readable and
other files may not be.  System calls hang, sometimes forever, and the
Java process can't abort them, because the hang is in the kernel.  It is
not fun when threads are stuck in "D state"
http://stackoverflow.com/questions/20423521/process-perminantly-stuck-on-d-state
.  Even kill -9 cannot abort the thread then.  Fortunately, this is
rare.

Another approach we should consider is for Kafka to implement its own
storage layer that would stripe across multiple disks.  This wouldn't
have to be done at the block level, but could be done at the file level.
 We could use consistent hashing to determine which disks a file should
end up on, for example.

best,
Colin

> 
> Thanks,
> Dong
> 
> On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe 
> wrote:
> 
> > Hi Dong,
> >
> > Thanks for the writeup!  It's very interesting.
> >
> > I apologize in advance if this has been discussed somewhere else.  But I
> > am curious if you have considered the solution of running multiple
> > brokers per node.  Clearly there is a memory overhead with this solution
> > because of the fixed cost of starting multiple JVMs.  However, running
> > multiple JVMs would help avoid scalability bottlenecks.  You could
> > probably push more RPCs per second, for example.  A garbage collection
> > in one broker would not affect the others.  It would be interesting to
> > see this considered in the "alternate designs" design, even if you end
> > up deciding it's not the way to go.
> >
> > best,
> > Colin
> >
> >
> > On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> > > Hi all,
> > >
> > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> > > wiki
> > > in the link 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-25 Thread Dong Lin
Hey Colin,

Good point! Yeah we have actually considered and tested this solution,
which we call one-broker-per-disk. It would work and should require no
major change in Kafka as compared to this JBOD KIP. So it would be a good
short term solution.

But it has a few drawbacks which makes it less desirable in the long term.
Assume we have 10 disks on a machine. Here are the problems:

1) Our stress test result shows that one-broker-per-disk has 15% lower
throughput

2) Controller would need to send 10X as many LeaderAndIsrRequest,
MetadataUpdateRequest and StopReplicaRequest. This increases the burden on
controller which can be the performance bottleneck.

3) Less efficient use of physical resource on the machine. The number of
socket on each machine will increase by 10X. The number of connection
between any two machine will increase by 100X.

4) Less efficient way to management memory and quota.

5) Rebalance between disks/brokers on the same machine will less efficient
and less flexible. Broker has to read data from another broker on the same
machine via socket. It is also harder to do automatic load balance between
disks on the same machine in the future.

I will put this and the explanation in the rejected alternative section. I
have a few questions:

- Can you explain why this solution can help avoid scalability bottleneck?
I actually think it will exacerbate the scalability problem due the 2)
above.
- Why can we push more RPC with this solution?
- It is true that a garbage collection in one broker would not affect
others. But that is after every broker only uses 1/10 of the memory. Can we
be sure that this will actually help performance?

Thanks,
Dong

On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe  wrote:

> Hi Dong,
>
> Thanks for the writeup!  It's very interesting.
>
> I apologize in advance if this has been discussed somewhere else.  But I
> am curious if you have considered the solution of running multiple
> brokers per node.  Clearly there is a memory overhead with this solution
> because of the fixed cost of starting multiple JVMs.  However, running
> multiple JVMs would help avoid scalability bottlenecks.  You could
> probably push more RPCs per second, for example.  A garbage collection
> in one broker would not affect the others.  It would be interesting to
> see this considered in the "alternate designs" design, even if you end
> up deciding it's not the way to go.
>
> best,
> Colin
>
>
> On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> > Hi all,
> >
> > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> > wiki
> > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 112%3A+Handle+disk+failure+for+JBOD.
> >
> > This KIP is related to KIP-113
> >  113%3A+Support+replicas+movement+between+log+directories>:
> > Support replicas movement between log directories. They are needed in
> > order
> > to support JBOD in Kafka. Please help review the KIP. You feedback is
> > appreciated!
> >
> > Thanks,
> > Dong
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-25 Thread Colin McCabe
Hi Dong,

Thanks for the writeup!  It's very interesting.

I apologize in advance if this has been discussed somewhere else.  But I
am curious if you have considered the solution of running multiple
brokers per node.  Clearly there is a memory overhead with this solution
because of the fixed cost of starting multiple JVMs.  However, running
multiple JVMs would help avoid scalability bottlenecks.  You could
probably push more RPCs per second, for example.  A garbage collection
in one broker would not affect the others.  It would be interesting to
see this considered in the "alternate designs" design, even if you end
up deciding it's not the way to go.

best,
Colin


On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> Hi all,
> 
> We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> wiki
> in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 112%3A+Handle+disk+failure+for+JBOD.
> 
> This KIP is related to KIP-113
> :
> Support replicas movement between log directories. They are needed in
> order
> to support JBOD in Kafka. Please help review the KIP. You feedback is
> appreciated!
> 
> Thanks,
> Dong


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-24 Thread Guozhang Wang
Thanks for the detailed explanations Dong. That makes sense to me.


Guozhang

On Sun, Jan 22, 2017 at 4:00 PM, Dong Lin  wrote:

> Hey Guozhang,
>
> Thanks for the review! Yes we have considered this approach and briefly
> explained why we don't do it in the rejected alternative section. Here is
> my concern with this approach in more detail:
>
> - This approach introduces tight coupling between kafka's logical leader
> election with broker's local file OS config. My intuition is that this
> tight coupling may make future development a bit harder and we should try
> to avoid that. Note that we only use logical information (e.g. partition,
> broker id) in the zookeeper and controller as of now.
>
> - Encoding log directory in the replica identifier requires much more
> change in the code. In addition to changing znode data format in zookeeper,
> we probably need to update every protocol that touches replica id, such as
> StopReplicaRequest, ListOffsetRequest, LeaderAndIsrResponse and so on. Many
> Java classes need to be changes as well to recognize log directory in
> replica identifier. Arguably it is still possible to use broker id without
> log directory to identify replica in some protocols and Java classes under
> the assumption that no two replicas of the same partition can reside on the
> same broker. But we need to think carefully for each protocol and Java
> class and the result may be error prone and controversial. For simplicity
> of the discussion and code review, I prefer to only do this if there is
> strong benefit of this design.
>
> - Current approach in the KIP make it easier to move replicas between
> replicas on the same broker because that operation can be completely hidden
> from controller and other brokers. On the other hand, if we were to move
> replica between disk in the suggested approach, broker needs to write to
> some notification zookeeper path after movement is completed so that broker
> can send LeaderAndIsrRequest to get the new replica identifier, update it
> cache and write to znode /brokers/topics/[topic]/partitions/[partitionId]/
> state.
>
> Dong
>
>
> On Sun, Jan 22, 2017 at 10:50 AM, Guozhang Wang 
> wrote:
>
> > Hello Dong,
> >
> > Thanks for the very well written KIP. I had a general thought on the ZK
> > path management, wondering if the following alternative would work:
> >
> > 1. Bump up versions in "brokers/topics/[topic]" and
> > "/brokers/topics/[topic]/partitions/[partitionId]/state"
> > to 2, in which the replica id is no longer an int but a string.
> >
> > 2. Bump up versions in "/brokers/ids/[brokerId]" to add another field:
> >
> > { "fields":
> > [ {"name": "version", "type": "int", "doc": "version id"},
> >   {"name": "host", "type": "string", "doc": "ip address or host name
> of
> > the broker"},
> >   {"name": "port", "type": "int", "doc": "port of the broker"},
> >   {"name": "jmx_port", "type": "int", "doc": "port for jmx"}
> >   {"name": "log_dirs",
> >"type": {"type": "array",
> > "items": "int",
> > "doc": "an array of the id of the log dirs in broker"}
> >   },
> > ]
> > }
> >
> > 3. The replica id can now either be an string-typed integer indicating
> that
> > all partitions on the broker still treated as failed or not as a whole,
> > i.e. no support needed for JBOD; or be a string typed
> "[brokerID]-[dirID]",
> > in which brokers / controllers can still parse to determine which broker
> is
> > hosting this replica: in this case the management of replicas is finer
> > grained, no longer at the broker level (i.e. if broker dies all replicas
> go
> > offline) but broker-dir level.
> >
> > 4. When broker had one of the dir failed, it can modify its "
> > /brokers/ids/[brokerId]" registry and remove the dir id, controller
> already
> > listening on this path can then be notified and run the replica
> assignment
> > accordingly where replica id is computed as above.
> >
> >
> > By doing this controller can also naturally reassign replicas between
> dirs
> > within the same broker.
> >
> >
> > Guozhang
> >
> >
> > On Thu, Jan 12, 2017 at 6:25 PM, Ismael Juma  wrote:
> >
> > > Thanks for the KIP. Just wanted to quickly say that it's great to see
> > > proposals for improving JBOD (KIP-113 too). More feedback soon,
> > hopefully.
> > >
> > > Ismael
> > >
> > > On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:
> > >
> > > > Hi all,
> > > >
> > > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> > > wiki
> > > > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 112%3A+Handle+disk+failure+for+JBOD.
> > > >
> > > > This KIP is related to KIP-113
> > > >  > > > 113%3A+Support+replicas+movement+between+log+directories>:
> > > > Support replicas movement between log directories. They are needed in
> > > 

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-22 Thread Dong Lin
Hey Guozhang,

Thanks for the review! Yes we have considered this approach and briefly
explained why we don't do it in the rejected alternative section. Here is
my concern with this approach in more detail:

- This approach introduces tight coupling between kafka's logical leader
election with broker's local file OS config. My intuition is that this
tight coupling may make future development a bit harder and we should try
to avoid that. Note that we only use logical information (e.g. partition,
broker id) in the zookeeper and controller as of now.

- Encoding log directory in the replica identifier requires much more
change in the code. In addition to changing znode data format in zookeeper,
we probably need to update every protocol that touches replica id, such as
StopReplicaRequest, ListOffsetRequest, LeaderAndIsrResponse and so on. Many
Java classes need to be changes as well to recognize log directory in
replica identifier. Arguably it is still possible to use broker id without
log directory to identify replica in some protocols and Java classes under
the assumption that no two replicas of the same partition can reside on the
same broker. But we need to think carefully for each protocol and Java
class and the result may be error prone and controversial. For simplicity
of the discussion and code review, I prefer to only do this if there is
strong benefit of this design.

- Current approach in the KIP make it easier to move replicas between
replicas on the same broker because that operation can be completely hidden
from controller and other brokers. On the other hand, if we were to move
replica between disk in the suggested approach, broker needs to write to
some notification zookeeper path after movement is completed so that broker
can send LeaderAndIsrRequest to get the new replica identifier, update it
cache and write to znode /brokers/topics/[topic]/partitions/[partitionId]/
state.

Dong


On Sun, Jan 22, 2017 at 10:50 AM, Guozhang Wang  wrote:

> Hello Dong,
>
> Thanks for the very well written KIP. I had a general thought on the ZK
> path management, wondering if the following alternative would work:
>
> 1. Bump up versions in "brokers/topics/[topic]" and
> "/brokers/topics/[topic]/partitions/[partitionId]/state"
> to 2, in which the replica id is no longer an int but a string.
>
> 2. Bump up versions in "/brokers/ids/[brokerId]" to add another field:
>
> { "fields":
> [ {"name": "version", "type": "int", "doc": "version id"},
>   {"name": "host", "type": "string", "doc": "ip address or host name of
> the broker"},
>   {"name": "port", "type": "int", "doc": "port of the broker"},
>   {"name": "jmx_port", "type": "int", "doc": "port for jmx"}
>   {"name": "log_dirs",
>"type": {"type": "array",
> "items": "int",
> "doc": "an array of the id of the log dirs in broker"}
>   },
> ]
> }
>
> 3. The replica id can now either be an string-typed integer indicating that
> all partitions on the broker still treated as failed or not as a whole,
> i.e. no support needed for JBOD; or be a string typed "[brokerID]-[dirID]",
> in which brokers / controllers can still parse to determine which broker is
> hosting this replica: in this case the management of replicas is finer
> grained, no longer at the broker level (i.e. if broker dies all replicas go
> offline) but broker-dir level.
>
> 4. When broker had one of the dir failed, it can modify its "
> /brokers/ids/[brokerId]" registry and remove the dir id, controller already
> listening on this path can then be notified and run the replica assignment
> accordingly where replica id is computed as above.
>
>
> By doing this controller can also naturally reassign replicas between dirs
> within the same broker.
>
>
> Guozhang
>
>
> On Thu, Jan 12, 2017 at 6:25 PM, Ismael Juma  wrote:
>
> > Thanks for the KIP. Just wanted to quickly say that it's great to see
> > proposals for improving JBOD (KIP-113 too). More feedback soon,
> hopefully.
> >
> > Ismael
> >
> > On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:
> >
> > > Hi all,
> > >
> > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> > wiki
> > > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 112%3A+Handle+disk+failure+for+JBOD.
> > >
> > > This KIP is related to KIP-113
> > >  > > 113%3A+Support+replicas+movement+between+log+directories>:
> > > Support replicas movement between log directories. They are needed in
> > order
> > > to support JBOD in Kafka. Please help review the KIP. You feedback is
> > > appreciated!
> > >
> > > Thanks,
> > > Dong
> > >
> >
>
>
>
> --
> -- Guozhang
>


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-22 Thread Guozhang Wang
I think it also affects KIP-113 design but just leave it as a single
comment here.

On Sun, Jan 22, 2017 at 10:50 AM, Guozhang Wang  wrote:

> Hello Dong,
>
> Thanks for the very well written KIP. I had a general thought on the ZK
> path management, wondering if the following alternative would work:
>
> 1. Bump up versions in "brokers/topics/[topic]" and "
> /brokers/topics/[topic]/partitions/[partitionId]/state" to 2, in which
> the replica id is no longer an int but a string.
>
> 2. Bump up versions in "/brokers/ids/[brokerId]" to add another field:
>
> { "fields":
> [ {"name": "version", "type": "int", "doc": "version id"},
>   {"name": "host", "type": "string", "doc": "ip address or host name
> of the broker"},
>   {"name": "port", "type": "int", "doc": "port of the broker"},
>   {"name": "jmx_port", "type": "int", "doc": "port for jmx"}
>   {"name": "log_dirs",
>"type": {"type": "array",
> "items": "int",
> "doc": "an array of the id of the log dirs in broker"}
>   },
> ]
> }
>
> 3. The replica id can now either be an string-typed integer indicating
> that all partitions on the broker still treated as failed or not as a
> whole, i.e. no support needed for JBOD; or be a string typed
> "[brokerID]-[dirID]", in which brokers / controllers can still parse to
> determine which broker is hosting this replica: in this case the management
> of replicas is finer grained, no longer at the broker level (i.e. if broker
> dies all replicas go offline) but broker-dir level.
>
> 4. When broker had one of the dir failed, it can modify its "
> /brokers/ids/[brokerId]" registry and remove the dir id, controller
> already listening on this path can then be notified and run the replica
> assignment accordingly where replica id is computed as above.
>
>
> By doing this controller can also naturally reassign replicas between dirs
> within the same broker.
>
>
> Guozhang
>
>
> On Thu, Jan 12, 2017 at 6:25 PM, Ismael Juma  wrote:
>
>> Thanks for the KIP. Just wanted to quickly say that it's great to see
>> proposals for improving JBOD (KIP-113 too). More feedback soon, hopefully.
>>
>> Ismael
>>
>> On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:
>>
>> > Hi all,
>> >
>> > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
>> wiki
>> > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > 112%3A+Handle+disk+failure+for+JBOD.
>> >
>> > This KIP is related to KIP-113
>> > > > 113%3A+Support+replicas+movement+between+log+directories>:
>> > Support replicas movement between log directories. They are needed in
>> order
>> > to support JBOD in Kafka. Please help review the KIP. You feedback is
>> > appreciated!
>> >
>> > Thanks,
>> > Dong
>> >
>>
>
>
>
> --
> -- Guozhang
>



-- 
-- Guozhang


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-22 Thread Guozhang Wang
Hello Dong,

Thanks for the very well written KIP. I had a general thought on the ZK
path management, wondering if the following alternative would work:

1. Bump up versions in "brokers/topics/[topic]" and
"/brokers/topics/[topic]/partitions/[partitionId]/state"
to 2, in which the replica id is no longer an int but a string.

2. Bump up versions in "/brokers/ids/[brokerId]" to add another field:

{ "fields":
[ {"name": "version", "type": "int", "doc": "version id"},
  {"name": "host", "type": "string", "doc": "ip address or host name of
the broker"},
  {"name": "port", "type": "int", "doc": "port of the broker"},
  {"name": "jmx_port", "type": "int", "doc": "port for jmx"}
  {"name": "log_dirs",
   "type": {"type": "array",
"items": "int",
"doc": "an array of the id of the log dirs in broker"}
  },
]
}

3. The replica id can now either be an string-typed integer indicating that
all partitions on the broker still treated as failed or not as a whole,
i.e. no support needed for JBOD; or be a string typed "[brokerID]-[dirID]",
in which brokers / controllers can still parse to determine which broker is
hosting this replica: in this case the management of replicas is finer
grained, no longer at the broker level (i.e. if broker dies all replicas go
offline) but broker-dir level.

4. When broker had one of the dir failed, it can modify its "
/brokers/ids/[brokerId]" registry and remove the dir id, controller already
listening on this path can then be notified and run the replica assignment
accordingly where replica id is computed as above.


By doing this controller can also naturally reassign replicas between dirs
within the same broker.


Guozhang


On Thu, Jan 12, 2017 at 6:25 PM, Ismael Juma  wrote:

> Thanks for the KIP. Just wanted to quickly say that it's great to see
> proposals for improving JBOD (KIP-113 too). More feedback soon, hopefully.
>
> Ismael
>
> On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:
>
> > Hi all,
> >
> > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> wiki
> > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 112%3A+Handle+disk+failure+for+JBOD.
> >
> > This KIP is related to KIP-113
> >  > 113%3A+Support+replicas+movement+between+log+directories>:
> > Support replicas movement between log directories. They are needed in
> order
> > to support JBOD in Kafka. Please help review the KIP. You feedback is
> > appreciated!
> >
> > Thanks,
> > Dong
> >
>



-- 
-- Guozhang


Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

2017-01-12 Thread Ismael Juma
Thanks for the KIP. Just wanted to quickly say that it's great to see
proposals for improving JBOD (KIP-113 too). More feedback soon, hopefully.

Ismael

On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin  wrote:

> Hi all,
>
> We created KIP-112: Handle disk failure for JBOD. Please find the KIP wiki
> in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 112%3A+Handle+disk+failure+for+JBOD.
>
> This KIP is related to KIP-113
>  113%3A+Support+replicas+movement+between+log+directories>:
> Support replicas movement between log directories. They are needed in order
> to support JBOD in Kafka. Please help review the KIP. You feedback is
> appreciated!
>
> Thanks,
> Dong
>