Hi Dong,
             I think AtMinIsr is still valuable to indicate cluster is at a 
critical state and something needs to be done asap to restore.
To your example 
" let's say min_isr = 1 and replica_set_size = 3, it is
> still possible that planned maintenance (e.g. one broker restart +
> partition reassignment) can cause isr size drop to 1. Since AtMinIsr can
> also cause fault positive (i.e. the fact that AtMinIsr > 0 does not
> necessarily need attention from user), "

One broker restart shouldn't cause ISR to drop to 1 from 3 unless 2 partitions 
are co-located on the same broker.
This is still a valuable indicator to the admins that the partition assignment 
needs to be moved.

In our case, we run 4 replicas for critical topics with min.isr = 2 . URPs are 
not really good indicator to take immediate action if one of the replicas is 
down. If 2 replicas are down and we are at 2 alive replicas this is stop 
everything to restore the cluster to a good state.

Thanks,
Harsha






On Wed, Feb 27, 2019, at 11:17 PM, Dong Lin wrote:
> Hey Kevin,
> 
> Thanks for the update.
> 
> The KIP suggests that AtMinIsr is better than UnderReplicatedPartition as
> indicator for alerting. However, in most case where min_isr =
> replica_set_size - 1, these two metrics are exactly the same, where planned
> maintenance can easily cause positive AtMinIsr value. In the other
> scenario, for example let's say min_isr = 1 and replica_set_size = 3, it is
> still possible that planned maintenance (e.g. one broker restart +
> partition reassignment) can cause isr size drop to 1. Since AtMinIsr can
> also cause fault positive (i.e. the fact that AtMinIsr > 0 does not
> necessarily need attention from user), I am not sure it is worth to add
> this metric.
> 
> In the Usage section, it is mentioned that user needs to manually check
> whether there is ongoing maintenance after AtMinIsr is triggered. Could you
> explain how is this different from the current way where we use
> UnderReplicatedPartition to trigger alert? More specifically, can we just
> replace AtMinIsr with UnderReplicatedPartition in the Usage section?
> 
> Thanks,
> Dong
> 
> 
> On Tue, Feb 26, 2019 at 6:49 PM Kevin Lu <lu.ke...@berkeley.edu> wrote:
> 
> > Hi Dong!
> >
> > Thanks for the feedback!
> >
> > You bring up a good point in that the AtMinIsr metric cannot be used to
> > identify failure in the mentioned scenarios. I admit the motivation section
> > placed too much emphasis on "identifying failure".
> >
> > I have modified the KIP to reflect the implementation as the AtMinIsr
> > metric is intended to serve as a warning as one more failure to a partition
> > AtMinIsr will cause producers with acks=ALL configured to fail. It has an
> > additional benefit when minIsr=1 as it will warn us that the entire
> > partition is at risk of going offline, but that is more of a side effect
> > that only applies in that scenario (minIsr=1).
> >
> > Regards,
> > Kevin
> >
> > On Tue, Feb 26, 2019 at 5:11 PM Dong Lin <lindon...@gmail.com> wrote:
> >
> > > Hey Kevin,
> > >
> > > Thanks for the proposal!
> > >
> > > It seems that the proposed implementation does not match the motivation.
> > > The motivation suggests that the operator wants to tell the planned
> > > maintenance (e.g. broker restart) from unplanned failure (e.g. network
> > > failure). But the use of the metric AtMinIsr does not really
> > differentiate
> > > between these causes of the reduced number of ISR. For example, an
> > > unplanned failure can cause ISR to drop from 3 to 2 but it can still be
> > > higher than the minIsr (say 1). And a planned maintenance can cause ISR
> > to
> > > drop from 3 to 2, which trigger the AtMinIsr metric if minIsr=2. Can you
> > > update the design doc to fix or explain this issue?
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Tue, Feb 12, 2019 at 9:02 AM Kevin Lu <lu.ke...@berkeley.edu> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Getting the discussion thread started for KIP-427 in case anyone is
> > free
> > > > right now.
> > > >
> > > > I’d like to propose a new category of topic partitions *AtMinIsr* which
> > > are
> > > > partitions that only have the minimum number of in sync replicas left
> > in
> > > > the ISR set (as configured by min.insync.replicas).
> > > >
> > > > This would add two new metrics *ReplicaManager.AtMinIsrPartitionCount
> > *&
> > > > *Partition.AtMinIsr*, and a new TopicCommand option*
> > > > --at-min-isr-partitions* to help in monitoring and alerting.
> > > >
> > > > KIP link: KIP-427: Add AtMinIsr topic partition category (new metric &
> > > > TopicCommand option)
> > > > <
> > > >
> > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103089398
> > > > >
> > > >
> > > > Please take a look and let me know what you think.
> > > >
> > > > Regards,
> > > > Kevin
> > > >
> > >
> >
>

Reply via email to