It seems to me what this is driving at is to list partitions that have no
more
fault-tolerance. Maybe the wording could reflect that? That rewording would
have the benefit of covering 1 of 3 in sync, or 2 of 3 in sync with
min.insync.replicas=2.

I'm not convinced with your labelling of WARNING / CRITICAL. It seems to be
making assumptions on behalf of the topic owners.  For example:

If there's a "two rack" cluster (say two DCs) and keeping written data
is absolutely critical, it's advisable to configure for rf=4, min.insync=3.

having only 3 out of 4 would mean no more HA. Depending on the ops
culture there that may be considered a warning or critical. I would call
it a warning. If having a topic being unavailable must be avoided, I'd
call it a warning requiring immediate attention.

If it dropped to 2/4, with acks=all that would then be a critical/fatal.
But according to your KIP you'd probably label that a warning.

I'd suggest considering wording that doesn't introduce ambiguity or
assume interpretations of what should be a warning or more severe.



On Sat, Aug 25, 2018 at 3:46 AM Kevin Lu <lu.ke...@berkeley.edu> wrote:

> Hi All,
>
> I am having some trouble re-formulating this KIP to output partitions that
> are under the configured "min.insync.replicas" as I am not sure how to
> reliably get the configured "min.insync.replicas" in all cases.
>
> The challenge I am facing is when "min.insync.replicas" is configured to
> non-default on the broker, and topics are created without
> "min.insync.replicas" specified. Since the topic is created without
> specifying "min.insync.replicas", then the value is not saved in Zookeeper
> and it is directly used by the brokers.
>
> The TopicCommand hits zookeeper so I am unable to get the configured value
> without querying the brokers somehow...
>
> Example:
> - Start a broker with min.insync.replicas=2 in server.properties
> - Use kafka-topics.sh to create topic without specifying
> min.insync.replicas
>
> The ZK node /config/ for the created topic will only have direct overrides,
> and will not have the broker's configured min.insync.replicas.
>
> Any ideas on how to approach this?
>
> Regards,
> Kevin
>
> On Mon, Aug 6, 2018 at 8:21 AM Kevin Lu <lu.ke...@berkeley.edu> wrote:
>
> > Hi Jason,
> >
> > Thanks for the response!
> >
> > I completely agree with you and Mickael about adding a
> > --under-minisr-partitions option to match the existing metric. I will
> > create a separate KIP to discuss the --under-minisr-partitions option. I
> > believe there is a technical challenge with retrieving the
> > min.insync.replicas configuration from zookeeper currently as it may also
> > be stored as a broker configuration, but let me do some digging to
> confirm.
> >
> > I am going to modify KIP-351 to represent the the gap that you have
> > mentioned (exactly at min.isr) as this is an important state that we
> > specifically monitor to alert on.
> >
> > Any other thoughts?
> >
> > Regards,
> > Kevin
> >
> > On Thu, Aug 2, 2018 at 11:23 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> >> Hey Kevin,
> >>
> >> Thanks for the KIP. I like Mickael's suggestion to
> >> add --under-minisr-partitions since it fits with the metric we already
> >> expose. It's also a good question whether there should be a separate
> >> category for partitions which are right at min.isr. I'm reluctant to add
> >> new categories, but I agree there might be a gap at the moment. Say you
> >> have a replication factor of 3 and the min isr is set to 1. Our notion
> of
> >> URP does not capture the difference between having an ISR down to a size
> >> of
> >> 1 and one down to a size of 2. The reason this might be significant is
> >> that
> >> a shrink of the ISR down to 2 may just be caused by a rolling restart
> or a
> >> transient network blip. A shrink to 1, on the other hand, might be
> >> indicative of a more severe problem and could be cause for a call from
> >> pagerduty.
> >>
> >> -Jason
> >>
> >> On Thu, Aug 2, 2018 at 9:28 AM, Kevin Lu <lu.ke...@berkeley.edu> wrote:
> >>
> >> > Hi Mickael,
> >> >
> >> > Thanks for the suggestion!
> >> >
> >> > Correct me if I am mistaken, but if a producer attempts to send to a
> >> > partition that is under min ISR (and ack=all or -1) then the send will
> >> fail
> >> > with a NotEnoughReplicas or NotEnoughReplicasAfterAppend exception? At
> >> this
> >> > point, client-side has already suffered failure but the server-side is
> >> > still fine for now?
> >> >
> >> > If the above is true, then this would be a FATAL case for producers.
> >> >
> >> > Would it be valuable to include the CRITICAL case where a topic
> >> partition
> >> > has exactly min ISR so that Kafka operators can take action so it does
> >> not
> >> > become FATAL? This could be in the same option or a new one.
> >> >
> >> > Thanks!
> >> >
> >> > Regards,
> >> > Kevin
> >> >
> >> > On Thu, Aug 2, 2018 at 2:27 AM Mickael Maison <
> mickael.mai...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> > > What about also adding a --under-minisr-partitions option?
> >> > >
> >> > > That would match the
> >> > > "kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount"
> >> > > broker metric and it's usually pretty relevant when investigating
> >> > > issues
> >> > >
> >> > > On Thu, Aug 2, 2018 at 8:54 AM, Kevin Lu <lu.ke...@berkeley.edu>
> >> wrote:
> >> > > > Hi friends!
> >> > > >
> >> > > > This thread is to discuss KIP-351
> >> > > > <
> >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-351
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-351>
> >> > +Add+--critical-partitions+option+to+describe+topics+command
> >> > > >
> >> > > > !
> >> > > >
> >> > > > I am proposing to add a --critical-partitions option to the
> describe
> >> > > topics
> >> > > > command that will only list out topic partitions that have 1 ISR
> >> left
> >> > > (RF >
> >> > > > 1) as they would be in a critical state and need immediate
> >> > > repartitioning.
> >> > > >
> >> > > > I wonder if the name "critical" is appropriate?
> >> > > >
> >> > > > Thoughts?
> >> > > >
> >> > > > Thanks!
> >> > > >
> >> > > > Regards,
> >> > > > Kevin
> >> > >
> >> >
> >>
> >
>

Reply via email to