Hi,
Apoligies for the late response. Thanks for the clarification. Some numbers
would be great to set the foundation.

Regards,
Sushant Mahajan

On Fri, 29 May 2026, 01:47 Muralidhar Basani via dev, <[email protected]>
wrote:

> Hi Sushant, thanks for the questions.
>
> sm01 - Is there any evidence behind the new bounds?
> The bounds are not based on benchmarks. They are based on the discussion in
> PR #21291. A value below ~200 causes the coordinator to write a snapshot
> for nearly every state change, which is not useful for real workloads.
>
> Regarding ceiling value, higher value means, longer replay time and delayed
> log pruning. Since the per-group override lets individual groups configure
> independently up to the max (ceiling val), we may not need a very high
> cluster-wide config for ceiling.
>
> However as these floor and ceiling vals are not measured, happy to
> benchmark and consider the values before vote if needed.
>
> sm02 - Guidance for choosing a value?
> Agree, it would be helpful to have these, but as ShareGroups are fairly
> new, I don't have that operational data. However, I can run benchmarks
> before vote and update kip if needed.
> Regarding metrics, there are a few in ShareCoordinatorMetrics (write-rate,
> write-latency-avg etc), but there are no metrics related to
> snapshot-vs-update. I can add a new metric to kip or a follow-up if you
> think it would be useful.
>
> And regarding the default at 500, this is kept as is for backwards
> compatibility. If any cluster didn't touch these configs, means they are
> implicitly using 500 as default. So changing this value would be a change
> for every cluster which has share groups.
>
> sm03 - Expected disk and recovery impact?
> I don't have any numbers for these, but I can run benchmarks and add
> results to the kip before the vote if needed.
>
> Please let me know your thoughts.
>
> Thanks,
> Murali
>
> On Thu, May 28, 2026 at 9:27 AM Sushant Mahajan <[email protected]>
> wrote:
>
> > Hi,
> > Thanks for the great writeup.
> >
> > Few questions though -
> >
> >
> > sm01 - Is there any evidence behind the new bounds?
> >
> > What data informed [200, 1000] and the floor of 200? Specifically, the
> > measured ShareUpdate vs ShareSnapshot record sizes that make values <200
> > "mostly snapshots," and what motivated 1000 as the ceiling rather than
> > higher?
> >
> > sm02 - Guidance for choosing a value?
> >
> > Could the KIP offer a starting point as a function of observable behavior
> > (records/sec, in-flight count, or __share_group_state write rate), plus
> > which metrics to watch when tuning? Also, what's the rationale for
> keeping
> > the default at 500 under the new bounds?
> >
> > sm03 - Expected disk and recovery impact?
> >
> > Any rough before/after numbers for moving a high-throughput group from
> 500
> > → 1000 — disk saved per day and added replay time on restart? A concrete
> > example would help operators weigh the tradeoff.
> >
> > Regards,
> > Sushant Mahajan
> >
> >
> > On Sat, 23 May 2026, 01:02 Muralidhar Basani via dev, <
> > [email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to start a discussion on KIP-1349, which allows
> configurable
> > > snapshot frequency of share groups.
> > >
> > > KIP :
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1349%3A+Configurable+snapshot+frequency+for+share+groups
> > >
> > > Thanks,
> > > Murali
> > >
> >
>

Reply via email to