+1 (binding)

Regards
Jiwei Guo (Tboy)


On Tue, Sep 16, 2025 at 2:54 PM Yike Xiao <[email protected]> wrote:

> +1 (non-binding)
>
> It's very useful -- I've left a minor comment on the implementation pr.
>
> On 2025/09/10 18:29:00 PengHui Li wrote:
> > Hi Team,
> >
> > This is the official VOTE thread for PIP-441: Add Broker-Level Metrics
> for
> > Skipped Non-Recoverable Data
> >
> > Currently, when Pulsar's autoSkipNonRecoverableData feature skips
> > corrupted data to maintain topic availability, there is no visibility
> into
> > when and
> > how frequently this occurs. This creates operational blind spots where
> > administrators
> > cannot be alerted when data loss happens, have no audit trail for
> > compliance requirements,
> > and cannot distinguish between healthy systems and those silently losing
> > data.
> >
> > Without these metrics, operators cannot determine whether issues are
> > systematic (entire ledgers lost) or localized (partial corruption
> > scenarios).
> >
> > Proposed Solution: This PIP proposes adding two new broker-level metrics
> to
> > the BrokerOperabilityMetrics class:
> >
> >   1. pulsar_broker_non_recoverable_ledgers_skipped_total:
> >       A counter incremented in
> ManagedLedgerImpl.skipNonRecoverableLedger()
> >       each time an entire ledger is skipped due to complete
> > unrecoverability.
> >   2. pulsar_broker_non_recoverable_entries_skipped_total:
> >       A counter incremented in
> > ManagedCursorImpl.skipNonRecoverableEntries()
> >       by the number of entries skipped when only partial ledger
> corruption
> > occurs.
> >
> > The broker-level approach avoids adding a high-cardinality burden to the
> > metrics
> > system that would occur with topic-level metrics in large clusters.
> > Operators can
> > use these broker-level metrics for alerting and monitoring trends, then
> > leverage
> > existing broker logs for detailed forensic analysis of specific affected
> > topics.
> >
> > The full proposal is available for review here:
> > https://github.com/apache/pulsar/pull/24716
> >
> > The discussion mailing list:
> > https://lists.apache.org/thread/b638towc7o4qb8dsozys4c14s00yflfj
> >
> > Pushed out the implementation PR:
> > https://github.com/apache/pulsar/pull/24726
> >
> > Regards,
> > Penghui
> >
>
> Regards,
> Yike
>

Reply via email to