+1 (binding)
Regards Jiwei Guo (Tboy) On Tue, Sep 16, 2025 at 2:54 PM Yike Xiao <[email protected]> wrote: > +1 (non-binding) > > It's very useful -- I've left a minor comment on the implementation pr. > > On 2025/09/10 18:29:00 PengHui Li wrote: > > Hi Team, > > > > This is the official VOTE thread for PIP-441: Add Broker-Level Metrics > for > > Skipped Non-Recoverable Data > > > > Currently, when Pulsar's autoSkipNonRecoverableData feature skips > > corrupted data to maintain topic availability, there is no visibility > into > > when and > > how frequently this occurs. This creates operational blind spots where > > administrators > > cannot be alerted when data loss happens, have no audit trail for > > compliance requirements, > > and cannot distinguish between healthy systems and those silently losing > > data. > > > > Without these metrics, operators cannot determine whether issues are > > systematic (entire ledgers lost) or localized (partial corruption > > scenarios). > > > > Proposed Solution: This PIP proposes adding two new broker-level metrics > to > > the BrokerOperabilityMetrics class: > > > > 1. pulsar_broker_non_recoverable_ledgers_skipped_total: > > A counter incremented in > ManagedLedgerImpl.skipNonRecoverableLedger() > > each time an entire ledger is skipped due to complete > > unrecoverability. > > 2. pulsar_broker_non_recoverable_entries_skipped_total: > > A counter incremented in > > ManagedCursorImpl.skipNonRecoverableEntries() > > by the number of entries skipped when only partial ledger > corruption > > occurs. > > > > The broker-level approach avoids adding a high-cardinality burden to the > > metrics > > system that would occur with topic-level metrics in large clusters. > > Operators can > > use these broker-level metrics for alerting and monitoring trends, then > > leverage > > existing broker logs for detailed forensic analysis of specific affected > > topics. > > > > The full proposal is available for review here: > > https://github.com/apache/pulsar/pull/24716 > > > > The discussion mailing list: > > https://lists.apache.org/thread/b638towc7o4qb8dsozys4c14s00yflfj > > > > Pushed out the implementation PR: > > https://github.com/apache/pulsar/pull/24726 > > > > Regards, > > Penghui > > > > Regards, > Yike >
