Lokesh Khurana created PHOENIX-7872:
---------------------------------------
Summary: Add client-side metrics for HA failover, mutation-block,
and CRR cache health
Key: PHOENIX-7872
URL: https://issues.apache.org/jira/browse/PHOENIX-7872
Project: Phoenix
Issue Type: Sub-task
Reporter: Lokesh Khurana
Assignee: Lokesh Khurana
The HA client today emits metrics only for the PARALLEL policy
(HA_PARALLEL_COUNT_*) and HA executor pool task counters (taskRejectedCounter,
taskExecutedCounter, taskEndToEndCounter). There is no client-side
observability for the FAILOVER policy's transitions, the mutation-block path,
or the CRR cache health. This makes it hard for operators to detect failover
events, measure failover duration, alert on mutation-block hit rate, or
diagnose CRR cache staleness without scraping logs.
Proposed metrics (split into two tiers):
Tier 1 — client-side counters mirroring the existing
PhoenixHAGroupMetrics.HAMetricType enum pattern:
┌─────────────────────────────┬───────────┬─────────────────────────────────────────────────────────────────────┐
│ Metric │ Type │
Emission point │
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_FAILOVER_COUNT │ Counter │
FailoverPhoenixConnection.failover() │
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_FAILOVER_DURATION_MS │ Histogram │ Same site, around try/finally
│
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_MUTATION_BLOCKED_COUNT │ Counter │ MutationState.send catch site for
MutationBlockedIOException causes │
├─────────────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────┤
│ HA_STALE_CRR_DETECTED_COUNT │ Counter │
FailoverPhoenixConnection.wrapActionDuringFailover SCRE catch site │
└─────────────────────────────┴───────────┴─────────────────────────────────────────────────────────────────────┘
Tier 2 — cross-cutting + server-side:
┌──────────────────────────────────┬─────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Metric │ Type │
Emission point │
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_CRR_REFRESH_COUNT │ Counter │
HighAvailabilityGroup.refreshClusterRoleRecord()
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_CRR_CACHE_AGE_MS │ Gauge │ Sample at every connect()
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_POLLER_TICK_COUNT │ Counter │
GetClusterRoleRecordUtil.schedulePoller lambda
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_POLLER_TICK_FAILURES │ Counter │ Same site, catch block
│
├──────────────────────────────────┼─────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HA_BYPASSED_MUTATION_BLOCK_COUNT │ Counter │ Server-side
IndexRegionObserver.preBatchMutate when _HAGroupName attribute is absent and
mutation proceeds │
└──────────────────────────────────┴─────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Tier-1 lands first as a self-contained client-side change. Tier-2 stacks on
Tier-1 and includes the server-side counter, which needs a shared
IndexRegionObserverMetrics JMX surface.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)