[
https://issues.apache.org/jira/browse/PHOENIX-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Palash Chauhan resolved PHOENIX-7884.
-------------------------------------
Resolution: Fixed
> cdcIndexUpdateLag is silent during idle / failure / parent-replay and
> misattributed during ancestor replay
> ----------------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-7884
> URL: https://issues.apache.org/jira/browse/PHOENIX-7884
> Project: Phoenix
> Issue Type: Sub-task
> Affects Versions: 5.3.1
> Reporter: Palash Chauhan
> Assignee: Palash Chauhan
> Priority: Major
> Fix For: 5.4.0, 5.3.2
>
>
> h4. Background
> {{cdcIndexUpdateLag}} is the primary freshness signal for eventually
> consistent secondary indexes. It is registered as a {{MetricHistogram}} in
> {{MetricsIndexCDCConsumerSource}} and is intended to drive freshness SLOs per
> data table per RegionServer.
> h4. Problem
> The metric is emitted at exactly two places — inside {{processCDCBatch}}
> (line 988) and {{processCDCBatchGenerated}} (line 1109) — both inside an {{if
> (!batchMutations.isEmpty())}} block, immediately after a successful non-empty
> batch. This produces three distinct bugs:
> 1. {*}Silent during idle / failure / startup{*}. No sample is emitted when:
> * the data table is idle (the main loop sleeps on {{pollIntervalMs}} or
> backoff),
> * batches are repeatedly failing (the catch block only increments
> {{{}cdcBatchFailureCount{}}}),
> * the consumer is in {{{}startupDelayMs{}}}, {{waitForCDCStreamEntry()}}
> retries, or {{checkTrackerStatus()}} retries.
> {*}2. Silent during parent-region replay, exactly when freshness matters
> most{*}. After a region split/merge, {{run()}} calls
> {{replayAndCompleteParentRegions(...)}} before the main loop starts. During
> this phase — which can take hours on busy tables — the region's own new
> writes accumulate in its CDC partition and are not processed. The lag metric
> reports nothing about that growing backlog. The 15 s
> {{parentProgressPauseMs}} sleeps inside {{processPartitionToCompletion}} are
> also silent.
> {*}3. Mis-attribution: parent-replay timestamps pollute the per-data-table
> histogram{*}. The per-batch {{updateCdcLag}} calls fire from both {{run()}}
> and {{{}processPartitionToCompletion{}}}. During parent replay,
> {{newLastTimestamp}} is an ancestor partition's processed timestamp
> (potentially hours/days old). Those samples are tagged with {{dataTableName}}
> and mix into the same histogram that represents this region's own freshness,
> blurring the SLO signal.
> h4. Proposed fix
> Decouple lag _measurement_ from batch completion, and use the consumer's own
> empty-poll signal to distinguish "caught up" from "behind".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)