[ 
https://issues.apache.org/jira/browse/PHOENIX-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Palash Chauhan resolved PHOENIX-7884.
-------------------------------------
    Resolution: Fixed

> cdcIndexUpdateLag is silent during idle / failure / parent-replay and 
> misattributed during ancestor replay
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-7884
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7884
>             Project: Phoenix
>          Issue Type: Sub-task
>    Affects Versions: 5.3.1
>            Reporter: Palash Chauhan
>            Assignee: Palash Chauhan
>            Priority: Major
>             Fix For: 5.4.0, 5.3.2
>
>
> h4. Background
> {{cdcIndexUpdateLag}} is the primary freshness signal for eventually 
> consistent secondary indexes. It is registered as a {{MetricHistogram}} in 
> {{MetricsIndexCDCConsumerSource}} and is intended to drive freshness SLOs per 
> data table per RegionServer.
> h4. Problem
> The metric is emitted at exactly two places — inside {{processCDCBatch}} 
> (line 988) and {{processCDCBatchGenerated}} (line 1109) — both inside an {{if 
> (!batchMutations.isEmpty())}} block, immediately after a successful non-empty 
> batch. This produces three distinct bugs:
> 1. {*}Silent during idle / failure / startup{*}. No sample is emitted when:
>  * the data table is idle (the main loop sleeps on {{pollIntervalMs}} or 
> backoff),
>  * batches are repeatedly failing (the catch block only increments 
> {{{}cdcBatchFailureCount{}}}),
>  * the consumer is in {{{}startupDelayMs{}}}, {{waitForCDCStreamEntry()}} 
> retries, or {{checkTrackerStatus()}} retries.
> {*}2. Silent during parent-region replay, exactly when freshness matters 
> most{*}. After a region split/merge, {{run()}} calls 
> {{replayAndCompleteParentRegions(...)}} before the main loop starts. During 
> this phase — which can take hours on busy tables — the region's own new 
> writes accumulate in its CDC partition and are not processed. The lag metric 
> reports nothing about that growing backlog. The 15 s 
> {{parentProgressPauseMs}} sleeps inside {{processPartitionToCompletion}} are 
> also silent.
> {*}3. Mis-attribution: parent-replay timestamps pollute the per-data-table 
> histogram{*}. The per-batch {{updateCdcLag}} calls fire from both {{run()}} 
> and {{{}processPartitionToCompletion{}}}. During parent replay, 
> {{newLastTimestamp}} is an ancestor partition's processed timestamp 
> (potentially hours/days old). Those samples are tagged with {{dataTableName}} 
> and mix into the same histogram that represents this region's own freshness, 
> blurring the SLO signal.
> h4. Proposed fix
> Decouple lag _measurement_ from batch completion, and use the consumer's own 
> empty-poll signal to distinguish "caught up" from "behind".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to