Himanshu Gwalani created PHOENIX-7938:
-----------------------------------------
Summary: Fix consistency point calculation for sync replication
replay when files within a round are processed out of order
Key: PHOENIX-7938
URL: https://issues.apache.org/jira/browse/PHOENIX-7938
Project: Phoenix
Issue Type: Sub-task
Reporter: Himanshu Gwalani
Assignee: Himanshu Gwalani
Primary-side HA fix is enabling a direct ANISTS → AISTS transition (Ritesh
Garg). On the standby this manifests as a local HAGroupState transition
DEGRADED_STANDBY → STANDBY_TO_ACTIVE directly, skipping STANDBY. Replay side
has no listener for this path today: replicationReplayState stays at DEGRADED,
no rewind to lastRoundInSync happens, and shouldTriggerFailover() (line 493)
hard-blocks promotion forever because it requires state == SYNC.
File:
phoenix-core-server/src/main/java/org/apache/phoenix/replication/reader/ReplicationLogDiscoveryReplay.java
*Fix on three fronts (all must land together):*
**1. triggerFailoverListener (148-160): add
replicationReplayState.compareAndSet(DEGRADED, SYNCED_RECOVERY) before
failoverPending.set(true). Conditional CAS so the happy STANDBY →
STANDBY_TO_ACTIVE path doesn't pay a redundant rewind; failoverPending set runs
unconditionally so the signal is never lost.
2. initializeLastRoundProcessed() (215-263): add a parallel branch for
STANDBY_TO_ACTIVE so a reader restart in this state — when
lastSyncStateTimeInMs indicates prior DEGRADED — initializes lastRoundInSync
from lastSyncStateTimeInMs and sets state to SYNCED_RECOVERY. Without this,
restart after the direct transition silently skips files between the pre-crash
sync point and the crash, promoting with a hole.
3. Declare lastRoundProcessed and lastRoundInSync volatile to close a
visibility gap between the ZK watcher thread and the scheduler thread that the
new path makes more reachable.
*Dependencies:* Coordinate landing with Ritesh's primary-side change widening
HAGroupStoreRecord.HAGroupState.DEGRADED_STANDBY.allowedTransitions (currently
{STANDBY}) and the writer signaling for ANISTS → AISTS. Neither side ships the
new transition without the other.
*Tests:* 2 listener unit cases (CAS fires from DEGRADED, no-ops from SYNC),
full IT for the direct path with files in OUT, restart IT for crash
mid-transition, end-to-end cycle IT including ABORT_TO_STANDBY retry, and
update to HAGroupStoreRecordTest.testHAGroupStateValidTransitions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)