Lokesh Khurana created PHOENIX-7870:
---------------------------------------
Summary: GetClusterRoleRecordUtil: per-HA-group poller futures +
url1/url2 alternation
Key: PHOENIX-7870
URL: https://issues.apache.org/jira/browse/PHOENIX-7870
Project: Phoenix
Issue Type: Sub-task
Reporter: Lokesh Khurana
Assignee: Lokesh Khurana
GetClusterRoleRecordUtil has two related bugs in its non-active poller logic.
Bug 1: Cross-group cancel collision via shared static pollerFuture
The class declares a single static volatile ScheduledFuture<?> pollerFuture
field that is overwritten by every call to schedulePoller, regardless of the HA
group name. The companion
schedulerMap is correctly keyed by HA group name, but the future itself is
not. When two HA groups poll concurrently, the second group's schedulePoller
overwrites pollerFuture with
its own future. The first group's later cancel-on-active branch then calls
pollerFuture.cancel(false), cancelling the wrong group's future. The first
group's poller is left orphaned:
still running on the scheduler, but no longer tracked, so it can never be
cancelled cleanly. The affected group's CRR cache stops refreshing and the
client routes at the last-known
active even after the operator promotes a new active.
Bug 2: Poller pins to a single URL with no alternation or peer fallback
schedulePoller accepts a single url parameter and the polling lambda closes
over it. Every tick calls getClusterRoleRecord(url, ...) against the same URL.
There is no alternation
between url1 and url2, and no fallback on SQLException. If the cluster behind
the bound URL becomes unreachable after the poller starts, every tick throws
and the poller never escapes
— no peer-side check happens, even when the peer cluster is healthy and
would correctly report the new role.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)