xiangfu0 opened a new pull request, #18258:
URL: https://github.com/apache/pinot/pull/18258

   ## Summary
   
   Adds a controller-side view of per-instance Pinot versions and an advisory 
rollout-order compatibility check — reusing the `pinotVersion` field already 
present in Helix `InstanceConfig`. No new wire protocol or controller→server 
RPC; the controller reads ZK only and caches snapshots.
   
   - New REST endpoints: `GET /cluster/versions` (optionally 
`?componentType=SERVER`) and `GET /cluster/compatibility`.
   - `GET /instances/{instanceName}` now includes a `"version"` field.
   - New periodic task `ClusterVersionHealthCheckTask` logs WARN when the 
rollout-order invariant is violated. Cluster-level (extends `BasePeriodicTask`, 
gated by `isLeaderForTable(TASK_NAME)`), so it runs even with zero tables.
   
   Rollout-order invariant is a DAG, not a sliding window:
   
   ```
   min(BROKER)  <= min(CONTROLLER)
   min(SERVER)  <= min(BROKER) (or min(CONTROLLER) if no live brokers)
   min(MINION)  <= min(BROKER) (or min(CONTROLLER) if no live brokers)
   ```
   
   SERVER and MINION are peers beneath BROKER. All findings are advisory — no 
operation is blocked.
   
   ## Design notes
   
   - **Live-only min/max**: dead InstanceConfigs are tracked separately as 
`offlineInstanceCount` so stale entries from a decommissioned host cannot drag 
the cluster min backwards during a rolling upgrade.
   - **TTL cache (default 30s)**: one Helix read per expiry. TTL is 
hot-reloadable via Helix cluster config 
(`controller.versionHealthCheck.cacheTtlSeconds`), clamped to `[1s, 1h]`.
   - **Failure handling**: a failed Helix read still publishes an "unavailable" 
snapshot with a fresh timestamp, so the TTL backoff prevents a thundering herd 
during a sustained ZK outage.
   - **Unknown prefixes are surfaced** rather than bucketed into SERVER 
(stricter than `InstanceTypeUtils.isServer`'s catch-all) so operators see 
genuine data-quality issues.
   - **`UNKNOWN` versions** sort less than any parseable version and are 
reported separately in the warnings list — they never drag the cluster min 
downward.
   - **Authorization**: reuses the existing `Actions.Cluster.GET_VERSION` 
action.
   - **Config**: two new properties on `ControllerConf`:
     - `controller.versionHealthCheck.frequencyPeriod` (default `5m`)
     - `controller.versionHealthCheck.cacheTtlSeconds` (default `30`)
   
   ## Test plan
   
   - [x] `VersionParsingUtilsTest` — 10 unit tests covering release / SNAPSHOT 
/ rcN parsing, numeric `rc10 > rc2` ordering, `UNKNOWN`-as-least semantics, 
malformed inputs, and `isAtLeast` / `isSnapshot` / `isKnown` helpers.
   - [x] `spotless:apply`, `checkstyle:check`, `license:check` pass on 
`pinot-common` and `pinot-controller`.
   - [x] Full compile of `pinot-controller` (with transitive deps) succeeds.
   - [ ] Manual: spin up a quickstart cluster, hit `/cluster/versions` and 
`/cluster/compatibility`, and verify JSON shape and the new `version` field on 
`/instances/{name}`.
   - [ ] Manual: with a mixed-version cluster, confirm 
`ClusterVersionHealthCheckTask` logs WARN on rollout violations and the 
compatibility endpoint reflects them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to