jmsperu opened a new pull request, #13076: URL: https://github.com/apache/cloudstack/pull/13076
## Summary The LINSTOR plugin treats a successful HTTP response from `resourceDefinitionDelete` as proof the resource is gone and immediately drops the volume from CloudStack's accounting. In practice LINSTOR can return success while the resource lingers in DELETING state — for example when a DRBD peer is unreachable, quorum was lost, or a satellite is down. The plugin has no retry, no verification, and no sweeper. We've found hundreds of stuck DELETING resources accumulated over weeks because nothing surfaces the divergence between the CS view and the LINSTOR view. ## What this PR does Adds two helpers to `LinstorUtil`: - `isResourceDefinitionGone(api, rscName)` — existence check via `resourceDefinitionList` - `waitForResourceDefinitionDeleted(api, rscName, timeoutMillis)` — polls every 1s until the resource is gone OR timeout elapses; returns `true` on confirmed-gone, `false` on timeout Calls `waitForResourceDefinitionDeleted` from both delete paths: - `LinstorPrimaryDataStoreDriverImpl.deleteResourceDefinition` (driver / management-server side) - `LinstorStorageAdaptor.deRefOrDeleteResource` (host / KVM agent side) Default timeout 30 seconds (`DEFAULT_RD_DELETE_VERIFY_TIMEOUT_MILLIS`). ## Behaviour - **Resource deletes within 30s** (the normal case) — no log change, no behaviour change. - **Resource still in DELETING after 30s** — emits a WARN naming the resource and pointing the operator at `linstor resource list`. Returns success to the caller (the CS-side accounting has already moved on; throwing here would create a different inconsistency). - **Controller transiently unreachable during poll** — debug-logs each failed poll, keeps trying until deadline. ## What this does NOT do (deferred) - Tier-2 sweeper that periodically re-attempts delete on long-stuck resources is not in this PR. The Tier-1 surface here is the minimum needed to make the problem visible in operator logs. A sweeper can land separately if maintainers want it. ## Test plan - [ ] CI build + unit tests pass (no public API changes; only behaviour additions in private methods) - [ ] Manual: delete a volume on a healthy LINSTOR cluster — no new logs, ~30s shorter than `wait` would otherwise be (immediate return after first poll confirms gone) - [ ] Manual: delete a volume on a cluster with a downed satellite — observe the new WARN line in the agent log and confirm the resource is still in `linstor resource list` until manually cleared -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
