rhtyd commented on pull request #5552:
URL: https://github.com/apache/cloudstack/pull/5552#issuecomment-941998947
Thanks @mlsorensen for the PR, I may get back on PR review soon.
While the PR is a good first step, I think there are some fundamental
RPC/programming model limitations and it maybe worth discussing how to address
these deficiencies and tech. debt. Pasting some thoughts I shared on the ML too:
```
I think you've hit on a fundamental RPC programming model issue in
CloudStack wherein the communication from the management server (control plane)
to agents (agents) is uni-directional and management server is not aware how to
process a response outside of the immediate thread of context.
This limitation is clearly visible and causes side-effects for long-running
Commands where an Answer is not sent back but during which the control plane
may become unavailable/restarts; since management server doesn't have the
thread of context, any Answers sent back are ignored. Furthermore, when this
happens agents get disconnected but continue to process all commands before
reconnecting back to management server. This is a more serious problem for
connected agents (such as KVM agents, ssvm/cpvm agents) than direct agents
(those for VMware/XenServer etc), as direct agents are killed/stopped with the
management server. The general side-effects include resources that were created
but later ignored (requires manual cleanup for snapshots for ex. etc).
In the past I've had discussions with colleagues (both at work and in the
community) and my recollection is this can be solved with: (brain dump of ideas
and thoughts, some from old conversations, and some new)
* Refactor long-running Commands: introduce new child/abstract class or
interface that separates normal Commands vs long-running Commands - that way we
know which commands are long-running and should have special handlers. (top of
my head Commands that do any storage work such as taking a snapshot are
long-running)
* Rolling-ownership: safely delegate ownership to another management
server with the passing context of handling an Answer for a set of long-running
Commands (usually a Java method/class which is the handler, perhaps using DB +
reflections)
* Bi-directional communication, message-bus based handlers: just like
we've the Command-Answer patterns, we perhaps need a new RPC mechanism that is
directional and secured (with CA framework), where agents can announce both
streaming progress of some task (say template downloaded etc) and also support
long-running tasks/answers that aren't ignored when control plane is
unavailable.
* I had some thoughts around having a plugin-framework based embedded
locking service within CloudStack (so turnkey and doesn't require separate
infra, brokers etc.) that implements both (a) a lock server (so replace MySQL
DB based GLOBAL_LOCK() too) and (b) a distributed message bus which can be used
to store/update/delete/announce/queue tasks. This sort of locking/message bus
framework can be implemented via pluggable plugins that say are implemented
using mysql/db, embedded zookeeper, or hazelcast. We had done some poc as part
of an internal hackathon in the past
(https://github.com/shapeblue/cloudstack/tree/locking-service).
* Maybe a more modern approach would be to look at how other projects
are solving this problem, maybe explore other RPC frameworks such as gRPC.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]