rhtyd commented on pull request #5552:
URL: https://github.com/apache/cloudstack/pull/5552#issuecomment-941998947


   Thanks @mlsorensen for the PR, I may get back on PR review soon.
   
   While the PR is a good first step, I think there are some fundamental 
RPC/programming model limitations and it maybe worth discussing how to address 
these deficiencies and tech. debt. Pasting some thoughts I shared on the ML too:
   
   ```
   I think you've hit on a fundamental RPC programming model issue in 
CloudStack wherein the communication from the management server (control plane) 
to agents (agents) is uni-directional and management server is not aware how to 
process  a response outside of the immediate thread of context.
   
   This limitation is clearly visible and causes side-effects for long-running 
Commands where an Answer is not sent back but during which the control plane 
may become unavailable/restarts; since management server doesn't have the 
thread of context, any Answers sent back are ignored. Furthermore, when this 
happens agents get disconnected but continue to process all commands before 
reconnecting back to management server. This is a more serious problem for 
connected agents (such as KVM agents, ssvm/cpvm agents) than direct agents 
(those for VMware/XenServer etc), as direct agents are killed/stopped with the 
management server. The general side-effects include resources that were created 
but later ignored (requires manual cleanup for snapshots for ex. etc).
   
   In the past I've had discussions with colleagues (both at work and in the 
community) and my recollection is this can be solved with: (brain dump of ideas 
and thoughts, some from old conversations, and some new)
   
     *   Refactor long-running Commands: introduce new child/abstract class or 
interface that separates normal Commands vs long-running Commands - that way we 
know which commands are long-running and should have special handlers. (top of 
my head Commands that do any storage work such as taking a snapshot are 
long-running)
     *   Rolling-ownership: safely delegate ownership to another management 
server with the passing context of handling an Answer for a set of long-running 
Commands (usually a Java method/class which is the handler, perhaps using DB + 
reflections)
     *   Bi-directional communication, message-bus based handlers: just like 
we've the Command-Answer patterns, we perhaps need a new RPC mechanism that is 
directional and secured (with CA framework), where agents can announce both 
streaming progress of some task (say template downloaded etc) and also support 
long-running tasks/answers that aren't ignored when control plane is 
unavailable.
        *   I had some thoughts around having a plugin-framework based embedded 
locking service within CloudStack (so turnkey and doesn't require separate 
infra, brokers etc.) that implements both (a) a lock server (so replace MySQL 
DB based GLOBAL_LOCK() too) and (b) a distributed message bus which can be used 
to store/update/delete/announce/queue tasks. This sort of locking/message bus 
framework can be implemented via pluggable plugins that say are implemented 
using mysql/db, embedded zookeeper, or hazelcast. We had done some poc as part 
of an internal hackathon in the past 
(https://github.com/shapeblue/cloudstack/tree/locking-service).
        *   Maybe a more modern approach would be to look at how other projects 
are solving this problem, maybe explore other RPC frameworks such as gRPC.
    ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to