On 16 September 2014 00:40, Julian Edwards <[email protected]> wrote: ... > We need a way to *recover* operations in the case of a pserv and or region > failure, and to do this the database needs to store the *desired* state of the > power in addition to its current state. As I have previously said, the pserv > needs to issue a "recovery" call to the region when it restarts so it can > converge on the desired state; for power the region would send back a list of > outstanding power ops on nodes and the desired state for each.
A way to recover operations in the cluster when it fails or is restarted is desirable, but it's something we can do next cycle, and it's out of scope for this cycle (and wasn't in scope anyway). Celery with RabbitMQ, gave us /an/ ability to recover from a restart or a crash, assuming Celery uses AMQP settlement correctly (which I assume it probably does). I don't think it gave us a way to prevent 100s of power-on and power-off tasks from being queued, it didn't give us a way to make immediate queries of clusters, and didn't give clusters a way to talk to the region. So, with the move to RPC we've lost some of that ability for a cluster to pick up where it left off, but we have gained a lot more control over MAAS's infrastructure, and have a strong basis for layering more HA and HA-like features on top. I agree that putting "truth" into the database and designing code to converge on that is the right approach. (Aside: once a node has been deployed, MAAS can no longer have a desired power state to converge on. The node belongs to the user at that point, and he/she has the freedom to turn it off and on as needed, and he/she can do that by mechanisms other than by MAAS.) -- Mailing list: https://launchpad.net/~maas-devel Post to : [email protected] Unsubscribe : https://launchpad.net/~maas-devel More help : https://help.launchpad.net/ListHelp

