Hi All, I would like to propose using XS HA to switch XS master host when XS master host is down
Reason, We found below issue recently, https://issues.apache.org/jira/browse/CLOUDSTACK-6177 When XS master is down, CS uses pool-emergency-transition-to-master and pool-recover-slaves API to choose a new master, this API is not safe, and should be only used in emergent situation, this API may cause XS use a little bit old(5 seconds old) version of XS DB, some of object may be missing in the old XS DB, which may cause weird behavior, you may not be able to start VM. Short term solution CS doesn't do XS master switch any more to avoid this issue. Impact, 1. When master host is down, CS loses connect to the whole XS pool(CS cluster), CS cannot get VMs info in this cluster, and the whole cluster is not operable. 2. Require admin to recover the XS master host manually, if recovering XS master host is not possible, admin can use uses pool-emergency-transition-to-master and pool-recover-slaves to recover the pool, per the issue I mentioned before , this should be the last resort. Long term solution Integrate XS HA, use XS HA to do XS master switch. 1. It might take some time to integrate XS HA. 2. Old free version XS doesn't have XS HA feature, user might need to upgrade to XS 6.2( which is free) to get the feature. I think we can fix this issue in two steps. 1. Since this issue is very critical, CS should not do XS master switch immediately to avoid this issue. 2. Integrate XS HA. Comments, suggestions are highly appreciated! Best Regards. Anthony