Hi, This is a Xen Server question, but since it is part of my ACS setup, I hope Xen Server experts on this list can provide some help.
Every month or so, one of my Xen Server pool (6.5 SP1 with most of patches are installed) with ten hypervisor nodes will go crazy: in CS, only the pool master stays in UP state, all slaves are in either Alert or Connecting state; and CS can’t perform any VM operations if that VM in running on one of slaves. On hypervisor CLI, xe commands are extremely slow on slaves, often they would just fail, but on pool master, xe commands behaves normally. It seems that the pool slaves just can’t communicate with the master properly. I have managed to recover the pool each time by switching pool master to another hypervisor (often this step proceeds with great difficulty due to poor communication between the master and slaves) and followed by running xe-toolstack-restart command on all pool members. What is the root cause of this condition? How could I avoid getting into such situation in the first place? Thanks Yiping