On Wed, May 08, 2013 at 09:11:43PM +0200, Jori Liesenborgs wrote: > > Hi everyone, > > On our cloudstack setup (4.0.2), I noticed that after a reboot of > the management server, I was no longer able to start new instances. > A secondary problem was that the management-server.log file filled > up extremely fast (gigabytes in a few hours), with messages like > these: > > 2013-05-08 05:26:10,627 DEBUG [agent.manager.ClusteredAgentAttache] > (AgentManager-Handler-4:null) Seq 7-1033568320: Forwarding Seq > 7-1033568320: { Cmd , MgmtId: 38424150221294, via: 7, Ver: v1, > Flags: 100111, > [{"StopCommand":{"isProxy":false,"vmName":"i-2-6-VM","wait":0}}] } > to 130450099353672 > > This turned out to contain an important clue: when looking at the > 'mshost' table in the 'cloud' database, instead of seeing one entry > for the management server ID, there now were two: > > | id | msid | runid | name | ... > | 1 | 130450099353672 | 1367919381740 | cloud-manager | ... > | 2 | 38424150221294 | 1367950608087 | cloud-manager | ... > > And these two IDs were those that were mentioned in the logfile. In > fact, every reboot a new entry in the 'mshost' table appeared, and > that new ID was being inserted into the 'host' entries, for system > VMs 'v-2-VM' and 's-1-VM'. > > Browsing through the code, it appears that in the > ManagementServerNode.java file, the function getManagementServerId() > returns a static value created by the MacAddress class. Now, on a > Linux platform (we are using ubuntu), this address is obtained from > the first entry that the command "/sbin/ifconfig -a" shows as > output. And this turned out to be the address of the cloud0 bridge > interface, which changed after a reboot (or after deleting the > bridge using brctl and restarting the entire cloudstack). > > To avoid having to modify and recompile cloudstack, I created a fake > ifconfig: a simple python process that most of the time just runs > the real ifconfig (which I renamed to ifconfig-bin), but when called > as "/sbin/ifconfig -a", it rearranges the output so that eth0 is > shown first (and not cloud0). This way, the management server id is > basically the MAC address of eth0, which stays the same after a > reboot. > > I haven't had the time to create a long running test yet (I only > figured it out this afternoon), but after several reboots, the > management server id now stays the same, and I am still able to > start new instances. > > Hope someone finds this useful. > > Cheers, > Jori > >
Jori, This is really interesting. Would you mind opening a bug about it with your findings? And if you're interested in submitting a patch, we'd love that too! -chip