Re: Problems after management server reboot & workaround

Jori Liesenborgs Wed, 08 May 2013 13:05:15 -0700

Hi Kelven,

We have never used KVM in our cloud setup and the management server is aseparate machine, not a VM. I'm not sure what the code logic is supposedto do, but in our case the problem did prevent the management serverfrom functioning: no new instances could be started.


Cheers,
Jori

This is a known issue when you are running management server together with
a KVM host. After KVM host is added to the running management server, it
creates a bridge that can cause management server ID to be changed after
reboot, but only for once.

A similar issue can happen when you run management server in a VM and
later on clone the VM.

We have code logic to handle these cases, instead of seeing some annoying
messages in the log, it will not affect management server from normal
functioning. But it would be really nice to see a fix to have a stable
management server ID acquisition process.

Kelven

On 5/8/13 12:15 PM, "Chip Childers" <[email protected]> wrote:

On Wed, May 08, 2013 at 09:11:43PM +0200, Jori Liesenborgs wrote:

Hi everyone,

On our cloudstack setup (4.0.2), I noticed that after a reboot of
the management server, I was no longer able to start new instances.
A secondary problem was that the management-server.log file filled
up extremely fast (gigabytes in a few hours), with messages like
these:

2013-05-08 05:26:10,627 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-4:null) Seq 7-1033568320: Forwarding Seq
7-1033568320:  { Cmd , MgmtId: 38424150221294, via: 7, Ver: v1,
Flags: 100111,
[{"StopCommand":{"isProxy":false,"vmName":"i-2-6-VM","wait":0}}] }
to 130450099353672

This turned out to contain an important clue: when looking at the
'mshost' table in the 'cloud' database, instead of seeing one entry
for the management server ID, there now were two:

| id | msid            | runid         | name          | ...
|  1 | 130450099353672 | 1367919381740 | cloud-manager | ...
|  2 |  38424150221294 | 1367950608087 | cloud-manager | ...

And these two IDs were those that were mentioned in the logfile. In
fact, every reboot a new entry in the 'mshost' table appeared, and
that new ID was being inserted into the 'host' entries, for system
VMs 'v-2-VM' and 's-1-VM'.

Browsing through the code, it appears that in the
ManagementServerNode.java file, the function getManagementServerId()
returns a static value created by the MacAddress class. Now, on a
Linux platform (we are using ubuntu), this address is obtained from
the first entry that the command "/sbin/ifconfig -a" shows as
output. And this turned out to be the address of the cloud0 bridge
interface, which changed after a reboot (or after deleting the
bridge using brctl and restarting the entire cloudstack).

To avoid having to modify and recompile cloudstack, I created a fake
ifconfig: a simple python process that most of the time just runs
the real ifconfig (which I renamed to ifconfig-bin), but when called
as "/sbin/ifconfig -a", it rearranges the output so that eth0 is
shown first (and not cloud0). This way, the management server id is
basically the MAC address of eth0, which stays the same after a
reboot.

I haven't had the time to create a long running test yet (I only
figured it out this afternoon), but after several reboots, the
management server id now stays the same, and I am still able to
start new instances.

Hope someone finds this useful.

Cheers,
Jori

Jori,

This is really interesting.  Would you mind opening a bug about it with
your findings?  And if you're interested in submitting a patch, we'd
love that too!

-chip

Re: Problems after management server reboot & workaround

Reply via email to