Re: Problems after management server reboot & workaround

Kelven Yang Wed, 08 May 2013 13:43:41 -0700

Then this is slight different than the case with
management-server-and-KVM-in-one-box issue that I'm aware of.


When a management server is restarted with a new ID, it appears to the
cluster as a new management server instance, we have some code logic to
handle that. From what you described, the logic to make it appear fully as
a new management node may be broken in this case.

But anyway, to have a stable management server ID acquisition is much
preferred and needed.

Kelven 

On 5/8/13 1:04 PM, "Jori Liesenborgs" <jori.liesenbo...@gmail.com> wrote:

>Hi Kelven,
>
>We have never used KVM in our cloud setup and the management server is a
>separate machine, not a VM. I'm not sure what the code logic is supposed
>to do, but in our case the problem did prevent the management server
>from functioning: no new instances could be started.
>
>Cheers,
>Jori
>
>> This is a known issue when you are running management server together
>>with
>> a KVM host. After KVM host is added to the running management server, it
>> creates a bridge that can cause management server ID to be changed after
>> reboot, but only for once.
>>
>> A similar issue can happen when you run management server in a VM and
>> later on clone the VM.
>>
>> We have code logic to handle these cases, instead of seeing some
>>annoying
>> messages in the log, it will not affect management server from normal
>> functioning. But it would be really nice to see a fix to have a stable
>> management server ID acquisition process.
>>
>> Kelven
>>
>> On 5/8/13 12:15 PM, "Chip Childers" <chip.child...@sungard.com> wrote:
>>
>>> On Wed, May 08, 2013 at 09:11:43PM +0200, Jori Liesenborgs wrote:
>>>> Hi everyone,
>>>>
>>>> On our cloudstack setup (4.0.2), I noticed that after a reboot of
>>>> the management server, I was no longer able to start new instances.
>>>> A secondary problem was that the management-server.log file filled
>>>> up extremely fast (gigabytes in a few hours), with messages like
>>>> these:
>>>>
>>>> 2013-05-08 05:26:10,627 DEBUG [agent.manager.ClusteredAgentAttache]
>>>> (AgentManager-Handler-4:null) Seq 7-1033568320: Forwarding Seq
>>>> 7-1033568320:  { Cmd , MgmtId: 38424150221294, via: 7, Ver: v1,
>>>> Flags: 100111,
>>>> [{"StopCommand":{"isProxy":false,"vmName":"i-2-6-VM","wait":0}}] }
>>>> to 130450099353672
>>>>
>>>> This turned out to contain an important clue: when looking at the
>>>> 'mshost' table in the 'cloud' database, instead of seeing one entry
>>>> for the management server ID, there now were two:
>>>>
>>>> | id | msid            | runid         | name          | ...
>>>> |  1 | 130450099353672 | 1367919381740 | cloud-manager | ...
>>>> |  2 |  38424150221294 | 1367950608087 | cloud-manager | ...
>>>>
>>>> And these two IDs were those that were mentioned in the logfile. In
>>>> fact, every reboot a new entry in the 'mshost' table appeared, and
>>>> that new ID was being inserted into the 'host' entries, for system
>>>> VMs 'v-2-VM' and 's-1-VM'.
>>>>
>>>> Browsing through the code, it appears that in the
>>>> ManagementServerNode.java file, the function getManagementServerId()
>>>> returns a static value created by the MacAddress class. Now, on a
>>>> Linux platform (we are using ubuntu), this address is obtained from
>>>> the first entry that the command "/sbin/ifconfig -a" shows as
>>>> output. And this turned out to be the address of the cloud0 bridge
>>>> interface, which changed after a reboot (or after deleting the
>>>> bridge using brctl and restarting the entire cloudstack).
>>>>
>>>> To avoid having to modify and recompile cloudstack, I created a fake
>>>> ifconfig: a simple python process that most of the time just runs
>>>> the real ifconfig (which I renamed to ifconfig-bin), but when called
>>>> as "/sbin/ifconfig -a", it rearranges the output so that eth0 is
>>>> shown first (and not cloud0). This way, the management server id is
>>>> basically the MAC address of eth0, which stays the same after a
>>>> reboot.
>>>>
>>>> I haven't had the time to create a long running test yet (I only
>>>> figured it out this afternoon), but after several reboots, the
>>>> management server id now stays the same, and I am still able to
>>>> start new instances.
>>>>
>>>> Hope someone finds this useful.
>>>>
>>>> Cheers,
>>>> Jori
>>>>
>>>>
>>> Jori,
>>>
>>> This is really interesting.  Would you mind opening a bug about it with
>>> your findings?  And if you're interested in submitting a patch, we'd
>>> love that too!
>>>
>>> -chip
>

Re: Problems after management server reboot & workaround

Reply via email to