Hi Guys,

I have just had a VMHost randomly disconnect in production and subsequently
take down some VMs.
I have attached the logs (happened to be running agent trace on this node),
but it would seem that the agent (or management?) waited 25 seconds before
erroring, and then the cloudstack agent froze until 1800.
I assume the agent syslog stack traces were caused by force closes of VMs,
no other nodes were affected during this time period.

While the host was in disconnect mode, I could connect to a VM which was
running on that host, although Cloudstack was already reporting that is was
down.
 Would it be a good idea to ping VM's (their allocated IPs before
attempting to start them on other nodes - especially in a HA setup)?

If someone could look at the logs and let me know if there is something
obvious it would be most appreciated, I have included the management bond
for reference that the link didn't go down.

Thanks in advance,
Marty
Aug 17 17:40:11 aurora dnsmasq-dhcp[6150]: DHCP packet received on brbond0-152 
which has no address
Aug 17 17:40:11 aurora dnsmasq-dhcp[6150]: DHCP packet received on brbond0-152 
which has no address
Aug 17 17:41:53 aurora puppet-agent[9852]: Finished catalog run in 7.21 seconds
Aug 17 17:59:01 aurora kernel: [522967.085451] INFO: task kvm:8402 blocked for 
more than 120 seconds.
Aug 17 17:59:01 aurora kernel: [522967.085788] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 17 17:59:01 aurora kernel: [522967.086214] kvm             D 
ffffffff8180cbe0     0  8402      1 0x00000000
Aug 17 17:59:01 aurora kernel: [522967.086219]  ffff881cf3c43cd8 
0000000000000082 ffff881cf3c43cb8 ffffffff81080cd8
Aug 17 17:59:01 aurora kernel: [522967.086226]  ffff881cf3c43fd8 
ffff881cf3c43fd8 ffff881cf3c43fd8 00000000000139c0
Aug 17 17:59:01 aurora kernel: [522967.086230]  ffff881f92dc9700 
ffff881c0cdfdc00 0000000000000282 ffff883f912d0a80
Aug 17 17:59:01 aurora kernel: [522967.086235] Call Trace:
Aug 17 17:59:01 aurora kernel: [522967.086244]  [<ffffffff81080cd8>] ? 
__wake_up_common+0x58/0x90
Aug 17 17:59:01 aurora kernel: [522967.086250]  [<ffffffff8169bee9>] 
schedule+0x29/0x70
Aug 17 17:59:01 aurora kernel: [522967.086255]  [<ffffffff81058f15>] 
exit_mm+0x85/0x130
Aug 17 17:59:01 aurora kernel: [522967.086259]  [<ffffffff81059131>] 
do_exit+0x171/0x480
Aug 17 17:59:01 aurora kernel: [522967.086264]  [<ffffffff81066cfa>] ? 
__dequeue_signal+0x6a/0xb0
Aug 17 17:59:01 aurora kernel: [522967.086267]  [<ffffffff810595e4>] 
do_group_exit+0x44/0xa0
Aug 17 17:59:01 aurora kernel: [522967.086271]  [<ffffffff810697fb>] 
get_signal_to_deliver+0x22b/0x440
Aug 17 17:59:01 aurora kernel: [522967.086277]  [<ffffffff810147e9>] 
do_signal+0x29/0x130
Aug 17 17:59:01 aurora kernel: [522967.086282]  [<ffffffff810affcc>] ? 
do_futex+0x7c/0x1b0
Aug 17 17:59:01 aurora kernel: [522967.086288]  [<ffffffff8119a10a>] ? 
do_vfs_ioctl+0x8a/0x340
Aug 17 17:59:01 aurora kernel: [522967.086291]  [<ffffffff810b0247>] ? 
sys_futex+0x147/0x1a0
Aug 17 17:59:01 aurora kernel: [522967.086295]  [<ffffffff810149a0>] 
do_notify_resume+0x90/0xd0
Aug 17 17:59:01 aurora kernel: [522967.086300]  [<ffffffff816a5922>] 
int_signal+0x12/0x17
Aug 17 17:59:01 aurora kernel: [522967.086303] INFO: task kvm:8403 blocked for 
more than 120 seconds.
Aug 17 17:59:01 aurora kernel: [522967.086625] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 17 17:59:01 aurora kernel: [522967.087050] kvm             D 
ffffffff8180cbe0     0  8403      1 0x00000000
Aug 17 17:59:01 aurora kernel: [522967.087053]  ffff881f7913bcd8 
0000000000000082 0000000000000000 ffffffffffffffe0
Aug 17 17:59:01 aurora kernel: [522967.087058]  ffff881f7913bfd8 
ffff881f7913bfd8 ffff881f7913bfd8 00000000000139c0
Aug 17 17:59:01 aurora kernel: [522967.087062]  ffff881f92db1700 
ffff881c0cdf8000 ffff881c0cdf8000 ffff883f912d0a80
Aug 17 17:59:01 aurora kernel: [522967.087067] Call Trace:
Aug 17 17:59:01 aurora kernel: [522967.087071]  [<ffffffff8169bee9>] 
schedule+0x29/0x70
Aug 17 17:59:01 aurora kernel: [522967.087074]  [<ffffffff81058f15>] 
exit_mm+0x85/0x130
Aug 17 17:59:01 aurora kernel: [522967.087077]  [<ffffffff81059131>] 
do_exit+0x171/0x480
Aug 17 17:59:01 aurora kernel: [522967.087081]  [<ffffffff81066cfa>] ? 
__dequeue_signal+0x6a/0xb0
Aug 17 17:59:01 aurora kernel: [522967.087084]  [<ffffffff810595e4>] 
do_group_exit+0x44/0xa0
Aug 17 17:59:01 aurora kernel: [522967.087088]  [<ffffffff810697fb>] 
get_signal_to_deliver+0x22b/0x440
Aug 17 17:59:01 aurora kernel: [522967.087092]  [<ffffffff810147e9>] 
do_signal+0x29/0x130
Aug 17 17:59:01 aurora kernel: [522967.087095]  [<ffffffff810affcc>] ? 
do_futex+0x7c/0x1b0
Aug 17 17:59:01 aurora kernel: [522967.087099]  [<ffffffff8119a10a>] ? 
do_vfs_ioctl+0x8a/0x340
Aug 17 17:59:01 aurora kernel: [522967.087102]  [<ffffffff810b0247>] ? 
sys_futex+0x147/0x1a0
Aug 17 17:59:01 aurora kernel: [522967.087106]  [<ffffffff810149a0>] 
do_notify_resume+0x90/0xd0
Aug 17 17:59:01 aurora kernel: [522967.087109]  [<ffffffff816a5922>] 
int_signal+0x12/0x17
Aug 17 17:59:01 aurora kernel: [522967.087112] INFO: task kvm:17450 blocked for 
more than 120 seconds.
Aug 17 17:59:01 aurora kernel: [522967.087436] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 17 17:59:01 aurora kernel: [522967.098800] kvm             D 
ffffffff8180cbe0     0 17450      1 0x00000000
Aug 17 17:59:01 aurora kernel: [522967.098823]  ffff88140dfc3cd8 
0000000000000082 0000000000000000 ffffffffffffffe0
Aug 17 17:59:01 aurora kernel: [522967.098856]  ffff88140dfc3fd8 
ffff88140dfc3fd8 ffff88140dfc3fd8 00000000000139c0
Aug 17 17:59:01 aurora kernel: [522967.098872]  ffff881f9318c500 
ffff881f8fe9dc00 ffff881f8fe9dc00 ffff883f912d0a80
Aug 17 17:59:01 aurora kernel: [522967.098877] Call Trace:
Aug 17 17:59:01 aurora kernel: [522967.098882]  [<ffffffff8169bee9>] 
schedule+0x29/0x70
Aug 17 17:59:01 aurora kernel: [522967.098886]  [<ffffffff81058f15>] 
exit_mm+0x85/0x130
Aug 17 17:59:01 aurora kernel: [522967.098889]  [<ffffffff81059131>] 
do_exit+0x171/0x480
Aug 17 17:59:01 aurora kernel: [522967.098893]  [<ffffffff81066cfa>] ? 
__dequeue_signal+0x6a/0xb0
Aug 17 17:59:01 aurora kernel: [522967.098897]  [<ffffffff810595e4>] 
do_group_exit+0x44/0xa0
Aug 17 17:59:01 aurora kernel: [522967.098901]  [<ffffffff810697fb>] 
get_signal_to_deliver+0x22b/0x440
Aug 17 17:59:01 aurora kernel: [522967.098905]  [<ffffffff810147e9>] 
do_signal+0x29/0x130
Aug 17 17:59:01 aurora kernel: [522967.098909]  [<ffffffff810affcc>] ? 
do_futex+0x7c/0x1b0
Aug 17 17:59:01 aurora kernel: [522967.098912]  [<ffffffff810b0247>] ? 
sys_futex+0x147/0x1a0
Aug 17 17:59:01 aurora kernel: [522967.098917]  [<ffffffff81188550>] ? 
vfs_write+0x110/0x180
Aug 17 17:59:01 aurora kernel: [522967.098921]  [<ffffffff810149a0>] 
do_notify_resume+0x90/0xd0
Aug 17 17:59:01 aurora kernel: [522967.098925]  [<ffffffff816a5922>] 
int_signal+0x12/0x17
Aug 17 18:00:32 aurora kernel: [523057.947100] brbond0-152: port 3(vnet2) 
entered disabled state
Aug 17 18:00:32 aurora kernel: [523057.951419] brbond0-152: port 3(vnet2) 
entered disabled state
Aug 17 18:00:32 aurora kernel: [523057.952182] device vnet2 left promiscuous 
mode
Aug 17 18:00:32 aurora kernel: [523057.952186] brbond0-152: port 3(vnet2) 
entered disabled state
Aug 17 18:00:34 aurora ntpd[5738]: Deleting interface #75 vnet2, 
fe80::fc89:aaff:fe00:5fa#123, interface stats: received=0, sent=0, dropped=0, 
active_time=1284 secs
Aug 17 18:00:34 aurora ntpd[5738]: peers refreshed
Aug 17 18:00:39 aurora kernel: [523064.894862] cloud0: port 3(vnet10) entered 
disabled state
Aug 17 18:00:39 aurora kernel: [523064.899361] cloud0: port 3(vnet10) entered 
disabled state
Aug 17 18:00:39 aurora kernel: [523064.900143] device vnet10 left promiscuous 
mode
Aug 17 18:00:39 aurora kernel: [523064.900146] cloud0: port 3(vnet10) entered 
disabled state
Aug 17 18:00:39 aurora kernel: [523064.957760] brbond0-153: port 2(vnet11) 
entered disabled state
Aug 17 18:00:39 aurora kernel: [523064.958224] device vnet11 left promiscuous 
mode
Aug 17 18:00:39 aurora kernel: [523064.958230] brbond0-153: port 2(vnet11) 
entered disabled state
Aug 17 18:00:39 aurora kernel: [523065.013749] brbond0-819: port 2(vnet12) 
entered disabled state
Aug 17 18:00:39 aurora kernel: [523065.014508] device vnet12 left promiscuous 
mode
Aug 17 18:00:39 aurora kernel: [523065.014517] brbond0-819: port 2(vnet12) 
entered disabled state
Aug 17 18:00:40 aurora ntpd[5738]: Deleting interface #44 vnet12, 
fe80::fc00:3bff:fe0e:2#123, interface stats: received=0, sent=0, dropped=0, 
active_time=118026 secs
Aug 17 18:00:40 aurora ntpd[5738]: Deleting interface #43 vnet11, 
fe80::fc77:acff:fe00:257#123, interface stats: received=0, sent=0, dropped=0, 
active_time=118026 secs
Aug 17 18:00:40 aurora ntpd[5738]: Deleting interface #42 vnet10, 
fe80::fc00:a9ff:fefe:9b#123, interface stats: received=0, sent=0, dropped=0, 
active_time=118026 secs
Aug 17 18:00:40 aurora ntpd[5738]: peers refreshed
Aug 17 18:02:41 aurora kernel: [523187.658419] brbond0-152: port 4(vnet3) 
entered disabled state
Aug 17 18:02:41 aurora kernel: [523187.662509] brbond0-152: port 4(vnet3) 
entered disabled state
Aug 17 18:02:41 aurora kernel: [523187.663595] device vnet3 left promiscuous 
mode
Aug 17 18:02:41 aurora kernel: [523187.663601] brbond0-152: port 4(vnet3) 
entered disabled state
Aug 17 18:02:43 aurora ntpd[5738]: Deleting interface #23 vnet3, 
fe80::fc41:7cff:fe00:61a#123, interface stats: received=0, sent=0, dropped=0, 
active_time=522318 secs
Aug 17 18:02:43 aurora ntpd[5738]: peers refreshed
Aug 17 18:04:49 aurora kernel: [523315.013106] brbond0-152: port 7(vnet6) 
entered disabled state
Aug 17 18:04:49 aurora kernel: [523315.017106] brbond0-152: port 7(vnet6) 
entered disabled state
Aug 17 18:04:49 aurora kernel: [523315.017876] device vnet6 left promiscuous 
mode
Aug 17 18:04:49 aurora kernel: [523315.017883] brbond0-152: port 7(vnet6) 
entered disabled state
Aug 17 18:04:50 aurora ntpd[5738]: Deleting interface #27 vnet6, 
fe80::fc5e:d0ff:fe00:5ee#123, interface stats: received=0, sent=0, dropped=0, 
active_time=522380 secs
Aug 17 18:04:50 aurora ntpd[5738]: peers refreshed
Aug 17 18:04:58 aurora kernel: [523324.135111] brbond0-851: port 2(vnet8) 
entered disabled state
Aug 17 18:04:58 aurora kernel: [523324.138860] brbond0-851: port 2(vnet8) 
entered disabled state
Aug 17 18:04:58 aurora kernel: [523324.139636] device vnet8 left promiscuous 
mode
Aug 17 18:04:58 aurora kernel: [523324.139642] brbond0-851: port 2(vnet8) 
entered disabled state
Aug 17 18:04:58 aurora kernel: [523324.173700] cloud0: port 2(vnet9) entered 
disabled state
Aug 17 18:04:58 aurora kernel: [523324.174395] device vnet9 left promiscuous 
mode
Aug 17 18:04:58 aurora kernel: [523324.174399] cloud0: port 2(vnet9) entered 
disabled state
Aug 17 18:04:59 aurora ntpd[5738]: Deleting interface #41 vnet9, 
fe80::fc00:a9ff:fefe:1f6#123, interface stats: received=0, sent=0, dropped=0, 
active_time=118355 secs
Aug 17 18:04:59 aurora ntpd[5738]: Deleting interface #40 vnet8, 
fe80::fcd7:7aff:fe00:6ea#123, interface stats: received=0, sent=0, dropped=0, 
active_time=118355 secs
Aug 17 18:04:59 aurora ntpd[5738]: peers refreshed
Aug 17 18:07:01 aurora kernel: [523447.242027] brbond0-152: port 5(vnet4) 
entered disabled state
Aug 17 18:07:01 aurora kernel: [523447.246056] brbond0-152: port 5(vnet4) 
entered disabled state
Aug 17 18:07:01 aurora kernel: [523447.246740] device vnet4 left promiscuous 
mode
Aug 17 18:07:01 aurora kernel: [523447.246746] brbond0-152: port 5(vnet4) 
entered disabled state
Aug 17 18:07:02 aurora ntpd[5738]: Deleting interface #25 vnet4, 
fe80::fcbb:a2ff:fe00:60d#123, interface stats: received=0, sent=0, dropped=0, 
active_time=522573 secs
Aug 17 18:07:02 aurora ntpd[5738]: peers refreshed
root@aurora:~# cat /proc/net/bonding/management0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: ac:16:2d:77:21:51
Slave queue ID: 0

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: ac:16:2d:77:21:50
Slave queue ID: 0

Reply via email to