Looking into it now. On 27 Feb 2014 15:56, "Derek Higgins" <der...@redhat.com> wrote:
> On 25/02/14 00:08, Robert Collins wrote: > > Today we had an outage of the tripleo test cloud :(. > > > > tl;dr: > > - we were down for 14 hours > > - we don't know the fundamental cause > > - infra were not inconvenienced - yaaay > > - its all ok now. > Looks like we've hit the same problem again tonight, I've > o rebooted the server > o fixed up the hostname > o restarted nova and neutron services on the controller > > VM's still not getting IP's, I'm not seeing dhcp requests from them > coming into dnsmasq, spent some time trying to figure out the problem, > no luck, I'll pick up in this in a few hours if nobody else has before > then. > > > > > Read on for more information, what little we have. > > > > We don't know exactly why it happened yet, but the control plane > > dropped off the network. Console showed node still had a correct > > networking configuration, including openflow rules and bridges. The > > node was arpingable, and could arping out, but could not be pinged. > > Tcpdump showed the node sending a ping reply on it's raw ethernet > > device, but other machines on the same LAN did not see the packet. > > > > From syslog we can see > > Feb 24 06:28:31 ci-overcloud-notcompute0-gxezgcvv4v2q kernel: > > [1454708.543053] hpsa 0000:06:00.0: cmd_alloc returned NULL! > > events > > > > around the time frame that the drop-off would have happened, but they > > go back many hours before and after that. > > > > After exhausting everything that came to mind we rebooted the machine, > > which promptly spat an NMI trace into the console: > > > > [1502354.552431] [<ffffffff810fdf98>] > rcu_eqs_enter_common.isra.43+0x208/0x220 > > [1502354.552491] [<ffffffff810ff9ed>] rcu_irq_exit+0x5d/0x90 > > [1502354.552549] [<ffffffff81067670>] irq_exit+0x80/0xc0 > > [1502354.552605] [<ffffffff816f9605>] smp_apic_timer_interrupt+0x45/0x60 > > [1502354.552665] [<ffffffff816f7f9d>] apic_timer_interrupt+0x6d/0x80 > > [1502354.552722] <EOI> <NMI> [<ffffffff816e1384>] ? panic+0x193/0x1d7 > > [1502354.552880] [<ffffffffa02d18e5>] hpwdt_pretimeout+0xe5/0xe5 [hpwdt] > > [1502354.552939] [<ffffffff816efc88>] nmi_handle.isra.3+0x88/0x180 > > [1502354.552997] [<ffffffff816eff11>] do_nmi+0x191/0x330 > > [1502354.553053] [<ffffffff816ef201>] end_repeat_nmi+0x1e/0x2e > > [1502354.553111] [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120 > > [1502354.553168] [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120 > > [1502354.553226] [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120 > > [1502354.553282] <<EOE>> [<ffffffff8159fe90>] > cpuidle_enter_state+0x40/0xc0 > > [1502354.553408] [<ffffffff8159ffd9>] cpuidle_idle_call+0xc9/0x210 > > [1502354.553466] [<ffffffff8101bafe>] arch_cpu_idle+0xe/0x30 > > [1502354.553523] [<ffffffff810b54c5>] cpu_startup_entry+0xe5/0x280 > > [1502354.553581] [<ffffffff816d64b7>] rest_init+0x77/0x80 > > [1502354.553638] [<ffffffff81d26ef7>] start_kernel+0x40a/0x416 > > [1502354.553695] [<ffffffff81d268f6>] ? repair_env_string+0x5c/0x5c > > [1502354.553753] [<ffffffff81d26120>] ? early_idt_handlers+0x120/0x120 > > [1502354.553812] [<ffffffff81d265de>] > x86_64_start_reservations+0x2a/0x2c > > [1502354.553871] [<ffffffff81d266e8>] x86_64_start_kernel+0x108/0x117 > > [1502354.553929] ---[ end trace 166b62e89aa1f54b ]--- > > > > 'yay'. After that, a power reset in the console, it came up ok, just > > needed a minor nudge to refresh it's heat configuration and we were up > > and running again. > > > > For some reason, neutron decided to rename it's agents at this point > > and we had to remove and reattach the l3 agent before VM connectivity > > was restored. > > https://bugs.launchpad.net/tripleo/+bug/1284354 > > > > However, about 90 nodepool nodes were stuck in states like ACTIVE > > deleting, and did not clear until we did a rolling restart of every > > nova compute process. > > https://bugs.launchpad.net/tripleo/+bug/1284356 > > > > Cheers, > > Rob > > > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev