Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Hi. Sorry for the slow follow-up. Stefan Hajnoczi writes: > Yes, it's odd that QEMU changes make the issue go away but tcpdump > suggests the packet is not being sent from the bridge to the tap > device. Indeed. I really don't understand how to the two could possible interact! > e1000 and rtl8139 both use the same QEMU network subsystem code. I > don't see an obvious difference between the two. I wondered whether it could be some sort of checksum offloading which doesn't apply for virtio or rtl8139, but otherwise I agree it's extremely strange. > Since this issue only happens once in many QEMU runs are you sure that > -usbdevice tablet really makes the issue go away? I ran in a tight loop without it for around 500 iterations, and have done so again to confirm. Typically it fails within 20-40 iterations with -usbdevice tablet present. > Are you using ebtables? I know you mentioned disabling iptables and > it would be good to try the same for ebtables if you use it. We're normally using ebtables, but I completely flushed all the tables and set policy of ACCEPT for these tests to eliminate the possibility of bugs in my table create code. > In order to debug the host networking issue you may be able to use > ebtables/iptables LOG targets to collect information on how far > exactly the packets are getting. For example, you could try logging > all packets destined for the guest MAC address - and if the log > information includes the network interface you should see the packet > move between its source, the bridge, and the destination interface. I > have never tried this but it might work. I will have a play with this. Cheers, Chris.
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
On Thu, Apr 12, 2012 at 10:37 AM, Chris Webb wrote: > Stefan Hajnoczi writes: > >> On Tue, Apr 3, 2012 at 5:37 PM, Chris Webb wrote: >> > Stefan Hajnoczi writes: >> > >> > >> >> >> Are you sure no other guest has the same MAC address or IP address? >> >> >> This weird behavior sounds similar to what happens when you have >> >> >> multiple devices on a network using the same address - the results are >> >> >> very confusing :). >> >> > >> >> > Yes, I agree! However, in this case there's no other guest with the >> >> > same MAC >> >> > or IP address on the network. I've explicitly rechecked this to be >> >> > sure, and >> >> > also deliberately varied the MAC address to something I know can't be >> >> > generated by our scripts. In any case, I'm using the same MAC and IP >> >> > address >> >> > for every reboot of this VM, and usually (19 times out of 20) it works >> >> > fine. >> >> >> >> The lack of ARP reply is a host networking problem. ?Have you checked >> >> host dmesg(1) output just in case there was a kernel message related >> >> to this? >> > >> > Nothing there I'm afraid. Just the usual >> > >> > ?device tap1 entered promiscuous mode >> > ?ADDRCONF(NETDEV_UP): tap1: link is not ready >> > ?ADDRCONF(NETDEV_CHANGE): tap1: link becomes ready >> > ?br0: port 2(tap1) entering forwarding state >> > ?br0: port 2(tap1) entering forwarding state >> > ?kvm: 20288: cpu0 unhandled rdmsr: 0xc0010112 >> > ?kvm: 20288: cpu0 unhandled rdmsr: 0xc0010048 >> > ?tap1: no IPv6 routers present >> > ?br0: port 2(tap1) entering forwarding state >> > ?br0: port 2(tap1) entering forwarding state >> > ?br0: port 2(tap1) entering forwarding state >> > ?br0: port 2(tap1) entering forwarding state >> > ?br0: port 2(tap1) entering disabled state >> > >> > cycle. It looks just the same for a working guest as for a non-working >> > guest. >> >> Is the "disabled state" because QEMU exited? > > Yes, that's right. > >> I'm afraid I don't have any suggestions beyond debugging the >> bridge->tap code in the kernel since packets are not being forwarded >> for some reason. > > Many thanks for your help and suggestions nonetheless. It reassuring to hear > it's not something completely obvious I'm overlooking. > > Does the fact that this only happens with model=e1000, not model=virtio or > model=rtl8139 give us a clue as to what might be going wrong in the host > kernel? The observation which particularly baffles me if it's a host kernel > issue is that removing -usbdevice tablet from the guest makes the problem go > away! > > More generally, my confusion with this bug is that guest changes like > model=e1000 -> model=rtl8139 fixing it or removing -usbdevice tablet fixing > it seem to imply a qemu problem rather than a host kernel bug, but -net tap > -> -net user fixing it seems to imply a host kernel bug rather than a qemu > problem! Yes, it's odd that QEMU changes make the issue go away but tcpdump suggests the packet is not being sent from the bridge to the tap device. e1000 and rtl8139 both use the same QEMU network subsystem code. I don't see an obvious difference between the two. Since this issue only happens once in many QEMU runs are you sure that -usbdevice tablet really makes the issue go away? Are you using ebtables? I know you mentioned disabling iptables and it would be good to try the same for ebtables if you use it. In order to debug the host networking issue you may be able to use ebtables/iptables LOG targets to collect information on how far exactly the packets are getting. For example, you could try logging all packets destined for the guest MAC address - and if the log information includes the network interface you should see the packet move between its source, the bridge, and the destination interface. I have never tried this but it might work. Stefan
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Stefan Hajnoczi writes: > On Tue, Apr 3, 2012 at 5:37 PM, Chris Webb wrote: > > Stefan Hajnoczi writes: > > > > > >> >> Are you sure no other guest has the same MAC address or IP address? > >> >> This weird behavior sounds similar to what happens when you have > >> >> multiple devices on a network using the same address - the results are > >> >> very confusing :). > >> > > >> > Yes, I agree! However, in this case there's no other guest with the same > >> > MAC > >> > or IP address on the network. I've explicitly rechecked this to be sure, > >> > and > >> > also deliberately varied the MAC address to something I know can't be > >> > generated by our scripts. In any case, I'm using the same MAC and IP > >> > address > >> > for every reboot of this VM, and usually (19 times out of 20) it works > >> > fine. > >> > >> The lack of ARP reply is a host networking problem. ?Have you checked > >> host dmesg(1) output just in case there was a kernel message related > >> to this? > > > > Nothing there I'm afraid. Just the usual > > > > ?device tap1 entered promiscuous mode > > ?ADDRCONF(NETDEV_UP): tap1: link is not ready > > ?ADDRCONF(NETDEV_CHANGE): tap1: link becomes ready > > ?br0: port 2(tap1) entering forwarding state > > ?br0: port 2(tap1) entering forwarding state > > ?kvm: 20288: cpu0 unhandled rdmsr: 0xc0010112 > > ?kvm: 20288: cpu0 unhandled rdmsr: 0xc0010048 > > ?tap1: no IPv6 routers present > > ?br0: port 2(tap1) entering forwarding state > > ?br0: port 2(tap1) entering forwarding state > > ?br0: port 2(tap1) entering forwarding state > > ?br0: port 2(tap1) entering forwarding state > > ?br0: port 2(tap1) entering disabled state > > > > cycle. It looks just the same for a working guest as for a non-working > > guest. > > Is the "disabled state" because QEMU exited? Yes, that's right. > I'm afraid I don't have any suggestions beyond debugging the > bridge->tap code in the kernel since packets are not being forwarded > for some reason. Many thanks for your help and suggestions nonetheless. It reassuring to hear it's not something completely obvious I'm overlooking. Does the fact that this only happens with model=e1000, not model=virtio or model=rtl8139 give us a clue as to what might be going wrong in the host kernel? The observation which particularly baffles me if it's a host kernel issue is that removing -usbdevice tablet from the guest makes the problem go away! More generally, my confusion with this bug is that guest changes like model=e1000 -> model=rtl8139 fixing it or removing -usbdevice tablet fixing it seem to imply a qemu problem rather than a host kernel bug, but -net tap -> -net user fixing it seems to imply a host kernel bug rather than a qemu problem! Cheers, Chris. Best wishes, Chris.
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
On Tue, Apr 3, 2012 at 5:37 PM, Chris Webb wrote: > Stefan Hajnoczi writes: > > >> >> Are you sure no other guest has the same MAC address or IP address? >> >> This weird behavior sounds similar to what happens when you have >> >> multiple devices on a network using the same address - the results are >> >> very confusing :). >> > >> > Yes, I agree! However, in this case there's no other guest with the same >> > MAC >> > or IP address on the network. I've explicitly rechecked this to be sure, >> > and >> > also deliberately varied the MAC address to something I know can't be >> > generated by our scripts. In any case, I'm using the same MAC and IP >> > address >> > for every reboot of this VM, and usually (19 times out of 20) it works >> > fine. >> >> The lack of ARP reply is a host networking problem. Have you checked >> host dmesg(1) output just in case there was a kernel message related >> to this? > > Nothing there I'm afraid. Just the usual > > device tap1 entered promiscuous mode > ADDRCONF(NETDEV_UP): tap1: link is not ready > ADDRCONF(NETDEV_CHANGE): tap1: link becomes ready > br0: port 2(tap1) entering forwarding state > br0: port 2(tap1) entering forwarding state > kvm: 20288: cpu0 unhandled rdmsr: 0xc0010112 > kvm: 20288: cpu0 unhandled rdmsr: 0xc0010048 > tap1: no IPv6 routers present > br0: port 2(tap1) entering forwarding state > br0: port 2(tap1) entering forwarding state > br0: port 2(tap1) entering forwarding state > br0: port 2(tap1) entering forwarding state > br0: port 2(tap1) entering disabled state > > cycle. It looks just the same for a working guest as for a non-working > guest. Is the "disabled state" because QEMU exited? I'm afraid I don't have any suggestions beyond debugging the bridge->tap code in the kernel since packets are not being forwarded for some reason. Stefan
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Stefan Hajnoczi writes: > >> Are you sure no other guest has the same MAC address or IP address? > >> This weird behavior sounds similar to what happens when you have > >> multiple devices on a network using the same address - the results are > >> very confusing :). > > > > Yes, I agree! However, in this case there's no other guest with the same MAC > > or IP address on the network. I've explicitly rechecked this to be sure, and > > also deliberately varied the MAC address to something I know can't be > > generated by our scripts. In any case, I'm using the same MAC and IP address > > for every reboot of this VM, and usually (19 times out of 20) it works fine. > > The lack of ARP reply is a host networking problem. Have you checked > host dmesg(1) output just in case there was a kernel message related > to this? Nothing there I'm afraid. Just the usual device tap1 entered promiscuous mode ADDRCONF(NETDEV_UP): tap1: link is not ready ADDRCONF(NETDEV_CHANGE): tap1: link becomes ready br0: port 2(tap1) entering forwarding state br0: port 2(tap1) entering forwarding state kvm: 20288: cpu0 unhandled rdmsr: 0xc0010112 kvm: 20288: cpu0 unhandled rdmsr: 0xc0010048 tap1: no IPv6 routers present br0: port 2(tap1) entering forwarding state br0: port 2(tap1) entering forwarding state br0: port 2(tap1) entering forwarding state br0: port 2(tap1) entering forwarding state br0: port 2(tap1) entering disabled state cycle. It looks just the same for a working guest as for a non-working guest. Best wishes, Chris.
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
On Tue, Apr 3, 2012 at 2:41 PM, Chris Webb wrote: > Stefan Hajnoczi writes: > >> No, that's weird. I would have simply tried "b start_xmit" and as >> long as the binary has symbols gdb would know what to do. > > I'll get another one and give it a go. There wasn't any particular reason I > gave a line number rather than a function except that I worried there might > be start_xmit in a variety of different nic models compiled into qemu so I > might end up setting the breakpoint on the wrong one. b hw/e1000.c:start_xmit > maybe? There is only one start_xmit() in qemu. >> Are you sure no other guest has the same MAC address or IP address? >> This weird behavior sounds similar to what happens when you have >> multiple devices on a network using the same address - the results are >> very confusing :). > > Yes, I agree! However, in this case there's no other guest with the same MAC > or IP address on the network. I've explicitly rechecked this to be sure, and > also deliberately varied the MAC address to something I know can't be > generated by our scripts. In any case, I'm using the same MAC and IP address > for every reboot of this VM, and usually (19 times out of 20) it works fine. The lack of ARP reply is a host networking problem. Have you checked host dmesg(1) output just in case there was a kernel message related to this? Stefan
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
On Tue, Apr 3, 2012 at 1:42 PM, Chris Webb wrote: > Stefan Hajnoczi writes: > >> In a case like this it might be most effective to catch a VM in the >> bad state and then go in with gdb to see what is broken. The basic >> approach would be putting breakpoints on the e1000 device model's >> transmit/receive paths to see if the guest is giving us packets and >> whether the tap device is transmitting/receiving. If guest and host >> appear to be working then QEMU's e1000 model must be in a bad state >> and it's a question of looking at the tx/rx rings and other hardware >> emulation state to figure out what went wrong. > > Hi Stefan. I tried setting a breakpoint on start_xmit, but the qemu blew up > when I hit it: > > (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:start_xmit > Function "start_xmit" not defined. > Make breakpoint pending on future shared library load? (y or [n]) n > (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:528 > Breakpoint 1 at 0x46dcd6: file > /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c, line 528. > (gdb) cont > Continuing. > > Program terminated with signal SIGTRAP, Trace/breakpoint trap. > The program no longer exists. > > I assume this is some subtlety with breakpointing threaded code? No, that's weird. I would have simply tried "b start_xmit" and as long as the binary has symbols gdb would know what to do. > However, along these lines, I note that the guest appears to have received > packets, though this count is stuck at 1993 bytes. The TX count marches > upwards > as I ping outbound from the guest. > > If I attach a tcpdump to tap1 on the host, I see the ARP requests going out > and > apparently no reply: > > 0024# tcpdump -i tap1 > tcpdump: WARNING: tap1: no IPv4 address assigned > tcpdump: verbose output suppressed, use -v or -vv for full protocol decode > listening on tap1, link-type EN10MB (Ethernet), capture size 65535 bytes > 12:08:35.654992 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 > 12:08:36.654976 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 > 12:08:37.654975 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 > 12:08:38.670933 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 > 12:08:39.670922 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 > 12:08:40.670908 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 > > Looking on br0, I do seem to see the replies: > > 12:12:53.509471 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has > 84.45.8.129 tell 84.45.8.242, length 28 > 12:12:53.509914 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at > 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:54.509455 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has > 84.45.8.129 tell 84.45.8.242, length 28 > 12:12:54.509875 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at > 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:55.509447 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has > 84.45.8.129 tell 84.45.8.242, length 28 > 12:12:55.509878 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at > 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:56.525424 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has > 84.45.8.129 tell 84.45.8.242, length 28 > 12:12:56.525854 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at > 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:57.525408 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has > 84.45.8.129 tell 84.45.8.242, length 28 > 12:12:57.525837 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at > 00:13:c3:35:a6:42 (oui Unknown), length 46 > > but they never get to tap1 despite STP being disabled and no bridge port > filtering: > > # ebtables -L > Bridge table: filter > > Bridge chain: INPUT, entries: 0, policy: ACCEPT > > Bridge chain: FORWARD, entries: 0, policy: ACCEPT > > Bridge chain: OUTPUT, entries: 0, policy: ACCEPT > > # brctl show br0 > bridge name bridge id STP enabled interfaces > br0 8000.002590224ffa no eth0 > > > This looks uncannily like a kernel problem doesn't it? However, remove the > -usbdevice tablet, and it goes away, which is truly weird! I've just done a > hundred successful reboots without it once again to confirm to myself that I'm > definitely not imagining that behaviour. Are you sure no other guest has the same MAC address or IP address? This weird behavior sounds similar to what happens when you have multiple devices on a network using the same address - the results are very confusing :). Stefan
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Stefan Hajnoczi writes: > No, that's weird. I would have simply tried "b start_xmit" and as > long as the binary has symbols gdb would know what to do. I'll get another one and give it a go. There wasn't any particular reason I gave a line number rather than a function except that I worried there might be start_xmit in a variety of different nic models compiled into qemu so I might end up setting the breakpoint on the wrong one. b hw/e1000.c:start_xmit maybe? > Are you sure no other guest has the same MAC address or IP address? > This weird behavior sounds similar to what happens when you have > multiple devices on a network using the same address - the results are > very confusing :). Yes, I agree! However, in this case there's no other guest with the same MAC or IP address on the network. I've explicitly rechecked this to be sure, and also deliberately varied the MAC address to something I know can't be generated by our scripts. In any case, I'm using the same MAC and IP address for every reboot of this VM, and usually (19 times out of 20) it works fine. Cheers, Chris.
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Stefan Hajnoczi writes: > In a case like this it might be most effective to catch a VM in the > bad state and then go in with gdb to see what is broken. The basic > approach would be putting breakpoints on the e1000 device model's > transmit/receive paths to see if the guest is giving us packets and > whether the tap device is transmitting/receiving. If guest and host > appear to be working then QEMU's e1000 model must be in a bad state > and it's a question of looking at the tx/rx rings and other hardware > emulation state to figure out what went wrong. Hi Stefan. I tried setting a breakpoint on start_xmit, but the qemu blew up when I hit it: (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:start_xmit Function "start_xmit" not defined. Make breakpoint pending on future shared library load? (y or [n]) n (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:528 Breakpoint 1 at 0x46dcd6: file /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c, line 528. (gdb) cont Continuing. Program terminated with signal SIGTRAP, Trace/breakpoint trap. The program no longer exists. I assume this is some subtlety with breakpointing threaded code? However, along these lines, I note that the guest appears to have received packets, though this count is stuck at 1993 bytes. The TX count marches upwards as I ping outbound from the guest. If I attach a tcpdump to tap1 on the host, I see the ARP requests going out and apparently no reply: 0024# tcpdump -i tap1 tcpdump: WARNING: tap1: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on tap1, link-type EN10MB (Ethernet), capture size 65535 bytes 12:08:35.654992 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:36.654976 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:37.654975 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:38.670933 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:39.670922 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:40.670908 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 Looking on br0, I do seem to see the replies: 12:12:53.509471 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:53.509914 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:54.509455 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:54.509875 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:55.509447 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:55.509878 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:56.525424 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:56.525854 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:57.525408 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:57.525837 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 but they never get to tap1 despite STP being disabled and no bridge port filtering: # ebtables -L Bridge table: filter Bridge chain: INPUT, entries: 0, policy: ACCEPT Bridge chain: FORWARD, entries: 0, policy: ACCEPT Bridge chain: OUTPUT, entries: 0, policy: ACCEPT # brctl show br0 bridge name bridge id STP enabled interfaces br0 8000.002590224ffa no eth0 This looks uncannily like a kernel problem doesn't it? However, remove the -usbdevice tablet, and it goes away, which is truly weird! I've just done a hundred successful reboots without it once again to confirm to myself that I'm definitely not imagining that behaviour. > Have you tried unloading the e1000 kernel module inside the guest and > then modprobing it again? Does this "fix" the issue? Hadn't thought of that, but no, it apparently has no effect. It's still broken after I rmmod it, modprobe it again, and reconfigure the networking. Cheers, Chris.
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
On Tue, Apr 3, 2012 at 9:13 AM, Chris Webb wrote: > Stefan Hajnoczi writes: >> On Mon, Apr 02, 2012 at 04:37:23PM +0100, Chris Webb wrote: >> It sounds like this is not the issue, but are you sure the bridge has >> forwarding delay set to 0 or Spanning Tree Protocol disabled? With STP >> enabled no traffic will be forwarded by the bridge for a configured >> timeout, and depending on the timing of your VM bootup you could see >> weird things. You can check with brctl showstp br0. > > No STP enabled, but the networking is permanently broken on these guests in > any case, not just slow to get started. Usually they've been sat there for > half an hour or more by the time I get back to the stopped reboot loop, and > I left one broken over a weekend without it fixing itself. The network is > statically configured, so if it were down temporarily and came back, pings > would then start working fine. In a case like this it might be most effective to catch a VM in the bad state and then go in with gdb to see what is broken. The basic approach would be putting breakpoints on the e1000 device model's transmit/receive paths to see if the guest is giving us packets and whether the tap device is transmitting/receiving. If guest and host appear to be working then QEMU's e1000 model must be in a bad state and it's a question of looking at the tx/rx rings and other hardware emulation state to figure out what went wrong. Have you tried unloading the e1000 kernel module inside the guest and then modprobing it again? Does this "fix" the issue? Stefan
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Stefan Hajnoczi writes: > On Mon, Apr 02, 2012 at 04:37:23PM +0100, Chris Webb wrote: > > We initially saw a problem after an upgrade from 0.15.x to 1.0. > > Perhaps git-bisect(1) can help you track down the change that introduced > this between 0.15 and 1.0. Hi. I attempted this, but the bug is so intermittent and there are so many unrelated red-herring breakages along the branchy path between the two that I had to abandon the effort after a week I'm afraid. It's phenomenally time consuming, unlike any other bug I've tried to bisect. > It sounds like this is not the issue, but are you sure the bridge has > forwarding delay set to 0 or Spanning Tree Protocol disabled? With STP > enabled no traffic will be forwarded by the bridge for a configured > timeout, and depending on the timing of your VM bootup you could see > weird things. You can check with brctl showstp br0. No STP enabled, but the networking is permanently broken on these guests in any case, not just slow to get started. Usually they've been sat there for half an hour or more by the time I get back to the stopped reboot loop, and I left one broken over a weekend without it fixing itself. The network is statically configured, so if it were down temporarily and came back, pings would then start working fine. Cheers, Chris.
Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
On Mon, Apr 02, 2012 at 04:37:23PM +0100, Chris Webb wrote: > We initially saw a problem after an upgrade from 0.15.x to 1.0. Perhaps git-bisect(1) can help you track down the change that introduced this between 0.15 and 1.0. > Once I've got a guest with broken networking, the network stays down even if > I do things like 'ip link set eth0 down; sleep 5; ip link set eth0 up'. > Killing and restarting the same VM, it runs fine next time. It sounds like this is not the issue, but are you sure the bridge has forwarding delay set to 0 or Spanning Tree Protocol disabled? With STP enabled no traffic will be forwarded by the bridge for a configured timeout, and depending on the timing of your VM bootup you could see weird things. You can check with brctl showstp br0. Stefan