Re: [Xen-devel] Bug in hash changes to netback in 4.7.2 kernel
On 06/09/2016 13:57, Paul Durrant wrote: >> -Original Message- >> From: Anthony Wright [mailto:anth...@overnetdata.com] >> Sent: 06 September 2016 13:23 >> To: Xen-devel >> Cc: Paul Durrant >> Subject: Bug in hash changes to netback in 4.7.2 kernel >> >> When I run Xen (4.7.0) nested in VirtualBox (5.0.24_Ubuntu r108355) with a >> linux-4.7.2 Dom0 kernel, none of my DomU's (linux-3.17.3) have network >> connectivity because they reject all packets with the error 'Invalid extra >> type: >> 4'. When I run exactly the same setup on bare metal, I don't get the error >> messages. >> >> From poking around in the code this seems to be because the 4.7.2 kernel >> wrongly decides that the DomU's will understand EXTRA_TYPE_HASH, and so >> attach it to the network packet. Since the DomU's don't understand the extra >> info their netfront driver rejects the whole packet. > The code in xenvif_select_queue() deliberately clears the skb->sw_hash field > (which gates adding the new extra type) if the hash algorithm selected by the > frontend is 'none', which should be the default. So, unless you have a > frontend that is implementing the control ring protocol, but failing to > recognize the new extra type I'm not how you're seeing the problem... unless > somehow a packet which hash is getting into netback's start_xmit without > first having gone through select_queue? I very much doubt that the frontend is implementing the control ring protocol, the DomUs are running stock linux-3.17.3. I build the system from source, so happy to re-compile with debug code. ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] xen/xen vs xen/kvm nesting with pv drivers
On 06/09/2016 14:05, Andrew Cooper wrote: > On 06/09/16 13:47, Anthony Wright wrote: >> I tried to install Xen (4.7.0 with linux 4.7.2 Dom0) on an AWS virtual >> machine and it failed because while AWS uses Xen it requires that you use >> the PVHVM network driver. I then tried to install Xen on a Google Cloud >> virtual machine and despite also requiring you to use PV drivers, that >> succeeded because Google Cloud uses KVM. >> >> I think this means that if you nest Xen in KVM you can use high performance >> drivers, but if you nest Xen in Xen you have to use slower drivers, which >> seems to be the wrong way around! >> >> I'd like to be able to install Xen on an AWS virtual machine, and wondered >> what are the challenges to getting the pv drivers working in a nested >> environment. Is this a problem with the Dom0 kernel only expecting there to >> be a single XenStore, or is there also a problem in Xen? > Nesting Xen inside Xen and getting high-speed drivers at L1 is a hard > problem, which is why noone has tackled it yet. > > The problems all revolve around L1's dom0. It can't issue hypercalls to > L0, meaning that it cant find or connect the xenstore ring. Even if it > could, there is the problem of multiple xenstores, which doesn't fit in > the current architecture. > > It would be lovely if someone would work on this, but it is a very large > swamp. > > ~Andrew Does the L1's Dom0 have to issue the hypercalls directly? Would it be possible to get the L1's Dom0 to issue the request to the L1 hypervisor and that to call the L0 hypervisor? This would seem to fit the current architecture fairly closely. (Sorry if I've got the terminology wrong) Regarding multiple XenStores, I appreciate there would be significant problems, but you'd only have a maximum of two XenStores, one for the xenback drivers (the current XenStore) and one for the xenfront drivers (that talks to the parent hypervisor). Anthony ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] xen/xen vs xen/kvm nesting with pv drivers
I tried to install Xen (4.7.0 with linux 4.7.2 Dom0) on an AWS virtual machine and it failed because while AWS uses Xen it requires that you use the PVHVM network driver. I then tried to install Xen on a Google Cloud virtual machine and despite also requiring you to use PV drivers, that succeeded because Google Cloud uses KVM. I think this means that if you nest Xen in KVM you can use high performance drivers, but if you nest Xen in Xen you have to use slower drivers, which seems to be the wrong way around! I'd like to be able to install Xen on an AWS virtual machine, and wondered what are the challenges to getting the pv drivers working in a nested environment. Is this a problem with the Dom0 kernel only expecting there to be a single XenStore, or is there also a problem in Xen? Thanks, Anthony Wright ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] Bug in hash changes to netback in 4.7.2 kernel
When I run Xen (4.7.0) nested in VirtualBox (5.0.24_Ubuntu r108355) with a linux-4.7.2 Dom0 kernel, none of my DomU's (linux-3.17.3) have network connectivity because they reject all packets with the error 'Invalid extra type: 4'. When I run exactly the same setup on bare metal, I don't get the error messages. From poking around in the code this seems to be because the 4.7.2 kernel wrongly decides that the DomU's will understand EXTRA_TYPE_HASH, and so attach it to the network packet. Since the DomU's don't understand the extra info their netfront driver rejects the whole packet. I'm guessing that the nesting is confusing the new hash code. I also wonder if the DomU's should simply ignore extra info that they don't understand rather than rejecting the packet. Cheers, Anthony ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0
On 09/12/2014 22:30, Siegmann Joseph wrote: > > Would you mind sharing what it would take to correct this issue… is > there a file I could just replace until a patch is released? > We simply applied David Vrabel's patch from 8/12/14 to a stock 3.17.3 kernel we were running in the DomU and it fixed the problem. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0
On 08/12/2014 12:03, David Vrabel wrote: > Does this patch to netfront fix it? > > 8<- > xen-netfront: use correct linear area after linearizing an skb > > Commit 97a6d1bb2b658ac85ed88205ccd1ab809899884d (xen-netfront: Fix > handling packets on compound pages with skb_linearize) attempted to > fix a problem where an skb that would have required too many slots > would be dropped causing TCP connections to stall. > > However, it filled in the first slot using the original buffer and not > the new one and would use the wrong offset and grant access to the > wrong page. > > Netback would notice the malformed request and stop all traffic on the > VIF, reporting: > > vif vif-3-0 vif3.0: txreq.offset: 85e, size: 4002, end: 6144 > vif vif-3-0 vif3.0: fatal error; disabling device > > Signed-off-by: David Vrabel > --- > drivers/net/xen-netfront.c |3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index ece8d18..eeed0ce 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -627,6 +627,9 @@ static int xennet_start_xmit(struct sk_buff *skb, > struct net_device *dev) > slots, skb->len); > if (skb_linearize(skb)) > goto drop; > + data = skb->data; > + offset = offset_in_page(data); > + len = skb_headlen(skb); > } > > spin_lock_irqsave(&queue->tx_lock, flags); The patch seems to have worked. Before we'd managed to reproduce the problem in under 10 seconds, with the patch we haven't seen the problem on the test or production systems. Thank you. Anthony. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0
> On 01/12/14 14:22, David Vrabel wrote: > This VIF protocol is weird. The first slot contains a txreq with a > size > for the total length of the packet, subsequent slots have sizes for > that > fragment only. > > netback then has to calculate how long the first slot is, by > subtracting > all the size from the following slots. > > So something has gone wrong but it's not obvious what it is. Any > chance > you can dump the ring state when it happens? We think we've worked out how to dump the ring state, please see below. dmesg output [76571.687014] vif vif-6-0 vif6.0: txreq.offset: a5e, size: 4002, end: 6656 [76571.687035] vif vif-6-0 vif6.0: fatal error; disabling device [76571.700304] br-primary-1: port 2(vif6.0) entered disabled state /sys/kernel/debug/xen-netback/vif6.0/io_ring_q0 === Queue 0 TX: nr_ents 256 req prod 10164 (39) cons 10127 (2) event 10126 (1) rsp prod 10125 (base) pvt 10125 (0) event 10145 (20) pending prod 9589 pending cons 9333 nr_pending_reqs 0 dealloc prod 8501 dealloc cons 8501 dealloc_queue 0 RX: nr_ents 256 req prod 1321 (41) cons 1280 (0) event 1 (-1279) rsp prod 1280 (base) pvt 1280 (0) event 1281 (1) NAPI state: 1 NAPI weight: 64 TX queue len 0 Credit timer_pending: 0, credit: 18446744073709551615, usec: 0 remaining: 18446744073678062682, expires: 0, now: 4314107964 /sys/kernel/debug/xen-netback/vif6.0/io_ring_q1 === Queue 1 TX: nr_ents 256 req prod 10106 (0) cons 10106 (0) event 10107 (1) rsp prod 10106 (base) pvt 10106 (0) event 10107 (1) pending prod 9573 pending cons 9317 nr_pending_reqs 0 dealloc prod 8503 dealloc cons 8503 dealloc_queue 0 RX: nr_ents 256 req prod 594 (39) cons 555 (0) event 1 (-554) rsp prod 555 (base) pvt 555 (0) event 556 (1) NAPI state: 1 NAPI weight: 64 TX queue len 0 Credit timer_pending: 0, credit: 18446744073709551615, usec: 0 remaining: 18446744073678038030, expires: 0, now: 4314118667 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0
- Original Message - > On 01/12/14 14:22, David Vrabel wrote: > > On 28/11/14 15:19, Anthony Wright wrote: > > The guest's frontend driver isn't putting valid requests onto the > > ring > > (it crosses a page boundary) so this is a frontend bug. > > This VIF protocol is weird. The first slot contains a txreq with a > size > for the total length of the packet, subsequent slots have sizes for > that > fragment only. > > netback then has to calculate how long the first slot is, by > subtracting > all the size from the following slots. > > So something has gone wrong but it's not obvious what it is. Any > chance > you can dump the ring state when it happens? Really sorry, but how do I dump the ring state? I can a root shell on both the Dom0 & DomU, but I don't know the command to use to dump the ring state. Anthony. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0
> On 28/11/14 15:19, Anthony Wright wrote: > > We have a 64 bit PV DomU that we recently upgraded from linux 3.3.2 > > to > > 3.17.3 running on a 64 bit 3.17.3 Dom0 with Xen 4.4.0. > > > > Shortly after the upgrade we started to lose network connectivity to > > the > > DomU a few times a day that required a reboot to fix. We see nothing > > in > > the xen logs or xl dmesg, but when we looked at the dmesg output we > > saw > > the following output for the two incidents we investigated in > > detail: > > > > [69332.026586] vif vif-4-0 vif4.0: txreq.offset: 85e, size: 4002, > > end: 6144 > > [69332.026607] vif vif-4-0 vif4.0: fatal error; disabling device > > [69332.031069] br-default: port 2(vif4.0) entered disabled state > > The guest's frontend driver isn't putting valid requests onto the ring > (it crosses a page boundary) so this is a frontend bug. > > What guest are you running? We're running a custom built 64 bit para-virtualised DomU with a stock Linux 3.17.3 downloaded from kernel.org. The problem only started happening when we upgraded the DomU Linux kernel from 3.3.2 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0
On 28/11/2014 15:23, Ian Campbell wrote: > On Fri, 2014-11-28 at 15:19 +0000, Anthony Wright wrote: >> We have a 64 bit PV DomU that we recently upgraded from linux 3.3.2 to >> 3.17.3 > Is this a Debian kernel? In which case you might be seeing It's a stock kernel from kernel.org, we have a custom system with no relation to Debian. > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=767261 , this will be > fixed in the next upload of the kernel, test binaries with the fixes are > referenced in the bug log. The error messages we're seeing are different from those reported, both the Dom0 and DomU continue to run correctly and the vif doesn't degrade slowly it fails the test in netback.c below which disables the interface: /* No crossing a page as the payload mustn't fragment. */ if (unlikely((txreq.offset + txreq.size) > PAGE_SIZE)) { netdev_err(queue->vif->dev, "txreq.offset: %x, size: %u, end: %lu\n", txreq.offset, txreq.size, (txreq.offset&~PAGE_MASK) + txreq.size); xenvif_fatal_tx_err(queue->vif); break; } > Even if not Debian then you'll probably want the same set of backports. I'm happy to apply the backports if you think it's likely to fix the problem despite the different symptoms, but from what I can see it looks like a different problem. thanks, Anthony > Ian. >> running on a 64 bit 3.17.3 Dom0 with Xen 4.4.0. >> >> Shortly after the upgrade we started to lose network connectivity to the >> DomU a few times a day that required a reboot to fix. We see nothing in >> the xen logs or xl dmesg, but when we looked at the dmesg output we saw >> the following output for the two incidents we investigated in detail: >> >> [69332.026586] vif vif-4-0 vif4.0: txreq.offset: 85e, size: 4002, end: 6144 >> [69332.026607] vif vif-4-0 vif4.0: fatal error; disabling device >> [69332.031069] br-default: port 2(vif4.0) entered disabled state >> >> >> [824365.530740] vif vif-9-0 vif9.0: txreq.offset: a5e, size: 4002, end: 6656 >> [824365.530748] vif vif-9-0 vif9.0: fatal error; disabling device >> [824365.531191] br-default: port 2(vif9.0) entered disabled state >> >> We have a very similar setup running on another machine with a 3.17.3 >> DomU, 3.17.3 Dom0 and Xen 4.4.0 but we can't reproduce the issue on this >> machine. This is a test system rather than a production system so has a >> different workload and fewer CPUs. >> >> The piece of code that outputs the error is in >> drivers/net/xen-netback/netback.c. >> >> The DomU has 4000MB of RAM and 8 CPUs. >> >> Any ideas? >> >> Thanks, >> >> Anthony. >> >> ___ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0
We have a 64 bit PV DomU that we recently upgraded from linux 3.3.2 to 3.17.3 running on a 64 bit 3.17.3 Dom0 with Xen 4.4.0. Shortly after the upgrade we started to lose network connectivity to the DomU a few times a day that required a reboot to fix. We see nothing in the xen logs or xl dmesg, but when we looked at the dmesg output we saw the following output for the two incidents we investigated in detail: [69332.026586] vif vif-4-0 vif4.0: txreq.offset: 85e, size: 4002, end: 6144 [69332.026607] vif vif-4-0 vif4.0: fatal error; disabling device [69332.031069] br-default: port 2(vif4.0) entered disabled state [824365.530740] vif vif-9-0 vif9.0: txreq.offset: a5e, size: 4002, end: 6656 [824365.530748] vif vif-9-0 vif9.0: fatal error; disabling device [824365.531191] br-default: port 2(vif9.0) entered disabled state We have a very similar setup running on another machine with a 3.17.3 DomU, 3.17.3 Dom0 and Xen 4.4.0 but we can't reproduce the issue on this machine. This is a test system rather than a production system so has a different workload and fewer CPUs. The piece of code that outputs the error is in drivers/net/xen-netback/netback.c. The DomU has 4000MB of RAM and 8 CPUs. Any ideas? Thanks, Anthony. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel