Re: [Xen-devel] Bug in hash changes to netback in 4.7.2 kernel

2016-09-06 Thread Anthony Wright
On 06/09/2016 13:57, Paul Durrant wrote:
>> -Original Message-
>> From: Anthony Wright [mailto:anth...@overnetdata.com]
>> Sent: 06 September 2016 13:23
>> To: Xen-devel 
>> Cc: Paul Durrant 
>> Subject: Bug in hash changes to netback in 4.7.2 kernel
>>
>> When I run Xen (4.7.0) nested in VirtualBox (5.0.24_Ubuntu r108355) with a
>> linux-4.7.2 Dom0 kernel, none of my DomU's (linux-3.17.3) have network
>> connectivity because they reject all packets with the error 'Invalid extra 
>> type:
>> 4'. When I run exactly the same setup on bare metal, I don't get the error
>> messages.
>>
>> From poking around in the code this seems to be because the 4.7.2 kernel
>> wrongly decides that the DomU's will understand EXTRA_TYPE_HASH, and so
>> attach it to the network packet. Since the DomU's don't understand the extra
>> info their netfront driver rejects the whole packet.
> The code in xenvif_select_queue() deliberately clears the skb->sw_hash field 
> (which gates adding the new extra type) if the hash algorithm selected by the 
> frontend is 'none', which should be the default. So, unless you have a 
> frontend that is implementing the control ring protocol, but failing to 
> recognize the new extra type I'm not how you're seeing the problem... unless 
> somehow a packet which hash is getting into netback's start_xmit without 
> first having gone through select_queue?
I very much doubt that the frontend is implementing the control ring
protocol, the DomUs are running stock linux-3.17.3. I build the system
from source, so happy to re-compile with debug code.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] xen/xen vs xen/kvm nesting with pv drivers

2016-09-06 Thread Anthony Wright
On 06/09/2016 14:05, Andrew Cooper wrote:
> On 06/09/16 13:47, Anthony Wright wrote:
>> I tried to install Xen (4.7.0 with linux 4.7.2 Dom0) on an AWS virtual 
>> machine and it failed because while AWS uses Xen it requires that you use 
>> the PVHVM network driver. I then tried to install Xen on a Google Cloud 
>> virtual machine and despite also requiring you to use PV drivers, that 
>> succeeded because Google Cloud uses KVM.
>>
>> I think this means that if you nest Xen in KVM you can use high performance 
>> drivers, but if you nest Xen in Xen you have to use slower drivers, which 
>> seems to be the wrong way around!
>>
>> I'd like to be able to install Xen on an AWS virtual machine, and wondered 
>> what are the challenges to getting the pv drivers working in a nested 
>> environment. Is this a problem with the Dom0 kernel only expecting there to 
>> be a single XenStore, or is there also a problem in Xen?
> Nesting Xen inside Xen and getting high-speed drivers at L1 is a hard
> problem, which is why noone has tackled it yet.
>
> The problems all revolve around L1's dom0.  It can't issue hypercalls to
> L0, meaning that it cant find or connect the xenstore ring.  Even if it
> could, there is the problem of multiple xenstores, which doesn't fit in
> the current architecture.
>
> It would be lovely if someone would work on this, but it is a very large
> swamp.
>
> ~Andrew
Does the L1's Dom0 have to issue the hypercalls directly? Would it be
possible to get the L1's Dom0 to issue the request to the L1 hypervisor
and that to call the L0 hypervisor? This would seem to fit the current
architecture fairly closely. (Sorry if I've got the terminology wrong)

Regarding multiple XenStores, I appreciate there would be significant
problems, but you'd only have a maximum of two XenStores, one for the
xenback drivers (the current XenStore) and one for the xenfront drivers
(that talks to the parent hypervisor).

Anthony


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] xen/xen vs xen/kvm nesting with pv drivers

2016-09-06 Thread Anthony Wright
I tried to install Xen (4.7.0 with linux 4.7.2 Dom0) on an AWS virtual machine 
and it failed because while AWS uses Xen it requires that you use the PVHVM 
network driver. I then tried to install Xen on a Google Cloud virtual machine 
and despite also requiring you to use PV drivers, that succeeded because Google 
Cloud uses KVM.

I think this means that if you nest Xen in KVM you can use high performance 
drivers, but if you nest Xen in Xen you have to use slower drivers, which seems 
to be the wrong way around!

I'd like to be able to install Xen on an AWS virtual machine, and wondered what 
are the challenges to getting the pv drivers working in a nested environment. 
Is this a problem with the Dom0 kernel only expecting there to be a single 
XenStore, or is there also a problem in Xen?

Thanks,

Anthony Wright

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Bug in hash changes to netback in 4.7.2 kernel

2016-09-06 Thread Anthony Wright
When I run Xen (4.7.0) nested in VirtualBox (5.0.24_Ubuntu r108355) with a 
linux-4.7.2 Dom0 kernel, none of my DomU's (linux-3.17.3) have network 
connectivity because they reject all packets with the error 'Invalid extra 
type: 4'. When I run exactly the same setup on bare metal, I don't get the 
error messages.

From poking around in the code this seems to be because the 4.7.2 kernel 
wrongly decides that the DomU's will understand EXTRA_TYPE_HASH, and so attach 
it to the network packet. Since the DomU's don't understand the extra info 
their netfront driver rejects the whole packet.

I'm guessing that the nesting is confusing the new hash code.

I also wonder if the DomU's should simply ignore extra info that they don't 
understand rather than rejecting the packet.

Cheers,

Anthony

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0

2014-12-09 Thread Anthony Wright
On 09/12/2014 22:30, Siegmann Joseph wrote:
>
> Would you mind sharing what it would take to correct this issue… is
> there a file I could just replace until a patch is released?
>
We simply applied David Vrabel's patch from 8/12/14 to a stock 3.17.3
kernel we were running in the DomU and it fixed the problem.
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0

2014-12-09 Thread Anthony Wright
On 08/12/2014 12:03, David Vrabel wrote:
> Does this patch to netfront fix it?
>
> 8<-
> xen-netfront: use correct linear area after linearizing an skb
>
> Commit 97a6d1bb2b658ac85ed88205ccd1ab809899884d (xen-netfront: Fix
> handling packets on compound pages with skb_linearize) attempted to
> fix a problem where an skb that would have required too many slots
> would be dropped causing TCP connections to stall.
>
> However, it filled in the first slot using the original buffer and not
> the new one and would use the wrong offset and grant access to the
> wrong page.
>
> Netback would notice the malformed request and stop all traffic on the
> VIF, reporting:
>
> vif vif-3-0 vif3.0: txreq.offset: 85e, size: 4002, end: 6144
> vif vif-3-0 vif3.0: fatal error; disabling device
>
> Signed-off-by: David Vrabel 
> ---
>  drivers/net/xen-netfront.c |3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index ece8d18..eeed0ce 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -627,6 +627,9 @@ static int xennet_start_xmit(struct sk_buff *skb,
> struct net_device *dev)
>   slots, skb->len);
>   if (skb_linearize(skb))
>   goto drop;
> + data = skb->data;
> + offset = offset_in_page(data);
> + len = skb_headlen(skb);
>   }
>
>   spin_lock_irqsave(&queue->tx_lock, flags);
The patch seems to have worked. Before we'd managed to reproduce the
problem in under 10 seconds, with the patch we haven't seen the problem
on the test or production systems.

Thank you.

Anthony.



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0

2014-12-04 Thread Anthony Wright
> On 01/12/14 14:22, David Vrabel wrote:
> This VIF protocol is weird. The first slot contains a txreq with a
> size
> for the total length of the packet, subsequent slots have sizes for
> that
> fragment only.
> 
> netback then has to calculate how long the first slot is, by
> subtracting
> all the size from the following slots.
> 
> So something has gone wrong but it's not obvious what it is. Any
> chance
> you can dump the ring state when it happens?

We think we've worked out how to dump the ring state, please see below.

dmesg output

[76571.687014] vif vif-6-0 vif6.0: txreq.offset: a5e, size: 4002, end: 6656
[76571.687035] vif vif-6-0 vif6.0: fatal error; disabling device
[76571.700304] br-primary-1: port 2(vif6.0) entered disabled state

/sys/kernel/debug/xen-netback/vif6.0/io_ring_q0
===
Queue 0
TX: nr_ents 256
req prod 10164 (39) cons 10127 (2) event 10126 (1)
rsp prod 10125 (base) pvt 10125 (0) event 10145 (20)
pending prod 9589 pending cons 9333 nr_pending_reqs 0
dealloc prod 8501 dealloc cons 8501 dealloc_queue 0

RX: nr_ents 256
req prod 1321 (41) cons 1280 (0) event 1 (-1279)
rsp prod 1280 (base) pvt 1280 (0) event 1281 (1)

NAPI state: 1 NAPI weight: 64 TX queue len 0
Credit timer_pending: 0, credit: 18446744073709551615, usec: 0
remaining: 18446744073678062682, expires: 0, now: 4314107964


/sys/kernel/debug/xen-netback/vif6.0/io_ring_q1
===
Queue 1
TX: nr_ents 256
req prod 10106 (0) cons 10106 (0) event 10107 (1)
rsp prod 10106 (base) pvt 10106 (0) event 10107 (1)
pending prod 9573 pending cons 9317 nr_pending_reqs 0
dealloc prod 8503 dealloc cons 8503 dealloc_queue 0

RX: nr_ents 256
req prod 594 (39) cons 555 (0) event 1 (-554)
rsp prod 555 (base) pvt 555 (0) event 556 (1)

NAPI state: 1 NAPI weight: 64 TX queue len 0
Credit timer_pending: 0, credit: 18446744073709551615, usec: 0
remaining: 18446744073678038030, expires: 0, now: 4314118667


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0

2014-12-02 Thread Anthony Wright


- Original Message -
> On 01/12/14 14:22, David Vrabel wrote:
> > On 28/11/14 15:19, Anthony Wright wrote:
> > The guest's frontend driver isn't putting valid requests onto the
> > ring
> > (it crosses a page boundary) so this is a frontend bug.
> 
> This VIF protocol is weird. The first slot contains a txreq with a
> size
> for the total length of the packet, subsequent slots have sizes for
> that
> fragment only.
> 
> netback then has to calculate how long the first slot is, by
> subtracting
> all the size from the following slots.
> 
> So something has gone wrong but it's not obvious what it is. Any
> chance
> you can dump the ring state when it happens?
Really sorry, but how do I dump the ring state? I can a root shell on both the 
Dom0 & DomU, but I don't know the command to use to dump the ring state.

Anthony.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0

2014-12-01 Thread Anthony Wright
> On 28/11/14 15:19, Anthony Wright wrote:
> > We have a 64 bit PV DomU that we recently upgraded from linux 3.3.2
> > to
> > 3.17.3 running on a 64 bit 3.17.3 Dom0 with Xen 4.4.0.
> >
> > Shortly after the upgrade we started to lose network connectivity to
> > the
> > DomU a few times a day that required a reboot to fix. We see nothing
> > in
> > the xen logs or xl dmesg, but when we looked at the dmesg output we
> > saw
> > the following output for the two incidents we investigated in
> > detail:
> >
> > [69332.026586] vif vif-4-0 vif4.0: txreq.offset: 85e, size: 4002,
> > end: 6144
> > [69332.026607] vif vif-4-0 vif4.0: fatal error; disabling device
> > [69332.031069] br-default: port 2(vif4.0) entered disabled state
> 
> The guest's frontend driver isn't putting valid requests onto the ring
> (it crosses a page boundary) so this is a frontend bug.
> 
> What guest are you running?

We're running a custom built 64 bit para-virtualised DomU with a stock Linux 
3.17.3 downloaded from kernel.org. The problem only started happening when we 
upgraded the DomU Linux kernel from 3.3.2

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0

2014-11-28 Thread Anthony Wright
On 28/11/2014 15:23, Ian Campbell wrote:
> On Fri, 2014-11-28 at 15:19 +0000, Anthony Wright wrote:
>> We have a 64 bit PV DomU that we recently upgraded from linux 3.3.2 to
>> 3.17.3
> Is this a Debian kernel? In which case you might be seeing
It's a stock kernel from kernel.org, we have a custom system with no
relation to Debian.
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=767261 , this will be
> fixed in the next upload of the kernel, test binaries with the fixes are
> referenced in the bug log.
The error messages we're seeing are different from those reported, both
the Dom0 and DomU continue to run correctly and the vif doesn't degrade
slowly it fails the test in netback.c below which disables the interface:

/* No crossing a page as the payload mustn't fragment. */
if (unlikely((txreq.offset + txreq.size) > PAGE_SIZE)) {
netdev_err(queue->vif->dev,
"txreq.offset: %x, size: %u, end: %lu\n",
txreq.offset, txreq.size,
(txreq.offset&~PAGE_MASK) + txreq.size);
xenvif_fatal_tx_err(queue->vif);
break;
}
> Even if not Debian then you'll probably want the same set of backports.
I'm happy to apply the backports if you think it's likely to fix the
problem despite the different symptoms, but from what I can see it looks
like a different problem.

thanks,

Anthony
> Ian.
>>  running on a 64 bit 3.17.3 Dom0 with Xen 4.4.0.
>>
>> Shortly after the upgrade we started to lose network connectivity to the
>> DomU a few times a day that required a reboot to fix. We see nothing in
>> the xen logs or xl dmesg, but when we looked at the dmesg output we saw
>> the following output for the two incidents we investigated in detail:
>>
>> [69332.026586] vif vif-4-0 vif4.0: txreq.offset: 85e, size: 4002, end: 6144
>> [69332.026607] vif vif-4-0 vif4.0: fatal error; disabling device
>> [69332.031069] br-default: port 2(vif4.0) entered disabled state
>>
>>
>> [824365.530740] vif vif-9-0 vif9.0: txreq.offset: a5e, size: 4002, end: 6656
>> [824365.530748] vif vif-9-0 vif9.0: fatal error; disabling device
>> [824365.531191] br-default: port 2(vif9.0) entered disabled state
>>
>> We have a very similar setup running on another machine with a 3.17.3
>> DomU, 3.17.3 Dom0 and Xen 4.4.0 but we can't reproduce the issue on this
>> machine. This is a test system rather than a production system so has a
>> different workload and fewer CPUs.
>>
>> The piece of code that outputs the error is in
>> drivers/net/xen-netback/netback.c.
>>
>> The DomU has 4000MB of RAM and 8 CPUs.
>>
>> Any ideas?
>>
>> Thanks,
>>
>> Anthony.
>>
>> ___
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] PV DomU running linux 3.17.3 causing xen-netback fatal error in Dom0

2014-11-28 Thread Anthony Wright
We have a 64 bit PV DomU that we recently upgraded from linux 3.3.2 to
3.17.3 running on a 64 bit 3.17.3 Dom0 with Xen 4.4.0.

Shortly after the upgrade we started to lose network connectivity to the
DomU a few times a day that required a reboot to fix. We see nothing in
the xen logs or xl dmesg, but when we looked at the dmesg output we saw
the following output for the two incidents we investigated in detail:

[69332.026586] vif vif-4-0 vif4.0: txreq.offset: 85e, size: 4002, end: 6144
[69332.026607] vif vif-4-0 vif4.0: fatal error; disabling device
[69332.031069] br-default: port 2(vif4.0) entered disabled state


[824365.530740] vif vif-9-0 vif9.0: txreq.offset: a5e, size: 4002, end: 6656
[824365.530748] vif vif-9-0 vif9.0: fatal error; disabling device
[824365.531191] br-default: port 2(vif9.0) entered disabled state

We have a very similar setup running on another machine with a 3.17.3
DomU, 3.17.3 Dom0 and Xen 4.4.0 but we can't reproduce the issue on this
machine. This is a test system rather than a production system so has a
different workload and fewer CPUs.

The piece of code that outputs the error is in
drivers/net/xen-netback/netback.c.

The DomU has 4000MB of RAM and 8 CPUs.

Any ideas?

Thanks,

Anthony.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel