Re: [CentOS-virt] Seeing dropped packets / tcp retrans on latest 4.4.1-10el6

2015-04-20 Thread George Dunlap
[cc'ing xen-users to see if anyone there has any familiarity with ebtables]

On Fri, Apr 17, 2015 at 8:20 PM, Nathan March  wrote:
> Hi All,
>
> I've tracked this down... We do rate limiting of our vms with a mix of 
> ebtables/tc.
>
> Running these commands (replace vif1.0 with the correct vif for your VM) will 
> reproduce this:
>
> ebtables -A FORWARD -i vif1.0 -j mark --set-mark 990 --mark-target CONTINUE
>
> tc qdisc add dev bond0 root handle 1: htb default 2
> tc class add dev bond0 parent 1: classid 1:0 htb rate 1mbit
>
> tc class add dev bond0 parent 1: classid 1:990 htb rate 1mbit
> tc filter add dev bond0 protocol ip parent 1:0 prio 990 handle 990 fw flowid 
> 1:990
>
> Note that the speed limits being applied here are 10gb and I'm testing this 
> on a 1gb network, so TC shouldn't really be doing anything here except 
> letting the packets through. These same commands worked fine on gentoo xen 
> 4.1 / kernel 3.2.57, compared to this now not working on centos xen 4.4.1 / 
> kernel 3.10.68.

So just to be clear, we have 3 variables to consider here?

* gentoo -> CentOS
* Xen 4.1 -> Xen 4.4
* kernel 3.2.57 -> 3.10.68

Unfortunately I'm not very familiar with ebtables, so I don't have a
clear idea what sort of thing might cause duplicate ACKs.

Are you able to narrow down any of those?

You can find CentOS packages for Xen 4.2 and kernel 3.4 here:

http://vault.centos.org/6.4/xen4/x86_64/Packages/

If you could build a more recent kernel and see if it's been fixed,
that might be helpful as well.

 -George

>
> Easiest way to reproduce is simply generate a large file, scp it to a remote 
> host and on the remote host run:
> tshark -Y "tcp.analysis.duplicate_ack_num"
>
> If you run the ssh in a loop + tshark in another window, you can see the Dup 
> ACK's begin immediately after adding the last filter rule:
>
> 25790294 1752.756733 xxx.xxx.xxx.13 -> xxx.xxx.xxx.205 TCP 78 [TCP Dup ACK 
> 25790286#4] ssh > 51515 [ACK] Seq=15994 Ack=50769840 Win=1544704 Len=0 
> TSval=738150929 TSecr=4294944346 SLE=50785768 SRE=50790596
> 25790296 1752.756742 xxx.xxx.xxx.13 -> xxx.xxx.xxx.205 TCP 78 [TCP Dup ACK 
> 25790286#5] ssh > 51515 [ACK] Seq=15994 Ack=50769840 Win=1544704 Len=0 
> TSval=738150929 TSecr=4294944346 SLE=50785768 SRE=50792044
>
> - Nathan
>
> ___
> CentOS-virt mailing list
> CentOS-virt@centos.org
> http://lists.centos.org/mailman/listinfo/centos-virt
___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Seeing dropped packets / tcp retrans on latest 4.4.1-10el6

2015-04-17 Thread Nathan March
Hi All,

I've tracked this down... We do rate limiting of our vms with a mix of 
ebtables/tc.

Running these commands (replace vif1.0 with the correct vif for your VM) will 
reproduce this:

ebtables -A FORWARD -i vif1.0 -j mark --set-mark 990 --mark-target CONTINUE

tc qdisc add dev bond0 root handle 1: htb default 2 
tc class add dev bond0 parent 1: classid 1:0 htb rate 1mbit 

tc class add dev bond0 parent 1: classid 1:990 htb rate 1mbit
tc filter add dev bond0 protocol ip parent 1:0 prio 990 handle 990 fw flowid 
1:990

Note that the speed limits being applied here are 10gb and I'm testing this on 
a 1gb network, so TC shouldn't really be doing anything here except letting the 
packets through. These same commands worked fine on gentoo xen 4.1 / kernel 
3.2.57, compared to this now not working on centos xen 4.4.1 / kernel 3.10.68.

Easiest way to reproduce is simply generate a large file, scp it to a remote 
host and on the remote host run:
tshark -Y "tcp.analysis.duplicate_ack_num"

If you run the ssh in a loop + tshark in another window, you can see the Dup 
ACK's begin immediately after adding the last filter rule:

25790294 1752.756733 xxx.xxx.xxx.13 -> xxx.xxx.xxx.205 TCP 78 [TCP Dup ACK 
25790286#4] ssh > 51515 [ACK] Seq=15994 Ack=50769840 Win=1544704 Len=0 
TSval=738150929 TSecr=4294944346 SLE=50785768 SRE=50790596
25790296 1752.756742 xxx.xxx.xxx.13 -> xxx.xxx.xxx.205 TCP 78 [TCP Dup ACK 
25790286#5] ssh > 51515 [ACK] Seq=15994 Ack=50769840 Win=1544704 Len=0 
TSval=738150929 TSecr=4294944346 SLE=50785768 SRE=50792044

- Nathan

___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Seeing dropped packets / tcp retrans on latest 4.4.1-10el6

2015-04-15 Thread Nathan March
So I might have been misinterpreting things here and might be way off base. I 
think you can ignore this thread and I'll follow up if I get anything concrete 
down the road =) The retranmissions I'm seeing and reproducing are probably 
within normal allowances and can't reproduce the issue that originally lead me 
down this path.

- Nathan

> -Original Message-
> From: centos-virt-boun...@centos.org [mailto:centos-virt-
> boun...@centos.org] On Behalf Of Nathan March
> Sent: Wednesday, April 15, 2015 1:13 PM
> To: 'Discussion about the virtualization on CentOS'
> Subject: Re: [CentOS-virt] Seeing dropped packets / tcp retrans on latest
> 4.4.1-10el6
> 
> Hi All,
> 
> Some more data on this, I've reproduced this on another host that's a
> completely stock centos/xen deployment with a centos 6.6 domU.
> 
> Since I’m seeing the retransmissions on the VIF, I don't think it's related to
> the network stack but just in case.. Each host is connected via LACP with vlan
> tagging to a pair of stacked cisco 3750's. Host networking config is here:
> 
> http://dpaste.com/1Q6NY3Y
> 
> The vm is on br99 here.
> 
> This is easily reproducable by just generating a 250mb random file and doing
> an scp, while watching with tshark:
> 
> tshark -R "tcp.analysis.retransmission"
> 
> There's no visible impact to the connection the vast majority of the time,
> which is why I think this has gone unnoticed.
> 
> Just to confirm this wasn't related to hardware / nics, I've reproduced this 
> on:
> 
>  - Dell PowerEdge M620 with broadcom nics
>  - Dell C6220 with intel nics
>  - Supermicro X8DTT with intel nics
> 
> Any ideas? =)
> 
> - Nathan
> 
> ___
> CentOS-virt mailing list
> CentOS-virt@centos.org
> http://lists.centos.org/mailman/listinfo/centos-virt

___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Seeing dropped packets / tcp retrans on latest 4.4.1-10el6

2015-04-15 Thread Nathan March
Hi All,

Some more data on this, I've reproduced this on another host that's a 
completely stock centos/xen deployment with a centos 6.6 domU.

Since I’m seeing the retransmissions on the VIF, I don't think it's related to 
the network stack but just in case.. Each host is connected via LACP with vlan 
tagging to a pair of stacked cisco 3750's. Host networking config is here: 

http://dpaste.com/1Q6NY3Y

The vm is on br99 here.

This is easily reproducable by just generating a 250mb random file and doing an 
scp, while watching with tshark: 

tshark -R "tcp.analysis.retransmission"

There's no visible impact to the connection the vast majority of the time, 
which is why I think this has gone unnoticed.

Just to confirm this wasn't related to hardware / nics, I've reproduced this on:

 - Dell PowerEdge M620 with broadcom nics
 - Dell C6220 with intel nics
 - Supermicro X8DTT with intel nics

Any ideas? =)

- Nathan

___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


[CentOS-virt] Seeing dropped packets / tcp retrans on latest 4.4.1-10el6

2015-04-14 Thread Nathan March
Hi All,

 

Was troubleshooting some odd VM network issues and discovered that we're seeing 
dropped packets + retransmissions across multiple domU OS's and dom0 hardware 
platforms.

 

xendev01 ~ # tshark -R "tcp.analysis.retransmission " -i vif7.0

Running as user "root" and group "root". This could be dangerous.

Capturing on vif7.0

  3.054257 xxx.xxx.xxx.196 -> xxx.xxx.xxx.145 SSH 110 [TCP Fast Retransmission] 
Encrypted response packet len=44

  3.061949 xxx.xxx.xxx.196 -> xxx.xxx.xxx.145 SSH 1434 [TCP Fast 
Retransmission] Encrypted response packet len=1368

  3.383880 xxx.xxx.xxx.196 -> xxx.xxx.xxx.145 SSH 1434 [TCP Fast 
Retransmission] Encrypted response packet len=1368

  3.630911 xxx.xxx.xxx.196 -> xxx.xxx.xxx.145 SSH 1434 [TCP Fast 
Retransmission] Encrypted response packet len=1368

  3.635964 xxx.xxx.xxx.196 -> xxx.xxx.xxx.145 SSH 1434 [TCP Fast 
Retransmission] Encrypted response packet len=1368

 

I've confirmed this is happening with linux, windows and pfsense (bsd) domU's. 
I've turned off every feature I can with ethtool on both the underlying bridge 
on the host, the vif's, and the eth's inside the domU's. I also see it on 
traffic inbetween vms on the same host.

 

The domU sees packet errors on incoming traffic and outgoing looks fine, 
dumping on the dom0 indicates incoming packets are fine, but the reply from the 
domU is broken. This does not happen running the exact same VMs on some older 
xen 4.1.3 hosts. Reproduction is easy (for me at least), any burst of traffic 
will do it. I've just been running "ps auxf" over ssh to a vm to trigger.

 

Since I'm seeing it on the host when I sniff the vif, this feels like a bug?

 

- Nathan

 

___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt