Re: [lng-odp] Data corruption during TCP download

Oriol Arcas Fri, 17 Feb 2017 04:04:25 -0800

Hi,

Thanks for your reply Marias.


I tried a simpler setup and the bug persists. With a linux bridge it works
fine.

My setup is the following:

| nginx <---> veth0 -|- veth1 <---> l2fwd <---> veth2 -|- veth3 <---> wget |

where the | delimiters mean a network namespace.

I have tried it with ODP_PKTIO_DISABLE_SOCKET_MMAP, all the different
scheduling modes and -c 1.

Our next test would be using DPDK or netmap, if they can be used with veth
interfaces.

--
Oriol Arcas
Software Engineer
Starflow Networks

On Fri, Feb 17, 2017 at 11:46 AM, Elo, Matias (Nokia - FI/Espoo) <
matias....@nokia-bell-labs.com> wrote:

> Hi Oriol,
>
> This seems rather odd indeed (especially point e). Just to be clear, are
> you using OFP in any part of the test setup or is the simplified setup as
> follows?
>
>         standard nginx <----> odp_l2fwd <----> standard wget
>
> You could try testing with different odp pktio types (preferably netmap or
> dpdk) to see if the problem persists. You can disable mmap pktio with
> ODP_PKTIO_DISABLE_SOCKET_MMAP environment variable.
>
> Second thing to try would be to run odp_l2fwd with a single core (-c 1) to
> rule out possible synchronisation problems.
>
> -Matias
>
>
> > On 16 Feb 2017, at 17:37, Oriol Arcas <or...@starflownetworks.com>
> wrote:
> >
> > Hi,
> >
> > We have been using ODP for a while, and we found this weird bug which is
> > difficult to explain, involving data corruption in TCP transfers, I hope
> > somebody may reproduce and shed some light on this.
> >
> > To reproduce this bug, we set up the following environment:
> >
> > 1- Two Debian Jessie VMs, running on QEMU/libvirt
> > 2- Each VM has Linux kernel 3.16.39 (any other version should experience
> > the same issues)
> > 3- The eth0 "physical" interfaces of the VMs are for management, the eth1
> > are connected through a bridge in the host
> > 4- We have OpenVPN taps through the eth1 interfaces (10.52.34.1/30 and
> > 10.52.34.2/30)
> > 5- In each VM, there is a pair of veth interfaces, vethi (10.52.34.5/30
> and
> > 10.52.34.6/30) and vethe (no IP)
> > 6- We "bridge" the vethe and the tap interfaces with the odp_l2fwd
> example
> > app
> > 7- We have an nginx server and wget clients (curl produces the same
> result)
> >
> > The setup looks like this:
> >
> > Server VM
> > nginx - vethi (10.52.34.5) - vethe - odp_l2fwd - tap (10.52.34.1) - ...
> >
> > [host bridge]
> >
> > Client VM
> > ... - tap (10.52.34.2) - odp_l2fwd - vethe - vethi (10.52.34.6) - wget
> >
> > The idea is that there should be tunnelled IP connections through the
> > corresponding vethi endpoints.
> >
> > The unmodified odp_l2fwd are run with the following command:
> >
> > sudo /usr/lib/odp/linux/examples/odp_l2fwd -i tap,vethe -d 0 -s 0 -m 1
> >
> > To do our tests, we have a 10 MB text file called "download" with the
> > following contents:
> >
> > 1 0000000000000000000000000000000000000000000000000000000000000000
> > 2 0000000000000000000000000000000000000000000000000000000000000000
> > 3 0000000000000000000000000000000000000000000000000000000000000000
> > ...
> > 147178 0000000000000000000000000000000000000000000000000000000000000000
> > 147179 000000000000000000000000000000000000000000
> >
> > We download the data from the client VM with the following command:
> >
> > $> wget http://10.52.34.5/download
> >
> > The data arrives completely (and in this case, correctly), and both
> > odp_l2fwd apps report the processed packets.
> >
> > However, when we perform several parallel downloads:
> >
> > $> for i in `seq 30`; do wget http://10.52.34.5/download -O
> download_${i}
> > &; done
> >
> > The downloads end, but the downloaded data is wrong:
> >
> > $> for i in `seq 1 30`; do cmp download_${i} download; done
> > download_7 download differ: byte 5140175, line 72554
> > download_19 download differ: byte 4739, line 70
> > download_25 download differ: byte 39677, line 577
> >
> > To be clear, we add the following comments:
> > a) We tried this with the ODP official packages 1.10.1, and also ODP LTS
> > 1.11 and the current master head (~1.13)
> > b) We have tried this with a bridge instead of the odp_l2fwd app, and it
> > worked fine
> > c) It seems that it happens when the client has ODP, regardless of the
> > server having ODP or a bridge; if only the server has ODP, it works fine
> > d) The data corruption presumably consists in packets of one TCP flow
> > interleaved with another flow.
> >
> > We tried this downloading simultaneously files with '0' and files with
> '1';
> > the result was files with chunks of '0' and '1' interleaved:
> >
> > $> diff download_1 download
> > 71692,71700c72149
> > < 72149
> > 000000000000000000000000000000000000000000000000000000000000
> 001111111111111111111111111111111111111111111
> > < 94804 1111111111111111111111111111111111111111111111111111111111111111
> > < 94805 1111111111111111111111111111111111111111111111111111111111111111
> > < 94806 1111111111111111111111111111111111111111111111111111111111111111
> > < 94807 1111111111111111111111111111111111111111111111111111111111111111
> > < 94808 1111111111111111111111111111111111111111111111111111111111111111
> > < 94809 1111111111111111111111111111111111111111111111111111111111111111
> > < 94810 1111111111111111111111111111111111111111111111111111111111111111
> > < 94811 1111111111111111111111111111111111111111111111111111111111111100
> > ---
> >> 72149 0000000000000000000000000000000000000000000000000000000000000000
> >
> > e) If there was data corruption during the transmission, and since we are
> > using TCP, the protocol should not allow this to happen, right?
> > f) We have PCAP traces from vethi and tap, and Wireshark shows that the
> TCP
> > conversation is OK; we cannot explain this
> > g) The TCP checksums and the TCP lengths seem to be OK; just in case, we
> > disabled checksum offloading in all the interfaces so that the checksums
> > are set and verified
> > h) Enabling or disabling the hugepages seems to not have any effect
> > i) A low number of flows doesn't always trigger the problem
> > j) We tried 2 tunnels, with several TCP downloads through each one; the
> > interleaving seems to happen only between flows within the same tunnel
> >
> > So it would seem like the data transmission is OK (by e, f and g), but
> > after the TCP stack reassembles the stream the files are 'corrupted'
> (ie.,
> > as shown in d, interleaved). And it seems that this happens when having
> ODP
> > in the client (by b and c) with any recent ODP version (by a).
> >
> > Maybe f) can be explained because we are using the socket_mmap PKTIO
> > interface, and the captured data is different somehow to the data
> actually
> > sent by ODP.
> >
> > Other hypotheses include:
> > - The packet mmap is somehow messing with the TCP stack
> > - OpenVPN+ODP is somehow causing this
> > - ODP on a VM causes this behavior
> >
> > I hope I explained the problem in an understandable way. Please do not
> > hesitate to ask for clarifications or new experiments.
> >
> > We will appreciate any comments, questions and suggestions, especially
> > regarding the reproducibility of this error.
> >
> > Thank you!
> >
> > --
> > Oriol Arcas
> > Software Engineer
> > Starflow Networks
>
>

Re: [lng-odp] Data corruption during TCP download

Reply via email to