On 3/26/24 03:30, Chris Mi wrote: > This patch set adds offload support for sFlow. > > Psample is a genetlink channel for packet sampling. TC action act_sample > uses psample to send sampled packets to userspace. > > When offloading sample action to TC, userspace creates a unique ID to > map sFlow action and tunnel info and passes this ID to kernel instead > of the sFlow info. psample will send this ID and sampled packet to > userspace. Using the ID, userspace can recover the sFlow info and send > sampled packet to the right sFlow monitoring host.
Hi, Chris, others. I've been looking through the set and a psample in general for the past few days. There are few fairly minor issues in the set like UBsan crash because of incorrect OVS_RETURNS_NONNULL annotation and a few style and formatting issues, which are not hard to fix. However, I encountered an issue with a way tunnel metadata is recovered that I'm not sure we can actually fix. The algorithm of work in this patch set is described in the cover letter above. The key point is then packet-1 goes though MISS upcall we install a new datapath flow into TC and we remember the tunnel metadata of the packet-1 associated with a sample ID. When the packet-2 hits the datapath flow (tc actions), it gets sent to psample and ovs-vswitchd reads this packet from psample netlink socket and presents it as an ACTION upcall. Doing that it finds tunnel metadata using the sample ID and uses it as a tunnel metadata for packet-2. The key of the problem is that this is metadata from packet-1 and it can be different from metadata of packet-2. An example of this will be having two separate TCP streams going through the same VxLAN tunnel. For load balancing reasons most UDP tunnels choose source port of the outer UDP header based on RSS or 5-tuple hash of the inner packet. This means that two TCP streams going through the same tunnel will have different UDP source port in the outer header. When a packet from a first stream hits a MISS upcall, datapath flow will be installed and the sample ID will be associated with the metadata of the first stream. Chances are this datapath flow will not have exact match on the source port of the tunnel metadata. That means that a packet from the second stream will successfully match on the installed datapath flow and will go to psample with the ID of the first stream. OVS will read it from the psample socket and send to sFlow collector with metadata it found by the ID, i.e. with a tunnel metadata that contains tunnel source port of the first TCP stream, which is incorrect. The test below reproduces the issue. It's not a real test, it always fails on purpose and not very pretty, but what it does is that it creates a tunnel, then starts netcat server on one side and sends two files to it from the other side. These two transfers are separate TCP connections, so they use different source ports, that is causing the tunnel to choose different UDP source ports for them. After the test failure we can see in the sflow.log that all the values for 'tunnel4_in_src_port' are the same for both streams and some exchanges prior to that. However, if offload is disabled, the values will be different as they should. So, unless we can ensure that packets from different streams do not match on the same datapath flow, this schema doesn't work. The only way to ensure that packets from different streams within the same tunnel do not match on the same datapath flow is to always have an exact match on the whole tunnel metadata. But I don't think that is a good idea, because this way we'll have a separate datapath flow per TCP stream, which will be very bad for performance. And it will be bad for hardware offload since hardware NICs normally have a more limited amount of available resources. This leads us to conclusion that only non-tunneling traffic can be offloaded this way. For this to work we'll have to add an explicit match on tunnel metadata being empty (can that be expressed in TC?) But at this point a question arises if this even worth implementing, taking into account that it requires a lot of infrastructure changes throughout all the layers of OVS code just for sampling of non-tunnel packets, i.e. a very narrow use case. Also, my understanding was that there is a plan to offload other types of userspace() action in the future, such as controller() action using the same psample infrastructure. But that will not be possible for the same reason. In addition to tunnel metadata we will also have to provide connection tracking information, which is even harder, because conntrack state is more dynamic and is only actually known to the datapath. psample can't supply this information to the userspace. Thoughts? Best regards, Ilya Maximets. P.S. The test that reproduces the issue: --- diff --git a/tests/system-offloads-traffic.at b/tests/system-offloads-traffic.at index ee4817e8e..51abc2782 100644 --- a/tests/system-offloads-traffic.at +++ b/tests/system-offloads-traffic.at @@ -100,6 +100,102 @@ OVS_TRAFFIC_VSWITCHD_STOP AT_CLEANUP +AT_SETUP([offloads - two TCP streams over vxlan with sFlow - offloads enabled qwe]) +OVS_CHECK_TUNNEL_TSO() +OVS_CHECK_VXLAN() +OVS_LOAD_MODULE([psample]) + +OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . other_config:hw-offload=true]) +ADD_BR([br-underlay]) + +AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"]) +AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"]) + +ADD_NAMESPACES(at_ns0) + +dnl Set up underlay link from host into the namespace using veth pair. +ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24") +AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"]) +AT_CHECK([ip link set dev br-underlay up]) + +dnl Set up tunnel endpoints on OVS outside the namespace and with a native +dnl linux device inside the namespace. +ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24]) +ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], [10.1.1.1/24], + [id 0 dstport 4789]) + +on_exit 'kill $(cat test-sflow.pid)' +AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile \ + 0:127.0.0.1 > sflow.log], [0], [], [ignore]) +AT_CAPTURE_FILE([sflow.log]) +PARSE_LISTENING_PORT([test-sflow.log], [SFLOW_PORT]) +AT_CHECK([ovs-vsctl -- \ + --id=@sflow create sflow agent=lo target=\"127.0.0.1:$SFLOW_PORT\" \ + header=128 sampling=1 polling=100 -- \ + set bridge br0 sflow=@sflow], [0], [ignore]) + +ovs-appctl vlog/set dpif:file:dbg +ovs-appctl vlog/disable-rate-limit dpif +ovs-appctl vlog/set dpif_netlink:file:dbg +ovs-appctl vlog/set ofproto_dpif:file:dbg +ovs-appctl vlog/set tc:file:dbg +ovs-appctl vlog/set netdev_offload_tc:file:dbg + +dnl First, check the underlay. +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -W 2 172.31.1.100 | FORMAT_PING], [0], [dnl +3 packets transmitted, 3 received, 0% packet loss, time 0ms +]) + +dnl Okay, now check the overlay. +NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.1 -W 2 10.1.1.100 | FORMAT_PING], [0], [dnl +3 packets transmitted, 3 received, 0% packet loss, time 0ms +]) + +sleep 2 +OVS_APP_EXIT_AND_WAIT([test-sflow]) +echo "================================= 1 ========================" >> sflow.log +AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile \ + ${SFLOW_PORT}:127.0.0.1 >> sflow.log], [0], [], [ignore]) + +echo "================================= 1 ========================" >> ovs-vswitchd.log +dnl Start ncat listener. +OVS_DAEMONIZE([nc -k -l 10.1.1.100 1234 > tcp_data], [nc.pid]) + +dnl Verify that ncat is ready. +OVS_WAIT_UNTIL([netstat -ln | grep :1234]) + +dnl Check large TCP. +AT_CHECK([dd if=/dev/urandom of=payload.bin bs=60000 count=1 2> /dev/null]) +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT 10.1.1.100 1234 < payload.bin]) +OVS_WAIT_UNTIL([diff -q payload.bin tcp_data]) + +sleep 2 +OVS_APP_EXIT_AND_WAIT([test-sflow]) +echo "================================= 2 ========================" >> sflow.log +AT_CHECK([ovstest test-sflow --log-file --detach --no-chdir --pidfile \ + ${SFLOW_PORT}:127.0.0.1 >> sflow.log], [0], [], [ignore]) + + +echo "================================= 2 ========================" >> ovs-vswitchd.log +dnl Verify that ncat is ready. +OVS_WAIT_UNTIL([netstat -ln | grep :1234]) +cp payload.bin payload0.bin +dnl Check large TCP. +AT_CHECK([dd if=/dev/urandom of=payload.bin bs=60000 count=1 2> /dev/null]) +NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT 10.1.1.100 1234 < payload.bin]) +cat payload.bin >> payload0.bin +OVS_WAIT_UNTIL([diff -q payload0.bin tcp_data]) + +sleep 2 +OVS_APP_EXIT_AND_WAIT([test-sflow]) + +echo "================================= Done======================" >> sflow.log + +OVS_TRAFFIC_VSWITCHD_STOP +AT_FAIL_IF([:]) +AT_CLEANUP + + AT_SETUP([offloads - sflow with sampling=1 - offloads enabled]) OVS_LOAD_MODULE([psample]) OVS_TRAFFIC_VSWITCHD_START([], [], [-- set Open_vSwitch . other_config:hw-offload=true]) --- _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev