On 10/20/25 11:59, Ilya Maximets wrote:
On 10/18/25 11:22 AM, Frode Nordahl wrote:
On 7/8/25 13:34, Ilya Maximets wrote:
When a packet enters kernel datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall.  With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet.  This allows us to send
packets from the same traffic flow through the same handler.

The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.

While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again.  And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.

However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread.  So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.

The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port.  In real setups the
number of recirculations can go up to 4 or 5, sometimes more.

There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet.  So, there is a race window when the flow is already
installed and the second packet can match it inside the kernel and be
forwarded to the destination before the first packet is re-injected.
But the fact that packets are going through multiple upcalls handled
by different userspace threads makes the reordering noticeably more
likely, because we not only have a race between the kernel and a
userspace handler (which is hard to avoid), but also between multiple
userspace handlers.

For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2.  Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets.  Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.

Let's provide the original upcall PID via the new ntlink attribute
OVS_PACKET_ATTR_UPCALL_PID.  This way the upcall triggered during the
execution will go to the same handler.  Packets will be enqueued to
the same socket and re-injected in the same order.  This doesn't
eliminate re-ordering as stated above, since we still have a race
between the kernel and the handler thread, but it allows to eliminate
races between multiple handlers.

The openvswitch kernel module ignores unknown attributes for the
OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
kernels.

Reported-at: https://issues.redhat.com/browse/FDP-1479
Link: 
https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Ilya Maximets <[email protected]>
---

Hello, Ilya,

There appears to be a regression caused either by the kernel or userspace 
change on this subject.

While investigating test failures in the OVN system test suite [0] starting with the 6.17 
kernel, which includes "net: openvswitch: allow providing upcall pid for the 
'execute' command", I found that reverting either the kernel commit or this OVS 
commit make the following OVN system tests pass again:

"DNAT and SNAT on distributed router - N/S - IPv6"
"Traffic to router port via LLA"

(Note that there are other tests discussed in that bug that break on different 
issue, these two are the ones I used in the latest validation steps before 
sending this e-mail).

I have yet to dive into details of why this may be the case and I'll likely 
poke some more at this, but wanted to hear your immediate thoughts / insights 
on what could possibly be the issue.

0: https://launchpad.net/bugs/2127061


Hi, Frode.  I didn't try the tests, but if packet re-ordering is in place,
I wonder if the root cause for these tests to fail is somewhat similar
to the one described in OVS commit:
   e0ee785b1a67 ("tests: system-traffic: Fix flaky floating IP test.")

Providing the original upcall pid can make connection processing faster and
cause the last ACK from a 3WHS to arrive very late and trigger RST reply.
This reordering of last ACK was in OVS always or for a very long time, but
it may be more visible with the rest of the traffic no longer reordered.

restis capture is needed to confirm.

Thanks for your pointer, retis was indeed enlightening.

To me there appears to be something nefarious going on as the packets appear to 
be dropped because the kernel ends up attempting to transmit upcalls to a 
Netlink PID on which Open vSwitch is not listening [1].

I'll try to track down from where, what appears to be a incorrect PID, might be 
coming from.

1: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/2127061/comments/3

--
Frode Nordahl


Best regards, Ilya Maximets.



_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to