On 10/18/25 11:22 AM, Frode Nordahl wrote:
> On 7/8/25 13:34, Ilya Maximets wrote:
>> When a packet enters kernel datapath and there is no flow to handle it,
>> packet goes to userspace through a MISS upcall.  With per-CPU upcall
>> dispatch mechanism, we're using the current CPU id to select the
>> Netlink PID on which to send this packet.  This allows us to send
>> packets from the same traffic flow through the same handler.
>>
>> The handler will process the packet, install required flow into the
>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
>>
>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
>> recirculation action that will pass the (likely modified) packet
>> through the flow lookup again.  And if the flow is not found, the
>> packet will be sent to userspace again through another MISS upcall.
>>
>> However, the handler thread in userspace is likely running on a
>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
>> in the syscall context of that thread.  So, when the time comes to
>> send the packet through another upcall, the per-CPU dispatch will
>> choose a different Netlink PID, and this packet will end up processed
>> by a different handler thread on a different CPU.
>>
>> The process continues as long as there are new recirculations, each
>> time the packet goes to a different handler thread before it is sent
>> out of the OVS datapath to the destination port.  In real setups the
>> number of recirculations can go up to 4 or 5, sometimes more.
>>
>> There is always a chance to re-order packets while processing upcalls,
>> because userspace will first install the flow and then re-inject the
>> original packet.  So, there is a race window when the flow is already
>> installed and the second packet can match it inside the kernel and be
>> forwarded to the destination before the first packet is re-injected.
>> But the fact that packets are going through multiple upcalls handled
>> by different userspace threads makes the reordering noticeably more
>> likely, because we not only have a race between the kernel and a
>> userspace handler (which is hard to avoid), but also between multiple
>> userspace handlers.
>>
>> For example, let's assume that 10 packets got enqueued through a MISS
>> upcall for handler-1, it will start processing them, will install the
>> flow into the kernel and start re-injecting packets back, from where
>> they will go through another MISS to handler-2.  Handler-2 will install
>> the flow into the kernel and start re-injecting the packets, while
>> handler-1 continues to re-inject the last of the 10 packets, they will
>> hit the flow installed by handler-2 and be forwarded without going to
>> the handler-2, while handler-2 still re-injects the first of these 10
>> packets.  Given multiple recirculations and misses, these 10 packets
>> may end up completely mixed up on the output from the datapath.
>>
>> Let's provide the original upcall PID via the new ntlink attribute
>> OVS_PACKET_ATTR_UPCALL_PID.  This way the upcall triggered during the
>> execution will go to the same handler.  Packets will be enqueued to
>> the same socket and re-injected in the same order.  This doesn't
>> eliminate re-ordering as stated above, since we still have a race
>> between the kernel and the handler thread, but it allows to eliminate
>> races between multiple handlers.
>>
>> The openvswitch kernel module ignores unknown attributes for the
>> OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
>> kernels.
>>
>> Reported-at: https://issues.redhat.com/browse/FDP-1479
>> Link: 
>> https://lore.kernel.org/netdev/[email protected]/
>> Signed-off-by: Ilya Maximets <[email protected]>
>> ---
> 
> Hello, Ilya,
> 
> There appears to be a regression caused either by the kernel or userspace 
> change on this subject.
> 
> While investigating test failures in the OVN system test suite [0] starting 
> with the 6.17 kernel, which includes "net: openvswitch: allow providing 
> upcall pid for the 'execute' command", I found that reverting either the 
> kernel commit or this OVS commit make the following OVN system tests pass 
> again:
> 
> "DNAT and SNAT on distributed router - N/S - IPv6"
> "Traffic to router port via LLA"
> 
> (Note that there are other tests discussed in that bug that break on 
> different issue, these two are the ones I used in the latest validation steps 
> before sending this e-mail).
> 
> I have yet to dive into details of why this may be the case and I'll likely 
> poke some more at this, but wanted to hear your immediate thoughts / insights 
> on what could possibly be the issue.
> 
> 0: https://launchpad.net/bugs/2127061
> 

Hi, Frode.  I didn't try the tests, but if packet re-ordering is in place,
I wonder if the root cause for these tests to fail is somewhat similar
to the one described in OVS commit:
  e0ee785b1a67 ("tests: system-traffic: Fix flaky floating IP test.")

Providing the original upcall pid can make connection processing faster and
cause the last ACK from a 3WHS to arrive very late and trigger RST reply.
This reordering of last ACK was in OVS always or for a very long time, but
it may be more visible with the rest of the traffic no longer reordered.

restis capture is needed to confirm.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to