Hi Elias, Could you give a shot running a build with https://gerrit.fd.io/r/#/c/vpp/+/23461/ in ?
I cherry-picked it from master today but it is not in 19.08 branch yet. --a > On 15 Nov 2019, at 11:05, Elias Rudberg <elias.rudb...@bahnhof.net> wrote: > > We are using VPP 19.08 for NAT (nat44) and are struggling with the > following problem: it first works seemingly fine for a while, like > several days or weeks, but then suddenly VPP stops forwarding traffic. > Even ping to the "outside" IP address fails. > > The VPP process is still running so we try to investigate further using > vppctl, enabling packet trace as follows: > > clear trace > trace add rdma-input 5 > > then doing ping to "outside" and then "show trace". > > To see the normal behavior we have compared to another server running > VPP without the strange problem happening; there we can see that the > normal behavior is that one worker starts processing the packet and > then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker takes > over: "handoff_trace" and then "HANDED-OFF: from thread..." and then > that worker continues processing the packet. > So the relevant parts of the trace look like this (abbreviated to show > only node names and handoff info) for a case when thread 8 hands off > work to thread 3: > > ------------------- Start of thread 3 vpp_wk_2 ------------------- > Packet 1 > > 08:15:10:781992: handoff_trace > HANDED-OFF: from thread 8 trace index 0 > 08:15:10:781992: nat44-out2in > 08:15:10:782008: ip4-lookup > 08:15:10:782009: ip4-local > 08:15:10:782010: ip4-icmp-input > 08:15:10:782011: ip4-icmp-echo-request > 08:15:10:782011: ip4-load-balance > 08:15:10:782013: ip4-rewrite > 08:15:10:782014: BondEthernet0-output > > ------------------- Start of thread 8 vpp_wk_7 ------------------- > Packet 1 > > 08:15:10:781986: rdma-input > 08:15:10:781988: bond-input > 08:15:10:781989: ethernet-input > 08:15:10:781989: ip4-input > 08:15:10:781990: nat44-out2in-worker-handoff > NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0 > > The above is what it looks like normally. The problem is that > sometimes, for some reason, the handoff stops working so that we only > get the initial processing by a worker and that working saying > NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up the > work, it is seemingly ignored. > > Here is what it looks like then, when the problem has happened, thread > 7 trying to handoff to thread 3: > > ------------------- Start of thread 3 vpp_wk_2 ------------------- > No packets in trace buffer > > ------------------- Start of thread 7 vpp_wk_6 ------------------- > Packet 1 > > 08:38:41:904654: rdma-input > 08:38:41:904656: bond-input > 08:38:41:904658: ethernet-input > 08:38:41:904660: ip4-input > 08:38:41:904663: nat44-out2in-worker-handoff > NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0 > > So, work is also in this case handed off to thread 3 but thread 3 does > not pick it up. There is no "HANDED-OFF" message in the trace at all, > not for any worker. It seems like the handed-off work was ignored. Then > of course it is understandable that the ping does not work and packet > forwarding does not work, the question is: why does that hand-off > procedure fail? > > Are there some known reasons that can cause this behavior? > > When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet > trace, should there always be a corresponding "HANDED-OFF" message for > another thread picking it up? > > One more question related to the above: sometimes when looking at trace > for ICMP packets to investigate this problem we have seen a worker > apparently handing off work to itself, which seems strange. Example: > > ------------------- Start of thread 3 vpp_wk_2 ------------------- > Packet 1 > > 08:31:23:871274: rdma-input > 08:31:23:871279: bond-input > 08:31:23:871282: ethernet-input > 08:31:23:871285: ip4-input > 08:31:23:871289: nat44-out2in-worker-handoff > NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0 > > If the purpose of "handoff" is to let another thread take over, then > this seems strange by itself (even without considering that there is no > "HANDED-OFF" for any thread): why is thread 3 trying to handoff work to > itself? Does that indicate something wrong or are there legitimate > cases where a thread "hands off" something to itself? > > We have encountered this problem several times but unfortunately we > have not yet found a way to reproduce it in a lab environment, we do > not know exactly what triggers the problem. Previous times, when we > have restarted vpp it starts working normally again. > > Any input on this or ideas for how to troubleshoot further would be > much appreciated. > > Best regards, > Elias > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > > View/Reply Online (#14602): https://lists.fd.io/g/vpp-dev/message/14602 > Mute This Topic: https://lists.fd.io/mt/59112885/675608 > Group Owner: vpp-dev+ow...@lists.fd.io > Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [ayour...@gmail.com] > -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#14603): https://lists.fd.io/g/vpp-dev/message/14603 Mute This Topic: https://lists.fd.io/mt/59112885/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-