Hi Elias,

Could you give a shot running a build with 
https://gerrit.fd.io/r/#/c/vpp/+/23461/ in ?

I cherry-picked it from master today but it is not in 19.08 branch yet.

--a

> On 15 Nov 2019, at 11:05, Elias Rudberg <elias.rudb...@bahnhof.net> wrote:
> 
> We are using VPP 19.08 for NAT (nat44) and are struggling with the
> following problem: it first works seemingly fine for a while, like
> several days or weeks, but then suddenly VPP stops forwarding traffic.
> Even ping to the "outside" IP address fails.
> 
> The VPP process is still running so we try to investigate further using
> vppctl, enabling packet trace as follows:
> 
> clear trace
> trace add rdma-input 5
> 
> then doing ping to "outside" and then "show trace".
> 
> To see the normal behavior we have compared to another server running
> VPP without the strange problem happening; there we can see that the
> normal behavior is that one worker starts processing the packet and
> then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker takes
> over: "handoff_trace" and then "HANDED-OFF: from thread..." and then
> that worker continues processing the packet.
> So the relevant parts of the trace look like this (abbreviated to show
> only node names and handoff info) for a case when thread 8 hands off
> work to thread 3:
> 
> ------------------- Start of thread 3 vpp_wk_2 -------------------
> Packet 1
> 
> 08:15:10:781992: handoff_trace
>  HANDED-OFF: from thread 8 trace index 0
> 08:15:10:781992: nat44-out2in
> 08:15:10:782008: ip4-lookup
> 08:15:10:782009: ip4-local
> 08:15:10:782010: ip4-icmp-input
> 08:15:10:782011: ip4-icmp-echo-request
> 08:15:10:782011: ip4-load-balance
> 08:15:10:782013: ip4-rewrite
> 08:15:10:782014: BondEthernet0-output
> 
> ------------------- Start of thread 8 vpp_wk_7 -------------------
> Packet 1
> 
> 08:15:10:781986: rdma-input
> 08:15:10:781988: bond-input
> 08:15:10:781989: ethernet-input
> 08:15:10:781989: ip4-input
> 08:15:10:781990: nat44-out2in-worker-handoff
>  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> 
> The above is what it looks like normally. The problem is that
> sometimes, for some reason, the handoff stops working so that we only
> get the initial processing by a worker and that working saying
> NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up the
> work, it is seemingly ignored.
> 
> Here is what it looks like then, when the problem has happened, thread
> 7 trying to handoff to thread 3:
> 
> ------------------- Start of thread 3 vpp_wk_2 -------------------
> No packets in trace buffer
> 
> ------------------- Start of thread 7 vpp_wk_6 -------------------
> Packet 1
> 
> 08:38:41:904654: rdma-input
> 08:38:41:904656: bond-input
> 08:38:41:904658: ethernet-input
> 08:38:41:904660: ip4-input
> 08:38:41:904663: nat44-out2in-worker-handoff
>  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> 
> So, work is also in this case handed off to thread 3 but thread 3 does
> not pick it up. There is no "HANDED-OFF" message in the trace at all,
> not for any worker. It seems like the handed-off work was ignored. Then
> of course it is understandable that the ping does not work and packet
> forwarding does not work, the question is: why does that hand-off
> procedure fail?
> 
> Are there some known reasons that can cause this behavior?
> 
> When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet
> trace, should there always be a corresponding "HANDED-OFF" message for
> another thread picking it up?
> 
> One more question related to the above: sometimes when looking at trace
> for ICMP packets to investigate this problem we have seen a worker
> apparently handing off work to itself, which seems strange. Example:
> 
> ------------------- Start of thread 3 vpp_wk_2 -------------------
> Packet 1
> 
> 08:31:23:871274: rdma-input
> 08:31:23:871279: bond-input
> 08:31:23:871282: ethernet-input
> 08:31:23:871285: ip4-input
> 08:31:23:871289: nat44-out2in-worker-handoff
>  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> 
> If the purpose of "handoff" is to let another thread take over, then
> this seems strange by itself (even without considering that there is no
> "HANDED-OFF" for any thread): why is thread 3 trying to handoff work to
> itself? Does that indicate something wrong or are there legitimate
> cases where a thread "hands off" something to itself?
> 
> We have encountered this problem several times but unfortunately we
> have not yet found a way to reproduce it in a lab environment, we do
> not know exactly what triggers the problem. Previous times, when we
> have restarted vpp it starts working normally again.
> 
> Any input on this or ideas for how to troubleshoot further would be
> much appreciated.
> 
> Best regards,
> Elias
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> 
> View/Reply Online (#14602): https://lists.fd.io/g/vpp-dev/message/14602
> Mute This Topic: https://lists.fd.io/mt/59112885/675608
> Group Owner: vpp-dev+ow...@lists.fd.io
> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [ayour...@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14603): https://lists.fd.io/g/vpp-dev/message/14603
Mute This Topic: https://lists.fd.io/mt/59112885/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to