Hi Andrew,

Thanks, that looks promising. The issue 
https://jira.fd.io/browse/VPP-1734 that the fix refers to seems like it
could be the same issue we are seeing.

We have just restarted vpp with the fix, it will be interesting to see
if it helps. Thanks again for your help!

/ Elias


On Fri, 2019-11-15 at 11:26 +0100, Andrew 👽 Yourtchenko wrote:
> Hi Elias,
> 
> Could you give a shot running a build with 
> https://gerrit.fd.io/r/#/c/vpp/+/23461/ in ?
> 
> I cherry-picked it from master today but it is not in 19.08 branch
> yet.
> 
> --a
> 
> > On 15 Nov 2019, at 11:05, Elias Rudberg <elias.rudb...@bahnhof.net>
> > wrote:
> > 
> > We are using VPP 19.08 for NAT (nat44) and are struggling with the
> > following problem: it first works seemingly fine for a while, like
> > several days or weeks, but then suddenly VPP stops forwarding
> > traffic.
> > Even ping to the "outside" IP address fails.
> > 
> > The VPP process is still running so we try to investigate further
> > using
> > vppctl, enabling packet trace as follows:
> > 
> > clear trace
> > trace add rdma-input 5
> > 
> > then doing ping to "outside" and then "show trace".
> > 
> > To see the normal behavior we have compared to another server
> > running
> > VPP without the strange problem happening; there we can see that
> > the
> > normal behavior is that one worker starts processing the packet and
> > then does NAT44_OUT2IN_WORKER_HANDOFF after which another worker
> > takes
> > over: "handoff_trace" and then "HANDED-OFF: from thread..." and
> > then
> > that worker continues processing the packet.
> > So the relevant parts of the trace look like this (abbreviated to
> > show
> > only node names and handoff info) for a case when thread 8 hands
> > off
> > work to thread 3:
> > 
> > ------------------- Start of thread 3 vpp_wk_2 -------------------
> > Packet 1
> > 
> > 08:15:10:781992: handoff_trace
> >  HANDED-OFF: from thread 8 trace index 0
> > 08:15:10:781992: nat44-out2in
> > 08:15:10:782008: ip4-lookup
> > 08:15:10:782009: ip4-local
> > 08:15:10:782010: ip4-icmp-input
> > 08:15:10:782011: ip4-icmp-echo-request
> > 08:15:10:782011: ip4-load-balance
> > 08:15:10:782013: ip4-rewrite
> > 08:15:10:782014: BondEthernet0-output
> > 
> > ------------------- Start of thread 8 vpp_wk_7 -------------------
> > Packet 1
> > 
> > 08:15:10:781986: rdma-input
> > 08:15:10:781988: bond-input
> > 08:15:10:781989: ethernet-input
> > 08:15:10:781989: ip4-input
> > 08:15:10:781990: nat44-out2in-worker-handoff
> >  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> > 
> > The above is what it looks like normally. The problem is that
> > sometimes, for some reason, the handoff stops working so that we
> > only
> > get the initial processing by a worker and that working saying
> > NAT44_OUT2IN_WORKER_HANDOFF but the other worker does not pick up
> > the
> > work, it is seemingly ignored.
> > 
> > Here is what it looks like then, when the problem has happened,
> > thread
> > 7 trying to handoff to thread 3:
> > 
> > ------------------- Start of thread 3 vpp_wk_2 -------------------
> > No packets in trace buffer
> > 
> > ------------------- Start of thread 7 vpp_wk_6 -------------------
> > Packet 1
> > 
> > 08:38:41:904654: rdma-input
> > 08:38:41:904656: bond-input
> > 08:38:41:904658: ethernet-input
> > 08:38:41:904660: ip4-input
> > 08:38:41:904663: nat44-out2in-worker-handoff
> >  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> > 
> > So, work is also in this case handed off to thread 3 but thread 3
> > does
> > not pick it up. There is no "HANDED-OFF" message in the trace at
> > all,
> > not for any worker. It seems like the handed-off work was ignored.
> > Then
> > of course it is understandable that the ping does not work and
> > packet
> > forwarding does not work, the question is: why does that hand-off
> > procedure fail?
> > 
> > Are there some known reasons that can cause this behavior?
> > 
> > When there is a NAT44_OUT2IN_WORKER_HANDOFF message in the packet
> > trace, should there always be a corresponding "HANDED-OFF" message
> > for
> > another thread picking it up?
> > 
> > One more question related to the above: sometimes when looking at
> > trace
> > for ICMP packets to investigate this problem we have seen a worker
> > apparently handing off work to itself, which seems strange.
> > Example:
> > 
> > ------------------- Start of thread 3 vpp_wk_2 -------------------
> > Packet 1
> > 
> > 08:31:23:871274: rdma-input
> > 08:31:23:871279: bond-input
> > 08:31:23:871282: ethernet-input
> > 08:31:23:871285: ip4-input
> > 08:31:23:871289: nat44-out2in-worker-handoff
> >  NAT44_OUT2IN_WORKER_HANDOFF : next-worker 3 trace index 0
> > 
> > If the purpose of "handoff" is to let another thread take over,
> > then
> > this seems strange by itself (even without considering that there
> > is no
> > "HANDED-OFF" for any thread): why is thread 3 trying to handoff
> > work to
> > itself? Does that indicate something wrong or are there legitimate
> > cases where a thread "hands off" something to itself?
> > 
> > We have encountered this problem several times but unfortunately we
> > have not yet found a way to reproduce it in a lab environment, we
> > do
> > not know exactly what triggers the problem. Previous times, when we
> > have restarted vpp it starts working normally again.
> > 
> > Any input on this or ideas for how to troubleshoot further would be
> > much appreciated.
> > 
> > Best regards,
> > Elias
> > -=-=-=-=-=-=-=-=-=-=-=-
> > 
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14608): https://lists.fd.io/g/vpp-dev/message/14608
Mute This Topic: https://lists.fd.io/mt/59112885/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to