+1, can't tell which poison pattern is involved without a scorecard.

load_balance_alloc_i (...) is clearly not thread-safe due to calls to 
pool_get_aligned (...) and vlib_validate_combined_counter(...). 

Judicious use of pool_get_aligned_will_expand(...), 
_vec_resize_will_expand(...) and a manual barrier sync will fix this problem 
without resorting to draconian measures.

It'd sure be nice to hear from Neale before we code something like that. 

D. 

-----Original Message-----
From: Benoit Ganne (bganne) <bga...@cisco.com> 
Sent: Wednesday, June 3, 2020 3:17 AM
To: raj...@rtbrick.com; Dave Barach (dbarach) <dbar...@cisco.com>
Cc: vpp-dev <vpp-dev@lists.fd.io>; Neale Ranns (nranns) <nra...@cisco.com>
Subject: RE: [vpp-dev] SEGMENTATION FAULT in load_balance_get()

Neale is away and might be slow to react.
I suspect the issue is when creating new load balance entry through 
load_blance_create(), which will get a new element from the load balance pool. 
This in turn will update the pool free bitmap, which can grow. As it is backed 
by a vector, it can be reallocated somewhere else to fit the new size.
If it is done concurrently with dataplane processing, bad things happen. The 
pattern 0x131313 is filled by dlmalloc free() and will happen in that case. I 
think the same could happen to the pool itself, not only the bitmap.
If I am correct, I am not sure how we should fix that: fib update API is marked 
as mp_safe, so we could create a fixed-size load balance pool to prevent 
runtime reallocation, but it would waste memory and impose a maximum size.

ben

> -----Original Message-----
> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Rajith PR 
> via lists.fd.io
> Sent: mercredi 3 juin 2020 05:46
> To: Dave Barach (dbarach) <dbar...@cisco.com>
> Cc: vpp-dev <vpp-dev@lists.fd.io>; Neale Ranns (nranns) 
> <nra...@cisco.com>
> Subject: Re: [vpp-dev] SEGMENTATION FAULT in load_balance_get()
> 
> Hi Dave/Neal,
> 
> The adj_poison seems to be a filling pattern - - 0xfefe. Am I looking 
> into the right code or I have interpreted it incorrectly?
> 
> Thanks,
> Rajith
> 
> On Tue, Jun 2, 2020 at 7:44 PM Dave Barach (dbarach) 
> <dbar...@cisco.com <mailto:dbar...@cisco.com> > wrote:
> 
> 
>       The code manages to access a poisoned adjacency – 0x131313 fill 
> pattern – copying Neale for an opinion.
> 
> 
> 
>       D.
> 
> 
> 
>       From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>  <vpp- 
> d...@lists.fd.io <mailto:vpp-dev@lists.fd.io> > On Behalf Of Rajith PR 
> via lists.fd.io <http://lists.fd.io>
>       Sent: Tuesday, June 2, 2020 10:00 AM
>       To: vpp-dev <vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> >
>       Subject: [vpp-dev] SEGMENTATION FAULT in load_balance_get()
> 
> 
> 
>       Hello All,
> 
> 
> 
>       In 19.08 VPP version we are seeing a crash while accessing the 
> load_balance_pool  in load_balanc_get() function. This is happening 
> after enabling worker threads.
> 
>       As such the FIB programming is happening in the main thread and in 
> one of the worker threads we see this crash.
> 
>       Also, this is seen when we scale to 300K+ ipv4 routes.
> 
> 
> 
>       Here is the complete stack,
> 
> 
> 
>       Thread 10 "vpp_wk_0" received signal SIGSEGV, Segmentation fault.
> 
>       [Switching to Thread 0x7fbe4aa8e700 (LWP 333)]
>       0x00007fbef10636f8 in clib_bitmap_get (ai=0x1313131313131313, i=61) 
> at /home/ubuntu/Scale/libvpp/src/vppinfra/bitmap.h:201
>       201  return i0 < vec_len (ai) && 0 != ((ai[i0] >> i1) & 1);
> 
> 
> 
>       Thread 10 (Thread 0x7fbe4aa8e700 (LWP 333)):
>       #0  0x00007fbef10636f8 in clib_bitmap_get (ai=0x1313131313131313,
> i=61) at /home/ubuntu/Scale/libvpp/src/vppinfra/bitmap.h:201
>       #1  0x00007fbef10676a8 in load_balance_get (lbi=61) at
> /home/ubuntu/Scale/libvpp/src/vnet/dpo/load_balance.h:222
>       #2  0x00007fbef106890c in ip4_lookup_inline (vm=0x7fbe8a5aa080, 
> node=0x7fbe8b3fd380, frame=0x7fbe8a5edb40) at
> /home/ubuntu/Scale/libvpp/src/vnet/ip/ip4_forward.h:369
>       #3  0x00007fbef1068ead in ip4_lookup_node_fn_avx2 (vm=0x7fbe8a5aa080, 
> node=0x7fbe8b3fd380, frame=0x7fbe8a5edb40)
>           at /home/ubuntu/Scale/libvpp/src/vnet/ip/ip4_forward.c:95
>       #4  0x00007fbef0c6afec in dispatch_node (vm=0x7fbe8a5aa080, 
> node=0x7fbe8b3fd380, type=VLIB_NODE_TYPE_INTERNAL, 
> dispatch_state=VLIB_NODE_STATE_POLLING,
>           frame=0x7fbe8a5edb40, last_time_stamp=381215594286358) at
> /home/ubuntu/Scale/libvpp/src/vlib/main.c:1207
>       #5  0x00007fbef0c6b7ad in dispatch_pending_node (vm=0x7fbe8a5aa080, 
> pending_frame_index=2, last_time_stamp=381215594286358)
>           at /home/ubuntu/Scale/libvpp/src/vlib/main.c:1375
>       #6  0x00007fbef0c6d3f0 in vlib_main_or_worker_loop 
> (vm=0x7fbe8a5aa080, is_main=0) at
> /home/ubuntu/Scale/libvpp/src/vlib/main.c:1826
>       #7  0x00007fbef0c6dc73 in vlib_worker_loop (vm=0x7fbe8a5aa080) at
> /home/ubuntu/Scale/libvpp/src/vlib/main.c:1934
>       #8  0x00007fbef0cac791 in vlib_worker_thread_fn (arg=0x7fbe8de2a340) 
> at /home/ubuntu/Scale/libvpp/src/vlib/threads.c:1754
>       #9  0x00007fbef092da48 in clib_calljmp () from
> /home/ubuntu/Scale/libvpp/build-root/install-vpp_debug-
> native/vpp/lib/libvppinfra.so.1.0.1
>       #10 0x00007fbe4aa8dec0 in ?? ()
>       #11 0x00007fbef0ca700c in vlib_worker_thread_bootstrap_fn
> (arg=0x7fbe8de2a340) at 
> /home/ubuntu/Scale/libvpp/src/vlib/threads.c:573
> 
>       Thanks in Advance,
> 
>       Rajith

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16630): https://lists.fd.io/g/vpp-dev/message/16630
Mute This Topic: https://lists.fd.io/mt/74627827/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to