Re: [vpp-dev] TLS half open lock

via Lists.Fd.Io Tue, 28 Aug 2018 08:45:50 -0700

Hi Ping,

I’ve already posted a comment ☺

Florin

From: "Yu, Ping" <ping...@intel.com>
Date: Tuesday, August 28, 2018 at 8:20 AM
To: Florin Coras <fcoras.li...@gmail.com>
Cc: "Florin Coras (fcoras)" <fco...@cisco.com>, "vpp-dev@lists.fd.io" 
<vpp-dev@lists.fd.io>, "Yu, Ping" <ping...@intel.com>
Subject: RE: [vpp-dev] TLS half open lock

Yes, this is what I did to fix the code.

Please help review below patch, and besides adding the lock, and I also 
optimize a bit to reduce the lock scope.

https://gerrit.fd.io/r/14534

Ping

From: Florin Coras [mailto:fcoras.li...@gmail.com]
Sent: Tuesday, August 28, 2018 10:58 PM
To: Yu, Ping <ping...@intel.com>
Cc: Florin Coras (fcoras) <fco...@cisco.com>; vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] TLS half open lock

Yes, this probably happens because free is called by a different thread. As 
mentioned in my previous reply, could you try protecting with 
clib_rwlock_reader_lock (&tm->half_open_rwlock); the branch that does not 
expand the pool?

Thanks,
Florin

On Aug 28, 2018, at 7:51 AM, Yu, Ping 
<ping...@intel.com<mailto:ping...@intel.com>> wrote:

Hello, Florin,

From below info, we can see a real race condition happens.

thread 2 free: 149
thread 3 free: 155
thread 0: alloc: 149
thread 0: alloc: 149

[Time 0]

thread 2 free: 149
Thread 2 free 149 and now 149 is at the pool vec_len (_pool_var 
(p)->free_indices);

[Time 1&Time 2 ]

Now thread 0 wants to get, and thread 3 wants to put at the same time.
Before thread 3 put, thread 0 get current vec_len: 149.
Thread 3 makes:  vec_add1 (_pool_var (p)->free_indices, _pool_var (l));
Thread 0 makes:  _vec_len (_pool_var (p)->free_indices) = _pool_var (l) - 1;

[Time 3]
Due to race condition, current vec_len is the same as time 0, and the old 
element 149 returns again.

Even though performance is important, and we still need to use lock to resolve 
this race condition.

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Yu, Ping
Sent: Tuesday, August 28, 2018 5:29 PM
To: Florin Coras <fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>>
Cc: Florin Coras (fcoras) <fco...@cisco.com<mailto:fco...@cisco.com>>; 
vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>; Yu, Ping 
<ping...@intel.com<mailto:ping...@intel.com>>
Subject: Re: [vpp-dev] TLS half open lock

Hi, Florin,

Yes, you are right, and all alloc operation is performed by thread 0. An 
interesting thing is that if running “test echo clients nclients 300 uri 
tls://10.10.1.1/1111” in clients with 4 threads, I can easily catch the case 
one same index be alloc twice by thread 0.

thread 0: alloc: 145
thread 0: alloc: 69
thread 4 free: 151
thread 0: alloc: 151
thread 2 free: 149
thread 3 free: 155
thread 0: alloc: 149
thread 0: alloc: 149
thread 0: alloc: 58
thread 0: alloc: 9
thread 0: alloc: 29
thread 3 free: 146
thread 0: alloc: 146
thread 2 free: 153
thread 0: alloc: 144
thread 0: alloc: 153
thread 0: alloc: 124
thread 3 free: 25
thread 0: alloc: 25

From: Florin Coras [mailto:fcoras.li...@gmail.com]
Sent: Tuesday, August 28, 2018 10:24 AM
To: Yu, Ping <ping...@intel.com<mailto:ping...@intel.com>>
Cc: Florin Coras (fcoras) <fco...@cisco.com<mailto:fco...@cisco.com>>; 
vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] TLS half open lock

Hi Ping,

The expectation is that all connects/listens come on the main thread (with the 
worker barrier held). In other words, we only need to support a one writer, 
multiple readers scenario.

Florin

On Aug 27, 2018, at 6:29 PM, Yu, Ping 
<ping...@intel.com<mailto:ping...@intel.com>> wrote:

Hi, Florin,

To check if it is about to expand is also lockless. Is there any issue if two 
threads check the pool simultaneously, and just one slot is available? One code 
will do normal get, and the other thread is expanding the pool?

Thanks
Ping

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
[mailto:vpp-dev@lists.fd.io] On Behalf Of Florin Coras
Sent: Tuesday, August 28, 2018 12:51 AM
To: Yu, Ping <ping...@intel.com<mailto:ping...@intel.com>>
Cc: Florin Coras (fcoras) <fco...@cisco.com<mailto:fco...@cisco.com>>; 
vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [vpp-dev] TLS half open lock

Hi Ping,

The current implementation only locks the half-open pool if the pool is about 
to expand. This is done to increase speed by avoiding unnecessary locking, 
i.e., if pool is not about to expand, it should be safe to get a new element 
from it without affecting readers. Now the thing to figure out is why this is 
happening. Does the slowdown due to the “big lock” avoid some race or is there 
more to it?

First of all, how many workers do you have configured and how many sessions are 
you allocating/connecting? Do you see failed connects?

Your tls_ctx_half_open_get line numbers don’t match my code. Did you by chance 
modify something else?

Thanks,
Florin

On Aug 27, 2018, at 9:22 AM, Yu, Ping 
<ping...@intel.com<mailto:ping...@intel.com>> wrote:

Hello, all

Recently I found that the TLS half open lock is not well implemented, and if 
enabling multiple thread, there are chances to get the following core dump info 
in debug mode.

(gdb) where
#0  0x00007f7a0848e428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f7a0849002a in __GI_abort () at abort.c:89
#2  0x0000000000407f0b in os_panic () at 
/home/pyu4/git-home/vpp_clean/vpp/src/vpp/vnet/main.c:331
#3  0x00007f7a08867bd0 in debugger () at 
/home/pyu4/git-home/vpp_clean/vpp/src/vppinfra/error.c:84
#4  0x00007f7a08868008 in _clib_error (how_to_die=2, function_name=0x0, 
line_number=0,
    fmt=0x7f7a0a0add78 "%s:%d (%s) assertion `%s' fails") at 
/home/pyu4/git-home/vpp_clean/vpp/src/vppinfra/error.c:143
#5  0x00007f7a09e10be0 in tls_ctx_half_open_get (ctx_index=48) at 
/home/pyu4/git-home/vpp_clean/vpp/src/vnet/tls/tls.c:126
#6  0x00007f7a09e11889 in tls_session_connected_callback (tls_app_index=0, 
ho_ctx_index=48, tls_session=0x7f79c9b6d1c0,
    is_fail=0 '\000') at 
/home/pyu4/git-home/vpp_clean/vpp/src/vnet/tls/tls.c:404
#7  0x00007f7a09d5ea6e in session_stream_connect_notify (tc=0x7f79c9b655fc, 
is_fail=0 '\000')
    at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/session/session.c:648
#8  0x00007f7a099cb969 in tcp46_syn_sent_inline (vm=0x7f79c8a25100, 
node=0x7f79c9a60500, from_frame=0x7f79c8b2a9c0, is_ip4=1)
    at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/tcp/tcp_input.c:2306
#9  0x00007f7a099cbe00 in tcp4_syn_sent (vm=0x7f79c8a25100, 
node=0x7f79c9a60500, from_frame=0x7f79c8b2a9c0)
    at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/tcp/tcp_input.c:2387
#10 0x00007f7a08fefa35 in dispatch_node (vm=0x7f79c8a25100, 
node=0x7f79c9a60500, type=VLIB_NODE_TYPE_INTERNAL,
    dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7f79c8b2a9c0, 
last_time_stamp=902372436923868)
    at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:988
#11 0x00007f7a08feffee in dispatch_pending_node (vm=0x7f79c8a25100, 
pending_frame_index=7, last_time_stamp=902372436923868)
    at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:1138
#12 0x00007f7a08ff1bed in vlib_main_or_worker_loop (vm=0x7f79c8a25100, 
is_main=0)
    at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:1554
#13 0x00007f7a08ff240c in vlib_worker_loop (vm=0x7f79c8a25100) at 
/home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:1634
#14 0x00007f7a09035541 in vlib_worker_thread_fn (arg=0x7f79ca4a41c0) at 
/home/pyu4/git-home/vpp_clean/vpp/src/vlib/threads.c:1760
#15 0x00007f7a0888aa38 in clib_calljmp ()
   from 
/home/pyu4/git-home/vpp_clean/vpp/build-root/install-vpp_debug-native/vpp/lib/libvppinfra.so
#16 0x00007f7761198d70 in ?? ()
#17 0x00007f7a090300be in vlib_worker_thread_bootstrap_fn (arg=0x7f79ca4a41c0)
at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/threads.c:684

seemly current code design may have race condition and to make the same index 
returned, and after one free one index, the other will cause core dump.
I did a simple fix to add a big lock as the following, and this kind of core 
dump never happen again.
Do you guys have a better solution for the lock?

+      clib_rwlock_writer_lock (&tm->half_open_rwlock);
   pool_get_aligned_will_expand (tm->half_open_ctx_pool, will_expand, 0);
   if (PREDICT_FALSE (will_expand && vlib_num_workers ()))
     {
-      clib_rwlock_writer_lock (&tm->half_open_rwlock);
       pool_get (tm->half_open_ctx_pool, ctx);
       memset (ctx, 0, sizeof (*ctx));
       ctx_index = ctx - tm->half_open_ctx_pool;
-      clib_rwlock_writer_unlock (&tm->half_open_rwlock);
     }
   else
     {
@@ -104,6 +103,8 @@ tls_ctx_half_open_alloc (void)
       memset (ctx, 0, sizeof (*ctx));
       ctx_index = ctx - tm->half_open_ctx_pool;
     }
+      clib_rwlock_writer_unlock (&tm->half_open_rwlock);
   return ctx_index;
}

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#10302): https://lists.fd.io/g/vpp-dev/message/10302
Mute This Topic: https://lists.fd.io/mt/24975100/675152
Group Owner: vpp-dev+ow...@lists.fd.io<mailto:vpp-dev+ow...@lists.fd.io>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>]
-=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#10325): https://lists.fd.io/g/vpp-dev/message/10325
Mute This Topic: https://lists.fd.io/mt/24975100/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] TLS half open lock

Reply via email to