Hello bihash experts!

There's an old thread claiming that bihash lookup can produce a value=-1 under 
intense add/delete concurrent activity: 
https://lists.fd.io/g/vpp-dev/message/15606

We had a seemingly related crash recently when a lookup in snat_main.flow_hash 
yielded a value=-1 which was subsequently used as a destination thread index to 
offload to. This crash prompted me to study bihash more closely.

The rest of the message is structured as follows:
1. Presenting reasons why I believe that bihash is not thread-safe.
2. Proposing a fix.

1 Bihash is probably not thread-safe

The number of buckets in a hash table never changes. Every bucket has a lock 
bit. Updates happen via clib_bihash_add_del_inline_with_hash. The function 
grabs the bucket lock early on and performs update while holding the lock. 
Obviously this is safe, let's focus on readers.

Lookups happen via clib_bihash_search_inline_with_hash / 
clib_bihash_search_inline_2_with_hash. The function locates the bucket and 
waits until the lock bit is cleared.

b = BV (clib_bihash_get_bucket) (h, hash);

if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
return -1;

if (PREDICT_FALSE (b->lock))
{
volatile BVT (clib_bihash_bucket) * bv = b;
while (bv->lock)
CLIB_PAUSE ();
}

>From this point on the function examines the data structure without ever 
>bothering to check the lock again. Nothing prevents an updater from changing 
>the data the reader is currently looking at, or even deallocating it right 
>away. The only way it could work is if we could make assumptions about 
>relative performance of lookup and update operations. Checking the lock early 
>in lookup ensures that there's no update currently in progress. If lookup is 
>faster than update, then no future updater will manage to progress to the 
>point where the memory is written BEFORE the lookup was complete. Indeed, we 
>have the following comment in clib_bihash_add_del_inline_with_hash:

/*
* Because reader threads are looking at live data,
* we have to be extra careful. Readers do NOT hold the
* bucket lock. We need to be SLOWER than a search, past the
* point where readers CHECK the bucket lock.
*/

Unfortunately, the assumption doesn't hold. Any thread could get preempted at 
arbitrary time. Even if we rule out preemption, there are microarchitectural 
quirks (e.g. caches, branch misprediction) that could slow down lookup to the 
point that memory read and update will overlap.

The core of lookup is the following loop. Please note that checking a key and 
fetching the value is not atomic, hence if we are preempted in-between the 
result could be bogus.

for (i = 0; i < limit; i++)
{
if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key))
{
*key_result = v->kvp[i];
return 0;
}
}

Different ways the key-value pair could get updated:

(1) Add using a previously empty slot:

clib_memcpy_fast (&(v->kvp[i].value),
&add_v->value, sizeof (add_v->value));
CLIB_MEMORY_STORE_BARRIER ();     /* Make sure the value has settled */
clib_memcpy_fast (&(v->kvp[i]), &add_v->key,
sizeof (add_v->key));

The key update is not atomic, reader could observe a key updated half-way.

(2) Add that recycles a stale slot:

clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v));

Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, old_v) 
or even an arbitrary interleaving of chunks from old and new keys.

(3) Deleting an entry:

clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v)));

Not atomic.

2 A fix

It's worth noting that bihash never crashes. It does yield bogus results 
occasionally, though. While -1 is easy to check for, the analysis shows that 
other bogus results are possible. In particular:

1. Value updated half-way, possible with bihash_8_16.
2. Observing a key that never existed due to a key partial update. The 
probability is low since the hash should map it to the same bucket.
3. Old key matched with a new value. The probability is low since lookup should 
get preempted at the particular spot to make it happen.

Even though these anomalies are unlikely they are still possible and 
exploitable.

Should we consider a fix?

The proposal is to introduce read locks for buckets. An implementation 
favouring readers could be as follows:

Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread index, 
each reader publishes the bucket number it is currently examining in the 
respective array item. Padding is introduced to avoid false sharing.

The writer lock sequence would be: 1) lock bucket; 2) wait until the bucket 
number is not in rlocks.

Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket not 
locked then done; 3) otherwise clear bucket number from rlocks, wait for bucket 
lock to be released and restart.

Thoughts?
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20401): https://lists.fd.io/g/vpp-dev/message/20401
Mute This Topic: https://lists.fd.io/mt/86744671/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to