Hi Damjan, 

Definitely like the scheme but the change bit might not be enough, unless I’m 
misunderstanding. For instance, two consecutive updates to a bucket before 
reader grabs b1 will hide the change. 

Florin

> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
> <dmarion=me....@lists.fd.io> wrote:
> 
> 
> Agree with Dave on atomic ops being bad on the reader side.
> 
> What about following schema:
> 
> As bucket is just u64 value on the reader side we grab bucket before (b0) and 
> after (b1) search operation.
> 
> If search finds entry, we simply do 2 checks:
> - that b0 is equal to b1
> - that lock bit is not set in both of them
> If check fails, we simply retry.
> 
> On the writer side, we have add, remove and replace operations.
> First 2 alter refcnt which is part of bucket.
> To deal with replace case we introduce another bit (change bit) which is 
> flipped every time data is changed in the bucket.
> 
> Here are possible scenarios:
> 
> - reader grabs b0 before lock and b1 after unlock
>    - add, del - refcnt and change bit will be different between b0 and b1 
> causing retry
>    - replace - change bit will be different between b0 and b1 causing retry
> 
> - reader grabs b0 after lock and/or b1 before unlock
>    - lock bit will be set causing retry  
> 
> Of course, this to work properly we need to ensure proper memory ordering 
> (i.e. avoid bucket change to be visible to remote thread before kvp change).
> 
> I crafted WIP patch to present my idea:
> 
> https://gerrit.fd.io/r/c/vpp/+/34326 <https://gerrit.fd.io/r/c/vpp/+/34326>
> 
> In this patch I get a rid of all store barriers and replaced them with more 
> lightweight:
> 
> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
> 
> On platforms with strong memory ordering (like x86_64) this will result with 
> just normal stores (but compiler will know that it should not reorder them).
> On platforms with weak memory ordering (like arch64) this will result in 
> special store instruction, but that one is still cheaper than full memory 
> barrier.
> 
> Thoughts? Comments?
> 
> Thanks,
> 
> — 
> Damjan
> 
> 
> 
>> On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net> wrote:
>> 
>> Dear Nick,
>> 
>> As the code comment suggests, we tiptoe right up to the line to extract 
>> performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
>> expedients to make the required assumptions true?
>> 
>> It’s easy enough to change the code in various ways so this use-case cannot 
>> backfire. High on the list: always make a working copy of the bucket, vs. 
>> update in place. Won’t help write performance, but it’s likely to make the 
>> pain go away.
>> 
>> Bucket-level reader-locks would involve adding Avogadro’s number of atomic 
>> ops to the predominant case. I’m pretty sure that’s a non-starter.
>> 
>> FWIW... Dave
>> 
>> 
>> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Nick Zavaritsky
>> Sent: Monday, November 1, 2021 12:12 PM
>> To: vpp-dev@lists.fd.io
>> Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't
>> 
>> Hello bihash experts!
>> 
>> There's an old thread claiming that bihash lookup can produce a value=-1 
>> under intense add/delete concurrent activity: 
>> https://lists.fd.io/g/vpp-dev/message/15606
>> 
>> We had a seemingly related crash recently when a lookup in 
>> snat_main.flow_hash yielded a value=-1 which was subsequently used as a 
>> destination thread index to offload to. This crash prompted me to study 
>> bihash more closely.
>> 
>> The rest of the message is structured as follows:
>>  1. Presenting reasons why I believe that bihash is not thread-safe.
>>  2. Proposing a fix.
>> 
>> 1 Bihash is probably not thread-safe
>> 
>> The number of buckets in a hash table never changes. Every bucket has a lock 
>> bit. Updates happen via clib_bihash_add_del_inline_with_hash. The function 
>> grabs the bucket lock early on and performs update while holding the lock. 
>> Obviously this is safe, let's focus on readers.
>> 
>> Lookups happen via clib_bihash_search_inline_with_hash / 
>> clib_bihash_search_inline_2_with_hash. The function locates the bucket and 
>> waits until the lock bit is cleared.
>> 
>>  b = BV (clib_bihash_get_bucket) (h, hash);
>> 
>>  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
>>    return -1;
>> 
>>  if (PREDICT_FALSE (b->lock))
>>    {
>>      volatile BVT (clib_bihash_bucket) * bv = b;
>>      while (bv->lock)
>>        CLIB_PAUSE ();
>>    }
>> 
>> From this point on the function examines the data structure without ever 
>> bothering to check the lock again. Nothing prevents an updater from changing 
>> the data the reader is currently looking at, or even deallocating it right 
>> away. The only way it could work is if we could make assumptions about 
>> relative performance of lookup and update operations. Checking the lock 
>> early in lookup ensures that there's no update currently in progress. If 
>> lookup is faster than update, then no future updater will manage to progress 
>> to the point where the memory is written BEFORE the lookup was complete. 
>> Indeed, we have the following comment in 
>> clib_bihash_add_del_inline_with_hash:
>> 
>>      /*
>>       * Because reader threads are looking at live data,
>>       * we have to be extra careful. Readers do NOT hold the
>>       * bucket lock. We need to be SLOWER than a search, past the
>>       * point where readers CHECK the bucket lock.
>>       */
>> 
>> Unfortunately, the assumption doesn't hold. Any thread could get preempted 
>> at arbitrary time. Even if we rule out preemption, there are 
>> microarchitectural quirks (e.g. caches, branch misprediction) that could 
>> slow down lookup to the point that memory read and update will overlap. 
>> 
>> The core of lookup is the following loop. Please note that checking a key 
>> and fetching the value is not atomic, hence if we are preempted in-between 
>> the result could be bogus.
>> 
>>  for (i = 0; i < limit; i++)
>>    {
>>      if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key))
>>        {
>>          *key_result = v->kvp[i];
>>          return 0;
>>        }
>>    }
>> 
>> Different ways the key-value pair could get updated:
>> 
>> (1) Add using a previously empty slot:
>> 
>>              clib_memcpy_fast (&(v->kvp[i].value),
>>                                &add_v->value, sizeof (add_v->value));
>>              CLIB_MEMORY_STORE_BARRIER ();     /* Make sure the value has 
>> settled */
>>              clib_memcpy_fast (&(v->kvp[i]), &add_v->key,
>>                                sizeof (add_v->key));
>> 
>> The key update is not atomic, reader could observe a key updated half-way.
>> 
>> (2) Add that recycles a stale slot:
>> 
>>                  clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v));
>> 
>> Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, 
>> old_v) or even an arbitrary interleaving of chunks from old and new keys.
>> 
>> (3) Deleting an entry:
>> 
>> clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v)));
>> 
>> Not atomic.
>> 
>> 
>> 2 A fix
>> 
>> It's worth noting that bihash never crashes. It does yield bogus results 
>> occasionally, though. While -1 is easy to check for, the analysis shows that 
>> other bogus results are possible. In particular:
>> 
>>  1. Value updated half-way, possible with bihash_8_16.
>>  2. Observing a key that never existed due to a key partial update. The 
>> probability is low since the hash should map it to the same bucket.
>>  3. Old key matched with a new value. The probability is low since lookup 
>> should get preempted at the particular spot to make it happen.
>> 
>> Even though these anomalies are unlikely they are still possible and 
>> exploitable.
>> 
>> Should we consider a fix?
>> 
>> The proposal is to introduce read locks for buckets. An implementation 
>> favouring readers could be as follows:
>> 
>> Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread 
>> index, each reader publishes the bucket number it is currently examining in 
>> the respective array item. Padding is introduced to avoid false sharing.
>> 
>> The writer lock sequence would be: 1) lock bucket; 2) wait until the bucket 
>> number is not in rlocks.
>> 
>> Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket not 
>> locked then done; 3) otherwise clear bucket number from rlocks, wait for 
>> bucket lock to be released and restart.
>> 
>> Thoughts?
>> 
>> 
> 
> 
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20416): https://lists.fd.io/g/vpp-dev/message/20416
Mute This Topic: https://lists.fd.io/mt/86744671/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to