Let me resurrect this discussion as my patch is still in gerrit.

Nick, any chance you can submit your proposal as discussed bellow?

Thanks,

— 
Damjan



> On 04.11.2021., at 19:35, Damjan Marion via lists.fd.io 
> <dmarion=me....@lists.fd.io> wrote:
> 
> 
> Dear Nick,
> 
> Will be great if you can support your proposal with a running code so we can 
> understand exactly what it means.
> 
> — 
> Damjan
> 
>> On 04.11.2021., at 19:24, Nick Zavaritsky <nick.zavarit...@emnify.com> wrote:
>> 
>>  Hi, thanks for an insightful discussion!
>> 
>> I do understand that high performance is one of the most important goals of 
>> vpp, therefore certain solutions might not fly. From my POV, the version 
>> counter would be an improvement. It definitely decreases the probability of 
>> triggering the bug.
>> 
>> Concerning isolcpus, currently this is presented as an optimisation, not a 
>> prerequisite. Without isolcpus, a thread could get preempted for arbitrarily 
>> long. Meaning that no matter how many bits we allocate for the version 
>> field, occasionally they won’t be enough.
>> 
>> I’d love to have something that’s robust no matter how the threads are 
>> scheduled. Would it be possible to use vpp benchmarking lab to evaluate the 
>> performance impact of the proposed solutions?
>> 
>> Finally, I'd like to rehash the reader lock proposal. The idea was that we 
>> don’t introduce any atomic operations in the reader path. A reader 
>> *publishes* the bucket number it is about to examine in int 
>> rlock[MAX_THREADS] array. Every thread uses a distinct cell in rlock 
>> (determined by the thread id), therefore it could be a regular write 
>> followed by a barrier. Eliminate false sharing with padding.
>> 
>> Writer locks a bucket as currently implemented (CAS) and then waits until 
>> the bucket number disappears from rlock[].
>> 
>> Reader publishes the bucket number and then checks if the bucket is locked 
>> (regular write, barrier, regular read). Good to go if  not locked, otherwise 
>> remove the bucket number from rlock, wait for the lock to get released, 
>> restart.
>> 
>> The proposal doesn’t introduce any new atomic operations. There still might 
>> be a slowdown due to cache line ping-pong in the rlock array. In the worst 
>> case, it costs us 1 extra cache miss for the reader. Could be coalesced with 
>> the bucket prefetch, making it essentially free (few if any bihash users 
>> prefetch buckets).
>> 
>> Best,
>> 
>> Nick
>> 
>> 
>>> On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io 
>>> <fcoras.lists=gmail....@lists.fd.io> wrote:
>>> 
>>> 
>>> Agreed it’s unlikely so maybe just use the 2 bits left for the epoch 
>>> counter as a middle ground? The new approach should be better either way :-)
>>> 
>>> Florin
>>> 
>>> 
>>>> On Nov 3, 2021, at 11:55 AM, Damjan Marion <dmar...@me.com> wrote:
>>>> 
>>>> What about the following, we shift offset by 6, as all buckets are aligned 
>>>> to 64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch 
>>>> counter…. ?
>>>> 
>>>> — 
>>>> Damjan
>>>> 
>>>>> On 03.11.2021., at 19:45, Damjan Marion <dmar...@me.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> yes, i am aware of that, it is extremelly unlikely and only way i can see 
>>>>> this fixed is introducing epoch on the bucket level but we dont have 
>>>>> enough space there…. 
>>>>> 
>>>>> — 
>>>>> Damjan
>>>>> 
>>>>>> On 03.11.2021., at 19:16, Florin Coras <fcoras.li...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Damjan, 
>>>>>> 
>>>>>> Definitely like the scheme but the change bit might not be enough, 
>>>>>> unless I’m misunderstanding. For instance, two consecutive updates to a 
>>>>>> bucket before reader grabs b1 will hide the change. 
>>>>>> 
>>>>>> Florin
>>>>>> 
>>>>>>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>>>>>>> <dmarion=me....@lists.fd.io> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Agree with Dave on atomic ops being bad on the reader side.
>>>>>>> 
>>>>>>> What about following schema:
>>>>>>> 
>>>>>>> As bucket is just u64 value on the reader side we grab bucket before 
>>>>>>> (b0) and after (b1) search operation.
>>>>>>> 
>>>>>>> If search finds entry, we simply do 2 checks:
>>>>>>> - that b0 is equal to b1
>>>>>>> - that lock bit is not set in both of them
>>>>>>> If check fails, we simply retry.
>>>>>>> 
>>>>>>> On the writer side, we have add, remove and replace operations.
>>>>>>> First 2 alter refcnt which is part of bucket.
>>>>>>> To deal with replace case we introduce another bit (change bit) which 
>>>>>>> is flipped every time data is changed in the bucket.
>>>>>>> 
>>>>>>> Here are possible scenarios:
>>>>>>> 
>>>>>>> - reader grabs b0 before lock and b1 after unlock
>>>>>>>    - add, del - refcnt and change bit will be different between b0 and 
>>>>>>> b1 causing retry
>>>>>>>    - replace - change bit will be different between b0 and b1 causing 
>>>>>>> retry
>>>>>>> 
>>>>>>> - reader grabs b0 after lock and/or b1 before unlock
>>>>>>>    - lock bit will be set causing retry  
>>>>>>> 
>>>>>>> Of course, this to work properly we need to ensure proper memory 
>>>>>>> ordering (i.e. avoid bucket change to be visible to remote thread 
>>>>>>> before kvp change).
>>>>>>> 
>>>>>>> I crafted WIP patch to present my idea:
>>>>>>> 
>>>>>>> https://gerrit.fd.io/r/c/vpp/+/34326
>>>>>>> 
>>>>>>> In this patch I get a rid of all store barriers and replaced them with 
>>>>>>> more lightweight:
>>>>>>> 
>>>>>>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
>>>>>>> 
>>>>>>> On platforms with strong memory ordering (like x86_64) this will result 
>>>>>>> with just normal stores (but compiler will know that it should not 
>>>>>>> reorder them).
>>>>>>> On platforms with weak memory ordering (like arch64) this will result 
>>>>>>> in special store instruction, but that one is still cheaper than full 
>>>>>>> memory barrier.
>>>>>>> 
>>>>>>> Thoughts? Comments?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> — 
>>>>>>> Damjan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net> wrote:
>>>>>>>> 
>>>>>>>> Dear Nick,
>>>>>>>> 
>>>>>>>> As the code comment suggests, we tiptoe right up to the line to 
>>>>>>>> extract performance. Have you tried e.g. ISOLCPUS, thread priority, or 
>>>>>>>> some other expedients to make the required assumptions true?
>>>>>>>> 
>>>>>>>> It’s easy enough to change the code in various ways so this use-case 
>>>>>>>> cannot backfire. High on the list: always make a working copy of the 
>>>>>>>> bucket, vs. update in place. Won’t help write performance, but it’s 
>>>>>>>> likely to make the pain go away.
>>>>>>>> 
>>>>>>>> Bucket-level reader-locks would involve adding Avogadro’s number of 
>>>>>>>> atomic ops to the predominant case. I’m pretty sure that’s a 
>>>>>>>> non-starter.
>>>>>>>> 
>>>>>>>> FWIW... Dave
>>>>>>>> 
>>>>>>>> 
>>>>>>>> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Nick 
>>>>>>>> Zavaritsky
>>>>>>>> Sent: Monday, November 1, 2021 12:12 PM
>>>>>>>> To: vpp-dev@lists.fd.io
>>>>>>>> Subject: [vpp-dev] Bihash is considered thread-safe but probably 
>>>>>>>> shouldn't
>>>>>>>> 
>>>>>>>> Hello bihash experts!
>>>>>>>> 
>>>>>>>> There's an old thread claiming that bihash lookup can produce a 
>>>>>>>> value=-1 under intense add/delete concurrent activity: 
>>>>>>>> https://lists.fd.io/g/vpp-dev/message/15606
>>>>>>>> 
>>>>>>>> We had a seemingly related crash recently when a lookup in 
>>>>>>>> snat_main.flow_hash yielded a value=-1 which was subsequently used as 
>>>>>>>> a destination thread index to offload to. This crash prompted me to 
>>>>>>>> study bihash more closely.
>>>>>>>> 
>>>>>>>> The rest of the message is structured as follows:
>>>>>>>>  1. Presenting reasons why I believe that bihash is not thread-safe.
>>>>>>>>  2. Proposing a fix.
>>>>>>>> 
>>>>>>>> 1 Bihash is probably not thread-safe
>>>>>>>> 
>>>>>>>> The number of buckets in a hash table never changes. Every bucket has 
>>>>>>>> a lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. 
>>>>>>>> The function grabs the bucket lock early on and performs update while 
>>>>>>>> holding the lock. Obviously this is safe, let's focus on readers.
>>>>>>>> 
>>>>>>>> Lookups happen via clib_bihash_search_inline_with_hash / 
>>>>>>>> clib_bihash_search_inline_2_with_hash. The function locates the bucket 
>>>>>>>> and waits until the lock bit is cleared.
>>>>>>>> 
>>>>>>>>  b = BV (clib_bihash_get_bucket) (h, hash);
>>>>>>>> 
>>>>>>>>  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
>>>>>>>>    return -1;
>>>>>>>> 
>>>>>>>>  if (PREDICT_FALSE (b->lock))
>>>>>>>>    {
>>>>>>>>      volatile BVT (clib_bihash_bucket) * bv = b;
>>>>>>>>      while (bv->lock)
>>>>>>>>        CLIB_PAUSE ();
>>>>>>>>    }
>>>>>>>> 
>>>>>>>> From this point on the function examines the data structure without 
>>>>>>>> ever bothering to check the lock again. Nothing prevents an updater 
>>>>>>>> from changing the data the reader is currently looking at, or even 
>>>>>>>> deallocating it right away. The only way it could work is if we could 
>>>>>>>> make assumptions about relative performance of lookup and update 
>>>>>>>> operations. Checking the lock early in lookup ensures that there's no 
>>>>>>>> update currently in progress. If lookup is faster than update, then no 
>>>>>>>> future updater will manage to progress to the point where the memory 
>>>>>>>> is written BEFORE the lookup was complete. Indeed, we have the 
>>>>>>>> following comment in clib_bihash_add_del_inline_with_hash:
>>>>>>>> 
>>>>>>>>      /*
>>>>>>>>       * Because reader threads are looking at live data,
>>>>>>>>       * we have to be extra careful. Readers do NOT hold the
>>>>>>>>       * bucket lock. We need to be SLOWER than a search, past the
>>>>>>>>       * point where readers CHECK the bucket lock.
>>>>>>>>       */
>>>>>>>> 
>>>>>>>> Unfortunately, the assumption doesn't hold. Any thread could get 
>>>>>>>> preempted at arbitrary time. Even if we rule out preemption, there are 
>>>>>>>> microarchitectural quirks (e.g. caches, branch misprediction) that 
>>>>>>>> could slow down lookup to the point that memory read and update will 
>>>>>>>> overlap. 
>>>>>>>> 
>>>>>>>> The core of lookup is the following loop. Please note that checking a 
>>>>>>>> key and fetching the value is not atomic, hence if we are preempted 
>>>>>>>> in-between the result could be bogus.
>>>>>>>> 
>>>>>>>>  for (i = 0; i < limit; i++)
>>>>>>>>    {
>>>>>>>>      if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key))
>>>>>>>>        {
>>>>>>>>          *key_result = v->kvp[i];
>>>>>>>>          return 0;
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>> Different ways the key-value pair could get updated:
>>>>>>>> 
>>>>>>>> (1) Add using a previously empty slot:
>>>>>>>> 
>>>>>>>>              clib_memcpy_fast (&(v->kvp[i].value),
>>>>>>>>                                &add_v->value, sizeof (add_v->value));
>>>>>>>>              CLIB_MEMORY_STORE_BARRIER ();     /* Make sure the value 
>>>>>>>> has settled */
>>>>>>>>              clib_memcpy_fast (&(v->kvp[i]), &add_v->key,
>>>>>>>>                                sizeof (add_v->key));
>>>>>>>> 
>>>>>>>> The key update is not atomic, reader could observe a key updated 
>>>>>>>> half-way.
>>>>>>>> 
>>>>>>>> (2) Add that recycles a stale slot:
>>>>>>>> 
>>>>>>>>                  clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof 
>>>>>>>> (*add_v));
>>>>>>>> 
>>>>>>>> Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, 
>>>>>>>> old_v) or even an arbitrary interleaving of chunks from old and new 
>>>>>>>> keys.
>>>>>>>> 
>>>>>>>> (3) Deleting an entry:
>>>>>>>> 
>>>>>>>> clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v)));
>>>>>>>> 
>>>>>>>> Not atomic.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2 A fix
>>>>>>>> 
>>>>>>>> It's worth noting that bihash never crashes. It does yield bogus 
>>>>>>>> results occasionally, though. While -1 is easy to check for, the 
>>>>>>>> analysis shows that other bogus results are possible. In particular:
>>>>>>>> 
>>>>>>>>  1. Value updated half-way, possible with bihash_8_16.
>>>>>>>>  2. Observing a key that never existed due to a key partial update. 
>>>>>>>> The probability is low since the hash should map it to the same bucket.
>>>>>>>>  3. Old key matched with a new value. The probability is low since 
>>>>>>>> lookup should get preempted at the particular spot to make it happen.
>>>>>>>> 
>>>>>>>> Even though these anomalies are unlikely they are still possible and 
>>>>>>>> exploitable.
>>>>>>>> 
>>>>>>>> Should we consider a fix?
>>>>>>>> 
>>>>>>>> The proposal is to introduce read locks for buckets. An implementation 
>>>>>>>> favouring readers could be as follows:
>>>>>>>> 
>>>>>>>> Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread 
>>>>>>>> index, each reader publishes the bucket number it is currently 
>>>>>>>> examining in the respective array item. Padding is introduced to avoid 
>>>>>>>> false sharing.
>>>>>>>> 
>>>>>>>> The writer lock sequence would be: 1) lock bucket; 2) wait until the 
>>>>>>>> bucket number is not in rlocks.
>>>>>>>> 
>>>>>>>> Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket 
>>>>>>>> not locked then done; 3) otherwise clear bucket number from rlocks, 
>>>>>>>> wait for bucket lock to be released and restart.
>>>>>>>> 
>>>>>>>> Thoughts?
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>>> 
>> 
> 
> 
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20766): https://lists.fd.io/g/vpp-dev/message/20766
Mute This Topic: https://lists.fd.io/mt/86744671/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to