Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2022-01-20 Thread Damjan Marion via lists.fd.io

Let me resurrect this discussion as my patch is still in gerrit.

Nick, any chance you can submit your proposal as discussed bellow?

Thanks,

— 
Damjan



> On 04.11.2021., at 19:35, Damjan Marion via lists.fd.io 
>  wrote:
> 
> 
> Dear Nick,
> 
> Will be great if you can support your proposal with a running code so we can 
> understand exactly what it means.
> 
> — 
> Damjan
> 
>> On 04.11.2021., at 19:24, Nick Zavaritsky  wrote:
>> 
>>  Hi, thanks for an insightful discussion!
>> 
>> I do understand that high performance is one of the most important goals of 
>> vpp, therefore certain solutions might not fly. From my POV, the version 
>> counter would be an improvement. It definitely decreases the probability of 
>> triggering the bug.
>> 
>> Concerning isolcpus, currently this is presented as an optimisation, not a 
>> prerequisite. Without isolcpus, a thread could get preempted for arbitrarily 
>> long. Meaning that no matter how many bits we allocate for the version 
>> field, occasionally they won’t be enough.
>> 
>> I’d love to have something that’s robust no matter how the threads are 
>> scheduled. Would it be possible to use vpp benchmarking lab to evaluate the 
>> performance impact of the proposed solutions?
>> 
>> Finally, I'd like to rehash the reader lock proposal. The idea was that we 
>> don’t introduce any atomic operations in the reader path. A reader 
>> *publishes* the bucket number it is about to examine in int 
>> rlock[MAX_THREADS] array. Every thread uses a distinct cell in rlock 
>> (determined by the thread id), therefore it could be a regular write 
>> followed by a barrier. Eliminate false sharing with padding.
>> 
>> Writer locks a bucket as currently implemented (CAS) and then waits until 
>> the bucket number disappears from rlock[].
>> 
>> Reader publishes the bucket number and then checks if the bucket is locked 
>> (regular write, barrier, regular read). Good to go if  not locked, otherwise 
>> remove the bucket number from rlock, wait for the lock to get released, 
>> restart.
>> 
>> The proposal doesn’t introduce any new atomic operations. There still might 
>> be a slowdown due to cache line ping-pong in the rlock array. In the worst 
>> case, it costs us 1 extra cache miss for the reader. Could be coalesced with 
>> the bucket prefetch, making it essentially free (few if any bihash users 
>> prefetch buckets).
>> 
>> Best,
>> 
>> Nick
>> 
>> 
>>> On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io 
>>>  wrote:
>>> 
>>> 
>>> Agreed it’s unlikely so maybe just use the 2 bits left for the epoch 
>>> counter as a middle ground? The new approach should be better either way :-)
>>> 
>>> Florin
>>> 
>>> 
 On Nov 3, 2021, at 11:55 AM, Damjan Marion  wrote:
 
 What about the following, we shift offset by 6, as all buckets are aligned 
 to 64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch 
 counter…. ?
 
 — 
 Damjan
 
> On 03.11.2021., at 19:45, Damjan Marion  wrote:
> 
> 
> 
> yes, i am aware of that, it is extremelly unlikely and only way i can see 
> this fixed is introducing epoch on the bucket level but we dont have 
> enough space there…. 
> 
> — 
> Damjan
> 
>> On 03.11.2021., at 19:16, Florin Coras  wrote:
>> 
>> Hi Damjan, 
>> 
>> Definitely like the scheme but the change bit might not be enough, 
>> unless I’m misunderstanding. For instance, two consecutive updates to a 
>> bucket before reader grabs b1 will hide the change. 
>> 
>> Florin
>> 
>>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>>>  wrote:
>>> 
>>> 
>>> Agree with Dave on atomic ops being bad on the reader side.
>>> 
>>> What about following schema:
>>> 
>>> As bucket is just u64 value on the reader side we grab bucket before 
>>> (b0) and after (b1) search operation.
>>> 
>>> If search finds entry, we simply do 2 checks:
>>> - that b0 is equal to b1
>>> - that lock bit is not set in both of them
>>> If check fails, we simply retry.
>>> 
>>> On the writer side, we have add, remove and replace operations.
>>> First 2 alter refcnt which is part of bucket.
>>> To deal with replace case we introduce another bit (change bit) which 
>>> is flipped every time data is changed in the bucket.
>>> 
>>> Here are possible scenarios:
>>> 
>>> - reader grabs b0 before lock and b1 after unlock
>>>- add, del - refcnt and change bit will be different between b0 and 
>>> b1 causing retry
>>>- replace - change bit will be different between b0 and b1 causing 
>>> retry
>>> 
>>> - reader grabs b0 after lock and/or b1 before unlock
>>>- lock bit will be set causing retry  
>>> 
>>> Of course, this to work properly we need to ensure proper memory 
>>> ordering (i.e. avoid bucket change to be visible to remote thread 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-04 Thread Florin Coras
Hi Nick, 

Not sure I understood everything correctly but here are a couple of comments:
- max_threads cannot be known upfront and that makes the use of bihashes a bit 
more complicated 
- the writer needs to scan the whole rlock array to check for readers. If 
max_threads is large (say 100) it might be a bit slow and, even if there are no 
readers, the scan must be done at least once. 

Regards,
Florin

> On Nov 4, 2021, at 11:24 AM, Nick Zavaritsky  
> wrote:
> 
> Hi, thanks for an insightful discussion!
> 
> I do understand that high performance is one of the most important goals of 
> vpp, therefore certain solutions might not fly. From my POV, the version 
> counter would be an improvement. It definitely decreases the probability of 
> triggering the bug.
> 
> Concerning isolcpus, currently this is presented as an optimisation, not a 
> prerequisite. Without isolcpus, a thread could get preempted for arbitrarily 
> long. Meaning that no matter how many bits we allocate for the version field, 
> occasionally they won’t be enough.
> 
> I’d love to have something that’s robust no matter how the threads are 
> scheduled. Would it be possible to use vpp benchmarking lab to evaluate the 
> performance impact of the proposed solutions?
> 
> Finally, I'd like to rehash the reader lock proposal. The idea was that we 
> don’t introduce any atomic operations in the reader path. A reader 
> *publishes* the bucket number it is about to examine in int 
> rlock[MAX_THREADS] array. Every thread uses a distinct cell in rlock 
> (determined by the thread id), therefore it could be a regular write followed 
> by a barrier. Eliminate false sharing with padding.
> 
> Writer locks a bucket as currently implemented (CAS) and then waits until the 
> bucket number disappears from rlock[].
> 
> Reader publishes the bucket number and then checks if the bucket is locked 
> (regular write, barrier, regular read). Good to go if  not locked, otherwise 
> remove the bucket number from rlock, wait for the lock to get released, 
> restart.
> 
> The proposal doesn’t introduce any new atomic operations. There still might 
> be a slowdown due to cache line ping-pong in the rlock array. In the worst 
> case, it costs us 1 extra cache miss for the reader. Could be coalesced with 
> the bucket prefetch, making it essentially free (few if any bihash users 
> prefetch buckets).
> 
> Best,
> 
> Nick
> 
> 
>> On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io  
>> > > wrote:
>> 
>> 
>> Agreed it’s unlikely so maybe just use the 2 bits left for the epoch counter 
>> as a middle ground? The new approach should be better either way :-)
>> 
>> Florin
>> 
>> 
>>> On Nov 3, 2021, at 11:55 AM, Damjan Marion >> > wrote:
>>> 
>>> What about the following, we shift offset by 6, as all buckets are aligned 
>>> to 64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch 
>>> counter…. ?
>>> 
>>> — 
>>> Damjan
>>> 
 On 03.11.2021., at 19:45, Damjan Marion >>> > wrote:
 
 
 
 yes, i am aware of that, it is extremelly unlikely and only way i can see 
 this fixed is introducing epoch on the bucket level but we dont have 
 enough space there…. 
 
 — 
 Damjan
 
> On 03.11.2021., at 19:16, Florin Coras  > wrote:
> 
> Hi Damjan, 
> 
> Definitely like the scheme but the change bit might not be enough, unless 
> I’m misunderstanding. For instance, two consecutive updates to a bucket 
> before reader grabs b1 will hide the change. 
> 
> Florin
> 
>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>> 
>>  mailto:dmarion=me@lists.fd.io>> wrote:
>> 
>> 
>> Agree with Dave on atomic ops being bad on the reader side.
>> 
>> What about following schema:
>> 
>> As bucket is just u64 value on the reader side we grab bucket before 
>> (b0) and after (b1) search operation.
>> 
>> If search finds entry, we simply do 2 checks:
>> - that b0 is equal to b1
>> - that lock bit is not set in both of them
>> If check fails, we simply retry.
>> 
>> On the writer side, we have add, remove and replace operations.
>> First 2 alter refcnt which is part of bucket.
>> To deal with replace case we introduce another bit (change bit) which is 
>> flipped every time data is changed in the bucket.
>> 
>> Here are possible scenarios:
>> 
>> - reader grabs b0 before lock and b1 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-04 Thread Damjan Marion via lists.fd.io

Dear Nick,

Will be great if you can support your proposal with a running code so we can 
understand exactly what it means.

— 
Damjan

> On 04.11.2021., at 19:24, Nick Zavaritsky  wrote:
> 
>  Hi, thanks for an insightful discussion!
> 
> I do understand that high performance is one of the most important goals of 
> vpp, therefore certain solutions might not fly. From my POV, the version 
> counter would be an improvement. It definitely decreases the probability of 
> triggering the bug.
> 
> Concerning isolcpus, currently this is presented as an optimisation, not a 
> prerequisite. Without isolcpus, a thread could get preempted for arbitrarily 
> long. Meaning that no matter how many bits we allocate for the version field, 
> occasionally they won’t be enough.
> 
> I’d love to have something that’s robust no matter how the threads are 
> scheduled. Would it be possible to use vpp benchmarking lab to evaluate the 
> performance impact of the proposed solutions?
> 
> Finally, I'd like to rehash the reader lock proposal. The idea was that we 
> don’t introduce any atomic operations in the reader path. A reader 
> *publishes* the bucket number it is about to examine in int 
> rlock[MAX_THREADS] array. Every thread uses a distinct cell in rlock 
> (determined by the thread id), therefore it could be a regular write followed 
> by a barrier. Eliminate false sharing with padding.
> 
> Writer locks a bucket as currently implemented (CAS) and then waits until the 
> bucket number disappears from rlock[].
> 
> Reader publishes the bucket number and then checks if the bucket is locked 
> (regular write, barrier, regular read). Good to go if  not locked, otherwise 
> remove the bucket number from rlock, wait for the lock to get released, 
> restart.
> 
> The proposal doesn’t introduce any new atomic operations. There still might 
> be a slowdown due to cache line ping-pong in the rlock array. In the worst 
> case, it costs us 1 extra cache miss for the reader. Could be coalesced with 
> the bucket prefetch, making it essentially free (few if any bihash users 
> prefetch buckets).
> 
> Best,
> 
> Nick
> 
> 
>>> On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io 
>>>  wrote:
>>> 
>>> 
>>> Agreed it’s unlikely so maybe just use the 2 bits left for the epoch 
>>> counter as a middle ground? The new approach should be better either way :-)
>>> 
>>> Florin
>>> 
>>> 
>>> On Nov 3, 2021, at 11:55 AM, Damjan Marion  wrote:
>>> 
>>> What about the following, we shift offset by 6, as all buckets are aligned 
>>> to 64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch 
>>> counter…. ?
>>> 
>>> — 
>>> Damjan
>>> 
 On 03.11.2021., at 19:45, Damjan Marion  wrote:
 
 
 
 yes, i am aware of that, it is extremelly unlikely and only way i can see 
 this fixed is introducing epoch on the bucket level but we dont have 
 enough space there…. 
 
 — 
 Damjan
 
> On 03.11.2021., at 19:16, Florin Coras  wrote:
> 
> Hi Damjan, 
> 
> Definitely like the scheme but the change bit might not be enough, unless 
> I’m misunderstanding. For instance, two consecutive updates to a bucket 
> before reader grabs b1 will hide the change. 
> 
> Florin
> 
>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>>  wrote:
>> 
>> 
>> Agree with Dave on atomic ops being bad on the reader side.
>> 
>> What about following schema:
>> 
>> As bucket is just u64 value on the reader side we grab bucket before 
>> (b0) and after (b1) search operation.
>> 
>> If search finds entry, we simply do 2 checks:
>> - that b0 is equal to b1
>> - that lock bit is not set in both of them
>> If check fails, we simply retry.
>> 
>> On the writer side, we have add, remove and replace operations.
>> First 2 alter refcnt which is part of bucket.
>> To deal with replace case we introduce another bit (change bit) which is 
>> flipped every time data is changed in the bucket.
>> 
>> Here are possible scenarios:
>> 
>> - reader grabs b0 before lock and b1 after unlock
>>- add, del - refcnt and change bit will be different between b0 and 
>> b1 causing retry
>>- replace - change bit will be different between b0 and b1 causing 
>> retry
>> 
>> - reader grabs b0 after lock and/or b1 before unlock
>>- lock bit will be set causing retry  
>> 
>> Of course, this to work properly we need to ensure proper memory 
>> ordering (i.e. avoid bucket change to be visible to remote thread before 
>> kvp change).
>> 
>> I crafted WIP patch to present my idea:
>> 
>> https://gerrit.fd.io/r/c/vpp/+/34326
>> 
>> In this patch I get a rid of all store barriers and replaced them with 
>> more lightweight:
>> 
>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
>> 
>> On platforms with strong memory 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-04 Thread Nick Zavaritsky
Hi, thanks for an insightful discussion!

I do understand that high performance is one of the most important goals of 
vpp, therefore certain solutions might not fly. From my POV, the version 
counter would be an improvement. It definitely decreases the probability of 
triggering the bug.

Concerning isolcpus, currently this is presented as an optimisation, not a 
prerequisite. Without isolcpus, a thread could get preempted for arbitrarily 
long. Meaning that no matter how many bits we allocate for the version field, 
occasionally they won’t be enough.

I’d love to have something that’s robust no matter how the threads are 
scheduled. Would it be possible to use vpp benchmarking lab to evaluate the 
performance impact of the proposed solutions?

Finally, I'd like to rehash the reader lock proposal. The idea was that we 
don’t introduce any atomic operations in the reader path. A reader *publishes* 
the bucket number it is about to examine in int rlock[MAX_THREADS] array. Every 
thread uses a distinct cell in rlock (determined by the thread id), therefore 
it could be a regular write followed by a barrier. Eliminate false sharing with 
padding.

Writer locks a bucket as currently implemented (CAS) and then waits until the 
bucket number disappears from rlock[].

Reader publishes the bucket number and then checks if the bucket is locked 
(regular write, barrier, regular read). Good to go if  not locked, otherwise 
remove the bucket number from rlock, wait for the lock to get released, restart.

The proposal doesn’t introduce any new atomic operations. There still might be 
a slowdown due to cache line ping-pong in the rlock array. In the worst case, 
it costs us 1 extra cache miss for the reader. Could be coalesced with the 
bucket prefetch, making it essentially free (few if any bihash users prefetch 
buckets).

Best,

Nick


On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io 
mailto:fcoras.lists=gmail@lists.fd.io>> 
wrote:


Agreed it’s unlikely so maybe just use the 2 bits left for the epoch counter as 
a middle ground? The new approach should be better either way :-)

Florin


On Nov 3, 2021, at 11:55 AM, Damjan Marion 
mailto:dmar...@me.com>> wrote:

What about the following, we shift offset by 6, as all buckets are aligned to 
64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch counter…. 
?

—
Damjan

On 03.11.2021., at 19:45, Damjan Marion mailto:dmar...@me.com>> 
wrote:



yes, i am aware of that, it is extremelly unlikely and only way i can see this 
fixed is introducing epoch on the bucket level but we dont have enough space 
there….

—
Damjan

On 03.11.2021., at 19:16, Florin Coras 
mailto:fcoras.li...@gmail.com>> wrote:

Hi Damjan,

Definitely like the scheme but the change bit might not be enough, unless I’m 
misunderstanding. For instance, two consecutive updates to a bucket before 
reader grabs b1 will hide the change.

Florin

On Nov 3, 2021, at 9:36 AM, Damjan Marion via 
lists.fd.io
 mailto:dmarion=me@lists.fd.io>> wrote:


Agree with Dave on atomic ops being bad on the reader side.

What about following schema:

As bucket is just u64 value on the reader side we grab bucket before (b0) and 
after (b1) search operation.

If search finds entry, we simply do 2 checks:
- that b0 is equal to b1
- that lock bit is not set in both of them
If check fails, we simply retry.

On the writer side, we have add, remove and replace operations.
First 2 alter refcnt which is part of bucket.
To deal with replace case we introduce another bit (change bit) which is 
flipped every time data is changed in the bucket.

Here are possible scenarios:

- reader grabs b0 before lock and b1 after unlock
   - add, del - refcnt and change bit will be different between b0 and b1 
causing retry
   - replace - change bit will be different between b0 and b1 causing retry

- reader grabs b0 after lock and/or b1 before unlock
   - lock bit will be set causing retry

Of course, this to work properly we need to ensure proper memory ordering (i.e. 
avoid bucket change to be visible to remote thread before kvp change).

I crafted WIP patch to present my idea:

https://gerrit.fd.io/r/c/vpp/+/34326

In this patch I get a rid of all 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-03 Thread Florin Coras
Agreed it’s unlikely so maybe just use the 2 bits left for the epoch counter as 
a middle ground? The new approach should be better either way :-)

Florin


> On Nov 3, 2021, at 11:55 AM, Damjan Marion  wrote:
> 
> What about the following, we shift offset by 6, as all buckets are aligned to 
> 64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch 
> counter…. ?
> 
> — 
> Damjan
> 
>> On 03.11.2021., at 19:45, Damjan Marion  wrote:
>> 
>> 
>> 
>> yes, i am aware of that, it is extremelly unlikely and only way i can see 
>> this fixed is introducing epoch on the bucket level but we dont have enough 
>> space there…. 
>> 
>> — 
>> Damjan
>> 
>>> On 03.11.2021., at 19:16, Florin Coras  wrote:
>>> 
>>> Hi Damjan, 
>>> 
>>> Definitely like the scheme but the change bit might not be enough, unless 
>>> I’m misunderstanding. For instance, two consecutive updates to a bucket 
>>> before reader grabs b1 will hide the change. 
>>> 
>>> Florin
>>> 
 On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
  >>> > wrote:
 
 
 Agree with Dave on atomic ops being bad on the reader side.
 
 What about following schema:
 
 As bucket is just u64 value on the reader side we grab bucket before (b0) 
 and after (b1) search operation.
 
 If search finds entry, we simply do 2 checks:
 - that b0 is equal to b1
 - that lock bit is not set in both of them
 If check fails, we simply retry.
 
 On the writer side, we have add, remove and replace operations.
 First 2 alter refcnt which is part of bucket.
 To deal with replace case we introduce another bit (change bit) which is 
 flipped every time data is changed in the bucket.
 
 Here are possible scenarios:
 
 - reader grabs b0 before lock and b1 after unlock
- add, del - refcnt and change bit will be different between b0 and b1 
 causing retry
- replace - change bit will be different between b0 and b1 causing retry
 
 - reader grabs b0 after lock and/or b1 before unlock
- lock bit will be set causing retry  
 
 Of course, this to work properly we need to ensure proper memory ordering 
 (i.e. avoid bucket change to be visible to remote thread before kvp 
 change).
 
 I crafted WIP patch to present my idea:
 
 https://gerrit.fd.io/r/c/vpp/+/34326 
 
 In this patch I get a rid of all store barriers and replaced them with 
 more lightweight:
 
 __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
 
 On platforms with strong memory ordering (like x86_64) this will result 
 with just normal stores (but compiler will know that it should not reorder 
 them).
 On platforms with weak memory ordering (like arch64) this will result in 
 special store instruction, but that one is still cheaper than full memory 
 barrier.
 
 Thoughts? Comments?
 
 Thanks,
 
 — 
 Damjan
 
 
 
> On 02.11.2021., at 12:14, Dave Barach  > wrote:
> 
> Dear Nick,
> 
> As the code comment suggests, we tiptoe right up to the line to extract 
> performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
> expedients to make the required assumptions true?
> 
> It’s easy enough to change the code in various ways so this use-case 
> cannot backfire. High on the list: always make a working copy of the 
> bucket, vs. update in place. Won’t help write performance, but it’s 
> likely to make the pain go away.
> 
> Bucket-level reader-locks would involve adding Avogadro’s number of 
> atomic ops to the predominant case. I’m pretty sure that’s a non-starter.
> 
> FWIW... Dave
> 
> 
> From: vpp-dev@lists.fd.io  
> mailto:vpp-dev@lists.fd.io>> On Behalf Of Nick 
> Zavaritsky
> Sent: Monday, November 1, 2021 12:12 PM
> To: vpp-dev@lists.fd.io 
> Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't
> 
> Hello bihash experts!
> 
> There's an old thread claiming that bihash lookup can produce a value=-1 
> under intense add/delete concurrent activity: 
> https://lists.fd.io/g/vpp-dev/message/15606 
> 
> 
> We had a seemingly related crash recently when a lookup in 
> snat_main.flow_hash yielded a value=-1 which was subsequently used as a 
> destination thread index to offload to. This crash prompted me to study 
> bihash more closely.
> 
> The rest of the message is structured as follows:
>  1. Presenting reasons why I believe that bihash is not thread-safe.
>  2. Proposing a fix.
> 
> 1 Bihash is probably not thread-safe
> 
> The number of 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-03 Thread Damjan Marion via lists.fd.io
What about the following, we shift offset by 6, as all buckets are aligned to 
64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch counter…. 
?

— 
Damjan

> On 03.11.2021., at 19:45, Damjan Marion  wrote:
> 
> 
> 
> yes, i am aware of that, it is extremelly unlikely and only way i can see 
> this fixed is introducing epoch on the bucket level but we dont have enough 
> space there…. 
> 
> — 
> Damjan
> 
>>> On 03.11.2021., at 19:16, Florin Coras  wrote:
>>> 
>> Hi Damjan, 
>> 
>> Definitely like the scheme but the change bit might not be enough, unless 
>> I’m misunderstanding. For instance, two consecutive updates to a bucket 
>> before reader grabs b1 will hide the change. 
>> 
>> Florin
>> 
>>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>>>  wrote:
>>> 
>>> 
>>> Agree with Dave on atomic ops being bad on the reader side.
>>> 
>>> What about following schema:
>>> 
>>> As bucket is just u64 value on the reader side we grab bucket before (b0) 
>>> and after (b1) search operation.
>>> 
>>> If search finds entry, we simply do 2 checks:
>>> - that b0 is equal to b1
>>> - that lock bit is not set in both of them
>>> If check fails, we simply retry.
>>> 
>>> On the writer side, we have add, remove and replace operations.
>>> First 2 alter refcnt which is part of bucket.
>>> To deal with replace case we introduce another bit (change bit) which is 
>>> flipped every time data is changed in the bucket.
>>> 
>>> Here are possible scenarios:
>>> 
>>> - reader grabs b0 before lock and b1 after unlock
>>>- add, del - refcnt and change bit will be different between b0 and b1 
>>> causing retry
>>>- replace - change bit will be different between b0 and b1 causing retry
>>> 
>>> - reader grabs b0 after lock and/or b1 before unlock
>>>- lock bit will be set causing retry  
>>> 
>>> Of course, this to work properly we need to ensure proper memory ordering 
>>> (i.e. avoid bucket change to be visible to remote thread before kvp change).
>>> 
>>> I crafted WIP patch to present my idea:
>>> 
>>> https://gerrit.fd.io/r/c/vpp/+/34326
>>> 
>>> In this patch I get a rid of all store barriers and replaced them with more 
>>> lightweight:
>>> 
>>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
>>> 
>>> On platforms with strong memory ordering (like x86_64) this will result 
>>> with just normal stores (but compiler will know that it should not reorder 
>>> them).
>>> On platforms with weak memory ordering (like arch64) this will result in 
>>> special store instruction, but that one is still cheaper than full memory 
>>> barrier.
>>> 
>>> Thoughts? Comments?
>>> 
>>> Thanks,
>>> 
>>> — 
>>> Damjan
>>> 
>>> 
>>> 
 On 02.11.2021., at 12:14, Dave Barach  wrote:
 
 Dear Nick,
 
 As the code comment suggests, we tiptoe right up to the line to extract 
 performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
 expedients to make the required assumptions true?
 
 It’s easy enough to change the code in various ways so this use-case 
 cannot backfire. High on the list: always make a working copy of the 
 bucket, vs. update in place. Won’t help write performance, but it’s likely 
 to make the pain go away.
 
 Bucket-level reader-locks would involve adding Avogadro’s number of atomic 
 ops to the predominant case. I’m pretty sure that’s a non-starter.
 
 FWIW... Dave
 
 
 From: vpp-dev@lists.fd.io  On Behalf Of Nick 
 Zavaritsky
 Sent: Monday, November 1, 2021 12:12 PM
 To: vpp-dev@lists.fd.io
 Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't
 
 Hello bihash experts!
 
 There's an old thread claiming that bihash lookup can produce a value=-1 
 under intense add/delete concurrent activity: 
 https://lists.fd.io/g/vpp-dev/message/15606
 
 We had a seemingly related crash recently when a lookup in 
 snat_main.flow_hash yielded a value=-1 which was subsequently used as a 
 destination thread index to offload to. This crash prompted me to study 
 bihash more closely.
 
 The rest of the message is structured as follows:
  1. Presenting reasons why I believe that bihash is not thread-safe.
  2. Proposing a fix.
 
 1 Bihash is probably not thread-safe
 
 The number of buckets in a hash table never changes. Every bucket has a 
 lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. The 
 function grabs the bucket lock early on and performs update while holding 
 the lock. Obviously this is safe, let's focus on readers.
 
 Lookups happen via clib_bihash_search_inline_with_hash / 
 clib_bihash_search_inline_2_with_hash. The function locates the bucket and 
 waits until the lock bit is cleared.
 
  b = BV (clib_bihash_get_bucket) (h, hash);
 
  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
return -1;
 
  if 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-03 Thread Damjan Marion via lists.fd.io

yes, i am aware of that, it is extremelly unlikely and only way i can see this 
fixed is introducing epoch on the bucket level but we dont have enough space 
there…. 

— 
Damjan

> On 03.11.2021., at 19:16, Florin Coras  wrote:
> 
> Hi Damjan, 
> 
> Definitely like the scheme but the change bit might not be enough, unless I’m 
> misunderstanding. For instance, two consecutive updates to a bucket before 
> reader grabs b1 will hide the change. 
> 
> Florin
> 
>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>>  wrote:
>> 
>> 
>> Agree with Dave on atomic ops being bad on the reader side.
>> 
>> What about following schema:
>> 
>> As bucket is just u64 value on the reader side we grab bucket before (b0) 
>> and after (b1) search operation.
>> 
>> If search finds entry, we simply do 2 checks:
>> - that b0 is equal to b1
>> - that lock bit is not set in both of them
>> If check fails, we simply retry.
>> 
>> On the writer side, we have add, remove and replace operations.
>> First 2 alter refcnt which is part of bucket.
>> To deal with replace case we introduce another bit (change bit) which is 
>> flipped every time data is changed in the bucket.
>> 
>> Here are possible scenarios:
>> 
>> - reader grabs b0 before lock and b1 after unlock
>>- add, del - refcnt and change bit will be different between b0 and b1 
>> causing retry
>>- replace - change bit will be different between b0 and b1 causing retry
>> 
>> - reader grabs b0 after lock and/or b1 before unlock
>>- lock bit will be set causing retry  
>> 
>> Of course, this to work properly we need to ensure proper memory ordering 
>> (i.e. avoid bucket change to be visible to remote thread before kvp change).
>> 
>> I crafted WIP patch to present my idea:
>> 
>> https://gerrit.fd.io/r/c/vpp/+/34326
>> 
>> In this patch I get a rid of all store barriers and replaced them with more 
>> lightweight:
>> 
>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
>> 
>> On platforms with strong memory ordering (like x86_64) this will result with 
>> just normal stores (but compiler will know that it should not reorder them).
>> On platforms with weak memory ordering (like arch64) this will result in 
>> special store instruction, but that one is still cheaper than full memory 
>> barrier.
>> 
>> Thoughts? Comments?
>> 
>> Thanks,
>> 
>> — 
>> Damjan
>> 
>> 
>> 
>>> On 02.11.2021., at 12:14, Dave Barach  wrote:
>>> 
>>> Dear Nick,
>>> 
>>> As the code comment suggests, we tiptoe right up to the line to extract 
>>> performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
>>> expedients to make the required assumptions true?
>>> 
>>> It’s easy enough to change the code in various ways so this use-case cannot 
>>> backfire. High on the list: always make a working copy of the bucket, vs. 
>>> update in place. Won’t help write performance, but it’s likely to make the 
>>> pain go away.
>>> 
>>> Bucket-level reader-locks would involve adding Avogadro’s number of atomic 
>>> ops to the predominant case. I’m pretty sure that’s a non-starter.
>>> 
>>> FWIW... Dave
>>> 
>>> 
>>> From: vpp-dev@lists.fd.io  On Behalf Of Nick Zavaritsky
>>> Sent: Monday, November 1, 2021 12:12 PM
>>> To: vpp-dev@lists.fd.io
>>> Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't
>>> 
>>> Hello bihash experts!
>>> 
>>> There's an old thread claiming that bihash lookup can produce a value=-1 
>>> under intense add/delete concurrent activity: 
>>> https://lists.fd.io/g/vpp-dev/message/15606
>>> 
>>> We had a seemingly related crash recently when a lookup in 
>>> snat_main.flow_hash yielded a value=-1 which was subsequently used as a 
>>> destination thread index to offload to. This crash prompted me to study 
>>> bihash more closely.
>>> 
>>> The rest of the message is structured as follows:
>>>  1. Presenting reasons why I believe that bihash is not thread-safe.
>>>  2. Proposing a fix.
>>> 
>>> 1 Bihash is probably not thread-safe
>>> 
>>> The number of buckets in a hash table never changes. Every bucket has a 
>>> lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. The 
>>> function grabs the bucket lock early on and performs update while holding 
>>> the lock. Obviously this is safe, let's focus on readers.
>>> 
>>> Lookups happen via clib_bihash_search_inline_with_hash / 
>>> clib_bihash_search_inline_2_with_hash. The function locates the bucket and 
>>> waits until the lock bit is cleared.
>>> 
>>>  b = BV (clib_bihash_get_bucket) (h, hash);
>>> 
>>>  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
>>>return -1;
>>> 
>>>  if (PREDICT_FALSE (b->lock))
>>>{
>>>  volatile BVT (clib_bihash_bucket) * bv = b;
>>>  while (bv->lock)
>>>CLIB_PAUSE ();
>>>}
>>> 
>>> From this point on the function examines the data structure without ever 
>>> bothering to check the lock again. Nothing prevents an updater from 
>>> changing the data the reader is currently looking at, or even deallocating 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-03 Thread Florin Coras
Hi Damjan, 

Definitely like the scheme but the change bit might not be enough, unless I’m 
misunderstanding. For instance, two consecutive updates to a bucket before 
reader grabs b1 will hide the change. 

Florin

> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>  wrote:
> 
> 
> Agree with Dave on atomic ops being bad on the reader side.
> 
> What about following schema:
> 
> As bucket is just u64 value on the reader side we grab bucket before (b0) and 
> after (b1) search operation.
> 
> If search finds entry, we simply do 2 checks:
> - that b0 is equal to b1
> - that lock bit is not set in both of them
> If check fails, we simply retry.
> 
> On the writer side, we have add, remove and replace operations.
> First 2 alter refcnt which is part of bucket.
> To deal with replace case we introduce another bit (change bit) which is 
> flipped every time data is changed in the bucket.
> 
> Here are possible scenarios:
> 
> - reader grabs b0 before lock and b1 after unlock
>- add, del - refcnt and change bit will be different between b0 and b1 
> causing retry
>- replace - change bit will be different between b0 and b1 causing retry
> 
> - reader grabs b0 after lock and/or b1 before unlock
>- lock bit will be set causing retry  
> 
> Of course, this to work properly we need to ensure proper memory ordering 
> (i.e. avoid bucket change to be visible to remote thread before kvp change).
> 
> I crafted WIP patch to present my idea:
> 
> https://gerrit.fd.io/r/c/vpp/+/34326 
> 
> In this patch I get a rid of all store barriers and replaced them with more 
> lightweight:
> 
> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
> 
> On platforms with strong memory ordering (like x86_64) this will result with 
> just normal stores (but compiler will know that it should not reorder them).
> On platforms with weak memory ordering (like arch64) this will result in 
> special store instruction, but that one is still cheaper than full memory 
> barrier.
> 
> Thoughts? Comments?
> 
> Thanks,
> 
> — 
> Damjan
> 
> 
> 
>> On 02.11.2021., at 12:14, Dave Barach  wrote:
>> 
>> Dear Nick,
>> 
>> As the code comment suggests, we tiptoe right up to the line to extract 
>> performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
>> expedients to make the required assumptions true?
>> 
>> It’s easy enough to change the code in various ways so this use-case cannot 
>> backfire. High on the list: always make a working copy of the bucket, vs. 
>> update in place. Won’t help write performance, but it’s likely to make the 
>> pain go away.
>> 
>> Bucket-level reader-locks would involve adding Avogadro’s number of atomic 
>> ops to the predominant case. I’m pretty sure that’s a non-starter.
>> 
>> FWIW... Dave
>> 
>> 
>> From: vpp-dev@lists.fd.io  On Behalf Of Nick Zavaritsky
>> Sent: Monday, November 1, 2021 12:12 PM
>> To: vpp-dev@lists.fd.io
>> Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't
>> 
>> Hello bihash experts!
>> 
>> There's an old thread claiming that bihash lookup can produce a value=-1 
>> under intense add/delete concurrent activity: 
>> https://lists.fd.io/g/vpp-dev/message/15606
>> 
>> We had a seemingly related crash recently when a lookup in 
>> snat_main.flow_hash yielded a value=-1 which was subsequently used as a 
>> destination thread index to offload to. This crash prompted me to study 
>> bihash more closely.
>> 
>> The rest of the message is structured as follows:
>>  1. Presenting reasons why I believe that bihash is not thread-safe.
>>  2. Proposing a fix.
>> 
>> 1 Bihash is probably not thread-safe
>> 
>> The number of buckets in a hash table never changes. Every bucket has a lock 
>> bit. Updates happen via clib_bihash_add_del_inline_with_hash. The function 
>> grabs the bucket lock early on and performs update while holding the lock. 
>> Obviously this is safe, let's focus on readers.
>> 
>> Lookups happen via clib_bihash_search_inline_with_hash / 
>> clib_bihash_search_inline_2_with_hash. The function locates the bucket and 
>> waits until the lock bit is cleared.
>> 
>>  b = BV (clib_bihash_get_bucket) (h, hash);
>> 
>>  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
>>return -1;
>> 
>>  if (PREDICT_FALSE (b->lock))
>>{
>>  volatile BVT (clib_bihash_bucket) * bv = b;
>>  while (bv->lock)
>>CLIB_PAUSE ();
>>}
>> 
>> From this point on the function examines the data structure without ever 
>> bothering to check the lock again. Nothing prevents an updater from changing 
>> the data the reader is currently looking at, or even deallocating it right 
>> away. The only way it could work is if we could make assumptions about 
>> relative performance of lookup and update operations. Checking the lock 
>> early in lookup ensures that there's no update currently in progress. If 
>> lookup is faster than update, then no future updater will manage to progress 
>> to the point 

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

2021-11-02 Thread Dave Barach
Dear Nick,

 

As the code comment suggests, we tiptoe right up to the line to extract 
performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
expedients to make the required assumptions true?

 

It’s easy enough to change the code in various ways so this use-case cannot 
backfire. High on the list: always make a working copy of the bucket, vs. 
update in place. Won’t help write performance, but it’s likely to make the pain 
go away.

 

Bucket-level reader-locks would involve adding Avogadro’s number of atomic ops 
to the predominant case. I’m pretty sure that’s a non-starter.

 

FWIW... Dave

 

 

From: vpp-dev@lists.fd.io  On Behalf Of Nick Zavaritsky
Sent: Monday, November 1, 2021 12:12 PM
To: vpp-dev@lists.fd.io
Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

 

Hello bihash experts!

There's an old thread claiming that bihash lookup can produce a value=-1 under 
intense add/delete concurrent activity: 
https://lists.fd.io/g/vpp-dev/message/15606

We had a seemingly related crash recently when a lookup in snat_main.flow_hash 
yielded a value=-1 which was subsequently used as a destination thread index to 
offload to. This crash prompted me to study bihash more closely.

The rest of the message is structured as follows:
  1. Presenting reasons why I believe that bihash is not thread-safe.
  2. Proposing a fix.

1 Bihash is probably not thread-safe

The number of buckets in a hash table never changes. Every bucket has a lock 
bit. Updates happen via clib_bihash_add_del_inline_with_hash. The function 
grabs the bucket lock early on and performs update while holding the lock. 
Obviously this is safe, let's focus on readers.

Lookups happen via clib_bihash_search_inline_with_hash / 
clib_bihash_search_inline_2_with_hash. The function locates the bucket and 
waits until the lock bit is cleared.

  b = BV (clib_bihash_get_bucket) (h, hash);

 

  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))

return -1;

 

  if (PREDICT_FALSE (b->lock))

{

  volatile BVT (clib_bihash_bucket) * bv = b;

  while (bv->lock)

CLIB_PAUSE ();

}


>From this point on the function examines the data structure without ever 
>bothering to check the lock again. Nothing prevents an updater from changing 
>the data the reader is currently looking at, or even deallocating it right 
>away. The only way it could work is if we could make assumptions about 
>relative performance of lookup and update operations. Checking the lock early 
>in lookup ensures that there's no update currently in progress. If lookup is 
>faster than update, then no future updater will manage to progress to the 
>point where the memory is written BEFORE the lookup was complete. Indeed, we 
>have the following comment in clib_bihash_add_del_inline_with_hash:

  /*

   * Because reader threads are looking at live data,

   * we have to be extra careful. Readers do NOT hold the

   * bucket lock. We need to be SLOWER than a search, past the

   * point where readers CHECK the bucket lock.

   */

Unfortunately, the assumption doesn't hold. Any thread could get preempted at 
arbitrary time. Even if we rule out preemption, there are microarchitectural 
quirks (e.g. caches, branch misprediction) that could slow down lookup to the 
point that memory read and update will overlap. 


The core of lookup is the following loop. Please note that checking a key and 
fetching the value is not atomic, hence if we are preempted in-between the 
result could be bogus.

  for (i = 0; i < limit; i++)

{

  if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key))

{

  *key_result = v->kvp[i];

  return 0;

}

}


Different ways the key-value pair could get updated:

(1) Add using a previously empty slot:

  clib_memcpy_fast (&(v->kvp[i].value),

_v->value, sizeof (add_v->value));

  CLIB_MEMORY_STORE_BARRIER (); /* Make sure the value has 
settled */

  clib_memcpy_fast (&(v->kvp[i]), _v->key,

sizeof (add_v->key));


The key update is not atomic, reader could observe a key updated half-way.


(2) Add that recycles a stale slot:

  clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v));


Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, old_v) 
or even an arbitrary interleaving of chunks from old and new keys.

(3) Deleting an entry:

clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v)));

Not atomic.


2 A fix

It's worth noting that bihash never crashes. It does yield bogus results 
occasionally, though. While -1 is easy to check for, the analysis shows that 
other bogus results are possible. In particular:

  1. Value updated half-way, possible with bihash_8_16.
  2. Observing a key that never existed due to a key partial update. The 
probability is low since the hash should