Re: [vpp-dev] worker barrier state

Neale Ranns via lists.fd.io Mon, 20 Apr 2020 00:24:23 -0700

Hi Chris,

Comments inline...


On 15/04/2020 15:14, "Christian Hopps" <cho...@chopps.org> wrote:

    Hi Neale,

    I agree that something like 4, is probably the correct approach. I had a 
side-meeting with some of the ARM folks (Govind and Honnappa), and we thought 
using a generation number for the state rather than just waiting "long-enough" 
to recycle it could work. The generation number would be the atomic value 
associated with the state. So consider this API:

     - MP-safe pools store generation numbers alongside each object.
     - When you allocate a new object from the pool you get an index and 
generation number.
     - When storing the object index you also save the generation number.
     - When getting a pointer to the object you pass the API the index and 
generation number and it will return NULL if the generation number did not 
match the one stored with the object in the pool.
     - When you delete a pool object its generation number is incremented (with 
barrier).

    The size of the generation number needs to be large enough to guarantee 
there is no wrap with objects still in the system that have stored the 
generation number. Technically this is a "long-enough" aspect of the scheme. :) 
One could imagine using less than 64 bits for the combination of index and 
generation, if that was important.

It's a good scheme, I like it.
I assume the pool indices would be 64 bit and the separation between vector 
index and generation would be hidden from the user. Maybe a 32 bit value would 
suffice in most cases, but why skimp...

    The advantage over just waiting N seconds to recycle the index is that the 
system scales better, i.e., if you just wait N seconds to reuse, and are 
creating and deleting objects at a significant rate, your pool can blow up in 
the N seconds of time. With the generation number this is not a problem as you 
can re-use the object immediately. Another advantage is that you don't have to 
have the timer logic (looping per pool or processing all pools) to free up old 
indices.

Yes, for my time based scheme, the size of the pool will become dependent on 
some integration over a rate of change, which is not deterministic, which is 
not great, but I don't suppose all APIs are subject to large churn.
With the generation scheme the pool always requires more memory, since you're 
storing a generation value for each index, but being a deterministic size (even 
though probably bigger), I'd probably take that.
I wouldn't use timer logic in my scheme. I'd make the pool's free-list a fifo 
(as opposed to the stack it is today) and each entry in the list has the index 
and the time it was added. If t_now - t_head < t_wrap I can pull from the 
free-list, else the pool needs to grow.

    The generation number scheme will still need the thread barrier to 
increment the generation number to make sure no-one is using the object in 
parallel. But this is a common problem with deleting non-reference-counted 
shared state I believe.

I don't think you strictly need the barrier, you can still use a 
make-before-break update. One downside of the generation approach is that nodes 
that try and fetch the state using the index will get NULL, so the only option 
is to drop, as opposed to what the make-before-break change determined. Mind 
you, this is probably fine for most practical purposes. Again if we're talking 
SAs, then at this point the SA is decoupled from the graph (i.e. it's no longer 
protecting the tunnel or it's not linked to a policy in the SPD), so drop is 
all we can do anyway.

    When you mentioned packet counters, that's really a reference count I 
guess. The trade-off here seems to me to be 2 cache-line-invalidates per packet 
(once on ingress once on egress) for the counter vs a barrier hit (all packet 
processing stops) per delete of the state. For your setup that you measured the 
packet counter solution how long does it spend from the barrier sync request to 
release (i.e., how long is the system not processing packets)?

As an example in the basic test setup I had that measured the increase in clock 
cycles for adj counters, here's the time taken for the CLI to execute the 
addition of two ipsec tunnels:
       3.786220665: cli-cmd: create ipsec tunnel
       3.786540648: cli-cmd: create ipsec tunnel OK
       3.786544389: cli-cmd: create ipsec tunnel
       3.786577392: cli-cmd: create ipsec tunnel OK

(collected with 'elog trace cli' and 'sh event-logger')

I see it as a trade-off between a cost for every packet forwarded versus how 
many may be dropped during API calls. I wouldn't want the scheme employed to 
ensure safe delete to affect the overall packet through put - most of the time 
I'm not changing the state...

Now we have a few potential schemes in mind, IIRC you focus was on the deletion 
of SAs. Can you remind me again what additional state you had associated with 
the SA that you needed to deal with.


/neale


    Thanks,
    Chris.

    > On Apr 15, 2020, at 5:38 AM, Neale Ranns (nranns) <nra...@cisco.com> 
wrote:
    > 
    > 
    > Hi Chris,
    > 
    > Firstly, apologies for the lengthy delay. 
    > 
    > When I say 'state' in the following I'm referring to some object[s] that 
are used to forward packets. 
    > 
    > I'd classify the possible solution space as:
    > 1) maintain per-packet counters for the state to indicate how many 
packets currently refer to that state.
    >     Pros; we know exactly when the state is no longer required and can be 
safely removed.
    >     Cons; significant per-packet cost, similar to maintaining counters. 
For reference, on my [aging] system enabling adjacency counters takes 
ip4-rewrite from 2.52e1 to 3.49e1 clocks. The wait times could be large 
(equivalent to flushing queues). 
    > 2) flush queues; ensure that there are no packets in flight, anywhere, 
when the workers stop at the barrier.
    >    Pros; It's certainly safe to delete state under these conditions.
    >    Cons; for handoff this could be known, though the wait time would be 
long. For async crypto HW this may not be knowable and if it is the wait times 
would be large. Either way we may end up waiting for a worst-case scenario, 
which is way longer that actually needed.
    > 3) epochs; maintain a global epoch; each time an API is called, the epoch 
is bumped. Packets entering the system get stamped with the current epoch. If a 
node sees a packet whose epoch does not match the global one, it is dropped.
    >    Pros: simple scheme, low/negligible DP cost.
    >    Cons: all inflight packets would be dropped on all API calls, not just 
the packets that would use the state that is being deleted.
    > 4) MP safe: remove the state with the workers unblocked. This is a 
multi-stage process. Firstly, unlink the state from the lookup data-structures 
so no more packets can find it. Secondly, 'atomically' update the state so that 
packets using it still perform a consistent action (probably drop). Thirdly, 
don't reuse that state (i.e. recycle its pool index) until all the inflight 
packets pass through the system (mis-forwarding must be avoided). 
Make-before-break, if that term means anything to you __
    >    Pros; MP safe is always good, since there's less packet drops. Zero 
per-packet DP cost. 
    >    Cons; it's not easy to get right nor test.
    > 
    > IMHO the drawbacks of options 1, 2 & 3 rule them out, which leaves us 
only 4.
    > 
    > For option 4, the first and second steps are very much dependent on the 
type of state we're talking about. For SAs for example, unlinking the SA from 
the lookup data-structure is accomplished using a separate API from the SA 
delete*. The final step we can easily accomplish with a new version of the pool 
allocator whose free-list prevents reuse for say 5 seconds (an age in DP terms).
    > 
    > Thoughts?
    > 
    > /neale
    > 
    > * I note that a SA delete is already (optimistically) marked MP safe, 
which assumes the system flushes inbetween these API calls.
    > 
    > 
    > 
    > 
    > On 26/03/2020 16:09, "Christian Hopps" <cho...@chopps.org> wrote:
    > 
    > 
    > 
    >> On Mar 25, 2020, at 1:39 PM, Dave Barach via Lists.Fd.Io 
<dbarach=cisco....@lists.fd.io> wrote:
    >> 
    >> Vlib_main_t *vm->main_loop_count.
    >> 
    >> One trip around the main loop accounts for all per-worker local graph 
edges / acyclic graph behaviors. 
    >> 
    >> As to the magic number E (not to be confused with e): repeatedly handing 
off packets from thread to thread seems like a bad implementation strategy. The 
packet tracer will tell you how many handoffs are involved in a certain path, 
as will a bit of code inspection.
    > 
    >    No, it would not be a good implementation strategy. :)
    > 
    >    However, I was looking at trying to code this in an upstreamable way, 
and I didn't think I got to make assumptions about how others might wire things 
together. I suppose we could just define a maximum number of handoffs and then 
if users violated that number they would need to increase it?
    > 
    >> Neale has some experience with this scenario, maybe he can share some 
thoughts...
    > 
    >    Hoping so. :)
    > 
    >    I noticed that crypto engine handoffs were added to the non-dpdk ipsec 
encrypt/decrypt in master, which seems somewhat relevant.
    > 
    >    Thanks,
    >    Chris.
    >

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16113): https://lists.fd.io/g/vpp-dev/message/16113
Mute This Topic: https://lists.fd.io/mt/72542383/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] worker barrier state

Reply via email to