Re: [PATCH 1/3] ptr_ring: batch ring zeroing

Jason Wang Mon, 17 Apr 2017 19:17:25 -0700


On 2017年04月15日 05:00, Michael S. Tsirkin wrote:

On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:


On 2017年04月12日 16:03, Jason Wang wrote:


On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
    that the following call to produce will succeed.
    No users seem to assume that.
- batching can also in theory reduce the signalling rate:
    users that would previously send interrups to the producer
    to wake it up after consuming each entry would now only
    need to do this once in a batch.
    Doing this would be easy by returning a flag to the caller.
    No users seem to do signalling on consume yet so this was not
    implemented yet.

Signed-off-by: Michael S. Tsirkin<m...@redhat.com>
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

The patch looks good to me, will have a test for vhost batching patches.

Thanks

Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Acked-by: Jason Wang <jasow...@redhat.com>

Thanks

Fascinating. How do we explain the gain with batch dequeue?

I count the drop rate (pktgen on tun and count tun tx) and maybe it canexplain more or less:

Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps11% droppedAfter this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps14% droppedWith batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total2.26Mpps ~0% dropped

With this patch, we remove the cache contention by blocking the producermore or less. With batch dequeuing, the ring is not full in 99% of thecases which probably means the producer is not blocked for most of the time.

Is it just the lock overhead?

I remove the spinlocks for peeking and dequeuing on top of this patch.The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not onlythe lock overhead.

Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?

Not sure I get the idea, we are going for fast path of spinlocks forsure which is just a cmpxchg().


Thanks

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

Reply via email to