On 2017年04月15日 05:00, Michael S. Tsirkin wrote:
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
On 2017年04月12日 16:03, Jason Wang wrote:
On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.
To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.
We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.
Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.
What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.
Notes:
- it is no longer true that a call to consume guarantees
that the following call to produce will succeed.
No users seem to assume that.
- batching can also in theory reduce the signalling rate:
users that would previously send interrups to the producer
to wake it up after consuming each entry would now only
need to do this once in a batch.
Doing this would be easy by returning a flag to the caller.
No users seem to do signalling on consume yet so this was not
implemented yet.
Signed-off-by: Michael S. Tsirkin<m...@redhat.com>
---
Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?
The patch looks good to me, will have a test for vhost batching patches.
Thanks
Still helpful:
before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps
Acked-by: Jason Wang <jasow...@redhat.com>
Thanks
Fascinating. How do we explain the gain with batch dequeue?
I count the drop rate (pktgen on tun and count tun tx) and maybe it can
explain more or less:
Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps
11% dropped
After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps
14% dropped
With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total
2.26Mpps ~0% dropped
With this patch, we remove the cache contention by blocking the producer
more or less. With batch dequeuing, the ring is not full in 99% of the
cases which probably means the producer is not blocked for most of the time.
Is it just the lock overhead?
I remove the spinlocks for peeking and dequeuing on top of this patch.
The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only
the lock overhead.
Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?
Not sure I get the idea, we are going for fast path of spinlocks for
sure which is just a cmpxchg().
Thanks