Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-25 Thread Jason Wang



On 2017年04月25日 23:35, Michael S. Tsirkin wrote:

On Tue, Apr 25, 2017 at 12:07:01PM +0800, Jason Wang wrote:


On 2017年04月24日 20:00, Michael S. Tsirkin wrote:

On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:

On 2017年04月24日 07:28, Michael S. Tsirkin wrote:

On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:

On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.

Ok, I will rebase the series on top of this. (Though I don't think we care
the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.

Ok.


I would still prefer that we understand what's going on,

I try to reply in another thread, does it make sense?


 and I would
like to know what's the smallest batch size that's still helpful,

Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?

The numbers are pretty stable, so probably not noise. Retested on top of
batch zeroing:

no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?

I increase tx_queue_len to 1100, but only see less than 1% improvement on
pps number (batch = 1) in my machine. If you care about the regression, we
probably can leave the choice to user through e.g module parameter. But I'm
afraid we have already had too much choices for them. Or I can test this
with different CPU types.

Thanks


I agree here. Let's keep it a constant. Testing on more machines would
be nice but not strictly required.


Ok, I will give a full benchmark (batch=1,4,64) on TCP stream to see how 
it will perform. Let's decide then.



  I just dislike not understanding why
it helps because it means we can easily break it by mistake.  So my only
request really is that you wrap access to this internal buffer in an
API. Let's see - I think we need

struct vhost_net_buf
vhost_net_buf_get_ptr
vhost_net_buf_get_size
vhost_net_buf_is_empty
vhost_net_buf_peek
vhost_net_buf_consume
vhost_net_buf_produce


Ok. Will do in next version.

Thanks.



Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-25 Thread Jason Wang



On 2017年04月25日 23:35, Michael S. Tsirkin wrote:

On Tue, Apr 25, 2017 at 12:07:01PM +0800, Jason Wang wrote:


On 2017年04月24日 20:00, Michael S. Tsirkin wrote:

On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:

On 2017年04月24日 07:28, Michael S. Tsirkin wrote:

On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:

On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.

Ok, I will rebase the series on top of this. (Though I don't think we care
the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.

Ok.


I would still prefer that we understand what's going on,

I try to reply in another thread, does it make sense?


 and I would
like to know what's the smallest batch size that's still helpful,

Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?

The numbers are pretty stable, so probably not noise. Retested on top of
batch zeroing:

no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?

I increase tx_queue_len to 1100, but only see less than 1% improvement on
pps number (batch = 1) in my machine. If you care about the regression, we
probably can leave the choice to user through e.g module parameter. But I'm
afraid we have already had too much choices for them. Or I can test this
with different CPU types.

Thanks


I agree here. Let's keep it a constant. Testing on more machines would
be nice but not strictly required.


Ok, I will give a full benchmark (batch=1,4,64) on TCP stream to see how 
it will perform. Let's decide then.



  I just dislike not understanding why
it helps because it means we can easily break it by mistake.  So my only
request really is that you wrap access to this internal buffer in an
API. Let's see - I think we need

struct vhost_net_buf
vhost_net_buf_get_ptr
vhost_net_buf_get_size
vhost_net_buf_is_empty
vhost_net_buf_peek
vhost_net_buf_consume
vhost_net_buf_produce


Ok. Will do in next version.

Thanks.



Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-25 Thread Michael S. Tsirkin
On Tue, Apr 25, 2017 at 12:07:01PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月24日 20:00, Michael S. Tsirkin wrote:
> > On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:
> > > On 2017年04月24日 07:28, Michael S. Tsirkin wrote:
> > > > On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> > > > > On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > > > > > Applications that consume a batch of entries in one go
> > > > > > can benefit from ability to return some of them back
> > > > > > into the ring.
> > > > > > 
> > > > > > Add an API for that - assuming there's space. If there's no space
> > > > > > naturally we can't do this and have to drop entries, but this 
> > > > > > implies
> > > > > > ring is full so we'd likely drop some anyway.
> > > > > > 
> > > > > > Signed-off-by: Michael S. Tsirkin
> > > > > > ---
> > > > > > 
> > > > > > Jason, in my mind the biggest issue with your batching patchset is 
> > > > > > the
> > > > > > backet drops on disconnect.  This API will help avoid that in the 
> > > > > > common
> > > > > > case.
> > > > > Ok, I will rebase the series on top of this. (Though I don't think we 
> > > > > care
> > > > > the packet loss).
> > > > E.g. I care - I often start sending packets to VM before it's
> > > > fully booted. Several vhost resets might follow.
> > > Ok.
> > > 
> > > > > > I would still prefer that we understand what's going on,
> > > > > I try to reply in another thread, does it make sense?
> > > > > 
> > > > > > and I would
> > > > > > like to know what's the smallest batch size that's still helpful,
> > > > > Yes, I've replied in another thread, the result is:
> > > > > 
> > > > > 
> > > > > no batching   1.88Mpps
> > > > > RX_BATCH=11.93Mpps
> > > > > RX_BATCH=42.11Mpps
> > > > > RX_BATCH=16   2.14Mpps
> > > > > RX_BATCH=64   2.25Mpps
> > > > > RX_BATCH=256  2.18Mpps
> > > > Essentially 4 is enough, other stuf looks more like noise
> > > > to me. What about 2?
> > > The numbers are pretty stable, so probably not noise. Retested on top of
> > > batch zeroing:
> > > 
> > > no  1.97Mpps
> > > 1   2.09Mpps
> > > 2   2.11Mpps
> > > 4   2.16Mpps
> > > 8   2.19Mpps
> > > 16  2.21Mpps
> > > 32  2.25Mpps
> > > 64  2.30Mpps
> > > 128 2.21Mpps
> > > 256 2.21Mpps
> > > 
> > > 64 performs best.
> > > 
> > > Thanks
> > OK but it might be e.g. a function of the ring size, host cache size or
> > whatever. As we don't really understand the why, if we just optimize for
> > your setup we risk regressions in others.  64 entries is a lot, it
> > increases the queue size noticeably.  Could this be part of the effect?
> > Could you try changing the queue size to see what happens?
> 
> I increase tx_queue_len to 1100, but only see less than 1% improvement on
> pps number (batch = 1) in my machine. If you care about the regression, we
> probably can leave the choice to user through e.g module parameter. But I'm
> afraid we have already had too much choices for them. Or I can test this
> with different CPU types.
> 
> Thanks
> 

I agree here. Let's keep it a constant. Testing on more machines would
be nice but not strictly required.  I just dislike not understanding why
it helps because it means we can easily break it by mistake.  So my only
request really is that you wrap access to this internal buffer in an
API. Let's see - I think we need

struct vhost_net_buf
vhost_net_buf_get_ptr
vhost_net_buf_get_size
vhost_net_buf_is_empty
vhost_net_buf_peek
vhost_net_buf_consume
vhost_net_buf_produce



-- 
MST


Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-25 Thread Michael S. Tsirkin
On Tue, Apr 25, 2017 at 12:07:01PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月24日 20:00, Michael S. Tsirkin wrote:
> > On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:
> > > On 2017年04月24日 07:28, Michael S. Tsirkin wrote:
> > > > On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> > > > > On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > > > > > Applications that consume a batch of entries in one go
> > > > > > can benefit from ability to return some of them back
> > > > > > into the ring.
> > > > > > 
> > > > > > Add an API for that - assuming there's space. If there's no space
> > > > > > naturally we can't do this and have to drop entries, but this 
> > > > > > implies
> > > > > > ring is full so we'd likely drop some anyway.
> > > > > > 
> > > > > > Signed-off-by: Michael S. Tsirkin
> > > > > > ---
> > > > > > 
> > > > > > Jason, in my mind the biggest issue with your batching patchset is 
> > > > > > the
> > > > > > backet drops on disconnect.  This API will help avoid that in the 
> > > > > > common
> > > > > > case.
> > > > > Ok, I will rebase the series on top of this. (Though I don't think we 
> > > > > care
> > > > > the packet loss).
> > > > E.g. I care - I often start sending packets to VM before it's
> > > > fully booted. Several vhost resets might follow.
> > > Ok.
> > > 
> > > > > > I would still prefer that we understand what's going on,
> > > > > I try to reply in another thread, does it make sense?
> > > > > 
> > > > > > and I would
> > > > > > like to know what's the smallest batch size that's still helpful,
> > > > > Yes, I've replied in another thread, the result is:
> > > > > 
> > > > > 
> > > > > no batching   1.88Mpps
> > > > > RX_BATCH=11.93Mpps
> > > > > RX_BATCH=42.11Mpps
> > > > > RX_BATCH=16   2.14Mpps
> > > > > RX_BATCH=64   2.25Mpps
> > > > > RX_BATCH=256  2.18Mpps
> > > > Essentially 4 is enough, other stuf looks more like noise
> > > > to me. What about 2?
> > > The numbers are pretty stable, so probably not noise. Retested on top of
> > > batch zeroing:
> > > 
> > > no  1.97Mpps
> > > 1   2.09Mpps
> > > 2   2.11Mpps
> > > 4   2.16Mpps
> > > 8   2.19Mpps
> > > 16  2.21Mpps
> > > 32  2.25Mpps
> > > 64  2.30Mpps
> > > 128 2.21Mpps
> > > 256 2.21Mpps
> > > 
> > > 64 performs best.
> > > 
> > > Thanks
> > OK but it might be e.g. a function of the ring size, host cache size or
> > whatever. As we don't really understand the why, if we just optimize for
> > your setup we risk regressions in others.  64 entries is a lot, it
> > increases the queue size noticeably.  Could this be part of the effect?
> > Could you try changing the queue size to see what happens?
> 
> I increase tx_queue_len to 1100, but only see less than 1% improvement on
> pps number (batch = 1) in my machine. If you care about the regression, we
> probably can leave the choice to user through e.g module parameter. But I'm
> afraid we have already had too much choices for them. Or I can test this
> with different CPU types.
> 
> Thanks
> 

I agree here. Let's keep it a constant. Testing on more machines would
be nice but not strictly required.  I just dislike not understanding why
it helps because it means we can easily break it by mistake.  So my only
request really is that you wrap access to this internal buffer in an
API. Let's see - I think we need

struct vhost_net_buf
vhost_net_buf_get_ptr
vhost_net_buf_get_size
vhost_net_buf_is_empty
vhost_net_buf_peek
vhost_net_buf_consume
vhost_net_buf_produce



-- 
MST


Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Jason Wang



On 2017年04月24日 20:00, Michael S. Tsirkin wrote:

On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:

On 2017年04月24日 07:28, Michael S. Tsirkin wrote:

On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:

On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.

Ok, I will rebase the series on top of this. (Though I don't think we care
the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.

Ok.


I would still prefer that we understand what's going on,

I try to reply in another thread, does it make sense?


and I would
like to know what's the smallest batch size that's still helpful,

Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?

The numbers are pretty stable, so probably not noise. Retested on top of
batch zeroing:

no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?


I increase tx_queue_len to 1100, but only see less than 1% improvement 
on pps number (batch = 1) in my machine. If you care about the 
regression, we probably can leave the choice to user through e.g module 
parameter. But I'm afraid we have already had too much choices for them. 
Or I can test this with different CPU types.


Thanks







Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Jason Wang



On 2017年04月24日 20:00, Michael S. Tsirkin wrote:

On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:

On 2017年04月24日 07:28, Michael S. Tsirkin wrote:

On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:

On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.

Ok, I will rebase the series on top of this. (Though I don't think we care
the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.

Ok.


I would still prefer that we understand what's going on,

I try to reply in another thread, does it make sense?


and I would
like to know what's the smallest batch size that's still helpful,

Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?

The numbers are pretty stable, so probably not noise. Retested on top of
batch zeroing:

no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?


I increase tx_queue_len to 1100, but only see less than 1% improvement 
on pps number (batch = 1) in my machine. If you care about the 
regression, we probably can leave the choice to user through e.g module 
parameter. But I'm afraid we have already had too much choices for them. 
Or I can test this with different CPU types.


Thanks







Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Michael S. Tsirkin
On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月24日 07:28, Michael S. Tsirkin wrote:
> > On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> > > 
> > > On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > > > Applications that consume a batch of entries in one go
> > > > can benefit from ability to return some of them back
> > > > into the ring.
> > > > 
> > > > Add an API for that - assuming there's space. If there's no space
> > > > naturally we can't do this and have to drop entries, but this implies
> > > > ring is full so we'd likely drop some anyway.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > > 
> > > > Jason, in my mind the biggest issue with your batching patchset is the
> > > > backet drops on disconnect.  This API will help avoid that in the common
> > > > case.
> > > Ok, I will rebase the series on top of this. (Though I don't think we care
> > > the packet loss).
> > E.g. I care - I often start sending packets to VM before it's
> > fully booted. Several vhost resets might follow.
> 
> Ok.
> 
> > 
> > > > I would still prefer that we understand what's going on,
> > > I try to reply in another thread, does it make sense?
> > > 
> > > >and I would
> > > > like to know what's the smallest batch size that's still helpful,
> > > Yes, I've replied in another thread, the result is:
> > > 
> > > 
> > > no batching   1.88Mpps
> > > RX_BATCH=11.93Mpps
> > > RX_BATCH=42.11Mpps
> > > RX_BATCH=16   2.14Mpps
> > > RX_BATCH=64   2.25Mpps
> > > RX_BATCH=256  2.18Mpps
> > Essentially 4 is enough, other stuf looks more like noise
> > to me. What about 2?
> 
> The numbers are pretty stable, so probably not noise. Retested on top of
> batch zeroing:
> 
> no  1.97Mpps
> 1   2.09Mpps
> 2   2.11Mpps
> 4   2.16Mpps
> 8   2.19Mpps
> 16  2.21Mpps
> 32  2.25Mpps
> 64  2.30Mpps
> 128 2.21Mpps
> 256 2.21Mpps
> 
> 64 performs best.
> 
> Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?

> > 
> > > >but
> > > > I'm not going to block the patch on these grounds assuming packet drops
> > > > are fixed.
> > > Thanks a lot.
> > > 
> > > > Lightly tested - this is on top of consumer batching patches.
> > > > 
> > > > Thanks!
> > > > 
> > > >include/linux/ptr_ring.h | 57 
> > > > 
> > > >1 file changed, 57 insertions(+)
> > > > 
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 783e7f5..5fbeab4 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring 
> > > > *r, int size, gfp_t gfp)
> > > > return 0;
> > > >}
> > > > +/*
> > > > + * Return entries into ring. Destroy entries that don't fit.
> > > > + *
> > > > + * Note: this is expected to be a rare slow path operation.
> > > > + *
> > > > + * Note: producer lock is nested within consumer lock, so if you
> > > > + * resize you must make sure all uses nest correctly.
> > > > + * In particular if you consume ring in interrupt or BH context, you 
> > > > must
> > > > + * disable interrupts/BH when doing so.
> > > > + */
> > > > +static inline void ptr_ring_unconsume(struct ptr_ring *r, void 
> > > > **batch, int n,
> > > > + void (*destroy)(void *))
> > > > +{
> > > > +   unsigned long flags;
> > > > +   int head;
> > > > +
> > > > +   spin_lock_irqsave(&(r)->consumer_lock, flags);
> > > > +   spin_lock(&(r)->producer_lock);
> > > > +
> > > > +   if (!r->size)
> > > > +   goto done;
> > > > +
> > > > +   /*
> > > > +* Clean out buffered entries (for simplicity). This way 
> > > > following code
> > > > +* can test entries for NULL and if not assume they are valid.
> > > > +*/
> > > > +   head = r->consumer_head - 1;
> > > > +   while (likely(head >= r->consumer_tail))
> > > > +   r->queue[head--] = NULL;
> > > > +   r->consumer_tail = r->consumer_head;
> > > > +
> > > > +   /*
> > > > +* Go over entries in batch, start moving head back and copy 
> > > > entries.
> > > > +* Stop when we run into previously unconsumed entries.
> > > > +*/
> > > > +   while (n--) {
> > > > +   head = r->consumer_head - 1;
> > > > +   if (head < 0)
> > > > +   head = r->size - 1;
> > > > +   if (r->queue[head]) {
> > > > +   /* This batch entry will have to be destroyed. 
> > > > */
> > > > +   ++n;
> > > > +   goto done;
> > > > +

Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Michael S. Tsirkin
On Mon, Apr 24, 2017 at 07:54:18PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月24日 07:28, Michael S. Tsirkin wrote:
> > On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> > > 
> > > On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > > > Applications that consume a batch of entries in one go
> > > > can benefit from ability to return some of them back
> > > > into the ring.
> > > > 
> > > > Add an API for that - assuming there's space. If there's no space
> > > > naturally we can't do this and have to drop entries, but this implies
> > > > ring is full so we'd likely drop some anyway.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > > 
> > > > Jason, in my mind the biggest issue with your batching patchset is the
> > > > backet drops on disconnect.  This API will help avoid that in the common
> > > > case.
> > > Ok, I will rebase the series on top of this. (Though I don't think we care
> > > the packet loss).
> > E.g. I care - I often start sending packets to VM before it's
> > fully booted. Several vhost resets might follow.
> 
> Ok.
> 
> > 
> > > > I would still prefer that we understand what's going on,
> > > I try to reply in another thread, does it make sense?
> > > 
> > > >and I would
> > > > like to know what's the smallest batch size that's still helpful,
> > > Yes, I've replied in another thread, the result is:
> > > 
> > > 
> > > no batching   1.88Mpps
> > > RX_BATCH=11.93Mpps
> > > RX_BATCH=42.11Mpps
> > > RX_BATCH=16   2.14Mpps
> > > RX_BATCH=64   2.25Mpps
> > > RX_BATCH=256  2.18Mpps
> > Essentially 4 is enough, other stuf looks more like noise
> > to me. What about 2?
> 
> The numbers are pretty stable, so probably not noise. Retested on top of
> batch zeroing:
> 
> no  1.97Mpps
> 1   2.09Mpps
> 2   2.11Mpps
> 4   2.16Mpps
> 8   2.19Mpps
> 16  2.21Mpps
> 32  2.25Mpps
> 64  2.30Mpps
> 128 2.21Mpps
> 256 2.21Mpps
> 
> 64 performs best.
> 
> Thanks

OK but it might be e.g. a function of the ring size, host cache size or
whatever. As we don't really understand the why, if we just optimize for
your setup we risk regressions in others.  64 entries is a lot, it
increases the queue size noticeably.  Could this be part of the effect?
Could you try changing the queue size to see what happens?

> > 
> > > >but
> > > > I'm not going to block the patch on these grounds assuming packet drops
> > > > are fixed.
> > > Thanks a lot.
> > > 
> > > > Lightly tested - this is on top of consumer batching patches.
> > > > 
> > > > Thanks!
> > > > 
> > > >include/linux/ptr_ring.h | 57 
> > > > 
> > > >1 file changed, 57 insertions(+)
> > > > 
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 783e7f5..5fbeab4 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring 
> > > > *r, int size, gfp_t gfp)
> > > > return 0;
> > > >}
> > > > +/*
> > > > + * Return entries into ring. Destroy entries that don't fit.
> > > > + *
> > > > + * Note: this is expected to be a rare slow path operation.
> > > > + *
> > > > + * Note: producer lock is nested within consumer lock, so if you
> > > > + * resize you must make sure all uses nest correctly.
> > > > + * In particular if you consume ring in interrupt or BH context, you 
> > > > must
> > > > + * disable interrupts/BH when doing so.
> > > > + */
> > > > +static inline void ptr_ring_unconsume(struct ptr_ring *r, void 
> > > > **batch, int n,
> > > > + void (*destroy)(void *))
> > > > +{
> > > > +   unsigned long flags;
> > > > +   int head;
> > > > +
> > > > +   spin_lock_irqsave(&(r)->consumer_lock, flags);
> > > > +   spin_lock(&(r)->producer_lock);
> > > > +
> > > > +   if (!r->size)
> > > > +   goto done;
> > > > +
> > > > +   /*
> > > > +* Clean out buffered entries (for simplicity). This way 
> > > > following code
> > > > +* can test entries for NULL and if not assume they are valid.
> > > > +*/
> > > > +   head = r->consumer_head - 1;
> > > > +   while (likely(head >= r->consumer_tail))
> > > > +   r->queue[head--] = NULL;
> > > > +   r->consumer_tail = r->consumer_head;
> > > > +
> > > > +   /*
> > > > +* Go over entries in batch, start moving head back and copy 
> > > > entries.
> > > > +* Stop when we run into previously unconsumed entries.
> > > > +*/
> > > > +   while (n--) {
> > > > +   head = r->consumer_head - 1;
> > > > +   if (head < 0)
> > > > +   head = r->size - 1;
> > > > +   if (r->queue[head]) {
> > > > +   /* This batch entry will have to be destroyed. 
> > > > */
> > > > +   ++n;
> > > > +   goto done;
> > > > +   }
> > > > +   

Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Jason Wang



On 2017年04月24日 07:28, Michael S. Tsirkin wrote:

On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:


On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.

Ok, I will rebase the series on top of this. (Though I don't think we care
the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.


Ok.




I would still prefer that we understand what's going on,

I try to reply in another thread, does it make sense?


   and I would
like to know what's the smallest batch size that's still helpful,

Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?


The numbers are pretty stable, so probably not noise. Retested on top of 
batch zeroing:


no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks




   but
I'm not going to block the patch on these grounds assuming packet drops
are fixed.

Thanks a lot.


Lightly tested - this is on top of consumer batching patches.

Thanks!

   include/linux/ptr_ring.h | 57 

   1 file changed, 57 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
   }
+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);
+
+   if (!r->size)
+   goto done;
+
+   /*
+* Clean out buffered entries (for simplicity). This way following code
+* can test entries for NULL and if not assume they are valid.
+*/
+   head = r->consumer_head - 1;
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+
+   /*
+* Go over entries in batch, start moving head back and copy entries.
+* Stop when we run into previously unconsumed entries.
+*/
+   while (n--) {
+   head = r->consumer_head - 1;
+   if (head < 0)
+   head = r->size - 1;
+   if (r->queue[head]) {
+   /* This batch entry will have to be destroyed. */
+   ++n;
+   goto done;
+   }
+   r->queue[head] = batch[n];
+   r->consumer_tail = r->consumer_head = head;
+   }
+
+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }
+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
+}
+
   static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
   int size, gfp_t gfp,
   void (*destroy)(void *))




Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-24 Thread Jason Wang



On 2017年04月24日 07:28, Michael S. Tsirkin wrote:

On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:


On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.

Ok, I will rebase the series on top of this. (Though I don't think we care
the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.


Ok.




I would still prefer that we understand what's going on,

I try to reply in another thread, does it make sense?


   and I would
like to know what's the smallest batch size that's still helpful,

Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?


The numbers are pretty stable, so probably not noise. Retested on top of 
batch zeroing:


no  1.97Mpps
1   2.09Mpps
2   2.11Mpps
4   2.16Mpps
8   2.19Mpps
16  2.21Mpps
32  2.25Mpps
64  2.30Mpps
128 2.21Mpps
256 2.21Mpps

64 performs best.

Thanks




   but
I'm not going to block the patch on these grounds assuming packet drops
are fixed.

Thanks a lot.


Lightly tested - this is on top of consumer batching patches.

Thanks!

   include/linux/ptr_ring.h | 57 

   1 file changed, 57 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
   }
+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);
+
+   if (!r->size)
+   goto done;
+
+   /*
+* Clean out buffered entries (for simplicity). This way following code
+* can test entries for NULL and if not assume they are valid.
+*/
+   head = r->consumer_head - 1;
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+
+   /*
+* Go over entries in batch, start moving head back and copy entries.
+* Stop when we run into previously unconsumed entries.
+*/
+   while (n--) {
+   head = r->consumer_head - 1;
+   if (head < 0)
+   head = r->size - 1;
+   if (r->queue[head]) {
+   /* This batch entry will have to be destroyed. */
+   ++n;
+   goto done;
+   }
+   r->queue[head] = batch[n];
+   r->consumer_tail = r->consumer_head = head;
+   }
+
+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }
+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
+}
+
   static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
   int size, gfp_t gfp,
   void (*destroy)(void *))




Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-23 Thread Michael S. Tsirkin
On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > Applications that consume a batch of entries in one go
> > can benefit from ability to return some of them back
> > into the ring.
> > 
> > Add an API for that - assuming there's space. If there's no space
> > naturally we can't do this and have to drop entries, but this implies
> > ring is full so we'd likely drop some anyway.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > Jason, in my mind the biggest issue with your batching patchset is the
> > backet drops on disconnect.  This API will help avoid that in the common
> > case.
> 
> Ok, I will rebase the series on top of this. (Though I don't think we care
> the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.

> > 
> > I would still prefer that we understand what's going on,
> 
> I try to reply in another thread, does it make sense?
> 
> >   and I would
> > like to know what's the smallest batch size that's still helpful,
> 
> Yes, I've replied in another thread, the result is:
> 
> 
> no batching   1.88Mpps
> RX_BATCH=11.93Mpps
> RX_BATCH=42.11Mpps
> RX_BATCH=16   2.14Mpps
> RX_BATCH=64   2.25Mpps
> RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?

> >   but
> > I'm not going to block the patch on these grounds assuming packet drops
> > are fixed.
> 
> Thanks a lot.
> 
> > 
> > Lightly tested - this is on top of consumer batching patches.
> > 
> > Thanks!
> > 
> >   include/linux/ptr_ring.h | 57 
> > 
> >   1 file changed, 57 insertions(+)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 783e7f5..5fbeab4 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, 
> > int size, gfp_t gfp)
> > return 0;
> >   }
> > +/*
> > + * Return entries into ring. Destroy entries that don't fit.
> > + *
> > + * Note: this is expected to be a rare slow path operation.
> > + *
> > + * Note: producer lock is nested within consumer lock, so if you
> > + * resize you must make sure all uses nest correctly.
> > + * In particular if you consume ring in interrupt or BH context, you must
> > + * disable interrupts/BH when doing so.
> > + */
> > +static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, 
> > int n,
> > + void (*destroy)(void *))
> > +{
> > +   unsigned long flags;
> > +   int head;
> > +
> > +   spin_lock_irqsave(&(r)->consumer_lock, flags);
> > +   spin_lock(&(r)->producer_lock);
> > +
> > +   if (!r->size)
> > +   goto done;
> > +
> > +   /*
> > +* Clean out buffered entries (for simplicity). This way following code
> > +* can test entries for NULL and if not assume they are valid.
> > +*/
> > +   head = r->consumer_head - 1;
> > +   while (likely(head >= r->consumer_tail))
> > +   r->queue[head--] = NULL;
> > +   r->consumer_tail = r->consumer_head;
> > +
> > +   /*
> > +* Go over entries in batch, start moving head back and copy entries.
> > +* Stop when we run into previously unconsumed entries.
> > +*/
> > +   while (n--) {
> > +   head = r->consumer_head - 1;
> > +   if (head < 0)
> > +   head = r->size - 1;
> > +   if (r->queue[head]) {
> > +   /* This batch entry will have to be destroyed. */
> > +   ++n;
> > +   goto done;
> > +   }
> > +   r->queue[head] = batch[n];
> > +   r->consumer_tail = r->consumer_head = head;
> > +   }
> > +
> > +done:
> > +   /* Destroy all entries left in the batch. */
> > +   while (n--) {
> > +   destroy(batch[n]);
> > +   }
> > +   spin_unlock(&(r)->producer_lock);
> > +   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
> > +}
> > +
> >   static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void 
> > **queue,
> >int size, gfp_t gfp,
> >void (*destroy)(void *))


Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-23 Thread Michael S. Tsirkin
On Tue, Apr 18, 2017 at 11:07:42AM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月17日 07:19, Michael S. Tsirkin wrote:
> > Applications that consume a batch of entries in one go
> > can benefit from ability to return some of them back
> > into the ring.
> > 
> > Add an API for that - assuming there's space. If there's no space
> > naturally we can't do this and have to drop entries, but this implies
> > ring is full so we'd likely drop some anyway.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > Jason, in my mind the biggest issue with your batching patchset is the
> > backet drops on disconnect.  This API will help avoid that in the common
> > case.
> 
> Ok, I will rebase the series on top of this. (Though I don't think we care
> the packet loss).

E.g. I care - I often start sending packets to VM before it's
fully booted. Several vhost resets might follow.

> > 
> > I would still prefer that we understand what's going on,
> 
> I try to reply in another thread, does it make sense?
> 
> >   and I would
> > like to know what's the smallest batch size that's still helpful,
> 
> Yes, I've replied in another thread, the result is:
> 
> 
> no batching   1.88Mpps
> RX_BATCH=11.93Mpps
> RX_BATCH=42.11Mpps
> RX_BATCH=16   2.14Mpps
> RX_BATCH=64   2.25Mpps
> RX_BATCH=256  2.18Mpps

Essentially 4 is enough, other stuf looks more like noise
to me. What about 2?

> >   but
> > I'm not going to block the patch on these grounds assuming packet drops
> > are fixed.
> 
> Thanks a lot.
> 
> > 
> > Lightly tested - this is on top of consumer batching patches.
> > 
> > Thanks!
> > 
> >   include/linux/ptr_ring.h | 57 
> > 
> >   1 file changed, 57 insertions(+)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 783e7f5..5fbeab4 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, 
> > int size, gfp_t gfp)
> > return 0;
> >   }
> > +/*
> > + * Return entries into ring. Destroy entries that don't fit.
> > + *
> > + * Note: this is expected to be a rare slow path operation.
> > + *
> > + * Note: producer lock is nested within consumer lock, so if you
> > + * resize you must make sure all uses nest correctly.
> > + * In particular if you consume ring in interrupt or BH context, you must
> > + * disable interrupts/BH when doing so.
> > + */
> > +static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, 
> > int n,
> > + void (*destroy)(void *))
> > +{
> > +   unsigned long flags;
> > +   int head;
> > +
> > +   spin_lock_irqsave(&(r)->consumer_lock, flags);
> > +   spin_lock(&(r)->producer_lock);
> > +
> > +   if (!r->size)
> > +   goto done;
> > +
> > +   /*
> > +* Clean out buffered entries (for simplicity). This way following code
> > +* can test entries for NULL and if not assume they are valid.
> > +*/
> > +   head = r->consumer_head - 1;
> > +   while (likely(head >= r->consumer_tail))
> > +   r->queue[head--] = NULL;
> > +   r->consumer_tail = r->consumer_head;
> > +
> > +   /*
> > +* Go over entries in batch, start moving head back and copy entries.
> > +* Stop when we run into previously unconsumed entries.
> > +*/
> > +   while (n--) {
> > +   head = r->consumer_head - 1;
> > +   if (head < 0)
> > +   head = r->size - 1;
> > +   if (r->queue[head]) {
> > +   /* This batch entry will have to be destroyed. */
> > +   ++n;
> > +   goto done;
> > +   }
> > +   r->queue[head] = batch[n];
> > +   r->consumer_tail = r->consumer_head = head;
> > +   }
> > +
> > +done:
> > +   /* Destroy all entries left in the batch. */
> > +   while (n--) {
> > +   destroy(batch[n]);
> > +   }
> > +   spin_unlock(&(r)->producer_lock);
> > +   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
> > +}
> > +
> >   static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void 
> > **queue,
> >int size, gfp_t gfp,
> >void (*destroy)(void *))


Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-17 Thread Jason Wang



On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.


Ok, I will rebase the series on top of this. (Though I don't think we 
care the packet loss).




I would still prefer that we understand what's going on,


I try to reply in another thread, does it make sense?


  and I would
like to know what's the smallest batch size that's still helpful,


Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps


  but
I'm not going to block the patch on these grounds assuming packet drops
are fixed.


Thanks a lot.



Lightly tested - this is on top of consumer batching patches.

Thanks!

  include/linux/ptr_ring.h | 57 
  1 file changed, 57 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
  }
  
+/*

+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);
+
+   if (!r->size)
+   goto done;
+
+   /*
+* Clean out buffered entries (for simplicity). This way following code
+* can test entries for NULL and if not assume they are valid.
+*/
+   head = r->consumer_head - 1;
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+
+   /*
+* Go over entries in batch, start moving head back and copy entries.
+* Stop when we run into previously unconsumed entries.
+*/
+   while (n--) {
+   head = r->consumer_head - 1;
+   if (head < 0)
+   head = r->size - 1;
+   if (r->queue[head]) {
+   /* This batch entry will have to be destroyed. */
+   ++n;
+   goto done;
+   }
+   r->queue[head] = batch[n];
+   r->consumer_tail = r->consumer_head = head;
+   }
+
+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }
+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
+}
+
  static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
   int size, gfp_t gfp,
   void (*destroy)(void *))




Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-17 Thread Jason Wang



On 2017年04月17日 07:19, Michael S. Tsirkin wrote:

Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common
case.


Ok, I will rebase the series on top of this. (Though I don't think we 
care the packet loss).




I would still prefer that we understand what's going on,


I try to reply in another thread, does it make sense?


  and I would
like to know what's the smallest batch size that's still helpful,


Yes, I've replied in another thread, the result is:


no batching   1.88Mpps
RX_BATCH=11.93Mpps
RX_BATCH=42.11Mpps
RX_BATCH=16   2.14Mpps
RX_BATCH=64   2.25Mpps
RX_BATCH=256  2.18Mpps


  but
I'm not going to block the patch on these grounds assuming packet drops
are fixed.


Thanks a lot.



Lightly tested - this is on top of consumer batching patches.

Thanks!

  include/linux/ptr_ring.h | 57 
  1 file changed, 57 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
  }
  
+/*

+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);
+
+   if (!r->size)
+   goto done;
+
+   /*
+* Clean out buffered entries (for simplicity). This way following code
+* can test entries for NULL and if not assume they are valid.
+*/
+   head = r->consumer_head - 1;
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+
+   /*
+* Go over entries in batch, start moving head back and copy entries.
+* Stop when we run into previously unconsumed entries.
+*/
+   while (n--) {
+   head = r->consumer_head - 1;
+   if (head < 0)
+   head = r->size - 1;
+   if (r->queue[head]) {
+   /* This batch entry will have to be destroyed. */
+   ++n;
+   goto done;
+   }
+   r->queue[head] = batch[n];
+   r->consumer_tail = r->consumer_head = head;
+   }
+
+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }
+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
+}
+
  static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
   int size, gfp_t gfp,
   void (*destroy)(void *))




Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-17 Thread Sergei Shtylyov

Hello!

On 4/17/2017 2:19 AM, Michael S. Tsirkin wrote:


Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common


   Packet?

[...]

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
 }

+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);


   The innermost parens seem pointless here

[...]

+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }


   Braces not needed here.


+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);


   Same comment about the innermost parens...

[...]

MBR, Sergei



Re: [PATCH RFC] ptr_ring: add ptr_ring_unconsume

2017-04-17 Thread Sergei Shtylyov

Hello!

On 4/17/2017 2:19 AM, Michael S. Tsirkin wrote:


Applications that consume a batch of entries in one go
can benefit from ability to return some of them back
into the ring.

Add an API for that - assuming there's space. If there's no space
naturally we can't do this and have to drop entries, but this implies
ring is full so we'd likely drop some anyway.

Signed-off-by: Michael S. Tsirkin 
---

Jason, in my mind the biggest issue with your batching patchset is the
backet drops on disconnect.  This API will help avoid that in the common


   Packet?

[...]

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 783e7f5..5fbeab4 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -457,6 +457,63 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
 }

+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   int head;
+
+   spin_lock_irqsave(&(r)->consumer_lock, flags);
+   spin_lock(&(r)->producer_lock);


   The innermost parens seem pointless here

[...]

+done:
+   /* Destroy all entries left in the batch. */
+   while (n--) {
+   destroy(batch[n]);
+   }


   Braces not needed here.


+   spin_unlock(&(r)->producer_lock);
+   spin_unlock_irqrestore(&(r)->consumer_lock, flags);


   Same comment about the innermost parens...

[...]

MBR, Sergei