Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-06 Thread Avi Kivity

On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:

+   /* We publish the last-seen used index at the end of the available ring.
+* It is at the end for backwards compatibility. */
+   vr->last_used_idx =&(vr)->avail->ring[num];
+   /* Verify that last used index does not spill over the used ring. */
+   BUG_ON((void *)vr->last_used_idx +
+  sizeof *vr->last_used_idx>  (void *)vr->used);
  }
   


Shouldn't this be on its own cache line?

One way to achieve this is to allocate it at the end.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-06 Thread Rusty Russell
On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
> On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
> > +   /* We publish the last-seen used index at the end of the available ring.
> > +* It is at the end for backwards compatibility. */
> > +   vr->last_used_idx =&(vr)->avail->ring[num];
> > +   /* Verify that last used index does not spill over the used ring. */
> > +   BUG_ON((void *)vr->last_used_idx +
> > +  sizeof *vr->last_used_idx>  (void *)vr->used);
> >   }
> >
> 
> Shouldn't this be on its own cache line?

It's next to the available ring; because that's where the guest publishes
its data.  That whole page is guest-write, host-read.

Putting it on a cacheline by itself would be a slight pessimization; the host
cpu would have to get the last_used_idx cacheline and the avail descriptor
cacheline every time.  This way, they are sometimes the same cacheline.

Hope that clarifies,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-11 Thread Avi Kivity

On 05/07/2010 06:23 AM, Rusty Russell wrote:

On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
   

On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
 

+   /* We publish the last-seen used index at the end of the available ring.
+* It is at the end for backwards compatibility. */
+   vr->last_used_idx =&(vr)->avail->ring[num];
+   /* Verify that last used index does not spill over the used ring. */
+   BUG_ON((void *)vr->last_used_idx +
+  sizeof *vr->last_used_idx>   (void *)vr->used);
   }

   

Shouldn't this be on its own cache line?
 

It's next to the available ring; because that's where the guest publishes
its data.  That whole page is guest-write, host-read.

Putting it on a cacheline by itself would be a slight pessimization; the host
cpu would have to get the last_used_idx cacheline and the avail descriptor
cacheline every time.  This way, they are sometimes the same cacheline.
   


If one peer writes the tail of the available ring, while the other reads 
last_used_idx, it's a false bounce, no?


Having things on the same cacheline is only worthwhile if they are 
accessed at the same time.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-11 Thread Michael S. Tsirkin
On Tue, May 11, 2010 at 10:27:22PM +0300, Avi Kivity wrote:
> On 05/07/2010 06:23 AM, Rusty Russell wrote:
>> On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
>>
>>> On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
>>>  
 +  /* We publish the last-seen used index at the end of the available ring.
 +   * It is at the end for backwards compatibility. */
 +  vr->last_used_idx =&(vr)->avail->ring[num];
 +  /* Verify that last used index does not spill over the used ring. */
 +  BUG_ON((void *)vr->last_used_idx +
 + sizeof *vr->last_used_idx>   (void *)vr->used);
}


>>> Shouldn't this be on its own cache line?
>>>  
>> It's next to the available ring; because that's where the guest publishes
>> its data.  That whole page is guest-write, host-read.
>>
>> Putting it on a cacheline by itself would be a slight pessimization; the host
>> cpu would have to get the last_used_idx cacheline and the avail descriptor
>> cacheline every time.  This way, they are sometimes the same cacheline.
>>
>
> If one peer writes the tail of the available ring, while the other reads  
> last_used_idx, it's a false bounce, no?
>
> Having things on the same cacheline is only worthwhile if they are  
> accessed at the same time.

Yes, this is what I was trying to say.
avail flags and used index *are* accessed at the same time, so
there could be an advantage to sharing a cache line there.

All this should be kept in mind if we ever do
VIRTIO_RING_F_NEW_LAYOUT.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell
On Wed, 12 May 2010 04:57:22 am Avi Kivity wrote:
> On 05/07/2010 06:23 AM, Rusty Russell wrote:
> > On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
> >
> >> On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
> >>  
> >>> + /* We publish the last-seen used index at the end of the available ring.
> >>> +  * It is at the end for backwards compatibility. */
> >>> + vr->last_used_idx =&(vr)->avail->ring[num];
> >>> + /* Verify that last used index does not spill over the used ring. */
> >>> + BUG_ON((void *)vr->last_used_idx +
> >>> +sizeof *vr->last_used_idx>   (void *)vr->used);
> >>>}
> >>>
> >>>
> >> Shouldn't this be on its own cache line?
> >>  
> > It's next to the available ring; because that's where the guest publishes
> > its data.  That whole page is guest-write, host-read.
> >
> > Putting it on a cacheline by itself would be a slight pessimization; the 
> > host
> > cpu would have to get the last_used_idx cacheline and the avail descriptor
> > cacheline every time.  This way, they are sometimes the same cacheline.
> 
> If one peer writes the tail of the available ring, while the other reads 
> last_used_idx, it's a false bounce, no?

I think we're talking about the last 2 entries of the avail ring.  That means
the worst case is 1 false bounce every time around the ring.  I think that's
why we're debating it instead of measuring it :)

Note that this is a exclusive->shared->exclusive bounce only, too.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Avi Kivity

On 05/19/2010 10:39 AM, Rusty Russell wrote:


I think we're talking about the last 2 entries of the avail ring.  That means
the worst case is 1 false bounce every time around the ring.


It's low, but why introduce an inefficiency when you can avoid doing it 
for the same effort?



I think that's
why we're debating it instead of measuring it :)
   


Measure before optimize is good for code but not for protocols.  
Protocols have to be robust against future changes.  Virtio is warty 
enough already, we can't keep doing local optimizations.



Note that this is a exclusive->shared->exclusive bounce only, too.
   


A bounce is a bounce.

Virtio is already way too bouncy due to the indirection between the 
avail/used rings and the descriptor pool.  A device with out of order 
completion (like virtio-blk) will quickly randomize the unused 
descriptor indexes, so every descriptor fetch will require a bounce.


In contrast, if the rings hold the descriptors themselves instead of 
pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for 
every descriptor, amortized.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Michael S. Tsirkin
On Wed, May 19, 2010 at 11:06:42AM +0300, Avi Kivity wrote:
> On 05/19/2010 10:39 AM, Rusty Russell wrote:
>>
>> I think we're talking about the last 2 entries of the avail ring.  That means
>> the worst case is 1 false bounce every time around the ring.
>
> It's low, but why introduce an inefficiency when you can avoid doing it  
> for the same effort?
> 
>> I think that's
>> why we're debating it instead of measuring it :)
>>
>
> Measure before optimize is good for code but not for protocols.   
> Protocols have to be robust against future changes.  Virtio is warty  
> enough already, we can't keep doing local optimizations.
>
>> Note that this is a exclusive->shared->exclusive bounce only, too.
>>
>
> A bounce is a bounce.
>
> Virtio is already way too bouncy due to the indirection between the  
> avail/used rings and the descriptor pool.  A device with out of order  
> completion (like virtio-blk) will quickly randomize the unused  
> descriptor indexes, so every descriptor fetch will require a bounce.
>
> In contrast, if the rings hold the descriptors themselves instead of  
> pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for  
> every descriptor, amortized.

On the other hand, consider that on fast path we are never using all
of the ring. With a good allocator we might be able to keep
reusing only small part of the ring, instead of wrapping around
all of it all of the time.


> -- 
> Do not meddle in the internals of kernels, for they are subtle and quick to 
> panic.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell
On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
> > Note that this is a exclusive->shared->exclusive bounce only, too.
> >
> 
> A bounce is a bounce.

I tried to measure this to show that you were wrong, but I was only able
to show that you're right.  How annoying.  Test code below.

> Virtio is already way too bouncy due to the indirection between the 
> avail/used rings and the descriptor pool.

I tried to do a more careful analysis below, and I think this is an
overstatement.

> A device with out of order 
> completion (like virtio-blk) will quickly randomize the unused 
> descriptor indexes, so every descriptor fetch will require a bounce.
> 
> In contrast, if the rings hold the descriptors themselves instead of 
> pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for 
> every descriptor, amortized.

We already have indirect, this would be a logical next step.  So let's
think about it. The avail ring would contain 64 bit values, the used ring
would contain indexes into the avail ring.

So client writes descriptor page and adds to avail ring, then writes to
index.  Server reads index, avail ring, descriptor page (3).  Writes used
entry (1).  Updates last_used (1).  Client reads used (1), derefs avail (1),
updates last_used (1), cleans descriptor page (1).

That's 9 cacheline transfers, worst case.  Best case of a half-full ring
in steady state, assuming 128-byte cache lines, the avail ring costs are
1/16, the used entry is 1/64.  This drops it to 6 and 9/64 transfers.

(Note, the current scheme adds 2 more cacheline transfers, for the descriptor
table, worst case.  Assuming indirect, we get 2/8 xfer best case.  Either way,
it's not the main source of cacheline xfers).

Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.

Any other thoughts?
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell
On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
> On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
> > > Note that this is a exclusive->shared->exclusive bounce only, too.
> > >
> > 
> > A bounce is a bounce.
> 
> I tried to measure this to show that you were wrong, but I was only able
> to show that you're right.  How annoying.  Test code below.

This time for sure!

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

/* We share memory via an mmap. */
struct counter {
unsigned int cacheline1;
char pad[256];
unsigned int cacheline2;
};

#define MAX_BOUNCES 1

enum mode {
SHARE,
UNSHARE,
LOCKSHARE,
LOCKUNSHARE,
};

int main(int argc, char *argv[])
{
cpu_set_t cpuset;
volatile struct counter *counter;
struct timeval start, stop;
bool child;
unsigned int count;
uint64_t usec;
enum mode mode;

if (argc != 4)
errx(1, "Usage: cachebounce share|unshare|lockshare|lockunshare 
 ");

if (strcmp(argv[1], "share") == 0)
mode = SHARE;
else if (strcmp(argv[1], "unshare") == 0)
mode = UNSHARE;
else if (strcmp(argv[1], "lockshare") == 0)
mode = LOCKSHARE;
else if (strcmp(argv[1], "lockunshare") == 0)
mode = LOCKSHARE;
else
errx(1, "Usage: cachebounce share|unshare|lockshare|lockunshare 
 ");

CPU_ZERO(&cpuset);

counter = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, 
MAP_ANONYMOUS|MAP_SHARED, -1, 0);
if (counter == MAP_FAILED)
err(1, "Mapping page");

/* Fault it in. */
counter->cacheline1 = counter->cacheline2 = 0;

child = (fork() == 0);

CPU_SET(atoi(argv[2 + child]), &cpuset);
if (sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset) != 0)
err(1, "Calling sched_setaffinity()");

gettimeofday(&start, NULL);

if (child) {
count = 1;
switch (mode) {
case SHARE:
while (count < MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter->cacheline1 != count);
count++;
counter->cacheline1 = count;
count++;
}
break;
case UNSHARE:
while (count < MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter->cacheline1 != count);
count++;
counter->cacheline2 = count;
count++;
}
break;
case LOCKSHARE:
while (count < MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while 
(__sync_val_compare_and_swap(&counter->cacheline1, count, count+1)
   != count);
count += 2;
}
break;
case LOCKUNSHARE:
while (count < MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter->cacheline1 != count);

__sync_val_compare_and_swap(&counter->cacheline2, count, count+1);
count += 2;
}
break;
}
} else {
count = 0;
switch (mode) {
case SHARE:
while (count < MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter->cacheline1 != count);
count++;
counter->cacheline1 = count;
count++;
}
break;
case UNSHARE:
while (count < MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter->cacheline2 != count);
count++;
counter->cacheline1 = count;
count++;
}
break;
case LOCKSHARE:
while (count < MAX_BOUNCES) {
/* Spin waiting for other side to cha

Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Avi Kivity

On 05/20/2010 01:33 AM, Michael S. Tsirkin wrote:



Virtio is already way too bouncy due to the indirection between the
avail/used rings and the descriptor pool.  A device with out of order
completion (like virtio-blk) will quickly randomize the unused
descriptor indexes, so every descriptor fetch will require a bounce.

In contrast, if the rings hold the descriptors themselves instead of
pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
every descriptor, amortized.
 

On the other hand, consider that on fast path we are never using all
of the ring. With a good allocator we might be able to keep
reusing only small part of the ring, instead of wrapping around
all of it all of the time.
   


It's still suboptimal, we have to bounce both the avail/used rings and 
the descriptor pool, compared to just the descriptor ring with a direct 
design.  Plus we don't need a fancy allocator.


When amortizing cachelines, simple data structures win.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Avi Kivity

On 05/20/2010 08:01 AM, Rusty Russell wrote:



A device with out of order
completion (like virtio-blk) will quickly randomize the unused
descriptor indexes, so every descriptor fetch will require a bounce.

In contrast, if the rings hold the descriptors themselves instead of
pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
every descriptor, amortized.
 

We already have indirect, this would be a logical next step.  So let's
think about it. The avail ring would contain 64 bit values, the used ring
would contain indexes into the avail ring.
   


Have just one ring, no indexes.  The producer places descriptors into 
the ring and updates the head,  The consumer copies out descriptors to 
be processed and copies back in completed descriptors.  Chaining is 
always linear.  The descriptors contain a tag that allow the producer to 
identify the completion.


Indirect only pays when there are enough descriptors in it to fill a 
couple of cache lines.  Otherwise it's an extra bounce.


We will always bounce here, that what happens when transferring data.  
The question is whether how many cache lines per descriptor.  A pointer 
adds 1 bounce, linear descriptors cost 1/4 bounce, chained descriptors 
cost a bounce.  So best is one ring of linearly chained descriptors.  
Indirect works when you have large requests (like block).



So client writes descriptor page and adds to avail ring, then writes to
index.
Server reads index, avail ring, descriptor page (3).  Writes used
entry (1).  Updates last_used (1).  Client reads used (1), derefs avail (1),
updates last_used (1), cleans descriptor page (1).

That's 9 cacheline transfers, worst case.  Best case of a half-full ring
in steady state, assuming 128-byte cache lines, the avail ring costs are
1/16, the used entry is 1/64.  This drops it to 6 and 9/64 transfers.
   


Cache lines are 64 bytes these days.

With a single ring, client writes descriptors (ceil(N/4)), updates head 
(1).  Server reads head (1) copies out descriptors (ceil(N/4)), issues 
requests, copies back completions ((ceil(N/4)), updates tail (1).  
Client reads back tail and descriptors (1 + ceil(N/4))


Worst case: 4 + 4 * ceil(N/4).  Best case I think this drops by half.




(Note, the current scheme adds 2 more cacheline transfers, for the descriptor
table, worst case.


2 bounces per descriptor due to random access.


   Assuming indirect, we get 2/8 xfer best case.  Either way,
it's not the main source of cacheline xfers).
   


Indirect adds a double bounce to get to the descriptor table, but any 
descriptors there are accessed linearly.  It's only good when you have 
large chains.



Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.
   


Interesting!  So a peer will read until it hits a wall.  But how to 
update the wall atomically?


Maybe we can have a flag in the descriptor indicate headness or 
tailness.  Update looks ugly though: write descriptor with head flag, 
write next descriptor with head flag, remove flag from previous descriptor.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Michael S. Tsirkin
On Thu, May 20, 2010 at 02:31:50PM +0930, Rusty Russell wrote:
> Can we do better?  The obvious idea is to try to get rid of last_used and
> used, and use the ring itself.  We would use an invalid entry to mark the
> head of the ring.
> 
> Any other thoughts?
> Rusty.

We also need a way to avoid interrupts at least while we are
processing the ring.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Rusty Russell
On Thu, 20 May 2010 04:30:56 pm Avi Kivity wrote:
> On 05/20/2010 08:01 AM, Rusty Russell wrote:
> >
> >> A device with out of order
> >> completion (like virtio-blk) will quickly randomize the unused
> >> descriptor indexes, so every descriptor fetch will require a bounce.
> >>
> >> In contrast, if the rings hold the descriptors themselves instead of
> >> pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
> >> every descriptor, amortized.
> >>  
> > We already have indirect, this would be a logical next step.  So let's
> > think about it. The avail ring would contain 64 bit values, the used ring
> > would contain indexes into the avail ring.
> 
> Have just one ring, no indexes.  The producer places descriptors into 
> the ring and updates the head,  The consumer copies out descriptors to 
> be processed and copies back in completed descriptors.  Chaining is 
> always linear.  The descriptors contain a tag that allow the producer to 
> identify the completion.

This could definitely work.  The original reason for the page boundaries
was for untrusted inter-guest communication: with appropriate page protections
they could see each other's rings and a simply inter-guest copy hypercall
could verify that the other guest really exposed that data via virtio ring.

But, cute as that is, we never did that.  And it's not clear that it wins
much over simply having the hypervisor read both rings directly.

> > Can we do better?  The obvious idea is to try to get rid of last_used and
> > used, and use the ring itself.  We would use an invalid entry to mark the
> > head of the ring.
> 
> Interesting!  So a peer will read until it hits a wall.  But how to 
> update the wall atomically?
> 
> Maybe we can have a flag in the descriptor indicate headness or 
> tailness.  Update looks ugly though: write descriptor with head flag, 
> write next descriptor with head flag, remove flag from previous descriptor.

I was thinking a separate magic "invalid" entry.  To publish an 3 descriptor
chain, you would write descriptors 2 and 3, write an invalid entry at 4,
barrier, write entry 1.  It is a bit ugly, yes, but not terrible.

I think that a simple simulator for this is worth writing, which tracks
cacheline moves under various fullness scenarios...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Avi Kivity

On 05/20/2010 05:34 PM, Rusty Russell wrote:



Have just one ring, no indexes.  The producer places descriptors into
the ring and updates the head,  The consumer copies out descriptors to
be processed and copies back in completed descriptors.  Chaining is
always linear.  The descriptors contain a tag that allow the producer to
identify the completion.
 

This could definitely work.  The original reason for the page boundaries
was for untrusted inter-guest communication: with appropriate page protections
they could see each other's rings and a simply inter-guest copy hypercall
could verify that the other guest really exposed that data via virtio ring.

But, cute as that is, we never did that.  And it's not clear that it wins
much over simply having the hypervisor read both rings directly.
   


AFAICS having separate avail_ring/used_ring/desc_pool is orthogonal to 
this cuteness.



Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.
   

Interesting!  So a peer will read until it hits a wall.  But how to
update the wall atomically?

Maybe we can have a flag in the descriptor indicate headness or
tailness.  Update looks ugly though: write descriptor with head flag,
write next descriptor with head flag, remove flag from previous descriptor.
 

I was thinking a separate magic "invalid" entry.  To publish an 3 descriptor
chain, you would write descriptors 2 and 3, write an invalid entry at 4,
barrier, write entry 1.  It is a bit ugly, yes, but not terrible.
   


Worth exploring. This amortizes the indexes into the ring, a good thing.

Another thing we can do is place the tail a half ring away from the head 
(and limit ring utilization to 50%), reducing bounces on short kicks.  
Or equivalently have an avail ring and used ring, but both containing 
tagged descriptors instead of pointers to descriptors.



I think that a simple simulator for this is worth writing, which tracks
cacheline moves under various fullness scenarios...
   


Yup.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:
> On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
> > On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
> > > > Note that this is a exclusive->shared->exclusive bounce only, too.
> > > >
> > > 
> > > A bounce is a bounce.
> > 
> > I tried to measure this to show that you were wrong, but I was only able
> > to show that you're right.  How annoying.  Test code below.
> 
> This time for sure!


What do you see?
On my laptop:
[...@tuck testring]$ ./rusty1 share 0 1
CPU 1: share cacheline: 2820410 usec
CPU 0: share cacheline: 2823441 usec
[...@tuck testring]$ ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 2783014 usec
CPU 1: unshare cacheline: 2782951 usec
[...@tuck testring]$ ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 1888495 usec
CPU 0: lockshare cacheline: 1888544 usec
[...@tuck testring]$ ./rusty1 lockunshare 0 1
CPU 0: lockunshare cacheline: 1889854 usec
CPU 1: lockunshare cacheline: 1889804 usec
So locked version seems to be faster than unlocked,
and share/unshare not to matter?

same on a workstation:
[r...@qus19 ~]# ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 6037002 usec
CPU 1: unshare cacheline: 6036977 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 1
CPU 1: lockunshare cacheline: 5734362 usec
CPU 0: lockunshare cacheline: 5734389 usec
[r...@qus19 ~]# ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 5733537 usec
CPU 0: lockshare cacheline: 5733564 usec

using another pair of CPUs gives a more drastic
results:

[r...@qus19 ~]# ./rusty1 lockshare 0 2
CPU 2: lockshare cacheline: 4226990 usec
CPU 0: lockshare cacheline: 4227038 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 2
CPU 0: lockunshare cacheline: 4226707 usec
CPU 2: lockunshare cacheline: 4226662 usec
[r...@qus19 ~]# ./rusty1 unshare 0 2
CPU 0: unshare cacheline: 14815048 usec
CPU 2: unshare cacheline: 14815006 usec


The share test seems to never finish on the
workstation. I am debugging this.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Avi Kivity

On 05/23/2010 06:31 PM, Michael S. Tsirkin wrote:

On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:
   

On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
 

On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
   

Note that this is a exclusive->shared->exclusive bounce only, too.

   

A bounce is a bounce.
 

I tried to measure this to show that you were wrong, but I was only able
to show that you're right.  How annoying.  Test code below.
   

This time for sure!
 


What do you see?
On my laptop:
[...@tuck testring]$ ./rusty1 share 0 1
CPU 1: share cacheline: 2820410 usec
CPU 0: share cacheline: 2823441 usec
[...@tuck testring]$ ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 2783014 usec
CPU 1: unshare cacheline: 2782951 usec
[...@tuck testring]$ ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 1888495 usec
CPU 0: lockshare cacheline: 1888544 usec
[...@tuck testring]$ ./rusty1 lockunshare 0 1
CPU 0: lockunshare cacheline: 1889854 usec
CPU 1: lockunshare cacheline: 1889804 usec
   


Ugh, can the timing be normalized per operation?  This is unreadable.


So locked version seems to be faster than unlocked,
and share/unshare not to matter?
   


May be due to the processor using the LOCK operation as a hint to 
reserve the cacheline for a bit.



same on a workstation:
[r...@qus19 ~]# ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 6037002 usec
CPU 1: unshare cacheline: 6036977 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 1
CPU 1: lockunshare cacheline: 5734362 usec
CPU 0: lockunshare cacheline: 5734389 usec
[r...@qus19 ~]# ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 5733537 usec
CPU 0: lockshare cacheline: 5733564 usec

using another pair of CPUs gives a more drastic
results:

[r...@qus19 ~]# ./rusty1 lockshare 0 2
CPU 2: lockshare cacheline: 4226990 usec
CPU 0: lockshare cacheline: 4227038 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 2
CPU 0: lockunshare cacheline: 4226707 usec
CPU 2: lockunshare cacheline: 4226662 usec
[r...@qus19 ~]# ./rusty1 unshare 0 2
CPU 0: unshare cacheline: 14815048 usec
CPU 2: unshare cacheline: 14815006 usec

   


That's expected.  Hyperthread will be fastest (shared L1), shared L2/L3 
will be slower, cross-socket will suck.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Sun, May 23, 2010 at 06:41:33PM +0300, Avi Kivity wrote:
> On 05/23/2010 06:31 PM, Michael S. Tsirkin wrote:
>> On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:
>>
>>> On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
>>>  
 On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:

>> Note that this is a exclusive->shared->exclusive bounce only, too.
>>
>>
> A bounce is a bounce.
>  
 I tried to measure this to show that you were wrong, but I was only able
 to show that you're right.  How annoying.  Test code below.

>>> This time for sure!
>>>  
>>
>> What do you see?
>> On my laptop:
>>  [...@tuck testring]$ ./rusty1 share 0 1
>>  CPU 1: share cacheline: 2820410 usec
>>  CPU 0: share cacheline: 2823441 usec
>>  [...@tuck testring]$ ./rusty1 unshare 0 1
>>  CPU 0: unshare cacheline: 2783014 usec
>>  CPU 1: unshare cacheline: 2782951 usec
>>  [...@tuck testring]$ ./rusty1 lockshare 0 1
>>  CPU 1: lockshare cacheline: 1888495 usec
>>  CPU 0: lockshare cacheline: 1888544 usec
>>  [...@tuck testring]$ ./rusty1 lockunshare 0 1
>>  CPU 0: lockunshare cacheline: 1889854 usec
>>  CPU 1: lockunshare cacheline: 1889804 usec
>>
>
> Ugh, can the timing be normalized per operation?  This is unreadable.
>
>> So locked version seems to be faster than unlocked,
>> and share/unshare not to matter?
>>
>
> May be due to the processor using the LOCK operation as a hint to  
> reserve the cacheline for a bit.

Maybe we should use atomics on index then?

>> same on a workstation:
>> [r...@qus19 ~]# ./rusty1 unshare 0 1
>> CPU 0: unshare cacheline: 6037002 usec
>> CPU 1: unshare cacheline: 6036977 usec
>> [r...@qus19 ~]# ./rusty1 lockunshare 0 1
>> CPU 1: lockunshare cacheline: 5734362 usec
>> CPU 0: lockunshare cacheline: 5734389 usec
>> [r...@qus19 ~]# ./rusty1 lockshare 0 1
>> CPU 1: lockshare cacheline: 5733537 usec
>> CPU 0: lockshare cacheline: 5733564 usec
>>
>> using another pair of CPUs gives a more drastic
>> results:
>>
>> [r...@qus19 ~]# ./rusty1 lockshare 0 2
>> CPU 2: lockshare cacheline: 4226990 usec
>> CPU 0: lockshare cacheline: 4227038 usec
>> [r...@qus19 ~]# ./rusty1 lockunshare 0 2
>> CPU 0: lockunshare cacheline: 4226707 usec
>> CPU 2: lockunshare cacheline: 4226662 usec
>> [r...@qus19 ~]# ./rusty1 unshare 0 2
>> CPU 0: unshare cacheline: 14815048 usec
>> CPU 2: unshare cacheline: 14815006 usec
>>
>>
>
> That's expected.  Hyperthread will be fastest (shared L1), shared L2/L3  
> will be slower, cross-socket will suck.

OK, after adding mb in code patch will be sent separately,
the test works for my workstation. locked is still fastest,
unshared sometimes shows wins and sometimes loses over shared.

[r...@qus19 ~]# ./cachebounce share 0 1
CPU 0: share cacheline: 6638521 usec
CPU 1: share cacheline: 6638478 usec
[r...@qus19 ~]# ./cachebounce unshare 0 1
CPU 0: unshare cacheline: 6037415 usec
CPU 1: unshare cacheline: 6037374 usec
[r...@qus19 ~]# ./cachebounce lockshare 0 1
CPU 0: lockshare cacheline: 5734017 usec
CPU 1: lockshare cacheline: 5733978 usec
[r...@qus19 ~]# ./cachebounce lockunshare 0 1
CPU 1: lockunshare cacheline: 5733260 usec
CPU 0: lockunshare cacheline: 5733307 usec
[r...@qus19 ~]# ./cachebounce share 0 2
CPU 0: share cacheline: 14529198 usec
CPU 2: share cacheline: 14529156 usec
[r...@qus19 ~]# ./cachebounce unshare 0 2
CPU 2: unshare cacheline: 14815328 usec
CPU 0: unshare cacheline: 14815374 usec
[r...@qus19 ~]# ./cachebounce lockshare 0 2
CPU 0: lockshare cacheline: 4226878 usec
CPU 2: lockshare cacheline: 4226842 usec
[r...@qus19 ~]# ./cachebounce locknushare 0 2
cachebounce: Usage: cachebounce share|unshare|lockshare|lockunshare  

[r...@qus19 ~]# ./cachebounce lockunshare 0 2
CPU 0: lockunshare cacheline: 4227432 usec
CPU 2: lockunshare cacheline: 4227375 usec

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:
> On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
> > On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
> > > > Note that this is a exclusive->shared->exclusive bounce only, too.
> > > >
> > > 
> > > A bounce is a bounce.
> > 
> > I tried to measure this to show that you were wrong, but I was only able
> > to show that you're right.  How annoying.  Test code below.
> 
> This time for sure!

The share option does not work on some
boxes unless I apply the following:
essentially, this adds mb() after each write
and before read. It seems to make sense to
me: we must update our counter before we
wanit for another side.

diff --git a/cachebounce.c b/cachebounce.c
index 0387027..ebe5a37 100644
--- a/cachebounce.c
+++ b/cachebounce.c
@@ -77,6 +77,7 @@ int main(int argc, char *argv[])
count++;
counter->cacheline1 = count;
count++;
+   __sync_synchronize();
}
break;
case UNSHARE:
@@ -86,6 +87,7 @@ int main(int argc, char *argv[])
count++;
counter->cacheline2 = count;
count++;
+   __sync_synchronize();
}
break;
case LOCKSHARE:
@@ -98,6 +100,7 @@ int main(int argc, char *argv[])
break;
case LOCKUNSHARE:
while (count < MAX_BOUNCES) {
+   __sync_synchronize();
/* Spin waiting for other side to change it. */
while (counter->cacheline1 != count);

__sync_val_compare_and_swap(&counter->cacheline2, count, count+1);
@@ -115,6 +118,7 @@ int main(int argc, char *argv[])
count++;
counter->cacheline1 = count;
count++;
+   __sync_synchronize();
}
break;
case UNSHARE:
@@ -124,6 +128,7 @@ int main(int argc, char *argv[])
count++;
counter->cacheline1 = count;
count++;
+   __sync_synchronize();
}
break;
case LOCKSHARE:
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Avi Kivity

On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote:



So locked version seems to be faster than unlocked,
and share/unshare not to matter?

   

May be due to the processor using the LOCK operation as a hint to
reserve the cacheline for a bit.
 

Maybe we should use atomics on index then?
   


This should only be helpful if you access the cacheline several times in 
a row.  That's not the case in virtio (or here).


I think the problem is that LOCKSHARE and SHARE are not symmetric, so 
they can't be directly compared.



OK, after adding mb in code patch will be sent separately,
the test works for my workstation. locked is still fastest,
unshared sometimes shows wins and sometimes loses over shared.

[r...@qus19 ~]# ./cachebounce share 0 1
CPU 0: share cacheline: 6638521 usec
CPU 1: share cacheline: 6638478 usec
   


66 ns? nice.


[r...@qus19 ~]# ./cachebounce share 0 2
CPU 0: share cacheline: 14529198 usec
CPU 2: share cacheline: 14529156 usec
   


140 ns, not too bad.  I hope I'm not misinterpreting the results.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Sun, May 23, 2010 at 07:03:10PM +0300, Avi Kivity wrote:
> On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote:
>>>
 So locked version seems to be faster than unlocked,
 and share/unshare not to matter?


>>> May be due to the processor using the LOCK operation as a hint to
>>> reserve the cacheline for a bit.
>>>  
>> Maybe we should use atomics on index then?
>>
>
> This should only be helpful if you access the cacheline several times in  
> a row.  That's not the case in virtio (or here).

So why does it help?

> I think the problem is that LOCKSHARE and SHARE are not symmetric, so  
> they can't be directly compared.

In what sense are they not symmetric?

>> OK, after adding mb in code patch will be sent separately,
>> the test works for my workstation. locked is still fastest,
>> unshared sometimes shows wins and sometimes loses over shared.
>>
>> [r...@qus19 ~]# ./cachebounce share 0 1
>> CPU 0: share cacheline: 6638521 usec
>> CPU 1: share cacheline: 6638478 usec
>>
>
> 66 ns? nice.
>
>> [r...@qus19 ~]# ./cachebounce share 0 2
>> CPU 0: share cacheline: 14529198 usec
>> CPU 2: share cacheline: 14529156 usec
>>
>
> 140 ns, not too bad.  I hope I'm not misinterpreting the results.
>
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Sun, May 23, 2010 at 07:03:10PM +0300, Avi Kivity wrote:
> On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote:
>>>
 So locked version seems to be faster than unlocked,
 and share/unshare not to matter?


>>> May be due to the processor using the LOCK operation as a hint to
>>> reserve the cacheline for a bit.
>>>  
>> Maybe we should use atomics on index then?
>>
>
> This should only be helpful if you access the cacheline several times in  
> a row.  That's not the case in virtio (or here).
>
> I think the problem is that LOCKSHARE and SHARE are not symmetric, so  
> they can't be directly compared.
>
>> OK, after adding mb in code patch will be sent separately,
>> the test works for my workstation. locked is still fastest,
>> unshared sometimes shows wins and sometimes loses over shared.
>>
>> [r...@qus19 ~]# ./cachebounce share 0 1
>> CPU 0: share cacheline: 6638521 usec
>> CPU 1: share cacheline: 6638478 usec
>>
>
> 66 ns? nice.
>
>> [r...@qus19 ~]# ./cachebounce share 0 2
>> CPU 0: share cacheline: 14529198 usec
>> CPU 2: share cacheline: 14529156 usec
>>
>
> 140 ns, not too bad.  I hope I'm not misinterpreting the results.
>
> -- 
> error compiling committee.c: too many arguments to function


Here's another box: here the fastest option
is shared, slowest unshared, lock is in the middle.



[r...@virtlab16 testring]# sh run 0 2
CPU 2: share cacheline: 3304728 usec
CPU 0: share cacheline: 3304784 usec
CPU 0: unshare cacheline: 6283248 usec
CPU 2: unshare cacheline: 6283224 usec
CPU 2: lockshare cacheline: 4018567 usec
CPU 0: lockshare cacheline: 4018609 usec


CPU 2: lockunshare cacheline: 4041791 usec
CPU 0: lockunshare cacheline: 4041832 usec
[r...@virtlab16 testring]# 
[r...@virtlab16 testring]# 
[r...@virtlab16 testring]# 
[r...@virtlab16 testring]# sh run 0 1
CPU 1: share cacheline: 8306326 usec
CPU 0: share cacheline: 8306324 usec
CPU 0: unshare cacheline: 19571697 usec
CPU 1: unshare cacheline: 19571578 usec
CPU 0: lockshare cacheline: 11281566 usec
CPU 1: lockshare cacheline: 11281424 usec
CPU 0: lockunshare cacheline: 11276093 usec
CPU 1: lockunshare cacheline: 11275957 usec


[r...@virtlab16 testring]# sh run 0 3
CPU 0: share cacheline: 8288335 usec
CPU 3: share cacheline: 8288334 usec
CPU 0: unshare cacheline: 19107202 usec
CPU 3: unshare cacheline: 19107139 usec
CPU 0: lockshare cacheline: 11238915 usec
CPU 3: lockshare cacheline: 11238848 usec
CPU 3: lockunshare cacheline: 11132134 usec
CPU 0: lockunshare cacheline: 11132249 usec

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Avi Kivity

On 05/23/2010 07:30 PM, Michael S. Tsirkin wrote:


   

Maybe we should use atomics on index then?

   

This should only be helpful if you access the cacheline several times in
a row.  That's not the case in virtio (or here).
 

So why does it help?
   


We actually do access the cacheline several times in a row here (but not 
in virtio?):



case SHARE:
while (count<  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter->cacheline1 != count);
   


Broadcast a read request.


count++;
counter->cacheline1 = count;
   


Broadcast an invalidate request.


count++;
}
break;

case LOCKSHARE:
while (count<  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while 
(__sync_val_compare_and_swap(&counter->cacheline1, count, count+1)
   != count);
   


Broadcast a 'read for ownership' request.


count += 2;
}
break;
   


So RMW should certainly by faster using single-instruction RMW 
operations (or using prefetchw).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-24 Thread Michael S. Tsirkin
On Mon, May 24, 2010 at 09:37:05AM +0300, Avi Kivity wrote:
> On 05/23/2010 07:30 PM, Michael S. Tsirkin wrote:
>>
>>
 Maybe we should use atomics on index then?


>>> This should only be helpful if you access the cacheline several times in
>>> a row.  That's not the case in virtio (or here).
>>>  
>> So why does it help?
>>
>
> We actually do access the cacheline several times in a row here (but not  
> in virtio?):
>
>>  case SHARE:
>>  while (count<  MAX_BOUNCES) {
>>  /* Spin waiting for other side to change it. */
>>  while (counter->cacheline1 != count);
>>
>
> Broadcast a read request.
>
>>  count++;
>>  counter->cacheline1 = count;
>>
>
> Broadcast an invalidate request.
>
>>  count++;
>>  }
>>  break;
>>
>>  case LOCKSHARE:
>>  while (count<  MAX_BOUNCES) {
>>  /* Spin waiting for other side to change it. */
>>  while 
>> (__sync_val_compare_and_swap(&counter->cacheline1, count, count+1)
>> != count);
>>
>
> Broadcast a 'read for ownership' request.
>
>>  count += 2;
>>  }
>>  break;
>>
>
> So RMW should certainly by faster using single-instruction RMW  
> operations (or using prefetchw).

Okay, but why is lockunshare faster than unshare?

> -- 
> Do not meddle in the internals of kernels, for they are subtle and quick to 
> panic.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-24 Thread Avi Kivity

On 05/24/2010 11:05 AM, Michael S. Tsirkin wrote:


Okay, but why is lockunshare faster than unshare?

   


No idea.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html