Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/24/2010 11:05 AM, Michael S. Tsirkin wrote: Okay, but why is lockunshare faster than unshare? No idea. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Mon, May 24, 2010 at 09:37:05AM +0300, Avi Kivity wrote: > On 05/23/2010 07:30 PM, Michael S. Tsirkin wrote: >> >> Maybe we should use atomics on index then? >>> This should only be helpful if you access the cacheline several times in >>> a row. That's not the case in virtio (or here). >>> >> So why does it help? >> > > We actually do access the cacheline several times in a row here (but not > in virtio?): > >> case SHARE: >> while (count< MAX_BOUNCES) { >> /* Spin waiting for other side to change it. */ >> while (counter->cacheline1 != count); >> > > Broadcast a read request. > >> count++; >> counter->cacheline1 = count; >> > > Broadcast an invalidate request. > >> count++; >> } >> break; >> >> case LOCKSHARE: >> while (count< MAX_BOUNCES) { >> /* Spin waiting for other side to change it. */ >> while >> (__sync_val_compare_and_swap(&counter->cacheline1, count, count+1) >> != count); >> > > Broadcast a 'read for ownership' request. > >> count += 2; >> } >> break; >> > > So RMW should certainly by faster using single-instruction RMW > operations (or using prefetchw). Okay, but why is lockunshare faster than unshare? > -- > Do not meddle in the internals of kernels, for they are subtle and quick to > panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/23/2010 07:30 PM, Michael S. Tsirkin wrote: Maybe we should use atomics on index then? This should only be helpful if you access the cacheline several times in a row. That's not the case in virtio (or here). So why does it help? We actually do access the cacheline several times in a row here (but not in virtio?): case SHARE: while (count< MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter->cacheline1 != count); Broadcast a read request. count++; counter->cacheline1 = count; Broadcast an invalidate request. count++; } break; case LOCKSHARE: while (count< MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (__sync_val_compare_and_swap(&counter->cacheline1, count, count+1) != count); Broadcast a 'read for ownership' request. count += 2; } break; So RMW should certainly by faster using single-instruction RMW operations (or using prefetchw). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Sun, May 23, 2010 at 07:03:10PM +0300, Avi Kivity wrote: > On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote: >>> So locked version seems to be faster than unlocked, and share/unshare not to matter? >>> May be due to the processor using the LOCK operation as a hint to >>> reserve the cacheline for a bit. >>> >> Maybe we should use atomics on index then? >> > > This should only be helpful if you access the cacheline several times in > a row. That's not the case in virtio (or here). > > I think the problem is that LOCKSHARE and SHARE are not symmetric, so > they can't be directly compared. > >> OK, after adding mb in code patch will be sent separately, >> the test works for my workstation. locked is still fastest, >> unshared sometimes shows wins and sometimes loses over shared. >> >> [r...@qus19 ~]# ./cachebounce share 0 1 >> CPU 0: share cacheline: 6638521 usec >> CPU 1: share cacheline: 6638478 usec >> > > 66 ns? nice. > >> [r...@qus19 ~]# ./cachebounce share 0 2 >> CPU 0: share cacheline: 14529198 usec >> CPU 2: share cacheline: 14529156 usec >> > > 140 ns, not too bad. I hope I'm not misinterpreting the results. > > -- > error compiling committee.c: too many arguments to function Here's another box: here the fastest option is shared, slowest unshared, lock is in the middle. [r...@virtlab16 testring]# sh run 0 2 CPU 2: share cacheline: 3304728 usec CPU 0: share cacheline: 3304784 usec CPU 0: unshare cacheline: 6283248 usec CPU 2: unshare cacheline: 6283224 usec CPU 2: lockshare cacheline: 4018567 usec CPU 0: lockshare cacheline: 4018609 usec CPU 2: lockunshare cacheline: 4041791 usec CPU 0: lockunshare cacheline: 4041832 usec [r...@virtlab16 testring]# [r...@virtlab16 testring]# [r...@virtlab16 testring]# [r...@virtlab16 testring]# sh run 0 1 CPU 1: share cacheline: 8306326 usec CPU 0: share cacheline: 8306324 usec CPU 0: unshare cacheline: 19571697 usec CPU 1: unshare cacheline: 19571578 usec CPU 0: lockshare cacheline: 11281566 usec CPU 1: lockshare cacheline: 11281424 usec CPU 0: lockunshare cacheline: 11276093 usec CPU 1: lockunshare cacheline: 11275957 usec [r...@virtlab16 testring]# sh run 0 3 CPU 0: share cacheline: 8288335 usec CPU 3: share cacheline: 8288334 usec CPU 0: unshare cacheline: 19107202 usec CPU 3: unshare cacheline: 19107139 usec CPU 0: lockshare cacheline: 11238915 usec CPU 3: lockshare cacheline: 11238848 usec CPU 3: lockunshare cacheline: 11132134 usec CPU 0: lockunshare cacheline: 11132249 usec -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Sun, May 23, 2010 at 07:03:10PM +0300, Avi Kivity wrote: > On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote: >>> So locked version seems to be faster than unlocked, and share/unshare not to matter? >>> May be due to the processor using the LOCK operation as a hint to >>> reserve the cacheline for a bit. >>> >> Maybe we should use atomics on index then? >> > > This should only be helpful if you access the cacheline several times in > a row. That's not the case in virtio (or here). So why does it help? > I think the problem is that LOCKSHARE and SHARE are not symmetric, so > they can't be directly compared. In what sense are they not symmetric? >> OK, after adding mb in code patch will be sent separately, >> the test works for my workstation. locked is still fastest, >> unshared sometimes shows wins and sometimes loses over shared. >> >> [r...@qus19 ~]# ./cachebounce share 0 1 >> CPU 0: share cacheline: 6638521 usec >> CPU 1: share cacheline: 6638478 usec >> > > 66 ns? nice. > >> [r...@qus19 ~]# ./cachebounce share 0 2 >> CPU 0: share cacheline: 14529198 usec >> CPU 2: share cacheline: 14529156 usec >> > > 140 ns, not too bad. I hope I'm not misinterpreting the results. > > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote: So locked version seems to be faster than unlocked, and share/unshare not to matter? May be due to the processor using the LOCK operation as a hint to reserve the cacheline for a bit. Maybe we should use atomics on index then? This should only be helpful if you access the cacheline several times in a row. That's not the case in virtio (or here). I think the problem is that LOCKSHARE and SHARE are not symmetric, so they can't be directly compared. OK, after adding mb in code patch will be sent separately, the test works for my workstation. locked is still fastest, unshared sometimes shows wins and sometimes loses over shared. [r...@qus19 ~]# ./cachebounce share 0 1 CPU 0: share cacheline: 6638521 usec CPU 1: share cacheline: 6638478 usec 66 ns? nice. [r...@qus19 ~]# ./cachebounce share 0 2 CPU 0: share cacheline: 14529198 usec CPU 2: share cacheline: 14529156 usec 140 ns, not too bad. I hope I'm not misinterpreting the results. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote: > On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote: > > On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: > > > > Note that this is a exclusive->shared->exclusive bounce only, too. > > > > > > > > > > A bounce is a bounce. > > > > I tried to measure this to show that you were wrong, but I was only able > > to show that you're right. How annoying. Test code below. > > This time for sure! The share option does not work on some boxes unless I apply the following: essentially, this adds mb() after each write and before read. It seems to make sense to me: we must update our counter before we wanit for another side. diff --git a/cachebounce.c b/cachebounce.c index 0387027..ebe5a37 100644 --- a/cachebounce.c +++ b/cachebounce.c @@ -77,6 +77,7 @@ int main(int argc, char *argv[]) count++; counter->cacheline1 = count; count++; + __sync_synchronize(); } break; case UNSHARE: @@ -86,6 +87,7 @@ int main(int argc, char *argv[]) count++; counter->cacheline2 = count; count++; + __sync_synchronize(); } break; case LOCKSHARE: @@ -98,6 +100,7 @@ int main(int argc, char *argv[]) break; case LOCKUNSHARE: while (count < MAX_BOUNCES) { + __sync_synchronize(); /* Spin waiting for other side to change it. */ while (counter->cacheline1 != count); __sync_val_compare_and_swap(&counter->cacheline2, count, count+1); @@ -115,6 +118,7 @@ int main(int argc, char *argv[]) count++; counter->cacheline1 = count; count++; + __sync_synchronize(); } break; case UNSHARE: @@ -124,6 +128,7 @@ int main(int argc, char *argv[]) count++; counter->cacheline1 = count; count++; + __sync_synchronize(); } break; case LOCKSHARE: -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Sun, May 23, 2010 at 06:41:33PM +0300, Avi Kivity wrote: > On 05/23/2010 06:31 PM, Michael S. Tsirkin wrote: >> On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote: >> >>> On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote: >>> On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: >> Note that this is a exclusive->shared->exclusive bounce only, too. >> >> > A bounce is a bounce. > I tried to measure this to show that you were wrong, but I was only able to show that you're right. How annoying. Test code below. >>> This time for sure! >>> >> >> What do you see? >> On my laptop: >> [...@tuck testring]$ ./rusty1 share 0 1 >> CPU 1: share cacheline: 2820410 usec >> CPU 0: share cacheline: 2823441 usec >> [...@tuck testring]$ ./rusty1 unshare 0 1 >> CPU 0: unshare cacheline: 2783014 usec >> CPU 1: unshare cacheline: 2782951 usec >> [...@tuck testring]$ ./rusty1 lockshare 0 1 >> CPU 1: lockshare cacheline: 1888495 usec >> CPU 0: lockshare cacheline: 1888544 usec >> [...@tuck testring]$ ./rusty1 lockunshare 0 1 >> CPU 0: lockunshare cacheline: 1889854 usec >> CPU 1: lockunshare cacheline: 1889804 usec >> > > Ugh, can the timing be normalized per operation? This is unreadable. > >> So locked version seems to be faster than unlocked, >> and share/unshare not to matter? >> > > May be due to the processor using the LOCK operation as a hint to > reserve the cacheline for a bit. Maybe we should use atomics on index then? >> same on a workstation: >> [r...@qus19 ~]# ./rusty1 unshare 0 1 >> CPU 0: unshare cacheline: 6037002 usec >> CPU 1: unshare cacheline: 6036977 usec >> [r...@qus19 ~]# ./rusty1 lockunshare 0 1 >> CPU 1: lockunshare cacheline: 5734362 usec >> CPU 0: lockunshare cacheline: 5734389 usec >> [r...@qus19 ~]# ./rusty1 lockshare 0 1 >> CPU 1: lockshare cacheline: 5733537 usec >> CPU 0: lockshare cacheline: 5733564 usec >> >> using another pair of CPUs gives a more drastic >> results: >> >> [r...@qus19 ~]# ./rusty1 lockshare 0 2 >> CPU 2: lockshare cacheline: 4226990 usec >> CPU 0: lockshare cacheline: 4227038 usec >> [r...@qus19 ~]# ./rusty1 lockunshare 0 2 >> CPU 0: lockunshare cacheline: 4226707 usec >> CPU 2: lockunshare cacheline: 4226662 usec >> [r...@qus19 ~]# ./rusty1 unshare 0 2 >> CPU 0: unshare cacheline: 14815048 usec >> CPU 2: unshare cacheline: 14815006 usec >> >> > > That's expected. Hyperthread will be fastest (shared L1), shared L2/L3 > will be slower, cross-socket will suck. OK, after adding mb in code patch will be sent separately, the test works for my workstation. locked is still fastest, unshared sometimes shows wins and sometimes loses over shared. [r...@qus19 ~]# ./cachebounce share 0 1 CPU 0: share cacheline: 6638521 usec CPU 1: share cacheline: 6638478 usec [r...@qus19 ~]# ./cachebounce unshare 0 1 CPU 0: unshare cacheline: 6037415 usec CPU 1: unshare cacheline: 6037374 usec [r...@qus19 ~]# ./cachebounce lockshare 0 1 CPU 0: lockshare cacheline: 5734017 usec CPU 1: lockshare cacheline: 5733978 usec [r...@qus19 ~]# ./cachebounce lockunshare 0 1 CPU 1: lockunshare cacheline: 5733260 usec CPU 0: lockunshare cacheline: 5733307 usec [r...@qus19 ~]# ./cachebounce share 0 2 CPU 0: share cacheline: 14529198 usec CPU 2: share cacheline: 14529156 usec [r...@qus19 ~]# ./cachebounce unshare 0 2 CPU 2: unshare cacheline: 14815328 usec CPU 0: unshare cacheline: 14815374 usec [r...@qus19 ~]# ./cachebounce lockshare 0 2 CPU 0: lockshare cacheline: 4226878 usec CPU 2: lockshare cacheline: 4226842 usec [r...@qus19 ~]# ./cachebounce locknushare 0 2 cachebounce: Usage: cachebounce share|unshare|lockshare|lockunshare [r...@qus19 ~]# ./cachebounce lockunshare 0 2 CPU 0: lockunshare cacheline: 4227432 usec CPU 2: lockunshare cacheline: 4227375 usec -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/23/2010 06:31 PM, Michael S. Tsirkin wrote: On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote: On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote: On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: Note that this is a exclusive->shared->exclusive bounce only, too. A bounce is a bounce. I tried to measure this to show that you were wrong, but I was only able to show that you're right. How annoying. Test code below. This time for sure! What do you see? On my laptop: [...@tuck testring]$ ./rusty1 share 0 1 CPU 1: share cacheline: 2820410 usec CPU 0: share cacheline: 2823441 usec [...@tuck testring]$ ./rusty1 unshare 0 1 CPU 0: unshare cacheline: 2783014 usec CPU 1: unshare cacheline: 2782951 usec [...@tuck testring]$ ./rusty1 lockshare 0 1 CPU 1: lockshare cacheline: 1888495 usec CPU 0: lockshare cacheline: 1888544 usec [...@tuck testring]$ ./rusty1 lockunshare 0 1 CPU 0: lockunshare cacheline: 1889854 usec CPU 1: lockunshare cacheline: 1889804 usec Ugh, can the timing be normalized per operation? This is unreadable. So locked version seems to be faster than unlocked, and share/unshare not to matter? May be due to the processor using the LOCK operation as a hint to reserve the cacheline for a bit. same on a workstation: [r...@qus19 ~]# ./rusty1 unshare 0 1 CPU 0: unshare cacheline: 6037002 usec CPU 1: unshare cacheline: 6036977 usec [r...@qus19 ~]# ./rusty1 lockunshare 0 1 CPU 1: lockunshare cacheline: 5734362 usec CPU 0: lockunshare cacheline: 5734389 usec [r...@qus19 ~]# ./rusty1 lockshare 0 1 CPU 1: lockshare cacheline: 5733537 usec CPU 0: lockshare cacheline: 5733564 usec using another pair of CPUs gives a more drastic results: [r...@qus19 ~]# ./rusty1 lockshare 0 2 CPU 2: lockshare cacheline: 4226990 usec CPU 0: lockshare cacheline: 4227038 usec [r...@qus19 ~]# ./rusty1 lockunshare 0 2 CPU 0: lockunshare cacheline: 4226707 usec CPU 2: lockunshare cacheline: 4226662 usec [r...@qus19 ~]# ./rusty1 unshare 0 2 CPU 0: unshare cacheline: 14815048 usec CPU 2: unshare cacheline: 14815006 usec That's expected. Hyperthread will be fastest (shared L1), shared L2/L3 will be slower, cross-socket will suck. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote: > On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote: > > On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: > > > > Note that this is a exclusive->shared->exclusive bounce only, too. > > > > > > > > > > A bounce is a bounce. > > > > I tried to measure this to show that you were wrong, but I was only able > > to show that you're right. How annoying. Test code below. > > This time for sure! What do you see? On my laptop: [...@tuck testring]$ ./rusty1 share 0 1 CPU 1: share cacheline: 2820410 usec CPU 0: share cacheline: 2823441 usec [...@tuck testring]$ ./rusty1 unshare 0 1 CPU 0: unshare cacheline: 2783014 usec CPU 1: unshare cacheline: 2782951 usec [...@tuck testring]$ ./rusty1 lockshare 0 1 CPU 1: lockshare cacheline: 1888495 usec CPU 0: lockshare cacheline: 1888544 usec [...@tuck testring]$ ./rusty1 lockunshare 0 1 CPU 0: lockunshare cacheline: 1889854 usec CPU 1: lockunshare cacheline: 1889804 usec So locked version seems to be faster than unlocked, and share/unshare not to matter? same on a workstation: [r...@qus19 ~]# ./rusty1 unshare 0 1 CPU 0: unshare cacheline: 6037002 usec CPU 1: unshare cacheline: 6036977 usec [r...@qus19 ~]# ./rusty1 lockunshare 0 1 CPU 1: lockunshare cacheline: 5734362 usec CPU 0: lockunshare cacheline: 5734389 usec [r...@qus19 ~]# ./rusty1 lockshare 0 1 CPU 1: lockshare cacheline: 5733537 usec CPU 0: lockshare cacheline: 5733564 usec using another pair of CPUs gives a more drastic results: [r...@qus19 ~]# ./rusty1 lockshare 0 2 CPU 2: lockshare cacheline: 4226990 usec CPU 0: lockshare cacheline: 4227038 usec [r...@qus19 ~]# ./rusty1 lockunshare 0 2 CPU 0: lockunshare cacheline: 4226707 usec CPU 2: lockunshare cacheline: 4226662 usec [r...@qus19 ~]# ./rusty1 unshare 0 2 CPU 0: unshare cacheline: 14815048 usec CPU 2: unshare cacheline: 14815006 usec The share test seems to never finish on the workstation. I am debugging this. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/20/2010 05:34 PM, Rusty Russell wrote: Have just one ring, no indexes. The producer places descriptors into the ring and updates the head, The consumer copies out descriptors to be processed and copies back in completed descriptors. Chaining is always linear. The descriptors contain a tag that allow the producer to identify the completion. This could definitely work. The original reason for the page boundaries was for untrusted inter-guest communication: with appropriate page protections they could see each other's rings and a simply inter-guest copy hypercall could verify that the other guest really exposed that data via virtio ring. But, cute as that is, we never did that. And it's not clear that it wins much over simply having the hypervisor read both rings directly. AFAICS having separate avail_ring/used_ring/desc_pool is orthogonal to this cuteness. Can we do better? The obvious idea is to try to get rid of last_used and used, and use the ring itself. We would use an invalid entry to mark the head of the ring. Interesting! So a peer will read until it hits a wall. But how to update the wall atomically? Maybe we can have a flag in the descriptor indicate headness or tailness. Update looks ugly though: write descriptor with head flag, write next descriptor with head flag, remove flag from previous descriptor. I was thinking a separate magic "invalid" entry. To publish an 3 descriptor chain, you would write descriptors 2 and 3, write an invalid entry at 4, barrier, write entry 1. It is a bit ugly, yes, but not terrible. Worth exploring. This amortizes the indexes into the ring, a good thing. Another thing we can do is place the tail a half ring away from the head (and limit ring utilization to 50%), reducing bounces on short kicks. Or equivalently have an avail ring and used ring, but both containing tagged descriptors instead of pointers to descriptors. I think that a simple simulator for this is worth writing, which tracks cacheline moves under various fullness scenarios... Yup. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 20 May 2010 04:30:56 pm Avi Kivity wrote: > On 05/20/2010 08:01 AM, Rusty Russell wrote: > > > >> A device with out of order > >> completion (like virtio-blk) will quickly randomize the unused > >> descriptor indexes, so every descriptor fetch will require a bounce. > >> > >> In contrast, if the rings hold the descriptors themselves instead of > >> pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for > >> every descriptor, amortized. > >> > > We already have indirect, this would be a logical next step. So let's > > think about it. The avail ring would contain 64 bit values, the used ring > > would contain indexes into the avail ring. > > Have just one ring, no indexes. The producer places descriptors into > the ring and updates the head, The consumer copies out descriptors to > be processed and copies back in completed descriptors. Chaining is > always linear. The descriptors contain a tag that allow the producer to > identify the completion. This could definitely work. The original reason for the page boundaries was for untrusted inter-guest communication: with appropriate page protections they could see each other's rings and a simply inter-guest copy hypercall could verify that the other guest really exposed that data via virtio ring. But, cute as that is, we never did that. And it's not clear that it wins much over simply having the hypervisor read both rings directly. > > Can we do better? The obvious idea is to try to get rid of last_used and > > used, and use the ring itself. We would use an invalid entry to mark the > > head of the ring. > > Interesting! So a peer will read until it hits a wall. But how to > update the wall atomically? > > Maybe we can have a flag in the descriptor indicate headness or > tailness. Update looks ugly though: write descriptor with head flag, > write next descriptor with head flag, remove flag from previous descriptor. I was thinking a separate magic "invalid" entry. To publish an 3 descriptor chain, you would write descriptors 2 and 3, write an invalid entry at 4, barrier, write entry 1. It is a bit ugly, yes, but not terrible. I think that a simple simulator for this is worth writing, which tracks cacheline moves under various fullness scenarios... Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, May 20, 2010 at 02:31:50PM +0930, Rusty Russell wrote: > Can we do better? The obvious idea is to try to get rid of last_used and > used, and use the ring itself. We would use an invalid entry to mark the > head of the ring. > > Any other thoughts? > Rusty. We also need a way to avoid interrupts at least while we are processing the ring. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/20/2010 08:01 AM, Rusty Russell wrote: A device with out of order completion (like virtio-blk) will quickly randomize the unused descriptor indexes, so every descriptor fetch will require a bounce. In contrast, if the rings hold the descriptors themselves instead of pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for every descriptor, amortized. We already have indirect, this would be a logical next step. So let's think about it. The avail ring would contain 64 bit values, the used ring would contain indexes into the avail ring. Have just one ring, no indexes. The producer places descriptors into the ring and updates the head, The consumer copies out descriptors to be processed and copies back in completed descriptors. Chaining is always linear. The descriptors contain a tag that allow the producer to identify the completion. Indirect only pays when there are enough descriptors in it to fill a couple of cache lines. Otherwise it's an extra bounce. We will always bounce here, that what happens when transferring data. The question is whether how many cache lines per descriptor. A pointer adds 1 bounce, linear descriptors cost 1/4 bounce, chained descriptors cost a bounce. So best is one ring of linearly chained descriptors. Indirect works when you have large requests (like block). So client writes descriptor page and adds to avail ring, then writes to index. Server reads index, avail ring, descriptor page (3). Writes used entry (1). Updates last_used (1). Client reads used (1), derefs avail (1), updates last_used (1), cleans descriptor page (1). That's 9 cacheline transfers, worst case. Best case of a half-full ring in steady state, assuming 128-byte cache lines, the avail ring costs are 1/16, the used entry is 1/64. This drops it to 6 and 9/64 transfers. Cache lines are 64 bytes these days. With a single ring, client writes descriptors (ceil(N/4)), updates head (1). Server reads head (1) copies out descriptors (ceil(N/4)), issues requests, copies back completions ((ceil(N/4)), updates tail (1). Client reads back tail and descriptors (1 + ceil(N/4)) Worst case: 4 + 4 * ceil(N/4). Best case I think this drops by half. (Note, the current scheme adds 2 more cacheline transfers, for the descriptor table, worst case. 2 bounces per descriptor due to random access. Assuming indirect, we get 2/8 xfer best case. Either way, it's not the main source of cacheline xfers). Indirect adds a double bounce to get to the descriptor table, but any descriptors there are accessed linearly. It's only good when you have large chains. Can we do better? The obvious idea is to try to get rid of last_used and used, and use the ring itself. We would use an invalid entry to mark the head of the ring. Interesting! So a peer will read until it hits a wall. But how to update the wall atomically? Maybe we can have a flag in the descriptor indicate headness or tailness. Update looks ugly though: write descriptor with head flag, write next descriptor with head flag, remove flag from previous descriptor. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/20/2010 01:33 AM, Michael S. Tsirkin wrote: Virtio is already way too bouncy due to the indirection between the avail/used rings and the descriptor pool. A device with out of order completion (like virtio-blk) will quickly randomize the unused descriptor indexes, so every descriptor fetch will require a bounce. In contrast, if the rings hold the descriptors themselves instead of pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for every descriptor, amortized. On the other hand, consider that on fast path we are never using all of the ring. With a good allocator we might be able to keep reusing only small part of the ring, instead of wrapping around all of it all of the time. It's still suboptimal, we have to bounce both the avail/used rings and the descriptor pool, compared to just the descriptor ring with a direct design. Plus we don't need a fancy allocator. When amortizing cachelines, simple data structures win. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote: > On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: > > > Note that this is a exclusive->shared->exclusive bounce only, too. > > > > > > > A bounce is a bounce. > > I tried to measure this to show that you were wrong, but I was only able > to show that you're right. How annoying. Test code below. This time for sure! #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include /* We share memory via an mmap. */ struct counter { unsigned int cacheline1; char pad[256]; unsigned int cacheline2; }; #define MAX_BOUNCES 1 enum mode { SHARE, UNSHARE, LOCKSHARE, LOCKUNSHARE, }; int main(int argc, char *argv[]) { cpu_set_t cpuset; volatile struct counter *counter; struct timeval start, stop; bool child; unsigned int count; uint64_t usec; enum mode mode; if (argc != 4) errx(1, "Usage: cachebounce share|unshare|lockshare|lockunshare "); if (strcmp(argv[1], "share") == 0) mode = SHARE; else if (strcmp(argv[1], "unshare") == 0) mode = UNSHARE; else if (strcmp(argv[1], "lockshare") == 0) mode = LOCKSHARE; else if (strcmp(argv[1], "lockunshare") == 0) mode = LOCKSHARE; else errx(1, "Usage: cachebounce share|unshare|lockshare|lockunshare "); CPU_ZERO(&cpuset); counter = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0); if (counter == MAP_FAILED) err(1, "Mapping page"); /* Fault it in. */ counter->cacheline1 = counter->cacheline2 = 0; child = (fork() == 0); CPU_SET(atoi(argv[2 + child]), &cpuset); if (sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset) != 0) err(1, "Calling sched_setaffinity()"); gettimeofday(&start, NULL); if (child) { count = 1; switch (mode) { case SHARE: while (count < MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter->cacheline1 != count); count++; counter->cacheline1 = count; count++; } break; case UNSHARE: while (count < MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter->cacheline1 != count); count++; counter->cacheline2 = count; count++; } break; case LOCKSHARE: while (count < MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (__sync_val_compare_and_swap(&counter->cacheline1, count, count+1) != count); count += 2; } break; case LOCKUNSHARE: while (count < MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter->cacheline1 != count); __sync_val_compare_and_swap(&counter->cacheline2, count, count+1); count += 2; } break; } } else { count = 0; switch (mode) { case SHARE: while (count < MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter->cacheline1 != count); count++; counter->cacheline1 = count; count++; } break; case UNSHARE: while (count < MAX_BOUNCES) { /* Spin waiting for other side to change it. */ while (counter->cacheline2 != count); count++; counter->cacheline1 = count; count++; } break; case LOCKSHARE: while (count < MAX_BOUNCES) { /* Spin waiting for other side to cha
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote: > > Note that this is a exclusive->shared->exclusive bounce only, too. > > > > A bounce is a bounce. I tried to measure this to show that you were wrong, but I was only able to show that you're right. How annoying. Test code below. > Virtio is already way too bouncy due to the indirection between the > avail/used rings and the descriptor pool. I tried to do a more careful analysis below, and I think this is an overstatement. > A device with out of order > completion (like virtio-blk) will quickly randomize the unused > descriptor indexes, so every descriptor fetch will require a bounce. > > In contrast, if the rings hold the descriptors themselves instead of > pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for > every descriptor, amortized. We already have indirect, this would be a logical next step. So let's think about it. The avail ring would contain 64 bit values, the used ring would contain indexes into the avail ring. So client writes descriptor page and adds to avail ring, then writes to index. Server reads index, avail ring, descriptor page (3). Writes used entry (1). Updates last_used (1). Client reads used (1), derefs avail (1), updates last_used (1), cleans descriptor page (1). That's 9 cacheline transfers, worst case. Best case of a half-full ring in steady state, assuming 128-byte cache lines, the avail ring costs are 1/16, the used entry is 1/64. This drops it to 6 and 9/64 transfers. (Note, the current scheme adds 2 more cacheline transfers, for the descriptor table, worst case. Assuming indirect, we get 2/8 xfer best case. Either way, it's not the main source of cacheline xfers). Can we do better? The obvious idea is to try to get rid of last_used and used, and use the ring itself. We would use an invalid entry to mark the head of the ring. Any other thoughts? Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Wed, May 19, 2010 at 11:06:42AM +0300, Avi Kivity wrote: > On 05/19/2010 10:39 AM, Rusty Russell wrote: >> >> I think we're talking about the last 2 entries of the avail ring. That means >> the worst case is 1 false bounce every time around the ring. > > It's low, but why introduce an inefficiency when you can avoid doing it > for the same effort? > >> I think that's >> why we're debating it instead of measuring it :) >> > > Measure before optimize is good for code but not for protocols. > Protocols have to be robust against future changes. Virtio is warty > enough already, we can't keep doing local optimizations. > >> Note that this is a exclusive->shared->exclusive bounce only, too. >> > > A bounce is a bounce. > > Virtio is already way too bouncy due to the indirection between the > avail/used rings and the descriptor pool. A device with out of order > completion (like virtio-blk) will quickly randomize the unused > descriptor indexes, so every descriptor fetch will require a bounce. > > In contrast, if the rings hold the descriptors themselves instead of > pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for > every descriptor, amortized. On the other hand, consider that on fast path we are never using all of the ring. With a good allocator we might be able to keep reusing only small part of the ring, instead of wrapping around all of it all of the time. > -- > Do not meddle in the internals of kernels, for they are subtle and quick to > panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/19/2010 10:39 AM, Rusty Russell wrote: I think we're talking about the last 2 entries of the avail ring. That means the worst case is 1 false bounce every time around the ring. It's low, but why introduce an inefficiency when you can avoid doing it for the same effort? I think that's why we're debating it instead of measuring it :) Measure before optimize is good for code but not for protocols. Protocols have to be robust against future changes. Virtio is warty enough already, we can't keep doing local optimizations. Note that this is a exclusive->shared->exclusive bounce only, too. A bounce is a bounce. Virtio is already way too bouncy due to the indirection between the avail/used rings and the descriptor pool. A device with out of order completion (like virtio-blk) will quickly randomize the unused descriptor indexes, so every descriptor fetch will require a bounce. In contrast, if the rings hold the descriptors themselves instead of pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for every descriptor, amortized. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Wed, 12 May 2010 04:57:22 am Avi Kivity wrote: > On 05/07/2010 06:23 AM, Rusty Russell wrote: > > On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote: > > > >> On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: > >> > >>> + /* We publish the last-seen used index at the end of the available ring. > >>> + * It is at the end for backwards compatibility. */ > >>> + vr->last_used_idx =&(vr)->avail->ring[num]; > >>> + /* Verify that last used index does not spill over the used ring. */ > >>> + BUG_ON((void *)vr->last_used_idx + > >>> +sizeof *vr->last_used_idx> (void *)vr->used); > >>>} > >>> > >>> > >> Shouldn't this be on its own cache line? > >> > > It's next to the available ring; because that's where the guest publishes > > its data. That whole page is guest-write, host-read. > > > > Putting it on a cacheline by itself would be a slight pessimization; the > > host > > cpu would have to get the last_used_idx cacheline and the avail descriptor > > cacheline every time. This way, they are sometimes the same cacheline. > > If one peer writes the tail of the available ring, while the other reads > last_used_idx, it's a false bounce, no? I think we're talking about the last 2 entries of the avail ring. That means the worst case is 1 false bounce every time around the ring. I think that's why we're debating it instead of measuring it :) Note that this is a exclusive->shared->exclusive bounce only, too. Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Tue, May 11, 2010 at 10:27:22PM +0300, Avi Kivity wrote: > On 05/07/2010 06:23 AM, Rusty Russell wrote: >> On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote: >> >>> On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: >>> + /* We publish the last-seen used index at the end of the available ring. + * It is at the end for backwards compatibility. */ + vr->last_used_idx =&(vr)->avail->ring[num]; + /* Verify that last used index does not spill over the used ring. */ + BUG_ON((void *)vr->last_used_idx + + sizeof *vr->last_used_idx> (void *)vr->used); } >>> Shouldn't this be on its own cache line? >>> >> It's next to the available ring; because that's where the guest publishes >> its data. That whole page is guest-write, host-read. >> >> Putting it on a cacheline by itself would be a slight pessimization; the host >> cpu would have to get the last_used_idx cacheline and the avail descriptor >> cacheline every time. This way, they are sometimes the same cacheline. >> > > If one peer writes the tail of the available ring, while the other reads > last_used_idx, it's a false bounce, no? > > Having things on the same cacheline is only worthwhile if they are > accessed at the same time. Yes, this is what I was trying to say. avail flags and used index *are* accessed at the same time, so there could be an advantage to sharing a cache line there. All this should be kept in mind if we ever do VIRTIO_RING_F_NEW_LAYOUT. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/07/2010 06:23 AM, Rusty Russell wrote: On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote: On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: + /* We publish the last-seen used index at the end of the available ring. +* It is at the end for backwards compatibility. */ + vr->last_used_idx =&(vr)->avail->ring[num]; + /* Verify that last used index does not spill over the used ring. */ + BUG_ON((void *)vr->last_used_idx + + sizeof *vr->last_used_idx> (void *)vr->used); } Shouldn't this be on its own cache line? It's next to the available ring; because that's where the guest publishes its data. That whole page is guest-write, host-read. Putting it on a cacheline by itself would be a slight pessimization; the host cpu would have to get the last_used_idx cacheline and the avail descriptor cacheline every time. This way, they are sometimes the same cacheline. If one peer writes the tail of the available ring, while the other reads last_used_idx, it's a false bounce, no? Having things on the same cacheline is only worthwhile if they are accessed at the same time. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote: > On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: > > + /* We publish the last-seen used index at the end of the available ring. > > +* It is at the end for backwards compatibility. */ > > + vr->last_used_idx =&(vr)->avail->ring[num]; > > + /* Verify that last used index does not spill over the used ring. */ > > + BUG_ON((void *)vr->last_used_idx + > > + sizeof *vr->last_used_idx> (void *)vr->used); > > } > > > > Shouldn't this be on its own cache line? It's next to the available ring; because that's where the guest publishes its data. That whole page is guest-write, host-read. Putting it on a cacheline by itself would be a slight pessimization; the host cpu would have to get the last_used_idx cacheline and the avail descriptor cacheline every time. This way, they are sometimes the same cacheline. Hope that clarifies, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself
On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote: + /* We publish the last-seen used index at the end of the available ring. +* It is at the end for backwards compatibility. */ + vr->last_used_idx =&(vr)->avail->ring[num]; + /* Verify that last used index does not spill over the used ring. */ + BUG_ON((void *)vr->last_used_idx + + sizeof *vr->last_used_idx> (void *)vr->used); } Shouldn't this be on its own cache line? One way to achieve this is to allocate it at the end. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html