Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Sasha Levin (levinsasha...@gmail.com) wrote: [...] > Mathieu, > > I've started working on converting our MMIO code to use RCU rbtree. > > It looks like each node contains one key, and the search functions > search for a node with a key in a specific range. > > Instead, the key in interval tree nodes is a range, and when searching > we try to find which node's range contains our search param. > > For example, our MMIO mapper maps an address space into devices, so we > can have one node which holds the range (0x100-0x200) which maps to a > VGA card, a (0x400-0x500) which maps to a sound card, and so on. Then, > when a guest is running and tries to write to 0x150, we want to know > which device it maps to. Hi Sasha, I finished updating the RCU RBTree internals and API to store ranges and query points or ranges instead of simple values. My tests are passing fine so far. I also added some documentation in the code explaining how I deal with search/prev/next reads vs concurrent writers. Feedback about the API would be very welcome! It's all available in the urcu rbtree2 branch. Thanks! Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Prasad Joshi wrote: > May be the DD of 1G file was a wrong test to calculate the > performance. Sasha had asked me to run boniee++ for performance > numbers. [...] It's difficult to test IO performance. One way to 'stabilize' such measurements would be to create a 'make test io' variant, which builds not just a simple 'Hello World!' but also an IO performance benchmark. The advantage of such an approach (beyond the fun of writing it) would be that it's executed at a very predictable point in the bootup sequence, so if you keep the whole disk image in ramdisk (in /dev/shm/ for example) then it would allow very precise measurements that converge very quickly with very little noise. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Tue, May 31, 2011 at 4:25 PM, Ingo Molnar wrote: > > * Sasha Levin wrote: > >> On Tue, 2011-05-31 at 15:09 +0200, Ingo Molnar wrote: >> > * Sasha Levin wrote: >> > >> > > I've started working on converting our MMIO code to use RCU rbtree. >> > >> > Well, why would we want to switch the MMIO code away from the brlock >> > right now? mmio tree reconfigurations are very, very rare. >> > >> > Won't something like the qcow cache be more suitable for that? That >> > has both frequent read and write activities. >> >> We don't have a qcow cache at the moment :) > > Wasnt one in the works by Prasad? I have the code changes which work. I think I had send patches few days back as well there were few improvements. The only reason I did not send the v2 of the patch is, I am not seeing performance improvement by adding cache. May be the DD of 1G file was a wrong test to calculate the performance. Sasha had asked me to run boniee++ for performance numbers. I will ensure the patches can be applied on the latest code base and send the patches. Sorry for the delay. Thanks and Regards, Prasad > >> It was either the MMIO or the ioport tree. We don't have to pull >> them into our master tree either - it's mostly a proof of concept >> I'm doing on a separate tree. > > Sure. > > Thanks, > > Ingo > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Sasha Levin wrote: > On Tue, 2011-05-31 at 15:09 +0200, Ingo Molnar wrote: > > * Sasha Levin wrote: > > > > > I've started working on converting our MMIO code to use RCU rbtree. > > > > Well, why would we want to switch the MMIO code away from the brlock > > right now? mmio tree reconfigurations are very, very rare. > > > > Won't something like the qcow cache be more suitable for that? That > > has both frequent read and write activities. > > We don't have a qcow cache at the moment :) Wasnt one in the works by Prasad? > It was either the MMIO or the ioport tree. We don't have to pull > them into our master tree either - it's mostly a proof of concept > I'm doing on a separate tree. Sure. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Tue, 2011-05-31 at 15:09 +0200, Ingo Molnar wrote: > * Sasha Levin wrote: > > > I've started working on converting our MMIO code to use RCU rbtree. > > Well, why would we want to switch the MMIO code away from the brlock > right now? mmio tree reconfigurations are very, very rare. > > Won't something like the qcow cache be more suitable for that? That > has both frequent read and write activities. We don't have a qcow cache at the moment :) It was either the MMIO or the ioport tree. We don't have to pull them into our master tree either - it's mostly a proof of concept I'm doing on a separate tree. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Sasha Levin wrote: > I've started working on converting our MMIO code to use RCU rbtree. Well, why would we want to switch the MMIO code away from the brlock right now? mmio tree reconfigurations are very, very rare. Won't something like the qcow cache be more suitable for that? That has both frequent read and write activities. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Mon, 2011-05-30 at 14:57 -0400, Mathieu Desnoyers wrote: > * Sasha Levin (levinsasha...@gmail.com) wrote: > > On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote: > > > * Sasha Levin (levinsasha...@gmail.com) wrote: > > > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > > > > > Please note that what I currently have is a normal rbtree, not an > > > > > interval rbtree. Can you elaborate on your use-case so I can try to > > > > > figure out how we could augment it to support the interval rbtree you > > > > > need ? > > > > > > > > We don't need anything specific for interval rbtree. The rbtree used in > > > > the kernel provides augmentation functions for insert and erase (see > > > > rb_augment_insert() and rb_augment_erase_begin() + > > > > rb_augment_erase_end()). > > > > What they basically do is call a user-provided callback for each node > > > > from the newly inserted (or deepest after deletion) node up to the root > > > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', > > > > basically all we need are the 2 augmentation functions I've mentioned > > > > above. > > > > > > Just a little question about Documentation/rbtree.txt: > > > > > > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to > > > compare the lo value with the max_hi and node->lo. I think it would be > > > more natural to do range comparison functions with inclusive ranges (>= > > > and <=). Or maybe I am missing something about the way find_lowest_match > > > works ? > > > > It's just an implementation of an interval search() function. Since > > kernel rbtree users implement their own search() and insert() functions > > (look in our util/rbtree-interval.c) it shouldn't be a consideration in > > designing the tree - we (the users of urcu/rbtree) want to control the > > search code anyway. > > The reason why I provide this facility as part of the RCU rbtree is > because, unlike the kernel rbtree where every user is free to use > "inheritance" to put their object in the same cache-line as the rbtree > node, the RCU implementation needs to do copies of its inner nodes, so > the rbtree user cannot expect the node pointer to stay valid. Therefore, > I use a more usual scheme where the pointer to the user-provided data > structure is kept as a "void *key" in the node. So basically, the rbtree > user don't have to provide the search, next and prev functions: this is > all done within the rbtree code, especially because these have to be > RCU-aware, and because the code that augments the rbtree with ranges > needs to be RCU-aware too. > > Finally, my tests showed up that the "<= and >=" need to have the equal > for the ranges to be inclusive. I tested this by using the same > test-case as the search, duplicating the key value as both lower and > upper bound of the range searched for. (see updated rbtree2 branch for > tested range search). > > My stress-test now tests the range lookups, and it passes fine so far: > > e.g. test_urcu_rbtree 6 2 200 -g 40 (on a 8-core machine) Mathieu, I've started working on converting our MMIO code to use RCU rbtree. It looks like each node contains one key, and the search functions search for a node with a key in a specific range. Instead, the key in interval tree nodes is a range, and when searching we try to find which node's range contains our search param. For example, our MMIO mapper maps an address space into devices, so we can have one node which holds the range (0x100-0x200) which maps to a VGA card, a (0x400-0x500) which maps to a sound card, and so on. Then, when a guest is running and tries to write to 0x150, we want to know which device it maps to. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Mon, 2011-05-30 at 14:57 -0400, Mathieu Desnoyers wrote: > * Sasha Levin (levinsasha...@gmail.com) wrote: > > On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote: > > > * Sasha Levin (levinsasha...@gmail.com) wrote: > > > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > > > > > Please note that what I currently have is a normal rbtree, not an > > > > > interval rbtree. Can you elaborate on your use-case so I can try to > > > > > figure out how we could augment it to support the interval rbtree you > > > > > need ? > > > > > > > > We don't need anything specific for interval rbtree. The rbtree used in > > > > the kernel provides augmentation functions for insert and erase (see > > > > rb_augment_insert() and rb_augment_erase_begin() + > > > > rb_augment_erase_end()). > > > > What they basically do is call a user-provided callback for each node > > > > from the newly inserted (or deepest after deletion) node up to the root > > > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', > > > > basically all we need are the 2 augmentation functions I've mentioned > > > > above. > > > > > > Just a little question about Documentation/rbtree.txt: > > > > > > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to > > > compare the lo value with the max_hi and node->lo. I think it would be > > > more natural to do range comparison functions with inclusive ranges (>= > > > and <=). Or maybe I am missing something about the way find_lowest_match > > > works ? > > > > It's just an implementation of an interval search() function. Since > > kernel rbtree users implement their own search() and insert() functions > > (look in our util/rbtree-interval.c) it shouldn't be a consideration in > > designing the tree - we (the users of urcu/rbtree) want to control the > > search code anyway. > > The reason why I provide this facility as part of the RCU rbtree is > because, unlike the kernel rbtree where every user is free to use > "inheritance" to put their object in the same cache-line as the rbtree > node, the RCU implementation needs to do copies of its inner nodes, so > the rbtree user cannot expect the node pointer to stay valid. Therefore, > I use a more usual scheme where the pointer to the user-provided data > structure is kept as a "void *key" in the node. So basically, the rbtree > user don't have to provide the search, next and prev functions: this is > all done within the rbtree code, especially because these have to be > RCU-aware, and because the code that augments the rbtree with ranges > needs to be RCU-aware too. > > Finally, my tests showed up that the "<= and >=" need to have the equal > for the ranges to be inclusive. I tested this by using the same > test-case as the search, duplicating the key value as both lower and > upper bound of the range searched for. (see updated rbtree2 branch for > tested range search). > > My stress-test now tests the range lookups, and it passes fine so far: > > e.g. test_urcu_rbtree 6 2 200 -g 40 (on a 8-core machine) Alright, so if urcu has rbtrees I'll make sure tools/kvm/ starts using urcu. I'll send an update tomorrow once I have something working. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Sasha Levin (levinsasha...@gmail.com) wrote: > On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote: > > * Sasha Levin (levinsasha...@gmail.com) wrote: > > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > > > > Please note that what I currently have is a normal rbtree, not an > > > > interval rbtree. Can you elaborate on your use-case so I can try to > > > > figure out how we could augment it to support the interval rbtree you > > > > need ? > > > > > > We don't need anything specific for interval rbtree. The rbtree used in > > > the kernel provides augmentation functions for insert and erase (see > > > rb_augment_insert() and rb_augment_erase_begin() + > > > rb_augment_erase_end()). > > > What they basically do is call a user-provided callback for each node > > > from the newly inserted (or deepest after deletion) node up to the root > > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', > > > basically all we need are the 2 augmentation functions I've mentioned > > > above. > > > > Just a little question about Documentation/rbtree.txt: > > > > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to > > compare the lo value with the max_hi and node->lo. I think it would be > > more natural to do range comparison functions with inclusive ranges (>= > > and <=). Or maybe I am missing something about the way find_lowest_match > > works ? > > It's just an implementation of an interval search() function. Since > kernel rbtree users implement their own search() and insert() functions > (look in our util/rbtree-interval.c) it shouldn't be a consideration in > designing the tree - we (the users of urcu/rbtree) want to control the > search code anyway. The reason why I provide this facility as part of the RCU rbtree is because, unlike the kernel rbtree where every user is free to use "inheritance" to put their object in the same cache-line as the rbtree node, the RCU implementation needs to do copies of its inner nodes, so the rbtree user cannot expect the node pointer to stay valid. Therefore, I use a more usual scheme where the pointer to the user-provided data structure is kept as a "void *key" in the node. So basically, the rbtree user don't have to provide the search, next and prev functions: this is all done within the rbtree code, especially because these have to be RCU-aware, and because the code that augments the rbtree with ranges needs to be RCU-aware too. Finally, my tests showed up that the "<= and >=" need to have the equal for the ranges to be inclusive. I tested this by using the same test-case as the search, duplicating the key value as both lower and upper bound of the range searched for. (see updated rbtree2 branch for tested range search). My stress-test now tests the range lookups, and it passes fine so far: e.g. test_urcu_rbtree 6 2 200 -g 40 (on a 8-core machine) Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote: > * Sasha Levin (levinsasha...@gmail.com) wrote: > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > > > Please note that what I currently have is a normal rbtree, not an > > > interval rbtree. Can you elaborate on your use-case so I can try to > > > figure out how we could augment it to support the interval rbtree you > > > need ? > > > > We don't need anything specific for interval rbtree. The rbtree used in > > the kernel provides augmentation functions for insert and erase (see > > rb_augment_insert() and rb_augment_erase_begin() + > > rb_augment_erase_end()). > > What they basically do is call a user-provided callback for each node > > from the newly inserted (or deepest after deletion) node up to the root > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', > > basically all we need are the 2 augmentation functions I've mentioned > > above. > > Just a little question about Documentation/rbtree.txt: > > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to > compare the lo value with the max_hi and node->lo. I think it would be > more natural to do range comparison functions with inclusive ranges (>= > and <=). Or maybe I am missing something about the way find_lowest_match > works ? It's just an implementation of an interval search() function. Since kernel rbtree users implement their own search() and insert() functions (look in our util/rbtree-interval.c) it shouldn't be a consideration in designing the tree - we (the users of urcu/rbtree) want to control the search code anyway. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: > * Sasha Levin (levinsasha...@gmail.com) wrote: > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > > > Please note that what I currently have is a normal rbtree, not an > > > interval rbtree. Can you elaborate on your use-case so I can try to > > > figure out how we could augment it to support the interval rbtree you > > > need ? > > > > We don't need anything specific for interval rbtree. The rbtree used in > > the kernel provides augmentation functions for insert and erase (see > > rb_augment_insert() and rb_augment_erase_begin() + > > rb_augment_erase_end()). > > What they basically do is call a user-provided callback for each node > > from the newly inserted (or deepest after deletion) node up to the root > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', > > basically all we need are the 2 augmentation functions I've mentioned > > above. > > Just a little question about Documentation/rbtree.txt: > > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to > compare the lo value with the max_hi and node->lo. I think it would be > more natural to do range comparison functions with inclusive ranges (>= > and <=). Or maybe I am missing something about the way find_lowest_match > works ? Sorry for self-reply, I actually got my head around this detail: these tests are excluding ranges. So to get the range with inclusive values, we must take the negation of the opposite (which is a range without the limit values). The code for augmented RBtree search by range is now in the rbtree2 branch. I still need to find a good way to test the range search functions though. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Sasha Levin (levinsasha...@gmail.com) wrote: > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > > Please note that what I currently have is a normal rbtree, not an > > interval rbtree. Can you elaborate on your use-case so I can try to > > figure out how we could augment it to support the interval rbtree you > > need ? > > We don't need anything specific for interval rbtree. The rbtree used in > the kernel provides augmentation functions for insert and erase (see > rb_augment_insert() and rb_augment_erase_begin() + > rb_augment_erase_end()). > What they basically do is call a user-provided callback for each node > from the newly inserted (or deepest after deletion) node up to the root > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', > basically all we need are the 2 augmentation functions I've mentioned > above. Just a little question about Documentation/rbtree.txt: I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to compare the lo value with the max_hi and node->lo. I think it would be more natural to do range comparison functions with inclusive ranges (>= and <=). Or maybe I am missing something about the way find_lowest_match works ? Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Sasha Levin (levinsasha...@gmail.com) wrote: > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > > Please note that what I currently have is a normal rbtree, not an > > interval rbtree. Can you elaborate on your use-case so I can try to > > figure out how we could augment it to support the interval rbtree you > > need ? > > We don't need anything specific for interval rbtree. The rbtree used in > the kernel provides augmentation functions for insert and erase (see > rb_augment_insert() and rb_augment_erase_begin() + > rb_augment_erase_end()). > What they basically do is call a user-provided callback for each node > from the newly inserted (or deepest after deletion) node up to the root > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', > basically all we need are the 2 augmentation functions I've mentioned > above. Given we have to update the parent nodes when the interval values change, we need to work on a copy of these parent nodes to ensure that their information about the children min/max corresponds to the children's left/right pointers they contain. Any discrepancy between their left/right pointers and the children min/max value they store would be invalid from a reader's POV. I'll see if I can embed this in my tree. It should be doable with the "decay" approach I am using. We'll need a way to test this though: possibly by walking the tree with range-aware lookups that also make sure that the ranges that were promised by the upper nodes are contained within their children at all times. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: > On Sun, May 29, 2011 at 01:01:04PM -0400, Mathieu Desnoyers wrote: > > * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: > > > * Sasha Levin (levinsasha...@gmail.com) wrote: > > [...] > > > > Hi Mathieu! > > > > > > > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the > > > > augmentation feature to support an interval rb-tree - which means that > > > > every update to the tree not only updates the nodes directly related to > > > > the updated node but also all the nodes on the path to the root of the > > > > tree. > > > > > > Cool !! > > > > > > I'm adding in copy Phil Howard who has been working on RCU RB tree for > > > much longer than myself. > > > > > > > I see that in liburcu there is an implementation of a rcu linked list > > > > but no implementation of a rb-tree. > > > > > > > > Are you currently working on one? or maybe I should try writing one and > > > > sending it to you? > > > > > > Actually, I started working on one last year, but had to interrupt my > > > effort before I got it even working right. > > [...] > > > We'd have to see how we can go from this implementation of a standard RB > > > tree to an interval RB tree too. I guess it will depend whether you need > > > the updates from the target node up to the root to be done "all at once" > > > from a reader perspective (then you would probably need to replace a > > > copy of a part of the tree all at once), or if you can allow the update > > > to be done piece-wise on a node-by-node basis as readers go through the > > > tree (from root to leafs). > > > > I've revisited the RCU rbtree implementation this weekend, and it works > > much better now. I reimplemented the whole thing from 0 starting from > > the CLRS chapter 12 algorithms (to get the non-rcu > > (insertion/removal)-only stress-tests working) and incrementally > > RCU-ized the updates and then added read-side tests. All along, I used > > the test_urcu_rbtree test case that does some basic coherency tests by > > searching for some random elements that *should* be there in parellel > > with insertion and removals. The implementation I currently have > > survives the "search for known elements in parallel with updates" stress > > test (so far). (e.g. test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2 > > writers, 30 known random elements, writers are adding/removing 6 random > > elements, on a 8-core machine) > > > > See: git://git.lttng.org/userspace-rcu.git > > branch : rbtree2 > > > > The key idea I used in this implementation is to "decay" the old nodes > > (AFAIK, I just made this up) : "decaying" a node could be best described > > as creating an exact copy of a node, and putting a pointer to this new > > node into the old node to form a "decay chain". This allowed me to keep > > the algorithm very much similar to CLRS by just walking the decay chains > > whenever needed. The old node "decays" by using call_rcu to free it > > after a grace period passes. This imply that the updates must hold the > > RCU read-side lock in addition to a mutex to make sure the decaying > > nodes stay valid for the duration of their use. > > > > This implementation never requires the read-side to loop, thus > > guaranteeing a wait-free read-side behavior (so search operations will > > always be strictly log(n) without any busy-loop delay). > > > > I have not created stress-tests for next/prev walk of the tree yet. It > > is therefore entirely possible that this does not work as expected. > > > > Comments are welcome, > > Very cool! > > The trick Phil Howard used allowed him to avoid duplicating the nodes > in some cases in the rotations. I might be missing something, but it > looks like you are duplicating in all cases. The duplications I do are (following CLRS 3rd ed. chap 12, 13): - x, y and beta for left and right rotation (p. 313) - v for transplant (p. 296) - the whole branch between z.right and y (inclusive) for lines 9--20 of rb_delete() (p. 324, chap. 13) (at most log(n) items), for the case I call rcu_rbtree_remove_nonil() in my code. > Would using Phil's trick > result in significant performance gain? I just read through Phil's paper at http://www.cs.pdx.edu/pdfs/tr1006.pdf. It looks like we have different targets: Phil's structure of RB tree is heavily tuned to allow RCU search, but it uses a RW lock for in-order traversal. Mine allows both search and in-order traversal to be performed under RCU read-side. One impact of my different goal is that I need to keep pointers to parent nodes (and must know if a node is a left or right child) -- and update both of these atomically. E.g., at least one optimisation done in Phil's work would not work with my scheme (his optimized swap, 4.1.2): it generates an intermediate tree state where in-order traversal could loop between C -> B -> A -> C (trying to do multiple rcu_rbtree_next) for a while which goes against the time guarantees I want
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Paul E. McKenney wrote: > > The stop_machine_run()-alike thing is for brlocks - for which > > Sasha sent patches already, see these patches on the > > kvm@vger.kernel.org list: > > > >[PATCH 3/4] kvm tools: Add a brlock > >[PATCH 4/4] kvm tools: Use brlock in MMIO and IOPORT > > > > Wrt. brlocks, we'll keep them as simple as possible and indeed no > > involved tricks are needed AFAICS. read_lock() will be a compiler > > barrier(), that's as fast as it gets :-) > > Makes sense! > > The other debugging use for the read-side primitives is to execute > the read-side ready-do-respond-to-kvm-pause code. This can help > catch bugs where the developer put the br_read_lock() and the > br_read_unlock() in the wrong place. That's a very good suggestion - it might in fact be simpler to implement than the 'replace by rw_lock' debugging variant. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote: > Please note that what I currently have is a normal rbtree, not an > interval rbtree. Can you elaborate on your use-case so I can try to > figure out how we could augment it to support the interval rbtree you > need ? We don't need anything specific for interval rbtree. The rbtree used in the kernel provides augmentation functions for insert and erase (see rb_augment_insert() and rb_augment_erase_begin() + rb_augment_erase_end()). What they basically do is call a user-provided callback for each node from the newly inserted (or deepest after deletion) node up to the root of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c', basically all we need are the 2 augmentation functions I've mentioned above. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Sun, May 29, 2011 at 01:01:04PM -0400, Mathieu Desnoyers wrote: > * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: > > * Sasha Levin (levinsasha...@gmail.com) wrote: > [...] > > > Hi Mathieu! > > > > > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the > > > augmentation feature to support an interval rb-tree - which means that > > > every update to the tree not only updates the nodes directly related to > > > the updated node but also all the nodes on the path to the root of the > > > tree. > > > > Cool !! > > > > I'm adding in copy Phil Howard who has been working on RCU RB tree for > > much longer than myself. > > > > > I see that in liburcu there is an implementation of a rcu linked list > > > but no implementation of a rb-tree. > > > > > > Are you currently working on one? or maybe I should try writing one and > > > sending it to you? > > > > Actually, I started working on one last year, but had to interrupt my > > effort before I got it even working right. > [...] > > We'd have to see how we can go from this implementation of a standard RB > > tree to an interval RB tree too. I guess it will depend whether you need > > the updates from the target node up to the root to be done "all at once" > > from a reader perspective (then you would probably need to replace a > > copy of a part of the tree all at once), or if you can allow the update > > to be done piece-wise on a node-by-node basis as readers go through the > > tree (from root to leafs). > > I've revisited the RCU rbtree implementation this weekend, and it works > much better now. I reimplemented the whole thing from 0 starting from > the CLRS chapter 12 algorithms (to get the non-rcu > (insertion/removal)-only stress-tests working) and incrementally > RCU-ized the updates and then added read-side tests. All along, I used > the test_urcu_rbtree test case that does some basic coherency tests by > searching for some random elements that *should* be there in parellel > with insertion and removals. The implementation I currently have > survives the "search for known elements in parallel with updates" stress > test (so far). (e.g. test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2 > writers, 30 known random elements, writers are adding/removing 6 random > elements, on a 8-core machine) > > See: git://git.lttng.org/userspace-rcu.git > branch : rbtree2 > > The key idea I used in this implementation is to "decay" the old nodes > (AFAIK, I just made this up) : "decaying" a node could be best described > as creating an exact copy of a node, and putting a pointer to this new > node into the old node to form a "decay chain". This allowed me to keep > the algorithm very much similar to CLRS by just walking the decay chains > whenever needed. The old node "decays" by using call_rcu to free it > after a grace period passes. This imply that the updates must hold the > RCU read-side lock in addition to a mutex to make sure the decaying > nodes stay valid for the duration of their use. > > This implementation never requires the read-side to loop, thus > guaranteeing a wait-free read-side behavior (so search operations will > always be strictly log(n) without any busy-loop delay). > > I have not created stress-tests for next/prev walk of the tree yet. It > is therefore entirely possible that this does not work as expected. > > Comments are welcome, Very cool! The trick Phil Howard used allowed him to avoid duplicating the nodes in some cases in the rotations. I might be missing something, but it looks like you are duplicating in all cases. Would using Phil's trick result in significant performance gain? Thanx, Paul > Thanks, > > Mathieu > > > > > > Thanks, > > > > Mathieu > > > > > > -- > > Mathieu Desnoyers > > Operating System Efficiency R&D Consultant > > EfficiOS Inc. > > http://www.efficios.com > > -- > Mathieu Desnoyers > Operating System Efficiency R&D Consultant > EfficiOS Inc. > http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sun, May 29, 2011 at 09:54:50PM +0200, Ingo Molnar wrote: > > * Paul E. McKenney wrote: > > > And the other reason that you want to mark the readers is for debug > > purposes. Murphy being who he is, you will some day need to check > > for someone calling the "OK to update" function while they are > > acting as a reader. > > Correct. In one of the previous mails i suggested a debug mode that > switches it all to pthread rwlocks. > > We do not want to lose the identify of what the read paths are: this > could grow into a BKL-alike nasty-to-fix assumption over a couple of > years! Then if someone finds a usecase that intensifies the frequency > of one of the key writepaths, we'll be in trouble ... "BKR" -- Big Kernel Reader! ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sun, May 29, 2011 at 09:33:27PM +0200, Ingo Molnar wrote: > > * Paul E. McKenney wrote: > > > On Sun, May 29, 2011 at 06:00:00PM +0300, Avi Kivity wrote: > > > On 05/29/2011 05:27 PM, Ingo Molnar wrote: > > > >* Avi Kivity wrote: > > > > > > > >> I don't understand how you expect per_cpu to work in userspace. As > > > >> soon as you calculate the per-cpu address, it can be invalidated. > > > >> It doesn't help that you get a signal; you've already calculated > > > >> the address. > > > > > > > >I was thinking of some sort of transactional mechanism, a tightly > > > >controlled set of assembly instructions updating the percpu fields, > > > >where the migration event would be able to 'unroll' incomplete > > > >modifications done to the 'wrong' percpu data structure. (It would be > > > >rather complex and every percpu op would have to be an atomic because > > > >there's always the chance that it's executed on the wrong CPU.) > > > > > > > >But note that we do not even need any notification if there's a > > > >(local) lock on the percpu fields: > > > > > > > >It will work because it's statistically percpu the lock will not > > > >SMP-bounce between CPUs generally so it will be very fast to > > > >acquire/release it, and we get the full cache benefits of percpu > > > >variables. > > > > > > > >The migration notification would still be useful to detect grace > > > >periods at natural points - but as Paul pointed out polling it via > > > >SIGALRM works as well. The two (migration and SIGALRM) could be > > > >combined as well. > > > > > > I think it's way simpler to map cpu == thread. And in fact, when > > > you run a Linux kernel in a kvm guest, that's what happens, since > > > each vcpu _is_ a host thread. > > > > I have to agree with Avi here. If a stop_machine()-like approach is > > going to work, the updates have to be very rare, so any additional > > cache-nonlocality from having lots of threads should not be a problem. > > Especially given that in this particular case, there are exactly as > > many CPUs as threads anyway. The readers should only need to touch a > > constant number of cache lines either way. > > > > Or am I missing something here? > > I was talking about the (still imaginery!) user-space tree-RCU code! > :-) Ah! I did miss a turn in the dicussion, then. ;-) My initial thought is to start with CPU==thread, with the CPU online and offline operations used at thread creation and destruction time. But longer term, you are right that there would be cache-locality benefits to splitting. Perhaps more important, user-space testing could then achieve much better coverage of the various race conditions. > The stop_machine_run()-alike thing is for brlocks - for which Sasha > sent patches already, see these patches on the kvm@vger.kernel.org > list: > >[PATCH 3/4] kvm tools: Add a brlock >[PATCH 4/4] kvm tools: Use brlock in MMIO and IOPORT > > Wrt. brlocks, we'll keep them as simple as possible and indeed no > involved tricks are needed AFAICS. read_lock() will be a compiler > barrier(), that's as fast as it gets :-) Makes sense! The other debugging use for the read-side primitives is to execute the read-side ready-do-respond-to-kvm-pause code. This can help catch bugs where the developer put the br_read_lock() and the br_read_unlock() in the wrong place. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Sasha Levin (levinsasha...@gmail.com) wrote: > On Sun, 2011-05-29 at 13:01 -0400, Mathieu Desnoyers wrote: > > * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: > > > * Sasha Levin (levinsasha...@gmail.com) wrote: > > [...] > > > > Hi Mathieu! > > > > > > > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the > > > > augmentation feature to support an interval rb-tree - which means that > > > > every update to the tree not only updates the nodes directly related to > > > > the updated node but also all the nodes on the path to the root of the > > > > tree. > > > > > > Cool !! > > > > > > I'm adding in copy Phil Howard who has been working on RCU RB tree for > > > much longer than myself. > > > > > > > I see that in liburcu there is an implementation of a rcu linked list > > > > but no implementation of a rb-tree. > > > > > > > > Are you currently working on one? or maybe I should try writing one and > > > > sending it to you? > > > > > > Actually, I started working on one last year, but had to interrupt my > > > effort before I got it even working right. > > [...] > > > We'd have to see how we can go from this implementation of a standard RB > > > tree to an interval RB tree too. I guess it will depend whether you need > > > the updates from the target node up to the root to be done "all at once" > > > from a reader perspective (then you would probably need to replace a > > > copy of a part of the tree all at once), or if you can allow the update > > > to be done piece-wise on a node-by-node basis as readers go through the > > > tree (from root to leafs). > > > > I've revisited the RCU rbtree implementation this weekend, and it works > > much better now. I reimplemented the whole thing from 0 starting from > > the CLRS chapter 12 algorithms (to get the non-rcu > > (insertion/removal)-only stress-tests working) and incrementally > > Hi Mathieu! > > Can't we use the rbtree provided by the kernel, and just add _rcu > functions to provide rcu protected rbtree? Typical RB trees expect mutual exclusion between writers and readers, and between writers. AFAIK, the in-kernel rbtree implementation has this requirement. "RCU-izing" the RBtree structure requires more than just adding some functions: its whole structure must be adapted to support concurrent reader activity while updates are performed. > > > RCU-ized the updates and then added read-side tests. All along, I used > > the test_urcu_rbtree test case that does some basic coherency tests by > > searching for some random elements that *should* be there in parellel > > with insertion and removals. The implementation I currently have > > survives the "search for known elements in parallel with updates" stress > > test (so far). (e.g. test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2 > > writers, 30 known random elements, writers are adding/removing 6 random > > elements, on a 8-core machine) > > I've grabbed it and it works great for me here too :) > test_urcu_rbtree does print a lot of mess, making it somewhat hard to > work with. I've now removed the "debug" printfs from the rbtree2 head. > > > > > See: git://git.lttng.org/userspace-rcu.git > > branch : rbtree2 > > > > The key idea I used in this implementation is to "decay" the old nodes > > (AFAIK, I just made this up) : "decaying" a node could be best described > > as creating an exact copy of a node, and putting a pointer to this new > > node into the old node to form a "decay chain". This allowed me to keep > > the algorithm very much similar to CLRS by just walking the decay chains > > whenever needed. The old node "decays" by using call_rcu to free it > > after a grace period passes. This imply that the updates must hold the > > RCU read-side lock in addition to a mutex to make sure the decaying > > nodes stay valid for the duration of their use. > > > > This implementation never requires the read-side to loop, thus > > guaranteeing a wait-free read-side behavior (so search operations will > > always be strictly log(n) without any busy-loop delay). > > > > I have not created stress-tests for next/prev walk of the tree yet. It > > is therefore entirely possible that this does not work as expected. I just added some min + next and max + prev walk stress tests into test_urcu_rbtree. They seem to work fine so far with 6 readers + 2 writers running concurrently with 50 random items (it catched a stupid bug in prev(), which I fixed immediately). The idea behind these stress tests is to walk forward or backward over the entire tree and flag the position that corresponds to the global array of values in an array of flags. At the end, it checks that all the values that "must" be there were indeed flagged. > > I've 'forked' tools/kvm/, and started working on integrating urcu into > it (Will be located in https://github.com/levinsasha/linux-kvm once some > progress has been made), this should make it easier to review > work-in-progress. I'll switch to the rbt
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Paul E. McKenney wrote: > And the other reason that you want to mark the readers is for debug > purposes. Murphy being who he is, you will some day need to check > for someone calling the "OK to update" function while they are > acting as a reader. Correct. In one of the previous mails i suggested a debug mode that switches it all to pthread rwlocks. We do not want to lose the identify of what the read paths are: this could grow into a BKL-alike nasty-to-fix assumption over a couple of years! Then if someone finds a usecase that intensifies the frequency of one of the key writepaths, we'll be in trouble ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Paul E. McKenney wrote: > On Sun, May 29, 2011 at 06:00:00PM +0300, Avi Kivity wrote: > > On 05/29/2011 05:27 PM, Ingo Molnar wrote: > > >* Avi Kivity wrote: > > > > > >> I don't understand how you expect per_cpu to work in userspace. As > > >> soon as you calculate the per-cpu address, it can be invalidated. > > >> It doesn't help that you get a signal; you've already calculated > > >> the address. > > > > > >I was thinking of some sort of transactional mechanism, a tightly > > >controlled set of assembly instructions updating the percpu fields, > > >where the migration event would be able to 'unroll' incomplete > > >modifications done to the 'wrong' percpu data structure. (It would be > > >rather complex and every percpu op would have to be an atomic because > > >there's always the chance that it's executed on the wrong CPU.) > > > > > >But note that we do not even need any notification if there's a > > >(local) lock on the percpu fields: > > > > > >It will work because it's statistically percpu the lock will not > > >SMP-bounce between CPUs generally so it will be very fast to > > >acquire/release it, and we get the full cache benefits of percpu > > >variables. > > > > > >The migration notification would still be useful to detect grace > > >periods at natural points - but as Paul pointed out polling it via > > >SIGALRM works as well. The two (migration and SIGALRM) could be > > >combined as well. > > > > I think it's way simpler to map cpu == thread. And in fact, when > > you run a Linux kernel in a kvm guest, that's what happens, since > > each vcpu _is_ a host thread. > > I have to agree with Avi here. If a stop_machine()-like approach is > going to work, the updates have to be very rare, so any additional > cache-nonlocality from having lots of threads should not be a problem. > Especially given that in this particular case, there are exactly as > many CPUs as threads anyway. The readers should only need to touch a > constant number of cache lines either way. > > Or am I missing something here? I was talking about the (still imaginery!) user-space tree-RCU code! :-) The stop_machine_run()-alike thing is for brlocks - for which Sasha sent patches already, see these patches on the kvm@vger.kernel.org list: [PATCH 3/4] kvm tools: Add a brlock [PATCH 4/4] kvm tools: Use brlock in MMIO and IOPORT Wrt. brlocks, we'll keep them as simple as possible and indeed no involved tricks are needed AFAICS. read_lock() will be a compiler barrier(), that's as fast as it gets :-) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
On Sun, 2011-05-29 at 13:01 -0400, Mathieu Desnoyers wrote: > * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: > > * Sasha Levin (levinsasha...@gmail.com) wrote: > [...] > > > Hi Mathieu! > > > > > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the > > > augmentation feature to support an interval rb-tree - which means that > > > every update to the tree not only updates the nodes directly related to > > > the updated node but also all the nodes on the path to the root of the > > > tree. > > > > Cool !! > > > > I'm adding in copy Phil Howard who has been working on RCU RB tree for > > much longer than myself. > > > > > I see that in liburcu there is an implementation of a rcu linked list > > > but no implementation of a rb-tree. > > > > > > Are you currently working on one? or maybe I should try writing one and > > > sending it to you? > > > > Actually, I started working on one last year, but had to interrupt my > > effort before I got it even working right. > [...] > > We'd have to see how we can go from this implementation of a standard RB > > tree to an interval RB tree too. I guess it will depend whether you need > > the updates from the target node up to the root to be done "all at once" > > from a reader perspective (then you would probably need to replace a > > copy of a part of the tree all at once), or if you can allow the update > > to be done piece-wise on a node-by-node basis as readers go through the > > tree (from root to leafs). > > I've revisited the RCU rbtree implementation this weekend, and it works > much better now. I reimplemented the whole thing from 0 starting from > the CLRS chapter 12 algorithms (to get the non-rcu > (insertion/removal)-only stress-tests working) and incrementally Hi Mathieu! Can't we use the rbtree provided by the kernel, and just add _rcu functions to provide rcu protected rbtree? > RCU-ized the updates and then added read-side tests. All along, I used > the test_urcu_rbtree test case that does some basic coherency tests by > searching for some random elements that *should* be there in parellel > with insertion and removals. The implementation I currently have > survives the "search for known elements in parallel with updates" stress > test (so far). (e.g. test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2 > writers, 30 known random elements, writers are adding/removing 6 random > elements, on a 8-core machine) I've grabbed it and it works great for me here too :) test_urcu_rbtree does print a lot of mess, making it somewhat hard to work with. > > See: git://git.lttng.org/userspace-rcu.git > branch : rbtree2 > > The key idea I used in this implementation is to "decay" the old nodes > (AFAIK, I just made this up) : "decaying" a node could be best described > as creating an exact copy of a node, and putting a pointer to this new > node into the old node to form a "decay chain". This allowed me to keep > the algorithm very much similar to CLRS by just walking the decay chains > whenever needed. The old node "decays" by using call_rcu to free it > after a grace period passes. This imply that the updates must hold the > RCU read-side lock in addition to a mutex to make sure the decaying > nodes stay valid for the duration of their use. > > This implementation never requires the read-side to loop, thus > guaranteeing a wait-free read-side behavior (so search operations will > always be strictly log(n) without any busy-loop delay). > > I have not created stress-tests for next/prev walk of the tree yet. It > is therefore entirely possible that this does not work as expected. I've 'forked' tools/kvm/, and started working on integrating urcu into it (Will be located in https://github.com/levinsasha/linux-kvm once some progress has been made), this should make it easier to review work-in-progress. I'll switch to the rbtree2 branch in urcu and work with it from now. > Comments are welcome, > > Thanks, > > Mathieu > > > > > > Thanks, > > > > Mathieu > > > > > > -- > > Mathieu Desnoyers > > Operating System Efficiency R&D Consultant > > EfficiOS Inc. > > http://www.efficios.com > -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)
* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote: > * Sasha Levin (levinsasha...@gmail.com) wrote: [...] > > Hi Mathieu! > > > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the > > augmentation feature to support an interval rb-tree - which means that > > every update to the tree not only updates the nodes directly related to > > the updated node but also all the nodes on the path to the root of the > > tree. > > Cool !! > > I'm adding in copy Phil Howard who has been working on RCU RB tree for > much longer than myself. > > > I see that in liburcu there is an implementation of a rcu linked list > > but no implementation of a rb-tree. > > > > Are you currently working on one? or maybe I should try writing one and > > sending it to you? > > Actually, I started working on one last year, but had to interrupt my > effort before I got it even working right. [...] > We'd have to see how we can go from this implementation of a standard RB > tree to an interval RB tree too. I guess it will depend whether you need > the updates from the target node up to the root to be done "all at once" > from a reader perspective (then you would probably need to replace a > copy of a part of the tree all at once), or if you can allow the update > to be done piece-wise on a node-by-node basis as readers go through the > tree (from root to leafs). I've revisited the RCU rbtree implementation this weekend, and it works much better now. I reimplemented the whole thing from 0 starting from the CLRS chapter 12 algorithms (to get the non-rcu (insertion/removal)-only stress-tests working) and incrementally RCU-ized the updates and then added read-side tests. All along, I used the test_urcu_rbtree test case that does some basic coherency tests by searching for some random elements that *should* be there in parellel with insertion and removals. The implementation I currently have survives the "search for known elements in parallel with updates" stress test (so far). (e.g. test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2 writers, 30 known random elements, writers are adding/removing 6 random elements, on a 8-core machine) See: git://git.lttng.org/userspace-rcu.git branch : rbtree2 The key idea I used in this implementation is to "decay" the old nodes (AFAIK, I just made this up) : "decaying" a node could be best described as creating an exact copy of a node, and putting a pointer to this new node into the old node to form a "decay chain". This allowed me to keep the algorithm very much similar to CLRS by just walking the decay chains whenever needed. The old node "decays" by using call_rcu to free it after a grace period passes. This imply that the updates must hold the RCU read-side lock in addition to a mutex to make sure the decaying nodes stay valid for the duration of their use. This implementation never requires the read-side to loop, thus guaranteeing a wait-free read-side behavior (so search operations will always be strictly log(n) without any busy-loop delay). I have not created stress-tests for next/prev walk of the tree yet. It is therefore entirely possible that this does not work as expected. Comments are welcome, Thanks, Mathieu > > Thanks, > > Mathieu > > > -- > Mathieu Desnoyers > Operating System Efficiency R&D Consultant > EfficiOS Inc. > http://www.efficios.com -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sun, 2011-05-29 at 08:31 -0700, Paul E. McKenney wrote: > This is very important even if no write path ever becomes more than > just occasional. If you don't mark the read paths like Ingo suggests, > your one-year-from-now self will be very annoyed at you, as the code > will be quite difficult to understand. And anyone else trying to > read your code will be even more annoyed. ;-) I very much agree with that. Working on the code I'm trying to separate between the pause/unpause of the guest, and a 'brlock' lock which allows for very cheap reads and very expensive writes. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sun, May 29, 2011 at 08:31:30AM -0700, Paul E. McKenney wrote: > On Sun, May 29, 2011 at 09:19:48AM +0200, Ingo Molnar wrote: > > > > * Avi Kivity wrote: > > > > > Yes, this is equivalent to the kernel's stop_machine_run(). It's a > > > heavyweight method but it should work just fine. > > > > Yeah. It is fine for reconfiguration/configuration-only kind of write > > paths - i.e. the mmio lookup path should be ok. > > > > There's only one thing i'd like to warn about: please still shape it > > as a br_read_lock()/unlock(), br_write_lock()/unlock() operation. > > > > This way all the SMP read paths remain identified properly, even if > > br_write_lock() does a stop_machine_run() equivalent. This still > > leaves options open for an easy conversion to rwlock or urcu (should > > any particular write path become more than just occasional). > > This is very important even if no write path ever becomes more than > just occasional. If you don't mark the read paths like Ingo suggests, > your one-year-from-now self will be very annoyed at you, as the code > will be quite difficult to understand. And anyone else trying to > read your code will be even more annoyed. ;-) And the other reason that you want to mark the readers is for debug purposes. Murphy being who he is, you will some day need to check for someone calling the "OK to update" function while they are acting as a reader. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sun, May 29, 2011 at 06:00:00PM +0300, Avi Kivity wrote: > On 05/29/2011 05:27 PM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >> I don't understand how you expect per_cpu to work in userspace. As > >> soon as you calculate the per-cpu address, it can be invalidated. > >> It doesn't help that you get a signal; you've already calculated > >> the address. > > > >I was thinking of some sort of transactional mechanism, a tightly > >controlled set of assembly instructions updating the percpu fields, > >where the migration event would be able to 'unroll' incomplete > >modifications done to the 'wrong' percpu data structure. (It would be > >rather complex and every percpu op would have to be an atomic because > >there's always the chance that it's executed on the wrong CPU.) > > > >But note that we do not even need any notification if there's a > >(local) lock on the percpu fields: > > > >It will work because it's statistically percpu the lock will not > >SMP-bounce between CPUs generally so it will be very fast to > >acquire/release it, and we get the full cache benefits of percpu > >variables. > > > >The migration notification would still be useful to detect grace > >periods at natural points - but as Paul pointed out polling it via > >SIGALRM works as well. The two (migration and SIGALRM) could be > >combined as well. > > I think it's way simpler to map cpu == thread. And in fact, when > you run a Linux kernel in a kvm guest, that's what happens, since > each vcpu _is_ a host thread. I have to agree with Avi here. If a stop_machine()-like approach is going to work, the updates have to be very rare, so any additional cache-nonlocality from having lots of threads should not be a problem. Especially given that in this particular case, there are exactly as many CPUs as threads anyway. The readers should only need to touch a constant number of cache lines either way. Or am I missing something here? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sun, May 29, 2011 at 09:19:48AM +0200, Ingo Molnar wrote: > > * Avi Kivity wrote: > > > Yes, this is equivalent to the kernel's stop_machine_run(). It's a > > heavyweight method but it should work just fine. > > Yeah. It is fine for reconfiguration/configuration-only kind of write > paths - i.e. the mmio lookup path should be ok. > > There's only one thing i'd like to warn about: please still shape it > as a br_read_lock()/unlock(), br_write_lock()/unlock() operation. > > This way all the SMP read paths remain identified properly, even if > br_write_lock() does a stop_machine_run() equivalent. This still > leaves options open for an easy conversion to rwlock or urcu (should > any particular write path become more than just occasional). This is very important even if no write path ever becomes more than just occasional. If you don't mark the read paths like Ingo suggests, your one-year-from-now self will be very annoyed at you, as the code will be quite difficult to understand. And anyone else trying to read your code will be even more annoyed. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/29/2011 05:27 PM, Ingo Molnar wrote: * Avi Kivity wrote: > I don't understand how you expect per_cpu to work in userspace. As > soon as you calculate the per-cpu address, it can be invalidated. > It doesn't help that you get a signal; you've already calculated > the address. I was thinking of some sort of transactional mechanism, a tightly controlled set of assembly instructions updating the percpu fields, where the migration event would be able to 'unroll' incomplete modifications done to the 'wrong' percpu data structure. (It would be rather complex and every percpu op would have to be an atomic because there's always the chance that it's executed on the wrong CPU.) But note that we do not even need any notification if there's a (local) lock on the percpu fields: It will work because it's statistically percpu the lock will not SMP-bounce between CPUs generally so it will be very fast to acquire/release it, and we get the full cache benefits of percpu variables. The migration notification would still be useful to detect grace periods at natural points - but as Paul pointed out polling it via SIGALRM works as well. The two (migration and SIGALRM) could be combined as well. I think it's way simpler to map cpu == thread. And in fact, when you run a Linux kernel in a kvm guest, that's what happens, since each vcpu _is_ a host thread. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Avi Kivity wrote: > I don't understand how you expect per_cpu to work in userspace. As > soon as you calculate the per-cpu address, it can be invalidated. > It doesn't help that you get a signal; you've already calculated > the address. I was thinking of some sort of transactional mechanism, a tightly controlled set of assembly instructions updating the percpu fields, where the migration event would be able to 'unroll' incomplete modifications done to the 'wrong' percpu data structure. (It would be rather complex and every percpu op would have to be an atomic because there's always the chance that it's executed on the wrong CPU.) But note that we do not even need any notification if there's a (local) lock on the percpu fields: It will work because it's statistically percpu the lock will not SMP-bounce between CPUs generally so it will be very fast to acquire/release it, and we get the full cache benefits of percpu variables. The migration notification would still be useful to detect grace periods at natural points - but as Paul pointed out polling it via SIGALRM works as well. The two (migration and SIGALRM) could be combined as well. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/29/2011 03:37 PM, Ingo Molnar wrote: > > > > It's not transparent at all if you index RCU data structures by > > the current CPU index, which the kernel implementation does. > > But that's completely broken for userspace. The "current cpu > index" doesn't even exist, since you can't disable preemption. It does exist, if the signal handler notification of a migration is instantaneous (which it is). Something like rcu_preempt_qs(), which expects to be called with interrupts disabled, cannot be made to work. I don't understand how you expect per_cpu to work in userspace. As soon as you calculate the per-cpu address, it can be invalidated. It doesn't help that you get a signal; you've already calculated the address. Also, in the kernel, using per-cpu data implies mutual exclusion (since you've disabled preemption). That doesn't apply to userspace. > > Doing that has the advantage of being much more cache-compressed > > than the TID index, > > If you have more tasks than cpus; which isn't a given. Sure there are special cases but in general there can be many more tasks (threads) than CPUs ;-) Sure; that was just a side note. Note that for server virtualization you usually have less tasks (in a process, not globally) than host cpus. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Avi Kivity wrote: > On 05/29/2011 10:35 AM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >> On 05/28/2011 09:32 PM, Ingo Molnar wrote: > >> >* Avi Kivity wrote: > >> > > >> >> > So if you set a notification signal via fcntl(F_SETOWN) on the > >> >> > scheduler context switch event fd, the user-space RCU code will > >> >> > get a signal on every context switch. > >> >> > >> >> Context switches are completely uninteresting for userspace rcu: > >> >> > >> >> rcu_read_lock(); > >> >> ---> context switch > >> >> > >> >> have we learned anything from that? no. User code is always > >> >> preemptible and migratable. If rcu_read_lock() prevented migration > >> >> somehow, then we'd know that a context switch means we've started a > >> >> grace period for this thread. But it doesn't, so we don't. > >> > > >> >Well, in the next mail i mentioned that we can do migration events as > >> >well, which would be useful: instead of having to keep track of > >> >nr_tasks RCU grace periods we could simplify it down to nr_cpus. > >> > >> I don't see how a migration event helps. It is completely > >> transparent from the task's point of view. > > > > It's not transparent at all if you index RCU data structures by > > the current CPU index, which the kernel implementation does. > > But that's completely broken for userspace. The "current cpu > index" doesn't even exist, since you can't disable preemption. It does exist, if the signal handler notification of a migration is instantaneous (which it is). > > Doing that has the advantage of being much more cache-compressed > > than the TID index, > > If you have more tasks than cpus; which isn't a given. Sure there are special cases but in general there can be many more tasks (threads) than CPUs ;-) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/29/2011 10:35 AM, Ingo Molnar wrote: * Avi Kivity wrote: > On 05/28/2011 09:32 PM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >> > So if you set a notification signal via fcntl(F_SETOWN) on the > >> > scheduler context switch event fd, the user-space RCU code will > >> > get a signal on every context switch. > >> > >> Context switches are completely uninteresting for userspace rcu: > >> > >> rcu_read_lock(); > >> ---> context switch > >> > >> have we learned anything from that? no. User code is always > >> preemptible and migratable. If rcu_read_lock() prevented migration > >> somehow, then we'd know that a context switch means we've started a > >> grace period for this thread. But it doesn't, so we don't. > > > >Well, in the next mail i mentioned that we can do migration events as > >well, which would be useful: instead of having to keep track of > >nr_tasks RCU grace periods we could simplify it down to nr_cpus. > > I don't see how a migration event helps. It is completely > transparent from the task's point of view. It's not transparent at all if you index RCU data structures by the current CPU index, which the kernel implementation does. But that's completely broken for userspace. The "current cpu index" doesn't even exist, since you can't disable preemption. Doing that has the advantage of being much more cache-compressed than the TID index, If you have more tasks than cpus; which isn't a given. and also having better worst-case grace period latency properties than a TID index. > > But if we indexed by the TID then we wouldnt need any scheduler > > bindings at all - this is the simpler approach. > > Yes, and it maps 1:1 to the kernel implementation (cpu = task). No, the kernel indexes grace period tracking (and the write-completion queues) by CPU index. Do a conceptual #define cpu task and it all works out. > >> What's needed are explicit notifications about grace periods. For > >> the vcpu threads, calling KVM_VCPU_RUN seems like a good point. > >> For I/O threads, completion of processing of an event is also a > >> good point. > > > > Grace period notifications are needed too, obviously. > > I'd think they're sufficient, no? Is something else needed? I think you are missing the fact that in the kernel we index RCU data structures by CPU number: static void rcu_preempt_qs(int cpu) { struct rcu_data *rdp =&per_cpu(rcu_preempt_data, cpu); ... s/per_cpu/__thread/ static void rcu_preempt_note_context_switch(int cpu) { struct task_struct *t = current; unsigned long flags; struct rcu_data *rdp; struct rcu_node *rnp; if (t->rcu_read_lock_nesting&& (t->rcu_read_unlock_special& RCU_READ_UNLOCK_BLOCKED) == 0) { /* Possibly blocking in an RCU read-side critical section. */ rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu); ... Which could be changed over to be per task in user-space by treating the TID as a 'virtual CPU' equivalent. This probably lengthens worst-case rcu_sync() latencies rather significantly though - possibly turning urcu into a stop_machine_run() equivalent in the worst-case. (but i could be wrong about this last bit) I believe you are. urcu does stress scaling, since it's much easier to add tasks than it is to add cpus, but it's conceptually the same problem. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Avi Kivity wrote: > On 05/28/2011 09:32 PM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >> > So if you set a notification signal via fcntl(F_SETOWN) on the > >> > scheduler context switch event fd, the user-space RCU code will > >> > get a signal on every context switch. > >> > >> Context switches are completely uninteresting for userspace rcu: > >> > >>rcu_read_lock(); > >>---> context switch > >> > >> have we learned anything from that? no. User code is always > >> preemptible and migratable. If rcu_read_lock() prevented migration > >> somehow, then we'd know that a context switch means we've started a > >> grace period for this thread. But it doesn't, so we don't. > > > >Well, in the next mail i mentioned that we can do migration events as > >well, which would be useful: instead of having to keep track of > >nr_tasks RCU grace periods we could simplify it down to nr_cpus. > > I don't see how a migration event helps. It is completely > transparent from the task's point of view. It's not transparent at all if you index RCU data structures by the current CPU index, which the kernel implementation does. Doing that has the advantage of being much more cache-compressed than the TID index, and also having better worst-case grace period latency properties than a TID index. > > But if we indexed by the TID then we wouldnt need any scheduler > > bindings at all - this is the simpler approach. > > Yes, and it maps 1:1 to the kernel implementation (cpu = task). No, the kernel indexes grace period tracking (and the write-completion queues) by CPU index. > >> What's needed are explicit notifications about grace periods. For > >> the vcpu threads, calling KVM_VCPU_RUN seems like a good point. > >> For I/O threads, completion of processing of an event is also a > >> good point. > > > > Grace period notifications are needed too, obviously. > > I'd think they're sufficient, no? Is something else needed? I think you are missing the fact that in the kernel we index RCU data structures by CPU number: static void rcu_preempt_qs(int cpu) { struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu); ... static void rcu_preempt_note_context_switch(int cpu) { struct task_struct *t = current; unsigned long flags; struct rcu_data *rdp; struct rcu_node *rnp; if (t->rcu_read_lock_nesting && (t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) { /* Possibly blocking in an RCU read-side critical section. */ rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu); ... Which could be changed over to be per task in user-space by treating the TID as a 'virtual CPU' equivalent. This probably lengthens worst-case rcu_sync() latencies rather significantly though - possibly turning urcu into a stop_machine_run() equivalent in the worst-case. (but i could be wrong about this last bit) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Avi Kivity wrote: > Yes, this is equivalent to the kernel's stop_machine_run(). It's a > heavyweight method but it should work just fine. Yeah. It is fine for reconfiguration/configuration-only kind of write paths - i.e. the mmio lookup path should be ok. There's only one thing i'd like to warn about: please still shape it as a br_read_lock()/unlock(), br_write_lock()/unlock() operation. This way all the SMP read paths remain identified properly, even if br_write_lock() does a stop_machine_run() equivalent. This still leaves options open for an easy conversion to rwlock or urcu (should any particular write path become more than just occasional). Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/28/2011 10:45 PM, Sasha Levin wrote: On Sat, 2011-05-28 at 17:24 +0200, Ingo Molnar wrote: > * Sasha Levin wrote: > > > So the basic plan here is to allocate a futex(?) for each VCPU > > thread, and have the writer thread lock all futexes when it needs > > to write? > > > > If we assume we only have one writer thread, it can stay pretty > > simple. > > We can use an even simpler and more scalable method: > >- writers can 'stop' all other threads, by sending them a > threadpool signal and waiting for each thread to have completed > processing their current work and notifying the writer back that > they have stopped running. > > This means that the read-side lock is _zero instructions_, basically > just a barrier() to make sure the compiler does not move instructions > across threadpool functions (it wont). > > This method requires that we know about every worker thread - i.e. > no-one does a stray pthread_create() and uses data structures from > there. It also requires that each worker thread can 'stop' within a > reasonable amount of time. In this case, maybe instead of implementing it as a 'lock', we can implement it as a way to stop all vcpu threads from reentering the kernel (KVM_RUN): 1. Set a 'vcpu-stop' flag. 2. Signal all VCPUs to exit KVM_RUN. 3. VCPU threads now wait on our lock before reentering into KVM_RUN - the writer thread waits until waiting threads = VCPU count. 4. Writer thread writes, releases lock. So instead of it being a lock in MMIO, IO-ports, etc - it's a method to stop the entire guest which could be used during configuration updates (and anything else we might think of). It could also be used as a method for users to 'pause' the guest. Yes, this is equivalent to the kernel's stop_machine_run(). It's a heavyweight method but it should work just fine. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/28/2011 09:32 PM, Ingo Molnar wrote: * Avi Kivity wrote: > > So if you set a notification signal via fcntl(F_SETOWN) on the > > scheduler context switch event fd, the user-space RCU code will > > get a signal on every context switch. > > Context switches are completely uninteresting for userspace rcu: > >rcu_read_lock(); >---> context switch > > have we learned anything from that? no. User code is always > preemptible and migratable. If rcu_read_lock() prevented migration > somehow, then we'd know that a context switch means we've started a > grace period for this thread. But it doesn't, so we don't. Well, in the next mail i mentioned that we can do migration events as well, which would be useful: instead of having to keep track of nr_tasks RCU grace periods we could simplify it down to nr_cpus. I don't see how a migration event helps. It is completely transparent from the task's point of view. But if we indexed by the TID then we wouldnt need any scheduler bindings at all - this is the simpler approach. Yes, and it maps 1:1 to the kernel implementation (cpu = task). > What's needed are explicit notifications about grace periods. For > the vcpu threads, calling KVM_VCPU_RUN seems like a good point. > For I/O threads, completion of processing of an event is also a > good point. Grace period notifications are needed too, obviously. I'd think they're sufficient, no? Is something else needed? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sat, 2011-05-28 at 17:24 +0200, Ingo Molnar wrote: > * Sasha Levin wrote: > > > So the basic plan here is to allocate a futex(?) for each VCPU > > thread, and have the writer thread lock all futexes when it needs > > to write? > > > > If we assume we only have one writer thread, it can stay pretty > > simple. > > We can use an even simpler and more scalable method: > > - writers can 'stop' all other threads, by sending them a > threadpool signal and waiting for each thread to have completed > processing their current work and notifying the writer back that > they have stopped running. > > This means that the read-side lock is _zero instructions_, basically > just a barrier() to make sure the compiler does not move instructions > across threadpool functions (it wont). > > This method requires that we know about every worker thread - i.e. > no-one does a stray pthread_create() and uses data structures from > there. It also requires that each worker thread can 'stop' within a > reasonable amount of time. In this case, maybe instead of implementing it as a 'lock', we can implement it as a way to stop all vcpu threads from reentering the kernel (KVM_RUN): 1. Set a 'vcpu-stop' flag. 2. Signal all VCPUs to exit KVM_RUN. 3. VCPU threads now wait on our lock before reentering into KVM_RUN - the writer thread waits until waiting threads = VCPU count. 4. Writer thread writes, releases lock. So instead of it being a lock in MMIO, IO-ports, etc - it's a method to stop the entire guest which could be used during configuration updates (and anything else we might think of). It could also be used as a method for users to 'pause' the guest. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Avi Kivity wrote: > > So if you set a notification signal via fcntl(F_SETOWN) on the > > scheduler context switch event fd, the user-space RCU code will > > get a signal on every context switch. > > Context switches are completely uninteresting for userspace rcu: > > rcu_read_lock(); > ---> context switch > > have we learned anything from that? no. User code is always > preemptible and migratable. If rcu_read_lock() prevented migration > somehow, then we'd know that a context switch means we've started a > grace period for this thread. But it doesn't, so we don't. Well, in the next mail i mentioned that we can do migration events as well, which would be useful: instead of having to keep track of nr_tasks RCU grace periods we could simplify it down to nr_cpus. But if we indexed by the TID then we wouldnt need any scheduler bindings at all - this is the simpler approach. > What's needed are explicit notifications about grace periods. For > the vcpu threads, calling KVM_VCPU_RUN seems like a good point. > For I/O threads, completion of processing of an event is also a > good point. Grace period notifications are needed too, obviously. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/27/2011 04:31 PM, Ingo Molnar wrote: One furthr optimization would be possible: in case you think we can write the signal handler in assembly or build it with gcc flags that does not use SSE we might also add a 'lightweight signal handler' kind of flag to the kernel, which does not save FPU/vector-CPU(SSE) state. In this case signals become *really* fast on x86, almost as fast as interrupts. My old fpu rewrite had the potential to do this without a flag. We simply allocate two fpu contexts per thread (main_fpu and signal_fpu); when we deliver a signal we clear cr0.ts. If the signal touches the fpu, we save the fpu state to main_fpu and initialize the fpu; we then save/restore to signal_fpu until the signal handler exits. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/27/2011 02:07 PM, Ingo Molnar wrote: * Ingo Molnar wrote: > > This code is very much tied with the kernel scheduler. [...] > > It would not be particularly complex to enable user-space to > request a callback on context switch events. > > I was thinking on and off about allowing perf events to generate a > per sampling event notification signal on specific events, such as > page faults or context switches. I was thinking about that on and off so loudly that Peter implemented it long ago via fasync support on the perf event fd! :-) So if you set a notification signal via fcntl(F_SETOWN) on the scheduler context switch event fd, the user-space RCU code will get a signal on every context switch. Context switches are completely uninteresting for userspace rcu: rcu_read_lock(); ---> context switch have we learned anything from that? no. User code is always preemptible and migratable. If rcu_read_lock() prevented migration somehow, then we'd know that a context switch means we've started a grace period for this thread. But it doesn't, so we don't. What's needed are explicit notifications about grace periods. For the vcpu threads, calling KVM_VCPU_RUN seems like a good point. For I/O threads, completion of processing of an event is also a good point. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Sat, May 28, 2011 at 05:24:08PM +0200, Ingo Molnar wrote: > > * Sasha Levin wrote: > > > So the basic plan here is to allocate a futex(?) for each VCPU > > thread, and have the writer thread lock all futexes when it needs > > to write? > > > > If we assume we only have one writer thread, it can stay pretty > > simple. > > We can use an even simpler and more scalable method: > > - writers can 'stop' all other threads, by sending them a > threadpool signal and waiting for each thread to have completed > processing their current work and notifying the writer back that > they have stopped running. > > This means that the read-side lock is _zero instructions_, basically > just a barrier() to make sure the compiler does not move instructions > across threadpool functions (it wont). > > This method requires that we know about every worker thread - i.e. > no-one does a stray pthread_create() and uses data structures from > there. It also requires that each worker thread can 'stop' within a > reasonable amount of time. Yep, this should work -- similar to use of stop_machine() in the kernel, but without the need for worker threads to disable preemption because there is no need for separate "cpu" and "task" concepts like there is in the kernel. The updates are quite heavyweight and they must block all readers, but if updates are infrequent enough, this clearly won't be a problem. Simpler than porting TREE_RCU to userspace as well, though I might do that some time just for grins. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Sasha Levin wrote: > So the basic plan here is to allocate a futex(?) for each VCPU > thread, and have the writer thread lock all futexes when it needs > to write? > > If we assume we only have one writer thread, it can stay pretty > simple. We can use an even simpler and more scalable method: - writers can 'stop' all other threads, by sending them a threadpool signal and waiting for each thread to have completed processing their current work and notifying the writer back that they have stopped running. This means that the read-side lock is _zero instructions_, basically just a barrier() to make sure the compiler does not move instructions across threadpool functions (it wont). This method requires that we know about every worker thread - i.e. no-one does a stray pthread_create() and uses data structures from there. It also requires that each worker thread can 'stop' within a reasonable amount of time. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Fri, 2011-05-27 at 19:10 +0200, Ingo Molnar wrote: > * Sasha Levin wrote: > > > On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote: > > > * Sasha Levin wrote: > > > > > > > I see that in liburcu there is an implementation of a rcu linked > > > > list but no implementation of a rb-tree. > > > > > > Another approach would be, until the RCU interactions are sorted out, > > > to implement a 'big reader lock' thing that is completely lockless on > > > the read side (!). > > > > > > It works well if the write side is expensive, but very rare: which is > > > certainly the case for these ioport registration data structures used > > > in the mmio event demux fast path! > > > > > > The write_lock() side signals all worker threads to finish whatever > > > they are doing now and to wait for the write_unlock(). Then the > > > modification can be done and the worker threads can be resumed. > > > > > > This can be updated to RCU later on without much trouble. > > > > > > The advantage is that this could be implemented with the existing > > > thread-pool primitives straight away i think, we'd need five > > > primitives: > > > > > > bread_lock(); > > > bread_unlock(); > > > bwrite_lock(); > > > bwrite_lock(); > > > > > > brlock_init(); > > > > > > and a data type: > > > > > > struct brlock; > > > > > > bread_lock()/bread_unlock() is basically just a compiler barrier. > > > bwrite_lock() stops all (other) worker threads. > > > bwrite_unlock() resumes them. > > > > > > That's all - should be 50 lines of code, unless i'm missing something > > > :-) > > > > > > Thanks, > > > > > > Ingo > > > > Isn't there something similar to this in the kernel? > > Yeah, there's include/linux/lglock.h. > > > I prefer not implementing a new lock type at the moment mostly because > > we're not tackling a bug or an immediate problem, we don't really need > > locking at the moment (we add all devices at init and don't support > > hotplug yet). So I'd rather not write code just to solve it faster but > > have it thrown away later. > > We don't have to throw it away: RCU is rather complex to pull off > here, and in many cases, where writes are very rare, brlocks are the > best solution even with RCU present. So the basic plan here is to allocate a futex(?) for each VCPU thread, and have the writer thread lock all futexes when it needs to write? If we assume we only have one writer thread, it can stay pretty simple. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Fri, May 27, 2011 at 11:12:20AM +0200, Ingo Molnar wrote: > > * Paul E. McKenney wrote: > > > > > > I'm CC'ing Paul and Mathieu as well for urcu. > > > > I am hoping we can get better convergence between the user-level > > and kernel-level URCU implementations once I get SRCU merged into > > the TREE_RCU and TINY_RCU implementations. [...] > > Yeah. > > > [...] But it is early days for user-level RCU implementations -- > > for example, the kernel-level implementations have deep > > dependencies on being able to lock themselves cheaply to a given > > CPU, which does not exist at user level. > > Correct - this is why i suggested a plain copy first, then look back > how we (and whether we!) want to share logic. OK, here is an approach that I rejected long ago due to its not handling existing code bases nicely. But it should work fine for user-space applications that are willing to adapt themselves to RCU, so it is well worth considering for this set of use cases. The basic trick is to pretend that each user-level thread is its own CPU. This is easiest if the application does not do any RCU activity from signal handlers, though app-code signal handlers can be dealt with as well, if needed. (But I hate POSIX signals...) Given this trick, the code that is currently invoked from the scheduling-clock interrupt can be instead invoked from a per-thread SIGALRM. Given the current implementation in -tip, the RCU core processing can be done from separate threads, but things must be tweaked because TREE_RCU assumes that the RCU core processing happens on the same CPU (which now becomes in the same thread) that it corresponds to. In other words, the rcuc0 thread is hard-affinitied to CPU 0, the rcuc1 thread to CPU 1, and so on. One way to handle this would be to do the per-CPU kthread processing in signal-handler context. Then code segments that disable interrupts (the __call_rcu() function comes immediately to mind) must block the corresponding signals. Which can easily be abstracted so that common code handles it. We could handle dyntick-idle if there is a convenient way to get notification when a thread blocks (as opposed to being preempted). There are a number of strategies that might work here, the first that comes to mind is to notify only if the block is TASK_INTERRUPTIBLE, which indicates a relatively long-term sleep. This notification could call rcu_enter_nohz() and friends. So, is there a way to get notification on TASK_INTERRUPTIBLE blocking and unblocking? This is not a general-purpose solution (which is why I rejected it when thinking along these lines some years ago), but it would be an interesting way to share the in-kernel code. And I believe that this approach would be quite useful to a great number of user-level apps/tools/utilities that were willing to live within its constraints. The each-thread-is-a-CPU might seem limiting, but the current TREE_RCU implementation would allow up to 4,194,304 threads on a 64-bit system and up to 524,288 on a 32-bit system, which should prove sufficient for most uses. Famous last words... But it would be easy to add a fifth level of hierarchy if someone really does have a legitimate need for more threads, which would bring us to 268,435,456 threads for 64-bit systems and 16,777,216 threads for 32-bit systems. And it is easy to add more levels -- and the extra levels don't penalize people who don't need them. With the current implementation, the maximum number of threads would need to be specified at compile time, but again, this should be OK in almost all cases. Default to (say) 131,072 threads maximum and be happy. > > But there seems to be an assumption that there should be only one > > URCU implementation, and I am not sure that this assumption holds. > > I'm not sure about that either. And sice tools/kvm/ lives in the > kernel repo it would be a mortal sin [*] to not explore the code > sharing angle!!! :-) > > If a reasonable amount of sharing of logic is possible without making > it painful for the kernel RCU code we could do other nice things like > change the RCU logic and test it in user-space first and run > user-space rcutorture on some really big cluster. That would be cool -- also, it would make the Linux-kernel code more accessible, because people could play with it in userspace, single-stepping, setting breakpoints, and so on. > > [ ... ] > > > > All that aside, one advantage of http://lttng.org/urcu is that it > > already exists, which allows prototyping to proceed immediately. > > it's offline right now: > > $ git clone git://git.lttng.org/urcu > Cloning into urcu... > fatal: The remote end hung up unexpectedly > > One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess > we could copy a suitable implementation into tools/kvm/rcu/? That is another reason why I believe that an in-kernel-tree version of URCU is not a replacement for the variant that Mathieu is maintaining (and that
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Sasha Levin wrote: > On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote: > > * Sasha Levin wrote: > > > > > I see that in liburcu there is an implementation of a rcu linked > > > list but no implementation of a rb-tree. > > > > Another approach would be, until the RCU interactions are sorted out, > > to implement a 'big reader lock' thing that is completely lockless on > > the read side (!). > > > > It works well if the write side is expensive, but very rare: which is > > certainly the case for these ioport registration data structures used > > in the mmio event demux fast path! > > > > The write_lock() side signals all worker threads to finish whatever > > they are doing now and to wait for the write_unlock(). Then the > > modification can be done and the worker threads can be resumed. > > > > This can be updated to RCU later on without much trouble. > > > > The advantage is that this could be implemented with the existing > > thread-pool primitives straight away i think, we'd need five > > primitives: > > > > bread_lock(); > > bread_unlock(); > > bwrite_lock(); > > bwrite_lock(); > > > > brlock_init(); > > > > and a data type: > > > > struct brlock; > > > > bread_lock()/bread_unlock() is basically just a compiler barrier. > > bwrite_lock() stops all (other) worker threads. > > bwrite_unlock() resumes them. > > > > That's all - should be 50 lines of code, unless i'm missing something > > :-) > > > > Thanks, > > > > Ingo > > Isn't there something similar to this in the kernel? Yeah, there's include/linux/lglock.h. > I prefer not implementing a new lock type at the moment mostly because > we're not tackling a bug or an immediate problem, we don't really need > locking at the moment (we add all devices at init and don't support > hotplug yet). So I'd rather not write code just to solve it faster but > have it thrown away later. We don't have to throw it away: RCU is rather complex to pull off here, and in many cases, where writes are very rare, brlocks are the best solution even with RCU present. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote: > * Sasha Levin wrote: > > > I see that in liburcu there is an implementation of a rcu linked > > list but no implementation of a rb-tree. > > Another approach would be, until the RCU interactions are sorted out, > to implement a 'big reader lock' thing that is completely lockless on > the read side (!). > > It works well if the write side is expensive, but very rare: which is > certainly the case for these ioport registration data structures used > in the mmio event demux fast path! > > The write_lock() side signals all worker threads to finish whatever > they are doing now and to wait for the write_unlock(). Then the > modification can be done and the worker threads can be resumed. > > This can be updated to RCU later on without much trouble. > > The advantage is that this could be implemented with the existing > thread-pool primitives straight away i think, we'd need five > primitives: > > bread_lock(); > bread_unlock(); > bwrite_lock(); > bwrite_lock(); > > brlock_init(); > > and a data type: > > struct brlock; > > bread_lock()/bread_unlock() is basically just a compiler barrier. > bwrite_lock() stops all (other) worker threads. > bwrite_unlock() resumes them. > > That's all - should be 50 lines of code, unless i'm missing something > :-) > > Thanks, > > Ingo Isn't there something similar to this in the kernel? I prefer not implementing a new lock type at the moment mostly because we're not tackling a bug or an immediate problem, we don't really need locking at the moment (we add all devices at init and don't support hotplug yet). So I'd rather not write code just to solve it faster but have it thrown away later. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar (mi...@elte.hu) wrote: > > * Ingo Molnar wrote: > > > Note that you do not want the context switch event, but the CPU > > migration event: that will notify user-space when it gets migrated > > to another CPU. This is the case that RCU really needs. > > Also note that the main current use-case of perf events is > instrumentation, thus if you make use of this facility for user-space > RCU you need to check whether the events are all precise and arrive > in time to the target process, etc. > > Statistical behavior isnt a big problem for instrumentation but it's > a showstopper for RCU, obviously! :-) > > If you find such bugs then we want to fix them, so there's no > fundamental *desire* from us for these events to be statistical and > inaccurate anywhere. The accuracy vs speed tradeoff is actually quite different from the instrumentation vs low-level-synchronization point of views. It might be acceptable in some sampling situations to get some inaccuracy due to lack of locking if it makes the data collection tool reasonably fast for real-life use (note that I am talking about "sampling", not event-based tracing here). So there are situations where adding locking will make the overhead prohibitive for sampling, but would be required for the perfect accuracy needed by RCU. So although the desire might not be to get inaccurate data, the actual desire to get it in a low-overhead fashion can lead to different decisions regarding the accuracy vs speed tradeoff. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar (mi...@elte.hu) wrote: > > * Ingo Molnar wrote: > > > > This code is very much tied with the kernel scheduler. [...] > > > > It would not be particularly complex to enable user-space to > > request a callback on context switch events. > > > > I was thinking on and off about allowing perf events to generate a > > per sampling event notification signal on specific events, such as > > page faults or context switches. > > I was thinking about that on and off so loudly that Peter implemented > it long ago via fasync support on the perf event fd! :-) > > So if you set a notification signal via fcntl(F_SETOWN) on the > scheduler context switch event fd, the user-space RCU code will get a > signal on every context switch. > > I have not tried it for this purpose yet, so let us know if there are > unexpected complications :) I'm worried about the following complications: Let's define per-process, per-cpu data structures, mapped in userland, that keep track of the per-cpu read-side C.S. nesting count. Let say we use this signal-based mechanism to inform userland of migrations. 1) The first thing I notice is that we're not informed of threads being blocked. So multiple threads taking read-side C.S. in turn on the same CPU as they are being scheduled will corrupt each other's read-side C.S. counter. 2) The second thing I notice is that there is no form of synchronization between the userland execution and delivery of this signal, which leads to interesting race conditions. Any thoughts on how to address these ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Mathieu Desnoyers wrote: > > So yes, kernel code was obviously used in the making of urcu - > > just not the RCU kernel code it appears. > > > > Which is a pity i think! :-) > > Heh :) You know, I really like the Linux kernel coding style, which > is what I tried to stick to within this library. So although I > initially imported some of the core Linux kernel macroisms, I had > to reimplement them (derived from a MIT-licensed code-base) as soon > as I decided to go for MIT-licensed low-level primitives and > LGPL-licensed library. Well, that might be a somewhat fragile assumption in certain jurisdictions, did you know that this is not necessarily a clean room reimplementation of the urcu code anymore, so the chain of derivation might still be present legally, right? :-) [ Not that i can imagine Linus ever bothering you about barrier() or atomic_inc() ;-) ] Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Mathieu Desnoyers wrote: > I'm worried about "self-recursion" behaviors that could be > triggered though: if the userland callback code called from a page > fault triggers a page fault all by itself, then it looks like a > good way to bring the system to its knees. [...] Not really, SIGIO isnt being reprocessed until the signal handler returns. > [...] The same apply to context switches. Do you have a way to > handle this in mind ? Shouldnt be a problem in theory: yes, in case of repeat migrations repeat signals will be sent, but they should not nest in any nasty fashion. That's the theory, it needs checking! :-) One furthr optimization would be possible: in case you think we can write the signal handler in assembly or build it with gcc flags that does not use SSE we might also add a 'lightweight signal handler' kind of flag to the kernel, which does not save FPU/vector-CPU(SSE) state. In this case signals become *really* fast on x86, almost as fast as interrupts. One detail: you'd not want to use a queueing signal, because the siginfo queue might overflow. It's also unnecessary: RCU only needs the last migration event, previous history is uninteresting. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar (mi...@elte.hu) wrote: > > * Mathieu Desnoyers wrote: > > > > it's offline right now: > > > > > > $ git clone git://git.lttng.org/urcu > > > Cloning into urcu... > > > fatal: The remote end hung up unexpectedly > > > > This would be: > > > > git clone git://git.lttng.org/userspace-rcu.git > > Hey, my impression wasn't *entirely* wrong, your initial urcu commit: > > From 27b012e271a82b9a0d94543688904f207cd154ea Mon Sep 17 00:00:00 2001 > From: Mathieu Desnoyers > Date: Thu, 5 Feb 2009 19:06:44 -0500 > Subject: [PATCH] init version > > --- > Makefile |6 ++ > urcu.c | 250 > ++ > urcu.h | 69 + > 3 files changed, 325 insertions(+), 0 deletions(-) > > Has: > > +/* The "volatile" is due to gcc bugs */ > +#define barrier() __asm__ __volatile__("": : :"memory") > + > > Which code sequence i recognize very well as a kernel maintainer ;-) > Here's the kernel's compiler.h definition of the same: > > /* The "volatile" is due to gcc bugs */ > #define barrier() __asm__ __volatile__("": : :"memory") > > This: > > +/* x86 32/64 specific */ > +#define mb()asm volatile("mfence":::"memory") > +#define rmb() asm volatile("lfence":::"memory") > +#define wmb() asm volatile("sfence" ::: "memory") > + > + > + > +/* x86 32 */ > +static inline void atomic_inc(int *v) > +{ > + asm volatile("lock; incl %0" > +: "+m" (v->counter)); > > is familiar to an arch/x86/ maintainer as well :-) > > So yes, kernel code was obviously used in the making of urcu - just > not the RCU kernel code it appears. > > Which is a pity i think! :-) Heh :) You know, I really like the Linux kernel coding style, which is what I tried to stick to within this library. So although I initially imported some of the core Linux kernel macroisms, I had to reimplement them (derived from a MIT-licensed code-base) as soon as I decided to go for MIT-licensed low-level primitives and LGPL-licensed library. About RCU, the picture seemed very much different in the userspace landscape compared to the kernel (needed to use of per-thread RCU nesting counters compared to per-CPU in the kernel because of lack of integration with the scheduler), but more on that in my other follow-up reply. Thanks, Mathieu > > Thanks, > > Ingo -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar (mi...@elte.hu) wrote: > > * Mathieu Desnoyers wrote: > > > > - Check kernel/tinyrcu.c to see how RCU is implemented in its > > >simplest form. :) > > > > ...so simplistic it only works on UP systems, which are not so common > > these days on the systems targeted by kvm. > > As i said above, in its simplest form - which is UP. > > Obviously it's not tinyrcu.c that should be used by tools/kvm/ but > what i suggested, tree-RCU: I agree that tree-RCU has some grace-period management scalability benefits that would be interesting to have. > > > - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/ > > > > This code is very much tied with the kernel scheduler. [...] > > It would not be particularly complex to enable user-space to request > a callback on context switch events. > > I was thinking on and off about allowing perf events to generate a > per sampling event notification signal on specific events, such as > page faults or context switches. > > Obviously this won't be enabled from NMI contexts due to atomicity > constraints, but the pagefault and maybe the context switch path > looks doable. > > That capability would be a rather simple kernel change and it would > allow a user-space RCU implementation to be notified of various key > events, context switches in particular. I'm worried about "self-recursion" behaviors that could be triggered though: if the userland callback code called from a page fault triggers a page fault all by itself, then it looks like a good way to bring the system to its knees. The same apply to context switches. Do you have a way to handle this in mind ? > > Would you be interested in helping code up such a facility? The urcu > library could make good use of it i think, regardless of what we do > in tools/kvm/. I'm open to try finding out the best possible approach to support RCU in user-space (disclaimer: I might need some help on this due to my time being fully taken by contracts currently). I guess the sweet spot will end up being at a crossroad between kernel-only and userland-only solution. Thanks, Mathieu > > Thanks, > > Ingo -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Mathieu Desnoyers wrote: > > it's offline right now: > > > > $ git clone git://git.lttng.org/urcu > > Cloning into urcu... > > fatal: The remote end hung up unexpectedly > > This would be: > > git clone git://git.lttng.org/userspace-rcu.git Hey, my impression wasn't *entirely* wrong, your initial urcu commit: From 27b012e271a82b9a0d94543688904f207cd154ea Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Thu, 5 Feb 2009 19:06:44 -0500 Subject: [PATCH] init version --- Makefile |6 ++ urcu.c | 250 ++ urcu.h | 69 + 3 files changed, 325 insertions(+), 0 deletions(-) Has: +/* The "volatile" is due to gcc bugs */ +#define barrier() __asm__ __volatile__("": : :"memory") + Which code sequence i recognize very well as a kernel maintainer ;-) Here's the kernel's compiler.h definition of the same: /* The "volatile" is due to gcc bugs */ #define barrier() __asm__ __volatile__("": : :"memory") This: +/* x86 32/64 specific */ +#define mb()asm volatile("mfence":::"memory") +#define rmb() asm volatile("lfence":::"memory") +#define wmb() asm volatile("sfence" ::: "memory") + + + +/* x86 32 */ +static inline void atomic_inc(int *v) +{ + asm volatile("lock; incl %0" +: "+m" (v->counter)); is familiar to an arch/x86/ maintainer as well :-) So yes, kernel code was obviously used in the making of urcu - just not the RCU kernel code it appears. Which is a pity i think! :-) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Sasha Levin (levinsasha...@gmail.com) wrote: > On Thu, 2011-05-26 at 19:09 -0400, Mathieu Desnoyers wrote: > > * Sasha Levin (levinsasha...@gmail.com) wrote: > > > On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote: > > > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > > > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote: > > > > >> > > > > >> > > > > > >> > I've added some rwlocks because of what Ingo said yesterday about > > > > >> > adding/removing devices after the first initialization phase. > > > > >> > > > > > >> > Take MMIO lock for example: Since we can now run SMP guests, we > > > > >> > may > > > > >> > have multiple MMIO exits (one from each VCPU thread). Each of > > > > >> > those > > > > >> > exits leads to searching the MMIO rbtree. > > > > >> > > > > > >> > We can use a mutex to lock it, but it just means that those > > > > >> > threads > > > > >> > will be blocked there instead of concurrently searching the MMIO > > > > >> > tree which makes the search linear instead of parallel. > > > > >> > > > > > >> > It's hard to bring 'real' numbers at this stage because the only > > > > >> > 'real' device we have which uses MMIO is the VESA driver, and we > > > > >> > can't really simulate many VCPUs writing to it :) > > > > >> > > > > >> I'd suggest keeping it simple first - rwlocks are nasty and will > > > > >> bounce a cacheline just as much. > > > > > > > > > > Well, this is the first case where tools/kvm can do better than qemu > > > > > with > > > > > its global lock, so I think it's worth it. > > > > > > > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > > > > > > > > > Definitely rcu is a perfect patch for mmio dispatch. > > > > > > > > Userspace RCU code is here, Sasha, if you feel like tackling this: > > > > > > > > http://lttng.org/urcu > > > > > > > > :-) > > > > > > > > I'm CC'ing Paul and Mathieu as well for urcu. > > > > > > Sounds good! > > > > > > Should be quite an addition and could be used in more places than just > > > the MMIO dispatcher. > > > > Hi Sasha, > > > > Feel free to let me know if you need any help, have questions, or need > > improvements to liburcu. I'd be interested to know about the specifics > > of you read vs update workload rate. Also, if you need more thorough > > information, we have a paper describing all the liburcu flavors. It > > might help you choose the one better suited for your needs. (if you > > don't care that much about fine-tuning, my recommendation is to stick > > with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be > > found at http://www.efficios.com/publications > > Hi Mathieu! > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the > augmentation feature to support an interval rb-tree - which means that > every update to the tree not only updates the nodes directly related to > the updated node but also all the nodes on the path to the root of the > tree. Cool !! I'm adding in copy Phil Howard who has been working on RCU RB tree for much longer than myself. > I see that in liburcu there is an implementation of a rcu linked list > but no implementation of a rb-tree. > > Are you currently working on one? or maybe I should try writing one and > sending it to you? Actually, I started working on one last year, but had to interrupt my effort before I got it even working right. The state of this (disclaimer: unfinished!!) work is available in the "rbtree" branch of the urcu library. This tree has insertion/removals protected by a mutex, and uses a RCU read lock to protect traversal. The main problem I was facing when I had to stop working on that code is that the "nil" node: 56 /* Sentinel (bottom nodes). 57 * Don't care about p, left, right, pos and key values */ 58 struct rcu_rbtree_node rcu_rbtree_nil = { 59 .color = COLOR_BLACK, 60 }; ended up being written to as temporary node by the algorithm presented in CLRS, chap. 12. So sharing it as a common node, as proposed in their book, made sense only if you consider you have no concurrency, but seems to break left and right in the presence of concurrency, and I did not have time to review their entire algo to find out where I should check for accesses to this nil node. This implementation is trying to think of the RB tree in terms of how each operation is affecting the read-side visibility of the tree nodes. It uses the fact that readers only ever go down into the tree extensively. I'd be glad to help out if someone want to have a look and try to complete that work, which should only be considered as "work in progress" level of (in)completeness. We'd have to see how we can go from this implementation of a standard RB tree to an interval RB tree too. I guess it will depend whether you need the updates from the target node up to the root to be done "all at once" from a reader perspective (then you would probably need to replace a copy of a part of the tree all at once), or
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Mathieu Desnoyers wrote: > > instead of bringing in yet another library which is a IIRC a > > distant copy of the kernel code to begin with. > > This is either a lie, or immensely misinformed. You should go and > look at the source before doing nonsensical assumptions like this. > What you are saying cannot be further from the truth. I was merely immensely misinformed, which (distinct!) possibility i tried to signal via the 'IIRC' qualifier as well ;-) While right now i cannot clone the repository itself: $ git clone git://git.lttng.org/urcu Cloning into urcu... fatal: The remote end hung up unexpectedly (tried it several times today, same failure) I found a tarball of the package so could have a look again. Despite my initial impression of it being kernel RCU code related (it has system.h, smp_mb, list_head, etc.), a closer look indeed suggests that the RCU implementation itself does not use the kernel RCU concepts so it sadly cannot be a distant copy of the kernel RCU code! Which is too bad IMO: i don't think user-space RCU should necessarily be so different from kernel-space RCU. I think the two codebases could be shared, or at least they could become closer relatives! :-) That could be done using the migration event notification trick i suggested in the previous mail. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar (mi...@elte.hu) wrote: > > [ ... ] > > > > All that aside, one advantage of http://lttng.org/urcu is that it > > already exists, which allows prototyping to proceed immediately. > > it's offline right now: > > $ git clone git://git.lttng.org/urcu > Cloning into urcu... > fatal: The remote end hung up unexpectedly This would be: git clone git://git.lttng.org/userspace-rcu.git I'll make sure a symlink for urcu -> userspace-rcu.git gets setup shortly. > One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess > we could copy a suitable implementation into tools/kvm/rcu/? That might make sense, yes. The intent of having liburcu LGPL is really to give as much liberty for apps developers to link to it as they have calling Linux kernel system calls. This peculiarness of the GPL Linux is able to use to let non-GPL apps use it does not apply so clearly to libraries, hence the use of LGPL for all my libs that support application execution. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar wrote: > Note that you do not want the context switch event, but the CPU > migration event: that will notify user-space when it gets migrated > to another CPU. This is the case that RCU really needs. Also note that the main current use-case of perf events is instrumentation, thus if you make use of this facility for user-space RCU you need to check whether the events are all precise and arrive in time to the target process, etc. Statistical behavior isnt a big problem for instrumentation but it's a showstopper for RCU, obviously! :-) If you find such bugs then we want to fix them, so there's no fundamental *desire* from us for these events to be statistical and inaccurate anywhere. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar wrote: > I was thinking about that on and off so loudly that Peter > implemented it long ago via fasync support on the perf event fd! > :-) > > So if you set a notification signal via fcntl(F_SETOWN) on the > scheduler context switch event fd, the user-space RCU code will get > a signal on every context switch. > > I have not tried it for this purpose yet, so let us know if there > are unexpected complications :) Note that you do not want the context switch event, but the CPU migration event: that will notify user-space when it gets migrated to another CPU. This is the case that RCU really needs. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar wrote: > > This code is very much tied with the kernel scheduler. [...] > > It would not be particularly complex to enable user-space to > request a callback on context switch events. > > I was thinking on and off about allowing perf events to generate a > per sampling event notification signal on specific events, such as > page faults or context switches. I was thinking about that on and off so loudly that Peter implemented it long ago via fasync support on the perf event fd! :-) So if you set a notification signal via fcntl(F_SETOWN) on the scheduler context switch event fd, the user-space RCU code will get a signal on every context switch. I have not tried it for this purpose yet, so let us know if there are unexpected complications :) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Sasha Levin wrote: > I see that in liburcu there is an implementation of a rcu linked > list but no implementation of a rb-tree. Another approach would be, until the RCU interactions are sorted out, to implement a 'big reader lock' thing that is completely lockless on the read side (!). It works well if the write side is expensive, but very rare: which is certainly the case for these ioport registration data structures used in the mmio event demux fast path! The write_lock() side signals all worker threads to finish whatever they are doing now and to wait for the write_unlock(). Then the modification can be done and the worker threads can be resumed. This can be updated to RCU later on without much trouble. The advantage is that this could be implemented with the existing thread-pool primitives straight away i think, we'd need five primitives: bread_lock(); bread_unlock(); bwrite_lock(); bwrite_lock(); brlock_init(); and a data type: struct brlock; bread_lock()/bread_unlock() is basically just a compiler barrier. bwrite_lock() stops all (other) worker threads. bwrite_unlock() resumes them. That's all - should be 50 lines of code, unless i'm missing something :-) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Mathieu Desnoyers wrote: > > - Check kernel/tinyrcu.c to see how RCU is implemented in its > >simplest form. :) > > ...so simplistic it only works on UP systems, which are not so common > these days on the systems targeted by kvm. As i said above, in its simplest form - which is UP. Obviously it's not tinyrcu.c that should be used by tools/kvm/ but what i suggested, tree-RCU: > > - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/ > > This code is very much tied with the kernel scheduler. [...] It would not be particularly complex to enable user-space to request a callback on context switch events. I was thinking on and off about allowing perf events to generate a per sampling event notification signal on specific events, such as page faults or context switches. Obviously this won't be enabled from NMI contexts due to atomicity constraints, but the pagefault and maybe the context switch path looks doable. That capability would be a rather simple kernel change and it would allow a user-space RCU implementation to be notified of various key events, context switches in particular. Would you be interested in helping code up such a facility? The urcu library could make good use of it i think, regardless of what we do in tools/kvm/. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Thu, 2011-05-26 at 19:09 -0400, Mathieu Desnoyers wrote: > * Sasha Levin (levinsasha...@gmail.com) wrote: > > On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote: > > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote: > > > >> > > > >> > > > > >> > I've added some rwlocks because of what Ingo said yesterday about > > > >> > adding/removing devices after the first initialization phase. > > > >> > > > > >> > Take MMIO lock for example: Since we can now run SMP guests, we may > > > >> > have multiple MMIO exits (one from each VCPU thread). Each of those > > > >> > exits leads to searching the MMIO rbtree. > > > >> > > > > >> > We can use a mutex to lock it, but it just means that those threads > > > >> > will be blocked there instead of concurrently searching the MMIO > > > >> > tree which makes the search linear instead of parallel. > > > >> > > > > >> > It's hard to bring 'real' numbers at this stage because the only > > > >> > 'real' device we have which uses MMIO is the VESA driver, and we > > > >> > can't really simulate many VCPUs writing to it :) > > > >> > > > >> I'd suggest keeping it simple first - rwlocks are nasty and will > > > >> bounce a cacheline just as much. > > > > > > > > Well, this is the first case where tools/kvm can do better than qemu > > > > with > > > > its global lock, so I think it's worth it. > > > > > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > > > > > > > Definitely rcu is a perfect patch for mmio dispatch. > > > > > > Userspace RCU code is here, Sasha, if you feel like tackling this: > > > > > > http://lttng.org/urcu > > > > > > :-) > > > > > > I'm CC'ing Paul and Mathieu as well for urcu. > > > > Sounds good! > > > > Should be quite an addition and could be used in more places than just > > the MMIO dispatcher. > > Hi Sasha, > > Feel free to let me know if you need any help, have questions, or need > improvements to liburcu. I'd be interested to know about the specifics > of you read vs update workload rate. Also, if you need more thorough > information, we have a paper describing all the liburcu flavors. It > might help you choose the one better suited for your needs. (if you > don't care that much about fine-tuning, my recommendation is to stick > with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be > found at http://www.efficios.com/publications Hi Mathieu! In tools/kvm/ we use a rb-tree (same one used by the kernel) with the augmentation feature to support an interval rb-tree - which means that every update to the tree not only updates the nodes directly related to the updated node but also all the nodes on the path to the root of the tree. I see that in liburcu there is an implementation of a rcu linked list but no implementation of a rb-tree. Are you currently working on one? or maybe I should try writing one and sending it to you? -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Paul E. McKenney wrote: > > > > I'm CC'ing Paul and Mathieu as well for urcu. > > I am hoping we can get better convergence between the user-level > and kernel-level URCU implementations once I get SRCU merged into > the TREE_RCU and TINY_RCU implementations. [...] Yeah. > [...] But it is early days for user-level RCU implementations -- > for example, the kernel-level implementations have deep > dependencies on being able to lock themselves cheaply to a given > CPU, which does not exist at user level. Correct - this is why i suggested a plain copy first, then look back how we (and whether we!) want to share logic. > But there seems to be an assumption that there should be only one > URCU implementation, and I am not sure that this assumption holds. I'm not sure about that either. And sice tools/kvm/ lives in the kernel repo it would be a mortal sin [*] to not explore the code sharing angle!!! :-) If a reasonable amount of sharing of logic is possible without making it painful for the kernel RCU code we could do other nice things like change the RCU logic and test it in user-space first and run user-space rcutorture on some really big cluster. > [ ... ] > > All that aside, one advantage of http://lttng.org/urcu is that it > already exists, which allows prototyping to proceed immediately. it's offline right now: $ git clone git://git.lttng.org/urcu Cloning into urcu... fatal: The remote end hung up unexpectedly One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess we could copy a suitable implementation into tools/kvm/rcu/? Thanks, Ingo [1] punishable by death or eternal hacking of a Windows driver (i'd pick the former) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Thu, May 26, 2011 at 07:05:08PM -0400, Mathieu Desnoyers wrote: > * Ingo Molnar (mi...@elte.hu) wrote: > > > > * Pekka Enberg wrote: > > > > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote: > > > >> > > > >> > > > > >> > I've added some rwlocks because of what Ingo said yesterday about > > > >> > adding/removing devices after the first initialization phase. > > > >> > > > > >> > Take MMIO lock for example: Since we can now run SMP guests, we may > > > >> > have multiple MMIO exits (one from each VCPU thread). Each of those > > > >> > exits leads to searching the MMIO rbtree. > > > >> > > > > >> > We can use a mutex to lock it, but it just means that those threads > > > >> > will be blocked there instead of concurrently searching the MMIO > > > >> > tree which makes the search linear instead of parallel. > > > >> > > > > >> > It's hard to bring 'real' numbers at this stage because the only > > > >> > 'real' device we have which uses MMIO is the VESA driver, and we > > > >> > can't really simulate many VCPUs writing to it :) > > > >> > > > >> I'd suggest keeping it simple first - rwlocks are nasty and will > > > >> bounce a cacheline just as much. > > > > > > > > Well, this is the first case where tools/kvm can do better than qemu > > > > with > > > > its global lock, so I think it's worth it. > > > > > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > > > > > > > Definitely rcu is a perfect patch for mmio dispatch. > > > > > > Userspace RCU code is here, Sasha, if you feel like tackling this: > > > > > > http://lttng.org/urcu > > > > > > :-) > > > > > > I'm CC'ing Paul and Mathieu as well for urcu. I am hoping we can get better convergence between the user-level and kernel-level URCU implementations once I get SRCU merged into the TREE_RCU and TINY_RCU implementations. But it is early days for user-level RCU implementations -- for example, the kernel-level implementations have deep dependencies on being able to lock themselves cheaply to a given CPU, which does not exist at user level. But there seems to be an assumption that there should be only one URCU implementation, and I am not sure that this assumption holds. For example, there are several in http://lttng.org/urcu, each corresponding to different use cases. This should not be too much of a surprise, given that there are quite a few implementations in the Linux kernel: TINY_RCU, TINY_PREEMPT_RCU, TREE_RCU, TREE_PREEMPT_RCU, and SRCU. Of course, each of the first four variants provides RCU-bh and RCU-sched, and TINY_PREEMPT_RCU and TREE_PREEMPT_RCU provide preemptible RCU in addition. And back in the mid-1990s, I would never have imagined a need for more than one implementation of RCU. ;-) All that aside, one advantage of http://lttng.org/urcu is that it already exists, which allows prototyping to proceed immediately. If it turns out that URCU doesn't help for whatever reason, then there is no point in worrying further. And if URCU does turn out to help, we will know more about exactly what this particular situation requires of URCU, which will likely help us better understand what the implemenation should look like -- perhaps very close to one of the URCU implementations, perhaps very close to one of the in-kernel implementations. Seem reasonable, or am I missing something here? Thanx, Paul > > I think we should rather share some of the kernel RCU code in an > > intelligent way > > Hi Ingo, > > By all means feel free to redo all the work Paul have spent care and > time coding and testing. > > > instead of bringing in yet another library which is a > > IIRC a distant copy of the kernel code to begin with. > > This is either a lie, or immensely misinformed. You should go and look > at the source before doing nonsensical assumptions like this. What you > are saying cannot be further from the truth. > > > > > That way we could lazily benefit from all the enhancements Paul does > > to the kernel RCU code! :-) > > Maybe there is a reason why Paul have been working with me on the > userspace RCU implementation in parallel with working on the Linux > kernel one ? > > > > > Note that kernel/treercu.c is pretty tied to the kernel right now, so > > a first approach could be to: > > > > - Check Paul's excellent documentation about RCU and make sure > >you don't miss Paul's excellent 3-part primer on LWN.net: > > > > http://lwn.net/Articles/262464/ > > http://lwn.net/Articles/263130/ > > http://lwn.net/Articles/264090/ > > > >There are also lots of very good RCU articles on the LWN Kernel > >Index page: > > > > http://lwn.net/Kernel/Index/ > > Or just (see README) > > git clone git://git.lttng.org/urcu > cd userspace-rcu > ./bootstrap > ./configure > make > make install > ldconfig > > #include > gcc -lurcu ... > > and be done with it ? >
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Sasha Levin (levinsasha...@gmail.com) wrote: > On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote: > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote: > > >> > > >> > > > >> > I've added some rwlocks because of what Ingo said yesterday about > > >> > adding/removing devices after the first initialization phase. > > >> > > > >> > Take MMIO lock for example: Since we can now run SMP guests, we may > > >> > have multiple MMIO exits (one from each VCPU thread). Each of those > > >> > exits leads to searching the MMIO rbtree. > > >> > > > >> > We can use a mutex to lock it, but it just means that those threads > > >> > will be blocked there instead of concurrently searching the MMIO > > >> > tree which makes the search linear instead of parallel. > > >> > > > >> > It's hard to bring 'real' numbers at this stage because the only > > >> > 'real' device we have which uses MMIO is the VESA driver, and we > > >> > can't really simulate many VCPUs writing to it :) > > >> > > >> I'd suggest keeping it simple first - rwlocks are nasty and will > > >> bounce a cacheline just as much. > > > > > > Well, this is the first case where tools/kvm can do better than qemu with > > > its global lock, so I think it's worth it. > > > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > > > > > Definitely rcu is a perfect patch for mmio dispatch. > > > > Userspace RCU code is here, Sasha, if you feel like tackling this: > > > > http://lttng.org/urcu > > > > :-) > > > > I'm CC'ing Paul and Mathieu as well for urcu. > > Sounds good! > > Should be quite an addition and could be used in more places than just > the MMIO dispatcher. Hi Sasha, Feel free to let me know if you need any help, have questions, or need improvements to liburcu. I'd be interested to know about the specifics of you read vs update workload rate. Also, if you need more thorough information, we have a paper describing all the liburcu flavors. It might help you choose the one better suited for your needs. (if you don't care that much about fine-tuning, my recommendation is to stick with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be found at http://www.efficios.com/publications Thanks! Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Ingo Molnar (mi...@elte.hu) wrote: > > * Pekka Enberg wrote: > > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote: > > >> > > >> > > > >> > I've added some rwlocks because of what Ingo said yesterday about > > >> > adding/removing devices after the first initialization phase. > > >> > > > >> > Take MMIO lock for example: Since we can now run SMP guests, we may > > >> > have multiple MMIO exits (one from each VCPU thread). Each of those > > >> > exits leads to searching the MMIO rbtree. > > >> > > > >> > We can use a mutex to lock it, but it just means that those threads > > >> > will be blocked there instead of concurrently searching the MMIO > > >> > tree which makes the search linear instead of parallel. > > >> > > > >> > It's hard to bring 'real' numbers at this stage because the only > > >> > 'real' device we have which uses MMIO is the VESA driver, and we > > >> > can't really simulate many VCPUs writing to it :) > > >> > > >> I'd suggest keeping it simple first - rwlocks are nasty and will > > >> bounce a cacheline just as much. > > > > > > Well, this is the first case where tools/kvm can do better than qemu with > > > its global lock, so I think it's worth it. > > > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > > > > > Definitely rcu is a perfect patch for mmio dispatch. > > > > Userspace RCU code is here, Sasha, if you feel like tackling this: > > > > http://lttng.org/urcu > > > > :-) > > > > I'm CC'ing Paul and Mathieu as well for urcu. > > I think we should rather share some of the kernel RCU code in an > intelligent way Hi Ingo, By all means feel free to redo all the work Paul have spent care and time coding and testing. > instead of bringing in yet another library which is a > IIRC a distant copy of the kernel code to begin with. This is either a lie, or immensely misinformed. You should go and look at the source before doing nonsensical assumptions like this. What you are saying cannot be further from the truth. > > That way we could lazily benefit from all the enhancements Paul does > to the kernel RCU code! :-) Maybe there is a reason why Paul have been working with me on the userspace RCU implementation in parallel with working on the Linux kernel one ? > > Note that kernel/treercu.c is pretty tied to the kernel right now, so > a first approach could be to: > > - Check Paul's excellent documentation about RCU and make sure >you don't miss Paul's excellent 3-part primer on LWN.net: > > http://lwn.net/Articles/262464/ > http://lwn.net/Articles/263130/ > http://lwn.net/Articles/264090/ > >There are also lots of very good RCU articles on the LWN Kernel >Index page: > > http://lwn.net/Kernel/Index/ Or just (see README) git clone git://git.lttng.org/urcu cd userspace-rcu ./bootstrap ./configure make make install ldconfig #include gcc -lurcu ... and be done with it ? > - Check kernel/tinyrcu.c to see how RCU is implemented in its >simplest form. :) ...so simplistic it only works on UP systems, which are not so common these days on the systems targeted by kvm. > > - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/ This code is very much tied with the kernel scheduler. This is actually one of the main reason why we had to reimplement RCU for userspace rather than to "simply copy the kernel one" as you recommend. > > - Massage it so far that it is suitable for tools/kvm/. We really >only need a few core RCU facilities initially: > > struct rcu_head; > > rcu_read_lock(); > rcu_read_unlock(); > > rcu_dereference() > > call_rcu(head, func); > > rcu_synchronize(); > >The rest, _bh(), etc. are all kernelisms. rcu_synchronize and the rcu read lock/unlock, in tree RCU, are tied to the scheduler to deal with preemption. User-land does not have this luxury. > > - Then once it's working we could look at doing the code sharing >*for real*: splitting the functionality out of the original >treercu.c code into kernel/treercu-lib.c and rcuupdate-lib.h or so >and include that one in tools/kvm/rcu/. > > - [ You might also benefit from porting rcutorture code to > user-space. It will catch RCU bugs like nothing else. ] Userspace RCU already includes the torture test suite you are referring to. > > That way the 'core RCU' logic would be contained in treercu-lib.c, > all kernel glue would be in kernel/rcutree.c. > > Some other approach might be possible as well, this was just a first > rough idea :) "wheel not invented here" syndrome ? Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Pekka Enberg wrote: > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > > On 05/26/2011 09:05 PM, Ingo Molnar wrote: > >> > >> > > >> > I've added some rwlocks because of what Ingo said yesterday about > >> > adding/removing devices after the first initialization phase. > >> > > >> > Take MMIO lock for example: Since we can now run SMP guests, we may > >> > have multiple MMIO exits (one from each VCPU thread). Each of those > >> > exits leads to searching the MMIO rbtree. > >> > > >> > We can use a mutex to lock it, but it just means that those threads > >> > will be blocked there instead of concurrently searching the MMIO > >> > tree which makes the search linear instead of parallel. > >> > > >> > It's hard to bring 'real' numbers at this stage because the only > >> > 'real' device we have which uses MMIO is the VESA driver, and we > >> > can't really simulate many VCPUs writing to it :) > >> > >> I'd suggest keeping it simple first - rwlocks are nasty and will > >> bounce a cacheline just as much. > > > > Well, this is the first case where tools/kvm can do better than qemu with > > its global lock, so I think it's worth it. > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > > > Definitely rcu is a perfect patch for mmio dispatch. > > Userspace RCU code is here, Sasha, if you feel like tackling this: > > http://lttng.org/urcu > > :-) > > I'm CC'ing Paul and Mathieu as well for urcu. I think we should rather share some of the kernel RCU code in an intelligent way instead of bringing in yet another library which is a IIRC a distant copy of the kernel code to begin with. That way we could lazily benefit from all the enhancements Paul does to the kernel RCU code! :-) Note that kernel/treercu.c is pretty tied to the kernel right now, so a first approach could be to: - Check Paul's excellent documentation about RCU and make sure you don't miss Paul's excellent 3-part primer on LWN.net: http://lwn.net/Articles/262464/ http://lwn.net/Articles/263130/ http://lwn.net/Articles/264090/ There are also lots of very good RCU articles on the LWN Kernel Index page: http://lwn.net/Kernel/Index/ - Check kernel/tinyrcu.c to see how RCU is implemented in its simplest form. :) - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/ - Massage it so far that it is suitable for tools/kvm/. We really only need a few core RCU facilities initially: struct rcu_head; rcu_read_lock(); rcu_read_unlock(); rcu_dereference() call_rcu(head, func); rcu_synchronize(); The rest, _bh(), etc. are all kernelisms. - Then once it's working we could look at doing the code sharing *for real*: splitting the functionality out of the original treercu.c code into kernel/treercu-lib.c and rcuupdate-lib.h or so and include that one in tools/kvm/rcu/. - [ You might also benefit from porting rcutorture code to user-space. It will catch RCU bugs like nothing else. ] That way the 'core RCU' logic would be contained in treercu-lib.c, all kernel glue would be in kernel/rcutree.c. Some other approach might be possible as well, this was just a first rough idea :) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote: > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > > On 05/26/2011 09:05 PM, Ingo Molnar wrote: > >> > >> > > >> > I've added some rwlocks because of what Ingo said yesterday about > >> > adding/removing devices after the first initialization phase. > >> > > >> > Take MMIO lock for example: Since we can now run SMP guests, we may > >> > have multiple MMIO exits (one from each VCPU thread). Each of those > >> > exits leads to searching the MMIO rbtree. > >> > > >> > We can use a mutex to lock it, but it just means that those threads > >> > will be blocked there instead of concurrently searching the MMIO > >> > tree which makes the search linear instead of parallel. > >> > > >> > It's hard to bring 'real' numbers at this stage because the only > >> > 'real' device we have which uses MMIO is the VESA driver, and we > >> > can't really simulate many VCPUs writing to it :) > >> > >> I'd suggest keeping it simple first - rwlocks are nasty and will > >> bounce a cacheline just as much. > > > > Well, this is the first case where tools/kvm can do better than qemu with > > its global lock, so I think it's worth it. > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > > > Definitely rcu is a perfect patch for mmio dispatch. > > Userspace RCU code is here, Sasha, if you feel like tackling this: > > http://lttng.org/urcu > > :-) > > I'm CC'ing Paul and Mathieu as well for urcu. Sounds good! Should be quite an addition and could be used in more places than just the MMIO dispatcher. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Thu, May 26, 2011 at 9:11 PM, Avi Kivity wrote: > On 05/26/2011 09:05 PM, Ingo Molnar wrote: >> >> > >> > I've added some rwlocks because of what Ingo said yesterday about >> > adding/removing devices after the first initialization phase. >> > >> > Take MMIO lock for example: Since we can now run SMP guests, we may >> > have multiple MMIO exits (one from each VCPU thread). Each of those >> > exits leads to searching the MMIO rbtree. >> > >> > We can use a mutex to lock it, but it just means that those threads >> > will be blocked there instead of concurrently searching the MMIO >> > tree which makes the search linear instead of parallel. >> > >> > It's hard to bring 'real' numbers at this stage because the only >> > 'real' device we have which uses MMIO is the VESA driver, and we >> > can't really simulate many VCPUs writing to it :) >> >> I'd suggest keeping it simple first - rwlocks are nasty and will >> bounce a cacheline just as much. > > Well, this is the first case where tools/kvm can do better than qemu with > its global lock, so I think it's worth it. > >> If lookup scalability is an issue we can extend RCU to tools/kvm/. > > Definitely rcu is a perfect patch for mmio dispatch. Userspace RCU code is here, Sasha, if you feel like tackling this: http://lttng.org/urcu :-) I'm CC'ing Paul and Mathieu as well for urcu. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On 05/26/2011 09:05 PM, Ingo Molnar wrote: > > I've added some rwlocks because of what Ingo said yesterday about > adding/removing devices after the first initialization phase. > > Take MMIO lock for example: Since we can now run SMP guests, we may > have multiple MMIO exits (one from each VCPU thread). Each of those > exits leads to searching the MMIO rbtree. > > We can use a mutex to lock it, but it just means that those threads > will be blocked there instead of concurrently searching the MMIO > tree which makes the search linear instead of parallel. > > It's hard to bring 'real' numbers at this stage because the only > 'real' device we have which uses MMIO is the VESA driver, and we > can't really simulate many VCPUs writing to it :) I'd suggest keeping it simple first - rwlocks are nasty and will bounce a cacheline just as much. Well, this is the first case where tools/kvm can do better than qemu with its global lock, so I think it's worth it. If lookup scalability is an issue we can extend RCU to tools/kvm/. Definitely rcu is a perfect patch for mmio dispatch. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
* Sasha Levin wrote: > On Thu, 2011-05-26 at 19:02 +0300, Pekka Enberg wrote: > > On Thu, 26 May 2011, Sasha Levin wrote: > > > Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls > > > similar to their kernel counterparts. > > > > > > Signed-off-by: Sasha Levin > > > > There's no explanation why a mutex isn't sufficient. The pthread > > locking primitives aren't all that great in practice so unless > > you have some correctness issue that requires a rwlock or some > > numbers, I'd prefer you go for a mutex. > > I've added some rwlocks because of what Ingo said yesterday about > adding/removing devices after the first initialization phase. > > Take MMIO lock for example: Since we can now run SMP guests, we may > have multiple MMIO exits (one from each VCPU thread). Each of those > exits leads to searching the MMIO rbtree. > > We can use a mutex to lock it, but it just means that those threads > will be blocked there instead of concurrently searching the MMIO > tree which makes the search linear instead of parallel. > > It's hard to bring 'real' numbers at this stage because the only > 'real' device we have which uses MMIO is the VESA driver, and we > can't really simulate many VCPUs writing to it :) I'd suggest keeping it simple first - rwlocks are nasty and will bounce a cacheline just as much. If lookup scalability is an issue we can extend RCU to tools/kvm/. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Thu, 2011-05-26 at 19:02 +0300, Pekka Enberg wrote: > On Thu, 26 May 2011, Sasha Levin wrote: > > Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls > > similar to their kernel counterparts. > > > > Signed-off-by: Sasha Levin > > There's no explanation why a mutex isn't sufficient. The pthread locking > primitives aren't all that great in practice so unless you have some > correctness issue that requires a rwlock or some numbers, I'd prefer you > go for a mutex. I've added some rwlocks because of what Ingo said yesterday about adding/removing devices after the first initialization phase. Take MMIO lock for example: Since we can now run SMP guests, we may have multiple MMIO exits (one from each VCPU thread). Each of those exits leads to searching the MMIO rbtree. We can use a mutex to lock it, but it just means that those threads will be blocked there instead of concurrently searching the MMIO tree which makes the search linear instead of parallel. It's hard to bring 'real' numbers at this stage because the only 'real' device we have which uses MMIO is the VESA driver, and we can't really simulate many VCPUs writing to it :) -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] kvm tools: Add rwlock wrapper
On Thu, 26 May 2011, Sasha Levin wrote: Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls similar to their kernel counterparts. Signed-off-by: Sasha Levin There's no explanation why a mutex isn't sufficient. The pthread locking primitives aren't all that great in practice so unless you have some correctness issue that requires a rwlock or some numbers, I'd prefer you go for a mutex. Pekka -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html