subject:"Re\: \[PATCH 4\/6\] kvm tools\: Add rwlock wrapper"

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-06-02 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
[...]
> Mathieu,
> 
> I've started working on converting our MMIO code to use RCU rbtree.
> 
> It looks like each node contains one key, and the search functions
> search for a node with a key in a specific range.
> 
> Instead, the key in interval tree nodes is a range, and when searching
> we try to find which node's range contains our search param.
> 
> For example, our MMIO mapper maps an address space into devices, so we
> can have one node which holds the range (0x100-0x200) which maps to a
> VGA card, a (0x400-0x500) which maps to a sound card, and so on. Then,
> when a guest is running and tries to write to 0x150, we want to know
> which device it maps to.

Hi Sasha,

I finished updating the RCU RBTree internals and API to store ranges and
query points or ranges instead of simple values. My tests are passing
fine so far. I also added some documentation in the code explaining how
I deal with search/prev/next reads vs concurrent writers. Feedback about
the API would be very welcome! It's all available in the urcu rbtree2
branch.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-31 Thread Ingo Molnar

* Prasad Joshi  wrote:

> May be the DD of 1G file was a wrong test to calculate the 
> performance. Sasha had asked me to run boniee++ for performance 
> numbers. [...]

It's difficult to test IO performance. One way to 'stabilize' such 
measurements would be to create a 'make test io' variant, which 
builds not just a simple 'Hello World!' but also an IO performance 
benchmark.

The advantage of such an approach (beyond the fun of writing it) 
would be that it's executed at a very predictable point in the bootup 
sequence, so if you keep the whole disk image in ramdisk (in 
/dev/shm/ for example) then it would allow very precise measurements 
that converge very quickly with very little noise.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-31 Thread Prasad Joshi

On Tue, May 31, 2011 at 4:25 PM, Ingo Molnar  wrote:
>
> * Sasha Levin  wrote:
>
>> On Tue, 2011-05-31 at 15:09 +0200, Ingo Molnar wrote:
>> > * Sasha Levin  wrote:
>> >
>> > > I've started working on converting our MMIO code to use RCU rbtree.
>> >
>> > Well, why would we want to switch the MMIO code away from the brlock
>> > right now? mmio tree reconfigurations are very, very rare.
>> >
>> > Won't something like the qcow cache be more suitable for that? That
>> > has both frequent read and write activities.
>>
>> We don't have a qcow cache at the moment :)
>
> Wasnt one in the works by Prasad?

I have the code changes which work. I think I had send patches few
days back as well there were few improvements. The only reason I did
not send the v2 of the patch is, I am not seeing performance
improvement by adding cache.

May be the DD of 1G file was a wrong test to calculate the
performance. Sasha had asked me to run boniee++ for performance
numbers. I will ensure the patches can be applied on the latest code
base and send the patches. Sorry for the delay.

Thanks and Regards,
Prasad

>
>> It was either the MMIO or the ioport tree. We don't have to pull
>> them into our master tree either - it's mostly a proof of concept
>> I'm doing on a separate tree.
>
> Sure.
>
> Thanks,
>
>        Ingo
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-31 Thread Ingo Molnar


* Sasha Levin  wrote:

> On Tue, 2011-05-31 at 15:09 +0200, Ingo Molnar wrote:
> > * Sasha Levin  wrote:
> > 
> > > I've started working on converting our MMIO code to use RCU rbtree.
> > 
> > Well, why would we want to switch the MMIO code away from the brlock 
> > right now? mmio tree reconfigurations are very, very rare.
> > 
> > Won't something like the qcow cache be more suitable for that? That 
> > has both frequent read and write activities.
> 
> We don't have a qcow cache at the moment :)

Wasnt one in the works by Prasad?

> It was either the MMIO or the ioport tree. We don't have to pull 
> them into our master tree either - it's mostly a proof of concept 
> I'm doing on a separate tree.

Sure.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-31 Thread Sasha Levin

On Tue, 2011-05-31 at 15:09 +0200, Ingo Molnar wrote:
> * Sasha Levin  wrote:
> 
> > I've started working on converting our MMIO code to use RCU rbtree.
> 
> Well, why would we want to switch the MMIO code away from the brlock 
> right now? mmio tree reconfigurations are very, very rare.
> 
> Won't something like the qcow cache be more suitable for that? That 
> has both frequent read and write activities.

We don't have a qcow cache at the moment :)

It was either the MMIO or the ioport tree. We don't have to pull them
into our master tree either - it's mostly a proof of concept I'm doing
on a separate tree.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-31 Thread Ingo Molnar


* Sasha Levin  wrote:

> I've started working on converting our MMIO code to use RCU rbtree.

Well, why would we want to switch the MMIO code away from the brlock 
right now? mmio tree reconfigurations are very, very rare.

Won't something like the qcow cache be more suitable for that? That 
has both frequent read and write activities.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-31 Thread Sasha Levin

On Mon, 2011-05-30 at 14:57 -0400, Mathieu Desnoyers wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
> > On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote:
> > > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> > > > > Please note that what I currently have is a normal rbtree, not an
> > > > > interval rbtree. Can you elaborate on your use-case so I can try to
> > > > > figure out how we could augment it to support the interval rbtree you
> > > > > need ?
> > > > 
> > > > We don't need anything specific for interval rbtree. The rbtree used in
> > > > the kernel provides augmentation functions for insert and erase (see
> > > > rb_augment_insert() and rb_augment_erase_begin() +
> > > > rb_augment_erase_end()).
> > > > What they basically do is call a user-provided callback for each node
> > > > from the newly inserted (or deepest after deletion) node up to the root
> > > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
> > > > basically all we need are the 2 augmentation functions I've mentioned
> > > > above.
> > > 
> > > Just a little question about Documentation/rbtree.txt:
> > > 
> > > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to
> > > compare the lo value with the max_hi and node->lo. I think it would be
> > > more natural to do range comparison functions with inclusive ranges (>=
> > > and <=). Or maybe I am missing something about the way find_lowest_match
> > > works ?
> > 
> > It's just an implementation of an interval search() function. Since
> > kernel rbtree users implement their own search() and insert() functions
> > (look in our util/rbtree-interval.c) it shouldn't be a consideration in
> > designing the tree - we (the users of urcu/rbtree) want to control the
> > search code anyway.
> 
> The reason why I provide this facility as part of the RCU rbtree is
> because, unlike the kernel rbtree where every user is free to use
> "inheritance" to put their object in the same cache-line as the rbtree
> node, the RCU implementation needs to do copies of its inner nodes, so
> the rbtree user cannot expect the node pointer to stay valid. Therefore,
> I use a more usual scheme where the pointer to the user-provided data
> structure is kept as a "void *key" in the node. So basically, the rbtree
> user don't have to provide the search, next and prev functions: this is
> all done within the rbtree code, especially because these have to be
> RCU-aware, and because the code that augments the rbtree with ranges
> needs to be RCU-aware too.
> 
> Finally, my tests showed up that the "<= and >=" need to have the equal
> for the ranges to be inclusive. I tested this by using the same
> test-case as the search, duplicating the key value as both lower and
> upper bound of the range searched for. (see updated rbtree2 branch for
> tested range search).
> 
> My stress-test now tests the range lookups, and it passes fine so far:
> 
> e.g.   test_urcu_rbtree 6 2 200 -g 40  (on a 8-core machine)

Mathieu,

I've started working on converting our MMIO code to use RCU rbtree.

It looks like each node contains one key, and the search functions
search for a node with a key in a specific range.

Instead, the key in interval tree nodes is a range, and when searching
we try to find which node's range contains our search param.

For example, our MMIO mapper maps an address space into devices, so we
can have one node which holds the range (0x100-0x200) which maps to a
VGA card, a (0x400-0x500) which maps to a sound card, and so on. Then,
when a guest is running and tries to write to 0x150, we want to know
which device it maps to.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-30 Thread Sasha Levin

On Mon, 2011-05-30 at 14:57 -0400, Mathieu Desnoyers wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
> > On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote:
> > > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> > > > > Please note that what I currently have is a normal rbtree, not an
> > > > > interval rbtree. Can you elaborate on your use-case so I can try to
> > > > > figure out how we could augment it to support the interval rbtree you
> > > > > need ?
> > > > 
> > > > We don't need anything specific for interval rbtree. The rbtree used in
> > > > the kernel provides augmentation functions for insert and erase (see
> > > > rb_augment_insert() and rb_augment_erase_begin() +
> > > > rb_augment_erase_end()).
> > > > What they basically do is call a user-provided callback for each node
> > > > from the newly inserted (or deepest after deletion) node up to the root
> > > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
> > > > basically all we need are the 2 augmentation functions I've mentioned
> > > > above.
> > > 
> > > Just a little question about Documentation/rbtree.txt:
> > > 
> > > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to
> > > compare the lo value with the max_hi and node->lo. I think it would be
> > > more natural to do range comparison functions with inclusive ranges (>=
> > > and <=). Or maybe I am missing something about the way find_lowest_match
> > > works ?
> > 
> > It's just an implementation of an interval search() function. Since
> > kernel rbtree users implement their own search() and insert() functions
> > (look in our util/rbtree-interval.c) it shouldn't be a consideration in
> > designing the tree - we (the users of urcu/rbtree) want to control the
> > search code anyway.
> 
> The reason why I provide this facility as part of the RCU rbtree is
> because, unlike the kernel rbtree where every user is free to use
> "inheritance" to put their object in the same cache-line as the rbtree
> node, the RCU implementation needs to do copies of its inner nodes, so
> the rbtree user cannot expect the node pointer to stay valid. Therefore,
> I use a more usual scheme where the pointer to the user-provided data
> structure is kept as a "void *key" in the node. So basically, the rbtree
> user don't have to provide the search, next and prev functions: this is
> all done within the rbtree code, especially because these have to be
> RCU-aware, and because the code that augments the rbtree with ranges
> needs to be RCU-aware too.
> 
> Finally, my tests showed up that the "<= and >=" need to have the equal
> for the ranges to be inclusive. I tested this by using the same
> test-case as the search, duplicating the key value as both lower and
> upper bound of the range searched for. (see updated rbtree2 branch for
> tested range search).
> 
> My stress-test now tests the range lookups, and it passes fine so far:
> 
> e.g.   test_urcu_rbtree 6 2 200 -g 40  (on a 8-core machine)

Alright, so if urcu has rbtrees I'll make sure tools/kvm/ starts using
urcu.

I'll send an update tomorrow once I have something working. 


-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-30 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> > > > Please note that what I currently have is a normal rbtree, not an
> > > > interval rbtree. Can you elaborate on your use-case so I can try to
> > > > figure out how we could augment it to support the interval rbtree you
> > > > need ?
> > > 
> > > We don't need anything specific for interval rbtree. The rbtree used in
> > > the kernel provides augmentation functions for insert and erase (see
> > > rb_augment_insert() and rb_augment_erase_begin() +
> > > rb_augment_erase_end()).
> > > What they basically do is call a user-provided callback for each node
> > > from the newly inserted (or deepest after deletion) node up to the root
> > > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
> > > basically all we need are the 2 augmentation functions I've mentioned
> > > above.
> > 
> > Just a little question about Documentation/rbtree.txt:
> > 
> > I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to
> > compare the lo value with the max_hi and node->lo. I think it would be
> > more natural to do range comparison functions with inclusive ranges (>=
> > and <=). Or maybe I am missing something about the way find_lowest_match
> > works ?
> 
> It's just an implementation of an interval search() function. Since
> kernel rbtree users implement their own search() and insert() functions
> (look in our util/rbtree-interval.c) it shouldn't be a consideration in
> designing the tree - we (the users of urcu/rbtree) want to control the
> search code anyway.

The reason why I provide this facility as part of the RCU rbtree is
because, unlike the kernel rbtree where every user is free to use
"inheritance" to put their object in the same cache-line as the rbtree
node, the RCU implementation needs to do copies of its inner nodes, so
the rbtree user cannot expect the node pointer to stay valid. Therefore,
I use a more usual scheme where the pointer to the user-provided data
structure is kept as a "void *key" in the node. So basically, the rbtree
user don't have to provide the search, next and prev functions: this is
all done within the rbtree code, especially because these have to be
RCU-aware, and because the code that augments the rbtree with ranges
needs to be RCU-aware too.

Finally, my tests showed up that the "<= and >=" need to have the equal
for the ranges to be inclusive. I tested this by using the same
test-case as the search, duplicating the key value as both lower and
upper bound of the range searched for. (see updated rbtree2 branch for
tested range search).

My stress-test now tests the range lookups, and it passes fine so far:

e.g.   test_urcu_rbtree 6 2 200 -g 40  (on a 8-core machine)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-30 Thread Sasha Levin

On Mon, 2011-05-30 at 13:38 -0400, Mathieu Desnoyers wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
> > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> > > Please note that what I currently have is a normal rbtree, not an
> > > interval rbtree. Can you elaborate on your use-case so I can try to
> > > figure out how we could augment it to support the interval rbtree you
> > > need ?
> > 
> > We don't need anything specific for interval rbtree. The rbtree used in
> > the kernel provides augmentation functions for insert and erase (see
> > rb_augment_insert() and rb_augment_erase_begin() +
> > rb_augment_erase_end()).
> > What they basically do is call a user-provided callback for each node
> > from the newly inserted (or deepest after deletion) node up to the root
> > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
> > basically all we need are the 2 augmentation functions I've mentioned
> > above.
> 
> Just a little question about Documentation/rbtree.txt:
> 
> I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to
> compare the lo value with the max_hi and node->lo. I think it would be
> more natural to do range comparison functions with inclusive ranges (>=
> and <=). Or maybe I am missing something about the way find_lowest_match
> works ?

It's just an implementation of an interval search() function. Since
kernel rbtree users implement their own search() and insert() functions
(look in our util/rbtree-interval.c) it shouldn't be a consideration in
designing the tree - we (the users of urcu/rbtree) want to control the
search code anyway.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-30 Thread Mathieu Desnoyers

* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
> > On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> > > Please note that what I currently have is a normal rbtree, not an
> > > interval rbtree. Can you elaborate on your use-case so I can try to
> > > figure out how we could augment it to support the interval rbtree you
> > > need ?
> > 
> > We don't need anything specific for interval rbtree. The rbtree used in
> > the kernel provides augmentation functions for insert and erase (see
> > rb_augment_insert() and rb_augment_erase_begin() +
> > rb_augment_erase_end()).
> > What they basically do is call a user-provided callback for each node
> > from the newly inserted (or deepest after deletion) node up to the root
> > of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
> > basically all we need are the 2 augmentation functions I've mentioned
> > above.
> 
> Just a little question about Documentation/rbtree.txt:
> 
> I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to
> compare the lo value with the max_hi and node->lo. I think it would be
> more natural to do range comparison functions with inclusive ranges (>=
> and <=). Or maybe I am missing something about the way find_lowest_match
> works ?

Sorry for self-reply, I actually got my head around this detail: these
tests are excluding ranges. So to get the range with inclusive values,
we must take the negation of the opposite (which is a range without the
limit values).

The code for augmented RBtree search by range is now in the rbtree2
branch. I still need to find a good way to test the range search
functions though.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-30 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> > Please note that what I currently have is a normal rbtree, not an
> > interval rbtree. Can you elaborate on your use-case so I can try to
> > figure out how we could augment it to support the interval rbtree you
> > need ?
> 
> We don't need anything specific for interval rbtree. The rbtree used in
> the kernel provides augmentation functions for insert and erase (see
> rb_augment_insert() and rb_augment_erase_begin() +
> rb_augment_erase_end()).
> What they basically do is call a user-provided callback for each node
> from the newly inserted (or deepest after deletion) node up to the root
> of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
> basically all we need are the 2 augmentation functions I've mentioned
> above.

Just a little question about Documentation/rbtree.txt:

I see that find_lowest_match(lo, hi, node) seems to use ">" and "<" to
compare the lo value with the max_hi and node->lo. I think it would be
more natural to do range comparison functions with inclusive ranges (>=
and <=). Or maybe I am missing something about the way find_lowest_match
works ?

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-30 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> > Please note that what I currently have is a normal rbtree, not an
> > interval rbtree. Can you elaborate on your use-case so I can try to
> > figure out how we could augment it to support the interval rbtree you
> > need ?
> 
> We don't need anything specific for interval rbtree. The rbtree used in
> the kernel provides augmentation functions for insert and erase (see
> rb_augment_insert() and rb_augment_erase_begin() +
> rb_augment_erase_end()).
> What they basically do is call a user-provided callback for each node
> from the newly inserted (or deepest after deletion) node up to the root
> of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
> basically all we need are the 2 augmentation functions I've mentioned
> above.

Given we have to update the parent nodes when the interval values
change, we need to work on a copy of these parent nodes to ensure that
their information about the children min/max corresponds to the
children's left/right pointers they contain. Any discrepancy between
their left/right pointers and the children min/max value they store
would be invalid from a reader's POV.

I'll see if I can embed this in my tree. It should be doable with the
"decay" approach I am using. We'll need a way to test this though:
possibly by walking the tree with range-aware lookups that also make
sure that the ranges that were promised by the upper nodes are contained
within their children at all times.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-30 Thread Mathieu Desnoyers

* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
> On Sun, May 29, 2011 at 01:01:04PM -0400, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> > > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > [...]
> > > > Hi Mathieu!
> > > > 
> > > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
> > > > augmentation feature to support an interval rb-tree - which means that
> > > > every update to the tree not only updates the nodes directly related to
> > > > the updated node but also all the nodes on the path to the root of the
> > > > tree.
> > > 
> > > Cool !!
> > > 
> > > I'm adding in copy Phil Howard who has been working on RCU RB tree for
> > > much longer than myself.
> > > 
> > > > I see that in liburcu there is an implementation of a rcu linked list
> > > > but no implementation of a rb-tree.
> > > > 
> > > > Are you currently working on one? or maybe I should try writing one and
> > > > sending it to you?
> > > 
> > > Actually, I started working on one last year, but had to interrupt my
> > > effort before I got it even working right.
> > [...]
> > > We'd have to see how we can go from this implementation of a standard RB
> > > tree to an interval RB tree too. I guess it will depend whether you need
> > > the updates from the target node up to the root to be done "all at once"
> > > from a reader perspective (then you would probably need to replace a
> > > copy of a part of the tree all at once), or if you can allow the update
> > > to be done piece-wise on a node-by-node basis as readers go through the
> > > tree (from root to leafs).
> > 
> > I've revisited the RCU rbtree implementation this weekend, and it works
> > much better now. I reimplemented the whole thing from 0 starting from
> > the CLRS chapter 12 algorithms (to get the non-rcu
> > (insertion/removal)-only stress-tests working) and incrementally
> > RCU-ized the updates and then added read-side tests. All along, I used
> > the test_urcu_rbtree test case that does some basic coherency tests by
> > searching for some random elements that *should* be there in parellel
> > with insertion and removals. The implementation I currently have
> > survives the "search for known elements in parallel with updates" stress
> > test (so far). (e.g.  test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2
> > writers, 30 known random elements, writers are adding/removing 6 random
> > elements, on a 8-core machine)
> > 
> > See: git://git.lttng.org/userspace-rcu.git
> >  branch : rbtree2
> > 
> > The key idea I used in this implementation is to "decay" the old nodes
> > (AFAIK, I just made this up) : "decaying" a node could be best described
> > as creating an exact copy of a node, and putting a pointer to this new
> > node into the old node to form a "decay chain". This allowed me to keep
> > the algorithm very much similar to CLRS by just walking the decay chains
> > whenever needed. The old node "decays" by using call_rcu to free it
> > after a grace period passes. This imply that the updates must hold the
> > RCU read-side lock in addition to a mutex to make sure the decaying
> > nodes stay valid for the duration of their use.
> > 
> > This implementation never requires the read-side to loop, thus
> > guaranteeing a wait-free read-side behavior (so search operations will
> > always be strictly log(n) without any busy-loop delay).
> > 
> > I have not created stress-tests for next/prev walk of the tree yet. It
> > is therefore entirely possible that this does not work as expected.
> > 
> > Comments are welcome,
> 
> Very cool!
> 
> The trick Phil Howard used allowed him to avoid duplicating the nodes
> in some cases in the rotations.  I might be missing something, but it
> looks like you are duplicating in all cases.

The duplications I do are (following CLRS 3rd ed. chap 12, 13):

- x, y and beta for left and right rotation (p. 313)
- v for transplant (p. 296)
- the whole branch between z.right and y (inclusive) for lines 9--20 of
  rb_delete() (p. 324, chap. 13) (at most log(n) items), for the case I
  call rcu_rbtree_remove_nonil() in my code.

> Would using Phil's trick
> result in significant performance gain?

I just read through Phil's paper at
http://www.cs.pdx.edu/pdfs/tr1006.pdf. It looks like we have different
targets: Phil's structure of RB tree is heavily tuned to allow RCU
search, but it uses a RW lock for in-order traversal. Mine allows both
search and in-order traversal to be performed under RCU read-side.

One impact of my different goal is that I need to keep pointers to
parent nodes (and must know if a node is a left or right child) -- and
update both of these atomically. E.g., at least one optimisation done in
Phil's work would not work with my scheme (his optimized swap, 4.1.2):
it generates an intermediate tree state where in-order traversal could
loop between C -> B -> A -> C (trying to do multiple rcu_rbtree_next)
for a while which goes against the time guarantees I want

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-30 Thread Ingo Molnar


* Paul E. McKenney  wrote:

> > The stop_machine_run()-alike thing is for brlocks - for which 
> > Sasha sent patches already, see these patches on the 
> > kvm@vger.kernel.org list:
> > 
> >[PATCH 3/4] kvm tools: Add a brlock
> >[PATCH 4/4] kvm tools: Use brlock in MMIO and IOPORT
> > 
> > Wrt. brlocks, we'll keep them as simple as possible and indeed no 
> > involved tricks are needed AFAICS. read_lock() will be a compiler 
> > barrier(), that's as fast as it gets :-)
> 
> Makes sense!
> 
> The other debugging use for the read-side primitives is to execute 
> the read-side ready-do-respond-to-kvm-pause code.  This can help 
> catch bugs where the developer put the br_read_lock() and the 
> br_read_unlock() in the wrong place.

That's a very good suggestion - it might in fact be simpler to 
implement than the 'replace by rw_lock' debugging variant.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-29 Thread Sasha Levin

On Sun, 2011-05-29 at 22:54 -0400, Mathieu Desnoyers wrote:
> Please note that what I currently have is a normal rbtree, not an
> interval rbtree. Can you elaborate on your use-case so I can try to
> figure out how we could augment it to support the interval rbtree you
> need ?

We don't need anything specific for interval rbtree. The rbtree used in
the kernel provides augmentation functions for insert and erase (see
rb_augment_insert() and rb_augment_erase_begin() +
rb_augment_erase_end()).
What they basically do is call a user-provided callback for each node
from the newly inserted (or deepest after deletion) node up to the root
of the tree. You can see our code at 'tools/kvm/util/rbtree-interval.c',
basically all we need are the 2 augmentation functions I've mentioned
above.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-29 Thread Paul E. McKenney

On Sun, May 29, 2011 at 01:01:04PM -0400, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> [...]
> > > Hi Mathieu!
> > > 
> > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
> > > augmentation feature to support an interval rb-tree - which means that
> > > every update to the tree not only updates the nodes directly related to
> > > the updated node but also all the nodes on the path to the root of the
> > > tree.
> > 
> > Cool !!
> > 
> > I'm adding in copy Phil Howard who has been working on RCU RB tree for
> > much longer than myself.
> > 
> > > I see that in liburcu there is an implementation of a rcu linked list
> > > but no implementation of a rb-tree.
> > > 
> > > Are you currently working on one? or maybe I should try writing one and
> > > sending it to you?
> > 
> > Actually, I started working on one last year, but had to interrupt my
> > effort before I got it even working right.
> [...]
> > We'd have to see how we can go from this implementation of a standard RB
> > tree to an interval RB tree too. I guess it will depend whether you need
> > the updates from the target node up to the root to be done "all at once"
> > from a reader perspective (then you would probably need to replace a
> > copy of a part of the tree all at once), or if you can allow the update
> > to be done piece-wise on a node-by-node basis as readers go through the
> > tree (from root to leafs).
> 
> I've revisited the RCU rbtree implementation this weekend, and it works
> much better now. I reimplemented the whole thing from 0 starting from
> the CLRS chapter 12 algorithms (to get the non-rcu
> (insertion/removal)-only stress-tests working) and incrementally
> RCU-ized the updates and then added read-side tests. All along, I used
> the test_urcu_rbtree test case that does some basic coherency tests by
> searching for some random elements that *should* be there in parellel
> with insertion and removals. The implementation I currently have
> survives the "search for known elements in parallel with updates" stress
> test (so far). (e.g.  test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2
> writers, 30 known random elements, writers are adding/removing 6 random
> elements, on a 8-core machine)
> 
> See: git://git.lttng.org/userspace-rcu.git
>  branch : rbtree2
> 
> The key idea I used in this implementation is to "decay" the old nodes
> (AFAIK, I just made this up) : "decaying" a node could be best described
> as creating an exact copy of a node, and putting a pointer to this new
> node into the old node to form a "decay chain". This allowed me to keep
> the algorithm very much similar to CLRS by just walking the decay chains
> whenever needed. The old node "decays" by using call_rcu to free it
> after a grace period passes. This imply that the updates must hold the
> RCU read-side lock in addition to a mutex to make sure the decaying
> nodes stay valid for the duration of their use.
> 
> This implementation never requires the read-side to loop, thus
> guaranteeing a wait-free read-side behavior (so search operations will
> always be strictly log(n) without any busy-loop delay).
> 
> I have not created stress-tests for next/prev walk of the tree yet. It
> is therefore entirely possible that this does not work as expected.
> 
> Comments are welcome,

Very cool!

The trick Phil Howard used allowed him to avoid duplicating the nodes
in some cases in the rotations.  I might be missing something, but it
looks like you are duplicating in all cases.  Would using Phil's trick
result in significant performance gain?

Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> 
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Paul E. McKenney

On Sun, May 29, 2011 at 09:54:50PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney  wrote:
> 
> > And the other reason that you want to mark the readers is for debug 
> > purposes.  Murphy being who he is, you will some day need to check 
> > for someone calling the "OK to update" function while they are 
> > acting as a reader.
> 
> Correct. In one of the previous mails i suggested a debug mode that 
> switches it all to pthread rwlocks.
> 
> We do not want to lose the identify of what the read paths are: this 
> could grow into a BKL-alike nasty-to-fix assumption over a couple of 
> years! Then if someone finds a usecase that intensifies the frequency 
> of one of the key writepaths, we'll be in trouble ...

"BKR" -- Big Kernel Reader!  ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Paul E. McKenney

On Sun, May 29, 2011 at 09:33:27PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney  wrote:
> 
> > On Sun, May 29, 2011 at 06:00:00PM +0300, Avi Kivity wrote:
> > > On 05/29/2011 05:27 PM, Ingo Molnar wrote:
> > > >* Avi Kivity  wrote:
> > > >
> > > >>  I don't understand how you expect per_cpu to work in userspace.  As
> > > >>  soon as you calculate the per-cpu address, it can be invalidated.
> > > >>  It doesn't help that you get a signal; you've already calculated
> > > >>  the address.
> > > >
> > > >I was thinking of some sort of transactional mechanism, a tightly
> > > >controlled set of assembly instructions updating the percpu fields,
> > > >where the migration event would be able to 'unroll' incomplete
> > > >modifications done to the 'wrong' percpu data structure. (It would be
> > > >rather complex and every percpu op would have to be an atomic because
> > > >there's always the chance that it's executed on the wrong CPU.)
> > > >
> > > >But note that we do not even need any notification if there's a
> > > >(local) lock on the percpu fields:
> > > >
> > > >It will work because it's statistically percpu the lock will not
> > > >SMP-bounce between CPUs generally so it will be very fast to
> > > >acquire/release it, and we get the full cache benefits of percpu
> > > >variables.
> > > >
> > > >The migration notification would still be useful to detect grace
> > > >periods at natural points - but as Paul pointed out polling it via
> > > >SIGALRM works as well. The two (migration and SIGALRM) could be
> > > >combined as well.
> > > 
> > > I think it's way simpler to map cpu == thread.  And in fact, when
> > > you run a Linux kernel in a kvm guest, that's what happens, since
> > > each vcpu _is_ a host thread.
> > 
> > I have to agree with Avi here.  If a stop_machine()-like approach is
> > going to work, the updates have to be very rare, so any additional
> > cache-nonlocality from having lots of threads should not be a problem.
> > Especially given that in this particular case, there are exactly as
> > many CPUs as threads anyway.  The readers should only need to touch a
> > constant number of cache lines either way.
> > 
> > Or am I missing something here?
> 
> I was talking about the (still imaginery!) user-space tree-RCU code!
> :-)

Ah!  I did miss a turn in the dicussion, then.  ;-)

My initial thought is to start with CPU==thread, with the CPU online
and offline operations used at thread creation and destruction time.

But longer term, you are right that there would be cache-locality
benefits to splitting.  Perhaps more important, user-space testing
could then achieve much better coverage of the various race conditions.

> The stop_machine_run()-alike thing is for brlocks - for which Sasha 
> sent patches already, see these patches on the kvm@vger.kernel.org 
> list:
> 
>[PATCH 3/4] kvm tools: Add a brlock
>[PATCH 4/4] kvm tools: Use brlock in MMIO and IOPORT
> 
> Wrt. brlocks, we'll keep them as simple as possible and indeed no 
> involved tricks are needed AFAICS. read_lock() will be a compiler 
> barrier(), that's as fast as it gets :-)

Makes sense!

The other debugging use for the read-side primitives is to execute
the read-side ready-do-respond-to-kvm-pause code.  This can help
catch bugs where the developer put the br_read_lock() and the
br_read_unlock() in the wrong place.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-29 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Sun, 2011-05-29 at 13:01 -0400, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> > > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > [...]
> > > > Hi Mathieu!
> > > > 
> > > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
> > > > augmentation feature to support an interval rb-tree - which means that
> > > > every update to the tree not only updates the nodes directly related to
> > > > the updated node but also all the nodes on the path to the root of the
> > > > tree.
> > > 
> > > Cool !!
> > > 
> > > I'm adding in copy Phil Howard who has been working on RCU RB tree for
> > > much longer than myself.
> > > 
> > > > I see that in liburcu there is an implementation of a rcu linked list
> > > > but no implementation of a rb-tree.
> > > > 
> > > > Are you currently working on one? or maybe I should try writing one and
> > > > sending it to you?
> > > 
> > > Actually, I started working on one last year, but had to interrupt my
> > > effort before I got it even working right.
> > [...]
> > > We'd have to see how we can go from this implementation of a standard RB
> > > tree to an interval RB tree too. I guess it will depend whether you need
> > > the updates from the target node up to the root to be done "all at once"
> > > from a reader perspective (then you would probably need to replace a
> > > copy of a part of the tree all at once), or if you can allow the update
> > > to be done piece-wise on a node-by-node basis as readers go through the
> > > tree (from root to leafs).
> > 
> > I've revisited the RCU rbtree implementation this weekend, and it works
> > much better now. I reimplemented the whole thing from 0 starting from
> > the CLRS chapter 12 algorithms (to get the non-rcu
> > (insertion/removal)-only stress-tests working) and incrementally
> 
> Hi Mathieu!
> 
> Can't we use the rbtree provided by the kernel, and just add _rcu
> functions to provide rcu protected rbtree?

Typical RB trees expect mutual exclusion between writers and readers,
and between writers. AFAIK, the in-kernel rbtree implementation has
this requirement.  "RCU-izing" the RBtree structure requires more than
just adding some functions: its whole structure must be adapted to
support concurrent reader activity while updates are performed.

> 
> > RCU-ized the updates and then added read-side tests. All along, I used
> > the test_urcu_rbtree test case that does some basic coherency tests by
> > searching for some random elements that *should* be there in parellel
> > with insertion and removals. The implementation I currently have
> > survives the "search for known elements in parallel with updates" stress
> > test (so far). (e.g.  test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2
> > writers, 30 known random elements, writers are adding/removing 6 random
> > elements, on a 8-core machine)
> 
> I've grabbed it and it works great for me here too :)
> test_urcu_rbtree does print a lot of mess, making it somewhat hard to
> work with.

I've now removed the "debug" printfs from the rbtree2 head.

> 
> > 
> > See: git://git.lttng.org/userspace-rcu.git
> >  branch : rbtree2
> > 
> > The key idea I used in this implementation is to "decay" the old nodes
> > (AFAIK, I just made this up) : "decaying" a node could be best described
> > as creating an exact copy of a node, and putting a pointer to this new
> > node into the old node to form a "decay chain". This allowed me to keep
> > the algorithm very much similar to CLRS by just walking the decay chains
> > whenever needed. The old node "decays" by using call_rcu to free it
> > after a grace period passes. This imply that the updates must hold the
> > RCU read-side lock in addition to a mutex to make sure the decaying
> > nodes stay valid for the duration of their use.
> > 
> > This implementation never requires the read-side to loop, thus
> > guaranteeing a wait-free read-side behavior (so search operations will
> > always be strictly log(n) without any busy-loop delay).
> > 
> > I have not created stress-tests for next/prev walk of the tree yet. It
> > is therefore entirely possible that this does not work as expected.

I just added some min + next and max + prev walk stress tests into
test_urcu_rbtree. They seem to work fine so far with 6 readers + 2
writers running concurrently with 50 random items (it catched a stupid
bug in prev(), which I fixed immediately). The idea behind these stress
tests is to walk forward or backward over the entire tree and flag the
position that corresponds to the global array of values in an array of
flags. At the end, it checks that all the values that "must" be there
were indeed flagged.

> 
> I've 'forked' tools/kvm/, and started working on integrating urcu into
> it (Will be located in https://github.com/levinsasha/linux-kvm once some
> progress has been made), this should make it easier to review
> work-in-progress. I'll switch to the rbt

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> And the other reason that you want to mark the readers is for debug 
> purposes.  Murphy being who he is, you will some day need to check 
> for someone calling the "OK to update" function while they are 
> acting as a reader.

Correct. In one of the previous mails i suggested a debug mode that 
switches it all to pthread rwlocks.

We do not want to lose the identify of what the read paths are: this 
could grow into a BKL-alike nasty-to-fix assumption over a couple of 
years! Then if someone finds a usecase that intensifies the frequency 
of one of the key writepaths, we'll be in trouble ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Ingo Molnar


* Paul E. McKenney  wrote:

> On Sun, May 29, 2011 at 06:00:00PM +0300, Avi Kivity wrote:
> > On 05/29/2011 05:27 PM, Ingo Molnar wrote:
> > >* Avi Kivity  wrote:
> > >
> > >>  I don't understand how you expect per_cpu to work in userspace.  As
> > >>  soon as you calculate the per-cpu address, it can be invalidated.
> > >>  It doesn't help that you get a signal; you've already calculated
> > >>  the address.
> > >
> > >I was thinking of some sort of transactional mechanism, a tightly
> > >controlled set of assembly instructions updating the percpu fields,
> > >where the migration event would be able to 'unroll' incomplete
> > >modifications done to the 'wrong' percpu data structure. (It would be
> > >rather complex and every percpu op would have to be an atomic because
> > >there's always the chance that it's executed on the wrong CPU.)
> > >
> > >But note that we do not even need any notification if there's a
> > >(local) lock on the percpu fields:
> > >
> > >It will work because it's statistically percpu the lock will not
> > >SMP-bounce between CPUs generally so it will be very fast to
> > >acquire/release it, and we get the full cache benefits of percpu
> > >variables.
> > >
> > >The migration notification would still be useful to detect grace
> > >periods at natural points - but as Paul pointed out polling it via
> > >SIGALRM works as well. The two (migration and SIGALRM) could be
> > >combined as well.
> > 
> > I think it's way simpler to map cpu == thread.  And in fact, when
> > you run a Linux kernel in a kvm guest, that's what happens, since
> > each vcpu _is_ a host thread.
> 
> I have to agree with Avi here.  If a stop_machine()-like approach is
> going to work, the updates have to be very rare, so any additional
> cache-nonlocality from having lots of threads should not be a problem.
> Especially given that in this particular case, there are exactly as
> many CPUs as threads anyway.  The readers should only need to touch a
> constant number of cache lines either way.
> 
> Or am I missing something here?

I was talking about the (still imaginery!) user-space tree-RCU code!
:-)

The stop_machine_run()-alike thing is for brlocks - for which Sasha 
sent patches already, see these patches on the kvm@vger.kernel.org 
list:

   [PATCH 3/4] kvm tools: Add a brlock
   [PATCH 4/4] kvm tools: Use brlock in MMIO and IOPORT

Wrt. brlocks, we'll keep them as simple as possible and indeed no 
involved tricks are needed AFAICS. read_lock() will be a compiler 
barrier(), that's as fast as it gets :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-29 Thread Sasha Levin

On Sun, 2011-05-29 at 13:01 -0400, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> [...]
> > > Hi Mathieu!
> > > 
> > > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
> > > augmentation feature to support an interval rb-tree - which means that
> > > every update to the tree not only updates the nodes directly related to
> > > the updated node but also all the nodes on the path to the root of the
> > > tree.
> > 
> > Cool !!
> > 
> > I'm adding in copy Phil Howard who has been working on RCU RB tree for
> > much longer than myself.
> > 
> > > I see that in liburcu there is an implementation of a rcu linked list
> > > but no implementation of a rb-tree.
> > > 
> > > Are you currently working on one? or maybe I should try writing one and
> > > sending it to you?
> > 
> > Actually, I started working on one last year, but had to interrupt my
> > effort before I got it even working right.
> [...]
> > We'd have to see how we can go from this implementation of a standard RB
> > tree to an interval RB tree too. I guess it will depend whether you need
> > the updates from the target node up to the root to be done "all at once"
> > from a reader perspective (then you would probably need to replace a
> > copy of a part of the tree all at once), or if you can allow the update
> > to be done piece-wise on a node-by-node basis as readers go through the
> > tree (from root to leafs).
> 
> I've revisited the RCU rbtree implementation this weekend, and it works
> much better now. I reimplemented the whole thing from 0 starting from
> the CLRS chapter 12 algorithms (to get the non-rcu
> (insertion/removal)-only stress-tests working) and incrementally

Hi Mathieu!

Can't we use the rbtree provided by the kernel, and just add _rcu
functions to provide rcu protected rbtree?

> RCU-ized the updates and then added read-side tests. All along, I used
> the test_urcu_rbtree test case that does some basic coherency tests by
> searching for some random elements that *should* be there in parellel
> with insertion and removals. The implementation I currently have
> survives the "search for known elements in parallel with updates" stress
> test (so far). (e.g.  test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2
> writers, 30 known random elements, writers are adding/removing 6 random
> elements, on a 8-core machine)

I've grabbed it and it works great for me here too :)
test_urcu_rbtree does print a lot of mess, making it somewhat hard to
work with.

> 
> See: git://git.lttng.org/userspace-rcu.git
>  branch : rbtree2
> 
> The key idea I used in this implementation is to "decay" the old nodes
> (AFAIK, I just made this up) : "decaying" a node could be best described
> as creating an exact copy of a node, and putting a pointer to this new
> node into the old node to form a "decay chain". This allowed me to keep
> the algorithm very much similar to CLRS by just walking the decay chains
> whenever needed. The old node "decays" by using call_rcu to free it
> after a grace period passes. This imply that the updates must hold the
> RCU read-side lock in addition to a mutex to make sure the decaying
> nodes stay valid for the duration of their use.
> 
> This implementation never requires the read-side to loop, thus
> guaranteeing a wait-free read-side behavior (so search operations will
> always be strictly log(n) without any busy-loop delay).
> 
> I have not created stress-tests for next/prev walk of the tree yet. It
> is therefore entirely possible that this does not work as expected.

I've 'forked' tools/kvm/, and started working on integrating urcu into
it (Will be located in https://github.com/levinsasha/linux-kvm once some
progress has been made), this should make it easier to review
work-in-progress. I'll switch to the rbtree2 branch in urcu and work
with it from now.

> Comments are welcome,
> 
> Thanks,
> 
> Mathieu
> 
> 
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> 

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RCU red-black tree (was: Re: [PATCH 4/6] kvm tools: Add rwlock wrapper)

2011-05-29 Thread Mathieu Desnoyers

* Mathieu Desnoyers (mathieu.desnoy...@efficios.com) wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
[...]
> > Hi Mathieu!
> > 
> > In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
> > augmentation feature to support an interval rb-tree - which means that
> > every update to the tree not only updates the nodes directly related to
> > the updated node but also all the nodes on the path to the root of the
> > tree.
> 
> Cool !!
> 
> I'm adding in copy Phil Howard who has been working on RCU RB tree for
> much longer than myself.
> 
> > I see that in liburcu there is an implementation of a rcu linked list
> > but no implementation of a rb-tree.
> > 
> > Are you currently working on one? or maybe I should try writing one and
> > sending it to you?
> 
> Actually, I started working on one last year, but had to interrupt my
> effort before I got it even working right.
[...]
> We'd have to see how we can go from this implementation of a standard RB
> tree to an interval RB tree too. I guess it will depend whether you need
> the updates from the target node up to the root to be done "all at once"
> from a reader perspective (then you would probably need to replace a
> copy of a part of the tree all at once), or if you can allow the update
> to be done piece-wise on a node-by-node basis as readers go through the
> tree (from root to leafs).

I've revisited the RCU rbtree implementation this weekend, and it works
much better now. I reimplemented the whole thing from 0 starting from
the CLRS chapter 12 algorithms (to get the non-rcu
(insertion/removal)-only stress-tests working) and incrementally
RCU-ized the updates and then added read-side tests. All along, I used
the test_urcu_rbtree test case that does some basic coherency tests by
searching for some random elements that *should* be there in parellel
with insertion and removals. The implementation I currently have
survives the "search for known elements in parallel with updates" stress
test (so far). (e.g.  test_urcu_rbtree 6 2 10 -g 30 : 6 readers, 2
writers, 30 known random elements, writers are adding/removing 6 random
elements, on a 8-core machine)

See: git://git.lttng.org/userspace-rcu.git
 branch : rbtree2

The key idea I used in this implementation is to "decay" the old nodes
(AFAIK, I just made this up) : "decaying" a node could be best described
as creating an exact copy of a node, and putting a pointer to this new
node into the old node to form a "decay chain". This allowed me to keep
the algorithm very much similar to CLRS by just walking the decay chains
whenever needed. The old node "decays" by using call_rcu to free it
after a grace period passes. This imply that the updates must hold the
RCU read-side lock in addition to a mutex to make sure the decaying
nodes stay valid for the duration of their use.

This implementation never requires the read-side to loop, thus
guaranteeing a wait-free read-side behavior (so search operations will
always be strictly log(n) without any busy-loop delay).

I have not created stress-tests for next/prev walk of the tree yet. It
is therefore entirely possible that this does not work as expected.

Comments are welcome,

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Sasha Levin

On Sun, 2011-05-29 at 08:31 -0700, Paul E. McKenney wrote:
> This is very important even if no write path ever becomes more than
> just occasional.  If you don't mark the read paths like Ingo suggests,
> your one-year-from-now self will be very annoyed at you, as the code
> will be quite difficult to understand.  And anyone else trying to
> read your code will be even more annoyed.  ;-)

I very much agree with that. Working on the code I'm trying to separate
between the pause/unpause of the guest, and a 'brlock' lock which allows
for very cheap reads and very expensive writes.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Paul E. McKenney

On Sun, May 29, 2011 at 08:31:30AM -0700, Paul E. McKenney wrote:
> On Sun, May 29, 2011 at 09:19:48AM +0200, Ingo Molnar wrote:
> > 
> > * Avi Kivity  wrote:
> > 
> > > Yes, this is equivalent to the kernel's stop_machine_run().  It's a 
> > > heavyweight method but it should work just fine.
> > 
> > Yeah. It is fine for reconfiguration/configuration-only kind of write 
> > paths - i.e. the mmio lookup path should be ok.
> > 
> > There's only one thing i'd like to warn about: please still shape it 
> > as a br_read_lock()/unlock(), br_write_lock()/unlock() operation.
> > 
> > This way all the SMP read paths remain identified properly, even if 
> > br_write_lock() does a stop_machine_run() equivalent. This still 
> > leaves options open for an easy conversion to rwlock or urcu (should 
> > any particular write path become more than just occasional).
> 
> This is very important even if no write path ever becomes more than
> just occasional.  If you don't mark the read paths like Ingo suggests,
> your one-year-from-now self will be very annoyed at you, as the code
> will be quite difficult to understand.  And anyone else trying to
> read your code will be even more annoyed.  ;-)

And the other reason that you want to mark the readers is for debug
purposes.  Murphy being who he is, you will some day need to check
for someone calling the "OK to update" function while they are acting
as a reader.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Paul E. McKenney

On Sun, May 29, 2011 at 06:00:00PM +0300, Avi Kivity wrote:
> On 05/29/2011 05:27 PM, Ingo Molnar wrote:
> >* Avi Kivity  wrote:
> >
> >>  I don't understand how you expect per_cpu to work in userspace.  As
> >>  soon as you calculate the per-cpu address, it can be invalidated.
> >>  It doesn't help that you get a signal; you've already calculated
> >>  the address.
> >
> >I was thinking of some sort of transactional mechanism, a tightly
> >controlled set of assembly instructions updating the percpu fields,
> >where the migration event would be able to 'unroll' incomplete
> >modifications done to the 'wrong' percpu data structure. (It would be
> >rather complex and every percpu op would have to be an atomic because
> >there's always the chance that it's executed on the wrong CPU.)
> >
> >But note that we do not even need any notification if there's a
> >(local) lock on the percpu fields:
> >
> >It will work because it's statistically percpu the lock will not
> >SMP-bounce between CPUs generally so it will be very fast to
> >acquire/release it, and we get the full cache benefits of percpu
> >variables.
> >
> >The migration notification would still be useful to detect grace
> >periods at natural points - but as Paul pointed out polling it via
> >SIGALRM works as well. The two (migration and SIGALRM) could be
> >combined as well.
> 
> I think it's way simpler to map cpu == thread.  And in fact, when
> you run a Linux kernel in a kvm guest, that's what happens, since
> each vcpu _is_ a host thread.

I have to agree with Avi here.  If a stop_machine()-like approach is
going to work, the updates have to be very rare, so any additional
cache-nonlocality from having lots of threads should not be a problem.
Especially given that in this particular case, there are exactly as
many CPUs as threads anyway.  The readers should only need to touch a
constant number of cache lines either way.

Or am I missing something here?

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Paul E. McKenney

On Sun, May 29, 2011 at 09:19:48AM +0200, Ingo Molnar wrote:
> 
> * Avi Kivity  wrote:
> 
> > Yes, this is equivalent to the kernel's stop_machine_run().  It's a 
> > heavyweight method but it should work just fine.
> 
> Yeah. It is fine for reconfiguration/configuration-only kind of write 
> paths - i.e. the mmio lookup path should be ok.
> 
> There's only one thing i'd like to warn about: please still shape it 
> as a br_read_lock()/unlock(), br_write_lock()/unlock() operation.
> 
> This way all the SMP read paths remain identified properly, even if 
> br_write_lock() does a stop_machine_run() equivalent. This still 
> leaves options open for an easy conversion to rwlock or urcu (should 
> any particular write path become more than just occasional).

This is very important even if no write path ever becomes more than
just occasional.  If you don't mark the read paths like Ingo suggests,
your one-year-from-now self will be very annoyed at you, as the code
will be quite difficult to understand.  And anyone else trying to
read your code will be even more annoyed.  ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Avi Kivity

On 05/29/2011 05:27 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

>  I don't understand how you expect per_cpu to work in userspace.  As
>  soon as you calculate the per-cpu address, it can be invalidated.
>  It doesn't help that you get a signal; you've already calculated
>  the address.

I was thinking of some sort of transactional mechanism, a tightly
controlled set of assembly instructions updating the percpu fields,
where the migration event would be able to 'unroll' incomplete
modifications done to the 'wrong' percpu data structure. (It would be
rather complex and every percpu op would have to be an atomic because
there's always the chance that it's executed on the wrong CPU.)

But note that we do not even need any notification if there's a
(local) lock on the percpu fields:

It will work because it's statistically percpu the lock will not
SMP-bounce between CPUs generally so it will be very fast to
acquire/release it, and we get the full cache benefits of percpu
variables.

The migration notification would still be useful to detect grace
periods at natural points - but as Paul pointed out polling it via
SIGALRM works as well. The two (migration and SIGALRM) could be
combined as well.

I think it's way simpler to map cpu == thread.  And in fact, when you 
run a Linux kernel in a kvm guest, that's what happens, since each vcpu 
_is_ a host thread.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Ingo Molnar

* Avi Kivity  wrote:

> I don't understand how you expect per_cpu to work in userspace.  As 
> soon as you calculate the per-cpu address, it can be invalidated. 
> It doesn't help that you get a signal; you've already calculated 
> the address.

I was thinking of some sort of transactional mechanism, a tightly 
controlled set of assembly instructions updating the percpu fields, 
where the migration event would be able to 'unroll' incomplete 
modifications done to the 'wrong' percpu data structure. (It would be 
rather complex and every percpu op would have to be an atomic because 
there's always the chance that it's executed on the wrong CPU.)

But note that we do not even need any notification if there's a 
(local) lock on the percpu fields:

It will work because it's statistically percpu the lock will not 
SMP-bounce between CPUs generally so it will be very fast to 
acquire/release it, and we get the full cache benefits of percpu 
variables.

The migration notification would still be useful to detect grace 
periods at natural points - but as Paul pointed out polling it via 
SIGALRM works as well. The two (migration and SIGALRM) could be 
combined as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Avi Kivity

On 05/29/2011 03:37 PM, Ingo Molnar wrote:

>  >
>  >  It's not transparent at all if you index RCU data structures by
>  >  the current CPU index, which the kernel implementation does.
>
>  But that's completely broken for userspace.  The "current cpu
>  index" doesn't even exist, since you can't disable preemption.

It does exist, if the signal handler notification of a migration is
instantaneous (which it is).

Something like rcu_preempt_qs(), which expects to be called with 
interrupts disabled, cannot be made to work.

I don't understand how you expect per_cpu to work in userspace.  As soon 
as you calculate the per-cpu address, it can be invalidated.  It doesn't 
help that you get a signal; you've already calculated the address.

Also, in the kernel, using per-cpu data implies mutual exclusion (since 
you've disabled preemption).  That doesn't apply to userspace.

>  >  Doing that has the advantage of being much more cache-compressed
>  >  than the TID index,
>
>  If you have more tasks than cpus; which isn't a given.

Sure there are special cases but in general there can be many more
tasks (threads) than CPUs ;-)

Sure; that was just a side note.

Note that for server virtualization you usually have less tasks (in a 
process, not globally) than host cpus.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Ingo Molnar


* Avi Kivity  wrote:

> On 05/29/2011 10:35 AM, Ingo Molnar wrote:
> >* Avi Kivity  wrote:
> >
> >>  On 05/28/2011 09:32 PM, Ingo Molnar wrote:
> >>  >* Avi Kivity   wrote:
> >>  >
> >>  >>   >   So if you set a notification signal via fcntl(F_SETOWN) on the
> >>  >>   >   scheduler context switch event fd, the user-space RCU code will
> >>  >>   >   get a signal on every context switch.
> >>  >>
> >>  >>   Context switches are completely uninteresting for userspace rcu:
> >>  >>
> >>  >> rcu_read_lock();
> >>  >> --->   context switch
> >>  >>
> >>  >>   have we learned anything from that?  no.  User code is always
> >>  >>   preemptible and migratable.  If rcu_read_lock() prevented migration
> >>  >>   somehow, then we'd know that a context switch means we've started a
> >>  >>   grace period for this thread.  But it doesn't, so we don't.
> >>  >
> >>  >Well, in the next mail i mentioned that we can do migration events as
> >>  >well, which would be useful: instead of having to keep track of
> >>  >nr_tasks RCU grace periods we could simplify it down to nr_cpus.
> >>
> >>  I don't see how a migration event helps.  It is completely
> >>  transparent from the task's point of view.
> >
> > It's not transparent at all if you index RCU data structures by 
> > the current CPU index, which the kernel implementation does.
> 
> But that's completely broken for userspace.  The "current cpu 
> index" doesn't even exist, since you can't disable preemption.

It does exist, if the signal handler notification of a migration is 
instantaneous (which it is).

> > Doing that has the advantage of being much more cache-compressed 
> > than the TID index,
> 
> If you have more tasks than cpus; which isn't a given.

Sure there are special cases but in general there can be many more 
tasks (threads) than CPUs ;-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Avi Kivity

On 05/29/2011 10:35 AM, Ingo Molnar wrote:

* Avi Kivity  wrote:

>  On 05/28/2011 09:32 PM, Ingo Molnar wrote:
>  >* Avi Kivity   wrote:
>  >
>  >>   >   So if you set a notification signal via fcntl(F_SETOWN) on the
>  >>   >   scheduler context switch event fd, the user-space RCU code will
>  >>   >   get a signal on every context switch.
>  >>
>  >>   Context switches are completely uninteresting for userspace rcu:
>  >>
>  >> rcu_read_lock();
>  >> --->   context switch
>  >>
>  >>   have we learned anything from that?  no.  User code is always
>  >>   preemptible and migratable.  If rcu_read_lock() prevented migration
>  >>   somehow, then we'd know that a context switch means we've started a
>  >>   grace period for this thread.  But it doesn't, so we don't.
>  >
>  >Well, in the next mail i mentioned that we can do migration events as
>  >well, which would be useful: instead of having to keep track of
>  >nr_tasks RCU grace periods we could simplify it down to nr_cpus.
>
>  I don't see how a migration event helps.  It is completely
>  transparent from the task's point of view.

It's not transparent at all if you index RCU data structures by the
current CPU index, which the kernel implementation does.

But that's completely broken for userspace.  The "current cpu index" 
doesn't even exist, since you can't disable preemption.

Doing that has the advantage of being much more cache-compressed than
the TID index,

If you have more tasks than cpus; which isn't a given.

  and also having better worst-case grace period latency
properties than a TID index.

>  >  But if we indexed by the TID then we wouldnt need any scheduler
>  >  bindings at all - this is the simpler approach.
>
>  Yes, and it maps 1:1 to the kernel implementation (cpu = task).

No, the kernel indexes grace period tracking (and the
write-completion queues) by CPU index.

Do a conceptual

  #define cpu task

and it all works out.

>  >>   What's needed are explicit notifications about grace periods.  For
>  >>   the vcpu threads, calling KVM_VCPU_RUN seems like a good point.
>  >>   For I/O threads, completion of processing of an event is also a
>  >>   good point.
>  >
>  >  Grace period notifications are needed too, obviously.
>
>  I'd think they're sufficient, no?  Is something else needed?

I think you are missing the fact that in the kernel we index RCU data
structures by CPU number:

static void rcu_preempt_qs(int cpu)
{
 struct rcu_data *rdp =&per_cpu(rcu_preempt_data, cpu);

...

s/per_cpu/__thread/

static void rcu_preempt_note_context_switch(int cpu)
{
 struct task_struct *t = current;
 unsigned long flags;
 struct rcu_data *rdp;
 struct rcu_node *rnp;

 if (t->rcu_read_lock_nesting&&
 (t->rcu_read_unlock_special&  RCU_READ_UNLOCK_BLOCKED) == 0) {

 /* Possibly blocking in an RCU read-side critical section. */
 rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu);

...

Which could be changed over to be per task in user-space by treating
the TID as a 'virtual CPU' equivalent.

This probably lengthens worst-case rcu_sync() latencies rather
significantly though - possibly turning urcu into a
stop_machine_run() equivalent in the worst-case. (but i could be
wrong about this last bit)

I believe you are.  urcu does stress scaling, since it's much easier to 
add tasks than it is to add cpus, but it's conceptually the same problem.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Ingo Molnar


* Avi Kivity  wrote:

> On 05/28/2011 09:32 PM, Ingo Molnar wrote:
> >* Avi Kivity  wrote:
> >
> >>  >  So if you set a notification signal via fcntl(F_SETOWN) on the
> >>  >  scheduler context switch event fd, the user-space RCU code will
> >>  >  get a signal on every context switch.
> >>
> >>  Context switches are completely uninteresting for userspace rcu:
> >>
> >>rcu_read_lock();
> >>--->  context switch
> >>
> >>  have we learned anything from that?  no.  User code is always
> >>  preemptible and migratable.  If rcu_read_lock() prevented migration
> >>  somehow, then we'd know that a context switch means we've started a
> >>  grace period for this thread.  But it doesn't, so we don't.
> >
> >Well, in the next mail i mentioned that we can do migration events as
> >well, which would be useful: instead of having to keep track of
> >nr_tasks RCU grace periods we could simplify it down to nr_cpus.
> 
> I don't see how a migration event helps.  It is completely
> transparent from the task's point of view.

It's not transparent at all if you index RCU data structures by the 
current CPU index, which the kernel implementation does.

Doing that has the advantage of being much more cache-compressed than 
the TID index, and also having better worst-case grace period latency 
properties than a TID index.

> > But if we indexed by the TID then we wouldnt need any scheduler 
> > bindings at all - this is the simpler approach.
> 
> Yes, and it maps 1:1 to the kernel implementation (cpu = task).

No, the kernel indexes grace period tracking (and the 
write-completion queues) by CPU index.

> >>  What's needed are explicit notifications about grace periods.  For
> >>  the vcpu threads, calling KVM_VCPU_RUN seems like a good point.
> >>  For I/O threads, completion of processing of an event is also a
> >>  good point.
> >
> > Grace period notifications are needed too, obviously.
> 
> I'd think they're sufficient, no?  Is something else needed?

I think you are missing the fact that in the kernel we index RCU data 
structures by CPU number:

static void rcu_preempt_qs(int cpu)
{
struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu);

...

static void rcu_preempt_note_context_switch(int cpu)
{
struct task_struct *t = current;
unsigned long flags;
struct rcu_data *rdp;
struct rcu_node *rnp;

if (t->rcu_read_lock_nesting &&
(t->rcu_read_unlock_special & RCU_READ_UNLOCK_BLOCKED) == 0) {

/* Possibly blocking in an RCU read-side critical section. */
rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu);

...

Which could be changed over to be per task in user-space by treating 
the TID as a 'virtual CPU' equivalent.

This probably lengthens worst-case rcu_sync() latencies rather 
significantly though - possibly turning urcu into a 
stop_machine_run() equivalent in the worst-case. (but i could be 
wrong about this last bit)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-29 Thread Ingo Molnar

* Avi Kivity  wrote:

> Yes, this is equivalent to the kernel's stop_machine_run().  It's a 
> heavyweight method but it should work just fine.

Yeah. It is fine for reconfiguration/configuration-only kind of write 
paths - i.e. the mmio lookup path should be ok.

There's only one thing i'd like to warn about: please still shape it 
as a br_read_lock()/unlock(), br_write_lock()/unlock() operation.

This way all the SMP read paths remain identified properly, even if 
br_write_lock() does a stop_machine_run() equivalent. This still 
leaves options open for an easy conversion to rwlock or urcu (should 
any particular write path become more than just occasional).

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Avi Kivity


On 05/28/2011 10:45 PM, Sasha Levin wrote:

On Sat, 2011-05-28 at 17:24 +0200, Ingo Molnar wrote:
>  * Sasha Levin  wrote:
>
>  >  So the basic plan here is to allocate a futex(?) for each VCPU
>  >  thread, and have the writer thread lock all futexes when it needs
>  >  to write?
>  >
>  >  If we assume we only have one writer thread, it can stay pretty
>  >  simple.
>
>  We can use an even simpler and more scalable method:
>
>- writers can 'stop' all other threads, by sending them a
>  threadpool signal and waiting for each thread to have completed
>  processing their current work and notifying the writer back that
>  they have stopped running.
>
>  This means that the read-side lock is _zero instructions_, basically
>  just a barrier() to make sure the compiler does not move instructions
>  across threadpool functions (it wont).
>
>  This method requires that we know about every worker thread - i.e.
>  no-one does a stray pthread_create() and uses data structures from
>  there. It also requires that each worker thread can 'stop' within a
>  reasonable amount of time.

In this case, maybe instead of implementing it as a 'lock', we can
implement it as a way to stop all vcpu threads from reentering the
kernel (KVM_RUN):

1. Set a 'vcpu-stop' flag.
2. Signal all VCPUs to exit KVM_RUN.
3. VCPU threads now wait on our lock before reentering into KVM_RUN -
the writer thread waits until waiting threads = VCPU count.
4. Writer thread writes, releases lock.

So instead of it being a lock in MMIO, IO-ports, etc - it's a method to
stop the entire guest which could be used during configuration updates
(and anything else we might think of). It could also be used as a method
for users to 'pause' the guest.


Yes, this is equivalent to the kernel's stop_machine_run().  It's a 
heavyweight method but it should work just fine.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Avi Kivity

On 05/28/2011 09:32 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

>  >  So if you set a notification signal via fcntl(F_SETOWN) on the
>  >  scheduler context switch event fd, the user-space RCU code will
>  >  get a signal on every context switch.
>
>  Context switches are completely uninteresting for userspace rcu:
>
>rcu_read_lock();
>--->  context switch
>
>  have we learned anything from that?  no.  User code is always
>  preemptible and migratable.  If rcu_read_lock() prevented migration
>  somehow, then we'd know that a context switch means we've started a
>  grace period for this thread.  But it doesn't, so we don't.

Well, in the next mail i mentioned that we can do migration events as
well, which would be useful: instead of having to keep track of
nr_tasks RCU grace periods we could simplify it down to nr_cpus.

I don't see how a migration event helps.  It is completely transparent 
from the task's point of view.

But if we indexed by the TID then we wouldnt need any scheduler
bindings at all - this is the simpler approach.

Yes, and it maps 1:1 to the kernel implementation (cpu = task).

>  What's needed are explicit notifications about grace periods.  For
>  the vcpu threads, calling KVM_VCPU_RUN seems like a good point.
>  For I/O threads, completion of processing of an event is also a
>  good point.

Grace period notifications are needed too, obviously.

I'd think they're sufficient, no?  Is something else needed?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Sasha Levin

On Sat, 2011-05-28 at 17:24 +0200, Ingo Molnar wrote:
> * Sasha Levin  wrote:
> 
> > So the basic plan here is to allocate a futex(?) for each VCPU 
> > thread, and have the writer thread lock all futexes when it needs 
> > to write?
> > 
> > If we assume we only have one writer thread, it can stay pretty 
> > simple.
> 
> We can use an even simpler and more scalable method:
> 
>   - writers can 'stop' all other threads, by sending them a 
> threadpool signal and waiting for each thread to have completed 
> processing their current work and notifying the writer back that 
> they have stopped running.
> 
> This means that the read-side lock is _zero instructions_, basically 
> just a barrier() to make sure the compiler does not move instructions 
> across threadpool functions (it wont).
> 
> This method requires that we know about every worker thread - i.e. 
> no-one does a stray pthread_create() and uses data structures from 
> there. It also requires that each worker thread can 'stop' within a 
> reasonable amount of time.

In this case, maybe instead of implementing it as a 'lock', we can
implement it as a way to stop all vcpu threads from reentering the
kernel (KVM_RUN):

1. Set a 'vcpu-stop' flag.
2. Signal all VCPUs to exit KVM_RUN.
3. VCPU threads now wait on our lock before reentering into KVM_RUN -
the writer thread waits until waiting threads = VCPU count.
4. Writer thread writes, releases lock.

So instead of it being a lock in MMIO, IO-ports, etc - it's a method to
stop the entire guest which could be used during configuration updates
(and anything else we might think of). It could also be used as a method
for users to 'pause' the guest.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Ingo Molnar


* Avi Kivity  wrote:

> > So if you set a notification signal via fcntl(F_SETOWN) on the 
> > scheduler context switch event fd, the user-space RCU code will 
> > get a signal on every context switch.
> 
> Context switches are completely uninteresting for userspace rcu:
> 
>   rcu_read_lock();
>   ---> context switch
> 
> have we learned anything from that?  no.  User code is always 
> preemptible and migratable.  If rcu_read_lock() prevented migration 
> somehow, then we'd know that a context switch means we've started a 
> grace period for this thread.  But it doesn't, so we don't.

Well, in the next mail i mentioned that we can do migration events as 
well, which would be useful: instead of having to keep track of 
nr_tasks RCU grace periods we could simplify it down to nr_cpus.

But if we indexed by the TID then we wouldnt need any scheduler 
bindings at all - this is the simpler approach.

> What's needed are explicit notifications about grace periods.  For 
> the vcpu threads, calling KVM_VCPU_RUN seems like a good point.  
> For I/O threads, completion of processing of an event is also a 
> good point.

Grace period notifications are needed too, obviously.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Avi Kivity


On 05/27/2011 04:31 PM, Ingo Molnar wrote:

One furthr optimization would be possible: in case you think we can
write the signal handler in assembly or build it with gcc flags that
does not use SSE we might also add a 'lightweight signal handler'
kind of flag to the kernel, which does not save FPU/vector-CPU(SSE)
state. In this case signals become *really* fast on x86, almost as
fast as interrupts.



My old fpu rewrite had the potential to do this without a flag.  We 
simply allocate two fpu contexts per thread (main_fpu and signal_fpu); 
when we deliver a signal we clear cr0.ts.  If the signal touches the 
fpu, we save the fpu state to main_fpu and initialize the fpu; we then 
save/restore to signal_fpu until the signal handler exits.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Avi Kivity

On 05/27/2011 02:07 PM, Ingo Molnar wrote:

* Ingo Molnar  wrote:

>  >  This code is very much tied with the kernel scheduler. [...]
>
>  It would not be particularly complex to enable user-space to
>  request a callback on context switch events.
>
>  I was thinking on and off about allowing perf events to generate a
>  per sampling event notification signal on specific events, such as
>  page faults or context switches.

I was thinking about that on and off so loudly that Peter implemented
it long ago via fasync support on the perf event fd! :-)

So if you set a notification signal via fcntl(F_SETOWN) on the
scheduler context switch event fd, the user-space RCU code will get a
signal on every context switch.

Context switches are completely uninteresting for userspace rcu:

  rcu_read_lock();
  ---> context switch

have we learned anything from that?  no.  User code is always 
preemptible and migratable.  If rcu_read_lock() prevented migration 
somehow, then we'd know that a context switch means we've started a 
grace period for this thread.  But it doesn't, so we don't.

What's needed are explicit notifications about grace periods.  For the 
vcpu threads, calling KVM_VCPU_RUN seems like a good point.  For I/O 
threads, completion of processing of an event is also a good point.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Paul E. McKenney

On Sat, May 28, 2011 at 05:24:08PM +0200, Ingo Molnar wrote:
> 
> * Sasha Levin  wrote:
> 
> > So the basic plan here is to allocate a futex(?) for each VCPU 
> > thread, and have the writer thread lock all futexes when it needs 
> > to write?
> > 
> > If we assume we only have one writer thread, it can stay pretty 
> > simple.
> 
> We can use an even simpler and more scalable method:
> 
>   - writers can 'stop' all other threads, by sending them a 
> threadpool signal and waiting for each thread to have completed 
> processing their current work and notifying the writer back that 
> they have stopped running.
> 
> This means that the read-side lock is _zero instructions_, basically 
> just a barrier() to make sure the compiler does not move instructions 
> across threadpool functions (it wont).
> 
> This method requires that we know about every worker thread - i.e. 
> no-one does a stray pthread_create() and uses data structures from 
> there. It also requires that each worker thread can 'stop' within a 
> reasonable amount of time.

Yep, this should work -- similar to use of stop_machine() in the kernel,
but without the need for worker threads to disable preemption because
there is no need for separate "cpu" and "task" concepts like there is
in the kernel.

The updates are quite heavyweight and they must block all readers,
but if updates are infrequent enough, this clearly won't be a problem.

Simpler than porting TREE_RCU to userspace as well, though I might do
that some time just for grins.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-28 Thread Ingo Molnar

* Sasha Levin  wrote:

> So the basic plan here is to allocate a futex(?) for each VCPU 
> thread, and have the writer thread lock all futexes when it needs 
> to write?
> 
> If we assume we only have one writer thread, it can stay pretty 
> simple.

We can use an even simpler and more scalable method:

  - writers can 'stop' all other threads, by sending them a 
threadpool signal and waiting for each thread to have completed 
processing their current work and notifying the writer back that 
they have stopped running.

This means that the read-side lock is _zero instructions_, basically 
just a barrier() to make sure the compiler does not move instructions 
across threadpool functions (it wont).

This method requires that we know about every worker thread - i.e. 
no-one does a stray pthread_create() and uses data structures from 
there. It also requires that each worker thread can 'stop' within a 
reasonable amount of time.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Sasha Levin

On Fri, 2011-05-27 at 19:10 +0200, Ingo Molnar wrote:
> * Sasha Levin  wrote:
> 
> > On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote:
> > > * Sasha Levin  wrote:
> > > 
> > > > I see that in liburcu there is an implementation of a rcu linked 
> > > > list but no implementation of a rb-tree.
> > > 
> > > Another approach would be, until the RCU interactions are sorted out, 
> > > to implement a 'big reader lock' thing that is completely lockless on 
> > > the read side (!).
> > > 
> > > It works well if the write side is expensive, but very rare: which is 
> > > certainly the case for these ioport registration data structures used 
> > > in the mmio event demux fast path!
> > > 
> > > The write_lock() side signals all worker threads to finish whatever 
> > > they are doing now and to wait for the write_unlock(). Then the 
> > > modification can be done and the worker threads can be resumed.
> > > 
> > > This can be updated to RCU later on without much trouble.
> > > 
> > > The advantage is that this could be implemented with the existing 
> > > thread-pool primitives straight away i think, we'd need five 
> > > primitives:
> > > 
> > >   bread_lock();
> > >   bread_unlock();
> > >   bwrite_lock();
> > >   bwrite_lock();
> > > 
> > >   brlock_init();
> > > 
> > > and a data type:
> > > 
> > >   struct brlock;
> > > 
> > > bread_lock()/bread_unlock() is basically just a compiler barrier. 
> > > bwrite_lock() stops all (other) worker threads.
> > > bwrite_unlock() resumes them.
> > > 
> > > That's all - should be 50 lines of code, unless i'm missing something 
> > > :-)
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > 
> > Isn't there something similar to this in the kernel?
> 
> Yeah, there's include/linux/lglock.h.
> 
> > I prefer not implementing a new lock type at the moment mostly because
> > we're not tackling a bug or an immediate problem, we don't really need
> > locking at the moment (we add all devices at init and don't support
> > hotplug yet). So I'd rather not write code just to solve it faster but
> > have it thrown away later.
> 
> We don't have to throw it away: RCU is rather complex to pull off 
> here, and in many cases, where writes are very rare, brlocks are the 
> best solution even with RCU present.

So the basic plan here is to allocate a futex(?) for each VCPU thread,
and have the writer thread lock all futexes when it needs to write?

If we assume we only have one writer thread, it can stay pretty simple.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Paul E. McKenney

On Fri, May 27, 2011 at 11:12:20AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney  wrote:
> 
> > > > > I'm CC'ing Paul and Mathieu as well for urcu.
> > 
> > I am hoping we can get better convergence between the user-level 
> > and kernel-level URCU implementations once I get SRCU merged into 
> > the TREE_RCU and TINY_RCU implementations. [...]
> 
> Yeah.
> 
> > [...]  But it is early days for user-level RCU implementations -- 
> > for example, the kernel-level implementations have deep 
> > dependencies on being able to lock themselves cheaply to a given 
> > CPU, which does not exist at user level.
> 
> Correct - this is why i suggested a plain copy first, then look back 
> how we (and whether we!) want to share logic.

OK, here is an approach that I rejected long ago due to its not handling
existing code bases nicely.  But it should work fine for user-space
applications that are willing to adapt themselves to RCU, so it is well
worth considering for this set of use cases.

The basic trick is to pretend that each user-level thread is its own CPU.
This is easiest if the application does not do any RCU activity from
signal handlers, though app-code signal handlers can be dealt with as
well, if needed.  (But I hate POSIX signals...)

Given this trick, the code that is currently invoked from the
scheduling-clock interrupt can be instead invoked from a per-thread
SIGALRM.

Given the current implementation in -tip, the RCU core processing can be
done from separate threads, but things must be tweaked because TREE_RCU
assumes that the RCU core processing happens on the same CPU (which now
becomes in the same thread) that it corresponds to.  In other words,
the rcuc0 thread is hard-affinitied to CPU 0, the rcuc1 thread to CPU 1,
and so on.

One way to handle this would be to do the per-CPU kthread processing
in signal-handler context.  Then code segments that disable interrupts
(the __call_rcu() function comes immediately to mind) must block the
corresponding signals.  Which can easily be abstracted so that common
code handles it.

We could handle dyntick-idle if there is a convenient way to get
notification when a thread blocks (as opposed to being preempted).
There are a number of strategies that might work here, the first that
comes to mind is to notify only if the block is TASK_INTERRUPTIBLE,
which indicates a relatively long-term sleep.  This notification could
call rcu_enter_nohz() and friends.  So, is there a way to get
notification on TASK_INTERRUPTIBLE blocking and unblocking?

This is not a general-purpose solution (which is why I rejected it when
thinking along these lines some years ago), but it would be an interesting
way to share the in-kernel code.  And I believe that this approach would
be quite useful to a great number of user-level apps/tools/utilities
that were willing to live within its constraints.

The each-thread-is-a-CPU might seem limiting, but the current TREE_RCU
implementation would allow up to 4,194,304 threads on a 64-bit system
and up to 524,288 on a 32-bit system, which should prove sufficient for
most uses.  Famous last words...  But it would be easy to add a fifth
level of hierarchy if someone really does have a legitimate need for more
threads, which would bring us to 268,435,456 threads for 64-bit systems
and 16,777,216 threads for 32-bit systems.  And it is easy to add more
levels -- and the extra levels don't penalize people who don't need them.
With the current implementation, the maximum number of threads would
need to be specified at compile time, but again, this should be OK in
almost all cases.  Default to (say) 131,072 threads maximum and be happy.

> > But there seems to be an assumption that there should be only one 
> > URCU implementation, and I am not sure that this assumption holds.  
> 
> I'm not sure about that either. And sice tools/kvm/ lives in the 
> kernel repo it would be a mortal sin [*] to not explore the code 
> sharing angle!!! :-)
> 
> If a reasonable amount of sharing of logic is possible without making 
> it painful for the kernel RCU code we could do other nice things like 
> change the RCU logic and test it in user-space first and run 
> user-space rcutorture on some really big cluster.

That would be cool -- also, it would make the Linux-kernel code
more accessible, because people could play with it in userspace,
single-stepping, setting breakpoints, and so on.

> > [ ... ]
> >
> > All that aside, one advantage of http://lttng.org/urcu is that it 
> > already exists, which allows prototyping to proceed immediately.  
> 
> it's offline right now:
> 
>  $ git clone git://git.lttng.org/urcu
>  Cloning into urcu...
>  fatal: The remote end hung up unexpectedly
> 
> One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess 
> we could copy a suitable implementation into tools/kvm/rcu/?

That is another reason why I believe that an in-kernel-tree version
of URCU is not a replacement for the variant that Mathieu is maintaining
(and that

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar


* Sasha Levin  wrote:

> On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote:
> > * Sasha Levin  wrote:
> > 
> > > I see that in liburcu there is an implementation of a rcu linked 
> > > list but no implementation of a rb-tree.
> > 
> > Another approach would be, until the RCU interactions are sorted out, 
> > to implement a 'big reader lock' thing that is completely lockless on 
> > the read side (!).
> > 
> > It works well if the write side is expensive, but very rare: which is 
> > certainly the case for these ioport registration data structures used 
> > in the mmio event demux fast path!
> > 
> > The write_lock() side signals all worker threads to finish whatever 
> > they are doing now and to wait for the write_unlock(). Then the 
> > modification can be done and the worker threads can be resumed.
> > 
> > This can be updated to RCU later on without much trouble.
> > 
> > The advantage is that this could be implemented with the existing 
> > thread-pool primitives straight away i think, we'd need five 
> > primitives:
> > 
> >   bread_lock();
> >   bread_unlock();
> >   bwrite_lock();
> >   bwrite_lock();
> > 
> >   brlock_init();
> > 
> > and a data type:
> > 
> >   struct brlock;
> > 
> > bread_lock()/bread_unlock() is basically just a compiler barrier. 
> > bwrite_lock() stops all (other) worker threads.
> > bwrite_unlock() resumes them.
> > 
> > That's all - should be 50 lines of code, unless i'm missing something 
> > :-)
> > 
> > Thanks,
> > 
> > Ingo
> 
> Isn't there something similar to this in the kernel?

Yeah, there's include/linux/lglock.h.

> I prefer not implementing a new lock type at the moment mostly because
> we're not tackling a bug or an immediate problem, we don't really need
> locking at the moment (we add all devices at init and don't support
> hotplug yet). So I'd rather not write code just to solve it faster but
> have it thrown away later.

We don't have to throw it away: RCU is rather complex to pull off 
here, and in many cases, where writes are very rare, brlocks are the 
best solution even with RCU present.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Sasha Levin

On Fri, 2011-05-27 at 12:36 +0200, Ingo Molnar wrote:
> * Sasha Levin  wrote:
> 
> > I see that in liburcu there is an implementation of a rcu linked 
> > list but no implementation of a rb-tree.
> 
> Another approach would be, until the RCU interactions are sorted out, 
> to implement a 'big reader lock' thing that is completely lockless on 
> the read side (!).
> 
> It works well if the write side is expensive, but very rare: which is 
> certainly the case for these ioport registration data structures used 
> in the mmio event demux fast path!
> 
> The write_lock() side signals all worker threads to finish whatever 
> they are doing now and to wait for the write_unlock(). Then the 
> modification can be done and the worker threads can be resumed.
> 
> This can be updated to RCU later on without much trouble.
> 
> The advantage is that this could be implemented with the existing 
> thread-pool primitives straight away i think, we'd need five 
> primitives:
> 
>   bread_lock();
>   bread_unlock();
>   bwrite_lock();
>   bwrite_lock();
> 
>   brlock_init();
> 
> and a data type:
> 
>   struct brlock;
> 
> bread_lock()/bread_unlock() is basically just a compiler barrier. 
> bwrite_lock() stops all (other) worker threads.
> bwrite_unlock() resumes them.
> 
> That's all - should be 50 lines of code, unless i'm missing something 
> :-)
> 
> Thanks,
> 
>   Ingo

Isn't there something similar to this in the kernel?

I prefer not implementing a new lock type at the moment mostly because
we're not tackling a bug or an immediate problem, we don't really need
locking at the moment (we add all devices at init and don't support
hotplug yet). So I'd rather not write code just to solve it faster but
have it thrown away later.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Mathieu Desnoyers

* Ingo Molnar (mi...@elte.hu) wrote:
> 
> * Ingo Molnar  wrote:
> 
> > Note that you do not want the context switch event, but the CPU 
> > migration event: that will notify user-space when it gets migrated 
> > to another CPU. This is the case that RCU really needs.
> 
> Also note that the main current use-case of perf events is 
> instrumentation, thus if you make use of this facility for user-space 
> RCU you need to check whether the events are all precise and arrive 
> in time to the target process, etc.
> 
> Statistical behavior isnt a big problem for instrumentation but it's 
> a showstopper for RCU, obviously! :-)
> 
> If you find such bugs then we want to fix them, so there's no 
> fundamental *desire* from us for these events to be statistical and 
> inaccurate anywhere.

The accuracy vs speed tradeoff is actually quite different from the
instrumentation vs low-level-synchronization point of views. It might be
acceptable in some sampling situations to get some inaccuracy due to
lack of locking if it makes the data collection tool reasonably fast for
real-life use (note that I am talking about "sampling", not event-based
tracing here). So there are situations where adding locking will make
the overhead prohibitive for sampling, but would be required for the
perfect accuracy needed by RCU.

So although the desire might not be to get inaccurate data, the actual
desire to get it in a low-overhead fashion can lead to different
decisions regarding the accuracy vs speed tradeoff.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Mathieu Desnoyers

* Ingo Molnar (mi...@elte.hu) wrote:
> 
> * Ingo Molnar  wrote:
> 
> > > This code is very much tied with the kernel scheduler. [...]
> > 
> > It would not be particularly complex to enable user-space to 
> > request a callback on context switch events.
> > 
> > I was thinking on and off about allowing perf events to generate a 
> > per sampling event notification signal on specific events, such as 
> > page faults or context switches.
> 
> I was thinking about that on and off so loudly that Peter implemented 
> it long ago via fasync support on the perf event fd! :-)
> 
> So if you set a notification signal via fcntl(F_SETOWN) on the 
> scheduler context switch event fd, the user-space RCU code will get a 
> signal on every context switch.
> 
> I have not tried it for this purpose yet, so let us know if there are 
> unexpected complications :)

I'm worried about the following complications:

Let's define per-process, per-cpu data structures, mapped in userland,
that keep track of the per-cpu read-side C.S. nesting count. Let say we
use this signal-based mechanism to inform userland of migrations.

1) The first thing I notice is that we're not informed of threads being
   blocked. So multiple threads taking read-side C.S. in turn on the
   same CPU as they are being scheduled will corrupt each other's
   read-side C.S. counter.

2) The second thing I notice is that there is no form of synchronization
   between the userland execution and delivery of this signal, which
   leads to interesting race conditions.

Any thoughts on how to address these ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar


* Mathieu Desnoyers  wrote:

> > So yes, kernel code was obviously used in the making of urcu - 
> > just not the RCU kernel code it appears.
> > 
> > Which is a pity i think! :-)
> 
> Heh :) You know, I really like the Linux kernel coding style, which 
> is what I tried to stick to within this library. So although I 
> initially imported some of the core Linux kernel macroisms, I had 
> to reimplement them (derived from a MIT-licensed code-base) as soon 
> as I decided to go for MIT-licensed low-level primitives and 
> LGPL-licensed library.

Well, that might be a somewhat fragile assumption in certain 
jurisdictions, did you know that this is not necessarily a clean
room reimplementation of the urcu code anymore, so the chain of
derivation might still be present legally, right? :-)

[ Not that i can imagine Linus ever bothering you about barrier() or 
  atomic_inc() ;-) ]

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar

* Mathieu Desnoyers  wrote:

> I'm worried about "self-recursion" behaviors that could be 
> triggered though: if the userland callback code called from a page 
> fault triggers a page fault all by itself, then it looks like a 
> good way to bring the system to its knees. [...]

Not really, SIGIO isnt being reprocessed until the signal handler 
returns.

> [...] The same apply to context switches. Do you have a way to 
> handle this in mind ?

Shouldnt be a problem in theory: yes, in case of repeat migrations 
repeat signals will be sent, but they should not nest in any nasty 
fashion.

That's the theory, it needs checking! :-)

One furthr optimization would be possible: in case you think we can 
write the signal handler in assembly or build it with gcc flags that 
does not use SSE we might also add a 'lightweight signal handler' 
kind of flag to the kernel, which does not save FPU/vector-CPU(SSE) 
state. In this case signals become *really* fast on x86, almost as 
fast as interrupts.

One detail: you'd not want to use a queueing signal, because the 
siginfo queue might overflow. It's also unnecessary: RCU only needs 
the last migration event, previous history is uninteresting.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Mathieu Desnoyers

* Ingo Molnar (mi...@elte.hu) wrote:
> 
> * Mathieu Desnoyers  wrote:
> 
> > > it's offline right now:
> > > 
> > >  $ git clone git://git.lttng.org/urcu
> > >  Cloning into urcu...
> > >  fatal: The remote end hung up unexpectedly
> > 
> > This would be:
> > 
> > git clone git://git.lttng.org/userspace-rcu.git
> 
> Hey, my impression wasn't *entirely* wrong, your initial urcu commit:
> 
>  From 27b012e271a82b9a0d94543688904f207cd154ea Mon Sep 17 00:00:00 2001
>  From: Mathieu Desnoyers 
>  Date: Thu, 5 Feb 2009 19:06:44 -0500
>  Subject: [PATCH] init version
> 
>  ---
>   Makefile |6 ++
>   urcu.c   |  250 
> ++
>   urcu.h   |   69 +
>   3 files changed, 325 insertions(+), 0 deletions(-)
> 
> Has:
> 
> +/* The "volatile" is due to gcc bugs */
> +#define barrier() __asm__ __volatile__("": : :"memory")
> +
> 
> Which code sequence i recognize very well as a kernel maintainer ;-) 
> Here's the kernel's compiler.h definition of the same:
> 
>   /* The "volatile" is due to gcc bugs */
>   #define barrier() __asm__ __volatile__("": : :"memory")
> 
> This:
> 
> +/* x86 32/64 specific */
> +#define mb()asm volatile("mfence":::"memory")
> +#define rmb()   asm volatile("lfence":::"memory")
> +#define wmb()   asm volatile("sfence" ::: "memory")
> +
> +
> +
> +/* x86 32 */
> +static inline void atomic_inc(int *v)
> +{
> +   asm volatile("lock; incl %0"
> +: "+m" (v->counter));
> 
> is familiar to an arch/x86/ maintainer as well :-)
> 
> So yes, kernel code was obviously used in the making of urcu - just 
> not the RCU kernel code it appears.
> 
> Which is a pity i think! :-)

Heh :) You know, I really like the Linux kernel coding style, which is
what I tried to stick to within this library. So although I initially
imported some of the core Linux kernel macroisms, I had to reimplement
them (derived from a MIT-licensed code-base) as soon as I decided to go
for MIT-licensed low-level primitives and LGPL-licensed library.

About RCU, the picture seemed very much different in the userspace
landscape compared to the kernel (needed to use of per-thread RCU
nesting counters compared to per-CPU in the kernel because of lack of
integration with the scheduler), but more on that in my other follow-up
reply.

Thanks,

Mathieu

> 
> Thanks,
> 
>   Ingo

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Mathieu Desnoyers

* Ingo Molnar (mi...@elte.hu) wrote:
> 
> * Mathieu Desnoyers  wrote:
> 
> > >  - Check kernel/tinyrcu.c to see how RCU is implemented in its 
> > >simplest form. :)
> > 
> > ...so simplistic it only works on UP systems, which are not so common
> > these days on the systems targeted by kvm.
> 
> As i said above, in its simplest form - which is UP.
> 
> Obviously it's not tinyrcu.c that should be used by tools/kvm/ but 
> what i suggested, tree-RCU:

I agree that tree-RCU has some grace-period management scalability
benefits that would be interesting to have.

> > >  - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/
> > 
> > This code is very much tied with the kernel scheduler. [...]
> 
> It would not be particularly complex to enable user-space to request 
> a callback on context switch events.
> 
> I was thinking on and off about allowing perf events to generate a 
> per sampling event notification signal on specific events, such as 
> page faults or context switches.
>
> Obviously this won't be enabled from NMI contexts due to atomicity 
> constraints, but the pagefault and maybe the context switch path 
> looks doable.
> 
> That capability would be a rather simple kernel change and it would 
> allow a user-space RCU implementation to be notified of various key 
> events, context switches in particular.

I'm worried about "self-recursion" behaviors that could be triggered
though: if the userland callback code called from a page fault triggers
a page fault all by itself, then it looks like a good way to bring the
system to its knees. The same apply to context switches. Do you have a
way to handle this in mind ?

> 
> Would you be interested in helping code up such a facility? The urcu 
> library could make good use of it i think, regardless of what we do 
> in tools/kvm/.

I'm open to try finding out the best possible approach to support RCU in
user-space (disclaimer: I might need some help on this due to my time
being fully taken by contracts currently). I guess the sweet spot will
end up being at a crossroad between kernel-only and userland-only
solution.

Thanks,

Mathieu

> 
> Thanks,
> 
>   Ingo

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar


* Mathieu Desnoyers  wrote:

> > it's offline right now:
> > 
> >  $ git clone git://git.lttng.org/urcu
> >  Cloning into urcu...
> >  fatal: The remote end hung up unexpectedly
> 
> This would be:
> 
> git clone git://git.lttng.org/userspace-rcu.git

Hey, my impression wasn't *entirely* wrong, your initial urcu commit:

 From 27b012e271a82b9a0d94543688904f207cd154ea Mon Sep 17 00:00:00 2001
 From: Mathieu Desnoyers 
 Date: Thu, 5 Feb 2009 19:06:44 -0500
 Subject: [PATCH] init version

 ---
  Makefile |6 ++
  urcu.c   |  250 ++
  urcu.h   |   69 +
  3 files changed, 325 insertions(+), 0 deletions(-)

Has:

+/* The "volatile" is due to gcc bugs */
+#define barrier() __asm__ __volatile__("": : :"memory")
+

Which code sequence i recognize very well as a kernel maintainer ;-) 
Here's the kernel's compiler.h definition of the same:

  /* The "volatile" is due to gcc bugs */
  #define barrier() __asm__ __volatile__("": : :"memory")

This:

+/* x86 32/64 specific */
+#define mb()asm volatile("mfence":::"memory")
+#define rmb()   asm volatile("lfence":::"memory")
+#define wmb()   asm volatile("sfence" ::: "memory")
+
+
+
+/* x86 32 */
+static inline void atomic_inc(int *v)
+{
+   asm volatile("lock; incl %0"
+: "+m" (v->counter));

is familiar to an arch/x86/ maintainer as well :-)

So yes, kernel code was obviously used in the making of urcu - just 
not the RCU kernel code it appears.

Which is a pity i think! :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Thu, 2011-05-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> > * Sasha Levin (levinsasha...@gmail.com) wrote:
> > > On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> > > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> > > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > > > >>
> > > > >> >
> > > > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > > > >> >  adding/removing devices after the first initialization phase.
> > > > >> >
> > > > >> >  Take MMIO lock for example: Since we can now run SMP guests, we 
> > > > >> > may
> > > > >> >  have multiple MMIO exits (one from each VCPU thread). Each of 
> > > > >> > those
> > > > >> >  exits leads to searching the MMIO rbtree.
> > > > >> >
> > > > >> >  We can use a mutex to lock it, but it just means that those 
> > > > >> > threads
> > > > >> >  will be blocked there instead of concurrently searching the MMIO
> > > > >> >  tree which makes the search linear instead of parallel.
> > > > >> >
> > > > >> >  It's hard to bring 'real' numbers at this stage because the only
> > > > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > > > >> >  can't really simulate many VCPUs writing to it :)
> > > > >>
> > > > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > > > >> bounce a cacheline just as much.
> > > > >
> > > > > Well, this is the first case where tools/kvm can do better than qemu 
> > > > > with
> > > > > its global lock, so I think it's worth it.
> > > > >
> > > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > > > >
> > > > > Definitely rcu is a perfect patch for mmio dispatch.
> > > > 
> > > > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > > > 
> > > > http://lttng.org/urcu
> > > > 
> > > > :-)
> > > > 
> > > > I'm CC'ing Paul and Mathieu as well for urcu.
> > > 
> > > Sounds good!
> > > 
> > > Should be quite an addition and could be used in more places than just
> > > the MMIO dispatcher.
> > 
> > Hi Sasha,
> > 
> > Feel free to let me know if you need any help, have questions, or need
> > improvements to liburcu. I'd be interested to know about the specifics
> > of you read vs update workload rate. Also, if you need more thorough
> > information, we have a paper describing all the liburcu flavors. It
> > might help you choose the one better suited for your needs. (if you
> > don't care that much about fine-tuning, my recommendation is to stick
> > with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be
> > found at http://www.efficios.com/publications
> 
> Hi Mathieu!
> 
> In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
> augmentation feature to support an interval rb-tree - which means that
> every update to the tree not only updates the nodes directly related to
> the updated node but also all the nodes on the path to the root of the
> tree.

Cool !!

I'm adding in copy Phil Howard who has been working on RCU RB tree for
much longer than myself.

> I see that in liburcu there is an implementation of a rcu linked list
> but no implementation of a rb-tree.
> 
> Are you currently working on one? or maybe I should try writing one and
> sending it to you?

Actually, I started working on one last year, but had to interrupt my
effort before I got it even working right. The state of this
(disclaimer: unfinished!!) work is available in the "rbtree" branch of
the urcu library. This tree has insertion/removals protected by a mutex,
and uses a RCU read lock to protect traversal. The main problem I was
facing when I had to stop working on that code is that the "nil" node:

  56 /* Sentinel (bottom nodes).
  57  * Don't care about p, left, right, pos and key values */
  58 struct rcu_rbtree_node rcu_rbtree_nil = {
  59 .color = COLOR_BLACK,
  60 };

ended up being written to as temporary node by the algorithm presented
in CLRS, chap. 12.  So sharing it as a common node, as proposed in their
book, made sense only if you consider you have no concurrency, but seems
to break left and right in the presence of concurrency, and I did not
have time to review their entire algo to find out where I should check
for accesses to this nil node.

This implementation is trying to think of the RB tree in terms of how
each operation is affecting the read-side visibility of the tree nodes.
It uses the fact that readers only ever go down into the tree
extensively.

I'd be glad to help out if someone want to have a look and try to
complete that work, which should only be considered as "work in
progress" level of (in)completeness.

We'd have to see how we can go from this implementation of a standard RB
tree to an interval RB tree too. I guess it will depend whether you need
the updates from the target node up to the root to be done "all at once"
from a reader perspective (then you would probably need to replace a
copy of a part of the tree all at once), or

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar

* Mathieu Desnoyers  wrote:

> > instead of bringing in yet another library which is a IIRC a 
> > distant copy of the kernel code to begin with.
> 
> This is either a lie, or immensely misinformed. You should go and 
> look at the source before doing nonsensical assumptions like this. 
> What you are saying cannot be further from the truth.

I was merely immensely misinformed, which (distinct!) possibility i 
tried to signal via the 'IIRC' qualifier as well ;-)

While right now i cannot clone the repository itself:

  $ git clone git://git.lttng.org/urcu
  Cloning into urcu...
  fatal: The remote end hung up unexpectedly

(tried it several times today, same failure)

I found a tarball of the package so could have a look again.

Despite my initial impression of it being kernel RCU code related (it 
has system.h, smp_mb, list_head, etc.), a closer look indeed suggests 
that the RCU implementation itself does not use the kernel RCU 
concepts so it sadly cannot be a distant copy of the kernel RCU code!

Which is too bad IMO: i don't think user-space RCU should necessarily 
be so different from kernel-space RCU. I think the two codebases 
could be shared, or at least they could become closer relatives! :-)

That could be done using the migration event notification trick i 
suggested in the previous mail.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Mathieu Desnoyers

* Ingo Molnar (mi...@elte.hu) wrote:
> > [ ... ]
> >
> > All that aside, one advantage of http://lttng.org/urcu is that it 
> > already exists, which allows prototyping to proceed immediately.  
> 
> it's offline right now:
> 
>  $ git clone git://git.lttng.org/urcu
>  Cloning into urcu...
>  fatal: The remote end hung up unexpectedly

This would be:

git clone git://git.lttng.org/userspace-rcu.git

I'll make sure a symlink for urcu -> userspace-rcu.git gets setup
shortly.

> One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess 
> we could copy a suitable implementation into tools/kvm/rcu/?

That might make sense, yes. The intent of having liburcu LGPL is really
to give as much liberty for apps developers to link to it as they
have calling Linux kernel system calls. This peculiarness of the GPL
Linux is able to use to let non-GPL apps use it does not apply so
clearly to libraries, hence the use of LGPL for all my libs that support
application execution.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar

* Ingo Molnar  wrote:

> Note that you do not want the context switch event, but the CPU 
> migration event: that will notify user-space when it gets migrated 
> to another CPU. This is the case that RCU really needs.

Also note that the main current use-case of perf events is 
instrumentation, thus if you make use of this facility for user-space 
RCU you need to check whether the events are all precise and arrive 
in time to the target process, etc.

Statistical behavior isnt a big problem for instrumentation but it's 
a showstopper for RCU, obviously! :-)

If you find such bugs then we want to fix them, so there's no 
fundamental *desire* from us for these events to be statistical and 
inaccurate anywhere.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar


* Ingo Molnar  wrote:

> I was thinking about that on and off so loudly that Peter 
> implemented it long ago via fasync support on the perf event fd! 
> :-)
> 
> So if you set a notification signal via fcntl(F_SETOWN) on the 
> scheduler context switch event fd, the user-space RCU code will get 
> a signal on every context switch.
> 
> I have not tried it for this purpose yet, so let us know if there 
> are unexpected complications :)

Note that you do not want the context switch event, but the CPU 
migration event: that will notify user-space when it gets migrated to 
another CPU. This is the case that RCU really needs.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar


* Ingo Molnar  wrote:

> > This code is very much tied with the kernel scheduler. [...]
> 
> It would not be particularly complex to enable user-space to 
> request a callback on context switch events.
> 
> I was thinking on and off about allowing perf events to generate a 
> per sampling event notification signal on specific events, such as 
> page faults or context switches.

I was thinking about that on and off so loudly that Peter implemented 
it long ago via fasync support on the perf event fd! :-)

So if you set a notification signal via fcntl(F_SETOWN) on the 
scheduler context switch event fd, the user-space RCU code will get a 
signal on every context switch.

I have not tried it for this purpose yet, so let us know if there are 
unexpected complications :)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar

* Sasha Levin  wrote:

> I see that in liburcu there is an implementation of a rcu linked 
> list but no implementation of a rb-tree.

Another approach would be, until the RCU interactions are sorted out, 
to implement a 'big reader lock' thing that is completely lockless on 
the read side (!).

It works well if the write side is expensive, but very rare: which is 
certainly the case for these ioport registration data structures used 
in the mmio event demux fast path!

The write_lock() side signals all worker threads to finish whatever 
they are doing now and to wait for the write_unlock(). Then the 
modification can be done and the worker threads can be resumed.

This can be updated to RCU later on without much trouble.

The advantage is that this could be implemented with the existing 
thread-pool primitives straight away i think, we'd need five 
primitives:

  bread_lock();
  bread_unlock();
  bwrite_lock();
  bwrite_lock();

  brlock_init();

and a data type:

  struct brlock;

bread_lock()/bread_unlock() is basically just a compiler barrier. 
bwrite_lock() stops all (other) worker threads.
bwrite_unlock() resumes them.

That's all - should be 50 lines of code, unless i'm missing something 
:-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar

* Mathieu Desnoyers  wrote:

> >  - Check kernel/tinyrcu.c to see how RCU is implemented in its 
> >simplest form. :)
> 
> ...so simplistic it only works on UP systems, which are not so common
> these days on the systems targeted by kvm.

As i said above, in its simplest form - which is UP.

Obviously it's not tinyrcu.c that should be used by tools/kvm/ but 
what i suggested, tree-RCU:

> >  - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/
> 
> This code is very much tied with the kernel scheduler. [...]

It would not be particularly complex to enable user-space to request 
a callback on context switch events.

I was thinking on and off about allowing perf events to generate a 
per sampling event notification signal on specific events, such as 
page faults or context switches.

Obviously this won't be enabled from NMI contexts due to atomicity 
constraints, but the pagefault and maybe the context switch path 
looks doable.

That capability would be a rather simple kernel change and it would 
allow a user-space RCU implementation to be notified of various key 
events, context switches in particular.

Would you be interested in helping code up such a facility? The urcu 
library could make good use of it i think, regardless of what we do 
in tools/kvm/.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Sasha Levin

On Thu, 2011-05-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> * Sasha Levin (levinsasha...@gmail.com) wrote:
> > On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > > >>
> > > >> >
> > > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > > >> >  adding/removing devices after the first initialization phase.
> > > >> >
> > > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > > >> >  exits leads to searching the MMIO rbtree.
> > > >> >
> > > >> >  We can use a mutex to lock it, but it just means that those threads
> > > >> >  will be blocked there instead of concurrently searching the MMIO
> > > >> >  tree which makes the search linear instead of parallel.
> > > >> >
> > > >> >  It's hard to bring 'real' numbers at this stage because the only
> > > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > > >> >  can't really simulate many VCPUs writing to it :)
> > > >>
> > > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > > >> bounce a cacheline just as much.
> > > >
> > > > Well, this is the first case where tools/kvm can do better than qemu 
> > > > with
> > > > its global lock, so I think it's worth it.
> > > >
> > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > > >
> > > > Definitely rcu is a perfect patch for mmio dispatch.
> > > 
> > > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > > 
> > > http://lttng.org/urcu
> > > 
> > > :-)
> > > 
> > > I'm CC'ing Paul and Mathieu as well for urcu.
> > 
> > Sounds good!
> > 
> > Should be quite an addition and could be used in more places than just
> > the MMIO dispatcher.
> 
> Hi Sasha,
> 
> Feel free to let me know if you need any help, have questions, or need
> improvements to liburcu. I'd be interested to know about the specifics
> of you read vs update workload rate. Also, if you need more thorough
> information, we have a paper describing all the liburcu flavors. It
> might help you choose the one better suited for your needs. (if you
> don't care that much about fine-tuning, my recommendation is to stick
> with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be
> found at http://www.efficios.com/publications

Hi Mathieu!

In tools/kvm/ we use a rb-tree (same one used by the kernel) with the
augmentation feature to support an interval rb-tree - which means that
every update to the tree not only updates the nodes directly related to
the updated node but also all the nodes on the path to the root of the
tree.

I see that in liburcu there is an implementation of a rcu linked list
but no implementation of a rb-tree.

Are you currently working on one? or maybe I should try writing one and
sending it to you?

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-27 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> > > > I'm CC'ing Paul and Mathieu as well for urcu.
> 
> I am hoping we can get better convergence between the user-level 
> and kernel-level URCU implementations once I get SRCU merged into 
> the TREE_RCU and TINY_RCU implementations. [...]

Yeah.

> [...]  But it is early days for user-level RCU implementations -- 
> for example, the kernel-level implementations have deep 
> dependencies on being able to lock themselves cheaply to a given 
> CPU, which does not exist at user level.

Correct - this is why i suggested a plain copy first, then look back 
how we (and whether we!) want to share logic.

> But there seems to be an assumption that there should be only one 
> URCU implementation, and I am not sure that this assumption holds.  

I'm not sure about that either. And sice tools/kvm/ lives in the 
kernel repo it would be a mortal sin [*] to not explore the code 
sharing angle!!! :-)

If a reasonable amount of sharing of logic is possible without making 
it painful for the kernel RCU code we could do other nice things like 
change the RCU logic and test it in user-space first and run 
user-space rcutorture on some really big cluster.

> [ ... ]
>
> All that aside, one advantage of http://lttng.org/urcu is that it 
> already exists, which allows prototyping to proceed immediately.  

it's offline right now:

 $ git clone git://git.lttng.org/urcu
 Cloning into urcu...
 fatal: The remote end hung up unexpectedly

One complication is that it's LGPL while tools/kvm/ is GPLv2. I guess 
we could copy a suitable implementation into tools/kvm/rcu/?

Thanks,

Ingo

[1] punishable by death or eternal hacking of a Windows driver (i'd pick the 
former)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Paul E. McKenney

On Thu, May 26, 2011 at 07:05:08PM -0400, Mathieu Desnoyers wrote:
> * Ingo Molnar (mi...@elte.hu) wrote:
> > 
> > * Pekka Enberg  wrote:
> > 
> > > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> > > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > > >>
> > > >> >
> > > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > > >> >  adding/removing devices after the first initialization phase.
> > > >> >
> > > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > > >> >  exits leads to searching the MMIO rbtree.
> > > >> >
> > > >> >  We can use a mutex to lock it, but it just means that those threads
> > > >> >  will be blocked there instead of concurrently searching the MMIO
> > > >> >  tree which makes the search linear instead of parallel.
> > > >> >
> > > >> >  It's hard to bring 'real' numbers at this stage because the only
> > > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > > >> >  can't really simulate many VCPUs writing to it :)
> > > >>
> > > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > > >> bounce a cacheline just as much.
> > > >
> > > > Well, this is the first case where tools/kvm can do better than qemu 
> > > > with
> > > > its global lock, so I think it's worth it.
> > > >
> > > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > > >
> > > > Definitely rcu is a perfect patch for mmio dispatch.
> > > 
> > > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > > 
> > > http://lttng.org/urcu
> > > 
> > > :-)
> > > 
> > > I'm CC'ing Paul and Mathieu as well for urcu.

I am hoping we can get better convergence between the user-level and
kernel-level URCU implementations once I get SRCU merged into the TREE_RCU
and TINY_RCU implementations.  But it is early days for user-level RCU
implementations -- for example, the kernel-level implementations have
deep dependencies on being able to lock themselves cheaply to a given CPU,
which does not exist at user level.

But there seems to be an assumption that there should be only one URCU
implementation, and I am not sure that this assumption holds.  For
example, there are several in http://lttng.org/urcu, each corresponding
to different use cases.  This should not be too much of a surprise, given
that there are quite a few implementations in the Linux kernel: TINY_RCU,
TINY_PREEMPT_RCU, TREE_RCU, TREE_PREEMPT_RCU, and SRCU.  Of course,
each of the first four variants provides RCU-bh and RCU-sched, and
TINY_PREEMPT_RCU and TREE_PREEMPT_RCU provide preemptible RCU in addition.

And back in the mid-1990s, I would never have imagined a need for more
than one implementation of RCU.  ;-)

All that aside, one advantage of http://lttng.org/urcu is that it already
exists, which allows prototyping to proceed immediately.  If it turns
out that URCU doesn't help for whatever reason, then there is no point in
worrying further.  And if URCU does turn out to help, we will know more
about exactly what this particular situation requires of URCU, which
will likely help us better understand what the implemenation should
look like -- perhaps very close to one of the URCU implementations,
perhaps very close to one of the in-kernel implementations.

Seem reasonable, or am I missing something here?

Thanx, Paul

> > I think we should rather share some of the kernel RCU code in an 
> > intelligent way
> 
> Hi Ingo,
> 
> By all means feel free to redo all the work Paul have spent care and
> time coding and testing.
> 
> > instead of bringing in yet another library which is a 
> > IIRC a distant copy of the kernel code to begin with.
> 
> This is either a lie, or immensely misinformed. You should go and look
> at the source before doing nonsensical assumptions like this. What you
> are saying cannot be further from the truth.
> 
> > 
> > That way we could lazily benefit from all the enhancements Paul does 
> > to the kernel RCU code! :-)
> 
> Maybe there is a reason why Paul have been working with me on the
> userspace RCU implementation in parallel with working on the Linux
> kernel one ?
> 
> > 
> > Note that kernel/treercu.c is pretty tied to the kernel right now, so 
> > a first approach could be to:
> > 
> >  - Check Paul's excellent documentation about RCU and make sure
> >you don't miss Paul's excellent 3-part primer on LWN.net:
> > 
> >  http://lwn.net/Articles/262464/
> >  http://lwn.net/Articles/263130/
> >  http://lwn.net/Articles/264090/
> > 
> >There are also lots of very good RCU articles on the LWN Kernel 
> >Index page:
> > 
> > http://lwn.net/Kernel/Index/
> 
> Or just (see README)
> 
> git clone git://git.lttng.org/urcu
> cd userspace-rcu
> ./bootstrap
> ./configure
> make
> make install
> ldconfig
> 
> #include 
> gcc -lurcu ...
> 
> and be done with it ?
>

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Mathieu Desnoyers

* Sasha Levin (levinsasha...@gmail.com) wrote:
> On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > >>
> > >> >
> > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > >> >  adding/removing devices after the first initialization phase.
> > >> >
> > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > >> >  exits leads to searching the MMIO rbtree.
> > >> >
> > >> >  We can use a mutex to lock it, but it just means that those threads
> > >> >  will be blocked there instead of concurrently searching the MMIO
> > >> >  tree which makes the search linear instead of parallel.
> > >> >
> > >> >  It's hard to bring 'real' numbers at this stage because the only
> > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > >> >  can't really simulate many VCPUs writing to it :)
> > >>
> > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > >> bounce a cacheline just as much.
> > >
> > > Well, this is the first case where tools/kvm can do better than qemu with
> > > its global lock, so I think it's worth it.
> > >
> > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > >
> > > Definitely rcu is a perfect patch for mmio dispatch.
> > 
> > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > 
> > http://lttng.org/urcu
> > 
> > :-)
> > 
> > I'm CC'ing Paul and Mathieu as well for urcu.
> 
> Sounds good!
> 
> Should be quite an addition and could be used in more places than just
> the MMIO dispatcher.

Hi Sasha,

Feel free to let me know if you need any help, have questions, or need
improvements to liburcu. I'd be interested to know about the specifics
of you read vs update workload rate. Also, if you need more thorough
information, we have a paper describing all the liburcu flavors. It
might help you choose the one better suited for your needs. (if you
don't care that much about fine-tuning, my recommendation is to stick
with the "urcu.h"/"liburcu" flavor). Link to the paper preprint can be
found at http://www.efficios.com/publications

Thanks!

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Mathieu Desnoyers

* Ingo Molnar (mi...@elte.hu) wrote:
> 
> * Pekka Enberg  wrote:
> 
> > On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> > > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> > >>
> > >> >
> > >> >  I've added some rwlocks because of what Ingo said yesterday about
> > >> >  adding/removing devices after the first initialization phase.
> > >> >
> > >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> > >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> > >> >  exits leads to searching the MMIO rbtree.
> > >> >
> > >> >  We can use a mutex to lock it, but it just means that those threads
> > >> >  will be blocked there instead of concurrently searching the MMIO
> > >> >  tree which makes the search linear instead of parallel.
> > >> >
> > >> >  It's hard to bring 'real' numbers at this stage because the only
> > >> >  'real' device we have which uses MMIO is the VESA driver, and we
> > >> >  can't really simulate many VCPUs writing to it :)
> > >>
> > >> I'd suggest keeping it simple first - rwlocks are nasty and will
> > >> bounce a cacheline just as much.
> > >
> > > Well, this is the first case where tools/kvm can do better than qemu with
> > > its global lock, so I think it's worth it.
> > >
> > >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> > >
> > > Definitely rcu is a perfect patch for mmio dispatch.
> > 
> > Userspace RCU code is here, Sasha, if you feel like tackling this:
> > 
> > http://lttng.org/urcu
> > 
> > :-)
> > 
> > I'm CC'ing Paul and Mathieu as well for urcu.
> 
> I think we should rather share some of the kernel RCU code in an 
> intelligent way

Hi Ingo,

By all means feel free to redo all the work Paul have spent care and
time coding and testing.

> instead of bringing in yet another library which is a 
> IIRC a distant copy of the kernel code to begin with.

This is either a lie, or immensely misinformed. You should go and look
at the source before doing nonsensical assumptions like this. What you
are saying cannot be further from the truth.

> 
> That way we could lazily benefit from all the enhancements Paul does 
> to the kernel RCU code! :-)

Maybe there is a reason why Paul have been working with me on the
userspace RCU implementation in parallel with working on the Linux
kernel one ?

> 
> Note that kernel/treercu.c is pretty tied to the kernel right now, so 
> a first approach could be to:
> 
>  - Check Paul's excellent documentation about RCU and make sure
>you don't miss Paul's excellent 3-part primer on LWN.net:
> 
>  http://lwn.net/Articles/262464/
>  http://lwn.net/Articles/263130/
>  http://lwn.net/Articles/264090/
> 
>There are also lots of very good RCU articles on the LWN Kernel 
>Index page:
> 
>   http://lwn.net/Kernel/Index/

Or just (see README)

git clone git://git.lttng.org/urcu
cd userspace-rcu
./bootstrap
./configure
make
make install
ldconfig

#include 
gcc -lurcu ...

and be done with it ?

>  - Check kernel/tinyrcu.c to see how RCU is implemented in its 
>simplest form. :)

...so simplistic it only works on UP systems, which are not so common
these days on the systems targeted by kvm.

> 
>  - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/

This code is very much tied with the kernel scheduler. This is actually
one of the main reason why we had to reimplement RCU for userspace
rather than to "simply copy the kernel one" as you recommend.

> 
>  - Massage it so far that it is suitable for tools/kvm/. We really
>only need a few core RCU facilities initially:
> 
> struct rcu_head;
> 
> rcu_read_lock();
> rcu_read_unlock();
> 
> rcu_dereference()
> 
> call_rcu(head, func);
> 
> rcu_synchronize();
> 
>The rest, _bh(), etc. are all kernelisms.

rcu_synchronize and the rcu read lock/unlock, in tree RCU, are tied to
the scheduler to deal with preemption. User-land does not have this
luxury.

> 
>  - Then once it's working we could look at doing the code sharing
>*for real*: splitting the functionality out of the original
>treercu.c code into kernel/treercu-lib.c and rcuupdate-lib.h or so
>and include that one in tools/kvm/rcu/.
> 
>  - [ You might also benefit from porting rcutorture code to 
>  user-space. It will catch RCU bugs like nothing else. ]

Userspace RCU already includes the torture test suite you are referring
to.

> 
> That way the 'core RCU' logic would be contained in treercu-lib.c, 
> all kernel glue would be in kernel/rcutree.c.
> 
> Some other approach might be possible as well, this was just a first 
> rough idea :)

"wheel not invented here" syndrome ?

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Ingo Molnar


* Pekka Enberg  wrote:

> On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> >>
> >> >
> >> >  I've added some rwlocks because of what Ingo said yesterday about
> >> >  adding/removing devices after the first initialization phase.
> >> >
> >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> >> >  exits leads to searching the MMIO rbtree.
> >> >
> >> >  We can use a mutex to lock it, but it just means that those threads
> >> >  will be blocked there instead of concurrently searching the MMIO
> >> >  tree which makes the search linear instead of parallel.
> >> >
> >> >  It's hard to bring 'real' numbers at this stage because the only
> >> >  'real' device we have which uses MMIO is the VESA driver, and we
> >> >  can't really simulate many VCPUs writing to it :)
> >>
> >> I'd suggest keeping it simple first - rwlocks are nasty and will
> >> bounce a cacheline just as much.
> >
> > Well, this is the first case where tools/kvm can do better than qemu with
> > its global lock, so I think it's worth it.
> >
> >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> >
> > Definitely rcu is a perfect patch for mmio dispatch.
> 
> Userspace RCU code is here, Sasha, if you feel like tackling this:
> 
> http://lttng.org/urcu
> 
> :-)
> 
> I'm CC'ing Paul and Mathieu as well for urcu.

I think we should rather share some of the kernel RCU code in an 
intelligent way instead of bringing in yet another library which is a 
IIRC a distant copy of the kernel code to begin with.

That way we could lazily benefit from all the enhancements Paul does 
to the kernel RCU code! :-)

Note that kernel/treercu.c is pretty tied to the kernel right now, so 
a first approach could be to:

 - Check Paul's excellent documentation about RCU and make sure
   you don't miss Paul's excellent 3-part primer on LWN.net:

 http://lwn.net/Articles/262464/
 http://lwn.net/Articles/263130/
 http://lwn.net/Articles/264090/

   There are also lots of very good RCU articles on the LWN Kernel 
   Index page:

http://lwn.net/Kernel/Index/

 - Check kernel/tinyrcu.c to see how RCU is implemented in its 
   simplest form. :)

 - Copy the tree-RCU code from kernel/treercu.c to tools/kvm/rcu/

 - Massage it so far that it is suitable for tools/kvm/. We really
   only need a few core RCU facilities initially:

struct rcu_head;

rcu_read_lock();
rcu_read_unlock();

rcu_dereference()

call_rcu(head, func);

rcu_synchronize();

   The rest, _bh(), etc. are all kernelisms.

 - Then once it's working we could look at doing the code sharing
   *for real*: splitting the functionality out of the original
   treercu.c code into kernel/treercu-lib.c and rcuupdate-lib.h or so
   and include that one in tools/kvm/rcu/.

 - [ You might also benefit from porting rcutorture code to 
 user-space. It will catch RCU bugs like nothing else. ]

That way the 'core RCU' logic would be contained in treercu-lib.c, 
all kernel glue would be in kernel/rcutree.c.

Some other approach might be possible as well, this was just a first 
rough idea :)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Sasha Levin

On Thu, 2011-05-26 at 21:21 +0300, Pekka Enberg wrote:
> On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> > On 05/26/2011 09:05 PM, Ingo Molnar wrote:
> >>
> >> >
> >> >  I've added some rwlocks because of what Ingo said yesterday about
> >> >  adding/removing devices after the first initialization phase.
> >> >
> >> >  Take MMIO lock for example: Since we can now run SMP guests, we may
> >> >  have multiple MMIO exits (one from each VCPU thread). Each of those
> >> >  exits leads to searching the MMIO rbtree.
> >> >
> >> >  We can use a mutex to lock it, but it just means that those threads
> >> >  will be blocked there instead of concurrently searching the MMIO
> >> >  tree which makes the search linear instead of parallel.
> >> >
> >> >  It's hard to bring 'real' numbers at this stage because the only
> >> >  'real' device we have which uses MMIO is the VESA driver, and we
> >> >  can't really simulate many VCPUs writing to it :)
> >>
> >> I'd suggest keeping it simple first - rwlocks are nasty and will
> >> bounce a cacheline just as much.
> >
> > Well, this is the first case where tools/kvm can do better than qemu with
> > its global lock, so I think it's worth it.
> >
> >> If lookup scalability is an issue we can extend RCU to tools/kvm/.
> >
> > Definitely rcu is a perfect patch for mmio dispatch.
> 
> Userspace RCU code is here, Sasha, if you feel like tackling this:
> 
> http://lttng.org/urcu
> 
> :-)
> 
> I'm CC'ing Paul and Mathieu as well for urcu.

Sounds good!

Should be quite an addition and could be used in more places than just
the MMIO dispatcher.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Pekka Enberg

On Thu, May 26, 2011 at 9:11 PM, Avi Kivity  wrote:
> On 05/26/2011 09:05 PM, Ingo Molnar wrote:
>>
>> >
>> >  I've added some rwlocks because of what Ingo said yesterday about
>> >  adding/removing devices after the first initialization phase.
>> >
>> >  Take MMIO lock for example: Since we can now run SMP guests, we may
>> >  have multiple MMIO exits (one from each VCPU thread). Each of those
>> >  exits leads to searching the MMIO rbtree.
>> >
>> >  We can use a mutex to lock it, but it just means that those threads
>> >  will be blocked there instead of concurrently searching the MMIO
>> >  tree which makes the search linear instead of parallel.
>> >
>> >  It's hard to bring 'real' numbers at this stage because the only
>> >  'real' device we have which uses MMIO is the VESA driver, and we
>> >  can't really simulate many VCPUs writing to it :)
>>
>> I'd suggest keeping it simple first - rwlocks are nasty and will
>> bounce a cacheline just as much.
>
> Well, this is the first case where tools/kvm can do better than qemu with
> its global lock, so I think it's worth it.
>
>> If lookup scalability is an issue we can extend RCU to tools/kvm/.
>
> Definitely rcu is a perfect patch for mmio dispatch.

Userspace RCU code is here, Sasha, if you feel like tackling this:

http://lttng.org/urcu

:-)

I'm CC'ing Paul and Mathieu as well for urcu.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Avi Kivity


On 05/26/2011 09:05 PM, Ingo Molnar wrote:

>
>  I've added some rwlocks because of what Ingo said yesterday about
>  adding/removing devices after the first initialization phase.
>
>  Take MMIO lock for example: Since we can now run SMP guests, we may
>  have multiple MMIO exits (one from each VCPU thread). Each of those
>  exits leads to searching the MMIO rbtree.
>
>  We can use a mutex to lock it, but it just means that those threads
>  will be blocked there instead of concurrently searching the MMIO
>  tree which makes the search linear instead of parallel.
>
>  It's hard to bring 'real' numbers at this stage because the only
>  'real' device we have which uses MMIO is the VESA driver, and we
>  can't really simulate many VCPUs writing to it :)

I'd suggest keeping it simple first - rwlocks are nasty and will
bounce a cacheline just as much.


Well, this is the first case where tools/kvm can do better than qemu 
with its global lock, so I think it's worth it.



If lookup scalability is an issue we can extend RCU to tools/kvm/.


Definitely rcu is a perfect patch for mmio dispatch.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Ingo Molnar


* Sasha Levin  wrote:

> On Thu, 2011-05-26 at 19:02 +0300, Pekka Enberg wrote:
> > On Thu, 26 May 2011, Sasha Levin wrote:
> > > Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls
> > > similar to their kernel counterparts.
> > >
> > > Signed-off-by: Sasha Levin 
> > 
> > There's no explanation why a mutex isn't sufficient. The pthread 
> > locking primitives aren't all that great in practice so unless 
> > you have some correctness issue that requires a rwlock or some 
> > numbers, I'd prefer you go for a mutex.
> 
> I've added some rwlocks because of what Ingo said yesterday about 
> adding/removing devices after the first initialization phase.
> 
> Take MMIO lock for example: Since we can now run SMP guests, we may 
> have multiple MMIO exits (one from each VCPU thread). Each of those 
> exits leads to searching the MMIO rbtree.
> 
> We can use a mutex to lock it, but it just means that those threads 
> will be blocked there instead of concurrently searching the MMIO 
> tree which makes the search linear instead of parallel.
> 
> It's hard to bring 'real' numbers at this stage because the only 
> 'real' device we have which uses MMIO is the VESA driver, and we 
> can't really simulate many VCPUs writing to it :)

I'd suggest keeping it simple first - rwlocks are nasty and will 
bounce a cacheline just as much.

If lookup scalability is an issue we can extend RCU to tools/kvm/.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Sasha Levin

On Thu, 2011-05-26 at 19:02 +0300, Pekka Enberg wrote:
> On Thu, 26 May 2011, Sasha Levin wrote:
> > Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls
> > similar to their kernel counterparts.
> >
> > Signed-off-by: Sasha Levin 
> 
> There's no explanation why a mutex isn't sufficient. The pthread locking 
> primitives aren't all that great in practice so unless you have some 
> correctness issue that requires a rwlock or some numbers, I'd prefer you 
> go for a mutex.

I've added some rwlocks because of what Ingo said yesterday about
adding/removing devices after the first initialization phase.

Take MMIO lock for example: Since we can now run SMP guests, we may have
multiple MMIO exits (one from each VCPU thread). Each of those exits
leads to searching the MMIO rbtree.

We can use a mutex to lock it, but it just means that those threads will
be blocked there instead of concurrently searching the MMIO tree which
makes the search linear instead of parallel.

It's hard to bring 'real' numbers at this stage because the only 'real'
device we have which uses MMIO is the VESA driver, and we can't really
simulate many VCPUs writing to it :)

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] kvm tools: Add rwlock wrapper

2011-05-26 Thread Pekka Enberg


On Thu, 26 May 2011, Sasha Levin wrote:

Adds a rwlock wrapper which like the mutex wrapper makes rwlock calls
similar to their kernel counterparts.

Signed-off-by: Sasha Levin 


There's no explanation why a mutex isn't sufficient. The pthread locking 
primitives aren't all that great in practice so unless you have some 
correctness issue that requires a rwlock or some numbers, I'd prefer you 
go for a mutex.


Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

74 matches

Mail list logo