Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
Hey, On Wed, May 22, 2013 at 02:50:14PM -0400, Mike Snitzer wrote: > Was wondering: how is percpu-refcounting coming along? Do you have a > pointer to the code that can be pulled in for use by Mikulas' dm-crypt > changes? > > Would be nice to get this stuff sorted out for the 3.11 merge window. Still in progress. Waiting for Kent's next round and yeah I do hope so too as I really wanna convert css refcnting to it regardless of dm-crypt work. The thread is at https://patchwork.kernel.org/patch/2562111/ Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 18 2013 at 1:03pm -0400, Tejun Heo wrote: > Hello, Mike. > > On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote: > > I see you nack and raise you with: please reconsider in the near term. > > The thing is that percpu-refcnting is already in mostly ready-form, so > unless this dm series is planned to be merged for v3.10-rc1, I don't > see the need for a separate near term solution. Mikulas can just do > the normal refcnting and cgroup will most likely adopt per-cpu ref > anyway once percpu-refcnting is in, so it should all fall in places > pretty quickly. For devel / testing / whatever, Mikulas can surely > turn off cgroup, right? Hey Tejun, Was wondering: how is percpu-refcounting coming along? Do you have a pointer to the code that can be pulled in for use by Mikulas' dm-crypt changes? Would be nice to get this stuff sorted out for the 3.11 merge window. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 18 2013 at 1:03pm -0400, Tejun Heo t...@kernel.org wrote: Hello, Mike. On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote: I see you nack and raise you with: please reconsider in the near term. The thing is that percpu-refcnting is already in mostly ready-form, so unless this dm series is planned to be merged for v3.10-rc1, I don't see the need for a separate near term solution. Mikulas can just do the normal refcnting and cgroup will most likely adopt per-cpu ref anyway once percpu-refcnting is in, so it should all fall in places pretty quickly. For devel / testing / whatever, Mikulas can surely turn off cgroup, right? Hey Tejun, Was wondering: how is percpu-refcounting coming along? Do you have a pointer to the code that can be pulled in for use by Mikulas' dm-crypt changes? Would be nice to get this stuff sorted out for the 3.11 merge window. Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
Hey, On Wed, May 22, 2013 at 02:50:14PM -0400, Mike Snitzer wrote: Was wondering: how is percpu-refcounting coming along? Do you have a pointer to the code that can be pulled in for use by Mikulas' dm-crypt changes? Would be nice to get this stuff sorted out for the 3.11 merge window. Still in progress. Waiting for Kent's next round and yeah I do hope so too as I really wanna convert css refcnting to it regardless of dm-crypt work. The thread is at https://patchwork.kernel.org/patch/2562111/ Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
Hello, Mike. On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote: > I see you nack and raise you with: please reconsider in the near term. The thing is that percpu-refcnting is already in mostly ready-form, so unless this dm series is planned to be merged for v3.10-rc1, I don't see the need for a separate near term solution. Mikulas can just do the normal refcnting and cgroup will most likely adopt per-cpu ref anyway once percpu-refcnting is in, so it should all fall in places pretty quickly. For devel / testing / whatever, Mikulas can surely turn off cgroup, right? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Tue, Apr 16 2013 at 1:24pm -0400, Tejun Heo wrote: > Hey, > > On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: > > The patch is not bug-prone, because we already must make sure that the > > cloned bio has shorter lifetime than the master bio - so the patch doesn't > > introduce any new possibilities to make bugs. > > The whole world isn't composed of only your code. As I said > repeatedly, you're introducing an API which is misleading and can > easily cause subtle bugs which are very difficult to reproduce. > > Imagine it being used to tag a metatdata or checksum update bio being > sent down while processing another bio and used to "clone" the context > of the original bio. It'll work most of the time even if the original > bio gets completed first but it'll break when it gets really unlucky - > e.g. racing with other operations which can put the base css ref, and > it'll be hellish to reproduce and everyone would have to pay for your > silly hack. > > > > Do the two really look the same to you? The page refs are much more > > > expensive, mostly contained in and the main focus of dm. ioc/css refs > > > aren't that expensive to begin with, css refcnting is widely scattered > > > > ioc is per-task, so it is likely to be cached (but there are processors > > that have slow atomic operations even on cached data - on Pentium 4 it > > takes about 100 cycles). But css is shared between tasks and produces the > > cache ping-pong effect. > > For $DIETY's sake, how many times do I have to tell you to use per-cpu > reference count? Why do I have to repeat the same story over and over > again? What part of "make the reference count per-cpu" don't you get? > It's not a complicated message. > > At this point, I can't even understand why or what the hell you're > arguing. There's a clearly better way to do it and you're just > repeating yourself like a broken record that your hack in itself isn't > broken. > > So, if you wanna continue that way for whatever reason, you have my > firm nack and I'm outta this thread. > > Bye bye. Hey Tejun, I see you nack and raise you with: please reconsider in the near term. Your point about not wanting to introduce a generic block interface that isn't "safe" for all users noted. But as Mikulas has repeatedly said DM does _not_ ever need to do the refcounting. So it seems a bit absurd to introduce the requirement that DM should stand up an interface that uses percpu. That is a fair amount of churn that DM will never have a need to take advantage of. So why not introduce __bio_copy_association(bio1, bio2) and add a BUG_ON in it if bio2 isn't a clone of bio1? When there is a need for async IO to have more scalable refcounting that would be the time to introduce bio_copy_association that uses per-cpu refcounting (and yes we could then even nuke __bio_copy_association). It just seems to me a bit burdensome to ask Mikulas to add this infrastructure when DM really doesn't need it at all. But again I do understand your desire for others to be stearing the kernel where it needs to be to benefit future use-cases. But I think in general it best to introduce complexity when there is an actual need. Your insights are amazingly helpful and I think it is unfortunate that this refcounting issue overshadowed the positive advancements of dm-crypt scaling. I'm just looking to see if we can carry on with a temporary intermediate step with __bio_copy_association. Thanks, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Tue, Apr 16 2013 at 1:24pm -0400, Tejun Heo t...@kernel.org wrote: Hey, On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: The patch is not bug-prone, because we already must make sure that the cloned bio has shorter lifetime than the master bio - so the patch doesn't introduce any new possibilities to make bugs. The whole world isn't composed of only your code. As I said repeatedly, you're introducing an API which is misleading and can easily cause subtle bugs which are very difficult to reproduce. Imagine it being used to tag a metatdata or checksum update bio being sent down while processing another bio and used to clone the context of the original bio. It'll work most of the time even if the original bio gets completed first but it'll break when it gets really unlucky - e.g. racing with other operations which can put the base css ref, and it'll be hellish to reproduce and everyone would have to pay for your silly hack. Do the two really look the same to you? The page refs are much more expensive, mostly contained in and the main focus of dm. ioc/css refs aren't that expensive to begin with, css refcnting is widely scattered ioc is per-task, so it is likely to be cached (but there are processors that have slow atomic operations even on cached data - on Pentium 4 it takes about 100 cycles). But css is shared between tasks and produces the cache ping-pong effect. For $DIETY's sake, how many times do I have to tell you to use per-cpu reference count? Why do I have to repeat the same story over and over again? What part of make the reference count per-cpu don't you get? It's not a complicated message. At this point, I can't even understand why or what the hell you're arguing. There's a clearly better way to do it and you're just repeating yourself like a broken record that your hack in itself isn't broken. So, if you wanna continue that way for whatever reason, you have my firm nack and I'm outta this thread. Bye bye. Hey Tejun, I see you nack and raise you with: please reconsider in the near term. Your point about not wanting to introduce a generic block interface that isn't safe for all users noted. But as Mikulas has repeatedly said DM does _not_ ever need to do the refcounting. So it seems a bit absurd to introduce the requirement that DM should stand up an interface that uses percpu. That is a fair amount of churn that DM will never have a need to take advantage of. So why not introduce __bio_copy_association(bio1, bio2) and add a BUG_ON in it if bio2 isn't a clone of bio1? When there is a need for async IO to have more scalable refcounting that would be the time to introduce bio_copy_association that uses per-cpu refcounting (and yes we could then even nuke __bio_copy_association). It just seems to me a bit burdensome to ask Mikulas to add this infrastructure when DM really doesn't need it at all. But again I do understand your desire for others to be stearing the kernel where it needs to be to benefit future use-cases. But I think in general it best to introduce complexity when there is an actual need. Your insights are amazingly helpful and I think it is unfortunate that this refcounting issue overshadowed the positive advancements of dm-crypt scaling. I'm just looking to see if we can carry on with a temporary intermediate step with __bio_copy_association. Thanks, Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
Hello, Mike. On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote: I see you nack and raise you with: please reconsider in the near term. The thing is that percpu-refcnting is already in mostly ready-form, so unless this dm series is planned to be merged for v3.10-rc1, I don't see the need for a separate near term solution. Mikulas can just do the normal refcnting and cgroup will most likely adopt per-cpu ref anyway once percpu-refcnting is in, so it should all fall in places pretty quickly. For devel / testing / whatever, Mikulas can surely turn off cgroup, right? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Tue, 16 Apr 2013, Tejun Heo wrote: > Hey, > > On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: > > The patch is not bug-prone, because we already must make sure that the > > cloned bio has shorter lifetime than the master bio - so the patch doesn't > > introduce any new possibilities to make bugs. > > The whole world isn't composed of only your code. As I said > repeatedly, you're introducing an API which is misleading and can > easily cause subtle bugs which are very difficult to reproduce. > > Imagine it being used to tag a metatdata or checksum update bio being > sent down while processing another bio and used to "clone" the context > of the original bio. It'll work most of the time even if the original > bio gets completed first but it'll break when it gets really unlucky - > e.g. racing with other operations which can put the base css ref, and > it'll be hellish to reproduce and everyone would have to pay for your > silly hack. That's why the comment at the function says: "it is assumed that the lifestime of the new bio is shorter than the lifetime of the original bio. If the new bio can outlive the old bio, the caller must increment the reference counts." - do you think that it is so bad that someone will use the function without reading the comment? Anyway, the situation that you describe could only happen in dm-bufio or dm-kcopyd files, so it's easy to control and increment the reference counts there. There are no other places in device mapper where we create bios that live longer than original one. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
Hey, On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: > The patch is not bug-prone, because we already must make sure that the > cloned bio has shorter lifetime than the master bio - so the patch doesn't > introduce any new possibilities to make bugs. The whole world isn't composed of only your code. As I said repeatedly, you're introducing an API which is misleading and can easily cause subtle bugs which are very difficult to reproduce. Imagine it being used to tag a metatdata or checksum update bio being sent down while processing another bio and used to "clone" the context of the original bio. It'll work most of the time even if the original bio gets completed first but it'll break when it gets really unlucky - e.g. racing with other operations which can put the base css ref, and it'll be hellish to reproduce and everyone would have to pay for your silly hack. > > Do the two really look the same to you? The page refs are much more > > expensive, mostly contained in and the main focus of dm. ioc/css refs > > aren't that expensive to begin with, css refcnting is widely scattered > > ioc is per-task, so it is likely to be cached (but there are processors > that have slow atomic operations even on cached data - on Pentium 4 it > takes about 100 cycles). But css is shared between tasks and produces the > cache ping-pong effect. For $DIETY's sake, how many times do I have to tell you to use per-cpu reference count? Why do I have to repeat the same story over and over again? What part of "make the reference count per-cpu" don't you get? It's not a complicated message. At this point, I can't even understand why or what the hell you're arguing. There's a clearly better way to do it and you're just repeating yourself like a broken record that your hack in itself isn't broken. So, if you wanna continue that way for whatever reason, you have my firm nack and I'm outta this thread. Bye bye. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
Hey, On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: The patch is not bug-prone, because we already must make sure that the cloned bio has shorter lifetime than the master bio - so the patch doesn't introduce any new possibilities to make bugs. The whole world isn't composed of only your code. As I said repeatedly, you're introducing an API which is misleading and can easily cause subtle bugs which are very difficult to reproduce. Imagine it being used to tag a metatdata or checksum update bio being sent down while processing another bio and used to clone the context of the original bio. It'll work most of the time even if the original bio gets completed first but it'll break when it gets really unlucky - e.g. racing with other operations which can put the base css ref, and it'll be hellish to reproduce and everyone would have to pay for your silly hack. Do the two really look the same to you? The page refs are much more expensive, mostly contained in and the main focus of dm. ioc/css refs aren't that expensive to begin with, css refcnting is widely scattered ioc is per-task, so it is likely to be cached (but there are processors that have slow atomic operations even on cached data - on Pentium 4 it takes about 100 cycles). But css is shared between tasks and produces the cache ping-pong effect. For $DIETY's sake, how many times do I have to tell you to use per-cpu reference count? Why do I have to repeat the same story over and over again? What part of make the reference count per-cpu don't you get? It's not a complicated message. At this point, I can't even understand why or what the hell you're arguing. There's a clearly better way to do it and you're just repeating yourself like a broken record that your hack in itself isn't broken. So, if you wanna continue that way for whatever reason, you have my firm nack and I'm outta this thread. Bye bye. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Tue, 16 Apr 2013, Tejun Heo wrote: Hey, On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote: The patch is not bug-prone, because we already must make sure that the cloned bio has shorter lifetime than the master bio - so the patch doesn't introduce any new possibilities to make bugs. The whole world isn't composed of only your code. As I said repeatedly, you're introducing an API which is misleading and can easily cause subtle bugs which are very difficult to reproduce. Imagine it being used to tag a metatdata or checksum update bio being sent down while processing another bio and used to clone the context of the original bio. It'll work most of the time even if the original bio gets completed first but it'll break when it gets really unlucky - e.g. racing with other operations which can put the base css ref, and it'll be hellish to reproduce and everyone would have to pay for your silly hack. That's why the comment at the function says: it is assumed that the lifestime of the new bio is shorter than the lifetime of the original bio. If the new bio can outlive the old bio, the caller must increment the reference counts. - do you think that it is so bad that someone will use the function without reading the comment? Anyway, the situation that you describe could only happen in dm-bufio or dm-kcopyd files, so it's easy to control and increment the reference counts there. There are no other places in device mapper where we create bios that live longer than original one. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Fri, 12 Apr 2013, Tejun Heo wrote: > On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote: > > So if you think that reference counts should be incremented by every clone > > of the original bio, what kind of bug should it protect against? If we > > don't increment reference counts for pages, why should we do it for cgroup > > pointers? > > These things are called trade-offs. You look at the overhead of the > things and how complex / fragile things get when certain shortcuts are > taken and how well contained and easy to verify / debug when things go > wrong and then make your choice. So what we are here trading for what? The patch eliminates the atomic references when passing the bio down the stack of bio drivers. The patch adds a flag BIO_DROP_CGROUP_REFCOUNT, modifies it at two places and tests it on two places - that is not big. The flag BIO_DROP_CGROUP_REFCOUNT is never supposed to be used outside bio cgroup functions, so it doesn't complicate interface to other subsystems. The patch is not bug-prone, because we already must make sure that the cloned bio has shorter lifetime than the master bio - so the patch doesn't introduce any new possibilities to make bugs. > Do the two really look the same to you? The page refs are much more > expensive, mostly contained in and the main focus of dm. ioc/css refs > aren't that expensive to begin with, css refcnting is widely scattered ioc is per-task, so it is likely to be cached (but there are processors that have slow atomic operations even on cached data - on Pentium 4 it takes about 100 cycles). But css is shared between tasks and produces the cache ping-pong effect. > across the kernel, the association interface is likely to be used by > any entity issuing IOs asynchronously soonish, and there is much saner > way to improve it - which would be beneficial not only to block / dm > but everyone else using it. > > Something being done in one place doesn't automatically make it okay > everywhere else. We can and do use hackery but *with* discretion. > > If you still can't understand, I'm not sure what more I can tell you. > > -- > tejun I don't know what's wrong with 4 lines of code to manipulate a flag. I understand that you don't want to do something complicated and bug-prone to improve performance. But the patch is neither complicated nor bug-prone. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Fri, 12 Apr 2013, Tejun Heo wrote: On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote: So if you think that reference counts should be incremented by every clone of the original bio, what kind of bug should it protect against? If we don't increment reference counts for pages, why should we do it for cgroup pointers? These things are called trade-offs. You look at the overhead of the things and how complex / fragile things get when certain shortcuts are taken and how well contained and easy to verify / debug when things go wrong and then make your choice. So what we are here trading for what? The patch eliminates the atomic references when passing the bio down the stack of bio drivers. The patch adds a flag BIO_DROP_CGROUP_REFCOUNT, modifies it at two places and tests it on two places - that is not big. The flag BIO_DROP_CGROUP_REFCOUNT is never supposed to be used outside bio cgroup functions, so it doesn't complicate interface to other subsystems. The patch is not bug-prone, because we already must make sure that the cloned bio has shorter lifetime than the master bio - so the patch doesn't introduce any new possibilities to make bugs. Do the two really look the same to you? The page refs are much more expensive, mostly contained in and the main focus of dm. ioc/css refs aren't that expensive to begin with, css refcnting is widely scattered ioc is per-task, so it is likely to be cached (but there are processors that have slow atomic operations even on cached data - on Pentium 4 it takes about 100 cycles). But css is shared between tasks and produces the cache ping-pong effect. across the kernel, the association interface is likely to be used by any entity issuing IOs asynchronously soonish, and there is much saner way to improve it - which would be beneficial not only to block / dm but everyone else using it. Something being done in one place doesn't automatically make it okay everywhere else. We can and do use hackery but *with* discretion. If you still can't understand, I'm not sure what more I can tell you. -- tejun I don't know what's wrong with 4 lines of code to manipulate a flag. I understand that you don't want to do something complicated and bug-prone to improve performance. But the patch is neither complicated nor bug-prone. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote: > So if you think that reference counts should be incremented by every clone > of the original bio, what kind of bug should it protect against? If we > don't increment reference counts for pages, why should we do it for cgroup > pointers? These things are called trade-offs. You look at the overhead of the things and how complex / fragile things get when certain shortcuts are taken and how well contained and easy to verify / debug when things go wrong and then make your choice. Do the two really look the same to you? The page refs are much more expensive, mostly contained in and the main focus of dm. ioc/css refs aren't that expensive to begin with, css refcnting is widely scattered across the kernel, the association interface is likely to be used by any entity issuing IOs asynchronously soonish, and there is much saner way to improve it - which would be beneficial not only to block / dm but everyone else using it. Something being done in one place doesn't automatically make it okay everywhere else. We can and do use hackery but *with* discretion. If you still can't understand, I'm not sure what more I can tell you. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: > > All that I can tell you is that adding an empty atomic operation > > "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" > > to bio_clone_context and bio_disassociate_task increases the time to run a > > benchmark from 23 to 40 seconds. > > Right, linear target on ramdisk, very realistic, and you know what, > hell with dm, let's just hand code everything into submit_bio(). I'm > sure it will speed up your test case significantly. The purpose of this benchmarking is not to optimize linear target on ramdisk. The purpose is to optimize the kernel for upcoming massively parallel servers, with possibly hundreds of energy-efficient cores or so. On these systems every single atomic reference really becomes a bottleneck. And since I don't have such massively parallel server, I am testing it on 12-core machine with ramdisk - the performance drop due to atomic accesses can be measured even on that. I already eliminated most of the atomic operations with this patch http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/dm-optimize.patch And I don't see a sense in adding more for cgroup, especially, if it can be easily avoided. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > > If this becomes an actual bottleneck, the right thing to do is making > > css ref per-cpu. Please stop messing around with refcounting. > > If you think this kind of hackery is acceptable, you really need to > re-evaluate your priorities in making engineering decisions. In > tightly coupled code, maybe, but you're trying to introduce utterly > broken error-prone thing as a generic block layer API. I mean, are > you for real? > > -- > tejun Please describe what is wrong with the code? Why do you call it hackery? When device mapper is creating a cloned bio for the lower layer, it is already assumed that the cloned bio has shorter lifetime than the original bio it was created from. The device mapper copies a part of the bio vector from the original bio to the cloned bio, it copies pointers to pages without increasing reference counts on those pages. As long as the original bio is not returned with bio_endio, the pages must exist, so there is no need to increase their reference counts. Now, if copying pointers without increasing reference counts is OK for pages, why do you think it is not OK for cgroup context? Why do you call this bug-prone? - how do you think a bug could happen? If someone in device mapper errorneously ends the master bio while the cloned bio is still in progress, there is already a memory corruption bug (the cloned bio vector points to potentially free pages) and safeguarding the cgroup pointers won't fix it. So if you think that reference counts should be incremented by every clone of the original bio, what kind of bug should it protect against? If we don't increment reference counts for pages, why should we do it for cgroup pointers? Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: If this becomes an actual bottleneck, the right thing to do is making css ref per-cpu. Please stop messing around with refcounting. If you think this kind of hackery is acceptable, you really need to re-evaluate your priorities in making engineering decisions. In tightly coupled code, maybe, but you're trying to introduce utterly broken error-prone thing as a generic block layer API. I mean, are you for real? -- tejun Please describe what is wrong with the code? Why do you call it hackery? When device mapper is creating a cloned bio for the lower layer, it is already assumed that the cloned bio has shorter lifetime than the original bio it was created from. The device mapper copies a part of the bio vector from the original bio to the cloned bio, it copies pointers to pages without increasing reference counts on those pages. As long as the original bio is not returned with bio_endio, the pages must exist, so there is no need to increase their reference counts. Now, if copying pointers without increasing reference counts is OK for pages, why do you think it is not OK for cgroup context? Why do you call this bug-prone? - how do you think a bug could happen? If someone in device mapper errorneously ends the master bio while the cloned bio is still in progress, there is already a memory corruption bug (the cloned bio vector points to potentially free pages) and safeguarding the cgroup pointers won't fix it. So if you think that reference counts should be incremented by every clone of the original bio, what kind of bug should it protect against? If we don't increment reference counts for pages, why should we do it for cgroup pointers? Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: All that I can tell you is that adding an empty atomic operation cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); to bio_clone_context and bio_disassociate_task increases the time to run a benchmark from 23 to 40 seconds. Right, linear target on ramdisk, very realistic, and you know what, hell with dm, let's just hand code everything into submit_bio(). I'm sure it will speed up your test case significantly. The purpose of this benchmarking is not to optimize linear target on ramdisk. The purpose is to optimize the kernel for upcoming massively parallel servers, with possibly hundreds of energy-efficient cores or so. On these systems every single atomic reference really becomes a bottleneck. And since I don't have such massively parallel server, I am testing it on 12-core machine with ramdisk - the performance drop due to atomic accesses can be measured even on that. I already eliminated most of the atomic operations with this patch http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/dm-optimize.patch And I don't see a sense in adding more for cgroup, especially, if it can be easily avoided. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote: So if you think that reference counts should be incremented by every clone of the original bio, what kind of bug should it protect against? If we don't increment reference counts for pages, why should we do it for cgroup pointers? These things are called trade-offs. You look at the overhead of the things and how complex / fragile things get when certain shortcuts are taken and how well contained and easy to verify / debug when things go wrong and then make your choice. Do the two really look the same to you? The page refs are much more expensive, mostly contained in and the main focus of dm. ioc/css refs aren't that expensive to begin with, css refcnting is widely scattered across the kernel, the association interface is likely to be used by any entity issuing IOs asynchronously soonish, and there is much saner way to improve it - which would be beneficial not only to block / dm but everyone else using it. Something being done in one place doesn't automatically make it okay everywhere else. We can and do use hackery but *with* discretion. If you still can't understand, I'm not sure what more I can tell you. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context
On 12.4.2013 2:22, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: >> All that I can tell you is that adding an empty atomic operation >> "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" >> to bio_clone_context and bio_disassociate_task increases the time to run a >> benchmark from 23 to 40 seconds. > > Right, linear target on ramdisk, very realistic, and you know what, > hell with dm, let's just hand code everything into submit_bio(). I'm > sure it will speed up your test case significantly. > > If this actually matters, improve it in *sane* way. Make the refcnts > per-cpu and not use atomic ops. In fact, we already have proposed > implementation of percpu refcnt which is being used by aio restructure > patches and likely to be included in some form. It's not quite ready > yet, so please work on something useful like that instead of > continuing this non-sense. Hey, what's going on here? Seems dmcrypt problem transformed into block level refcount flame :) Mikulas, please, can you talk to Tejun and find some better way how to solve DM & block level context bio contexts here? (Ideally on some realistic scenario, you have enough hw in Red Hat to try, some raid0 ssds with linear on top should be good example) and later (when agreed) implement it on dmcrypt? I definitely do not want dmcrypt becomes guinea pig here, it should remain simple as possible and should do transparent _encryption_ and not any inline device-mapper super optimizing games. Thanks, Milan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: > All that I can tell you is that adding an empty atomic operation > "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" > to bio_clone_context and bio_disassociate_task increases the time to run a > benchmark from 23 to 40 seconds. Right, linear target on ramdisk, very realistic, and you know what, hell with dm, let's just hand code everything into submit_bio(). I'm sure it will speed up your test case significantly. If this actually matters, improve it in *sane* way. Make the refcnts per-cpu and not use atomic ops. In fact, we already have proposed implementation of percpu refcnt which is being used by aio restructure patches and likely to be included in some form. It's not quite ready yet, so please work on something useful like that instead of continuing this non-sense. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: > On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > > If this becomes an actual bottleneck, the right thing to do is making > > css ref per-cpu. Please stop messing around with refcounting. > > If you think this kind of hackery is acceptable, you really need to > re-evaluate your priorities in making engineering decisions. In > tightly coupled code, maybe, but you're trying to introduce utterly > broken error-prone thing as a generic block layer API. I mean, are > you for real? > > -- > tejun All that I can tell you is that adding an empty atomic operation "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" to bio_clone_context and bio_disassociate_task increases the time to run a benchmark from 23 to 40 seconds. Every single atomic reference in the block layer is measurable. How did I measure it: (1) use dm SRCU patches (http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/) that replace some atomic accesses in device mapper with SRCU. The patches will likely be included in the kernel to improve performance. (2) use the patch v2 that I posted in this thread (3) add bio_associate_current(bio) to _dm_request (so that each bio is associated with a process even if it is not offloaded to a workqueue) (4) change bio_clone_context to actually increase reference counts: static inline void bio_clone_context(struct bio *bio, struct bio *bio_src) { #ifdef CONFIG_BLK_CGROUP BUG_ON(bio->bi_ioc != NULL); if (bio_src->bi_ioc) { get_io_context_active(bio_src->bi_ioc); bio->bi_ioc = bio_src->bi_ioc; if (bio_src->bi_css && css_tryget(bio_src->bi_css)) { bio->bi_css = bio_src->bi_css; } bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT; } #endif } (5) add "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt)" to bio_clone_context and bio_disassociate_task Now, measuring: - create 4GiB ramdisk, fill it with dd so that it is allocated - create 5 nested device mapper linear targets on it - run "time fio --rw=randrw --size=1G --bs=512 --filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12" (it was ran on 12-core machine, so there are 12 concurrent jobs) If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it takes 40 seconds. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: > If this becomes an actual bottleneck, the right thing to do is making > css ref per-cpu. Please stop messing around with refcounting. If you think this kind of hackery is acceptable, you really need to re-evaluate your priorities in making engineering decisions. In tightly coupled code, maybe, but you're trying to introduce utterly broken error-prone thing as a generic block layer API. I mean, are you for real? -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 11, 2013 at 03:49:20PM -0400, Mikulas Patocka wrote: > If the bi_css pointer points to a structure that is shared between > processes, using atomic instruction causes cache line boucing - it doesn't > cost a few instructions, it costs 2-3 hundreds cycles. > > I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that > the refcount must be decremented - if the flag is set, refcounts must be > decremented when bio is destroyed, if it is not set, references are > borrowed from upper layer bio. > > It is less bug-prone than the previous patch. If this becomes an actual bottleneck, the right thing to do is making css ref per-cpu. Please stop messing around with refcounting. NACK. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Wed, 10 Apr 2013, Tejun Heo wrote: > On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote: > > /* > > + * bio_clone_context copies cgroup context from the original bio to the > > new bio. > > + * It is used by bio midlayer drivers that create new bio based on an > > original > > + * bio and forward it to the lower layer. > > + * > > + * No reference counts are incremented - it is assumed that the lifestime > > of the > > + * new bio is shorter than the lifetime of the original bio. If the new > > bio can > > + * outlive the old bio, the caller must increment the reference counts. > > + * > > + * Before freeing the new bio, the caller must clear the context with > > + * bio_clear_context function. If bio_clear_context were not called, the > > + * reference counts would be decremented on both new and original bio, > > resulting > > + * in crash due to reference count underflow. > > + */ > > +static inline void bio_clone_context(struct bio *orig, struct bio *new) > > +{ > > +#ifdef CONFIG_BLK_CGROUP > > + new->bi_ioc = orig->bi_ioc; > > + new->bi_css = orig->bi_css; > > Hmmm... Let's not do this. Sure, you'd be saving several instructions > but the gain is unlikely to be significant given that those cachelines > are likely to be hot anyway. Also, please name it > bio_copy_association(). > > Thanks. > > -- > tejun If the bi_css pointer points to a structure that is shared between processes, using atomic instruction causes cache line boucing - it doesn't cost a few instructions, it costs 2-3 hundreds cycles. I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that the refcount must be decremented - if the flag is set, refcounts must be decremented when bio is destroyed, if it is not set, references are borrowed from upper layer bio. It is less bug-prone than the previous patch. Mikulas --- dm: retain cgroup context This patch makes dm and dm-crypt target retain cgroup context. New function bio_clone_context is introduced. It copies cgroup context from one bio to another without incrementing the reference count. A new bio flag BIO_DROP_CGROUP_REFCOUNT specifies that cgroup refcounts should be decremented when the bio is freed. bio_associate_current and bio_disassociate_task are exported to modules. dm core is changed so that it copies the context to cloned bio. dm associates the bio with current process if it is going to offload bio to a thread. dm-crypt associates the bio with the current process and copies the context to outgoing bios. Signed-off-by: Mikulas Patocka --- drivers/md/dm-crypt.c |4 drivers/md/dm.c |7 +-- fs/bio.c | 11 +-- include/linux/bio.h | 27 +++ include/linux/blk_types.h |3 +++ 5 files changed, 48 insertions(+), 4 deletions(-) Index: linux-3.8.6-fast/drivers/md/dm.c === --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-11 17:29:09.0 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c2013-04-11 19:33:47.0 +0200 @@ -1124,6 +1124,8 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs); tio = container_of(clone, struct dm_target_io, clone); + bio_clone_context(>clone, ci->bio); + tio->io = ci->io; tio->ti = ti; memset(>info, 0, sizeof(tio->info)); @@ -1469,9 +1471,10 @@ static void _dm_request(struct request_q if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags))) { dm_put_live_table(md, srcu_idx); - if (bio_rw(bio) != READA) + if (bio_rw(bio) != READA) { + bio_associate_current(bio); queue_io(md, bio); - else + } else bio_io_error(bio); return; } Index: linux-3.8.6-fast/include/linux/bio.h === --- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-11 17:29:07.0 +0200 +++ linux-3.8.6-fast/include/linux/bio.h2013-04-11 19:34:11.0 +0200 @@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +/* + * bio_associate_current associates the bio with the current process. It should + * be called by any block device driver that passes the bio to a different + * process to be processed. It must be called in the original process. + * bio_associate_current does nothing if the bio is already associated. + * + * bio_dissociate_task dissociates the bio from the task. It is called + * automatically at bio destruction. + */ int bio_associate_current(struct bio *bio); void bio_disassociate_task(struct bio *bio); #else /* CONFIG_BLK_CGROUP */ @@ -299,6 +308,24 @@ static inline void bio_disassociate_task #endif /*
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Wed, 10 Apr 2013, Tejun Heo wrote: On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote: /* + * bio_clone_context copies cgroup context from the original bio to the new bio. + * It is used by bio midlayer drivers that create new bio based on an original + * bio and forward it to the lower layer. + * + * No reference counts are incremented - it is assumed that the lifestime of the + * new bio is shorter than the lifetime of the original bio. If the new bio can + * outlive the old bio, the caller must increment the reference counts. + * + * Before freeing the new bio, the caller must clear the context with + * bio_clear_context function. If bio_clear_context were not called, the + * reference counts would be decremented on both new and original bio, resulting + * in crash due to reference count underflow. + */ +static inline void bio_clone_context(struct bio *orig, struct bio *new) +{ +#ifdef CONFIG_BLK_CGROUP + new-bi_ioc = orig-bi_ioc; + new-bi_css = orig-bi_css; Hmmm... Let's not do this. Sure, you'd be saving several instructions but the gain is unlikely to be significant given that those cachelines are likely to be hot anyway. Also, please name it bio_copy_association(). Thanks. -- tejun If the bi_css pointer points to a structure that is shared between processes, using atomic instruction causes cache line boucing - it doesn't cost a few instructions, it costs 2-3 hundreds cycles. I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that the refcount must be decremented - if the flag is set, refcounts must be decremented when bio is destroyed, if it is not set, references are borrowed from upper layer bio. It is less bug-prone than the previous patch. Mikulas --- dm: retain cgroup context This patch makes dm and dm-crypt target retain cgroup context. New function bio_clone_context is introduced. It copies cgroup context from one bio to another without incrementing the reference count. A new bio flag BIO_DROP_CGROUP_REFCOUNT specifies that cgroup refcounts should be decremented when the bio is freed. bio_associate_current and bio_disassociate_task are exported to modules. dm core is changed so that it copies the context to cloned bio. dm associates the bio with current process if it is going to offload bio to a thread. dm-crypt associates the bio with the current process and copies the context to outgoing bios. Signed-off-by: Mikulas Patocka mpato...@redhat.com --- drivers/md/dm-crypt.c |4 drivers/md/dm.c |7 +-- fs/bio.c | 11 +-- include/linux/bio.h | 27 +++ include/linux/blk_types.h |3 +++ 5 files changed, 48 insertions(+), 4 deletions(-) Index: linux-3.8.6-fast/drivers/md/dm.c === --- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-11 17:29:09.0 +0200 +++ linux-3.8.6-fast/drivers/md/dm.c2013-04-11 19:33:47.0 +0200 @@ -1124,6 +1124,8 @@ static struct dm_target_io *alloc_tio(st clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci-md-bs); tio = container_of(clone, struct dm_target_io, clone); + bio_clone_context(tio-clone, ci-bio); + tio-io = ci-io; tio-ti = ti; memset(tio-info, 0, sizeof(tio-info)); @@ -1469,9 +1471,10 @@ static void _dm_request(struct request_q if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, md-flags))) { dm_put_live_table(md, srcu_idx); - if (bio_rw(bio) != READA) + if (bio_rw(bio) != READA) { + bio_associate_current(bio); queue_io(md, bio); - else + } else bio_io_error(bio); return; } Index: linux-3.8.6-fast/include/linux/bio.h === --- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-11 17:29:07.0 +0200 +++ linux-3.8.6-fast/include/linux/bio.h2013-04-11 19:34:11.0 +0200 @@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +/* + * bio_associate_current associates the bio with the current process. It should + * be called by any block device driver that passes the bio to a different + * process to be processed. It must be called in the original process. + * bio_associate_current does nothing if the bio is already associated. + * + * bio_dissociate_task dissociates the bio from the task. It is called + * automatically at bio destruction. + */ int bio_associate_current(struct bio *bio); void bio_disassociate_task(struct bio *bio); #else /* CONFIG_BLK_CGROUP */ @@ -299,6 +308,24 @@ static inline void bio_disassociate_task #endif /* CONFIG_BLK_CGROUP */ /* + *
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 11, 2013 at 03:49:20PM -0400, Mikulas Patocka wrote: If the bi_css pointer points to a structure that is shared between processes, using atomic instruction causes cache line boucing - it doesn't cost a few instructions, it costs 2-3 hundreds cycles. I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that the refcount must be decremented - if the flag is set, refcounts must be decremented when bio is destroyed, if it is not set, references are borrowed from upper layer bio. It is less bug-prone than the previous patch. If this becomes an actual bottleneck, the right thing to do is making css ref per-cpu. Please stop messing around with refcounting. NACK. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: If this becomes an actual bottleneck, the right thing to do is making css ref per-cpu. Please stop messing around with refcounting. If you think this kind of hackery is acceptable, you really need to re-evaluate your priorities in making engineering decisions. In tightly coupled code, maybe, but you're trying to introduce utterly broken error-prone thing as a generic block layer API. I mean, are you for real? -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, 11 Apr 2013, Tejun Heo wrote: On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote: If this becomes an actual bottleneck, the right thing to do is making css ref per-cpu. Please stop messing around with refcounting. If you think this kind of hackery is acceptable, you really need to re-evaluate your priorities in making engineering decisions. In tightly coupled code, maybe, but you're trying to introduce utterly broken error-prone thing as a generic block layer API. I mean, are you for real? -- tejun All that I can tell you is that adding an empty atomic operation cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); to bio_clone_context and bio_disassociate_task increases the time to run a benchmark from 23 to 40 seconds. Every single atomic reference in the block layer is measurable. How did I measure it: (1) use dm SRCU patches (http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/) that replace some atomic accesses in device mapper with SRCU. The patches will likely be included in the kernel to improve performance. (2) use the patch v2 that I posted in this thread (3) add bio_associate_current(bio) to _dm_request (so that each bio is associated with a process even if it is not offloaded to a workqueue) (4) change bio_clone_context to actually increase reference counts: static inline void bio_clone_context(struct bio *bio, struct bio *bio_src) { #ifdef CONFIG_BLK_CGROUP BUG_ON(bio-bi_ioc != NULL); if (bio_src-bi_ioc) { get_io_context_active(bio_src-bi_ioc); bio-bi_ioc = bio_src-bi_ioc; if (bio_src-bi_css css_tryget(bio_src-bi_css)) { bio-bi_css = bio_src-bi_css; } bio-bi_flags |= 1UL BIO_DROP_CGROUP_REFCOUNT; } #endif } (5) add cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt) to bio_clone_context and bio_disassociate_task Now, measuring: - create 4GiB ramdisk, fill it with dd so that it is allocated - create 5 nested device mapper linear targets on it - run time fio --rw=randrw --size=1G --bs=512 --filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 --name=job12 (it was ran on 12-core machine, so there are 12 concurrent jobs) If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it takes 40 seconds. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)
On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: All that I can tell you is that adding an empty atomic operation cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); to bio_clone_context and bio_disassociate_task increases the time to run a benchmark from 23 to 40 seconds. Right, linear target on ramdisk, very realistic, and you know what, hell with dm, let's just hand code everything into submit_bio(). I'm sure it will speed up your test case significantly. If this actually matters, improve it in *sane* way. Make the refcnts per-cpu and not use atomic ops. In fact, we already have proposed implementation of percpu refcnt which is being used by aio restructure patches and likely to be included in some form. It's not quite ready yet, so please work on something useful like that instead of continuing this non-sense. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] make dm and dm-crypt forward cgroup context
On 12.4.2013 2:22, Tejun Heo wrote: On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote: All that I can tell you is that adding an empty atomic operation cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); to bio_clone_context and bio_disassociate_task increases the time to run a benchmark from 23 to 40 seconds. Right, linear target on ramdisk, very realistic, and you know what, hell with dm, let's just hand code everything into submit_bio(). I'm sure it will speed up your test case significantly. If this actually matters, improve it in *sane* way. Make the refcnts per-cpu and not use atomic ops. In fact, we already have proposed implementation of percpu refcnt which is being used by aio restructure patches and likely to be included in some form. It's not quite ready yet, so please work on something useful like that instead of continuing this non-sense. Hey, what's going on here? Seems dmcrypt problem transformed into block level refcount flame :) Mikulas, please, can you talk to Tejun and find some better way how to solve DM block level context bio contexts here? (Ideally on some realistic scenario, you have enough hw in Red Hat to try, some raid0 ssds with linear on top should be good example) and later (when agreed) implement it on dmcrypt? I definitely do not want dmcrypt becomes guinea pig here, it should remain simple as possible and should do transparent _encryption_ and not any inline device-mapper super optimizing games. Thanks, Milan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/