Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-05-22 Thread Tejun Heo
Hey,

On Wed, May 22, 2013 at 02:50:14PM -0400, Mike Snitzer wrote:
> Was wondering: how is percpu-refcounting coming along?  Do you have a
> pointer to the code that can be pulled in for use by Mikulas' dm-crypt
> changes?
> 
> Would be nice to get this stuff sorted out for the 3.11 merge window.

Still in progress.  Waiting for Kent's next round and yeah I do hope
so too as I really wanna convert css refcnting to it regardless of
dm-crypt work.  The thread is at

 https://patchwork.kernel.org/patch/2562111/

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-05-22 Thread Mike Snitzer
On Thu, Apr 18 2013 at  1:03pm -0400,
Tejun Heo  wrote:

> Hello, Mike.
> 
> On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote:
> > I see you nack and raise you with: please reconsider in the near term.
> 
> The thing is that percpu-refcnting is already in mostly ready-form, so
> unless this dm series is planned to be merged for v3.10-rc1, I don't
> see the need for a separate near term solution.  Mikulas can just do
> the normal refcnting and cgroup will most likely adopt per-cpu ref
> anyway once percpu-refcnting is in, so it should all fall in places
> pretty quickly.  For devel / testing / whatever, Mikulas can surely
> turn off cgroup, right?

Hey Tejun,

Was wondering: how is percpu-refcounting coming along?  Do you have a
pointer to the code that can be pulled in for use by Mikulas' dm-crypt
changes?

Would be nice to get this stuff sorted out for the 3.11 merge window.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-05-22 Thread Mike Snitzer
On Thu, Apr 18 2013 at  1:03pm -0400,
Tejun Heo t...@kernel.org wrote:

 Hello, Mike.
 
 On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote:
  I see you nack and raise you with: please reconsider in the near term.
 
 The thing is that percpu-refcnting is already in mostly ready-form, so
 unless this dm series is planned to be merged for v3.10-rc1, I don't
 see the need for a separate near term solution.  Mikulas can just do
 the normal refcnting and cgroup will most likely adopt per-cpu ref
 anyway once percpu-refcnting is in, so it should all fall in places
 pretty quickly.  For devel / testing / whatever, Mikulas can surely
 turn off cgroup, right?

Hey Tejun,

Was wondering: how is percpu-refcounting coming along?  Do you have a
pointer to the code that can be pulled in for use by Mikulas' dm-crypt
changes?

Would be nice to get this stuff sorted out for the 3.11 merge window.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-05-22 Thread Tejun Heo
Hey,

On Wed, May 22, 2013 at 02:50:14PM -0400, Mike Snitzer wrote:
 Was wondering: how is percpu-refcounting coming along?  Do you have a
 pointer to the code that can be pulled in for use by Mikulas' dm-crypt
 changes?
 
 Would be nice to get this stuff sorted out for the 3.11 merge window.

Still in progress.  Waiting for Kent's next round and yeah I do hope
so too as I really wanna convert css refcnting to it regardless of
dm-crypt work.  The thread is at

 https://patchwork.kernel.org/patch/2562111/

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-18 Thread Tejun Heo
Hello, Mike.

On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote:
> I see you nack and raise you with: please reconsider in the near term.

The thing is that percpu-refcnting is already in mostly ready-form, so
unless this dm series is planned to be merged for v3.10-rc1, I don't
see the need for a separate near term solution.  Mikulas can just do
the normal refcnting and cgroup will most likely adopt per-cpu ref
anyway once percpu-refcnting is in, so it should all fall in places
pretty quickly.  For devel / testing / whatever, Mikulas can surely
turn off cgroup, right?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-18 Thread Mike Snitzer
On Tue, Apr 16 2013 at  1:24pm -0400,
Tejun Heo  wrote:

> Hey,
> 
> On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
> > The patch is not bug-prone, because we already must make sure that the 
> > cloned bio has shorter lifetime than the master bio - so the patch doesn't 
> > introduce any new possibilities to make bugs.
> 
> The whole world isn't composed of only your code.  As I said
> repeatedly, you're introducing an API which is misleading and can
> easily cause subtle bugs which are very difficult to reproduce.
> 
> Imagine it being used to tag a metatdata or checksum update bio being
> sent down while processing another bio and used to "clone" the context
> of the original bio.  It'll work most of the time even if the original
> bio gets completed first but it'll break when it gets really unlucky -
> e.g. racing with other operations which can put the base css ref, and
> it'll be hellish to reproduce and everyone would have to pay for your
> silly hack.
> 
> > > Do the two really look the same to you?  The page refs are much more
> > > expensive, mostly contained in and the main focus of dm.  ioc/css refs
> > > aren't that expensive to begin with, css refcnting is widely scattered
> > 
> > ioc is per-task, so it is likely to be cached (but there are processors 
> > that have slow atomic operations even on cached data - on Pentium 4 it 
> > takes about 100 cycles). But css is shared between tasks and produces the 
> > cache ping-pong effect.
> 
> For $DIETY's sake, how many times do I have to tell you to use per-cpu
> reference count?  Why do I have to repeat the same story over and over
> again?  What part of "make the reference count per-cpu" don't you get?
> It's not a complicated message.
> 
> At this point, I can't even understand why or what the hell you're
> arguing.  There's a clearly better way to do it and you're just
> repeating yourself like a broken record that your hack in itself isn't
> broken.
> 
> So, if you wanna continue that way for whatever reason, you have my
> firm nack and I'm outta this thread.
> 
> Bye bye.

Hey Tejun,

I see you nack and raise you with: please reconsider in the near term.

Your point about not wanting to introduce a generic block interface that
isn't "safe" for all users noted.  But as Mikulas has repeatedly said DM
does _not_ ever need to do the refcounting.  So it seems a bit absurd to
introduce the requirement that DM should stand up an interface that uses
percpu.  That is a fair amount of churn that DM will never have a need
to take advantage of.

So why not introduce __bio_copy_association(bio1, bio2) and add a BUG_ON
in it if bio2 isn't a clone of bio1?

When there is a need for async IO to have more scalable refcounting that
would be the time to introduce bio_copy_association that uses per-cpu
refcounting (and yes we could then even nuke __bio_copy_association).

It just seems to me a bit burdensome to ask Mikulas to add this
infrastructure when DM really doesn't need it at all.  But again I do
understand your desire for others to be stearing the kernel where it
needs to be to benefit future use-cases.  But I think in general it best
to introduce complexity when there is an actual need.

Your insights are amazingly helpful and I think it is unfortunate that
this refcounting issue overshadowed the positive advancements of
dm-crypt scaling.  I'm just looking to see if we can carry on with a
temporary intermediate step with __bio_copy_association.

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-18 Thread Mike Snitzer
On Tue, Apr 16 2013 at  1:24pm -0400,
Tejun Heo t...@kernel.org wrote:

 Hey,
 
 On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
  The patch is not bug-prone, because we already must make sure that the 
  cloned bio has shorter lifetime than the master bio - so the patch doesn't 
  introduce any new possibilities to make bugs.
 
 The whole world isn't composed of only your code.  As I said
 repeatedly, you're introducing an API which is misleading and can
 easily cause subtle bugs which are very difficult to reproduce.
 
 Imagine it being used to tag a metatdata or checksum update bio being
 sent down while processing another bio and used to clone the context
 of the original bio.  It'll work most of the time even if the original
 bio gets completed first but it'll break when it gets really unlucky -
 e.g. racing with other operations which can put the base css ref, and
 it'll be hellish to reproduce and everyone would have to pay for your
 silly hack.
 
   Do the two really look the same to you?  The page refs are much more
   expensive, mostly contained in and the main focus of dm.  ioc/css refs
   aren't that expensive to begin with, css refcnting is widely scattered
  
  ioc is per-task, so it is likely to be cached (but there are processors 
  that have slow atomic operations even on cached data - on Pentium 4 it 
  takes about 100 cycles). But css is shared between tasks and produces the 
  cache ping-pong effect.
 
 For $DIETY's sake, how many times do I have to tell you to use per-cpu
 reference count?  Why do I have to repeat the same story over and over
 again?  What part of make the reference count per-cpu don't you get?
 It's not a complicated message.
 
 At this point, I can't even understand why or what the hell you're
 arguing.  There's a clearly better way to do it and you're just
 repeating yourself like a broken record that your hack in itself isn't
 broken.
 
 So, if you wanna continue that way for whatever reason, you have my
 firm nack and I'm outta this thread.
 
 Bye bye.

Hey Tejun,

I see you nack and raise you with: please reconsider in the near term.

Your point about not wanting to introduce a generic block interface that
isn't safe for all users noted.  But as Mikulas has repeatedly said DM
does _not_ ever need to do the refcounting.  So it seems a bit absurd to
introduce the requirement that DM should stand up an interface that uses
percpu.  That is a fair amount of churn that DM will never have a need
to take advantage of.

So why not introduce __bio_copy_association(bio1, bio2) and add a BUG_ON
in it if bio2 isn't a clone of bio1?

When there is a need for async IO to have more scalable refcounting that
would be the time to introduce bio_copy_association that uses per-cpu
refcounting (and yes we could then even nuke __bio_copy_association).

It just seems to me a bit burdensome to ask Mikulas to add this
infrastructure when DM really doesn't need it at all.  But again I do
understand your desire for others to be stearing the kernel where it
needs to be to benefit future use-cases.  But I think in general it best
to introduce complexity when there is an actual need.

Your insights are amazingly helpful and I think it is unfortunate that
this refcounting issue overshadowed the positive advancements of
dm-crypt scaling.  I'm just looking to see if we can carry on with a
temporary intermediate step with __bio_copy_association.

Thanks,
Mike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-18 Thread Tejun Heo
Hello, Mike.

On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote:
 I see you nack and raise you with: please reconsider in the near term.

The thing is that percpu-refcnting is already in mostly ready-form, so
unless this dm series is planned to be merged for v3.10-rc1, I don't
see the need for a separate near term solution.  Mikulas can just do
the normal refcnting and cgroup will most likely adopt per-cpu ref
anyway once percpu-refcnting is in, so it should all fall in places
pretty quickly.  For devel / testing / whatever, Mikulas can surely
turn off cgroup, right?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-16 Thread Mikulas Patocka


On Tue, 16 Apr 2013, Tejun Heo wrote:

> Hey,
> 
> On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
> > The patch is not bug-prone, because we already must make sure that the 
> > cloned bio has shorter lifetime than the master bio - so the patch doesn't 
> > introduce any new possibilities to make bugs.
> 
> The whole world isn't composed of only your code.  As I said
> repeatedly, you're introducing an API which is misleading and can
> easily cause subtle bugs which are very difficult to reproduce.
> 
> Imagine it being used to tag a metatdata or checksum update bio being
> sent down while processing another bio and used to "clone" the context
> of the original bio.  It'll work most of the time even if the original
> bio gets completed first but it'll break when it gets really unlucky -
> e.g. racing with other operations which can put the base css ref, and
> it'll be hellish to reproduce and everyone would have to pay for your
> silly hack.

That's why the comment at the function says: "it is assumed that the 
lifestime of the new bio is shorter than the lifetime of the original bio. 
If the new bio can outlive the old bio, the caller must increment the 
reference counts." - do you think that it is so bad that someone will use 
the function without reading the comment?

Anyway, the situation that you describe could only happen in dm-bufio or 
dm-kcopyd files, so it's easy to control and increment the reference 
counts there. There are no other places in device mapper where we create 
bios that live longer than original one.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-16 Thread Tejun Heo
Hey,

On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
> The patch is not bug-prone, because we already must make sure that the 
> cloned bio has shorter lifetime than the master bio - so the patch doesn't 
> introduce any new possibilities to make bugs.

The whole world isn't composed of only your code.  As I said
repeatedly, you're introducing an API which is misleading and can
easily cause subtle bugs which are very difficult to reproduce.

Imagine it being used to tag a metatdata or checksum update bio being
sent down while processing another bio and used to "clone" the context
of the original bio.  It'll work most of the time even if the original
bio gets completed first but it'll break when it gets really unlucky -
e.g. racing with other operations which can put the base css ref, and
it'll be hellish to reproduce and everyone would have to pay for your
silly hack.

> > Do the two really look the same to you?  The page refs are much more
> > expensive, mostly contained in and the main focus of dm.  ioc/css refs
> > aren't that expensive to begin with, css refcnting is widely scattered
> 
> ioc is per-task, so it is likely to be cached (but there are processors 
> that have slow atomic operations even on cached data - on Pentium 4 it 
> takes about 100 cycles). But css is shared between tasks and produces the 
> cache ping-pong effect.

For $DIETY's sake, how many times do I have to tell you to use per-cpu
reference count?  Why do I have to repeat the same story over and over
again?  What part of "make the reference count per-cpu" don't you get?
It's not a complicated message.

At this point, I can't even understand why or what the hell you're
arguing.  There's a clearly better way to do it and you're just
repeating yourself like a broken record that your hack in itself isn't
broken.

So, if you wanna continue that way for whatever reason, you have my
firm nack and I'm outta this thread.

Bye bye.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-16 Thread Tejun Heo
Hey,

On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
 The patch is not bug-prone, because we already must make sure that the 
 cloned bio has shorter lifetime than the master bio - so the patch doesn't 
 introduce any new possibilities to make bugs.

The whole world isn't composed of only your code.  As I said
repeatedly, you're introducing an API which is misleading and can
easily cause subtle bugs which are very difficult to reproduce.

Imagine it being used to tag a metatdata or checksum update bio being
sent down while processing another bio and used to clone the context
of the original bio.  It'll work most of the time even if the original
bio gets completed first but it'll break when it gets really unlucky -
e.g. racing with other operations which can put the base css ref, and
it'll be hellish to reproduce and everyone would have to pay for your
silly hack.

  Do the two really look the same to you?  The page refs are much more
  expensive, mostly contained in and the main focus of dm.  ioc/css refs
  aren't that expensive to begin with, css refcnting is widely scattered
 
 ioc is per-task, so it is likely to be cached (but there are processors 
 that have slow atomic operations even on cached data - on Pentium 4 it 
 takes about 100 cycles). But css is shared between tasks and produces the 
 cache ping-pong effect.

For $DIETY's sake, how many times do I have to tell you to use per-cpu
reference count?  Why do I have to repeat the same story over and over
again?  What part of make the reference count per-cpu don't you get?
It's not a complicated message.

At this point, I can't even understand why or what the hell you're
arguing.  There's a clearly better way to do it and you're just
repeating yourself like a broken record that your hack in itself isn't
broken.

So, if you wanna continue that way for whatever reason, you have my
firm nack and I'm outta this thread.

Bye bye.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-16 Thread Mikulas Patocka


On Tue, 16 Apr 2013, Tejun Heo wrote:

 Hey,
 
 On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
  The patch is not bug-prone, because we already must make sure that the 
  cloned bio has shorter lifetime than the master bio - so the patch doesn't 
  introduce any new possibilities to make bugs.
 
 The whole world isn't composed of only your code.  As I said
 repeatedly, you're introducing an API which is misleading and can
 easily cause subtle bugs which are very difficult to reproduce.
 
 Imagine it being used to tag a metatdata or checksum update bio being
 sent down while processing another bio and used to clone the context
 of the original bio.  It'll work most of the time even if the original
 bio gets completed first but it'll break when it gets really unlucky -
 e.g. racing with other operations which can put the base css ref, and
 it'll be hellish to reproduce and everyone would have to pay for your
 silly hack.

That's why the comment at the function says: it is assumed that the 
lifestime of the new bio is shorter than the lifetime of the original bio. 
If the new bio can outlive the old bio, the caller must increment the 
reference counts. - do you think that it is so bad that someone will use 
the function without reading the comment?

Anyway, the situation that you describe could only happen in dm-bufio or 
dm-kcopyd files, so it's easy to control and increment the reference 
counts there. There are no other places in device mapper where we create 
bios that live longer than original one.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-15 Thread Mikulas Patocka


On Fri, 12 Apr 2013, Tejun Heo wrote:

> On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote:
> > So if you think that reference counts should be incremented by every clone 
> > of the original bio, what kind of bug should it protect against? If we 
> > don't increment reference counts for pages, why should we do it for cgroup 
> > pointers?
> 
> These things are called trade-offs.  You look at the overhead of the
> things and how complex / fragile things get when certain shortcuts are
> taken and how well contained and easy to verify / debug when things go
> wrong and then make your choice.

So what we are here trading for what?


The patch eliminates the atomic references when passing the bio down the 
stack of bio drivers.


The patch adds a flag BIO_DROP_CGROUP_REFCOUNT, modifies it at two places 
and tests it on two places - that is not big.

The flag BIO_DROP_CGROUP_REFCOUNT is never supposed to be used outside bio 
cgroup functions, so it doesn't complicate interface to other subsystems.

The patch is not bug-prone, because we already must make sure that the 
cloned bio has shorter lifetime than the master bio - so the patch doesn't 
introduce any new possibilities to make bugs.


> Do the two really look the same to you?  The page refs are much more
> expensive, mostly contained in and the main focus of dm.  ioc/css refs
> aren't that expensive to begin with, css refcnting is widely scattered

ioc is per-task, so it is likely to be cached (but there are processors 
that have slow atomic operations even on cached data - on Pentium 4 it 
takes about 100 cycles). But css is shared between tasks and produces the 
cache ping-pong effect.

> across the kernel, the association interface is likely to be used by
> any entity issuing IOs asynchronously soonish, and there is much saner
> way to improve it - which would be beneficial not only to block / dm
> but everyone else using it.
> 
> Something being done in one place doesn't automatically make it okay
> everywhere else.  We can and do use hackery but *with* discretion.
>
> If you still can't understand, I'm not sure what more I can tell you.
> 
> -- 
> tejun

I don't know what's wrong with 4 lines of code to manipulate a flag.

I understand that you don't want to do something complicated and bug-prone 
to improve performance. But the patch is neither complicated nor 
bug-prone.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-15 Thread Mikulas Patocka


On Fri, 12 Apr 2013, Tejun Heo wrote:

 On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote:
  So if you think that reference counts should be incremented by every clone 
  of the original bio, what kind of bug should it protect against? If we 
  don't increment reference counts for pages, why should we do it for cgroup 
  pointers?
 
 These things are called trade-offs.  You look at the overhead of the
 things and how complex / fragile things get when certain shortcuts are
 taken and how well contained and easy to verify / debug when things go
 wrong and then make your choice.

So what we are here trading for what?


The patch eliminates the atomic references when passing the bio down the 
stack of bio drivers.


The patch adds a flag BIO_DROP_CGROUP_REFCOUNT, modifies it at two places 
and tests it on two places - that is not big.

The flag BIO_DROP_CGROUP_REFCOUNT is never supposed to be used outside bio 
cgroup functions, so it doesn't complicate interface to other subsystems.

The patch is not bug-prone, because we already must make sure that the 
cloned bio has shorter lifetime than the master bio - so the patch doesn't 
introduce any new possibilities to make bugs.


 Do the two really look the same to you?  The page refs are much more
 expensive, mostly contained in and the main focus of dm.  ioc/css refs
 aren't that expensive to begin with, css refcnting is widely scattered

ioc is per-task, so it is likely to be cached (but there are processors 
that have slow atomic operations even on cached data - on Pentium 4 it 
takes about 100 cycles). But css is shared between tasks and produces the 
cache ping-pong effect.

 across the kernel, the association interface is likely to be used by
 any entity issuing IOs asynchronously soonish, and there is much saner
 way to improve it - which would be beneficial not only to block / dm
 but everyone else using it.
 
 Something being done in one place doesn't automatically make it okay
 everywhere else.  We can and do use hackery but *with* discretion.

 If you still can't understand, I'm not sure what more I can tell you.
 
 -- 
 tejun

I don't know what's wrong with 4 lines of code to manipulate a flag.

I understand that you don't want to do something complicated and bug-prone 
to improve performance. But the patch is neither complicated nor 
bug-prone.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-12 Thread Tejun Heo
On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote:
> So if you think that reference counts should be incremented by every clone 
> of the original bio, what kind of bug should it protect against? If we 
> don't increment reference counts for pages, why should we do it for cgroup 
> pointers?

These things are called trade-offs.  You look at the overhead of the
things and how complex / fragile things get when certain shortcuts are
taken and how well contained and easy to verify / debug when things go
wrong and then make your choice.

Do the two really look the same to you?  The page refs are much more
expensive, mostly contained in and the main focus of dm.  ioc/css refs
aren't that expensive to begin with, css refcnting is widely scattered
across the kernel, the association interface is likely to be used by
any entity issuing IOs asynchronously soonish, and there is much saner
way to improve it - which would be beneficial not only to block / dm
but everyone else using it.

Something being done in one place doesn't automatically make it okay
everywhere else.  We can and do use hackery but *with* discretion.

If you still can't understand, I'm not sure what more I can tell you.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-12 Thread Mikulas Patocka


On Thu, 11 Apr 2013, Tejun Heo wrote:

> On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
> > All that I can tell you is that adding an empty atomic operation 
> > "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" 
> > to bio_clone_context and bio_disassociate_task increases the time to run a 
> > benchmark from 23 to 40 seconds.
> 
> Right, linear target on ramdisk, very realistic, and you know what,
> hell with dm, let's just hand code everything into submit_bio().  I'm
> sure it will speed up your test case significantly.

The purpose of this benchmarking is not to optimize linear target on 
ramdisk. The purpose is to optimize the kernel for upcoming massively 
parallel servers, with possibly hundreds of energy-efficient cores or so. 
On these systems every single atomic reference really becomes a 
bottleneck. And since I don't have such massively parallel server, I am 
testing it on 12-core machine with ramdisk - the performance drop due to 
atomic accesses can be measured even on that.

I already eliminated most of the atomic operations with this patch 
http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/dm-optimize.patch
 
And I don't see a sense in adding more for cgroup, especially, if it can 
be easily avoided.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-12 Thread Mikulas Patocka


On Thu, 11 Apr 2013, Tejun Heo wrote:

> On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
> > If this becomes an actual bottleneck, the right thing to do is making
> > css ref per-cpu.  Please stop messing around with refcounting.
> 
> If you think this kind of hackery is acceptable, you really need to
> re-evaluate your priorities in making engineering decisions.  In
> tightly coupled code, maybe, but you're trying to introduce utterly
> broken error-prone thing as a generic block layer API.  I mean, are
> you for real?
> 
> -- 
> tejun

Please describe what is wrong with the code? Why do you call it hackery?

When device mapper is creating a cloned bio for the lower layer, it is 
already assumed that the cloned bio has shorter lifetime than the original 
bio it was created from.

The device mapper copies a part of the bio vector from the original bio to 
the cloned bio, it copies pointers to pages without increasing reference 
counts on those pages. As long as the original bio is not returned with 
bio_endio, the pages must exist, so there is no need to increase their 
reference counts.

Now, if copying pointers without increasing reference counts is OK for 
pages, why do you think it is not OK for cgroup context?

Why do you call this bug-prone? - how do you think a bug could happen? If 
someone in device mapper errorneously ends the master bio while the cloned 
bio is still in progress, there is already a memory corruption bug (the 
cloned bio vector points to potentially free pages) and safeguarding the 
cgroup pointers won't fix it.


So if you think that reference counts should be incremented by every clone 
of the original bio, what kind of bug should it protect against? If we 
don't increment reference counts for pages, why should we do it for cgroup 
pointers?

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-12 Thread Mikulas Patocka


On Thu, 11 Apr 2013, Tejun Heo wrote:

 On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
  If this becomes an actual bottleneck, the right thing to do is making
  css ref per-cpu.  Please stop messing around with refcounting.
 
 If you think this kind of hackery is acceptable, you really need to
 re-evaluate your priorities in making engineering decisions.  In
 tightly coupled code, maybe, but you're trying to introduce utterly
 broken error-prone thing as a generic block layer API.  I mean, are
 you for real?
 
 -- 
 tejun

Please describe what is wrong with the code? Why do you call it hackery?

When device mapper is creating a cloned bio for the lower layer, it is 
already assumed that the cloned bio has shorter lifetime than the original 
bio it was created from.

The device mapper copies a part of the bio vector from the original bio to 
the cloned bio, it copies pointers to pages without increasing reference 
counts on those pages. As long as the original bio is not returned with 
bio_endio, the pages must exist, so there is no need to increase their 
reference counts.

Now, if copying pointers without increasing reference counts is OK for 
pages, why do you think it is not OK for cgroup context?

Why do you call this bug-prone? - how do you think a bug could happen? If 
someone in device mapper errorneously ends the master bio while the cloned 
bio is still in progress, there is already a memory corruption bug (the 
cloned bio vector points to potentially free pages) and safeguarding the 
cgroup pointers won't fix it.


So if you think that reference counts should be incremented by every clone 
of the original bio, what kind of bug should it protect against? If we 
don't increment reference counts for pages, why should we do it for cgroup 
pointers?

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-12 Thread Mikulas Patocka


On Thu, 11 Apr 2013, Tejun Heo wrote:

 On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
  All that I can tell you is that adding an empty atomic operation 
  cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); 
  to bio_clone_context and bio_disassociate_task increases the time to run a 
  benchmark from 23 to 40 seconds.
 
 Right, linear target on ramdisk, very realistic, and you know what,
 hell with dm, let's just hand code everything into submit_bio().  I'm
 sure it will speed up your test case significantly.

The purpose of this benchmarking is not to optimize linear target on 
ramdisk. The purpose is to optimize the kernel for upcoming massively 
parallel servers, with possibly hundreds of energy-efficient cores or so. 
On these systems every single atomic reference really becomes a 
bottleneck. And since I don't have such massively parallel server, I am 
testing it on 12-core machine with ramdisk - the performance drop due to 
atomic accesses can be measured even on that.

I already eliminated most of the atomic operations with this patch 
http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/dm-optimize.patch
 
And I don't see a sense in adding more for cgroup, especially, if it can 
be easily avoided.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-12 Thread Tejun Heo
On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote:
 So if you think that reference counts should be incremented by every clone 
 of the original bio, what kind of bug should it protect against? If we 
 don't increment reference counts for pages, why should we do it for cgroup 
 pointers?

These things are called trade-offs.  You look at the overhead of the
things and how complex / fragile things get when certain shortcuts are
taken and how well contained and easy to verify / debug when things go
wrong and then make your choice.

Do the two really look the same to you?  The page refs are much more
expensive, mostly contained in and the main focus of dm.  ioc/css refs
aren't that expensive to begin with, css refcnting is widely scattered
across the kernel, the association interface is likely to be used by
any entity issuing IOs asynchronously soonish, and there is much saner
way to improve it - which would be beneficial not only to block / dm
but everyone else using it.

Something being done in one place doesn't automatically make it okay
everywhere else.  We can and do use hackery but *with* discretion.

If you still can't understand, I'm not sure what more I can tell you.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context

2013-04-11 Thread Milan Broz
On 12.4.2013 2:22, Tejun Heo wrote:
> On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
>> All that I can tell you is that adding an empty atomic operation 
>> "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" 
>> to bio_clone_context and bio_disassociate_task increases the time to run a 
>> benchmark from 23 to 40 seconds.
> 
> Right, linear target on ramdisk, very realistic, and you know what,
> hell with dm, let's just hand code everything into submit_bio().  I'm
> sure it will speed up your test case significantly.
> 
> If this actually matters, improve it in *sane* way.  Make the refcnts
> per-cpu and not use atomic ops.  In fact, we already have proposed
> implementation of percpu refcnt which is being used by aio restructure
> patches and likely to be included in some form.  It's not quite ready
> yet, so please work on something useful like that instead of
> continuing this non-sense.

Hey, what's going on here?

Seems dmcrypt problem transformed into block level refcount flame :)

Mikulas, please, can you talk to Tejun and find some better way how
to solve DM & block level context bio contexts here?
(Ideally on some realistic scenario, you have enough hw in Red Hat to try,
some raid0 ssds with linear on top should be good example)
and later (when agreed) implement it on dmcrypt?

I definitely do not want dmcrypt becomes guinea pig here, it should
remain simple as possible and should do transparent _encryption_
and not any inline device-mapper super optimizing games.


Thanks,
Milan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Tejun Heo
On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
> All that I can tell you is that adding an empty atomic operation 
> "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" 
> to bio_clone_context and bio_disassociate_task increases the time to run a 
> benchmark from 23 to 40 seconds.

Right, linear target on ramdisk, very realistic, and you know what,
hell with dm, let's just hand code everything into submit_bio().  I'm
sure it will speed up your test case significantly.

If this actually matters, improve it in *sane* way.  Make the refcnts
per-cpu and not use atomic ops.  In fact, we already have proposed
implementation of percpu refcnt which is being used by aio restructure
patches and likely to be included in some form.  It's not quite ready
yet, so please work on something useful like that instead of
continuing this non-sense.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Mikulas Patocka


On Thu, 11 Apr 2013, Tejun Heo wrote:

> On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
> > If this becomes an actual bottleneck, the right thing to do is making
> > css ref per-cpu.  Please stop messing around with refcounting.
> 
> If you think this kind of hackery is acceptable, you really need to
> re-evaluate your priorities in making engineering decisions.  In
> tightly coupled code, maybe, but you're trying to introduce utterly
> broken error-prone thing as a generic block layer API.  I mean, are
> you for real?
> 
> -- 
> tejun

All that I can tell you is that adding an empty atomic operation 
"cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" 
to bio_clone_context and bio_disassociate_task increases the time to run a 
benchmark from 23 to 40 seconds.

Every single atomic reference in the block layer is measurable.



How did I measure it:

(1) use dm SRCU patches 
(http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/) 
that replace some atomic accesses in device mapper with SRCU. The patches 
will likely be included in the kernel to improve performance.

(2) use the patch v2 that I posted in this thread

(3) add bio_associate_current(bio) to _dm_request (so that each bio is 
associated with a process even if it is not offloaded to a workqueue)

(4) change bio_clone_context to actually increase reference counts:
static inline void bio_clone_context(struct bio *bio, struct bio *bio_src)
{
#ifdef CONFIG_BLK_CGROUP
BUG_ON(bio->bi_ioc != NULL);
if (bio_src->bi_ioc) {
get_io_context_active(bio_src->bi_ioc);
bio->bi_ioc = bio_src->bi_ioc;
if (bio_src->bi_css && css_tryget(bio_src->bi_css)) {
bio->bi_css = bio_src->bi_css;
}
bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT;
}
#endif
}

(5) add "cmpxchg(>bi_css->refcnt, bio->bi_css->refcnt, 
bio->bi_css->refcnt)" to bio_clone_context and bio_disassociate_task


Now, measuring:
- create 4GiB ramdisk, fill it with dd so that it is allocated
- create 5 nested device mapper linear targets on it
- run "time fio --rw=randrw --size=1G --bs=512 
--filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2 
--name=job3 --name=job4 --name=job5 --name=job6 --name=job7 
--name=job8 --name=job9 --name=job10 --name=job11 --name=job12"
(it was ran on 12-core machine, so there are 12 concurrent jobs)


If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it 
takes 40 seconds.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Tejun Heo
On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
> If this becomes an actual bottleneck, the right thing to do is making
> css ref per-cpu.  Please stop messing around with refcounting.

If you think this kind of hackery is acceptable, you really need to
re-evaluate your priorities in making engineering decisions.  In
tightly coupled code, maybe, but you're trying to introduce utterly
broken error-prone thing as a generic block layer API.  I mean, are
you for real?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Tejun Heo
On Thu, Apr 11, 2013 at 03:49:20PM -0400, Mikulas Patocka wrote:
> If the bi_css pointer points to a structure that is shared between 
> processes, using atomic instruction causes cache line boucing - it doesn't 
> cost a few instructions, it costs 2-3 hundreds cycles.
> 
> I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that 
> the refcount must be decremented - if the flag is set, refcounts must be 
> decremented when bio is destroyed, if it is not set, references are 
> borrowed from upper layer bio.
> 
> It is less bug-prone than the previous patch.

If this becomes an actual bottleneck, the right thing to do is making
css ref per-cpu.  Please stop messing around with refcounting.

NACK.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Mikulas Patocka


On Wed, 10 Apr 2013, Tejun Heo wrote:

> On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote:
> >  /*
> > + * bio_clone_context copies cgroup context from the original bio to the 
> > new bio.
> > + * It is used by bio midlayer drivers that create new bio based on an 
> > original
> > + * bio and forward it to the lower layer.
> > + *
> > + * No reference counts are incremented - it is assumed that the lifestime 
> > of the
> > + * new bio is shorter than the lifetime of the original bio. If the new 
> > bio can
> > + * outlive the old bio, the caller must increment the reference counts.
> > + *
> > + * Before freeing the new bio, the caller must clear the context with
> > + * bio_clear_context function. If bio_clear_context were not called, the
> > + * reference counts would be decremented on both new and original bio, 
> > resulting
> > + * in crash due to reference count underflow.
> > + */
> > +static inline void bio_clone_context(struct bio *orig, struct bio *new)
> > +{
> > +#ifdef CONFIG_BLK_CGROUP
> > +   new->bi_ioc = orig->bi_ioc;
> > +   new->bi_css = orig->bi_css;
> 
> Hmmm... Let's not do this.  Sure, you'd be saving several instructions
> but the gain is unlikely to be significant given that those cachelines
> are likely to be hot anyway.  Also, please name it
> bio_copy_association().
> 
> Thanks.
> 
> -- 
> tejun

If the bi_css pointer points to a structure that is shared between 
processes, using atomic instruction causes cache line boucing - it doesn't 
cost a few instructions, it costs 2-3 hundreds cycles.

I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that 
the refcount must be decremented - if the flag is set, refcounts must be 
decremented when bio is destroyed, if it is not set, references are 
borrowed from upper layer bio.

It is less bug-prone than the previous patch.

Mikulas

---

dm: retain cgroup context

This patch makes dm and dm-crypt target retain cgroup context.
 
New function bio_clone_context is introduced. It copies cgroup context
from one bio to another without incrementing the reference count. A new
bio flag BIO_DROP_CGROUP_REFCOUNT specifies that cgroup refcounts should
be decremented when the bio is freed.

bio_associate_current and bio_disassociate_task are exported to modules.

dm core is changed so that it copies the context to cloned bio. dm
associates the bio with current process if it is going to offload bio to a
thread.

dm-crypt associates the bio with the current process and copies the
context to outgoing bios.

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm-crypt.c |4 
 drivers/md/dm.c   |7 +--
 fs/bio.c  |   11 +--
 include/linux/bio.h   |   27 +++
 include/linux/blk_types.h |3 +++
 5 files changed, 48 insertions(+), 4 deletions(-)

Index: linux-3.8.6-fast/drivers/md/dm.c
===
--- linux-3.8.6-fast.orig/drivers/md/dm.c   2013-04-11 17:29:09.0 
+0200
+++ linux-3.8.6-fast/drivers/md/dm.c2013-04-11 19:33:47.0 +0200
@@ -1124,6 +1124,8 @@ static struct dm_target_io *alloc_tio(st
clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs);
tio = container_of(clone, struct dm_target_io, clone);
 
+   bio_clone_context(>clone, ci->bio);
+
tio->io = ci->io;
tio->ti = ti;
memset(>info, 0, sizeof(tio->info));
@@ -1469,9 +1471,10 @@ static void _dm_request(struct request_q
if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, >flags))) {
dm_put_live_table(md, srcu_idx);
 
-   if (bio_rw(bio) != READA)
+   if (bio_rw(bio) != READA) {
+   bio_associate_current(bio);
queue_io(md, bio);
-   else
+   } else
bio_io_error(bio);
return;
}
Index: linux-3.8.6-fast/include/linux/bio.h
===
--- linux-3.8.6-fast.orig/include/linux/bio.h   2013-04-11 17:29:07.0 
+0200
+++ linux-3.8.6-fast/include/linux/bio.h2013-04-11 19:34:11.0 
+0200
@@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set
 extern unsigned int bvec_nr_vecs(unsigned short idx);
 
 #ifdef CONFIG_BLK_CGROUP
+/*
+ * bio_associate_current associates the bio with the current process. It should
+ * be called by any block device driver that passes the bio to a different
+ * process to be processed. It must be called in the original process.
+ * bio_associate_current does nothing if the bio is already associated.
+ *
+ * bio_dissociate_task dissociates the bio from the task. It is called
+ * automatically at bio destruction.
+ */
 int bio_associate_current(struct bio *bio);
 void bio_disassociate_task(struct bio *bio);
 #else  /* CONFIG_BLK_CGROUP */
@@ -299,6 +308,24 @@ static inline void bio_disassociate_task
 #endif /* 

Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Mikulas Patocka


On Wed, 10 Apr 2013, Tejun Heo wrote:

 On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote:
   /*
  + * bio_clone_context copies cgroup context from the original bio to the 
  new bio.
  + * It is used by bio midlayer drivers that create new bio based on an 
  original
  + * bio and forward it to the lower layer.
  + *
  + * No reference counts are incremented - it is assumed that the lifestime 
  of the
  + * new bio is shorter than the lifetime of the original bio. If the new 
  bio can
  + * outlive the old bio, the caller must increment the reference counts.
  + *
  + * Before freeing the new bio, the caller must clear the context with
  + * bio_clear_context function. If bio_clear_context were not called, the
  + * reference counts would be decremented on both new and original bio, 
  resulting
  + * in crash due to reference count underflow.
  + */
  +static inline void bio_clone_context(struct bio *orig, struct bio *new)
  +{
  +#ifdef CONFIG_BLK_CGROUP
  +   new-bi_ioc = orig-bi_ioc;
  +   new-bi_css = orig-bi_css;
 
 Hmmm... Let's not do this.  Sure, you'd be saving several instructions
 but the gain is unlikely to be significant given that those cachelines
 are likely to be hot anyway.  Also, please name it
 bio_copy_association().
 
 Thanks.
 
 -- 
 tejun

If the bi_css pointer points to a structure that is shared between 
processes, using atomic instruction causes cache line boucing - it doesn't 
cost a few instructions, it costs 2-3 hundreds cycles.

I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that 
the refcount must be decremented - if the flag is set, refcounts must be 
decremented when bio is destroyed, if it is not set, references are 
borrowed from upper layer bio.

It is less bug-prone than the previous patch.

Mikulas

---

dm: retain cgroup context

This patch makes dm and dm-crypt target retain cgroup context.
 
New function bio_clone_context is introduced. It copies cgroup context
from one bio to another without incrementing the reference count. A new
bio flag BIO_DROP_CGROUP_REFCOUNT specifies that cgroup refcounts should
be decremented when the bio is freed.

bio_associate_current and bio_disassociate_task are exported to modules.

dm core is changed so that it copies the context to cloned bio. dm
associates the bio with current process if it is going to offload bio to a
thread.

dm-crypt associates the bio with the current process and copies the
context to outgoing bios.

Signed-off-by: Mikulas Patocka mpato...@redhat.com

---
 drivers/md/dm-crypt.c |4 
 drivers/md/dm.c   |7 +--
 fs/bio.c  |   11 +--
 include/linux/bio.h   |   27 +++
 include/linux/blk_types.h |3 +++
 5 files changed, 48 insertions(+), 4 deletions(-)

Index: linux-3.8.6-fast/drivers/md/dm.c
===
--- linux-3.8.6-fast.orig/drivers/md/dm.c   2013-04-11 17:29:09.0 
+0200
+++ linux-3.8.6-fast/drivers/md/dm.c2013-04-11 19:33:47.0 +0200
@@ -1124,6 +1124,8 @@ static struct dm_target_io *alloc_tio(st
clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci-md-bs);
tio = container_of(clone, struct dm_target_io, clone);
 
+   bio_clone_context(tio-clone, ci-bio);
+
tio-io = ci-io;
tio-ti = ti;
memset(tio-info, 0, sizeof(tio-info));
@@ -1469,9 +1471,10 @@ static void _dm_request(struct request_q
if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, md-flags))) {
dm_put_live_table(md, srcu_idx);
 
-   if (bio_rw(bio) != READA)
+   if (bio_rw(bio) != READA) {
+   bio_associate_current(bio);
queue_io(md, bio);
-   else
+   } else
bio_io_error(bio);
return;
}
Index: linux-3.8.6-fast/include/linux/bio.h
===
--- linux-3.8.6-fast.orig/include/linux/bio.h   2013-04-11 17:29:07.0 
+0200
+++ linux-3.8.6-fast/include/linux/bio.h2013-04-11 19:34:11.0 
+0200
@@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set
 extern unsigned int bvec_nr_vecs(unsigned short idx);
 
 #ifdef CONFIG_BLK_CGROUP
+/*
+ * bio_associate_current associates the bio with the current process. It should
+ * be called by any block device driver that passes the bio to a different
+ * process to be processed. It must be called in the original process.
+ * bio_associate_current does nothing if the bio is already associated.
+ *
+ * bio_dissociate_task dissociates the bio from the task. It is called
+ * automatically at bio destruction.
+ */
 int bio_associate_current(struct bio *bio);
 void bio_disassociate_task(struct bio *bio);
 #else  /* CONFIG_BLK_CGROUP */
@@ -299,6 +308,24 @@ static inline void bio_disassociate_task
 #endif /* CONFIG_BLK_CGROUP */
 
 /*
+ * 

Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Tejun Heo
On Thu, Apr 11, 2013 at 03:49:20PM -0400, Mikulas Patocka wrote:
 If the bi_css pointer points to a structure that is shared between 
 processes, using atomic instruction causes cache line boucing - it doesn't 
 cost a few instructions, it costs 2-3 hundreds cycles.
 
 I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that 
 the refcount must be decremented - if the flag is set, refcounts must be 
 decremented when bio is destroyed, if it is not set, references are 
 borrowed from upper layer bio.
 
 It is less bug-prone than the previous patch.

If this becomes an actual bottleneck, the right thing to do is making
css ref per-cpu.  Please stop messing around with refcounting.

NACK.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Tejun Heo
On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
 If this becomes an actual bottleneck, the right thing to do is making
 css ref per-cpu.  Please stop messing around with refcounting.

If you think this kind of hackery is acceptable, you really need to
re-evaluate your priorities in making engineering decisions.  In
tightly coupled code, maybe, but you're trying to introduce utterly
broken error-prone thing as a generic block layer API.  I mean, are
you for real?

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Mikulas Patocka


On Thu, 11 Apr 2013, Tejun Heo wrote:

 On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
  If this becomes an actual bottleneck, the right thing to do is making
  css ref per-cpu.  Please stop messing around with refcounting.
 
 If you think this kind of hackery is acceptable, you really need to
 re-evaluate your priorities in making engineering decisions.  In
 tightly coupled code, maybe, but you're trying to introduce utterly
 broken error-prone thing as a generic block layer API.  I mean, are
 you for real?
 
 -- 
 tejun

All that I can tell you is that adding an empty atomic operation 
cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); 
to bio_clone_context and bio_disassociate_task increases the time to run a 
benchmark from 23 to 40 seconds.

Every single atomic reference in the block layer is measurable.



How did I measure it:

(1) use dm SRCU patches 
(http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/) 
that replace some atomic accesses in device mapper with SRCU. The patches 
will likely be included in the kernel to improve performance.

(2) use the patch v2 that I posted in this thread

(3) add bio_associate_current(bio) to _dm_request (so that each bio is 
associated with a process even if it is not offloaded to a workqueue)

(4) change bio_clone_context to actually increase reference counts:
static inline void bio_clone_context(struct bio *bio, struct bio *bio_src)
{
#ifdef CONFIG_BLK_CGROUP
BUG_ON(bio-bi_ioc != NULL);
if (bio_src-bi_ioc) {
get_io_context_active(bio_src-bi_ioc);
bio-bi_ioc = bio_src-bi_ioc;
if (bio_src-bi_css  css_tryget(bio_src-bi_css)) {
bio-bi_css = bio_src-bi_css;
}
bio-bi_flags |= 1UL  BIO_DROP_CGROUP_REFCOUNT;
}
#endif
}

(5) add cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, 
bio-bi_css-refcnt) to bio_clone_context and bio_disassociate_task


Now, measuring:
- create 4GiB ramdisk, fill it with dd so that it is allocated
- create 5 nested device mapper linear targets on it
- run time fio --rw=randrw --size=1G --bs=512 
--filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2 
--name=job3 --name=job4 --name=job5 --name=job6 --name=job7 
--name=job8 --name=job9 --name=job10 --name=job11 --name=job12
(it was ran on 12-core machine, so there are 12 concurrent jobs)


If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it 
takes 40 seconds.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

2013-04-11 Thread Tejun Heo
On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
 All that I can tell you is that adding an empty atomic operation 
 cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); 
 to bio_clone_context and bio_disassociate_task increases the time to run a 
 benchmark from 23 to 40 seconds.

Right, linear target on ramdisk, very realistic, and you know what,
hell with dm, let's just hand code everything into submit_bio().  I'm
sure it will speed up your test case significantly.

If this actually matters, improve it in *sane* way.  Make the refcnts
per-cpu and not use atomic ops.  In fact, we already have proposed
implementation of percpu refcnt which is being used by aio restructure
patches and likely to be included in some form.  It's not quite ready
yet, so please work on something useful like that instead of
continuing this non-sense.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] make dm and dm-crypt forward cgroup context

2013-04-11 Thread Milan Broz
On 12.4.2013 2:22, Tejun Heo wrote:
 On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
 All that I can tell you is that adding an empty atomic operation 
 cmpxchg(bio-bi_css-refcnt, bio-bi_css-refcnt, bio-bi_css-refcnt); 
 to bio_clone_context and bio_disassociate_task increases the time to run a 
 benchmark from 23 to 40 seconds.
 
 Right, linear target on ramdisk, very realistic, and you know what,
 hell with dm, let's just hand code everything into submit_bio().  I'm
 sure it will speed up your test case significantly.
 
 If this actually matters, improve it in *sane* way.  Make the refcnts
 per-cpu and not use atomic ops.  In fact, we already have proposed
 implementation of percpu refcnt which is being used by aio restructure
 patches and likely to be included in some form.  It's not quite ready
 yet, so please work on something useful like that instead of
 continuing this non-sense.

Hey, what's going on here?

Seems dmcrypt problem transformed into block level refcount flame :)

Mikulas, please, can you talk to Tejun and find some better way how
to solve DM  block level context bio contexts here?
(Ideally on some realistic scenario, you have enough hw in Red Hat to try,
some raid0 ssds with linear on top should be good example)
and later (when agreed) implement it on dmcrypt?

I definitely do not want dmcrypt becomes guinea pig here, it should
remain simple as possible and should do transparent _encryption_
and not any inline device-mapper super optimizing games.


Thanks,
Milan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/