Re: [patch] speed up single bio_vec allocation

2006-12-14 Thread Jens Axboe
On Fri, Dec 08 2006, Chen, Kenneth W wrote:
> > Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
> > > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > > This is what I had in mind, in case it wasn't completely clear. Not
> > > tested, other than it compiles. Basically it eliminates the small
> > > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> > > 12-bytes on 32-bit archs instead and uses the room at the end for the
> > > bio_vec structure.
> > 
> > Yeah, I had a very similar patch queued internally for the large benchmark
> > measurement.  I will post the result as soon as I get it.
> 
> 
> Jens, this improves 0.25% on our db transaction processing benchmark setup.
> The patch tested is (on top of 2.6.19):
> http://marc.theaimsgroup.com/?l=linux-kernel=116539972229021=2

Ok, well it's not much but if it's significant it's not too bad either
:-)

Some tests I ran locally showed it being _slower_, which is a little
odd. They were basically hammering requests through the block layer with
a null end.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-14 Thread Jens Axboe
On Fri, Dec 08 2006, Chen, Kenneth W wrote:
  Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
   Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
   This is what I had in mind, in case it wasn't completely clear. Not
   tested, other than it compiles. Basically it eliminates the small
   bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
   12-bytes on 32-bit archs instead and uses the room at the end for the
   bio_vec structure.
  
  Yeah, I had a very similar patch queued internally for the large benchmark
  measurement.  I will post the result as soon as I get it.
 
 
 Jens, this improves 0.25% on our db transaction processing benchmark setup.
 The patch tested is (on top of 2.6.19):
 http://marc.theaimsgroup.com/?l=linux-kernelm=116539972229021w=2

Ok, well it's not much but if it's significant it's not too bad either
:-)

Some tests I ran locally showed it being _slower_, which is a little
odd. They were basically hammering requests through the block layer with
a null end.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-08 Thread Chen, Kenneth W
> Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
> > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > This is what I had in mind, in case it wasn't completely clear. Not
> > tested, other than it compiles. Basically it eliminates the small
> > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> > 12-bytes on 32-bit archs instead and uses the room at the end for the
> > bio_vec structure.
> 
> Yeah, I had a very similar patch queued internally for the large benchmark
> measurement.  I will post the result as soon as I get it.


Jens, this improves 0.25% on our db transaction processing benchmark setup.
The patch tested is (on top of 2.6.19):
http://marc.theaimsgroup.com/?l=linux-kernel=116539972229021=2

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-08 Thread Jens Axboe
On Thu, Dec 07 2006, Nate Diller wrote:
> On 12/7/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote:
> >Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
> >> the current code is straightforward and obviously correct.  you want
> >> to make the alloc/dealloc paths more complex, by special-casing for an
> >> arbitrary limit of "small" I/O, AFAICT.  of *course* you can expect
> >> less overhead when you're doing one large I/O vs. two small ones,
> >> that's the whole reason we have all this code to try to coalesce
> >> contiguous I/O, do readahead, swap page clustering, etc.  we *want*
> >> more complexity if it will get us bigger I/Os.  I don't see why we
> >> want more complexity to reduce the *inherent* penalty of doing smaller
> >> ones.
> >
> >You should check out the latest proposal from Jens Axboe which treats
> >all biovec size the same and stuff it along with struct bio.  I think
> >it is a better approach than my first cut of special casing 1 segment
> >biovec.  His patch will speed up all sized I/O.
> 
> i rather agree with his reservations on that, since we'd be making the
> allocator's job harder by requesting order 1 pages for all allocations
> on x86_64 large I/O patterns.  but it reduces complexity instead of
> increasing it ... can you produce some benchmarks not just for your
> workload but for one that triggers the order 1 case?  biovec-(256)
> transfers are more common than you seem to think, and if the allocator
> can't do it, that forces the bio code to fall back to 2 x biovec-128,
> which, as you indicated above, would show a real penalty.

The question is if the slab allocator is only doing 2^0 order
allocations for the 256-page bio_vec currently - it's at 4096 bytes, so
potentially (I suspect) the worst size it could be.

On the 1 vs many page bio_vec patterns, I agree with Nate. I do see lots
of larger bio_vecs here. > 1 page bio_vec usage is also becoming more
prevalent, not less. So optimizing for a benchmark case that
predominately uses 1 page bio's is indeed a silly thing.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-08 Thread Jens Axboe
On Thu, Dec 07 2006, Nate Diller wrote:
 On 12/7/06, Chen, Kenneth W [EMAIL PROTECTED] wrote:
 Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
  the current code is straightforward and obviously correct.  you want
  to make the alloc/dealloc paths more complex, by special-casing for an
  arbitrary limit of small I/O, AFAICT.  of *course* you can expect
  less overhead when you're doing one large I/O vs. two small ones,
  that's the whole reason we have all this code to try to coalesce
  contiguous I/O, do readahead, swap page clustering, etc.  we *want*
  more complexity if it will get us bigger I/Os.  I don't see why we
  want more complexity to reduce the *inherent* penalty of doing smaller
  ones.
 
 You should check out the latest proposal from Jens Axboe which treats
 all biovec size the same and stuff it along with struct bio.  I think
 it is a better approach than my first cut of special casing 1 segment
 biovec.  His patch will speed up all sized I/O.
 
 i rather agree with his reservations on that, since we'd be making the
 allocator's job harder by requesting order 1 pages for all allocations
 on x86_64 large I/O patterns.  but it reduces complexity instead of
 increasing it ... can you produce some benchmarks not just for your
 workload but for one that triggers the order 1 case?  biovec-(256)
 transfers are more common than you seem to think, and if the allocator
 can't do it, that forces the bio code to fall back to 2 x biovec-128,
 which, as you indicated above, would show a real penalty.

The question is if the slab allocator is only doing 2^0 order
allocations for the 256-page bio_vec currently - it's at 4096 bytes, so
potentially (I suspect) the worst size it could be.

On the 1 vs many page bio_vec patterns, I agree with Nate. I do see lots
of larger bio_vecs here.  1 page bio_vec usage is also becoming more
prevalent, not less. So optimizing for a benchmark case that
predominately uses 1 page bio's is indeed a silly thing.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-08 Thread Chen, Kenneth W
 Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
  Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
  This is what I had in mind, in case it wasn't completely clear. Not
  tested, other than it compiles. Basically it eliminates the small
  bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
  12-bytes on 32-bit archs instead and uses the room at the end for the
  bio_vec structure.
 
 Yeah, I had a very similar patch queued internally for the large benchmark
 measurement.  I will post the result as soon as I get it.


Jens, this improves 0.25% on our db transaction processing benchmark setup.
The patch tested is (on top of 2.6.19):
http://marc.theaimsgroup.com/?l=linux-kernelm=116539972229021w=2

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Andi Kleen
On Friday 08 December 2006 05:23, Chen, Kenneth W wrote:
> Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
> > "Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
> > > I tried to use cache_line_size() to find out the alignment of struct bio, 
> > > but
> > > stumbled on that it is a runtime function for x86_64.
> > 
> > It's a single global variable access:
> > 
> > #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
> > 
> > Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
> > and practically read only, so there shouldn't be any false sharing at least.
> 
> No, I was looking for a generic constant that describes cache line size.

The same kernel binary runs on CPUs with 
different cache line sizes. For example P4 has 128 bytes, Core2 64 bytes.

However there is a worst case alignment that is used for static
alignments which is L1_CACHE_BYTES. It would be normally 128 bytes
on a x86 kernel, unless it is especially compiled for a CPU with
a smaller cache line size.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W
Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
> "Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
> > I tried to use cache_line_size() to find out the alignment of struct bio, 
> > but
> > stumbled on that it is a runtime function for x86_64.
> 
> It's a single global variable access:
> 
> #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
> 
> Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
> and practically read only, so there shouldn't be any false sharing at least.

No, I was looking for a generic constant that describes cache line size.
I needed a constant in order to avoid runtime check and to rely on compiler
to optimize a conditional check away.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Andi Kleen
"Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
> 
> I tried to use cache_line_size() to find out the alignment of struct bio, but
> stumbled on that it is a runtime function for x86_64.

It's a single global variable access:

#define cache_line_size() (boot_cpu_data.x86_cache_alignment)

Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
and practically read only, so there shouldn't be any false sharing at least.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Nate Diller

On 12/7/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote:

Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
> the current code is straightforward and obviously correct.  you want
> to make the alloc/dealloc paths more complex, by special-casing for an
> arbitrary limit of "small" I/O, AFAICT.  of *course* you can expect
> less overhead when you're doing one large I/O vs. two small ones,
> that's the whole reason we have all this code to try to coalesce
> contiguous I/O, do readahead, swap page clustering, etc.  we *want*
> more complexity if it will get us bigger I/Os.  I don't see why we
> want more complexity to reduce the *inherent* penalty of doing smaller
> ones.

You should check out the latest proposal from Jens Axboe which treats
all biovec size the same and stuff it along with struct bio.  I think
it is a better approach than my first cut of special casing 1 segment
biovec.  His patch will speed up all sized I/O.


i rather agree with his reservations on that, since we'd be making the
allocator's job harder by requesting order 1 pages for all allocations
on x86_64 large I/O patterns.  but it reduces complexity instead of
increasing it ... can you produce some benchmarks not just for your
workload but for one that triggers the order 1 case?  biovec-(256)
transfers are more common than you seem to think, and if the allocator
can't do it, that forces the bio code to fall back to 2 x biovec-128,
which, as you indicated above, would show a real penalty.

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W
Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
> the current code is straightforward and obviously correct.  you want
> to make the alloc/dealloc paths more complex, by special-casing for an
> arbitrary limit of "small" I/O, AFAICT.  of *course* you can expect
> less overhead when you're doing one large I/O vs. two small ones,
> that's the whole reason we have all this code to try to coalesce
> contiguous I/O, do readahead, swap page clustering, etc.  we *want*
> more complexity if it will get us bigger I/Os.  I don't see why we
> want more complexity to reduce the *inherent* penalty of doing smaller
> ones.

You should check out the latest proposal from Jens Axboe which treats
all biovec size the same and stuff it along with struct bio.  I think
it is a better approach than my first cut of special casing 1 segment
biovec.  His patch will speed up all sized I/O.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Nate Diller

On 12/7/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote:

Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
> > > I still can't help but think we can do better than this, and that this
> > > is nothing more than optimizing for a benchmark. For high performance
> > > I/O, you will be doing > 1 page bio's anyway and this patch wont help
> > > you at all. Perhaps we can just kill bio_vec slabs completely, and
> > > create bio slabs instead with differing sizes. So instead of having 1
> > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> > > for the bio_vec list at the end. That would always eliminate the extra
> > > allocation, at the cost of blowing the 256-page case into a order 1 page
> > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> > > which is something I've always tried to avoid.
> >
> > I took a quick query of biovec-* slab stats on various production machines,
> > majority of the allocation is on 1 and 4 segments, usages falls off quickly
> > on 16 or more.  256 segment biovec allocation is really rare.  I think it
> > makes sense to heavily bias towards smaller biovec allocation and have
> > separate biovec allocation for really large ones.
>
> what file system?  have you tested with more than one?  have you
> tested with file systems that build their own bio's instead of using
> get_block() calls?  have you tested with large files or streaming
> workloads?  how about direct I/O?
>
> i think that a "heavy bias" toward small biovecs is FS and workload
> dependent, and that it's irresponsible to make such unjustified
> changes just to show improvement on your particular benchmark.

It is no doubt that the above data is just a quick estimate on one
usage model. There are tons of other usage in the world.  After all,
any algorithm in the kernel has to be generic and self tune to
specific environment.

On very large I/O, the relative overhead in allocating biovec will
decrease because larger I/O needs more code to do setup, more code
to perform I/O completion, more code in the device driver etc. So
time spent on one mempool alloc will amortize over the size of I/O.
On a smaller I/O size, the overhead is more visible and thus makes
sense to me to cut down that relative overhead.

In fact, the large I/O already have unfair advantage.  If you do 1MB
I/O, only 1 call to mempool to get a 256 segment bio.  However if you
do two 512K I/O, two calls to mempool is made.  So in some sense,
current scheme is unfair to small I/O.


so that's a fancy way of saying "no, i didn't do any benchmarks", yes?

the current code is straightforward and obviously correct.  you want
to make the alloc/dealloc paths more complex, by special-casing for an
arbitrary limit of "small" I/O, AFAICT.  of *course* you can expect
less overhead when you're doing one large I/O vs. two small ones,
that's the whole reason we have all this code to try to coalesce
contiguous I/O, do readahead, swap page clustering, etc.  we *want*
more complexity if it will get us bigger I/Os.  I don't see why we
want more complexity to reduce the *inherent* penalty of doing smaller
ones.

btw, i am happy to see that you are working on performance, especially
to the degree you reduce overhead by cleaning things up and removing
cruft and unneeded complexity...

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W
Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
> > > I still can't help but think we can do better than this, and that this
> > > is nothing more than optimizing for a benchmark. For high performance
> > > I/O, you will be doing > 1 page bio's anyway and this patch wont help
> > > you at all. Perhaps we can just kill bio_vec slabs completely, and
> > > create bio slabs instead with differing sizes. So instead of having 1
> > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> > > for the bio_vec list at the end. That would always eliminate the extra
> > > allocation, at the cost of blowing the 256-page case into a order 1 page
> > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> > > which is something I've always tried to avoid.
> >
> > I took a quick query of biovec-* slab stats on various production machines,
> > majority of the allocation is on 1 and 4 segments, usages falls off quickly
> > on 16 or more.  256 segment biovec allocation is really rare.  I think it
> > makes sense to heavily bias towards smaller biovec allocation and have
> > separate biovec allocation for really large ones.
> 
> what file system?  have you tested with more than one?  have you
> tested with file systems that build their own bio's instead of using
> get_block() calls?  have you tested with large files or streaming
> workloads?  how about direct I/O?
> 
> i think that a "heavy bias" toward small biovecs is FS and workload
> dependent, and that it's irresponsible to make such unjustified
> changes just to show improvement on your particular benchmark.

It is no doubt that the above data is just a quick estimate on one
usage model. There are tons of other usage in the world.  After all,
any algorithm in the kernel has to be generic and self tune to
specific environment.

On very large I/O, the relative overhead in allocating biovec will
decrease because larger I/O needs more code to do setup, more code
to perform I/O completion, more code in the device driver etc. So
time spent on one mempool alloc will amortize over the size of I/O.
On a smaller I/O size, the overhead is more visible and thus makes
sense to me to cut down that relative overhead.

In fact, the large I/O already have unfair advantage.  If you do 1MB
I/O, only 1 call to mempool to get a 256 segment bio.  However if you
do two 512K I/O, two calls to mempool is made.  So in some sense,
current scheme is unfair to small I/O.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Nate Diller

On 12/6/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote:

Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > > I will try that too.  I'm a bit touchy about sharing a cache line for
> > > different bio.  But given that there are 200,000 I/O per second we are
> > > currently pushing the kernel, the chances of two cpu working on two
> > > bio that sits in the same cache line are pretty small.
> >
> > Yep I really think so. Besides, it's not like we are repeatedly writing
> > to these objects in the first place.
>
> This is what I had in mind, in case it wasn't completely clear. Not
> tested, other than it compiles. Basically it eliminates the small
> bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> 12-bytes on 32-bit archs instead and uses the room at the end for the
> bio_vec structure.

Yeah, I had a very similar patch queued internally for the large benchmark
measurement.  I will post the result as soon as I get it.


> I still can't help but think we can do better than this, and that this
> is nothing more than optimizing for a benchmark. For high performance
> I/O, you will be doing > 1 page bio's anyway and this patch wont help
> you at all. Perhaps we can just kill bio_vec slabs completely, and
> create bio slabs instead with differing sizes. So instead of having 1
> bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> for the bio_vec list at the end. That would always eliminate the extra
> allocation, at the cost of blowing the 256-page case into a order 1 page
> allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> which is something I've always tried to avoid.

I took a quick query of biovec-* slab stats on various production machines,
majority of the allocation is on 1 and 4 segments, usages falls off quickly
on 16 or more.  256 segment biovec allocation is really rare.  I think it
makes sense to heavily bias towards smaller biovec allocation and have
separate biovec allocation for really large ones.


what file system?  have you tested with more than one?  have you
tested with file systems that build their own bio's instead of using
get_block() calls?  have you tested with large files or streaming
workloads?  how about direct I/O?

i think that a "heavy bias" toward small biovecs is FS and workload
dependent, and that it's irresponsible to make such unjustified
changes just to show improvement on your particular benchmark.

i do however agree with killing SLAB_HWCACHE_ALIGN for biovecs,
pending reasonable regression benchmarks.

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Nate Diller

On 12/6/06, Chen, Kenneth W [EMAIL PROTECTED] wrote:

Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
   I will try that too.  I'm a bit touchy about sharing a cache line for
   different bio.  But given that there are 200,000 I/O per second we are
   currently pushing the kernel, the chances of two cpu working on two
   bio that sits in the same cache line are pretty small.
 
  Yep I really think so. Besides, it's not like we are repeatedly writing
  to these objects in the first place.

 This is what I had in mind, in case it wasn't completely clear. Not
 tested, other than it compiles. Basically it eliminates the small
 bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
 12-bytes on 32-bit archs instead and uses the room at the end for the
 bio_vec structure.

Yeah, I had a very similar patch queued internally for the large benchmark
measurement.  I will post the result as soon as I get it.


 I still can't help but think we can do better than this, and that this
 is nothing more than optimizing for a benchmark. For high performance
 I/O, you will be doing  1 page bio's anyway and this patch wont help
 you at all. Perhaps we can just kill bio_vec slabs completely, and
 create bio slabs instead with differing sizes. So instead of having 1
 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
 for the bio_vec list at the end. That would always eliminate the extra
 allocation, at the cost of blowing the 256-page case into a order 1 page
 allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
 which is something I've always tried to avoid.

I took a quick query of biovec-* slab stats on various production machines,
majority of the allocation is on 1 and 4 segments, usages falls off quickly
on 16 or more.  256 segment biovec allocation is really rare.  I think it
makes sense to heavily bias towards smaller biovec allocation and have
separate biovec allocation for really large ones.


what file system?  have you tested with more than one?  have you
tested with file systems that build their own bio's instead of using
get_block() calls?  have you tested with large files or streaming
workloads?  how about direct I/O?

i think that a heavy bias toward small biovecs is FS and workload
dependent, and that it's irresponsible to make such unjustified
changes just to show improvement on your particular benchmark.

i do however agree with killing SLAB_HWCACHE_ALIGN for biovecs,
pending reasonable regression benchmarks.

NATE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W
Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
   I still can't help but think we can do better than this, and that this
   is nothing more than optimizing for a benchmark. For high performance
   I/O, you will be doing  1 page bio's anyway and this patch wont help
   you at all. Perhaps we can just kill bio_vec slabs completely, and
   create bio slabs instead with differing sizes. So instead of having 1
   bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
   for the bio_vec list at the end. That would always eliminate the extra
   allocation, at the cost of blowing the 256-page case into a order 1 page
   allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
   which is something I've always tried to avoid.
 
  I took a quick query of biovec-* slab stats on various production machines,
  majority of the allocation is on 1 and 4 segments, usages falls off quickly
  on 16 or more.  256 segment biovec allocation is really rare.  I think it
  makes sense to heavily bias towards smaller biovec allocation and have
  separate biovec allocation for really large ones.
 
 what file system?  have you tested with more than one?  have you
 tested with file systems that build their own bio's instead of using
 get_block() calls?  have you tested with large files or streaming
 workloads?  how about direct I/O?
 
 i think that a heavy bias toward small biovecs is FS and workload
 dependent, and that it's irresponsible to make such unjustified
 changes just to show improvement on your particular benchmark.

It is no doubt that the above data is just a quick estimate on one
usage model. There are tons of other usage in the world.  After all,
any algorithm in the kernel has to be generic and self tune to
specific environment.

On very large I/O, the relative overhead in allocating biovec will
decrease because larger I/O needs more code to do setup, more code
to perform I/O completion, more code in the device driver etc. So
time spent on one mempool alloc will amortize over the size of I/O.
On a smaller I/O size, the overhead is more visible and thus makes
sense to me to cut down that relative overhead.

In fact, the large I/O already have unfair advantage.  If you do 1MB
I/O, only 1 call to mempool to get a 256 segment bio.  However if you
do two 512K I/O, two calls to mempool is made.  So in some sense,
current scheme is unfair to small I/O.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Nate Diller

On 12/7/06, Chen, Kenneth W [EMAIL PROTECTED] wrote:

Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
   I still can't help but think we can do better than this, and that this
   is nothing more than optimizing for a benchmark. For high performance
   I/O, you will be doing  1 page bio's anyway and this patch wont help
   you at all. Perhaps we can just kill bio_vec slabs completely, and
   create bio slabs instead with differing sizes. So instead of having 1
   bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
   for the bio_vec list at the end. That would always eliminate the extra
   allocation, at the cost of blowing the 256-page case into a order 1 page
   allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
   which is something I've always tried to avoid.
 
  I took a quick query of biovec-* slab stats on various production machines,
  majority of the allocation is on 1 and 4 segments, usages falls off quickly
  on 16 or more.  256 segment biovec allocation is really rare.  I think it
  makes sense to heavily bias towards smaller biovec allocation and have
  separate biovec allocation for really large ones.

 what file system?  have you tested with more than one?  have you
 tested with file systems that build their own bio's instead of using
 get_block() calls?  have you tested with large files or streaming
 workloads?  how about direct I/O?

 i think that a heavy bias toward small biovecs is FS and workload
 dependent, and that it's irresponsible to make such unjustified
 changes just to show improvement on your particular benchmark.

It is no doubt that the above data is just a quick estimate on one
usage model. There are tons of other usage in the world.  After all,
any algorithm in the kernel has to be generic and self tune to
specific environment.

On very large I/O, the relative overhead in allocating biovec will
decrease because larger I/O needs more code to do setup, more code
to perform I/O completion, more code in the device driver etc. So
time spent on one mempool alloc will amortize over the size of I/O.
On a smaller I/O size, the overhead is more visible and thus makes
sense to me to cut down that relative overhead.

In fact, the large I/O already have unfair advantage.  If you do 1MB
I/O, only 1 call to mempool to get a 256 segment bio.  However if you
do two 512K I/O, two calls to mempool is made.  So in some sense,
current scheme is unfair to small I/O.


so that's a fancy way of saying no, i didn't do any benchmarks, yes?

the current code is straightforward and obviously correct.  you want
to make the alloc/dealloc paths more complex, by special-casing for an
arbitrary limit of small I/O, AFAICT.  of *course* you can expect
less overhead when you're doing one large I/O vs. two small ones,
that's the whole reason we have all this code to try to coalesce
contiguous I/O, do readahead, swap page clustering, etc.  we *want*
more complexity if it will get us bigger I/Os.  I don't see why we
want more complexity to reduce the *inherent* penalty of doing smaller
ones.

btw, i am happy to see that you are working on performance, especially
to the degree you reduce overhead by cleaning things up and removing
cruft and unneeded complexity...

NATE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W
Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
 the current code is straightforward and obviously correct.  you want
 to make the alloc/dealloc paths more complex, by special-casing for an
 arbitrary limit of small I/O, AFAICT.  of *course* you can expect
 less overhead when you're doing one large I/O vs. two small ones,
 that's the whole reason we have all this code to try to coalesce
 contiguous I/O, do readahead, swap page clustering, etc.  we *want*
 more complexity if it will get us bigger I/Os.  I don't see why we
 want more complexity to reduce the *inherent* penalty of doing smaller
 ones.

You should check out the latest proposal from Jens Axboe which treats
all biovec size the same and stuff it along with struct bio.  I think
it is a better approach than my first cut of special casing 1 segment
biovec.  His patch will speed up all sized I/O.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Nate Diller

On 12/7/06, Chen, Kenneth W [EMAIL PROTECTED] wrote:

Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
 the current code is straightforward and obviously correct.  you want
 to make the alloc/dealloc paths more complex, by special-casing for an
 arbitrary limit of small I/O, AFAICT.  of *course* you can expect
 less overhead when you're doing one large I/O vs. two small ones,
 that's the whole reason we have all this code to try to coalesce
 contiguous I/O, do readahead, swap page clustering, etc.  we *want*
 more complexity if it will get us bigger I/Os.  I don't see why we
 want more complexity to reduce the *inherent* penalty of doing smaller
 ones.

You should check out the latest proposal from Jens Axboe which treats
all biovec size the same and stuff it along with struct bio.  I think
it is a better approach than my first cut of special casing 1 segment
biovec.  His patch will speed up all sized I/O.


i rather agree with his reservations on that, since we'd be making the
allocator's job harder by requesting order 1 pages for all allocations
on x86_64 large I/O patterns.  but it reduces complexity instead of
increasing it ... can you produce some benchmarks not just for your
workload but for one that triggers the order 1 case?  biovec-(256)
transfers are more common than you seem to think, and if the allocator
can't do it, that forces the bio code to fall back to 2 x biovec-128,
which, as you indicated above, would show a real penalty.

NATE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Andi Kleen
Chen, Kenneth W [EMAIL PROTECTED] writes:
 
 I tried to use cache_line_size() to find out the alignment of struct bio, but
 stumbled on that it is a runtime function for x86_64.

It's a single global variable access:

#define cache_line_size() (boot_cpu_data.x86_cache_alignment)

Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
and practically read only, so there shouldn't be any false sharing at least.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W
Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
 Chen, Kenneth W [EMAIL PROTECTED] writes:
  I tried to use cache_line_size() to find out the alignment of struct bio, 
  but
  stumbled on that it is a runtime function for x86_64.
 
 It's a single global variable access:
 
 #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
 
 Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
 and practically read only, so there shouldn't be any false sharing at least.

No, I was looking for a generic constant that describes cache line size.
I needed a constant in order to avoid runtime check and to rely on compiler
to optimize a conditional check away.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-07 Thread Andi Kleen
On Friday 08 December 2006 05:23, Chen, Kenneth W wrote:
 Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
  Chen, Kenneth W [EMAIL PROTECTED] writes:
   I tried to use cache_line_size() to find out the alignment of struct bio, 
   but
   stumbled on that it is a runtime function for x86_64.
  
  It's a single global variable access:
  
  #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
  
  Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
  and practically read only, so there shouldn't be any false sharing at least.
 
 No, I was looking for a generic constant that describes cache line size.

The same kernel binary runs on CPUs with 
different cache line sizes. For example P4 has 128 bytes, Core2 64 bytes.

However there is a worst case alignment that is used for static
alignments which is L1_CACHE_BYTES. It would be normally 128 bytes
on a x86 kernel, unless it is especially compiled for a CPU with
a smaller cache line size.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-06 Thread Chen, Kenneth W
Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > > I will try that too.  I'm a bit touchy about sharing a cache line for
> > > different bio.  But given that there are 200,000 I/O per second we are
> > > currently pushing the kernel, the chances of two cpu working on two
> > > bio that sits in the same cache line are pretty small.
> > 
> > Yep I really think so. Besides, it's not like we are repeatedly writing
> > to these objects in the first place.
> 
> This is what I had in mind, in case it wasn't completely clear. Not
> tested, other than it compiles. Basically it eliminates the small
> bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> 12-bytes on 32-bit archs instead and uses the room at the end for the
> bio_vec structure.

Yeah, I had a very similar patch queued internally for the large benchmark
measurement.  I will post the result as soon as I get it.


> I still can't help but think we can do better than this, and that this
> is nothing more than optimizing for a benchmark. For high performance
> I/O, you will be doing > 1 page bio's anyway and this patch wont help
> you at all. Perhaps we can just kill bio_vec slabs completely, and
> create bio slabs instead with differing sizes. So instead of having 1
> bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> for the bio_vec list at the end. That would always eliminate the extra
> allocation, at the cost of blowing the 256-page case into a order 1 page
> allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> which is something I've always tried to avoid.

I took a quick query of biovec-* slab stats on various production machines,
majority of the allocation is on 1 and 4 segments, usages falls off quickly
on 16 or more.  256 segment biovec allocation is really rare.  I think it
makes sense to heavily bias towards smaller biovec allocation and have
separate biovec allocation for really large ones.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-06 Thread Jens Axboe
On Wed, Dec 06 2006, Jens Axboe wrote:
> I still can't help but think we can do better than this, and that this
> is nothing more than optimizing for a benchmark. For high performance
> I/O, you will be doing > 1 page bio's anyway and this patch wont help
> you at all. Perhaps we can just kill bio_vec slabs completely, and
> create bio slabs instead with differing sizes. So instead of having 1
> bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> for the bio_vec list at the end. That would always eliminate the extra
> allocation, at the cost of blowing the 256-page case into a order 1 page
> allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> which is something I've always tried to avoid.

That would look something like this. You'd probably want to re-tweak the
slab sizes now, so that we get the most out of the slab page. If we
accept the 2^1 order allocation for the largest bio, we can make it
larger than 256 pages at no extra cost. Probably around 500 pages would
still fit inside the two pages.

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 08a40f4..b4bf3b3 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -813,7 +813,7 @@ static int crypt_ctr(struct dm_target *t
goto bad4;
}
 
-   cc->bs = bioset_create(MIN_IOS, MIN_IOS, 4);
+   cc->bs = bioset_create(MIN_IOS, 4);
if (!cc->bs) {
ti->error = "Cannot allocate crypt bioset";
goto bad_bs;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index da663d2..581c24a 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -60,7 +60,7 @@ static int resize_pool(unsigned int new_
if (!_io_pool)
return -ENOMEM;
 
-   _bios = bioset_create(16, 16, 4);
+   _bios = bioset_create(16, 4);
if (!_bios) {
mempool_destroy(_io_pool);
_io_pool = NULL;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index fc4f743..98f6768 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -973,7 +973,7 @@ static struct mapped_device *alloc_dev(i
if (!md->tio_pool)
goto bad3;
 
-   md->bs = bioset_create(16, 16, 4);
+   md->bs = bioset_create(16, 4);
if (!md->bs)
goto bad_no_bioset;
 
diff --git a/fs/bio.c b/fs/bio.c
index aa4d09b..452e8f7 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -30,10 +30,6 @@
 
 #define BIO_POOL_SIZE 256
 
-static kmem_cache_t *bio_slab __read_mostly;
-
-#define BIOVEC_NR_POOLS 6
-
 /*
  * a small number of entries is fine, not going to be performance critical.
  * basically we just need to survive
@@ -41,23 +37,23 @@ static kmem_cache_t *bio_slab __read_mos
 #define BIO_SPLIT_ENTRIES 8
 mempool_t *bio_split_pool __read_mostly;
 
-struct biovec_slab {
+struct bio_slab {
int nr_vecs;
-   char *name; 
+   const char *name;
kmem_cache_t *slab;
 };
 
-/*
- * if you change this list, also change bvec_alloc or things will
- * break badly! cannot be bigger than what you can fit into an
- * unsigned short
- */
-
-#define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
-static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
-   BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+#define BIOSLAB_NR_POOLS 6
+#define BIOSLAB(x) { .nr_vecs = x, .name = "bio-"__stringify(x) }
+static struct bio_slab bio_slabs[BIOSLAB_NR_POOLS] __read_mostly = {
+   BIOSLAB(0),
+   BIOSLAB(4),
+   BIOSLAB(16),
+   BIOSLAB(64),
+   BIOSLAB(128),
+   BIOSLAB(BIO_MAX_PAGES),
 };
-#undef BV
+#undef BIOSLAB
 
 /*
  * bio_set is used to allow other portions of the IO system to
@@ -66,8 +62,7 @@ static struct biovec_slab bvec_slabs[BIO
  * and the bvec_slabs[].
  */
 struct bio_set {
-   mempool_t *bio_pool;
-   mempool_t *bvec_pools[BIOVEC_NR_POOLS];
+   mempool_t *bio_pools[BIOSLAB_NR_POOLS];
 };
 
 /*
@@ -76,45 +71,12 @@ struct bio_set {
  */
 static struct bio_set *fs_bio_set;
 
-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned 
long *idx, struct bio_set *bs)
-{
-   struct bio_vec *bvl;
-
-   /*
-* see comment near bvec_array define!
-*/
-   switch (nr) {
-   case   1: *idx = 0; break;
-   case   2 ...   4: *idx = 1; break;
-   case   5 ...  16: *idx = 2; break;
-   case  17 ...  64: *idx = 3; break;
-   case  65 ... 128: *idx = 4; break;
-   case 129 ... BIO_MAX_PAGES: *idx = 5; break;
-   default:
-   return NULL;
-   }
-   /*
-* idx now points to the pool we want to allocate from
-*/
-
-   bvl = mempool_alloc(bs->bvec_pools[*idx], gfp_mask);
-   if (bvl) {
-   struct biovec_slab *bp = bvec_slabs + *idx;
-
-   memset(bvl, 0, bp->nr_vecs * sizeof(struct 

Re: [patch] speed up single bio_vec allocation

2006-12-06 Thread Jens Axboe
On Mon, Dec 04 2006, Jens Axboe wrote:
> On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> > > [...]
> > > 
> > > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
> > > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
> > > bio is allocated. It doesn't add a lot of overhead even for the case
> > > where we do > 1 page bios, and it gets rid of the dual allocation for
> > > the 1 page bio.
> > 
> > I will try that too.  I'm a bit touchy about sharing a cache line for
> > different bio.  But given that there are 200,000 I/O per second we are
> > currently pushing the kernel, the chances of two cpu working on two
> > bio that sits in the same cache line are pretty small.
> 
> Yep I really think so. Besides, it's not like we are repeatedly writing
> to these objects in the first place.

This is what I had in mind, in case it wasn't completely clear. Not
tested, other than it compiles. Basically it eliminates the small
bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
12-bytes on 32-bit archs instead and uses the room at the end for the
bio_vec structure.

I still can't help but think we can do better than this, and that this
is nothing more than optimizing for a benchmark. For high performance
I/O, you will be doing > 1 page bio's anyway and this patch wont help
you at all. Perhaps we can just kill bio_vec slabs completely, and
create bio slabs instead with differing sizes. So instead of having 1
bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
for the bio_vec list at the end. That would always eliminate the extra
allocation, at the cost of blowing the 256-page case into a order 1 page
allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
which is something I've always tried to avoid.


diff --git a/fs/bio.c b/fs/bio.c
index aa4d09b..94b5e05 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -32,7 +32,7 @@
 
 static kmem_cache_t *bio_slab __read_mostly;
 
-#define BIOVEC_NR_POOLS 6
+#define BIOVEC_NR_POOLS 5
 
 /*
  * a small number of entries is fine, not going to be performance critical.
@@ -55,7 +55,7 @@ struct biovec_slab {
 
 #define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
 static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
-   BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+   BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
 };
 #undef BV
 
@@ -76,7 +76,8 @@ struct bio_set {
  */
 static struct bio_set *fs_bio_set;
 
-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned 
long *idx, struct bio_set *bs)
+static struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long 
*idx,
+struct bio_set *bs)
 {
struct bio_vec *bvl;
 
@@ -84,15 +85,25 @@ static inline struct bio_vec *bvec_alloc
 * see comment near bvec_array define!
 */
switch (nr) {
-   case   1: *idx = 0; break;
-   case   2 ...   4: *idx = 1; break;
-   case   5 ...  16: *idx = 2; break;
-   case  17 ...  64: *idx = 3; break;
-   case  65 ... 128: *idx = 4; break;
-   case 129 ... BIO_MAX_PAGES: *idx = 5; break;
-   default:
-   return NULL;
+   case   2 ...   4:
+   *idx = 0;
+   break;
+   case   5 ...  16:
+   *idx = 1;
+   break;
+   case  17 ...  64:
+   *idx = 2;
+   break;
+   case  65 ... 128:
+   *idx = 3;
+   break;
+   case 129 ... BIO_MAX_PAGES:
+   *idx = 4;
+   break;
+   default:
+   return NULL;
}
+
/*
 * idx now points to the pool we want to allocate from
 */
@@ -109,11 +120,13 @@ static inline struct bio_vec *bvec_alloc
 
 void bio_free(struct bio *bio, struct bio_set *bio_set)
 {
-   const int pool_idx = BIO_POOL_IDX(bio);
+   if ((bio->bi_flags & (1 << BIO_BVL_EMBED)) == 0) {
+   const int pool_idx = BIO_POOL_IDX(bio);
 
-   BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
+   BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
+   mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
+   }
 
-   mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
mempool_free(bio, bio_set->bio_pool);
 }
 
@@ -166,7 +179,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
struct bio_vec *bvl = NULL;
 
bio_init(bio);
-   if (likely(nr_iovecs)) {
+
+   if (nr_iovecs == 1) {
+   bvl = (void *) bio + sizeof(*bio);
+   memset(bvl, 0, sizeof(*bvl));
+   bio->bi_flags |= 1 << BIO_BVL_EMBED;
+   bio->bi_max_vecs = 1;
+   } else if (nr_iovecs) {
unsigned long idx = 0; /* shut up gcc */
 
 

Re: [patch] speed up single bio_vec allocation

2006-12-06 Thread Jens Axboe
On Mon, Dec 04 2006, Jens Axboe wrote:
 On Mon, Dec 04 2006, Chen, Kenneth W wrote:
   [...]
   
   Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
   I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
   bio is allocated. It doesn't add a lot of overhead even for the case
   where we do  1 page bios, and it gets rid of the dual allocation for
   the 1 page bio.
  
  I will try that too.  I'm a bit touchy about sharing a cache line for
  different bio.  But given that there are 200,000 I/O per second we are
  currently pushing the kernel, the chances of two cpu working on two
  bio that sits in the same cache line are pretty small.
 
 Yep I really think so. Besides, it's not like we are repeatedly writing
 to these objects in the first place.

This is what I had in mind, in case it wasn't completely clear. Not
tested, other than it compiles. Basically it eliminates the small
bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
12-bytes on 32-bit archs instead and uses the room at the end for the
bio_vec structure.

I still can't help but think we can do better than this, and that this
is nothing more than optimizing for a benchmark. For high performance
I/O, you will be doing  1 page bio's anyway and this patch wont help
you at all. Perhaps we can just kill bio_vec slabs completely, and
create bio slabs instead with differing sizes. So instead of having 1
bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
for the bio_vec list at the end. That would always eliminate the extra
allocation, at the cost of blowing the 256-page case into a order 1 page
allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
which is something I've always tried to avoid.


diff --git a/fs/bio.c b/fs/bio.c
index aa4d09b..94b5e05 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -32,7 +32,7 @@
 
 static kmem_cache_t *bio_slab __read_mostly;
 
-#define BIOVEC_NR_POOLS 6
+#define BIOVEC_NR_POOLS 5
 
 /*
  * a small number of entries is fine, not going to be performance critical.
@@ -55,7 +55,7 @@ struct biovec_slab {
 
 #define BV(x) { .nr_vecs = x, .name = biovec-__stringify(x) }
 static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
-   BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+   BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
 };
 #undef BV
 
@@ -76,7 +76,8 @@ struct bio_set {
  */
 static struct bio_set *fs_bio_set;
 
-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned 
long *idx, struct bio_set *bs)
+static struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long 
*idx,
+struct bio_set *bs)
 {
struct bio_vec *bvl;
 
@@ -84,15 +85,25 @@ static inline struct bio_vec *bvec_alloc
 * see comment near bvec_array define!
 */
switch (nr) {
-   case   1: *idx = 0; break;
-   case   2 ...   4: *idx = 1; break;
-   case   5 ...  16: *idx = 2; break;
-   case  17 ...  64: *idx = 3; break;
-   case  65 ... 128: *idx = 4; break;
-   case 129 ... BIO_MAX_PAGES: *idx = 5; break;
-   default:
-   return NULL;
+   case   2 ...   4:
+   *idx = 0;
+   break;
+   case   5 ...  16:
+   *idx = 1;
+   break;
+   case  17 ...  64:
+   *idx = 2;
+   break;
+   case  65 ... 128:
+   *idx = 3;
+   break;
+   case 129 ... BIO_MAX_PAGES:
+   *idx = 4;
+   break;
+   default:
+   return NULL;
}
+
/*
 * idx now points to the pool we want to allocate from
 */
@@ -109,11 +120,13 @@ static inline struct bio_vec *bvec_alloc
 
 void bio_free(struct bio *bio, struct bio_set *bio_set)
 {
-   const int pool_idx = BIO_POOL_IDX(bio);
+   if ((bio-bi_flags  (1  BIO_BVL_EMBED)) == 0) {
+   const int pool_idx = BIO_POOL_IDX(bio);
 
-   BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS);
+   BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS);
+   mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]);
+   }
 
-   mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]);
mempool_free(bio, bio_set-bio_pool);
 }
 
@@ -166,7 +179,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
struct bio_vec *bvl = NULL;
 
bio_init(bio);
-   if (likely(nr_iovecs)) {
+
+   if (nr_iovecs == 1) {
+   bvl = (void *) bio + sizeof(*bio);
+   memset(bvl, 0, sizeof(*bvl));
+   bio-bi_flags |= 1  BIO_BVL_EMBED;
+   bio-bi_max_vecs = 1;
+   } else if (nr_iovecs) {
unsigned long idx = 0; /* shut up gcc */
 
bvl = bvec_alloc_bs(gfp_mask, 

Re: [patch] speed up single bio_vec allocation

2006-12-06 Thread Jens Axboe
On Wed, Dec 06 2006, Jens Axboe wrote:
 I still can't help but think we can do better than this, and that this
 is nothing more than optimizing for a benchmark. For high performance
 I/O, you will be doing  1 page bio's anyway and this patch wont help
 you at all. Perhaps we can just kill bio_vec slabs completely, and
 create bio slabs instead with differing sizes. So instead of having 1
 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
 for the bio_vec list at the end. That would always eliminate the extra
 allocation, at the cost of blowing the 256-page case into a order 1 page
 allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
 which is something I've always tried to avoid.

That would look something like this. You'd probably want to re-tweak the
slab sizes now, so that we get the most out of the slab page. If we
accept the 2^1 order allocation for the largest bio, we can make it
larger than 256 pages at no extra cost. Probably around 500 pages would
still fit inside the two pages.

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 08a40f4..b4bf3b3 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -813,7 +813,7 @@ static int crypt_ctr(struct dm_target *t
goto bad4;
}
 
-   cc-bs = bioset_create(MIN_IOS, MIN_IOS, 4);
+   cc-bs = bioset_create(MIN_IOS, 4);
if (!cc-bs) {
ti-error = Cannot allocate crypt bioset;
goto bad_bs;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index da663d2..581c24a 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -60,7 +60,7 @@ static int resize_pool(unsigned int new_
if (!_io_pool)
return -ENOMEM;
 
-   _bios = bioset_create(16, 16, 4);
+   _bios = bioset_create(16, 4);
if (!_bios) {
mempool_destroy(_io_pool);
_io_pool = NULL;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index fc4f743..98f6768 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -973,7 +973,7 @@ static struct mapped_device *alloc_dev(i
if (!md-tio_pool)
goto bad3;
 
-   md-bs = bioset_create(16, 16, 4);
+   md-bs = bioset_create(16, 4);
if (!md-bs)
goto bad_no_bioset;
 
diff --git a/fs/bio.c b/fs/bio.c
index aa4d09b..452e8f7 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -30,10 +30,6 @@
 
 #define BIO_POOL_SIZE 256
 
-static kmem_cache_t *bio_slab __read_mostly;
-
-#define BIOVEC_NR_POOLS 6
-
 /*
  * a small number of entries is fine, not going to be performance critical.
  * basically we just need to survive
@@ -41,23 +37,23 @@ static kmem_cache_t *bio_slab __read_mos
 #define BIO_SPLIT_ENTRIES 8
 mempool_t *bio_split_pool __read_mostly;
 
-struct biovec_slab {
+struct bio_slab {
int nr_vecs;
-   char *name; 
+   const char *name;
kmem_cache_t *slab;
 };
 
-/*
- * if you change this list, also change bvec_alloc or things will
- * break badly! cannot be bigger than what you can fit into an
- * unsigned short
- */
-
-#define BV(x) { .nr_vecs = x, .name = biovec-__stringify(x) }
-static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = {
-   BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+#define BIOSLAB_NR_POOLS 6
+#define BIOSLAB(x) { .nr_vecs = x, .name = bio-__stringify(x) }
+static struct bio_slab bio_slabs[BIOSLAB_NR_POOLS] __read_mostly = {
+   BIOSLAB(0),
+   BIOSLAB(4),
+   BIOSLAB(16),
+   BIOSLAB(64),
+   BIOSLAB(128),
+   BIOSLAB(BIO_MAX_PAGES),
 };
-#undef BV
+#undef BIOSLAB
 
 /*
  * bio_set is used to allow other portions of the IO system to
@@ -66,8 +62,7 @@ static struct biovec_slab bvec_slabs[BIO
  * and the bvec_slabs[].
  */
 struct bio_set {
-   mempool_t *bio_pool;
-   mempool_t *bvec_pools[BIOVEC_NR_POOLS];
+   mempool_t *bio_pools[BIOSLAB_NR_POOLS];
 };
 
 /*
@@ -76,45 +71,12 @@ struct bio_set {
  */
 static struct bio_set *fs_bio_set;
 
-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned 
long *idx, struct bio_set *bs)
-{
-   struct bio_vec *bvl;
-
-   /*
-* see comment near bvec_array define!
-*/
-   switch (nr) {
-   case   1: *idx = 0; break;
-   case   2 ...   4: *idx = 1; break;
-   case   5 ...  16: *idx = 2; break;
-   case  17 ...  64: *idx = 3; break;
-   case  65 ... 128: *idx = 4; break;
-   case 129 ... BIO_MAX_PAGES: *idx = 5; break;
-   default:
-   return NULL;
-   }
-   /*
-* idx now points to the pool we want to allocate from
-*/
-
-   bvl = mempool_alloc(bs-bvec_pools[*idx], gfp_mask);
-   if (bvl) {
-   struct biovec_slab *bp = bvec_slabs + *idx;
-
-   memset(bvl, 0, bp-nr_vecs * sizeof(struct bio_vec));
-   }
-
- 

RE: [patch] speed up single bio_vec allocation

2006-12-06 Thread Chen, Kenneth W
Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
   I will try that too.  I'm a bit touchy about sharing a cache line for
   different bio.  But given that there are 200,000 I/O per second we are
   currently pushing the kernel, the chances of two cpu working on two
   bio that sits in the same cache line are pretty small.
  
  Yep I really think so. Besides, it's not like we are repeatedly writing
  to these objects in the first place.
 
 This is what I had in mind, in case it wasn't completely clear. Not
 tested, other than it compiles. Basically it eliminates the small
 bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
 12-bytes on 32-bit archs instead and uses the room at the end for the
 bio_vec structure.

Yeah, I had a very similar patch queued internally for the large benchmark
measurement.  I will post the result as soon as I get it.


 I still can't help but think we can do better than this, and that this
 is nothing more than optimizing for a benchmark. For high performance
 I/O, you will be doing  1 page bio's anyway and this patch wont help
 you at all. Perhaps we can just kill bio_vec slabs completely, and
 create bio slabs instead with differing sizes. So instead of having 1
 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
 for the bio_vec list at the end. That would always eliminate the extra
 allocation, at the cost of blowing the 256-page case into a order 1 page
 allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
 which is something I've always tried to avoid.

I took a quick query of biovec-* slab stats on various production machines,
majority of the allocation is on 1 and 4 segments, usages falls off quickly
on 16 or more.  256 segment biovec allocation is really rare.  I think it
makes sense to heavily bias towards smaller biovec allocation and have
separate biovec allocation for really large ones.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-04 Thread Chen, Kenneth W
Jens Axboe wrote on Monday, December 04, 2006 12:07 PM
> On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> > On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
> > created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
> > available at the end of bio.  I think we can utilize that memory for
> > bio_vec allocation.  The purpose is not so much on saving memory consumption
> > for bio_vec, instead, I'm attempting to optimize away a call to 
> > bvec_alloc_bs.
> > 
> > So here is a patch to do just that for 1 segment bio_vec (we currently only
> > have space for 1 on 2.6.19).  And the detection whether there are spare 
> > space
> > available is dynamically calculated at compile time.  If there are no space
> > available, there will be no run time cost at all because gcc simply optimize
> > away all the code added in this patch.  If there are space available, the 
> > only
> > run time check is to see what the size of iovec is and we do appropriate
> > assignment to bio->bi_io_Vec etc.  The cost is minimal and we gain a whole
> > lot back from not calling bvec_alloc_bs() function.
> > 
> > I tried to use cache_line_size() to find out the alignment of struct bio, 
> > but
> > stumbled on that it is a runtime function for x86_64. So instead I made bio
> > to hint to the slab allocator to align on 32 byte (slab will use the larger
> > value of hw cache line and caller hints of "align").  I think it is a sane
> > number for majority of the CPUs out in the world.
> 
> Any benchmarks for this one?

About 0.2% on database transaction processing benchmark.  It was done a while
back on top of a major Linux vendor kernel. I will retest it again for 2.6.19.


> [...]
> 
> Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
> I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
> bio is allocated. It doesn't add a lot of overhead even for the case
> where we do > 1 page bios, and it gets rid of the dual allocation for
> the 1 page bio.

I will try that too.  I'm a bit touchy about sharing a cache line for different
bio.  But given that there are 200,000 I/O per second we are currently pushing
the kernel, the chances of two cpu working on two bio that sits in the same
cache line are pretty small.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-04 Thread Jens Axboe
On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> > [...]
> > 
> > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
> > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
> > bio is allocated. It doesn't add a lot of overhead even for the case
> > where we do > 1 page bios, and it gets rid of the dual allocation for
> > the 1 page bio.
> 
> I will try that too.  I'm a bit touchy about sharing a cache line for
> different bio.  But given that there are 200,000 I/O per second we are
> currently pushing the kernel, the chances of two cpu working on two
> bio that sits in the same cache line are pretty small.

Yep I really think so. Besides, it's not like we are repeatedly writing
to these objects in the first place.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-04 Thread Jens Axboe
On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
> created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
> available at the end of bio.  I think we can utilize that memory for
> bio_vec allocation.  The purpose is not so much on saving memory consumption
> for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.
> 
> So here is a patch to do just that for 1 segment bio_vec (we currently only
> have space for 1 on 2.6.19).  And the detection whether there are spare space
> available is dynamically calculated at compile time.  If there are no space
> available, there will be no run time cost at all because gcc simply optimize
> away all the code added in this patch.  If there are space available, the only
> run time check is to see what the size of iovec is and we do appropriate
> assignment to bio->bi_io_Vec etc.  The cost is minimal and we gain a whole
> lot back from not calling bvec_alloc_bs() function.
> 
> I tried to use cache_line_size() to find out the alignment of struct bio, but
> stumbled on that it is a runtime function for x86_64. So instead I made bio
> to hint to the slab allocator to align on 32 byte (slab will use the larger
> value of hw cache line and caller hints of "align").  I think it is a sane
> number for majority of the CPUs out in the world.

Any benchmarks for this one?

> --- ./fs/bio.c.orig   2006-12-03 17:20:36.0 -0800
> +++ ./fs/bio.c2006-12-03 21:29:20.0 -0800
> @@ -29,11 +29,14 @@
>  #include  /* for struct sg_iovec */
>  
>  #define BIO_POOL_SIZE 256
> -
> +#define BIO_ALIGN 32 /* minimal bio structure alignment */
>  static kmem_cache_t *bio_slab __read_mostly;
>  
>  #define BIOVEC_NR_POOLS 6
>  
> +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \
> + (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\
> +  ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN))
>  /*
>   * a small number of entries is fine, not going to be performance critical.
>   * basically we just need to survive
> @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi
>  
>   BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
>  
> - mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
> + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx)
> + mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);

I think you should use a flag for this instead.

>   mempool_free(bio, bio_set->bio_pool);
>  }
>  
> @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
>   struct bio_vec *bvl = NULL;
>  
>   bio_init(bio);
> - if (likely(nr_iovecs)) {
> +
> + /*
> +  * if bio_vec can fit into remaining cache line of struct
> +  * bio, go ahead use it and skip mempool allocation.
> +  */
> + if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) {
> + bvl = (struct bio_vec*) (bio + 1);
> + bio->bi_max_vecs = 1;
> + } else if (likely(nr_iovecs)) {
>   unsigned long idx = 0; /* shut up gcc */

Either move this logic to bvec_alloc_bs(), or remember to clear bvl as
is done there.

> @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi
>   struct biovec_slab *bvs = bvec_slabs + i;
>  
>   size = bvs->nr_vecs * sizeof(struct bio_vec);
> - bvs->slab = kmem_cache_create(bvs->name, size, 0,
> + bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN,
>  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
>   }
>  }

Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
bio is allocated. It doesn't add a lot of overhead even for the case
where we do > 1 page bios, and it gets rid of the dual allocation for
the 1 page bio.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] speed up single bio_vec allocation

2006-12-04 Thread Chen, Kenneth W
On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
available at the end of bio.  I think we can utilize that memory for
bio_vec allocation.  The purpose is not so much on saving memory consumption
for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.

So here is a patch to do just that for 1 segment bio_vec (we currently only
have space for 1 on 2.6.19).  And the detection whether there are spare space
available is dynamically calculated at compile time.  If there are no space
available, there will be no run time cost at all because gcc simply optimize
away all the code added in this patch.  If there are space available, the only
run time check is to see what the size of iovec is and we do appropriate
assignment to bio->bi_io_Vec etc.  The cost is minimal and we gain a whole
lot back from not calling bvec_alloc_bs() function.

I tried to use cache_line_size() to find out the alignment of struct bio, but
stumbled on that it is a runtime function for x86_64. So instead I made bio
to hint to the slab allocator to align on 32 byte (slab will use the larger
value of hw cache line and caller hints of "align").  I think it is a sane
number for majority of the CPUs out in the world.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800
+++ ./fs/bio.c  2006-12-03 21:29:20.0 -0800
@@ -29,11 +29,14 @@
 #include/* for struct sg_iovec */
 
 #define BIO_POOL_SIZE 256
-
+#define BIO_ALIGN 32   /* minimal bio structure alignment */
 static kmem_cache_t *bio_slab __read_mostly;
 
 #define BIOVEC_NR_POOLS 6
 
+#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE   \
+   (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\
+ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN))
 /*
  * a small number of entries is fine, not going to be performance critical.
  * basically we just need to survive
@@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi
 
BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
 
-   mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
+   if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx)
+   mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
mempool_free(bio, bio_set->bio_pool);
 }
 
@@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
struct bio_vec *bvl = NULL;
 
bio_init(bio);
-   if (likely(nr_iovecs)) {
+
+   /*
+* if bio_vec can fit into remaining cache line of struct
+* bio, go ahead use it and skip mempool allocation.
+*/
+   if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) {
+   bvl = (struct bio_vec*) (bio + 1);
+   bio->bi_max_vecs = 1;
+   } else if (likely(nr_iovecs)) {
unsigned long idx = 0; /* shut up gcc */
 
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, , bs);
@@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi
struct biovec_slab *bvs = bvec_slabs + i;
 
size = bvs->nr_vecs * sizeof(struct bio_vec);
-   bvs->slab = kmem_cache_create(bvs->name, size, 0,
+   bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN,
 SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
}
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] speed up single bio_vec allocation

2006-12-04 Thread Chen, Kenneth W
On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
available at the end of bio.  I think we can utilize that memory for
bio_vec allocation.  The purpose is not so much on saving memory consumption
for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.

So here is a patch to do just that for 1 segment bio_vec (we currently only
have space for 1 on 2.6.19).  And the detection whether there are spare space
available is dynamically calculated at compile time.  If there are no space
available, there will be no run time cost at all because gcc simply optimize
away all the code added in this patch.  If there are space available, the only
run time check is to see what the size of iovec is and we do appropriate
assignment to bio-bi_io_Vec etc.  The cost is minimal and we gain a whole
lot back from not calling bvec_alloc_bs() function.

I tried to use cache_line_size() to find out the alignment of struct bio, but
stumbled on that it is a runtime function for x86_64. So instead I made bio
to hint to the slab allocator to align on 32 byte (slab will use the larger
value of hw cache line and caller hints of align).  I think it is a sane
number for majority of the CPUs out in the world.


Signed-off-by: Ken Chen [EMAIL PROTECTED]


--- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800
+++ ./fs/bio.c  2006-12-03 21:29:20.0 -0800
@@ -29,11 +29,14 @@
 #include scsi/sg.h   /* for struct sg_iovec */
 
 #define BIO_POOL_SIZE 256
-
+#define BIO_ALIGN 32   /* minimal bio structure alignment */
 static kmem_cache_t *bio_slab __read_mostly;
 
 #define BIOVEC_NR_POOLS 6
 
+#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE   \
+   (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\
+ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN))
 /*
  * a small number of entries is fine, not going to be performance critical.
  * basically we just need to survive
@@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi
 
BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS);
 
-   mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]);
+   if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx)
+   mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]);
mempool_free(bio, bio_set-bio_pool);
 }
 
@@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
struct bio_vec *bvl = NULL;
 
bio_init(bio);
-   if (likely(nr_iovecs)) {
+
+   /*
+* if bio_vec can fit into remaining cache line of struct
+* bio, go ahead use it and skip mempool allocation.
+*/
+   if (nr_iovecs == 1  BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) {
+   bvl = (struct bio_vec*) (bio + 1);
+   bio-bi_max_vecs = 1;
+   } else if (likely(nr_iovecs)) {
unsigned long idx = 0; /* shut up gcc */
 
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, idx, bs);
@@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi
struct biovec_slab *bvs = bvec_slabs + i;
 
size = bvs-nr_vecs * sizeof(struct bio_vec);
-   bvs-slab = kmem_cache_create(bvs-name, size, 0,
+   bvs-slab = kmem_cache_create(bvs-name, size, BIO_ALIGN,
 SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
}
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-04 Thread Jens Axboe
On Mon, Dec 04 2006, Chen, Kenneth W wrote:
 On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
 created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
 available at the end of bio.  I think we can utilize that memory for
 bio_vec allocation.  The purpose is not so much on saving memory consumption
 for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.
 
 So here is a patch to do just that for 1 segment bio_vec (we currently only
 have space for 1 on 2.6.19).  And the detection whether there are spare space
 available is dynamically calculated at compile time.  If there are no space
 available, there will be no run time cost at all because gcc simply optimize
 away all the code added in this patch.  If there are space available, the only
 run time check is to see what the size of iovec is and we do appropriate
 assignment to bio-bi_io_Vec etc.  The cost is minimal and we gain a whole
 lot back from not calling bvec_alloc_bs() function.
 
 I tried to use cache_line_size() to find out the alignment of struct bio, but
 stumbled on that it is a runtime function for x86_64. So instead I made bio
 to hint to the slab allocator to align on 32 byte (slab will use the larger
 value of hw cache line and caller hints of align).  I think it is a sane
 number for majority of the CPUs out in the world.

Any benchmarks for this one?

 --- ./fs/bio.c.orig   2006-12-03 17:20:36.0 -0800
 +++ ./fs/bio.c2006-12-03 21:29:20.0 -0800
 @@ -29,11 +29,14 @@
  #include scsi/sg.h /* for struct sg_iovec */
  
  #define BIO_POOL_SIZE 256
 -
 +#define BIO_ALIGN 32 /* minimal bio structure alignment */
  static kmem_cache_t *bio_slab __read_mostly;
  
  #define BIOVEC_NR_POOLS 6
  
 +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \
 + (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\
 +  ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN))
  /*
   * a small number of entries is fine, not going to be performance critical.
   * basically we just need to survive
 @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi
  
   BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS);
  
 - mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]);
 + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx)
 + mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]);

I think you should use a flag for this instead.

   mempool_free(bio, bio_set-bio_pool);
  }
  
 @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
   struct bio_vec *bvl = NULL;
  
   bio_init(bio);
 - if (likely(nr_iovecs)) {
 +
 + /*
 +  * if bio_vec can fit into remaining cache line of struct
 +  * bio, go ahead use it and skip mempool allocation.
 +  */
 + if (nr_iovecs == 1  BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) {
 + bvl = (struct bio_vec*) (bio + 1);
 + bio-bi_max_vecs = 1;
 + } else if (likely(nr_iovecs)) {
   unsigned long idx = 0; /* shut up gcc */

Either move this logic to bvec_alloc_bs(), or remember to clear bvl as
is done there.

 @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi
   struct biovec_slab *bvs = bvec_slabs + i;
  
   size = bvs-nr_vecs * sizeof(struct bio_vec);
 - bvs-slab = kmem_cache_create(bvs-name, size, 0,
 + bvs-slab = kmem_cache_create(bvs-name, size, BIO_ALIGN,
  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
   }
  }

Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
bio is allocated. It doesn't add a lot of overhead even for the case
where we do  1 page bios, and it gets rid of the dual allocation for
the 1 page bio.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] speed up single bio_vec allocation

2006-12-04 Thread Jens Axboe
On Mon, Dec 04 2006, Chen, Kenneth W wrote:
  [...]
  
  Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
  I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
  bio is allocated. It doesn't add a lot of overhead even for the case
  where we do  1 page bios, and it gets rid of the dual allocation for
  the 1 page bio.
 
 I will try that too.  I'm a bit touchy about sharing a cache line for
 different bio.  But given that there are 200,000 I/O per second we are
 currently pushing the kernel, the chances of two cpu working on two
 bio that sits in the same cache line are pretty small.

Yep I really think so. Besides, it's not like we are repeatedly writing
to these objects in the first place.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] speed up single bio_vec allocation

2006-12-04 Thread Chen, Kenneth W
Jens Axboe wrote on Monday, December 04, 2006 12:07 PM
 On Mon, Dec 04 2006, Chen, Kenneth W wrote:
  On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
  created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
  available at the end of bio.  I think we can utilize that memory for
  bio_vec allocation.  The purpose is not so much on saving memory consumption
  for bio_vec, instead, I'm attempting to optimize away a call to 
  bvec_alloc_bs.
  
  So here is a patch to do just that for 1 segment bio_vec (we currently only
  have space for 1 on 2.6.19).  And the detection whether there are spare 
  space
  available is dynamically calculated at compile time.  If there are no space
  available, there will be no run time cost at all because gcc simply optimize
  away all the code added in this patch.  If there are space available, the 
  only
  run time check is to see what the size of iovec is and we do appropriate
  assignment to bio-bi_io_Vec etc.  The cost is minimal and we gain a whole
  lot back from not calling bvec_alloc_bs() function.
  
  I tried to use cache_line_size() to find out the alignment of struct bio, 
  but
  stumbled on that it is a runtime function for x86_64. So instead I made bio
  to hint to the slab allocator to align on 32 byte (slab will use the larger
  value of hw cache line and caller hints of align).  I think it is a sane
  number for majority of the CPUs out in the world.
 
 Any benchmarks for this one?

About 0.2% on database transaction processing benchmark.  It was done a while
back on top of a major Linux vendor kernel. I will retest it again for 2.6.19.


 [...]
 
 Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
 I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
 bio is allocated. It doesn't add a lot of overhead even for the case
 where we do  1 page bios, and it gets rid of the dual allocation for
 the 1 page bio.

I will try that too.  I'm a bit touchy about sharing a cache line for different
bio.  But given that there are 200,000 I/O per second we are currently pushing
the kernel, the chances of two cpu working on two bio that sits in the same
cache line are pretty small.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/