Re: [patch] speed up single bio_vec allocation
On Fri, Dec 08 2006, Chen, Kenneth W wrote: > > Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM > > > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM > > > This is what I had in mind, in case it wasn't completely clear. Not > > > tested, other than it compiles. Basically it eliminates the small > > > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by > > > 12-bytes on 32-bit archs instead and uses the room at the end for the > > > bio_vec structure. > > > > Yeah, I had a very similar patch queued internally for the large benchmark > > measurement. I will post the result as soon as I get it. > > > Jens, this improves 0.25% on our db transaction processing benchmark setup. > The patch tested is (on top of 2.6.19): > http://marc.theaimsgroup.com/?l=linux-kernel=116539972229021=2 Ok, well it's not much but if it's significant it's not too bad either :-) Some tests I ran locally showed it being _slower_, which is a little odd. They were basically hammering requests through the block layer with a null end. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Fri, Dec 08 2006, Chen, Kenneth W wrote: Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. Jens, this improves 0.25% on our db transaction processing benchmark setup. The patch tested is (on top of 2.6.19): http://marc.theaimsgroup.com/?l=linux-kernelm=116539972229021w=2 Ok, well it's not much but if it's significant it's not too bad either :-) Some tests I ran locally showed it being _slower_, which is a little odd. They were basically hammering requests through the block layer with a null end. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
> Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM > > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM > > This is what I had in mind, in case it wasn't completely clear. Not > > tested, other than it compiles. Basically it eliminates the small > > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by > > 12-bytes on 32-bit archs instead and uses the room at the end for the > > bio_vec structure. > > Yeah, I had a very similar patch queued internally for the large benchmark > measurement. I will post the result as soon as I get it. Jens, this improves 0.25% on our db transaction processing benchmark setup. The patch tested is (on top of 2.6.19): http://marc.theaimsgroup.com/?l=linux-kernel=116539972229021=2 - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Thu, Dec 07 2006, Nate Diller wrote: > On 12/7/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote: > >Nate Diller wrote on Thursday, December 07, 2006 1:46 PM > >> the current code is straightforward and obviously correct. you want > >> to make the alloc/dealloc paths more complex, by special-casing for an > >> arbitrary limit of "small" I/O, AFAICT. of *course* you can expect > >> less overhead when you're doing one large I/O vs. two small ones, > >> that's the whole reason we have all this code to try to coalesce > >> contiguous I/O, do readahead, swap page clustering, etc. we *want* > >> more complexity if it will get us bigger I/Os. I don't see why we > >> want more complexity to reduce the *inherent* penalty of doing smaller > >> ones. > > > >You should check out the latest proposal from Jens Axboe which treats > >all biovec size the same and stuff it along with struct bio. I think > >it is a better approach than my first cut of special casing 1 segment > >biovec. His patch will speed up all sized I/O. > > i rather agree with his reservations on that, since we'd be making the > allocator's job harder by requesting order 1 pages for all allocations > on x86_64 large I/O patterns. but it reduces complexity instead of > increasing it ... can you produce some benchmarks not just for your > workload but for one that triggers the order 1 case? biovec-(256) > transfers are more common than you seem to think, and if the allocator > can't do it, that forces the bio code to fall back to 2 x biovec-128, > which, as you indicated above, would show a real penalty. The question is if the slab allocator is only doing 2^0 order allocations for the 256-page bio_vec currently - it's at 4096 bytes, so potentially (I suspect) the worst size it could be. On the 1 vs many page bio_vec patterns, I agree with Nate. I do see lots of larger bio_vecs here. > 1 page bio_vec usage is also becoming more prevalent, not less. So optimizing for a benchmark case that predominately uses 1 page bio's is indeed a silly thing. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Thu, Dec 07 2006, Nate Diller wrote: On 12/7/06, Chen, Kenneth W [EMAIL PROTECTED] wrote: Nate Diller wrote on Thursday, December 07, 2006 1:46 PM the current code is straightforward and obviously correct. you want to make the alloc/dealloc paths more complex, by special-casing for an arbitrary limit of small I/O, AFAICT. of *course* you can expect less overhead when you're doing one large I/O vs. two small ones, that's the whole reason we have all this code to try to coalesce contiguous I/O, do readahead, swap page clustering, etc. we *want* more complexity if it will get us bigger I/Os. I don't see why we want more complexity to reduce the *inherent* penalty of doing smaller ones. You should check out the latest proposal from Jens Axboe which treats all biovec size the same and stuff it along with struct bio. I think it is a better approach than my first cut of special casing 1 segment biovec. His patch will speed up all sized I/O. i rather agree with his reservations on that, since we'd be making the allocator's job harder by requesting order 1 pages for all allocations on x86_64 large I/O patterns. but it reduces complexity instead of increasing it ... can you produce some benchmarks not just for your workload but for one that triggers the order 1 case? biovec-(256) transfers are more common than you seem to think, and if the allocator can't do it, that forces the bio code to fall back to 2 x biovec-128, which, as you indicated above, would show a real penalty. The question is if the slab allocator is only doing 2^0 order allocations for the 256-page bio_vec currently - it's at 4096 bytes, so potentially (I suspect) the worst size it could be. On the 1 vs many page bio_vec patterns, I agree with Nate. I do see lots of larger bio_vecs here. 1 page bio_vec usage is also becoming more prevalent, not less. So optimizing for a benchmark case that predominately uses 1 page bio's is indeed a silly thing. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. Jens, this improves 0.25% on our db transaction processing benchmark setup. The patch tested is (on top of 2.6.19): http://marc.theaimsgroup.com/?l=linux-kernelm=116539972229021w=2 - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Friday 08 December 2006 05:23, Chen, Kenneth W wrote: > Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM > > "Chen, Kenneth W" <[EMAIL PROTECTED]> writes: > > > I tried to use cache_line_size() to find out the alignment of struct bio, > > > but > > > stumbled on that it is a runtime function for x86_64. > > > > It's a single global variable access: > > > > #define cache_line_size() (boot_cpu_data.x86_cache_alignment) > > > > Or do you mean it caused cache misses? boot_cpu_data is cache aligned > > and practically read only, so there shouldn't be any false sharing at least. > > No, I was looking for a generic constant that describes cache line size. The same kernel binary runs on CPUs with different cache line sizes. For example P4 has 128 bytes, Core2 64 bytes. However there is a worst case alignment that is used for static alignments which is L1_CACHE_BYTES. It would be normally 128 bytes on a x86 kernel, unless it is especially compiled for a CPU with a smaller cache line size. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM > "Chen, Kenneth W" <[EMAIL PROTECTED]> writes: > > I tried to use cache_line_size() to find out the alignment of struct bio, > > but > > stumbled on that it is a runtime function for x86_64. > > It's a single global variable access: > > #define cache_line_size() (boot_cpu_data.x86_cache_alignment) > > Or do you mean it caused cache misses? boot_cpu_data is cache aligned > and practically read only, so there shouldn't be any false sharing at least. No, I was looking for a generic constant that describes cache line size. I needed a constant in order to avoid runtime check and to rely on compiler to optimize a conditional check away. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
"Chen, Kenneth W" <[EMAIL PROTECTED]> writes: > > I tried to use cache_line_size() to find out the alignment of struct bio, but > stumbled on that it is a runtime function for x86_64. It's a single global variable access: #define cache_line_size() (boot_cpu_data.x86_cache_alignment) Or do you mean it caused cache misses? boot_cpu_data is cache aligned and practically read only, so there shouldn't be any false sharing at least. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On 12/7/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote: Nate Diller wrote on Thursday, December 07, 2006 1:46 PM > the current code is straightforward and obviously correct. you want > to make the alloc/dealloc paths more complex, by special-casing for an > arbitrary limit of "small" I/O, AFAICT. of *course* you can expect > less overhead when you're doing one large I/O vs. two small ones, > that's the whole reason we have all this code to try to coalesce > contiguous I/O, do readahead, swap page clustering, etc. we *want* > more complexity if it will get us bigger I/Os. I don't see why we > want more complexity to reduce the *inherent* penalty of doing smaller > ones. You should check out the latest proposal from Jens Axboe which treats all biovec size the same and stuff it along with struct bio. I think it is a better approach than my first cut of special casing 1 segment biovec. His patch will speed up all sized I/O. i rather agree with his reservations on that, since we'd be making the allocator's job harder by requesting order 1 pages for all allocations on x86_64 large I/O patterns. but it reduces complexity instead of increasing it ... can you produce some benchmarks not just for your workload but for one that triggers the order 1 case? biovec-(256) transfers are more common than you seem to think, and if the allocator can't do it, that forces the bio code to fall back to 2 x biovec-128, which, as you indicated above, would show a real penalty. NATE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 1:46 PM > the current code is straightforward and obviously correct. you want > to make the alloc/dealloc paths more complex, by special-casing for an > arbitrary limit of "small" I/O, AFAICT. of *course* you can expect > less overhead when you're doing one large I/O vs. two small ones, > that's the whole reason we have all this code to try to coalesce > contiguous I/O, do readahead, swap page clustering, etc. we *want* > more complexity if it will get us bigger I/Os. I don't see why we > want more complexity to reduce the *inherent* penalty of doing smaller > ones. You should check out the latest proposal from Jens Axboe which treats all biovec size the same and stuff it along with struct bio. I think it is a better approach than my first cut of special casing 1 segment biovec. His patch will speed up all sized I/O. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On 12/7/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote: Nate Diller wrote on Thursday, December 07, 2006 11:22 AM > > > I still can't help but think we can do better than this, and that this > > > is nothing more than optimizing for a benchmark. For high performance > > > I/O, you will be doing > 1 page bio's anyway and this patch wont help > > > you at all. Perhaps we can just kill bio_vec slabs completely, and > > > create bio slabs instead with differing sizes. So instead of having 1 > > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room > > > for the bio_vec list at the end. That would always eliminate the extra > > > allocation, at the cost of blowing the 256-page case into a order 1 page > > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, > > > which is something I've always tried to avoid. > > > > I took a quick query of biovec-* slab stats on various production machines, > > majority of the allocation is on 1 and 4 segments, usages falls off quickly > > on 16 or more. 256 segment biovec allocation is really rare. I think it > > makes sense to heavily bias towards smaller biovec allocation and have > > separate biovec allocation for really large ones. > > what file system? have you tested with more than one? have you > tested with file systems that build their own bio's instead of using > get_block() calls? have you tested with large files or streaming > workloads? how about direct I/O? > > i think that a "heavy bias" toward small biovecs is FS and workload > dependent, and that it's irresponsible to make such unjustified > changes just to show improvement on your particular benchmark. It is no doubt that the above data is just a quick estimate on one usage model. There are tons of other usage in the world. After all, any algorithm in the kernel has to be generic and self tune to specific environment. On very large I/O, the relative overhead in allocating biovec will decrease because larger I/O needs more code to do setup, more code to perform I/O completion, more code in the device driver etc. So time spent on one mempool alloc will amortize over the size of I/O. On a smaller I/O size, the overhead is more visible and thus makes sense to me to cut down that relative overhead. In fact, the large I/O already have unfair advantage. If you do 1MB I/O, only 1 call to mempool to get a 256 segment bio. However if you do two 512K I/O, two calls to mempool is made. So in some sense, current scheme is unfair to small I/O. so that's a fancy way of saying "no, i didn't do any benchmarks", yes? the current code is straightforward and obviously correct. you want to make the alloc/dealloc paths more complex, by special-casing for an arbitrary limit of "small" I/O, AFAICT. of *course* you can expect less overhead when you're doing one large I/O vs. two small ones, that's the whole reason we have all this code to try to coalesce contiguous I/O, do readahead, swap page clustering, etc. we *want* more complexity if it will get us bigger I/Os. I don't see why we want more complexity to reduce the *inherent* penalty of doing smaller ones. btw, i am happy to see that you are working on performance, especially to the degree you reduce overhead by cleaning things up and removing cruft and unneeded complexity... NATE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 11:22 AM > > > I still can't help but think we can do better than this, and that this > > > is nothing more than optimizing for a benchmark. For high performance > > > I/O, you will be doing > 1 page bio's anyway and this patch wont help > > > you at all. Perhaps we can just kill bio_vec slabs completely, and > > > create bio slabs instead with differing sizes. So instead of having 1 > > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room > > > for the bio_vec list at the end. That would always eliminate the extra > > > allocation, at the cost of blowing the 256-page case into a order 1 page > > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, > > > which is something I've always tried to avoid. > > > > I took a quick query of biovec-* slab stats on various production machines, > > majority of the allocation is on 1 and 4 segments, usages falls off quickly > > on 16 or more. 256 segment biovec allocation is really rare. I think it > > makes sense to heavily bias towards smaller biovec allocation and have > > separate biovec allocation for really large ones. > > what file system? have you tested with more than one? have you > tested with file systems that build their own bio's instead of using > get_block() calls? have you tested with large files or streaming > workloads? how about direct I/O? > > i think that a "heavy bias" toward small biovecs is FS and workload > dependent, and that it's irresponsible to make such unjustified > changes just to show improvement on your particular benchmark. It is no doubt that the above data is just a quick estimate on one usage model. There are tons of other usage in the world. After all, any algorithm in the kernel has to be generic and self tune to specific environment. On very large I/O, the relative overhead in allocating biovec will decrease because larger I/O needs more code to do setup, more code to perform I/O completion, more code in the device driver etc. So time spent on one mempool alloc will amortize over the size of I/O. On a smaller I/O size, the overhead is more visible and thus makes sense to me to cut down that relative overhead. In fact, the large I/O already have unfair advantage. If you do 1MB I/O, only 1 call to mempool to get a 256 segment bio. However if you do two 512K I/O, two calls to mempool is made. So in some sense, current scheme is unfair to small I/O. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On 12/6/06, Chen, Kenneth W <[EMAIL PROTECTED]> wrote: Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM > > > I will try that too. I'm a bit touchy about sharing a cache line for > > > different bio. But given that there are 200,000 I/O per second we are > > > currently pushing the kernel, the chances of two cpu working on two > > > bio that sits in the same cache line are pretty small. > > > > Yep I really think so. Besides, it's not like we are repeatedly writing > > to these objects in the first place. > > This is what I had in mind, in case it wasn't completely clear. Not > tested, other than it compiles. Basically it eliminates the small > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by > 12-bytes on 32-bit archs instead and uses the room at the end for the > bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. > I still can't help but think we can do better than this, and that this > is nothing more than optimizing for a benchmark. For high performance > I/O, you will be doing > 1 page bio's anyway and this patch wont help > you at all. Perhaps we can just kill bio_vec slabs completely, and > create bio slabs instead with differing sizes. So instead of having 1 > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room > for the bio_vec list at the end. That would always eliminate the extra > allocation, at the cost of blowing the 256-page case into a order 1 page > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, > which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. what file system? have you tested with more than one? have you tested with file systems that build their own bio's instead of using get_block() calls? have you tested with large files or streaming workloads? how about direct I/O? i think that a "heavy bias" toward small biovecs is FS and workload dependent, and that it's irresponsible to make such unjustified changes just to show improvement on your particular benchmark. i do however agree with killing SLAB_HWCACHE_ALIGN for biovecs, pending reasonable regression benchmarks. NATE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On 12/6/06, Chen, Kenneth W [EMAIL PROTECTED] wrote: Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. Yep I really think so. Besides, it's not like we are repeatedly writing to these objects in the first place. This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. what file system? have you tested with more than one? have you tested with file systems that build their own bio's instead of using get_block() calls? have you tested with large files or streaming workloads? how about direct I/O? i think that a heavy bias toward small biovecs is FS and workload dependent, and that it's irresponsible to make such unjustified changes just to show improvement on your particular benchmark. i do however agree with killing SLAB_HWCACHE_ALIGN for biovecs, pending reasonable regression benchmarks. NATE - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 11:22 AM I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. what file system? have you tested with more than one? have you tested with file systems that build their own bio's instead of using get_block() calls? have you tested with large files or streaming workloads? how about direct I/O? i think that a heavy bias toward small biovecs is FS and workload dependent, and that it's irresponsible to make such unjustified changes just to show improvement on your particular benchmark. It is no doubt that the above data is just a quick estimate on one usage model. There are tons of other usage in the world. After all, any algorithm in the kernel has to be generic and self tune to specific environment. On very large I/O, the relative overhead in allocating biovec will decrease because larger I/O needs more code to do setup, more code to perform I/O completion, more code in the device driver etc. So time spent on one mempool alloc will amortize over the size of I/O. On a smaller I/O size, the overhead is more visible and thus makes sense to me to cut down that relative overhead. In fact, the large I/O already have unfair advantage. If you do 1MB I/O, only 1 call to mempool to get a 256 segment bio. However if you do two 512K I/O, two calls to mempool is made. So in some sense, current scheme is unfair to small I/O. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On 12/7/06, Chen, Kenneth W [EMAIL PROTECTED] wrote: Nate Diller wrote on Thursday, December 07, 2006 11:22 AM I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. what file system? have you tested with more than one? have you tested with file systems that build their own bio's instead of using get_block() calls? have you tested with large files or streaming workloads? how about direct I/O? i think that a heavy bias toward small biovecs is FS and workload dependent, and that it's irresponsible to make such unjustified changes just to show improvement on your particular benchmark. It is no doubt that the above data is just a quick estimate on one usage model. There are tons of other usage in the world. After all, any algorithm in the kernel has to be generic and self tune to specific environment. On very large I/O, the relative overhead in allocating biovec will decrease because larger I/O needs more code to do setup, more code to perform I/O completion, more code in the device driver etc. So time spent on one mempool alloc will amortize over the size of I/O. On a smaller I/O size, the overhead is more visible and thus makes sense to me to cut down that relative overhead. In fact, the large I/O already have unfair advantage. If you do 1MB I/O, only 1 call to mempool to get a 256 segment bio. However if you do two 512K I/O, two calls to mempool is made. So in some sense, current scheme is unfair to small I/O. so that's a fancy way of saying no, i didn't do any benchmarks, yes? the current code is straightforward and obviously correct. you want to make the alloc/dealloc paths more complex, by special-casing for an arbitrary limit of small I/O, AFAICT. of *course* you can expect less overhead when you're doing one large I/O vs. two small ones, that's the whole reason we have all this code to try to coalesce contiguous I/O, do readahead, swap page clustering, etc. we *want* more complexity if it will get us bigger I/Os. I don't see why we want more complexity to reduce the *inherent* penalty of doing smaller ones. btw, i am happy to see that you are working on performance, especially to the degree you reduce overhead by cleaning things up and removing cruft and unneeded complexity... NATE - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 1:46 PM the current code is straightforward and obviously correct. you want to make the alloc/dealloc paths more complex, by special-casing for an arbitrary limit of small I/O, AFAICT. of *course* you can expect less overhead when you're doing one large I/O vs. two small ones, that's the whole reason we have all this code to try to coalesce contiguous I/O, do readahead, swap page clustering, etc. we *want* more complexity if it will get us bigger I/Os. I don't see why we want more complexity to reduce the *inherent* penalty of doing smaller ones. You should check out the latest proposal from Jens Axboe which treats all biovec size the same and stuff it along with struct bio. I think it is a better approach than my first cut of special casing 1 segment biovec. His patch will speed up all sized I/O. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On 12/7/06, Chen, Kenneth W [EMAIL PROTECTED] wrote: Nate Diller wrote on Thursday, December 07, 2006 1:46 PM the current code is straightforward and obviously correct. you want to make the alloc/dealloc paths more complex, by special-casing for an arbitrary limit of small I/O, AFAICT. of *course* you can expect less overhead when you're doing one large I/O vs. two small ones, that's the whole reason we have all this code to try to coalesce contiguous I/O, do readahead, swap page clustering, etc. we *want* more complexity if it will get us bigger I/Os. I don't see why we want more complexity to reduce the *inherent* penalty of doing smaller ones. You should check out the latest proposal from Jens Axboe which treats all biovec size the same and stuff it along with struct bio. I think it is a better approach than my first cut of special casing 1 segment biovec. His patch will speed up all sized I/O. i rather agree with his reservations on that, since we'd be making the allocator's job harder by requesting order 1 pages for all allocations on x86_64 large I/O patterns. but it reduces complexity instead of increasing it ... can you produce some benchmarks not just for your workload but for one that triggers the order 1 case? biovec-(256) transfers are more common than you seem to think, and if the allocator can't do it, that forces the bio code to fall back to 2 x biovec-128, which, as you indicated above, would show a real penalty. NATE - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
Chen, Kenneth W [EMAIL PROTECTED] writes: I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. It's a single global variable access: #define cache_line_size() (boot_cpu_data.x86_cache_alignment) Or do you mean it caused cache misses? boot_cpu_data is cache aligned and practically read only, so there shouldn't be any false sharing at least. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM Chen, Kenneth W [EMAIL PROTECTED] writes: I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. It's a single global variable access: #define cache_line_size() (boot_cpu_data.x86_cache_alignment) Or do you mean it caused cache misses? boot_cpu_data is cache aligned and practically read only, so there shouldn't be any false sharing at least. No, I was looking for a generic constant that describes cache line size. I needed a constant in order to avoid runtime check and to rely on compiler to optimize a conditional check away. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Friday 08 December 2006 05:23, Chen, Kenneth W wrote: Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM Chen, Kenneth W [EMAIL PROTECTED] writes: I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. It's a single global variable access: #define cache_line_size() (boot_cpu_data.x86_cache_alignment) Or do you mean it caused cache misses? boot_cpu_data is cache aligned and practically read only, so there shouldn't be any false sharing at least. No, I was looking for a generic constant that describes cache line size. The same kernel binary runs on CPUs with different cache line sizes. For example P4 has 128 bytes, Core2 64 bytes. However there is a worst case alignment that is used for static alignments which is L1_CACHE_BYTES. It would be normally 128 bytes on a x86 kernel, unless it is especially compiled for a CPU with a smaller cache line size. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM > > > I will try that too. I'm a bit touchy about sharing a cache line for > > > different bio. But given that there are 200,000 I/O per second we are > > > currently pushing the kernel, the chances of two cpu working on two > > > bio that sits in the same cache line are pretty small. > > > > Yep I really think so. Besides, it's not like we are repeatedly writing > > to these objects in the first place. > > This is what I had in mind, in case it wasn't completely clear. Not > tested, other than it compiles. Basically it eliminates the small > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by > 12-bytes on 32-bit archs instead and uses the room at the end for the > bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. > I still can't help but think we can do better than this, and that this > is nothing more than optimizing for a benchmark. For high performance > I/O, you will be doing > 1 page bio's anyway and this patch wont help > you at all. Perhaps we can just kill bio_vec slabs completely, and > create bio slabs instead with differing sizes. So instead of having 1 > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room > for the bio_vec list at the end. That would always eliminate the extra > allocation, at the cost of blowing the 256-page case into a order 1 page > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, > which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Wed, Dec 06 2006, Jens Axboe wrote: > I still can't help but think we can do better than this, and that this > is nothing more than optimizing for a benchmark. For high performance > I/O, you will be doing > 1 page bio's anyway and this patch wont help > you at all. Perhaps we can just kill bio_vec slabs completely, and > create bio slabs instead with differing sizes. So instead of having 1 > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room > for the bio_vec list at the end. That would always eliminate the extra > allocation, at the cost of blowing the 256-page case into a order 1 page > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, > which is something I've always tried to avoid. That would look something like this. You'd probably want to re-tweak the slab sizes now, so that we get the most out of the slab page. If we accept the 2^1 order allocation for the largest bio, we can make it larger than 256 pages at no extra cost. Probably around 500 pages would still fit inside the two pages. diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 08a40f4..b4bf3b3 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -813,7 +813,7 @@ static int crypt_ctr(struct dm_target *t goto bad4; } - cc->bs = bioset_create(MIN_IOS, MIN_IOS, 4); + cc->bs = bioset_create(MIN_IOS, 4); if (!cc->bs) { ti->error = "Cannot allocate crypt bioset"; goto bad_bs; diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c index da663d2..581c24a 100644 --- a/drivers/md/dm-io.c +++ b/drivers/md/dm-io.c @@ -60,7 +60,7 @@ static int resize_pool(unsigned int new_ if (!_io_pool) return -ENOMEM; - _bios = bioset_create(16, 16, 4); + _bios = bioset_create(16, 4); if (!_bios) { mempool_destroy(_io_pool); _io_pool = NULL; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index fc4f743..98f6768 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -973,7 +973,7 @@ static struct mapped_device *alloc_dev(i if (!md->tio_pool) goto bad3; - md->bs = bioset_create(16, 16, 4); + md->bs = bioset_create(16, 4); if (!md->bs) goto bad_no_bioset; diff --git a/fs/bio.c b/fs/bio.c index aa4d09b..452e8f7 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -30,10 +30,6 @@ #define BIO_POOL_SIZE 256 -static kmem_cache_t *bio_slab __read_mostly; - -#define BIOVEC_NR_POOLS 6 - /* * a small number of entries is fine, not going to be performance critical. * basically we just need to survive @@ -41,23 +37,23 @@ static kmem_cache_t *bio_slab __read_mos #define BIO_SPLIT_ENTRIES 8 mempool_t *bio_split_pool __read_mostly; -struct biovec_slab { +struct bio_slab { int nr_vecs; - char *name; + const char *name; kmem_cache_t *slab; }; -/* - * if you change this list, also change bvec_alloc or things will - * break badly! cannot be bigger than what you can fit into an - * unsigned short - */ - -#define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) } -static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = { - BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES), +#define BIOSLAB_NR_POOLS 6 +#define BIOSLAB(x) { .nr_vecs = x, .name = "bio-"__stringify(x) } +static struct bio_slab bio_slabs[BIOSLAB_NR_POOLS] __read_mostly = { + BIOSLAB(0), + BIOSLAB(4), + BIOSLAB(16), + BIOSLAB(64), + BIOSLAB(128), + BIOSLAB(BIO_MAX_PAGES), }; -#undef BV +#undef BIOSLAB /* * bio_set is used to allow other portions of the IO system to @@ -66,8 +62,7 @@ static struct biovec_slab bvec_slabs[BIO * and the bvec_slabs[]. */ struct bio_set { - mempool_t *bio_pool; - mempool_t *bvec_pools[BIOVEC_NR_POOLS]; + mempool_t *bio_pools[BIOSLAB_NR_POOLS]; }; /* @@ -76,45 +71,12 @@ struct bio_set { */ static struct bio_set *fs_bio_set; -static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs) -{ - struct bio_vec *bvl; - - /* -* see comment near bvec_array define! -*/ - switch (nr) { - case 1: *idx = 0; break; - case 2 ... 4: *idx = 1; break; - case 5 ... 16: *idx = 2; break; - case 17 ... 64: *idx = 3; break; - case 65 ... 128: *idx = 4; break; - case 129 ... BIO_MAX_PAGES: *idx = 5; break; - default: - return NULL; - } - /* -* idx now points to the pool we want to allocate from -*/ - - bvl = mempool_alloc(bs->bvec_pools[*idx], gfp_mask); - if (bvl) { - struct biovec_slab *bp = bvec_slabs + *idx; - - memset(bvl, 0, bp->nr_vecs * sizeof(struct
Re: [patch] speed up single bio_vec allocation
On Mon, Dec 04 2006, Jens Axboe wrote: > On Mon, Dec 04 2006, Chen, Kenneth W wrote: > > > [...] > > > > > > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, > > > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a > > > bio is allocated. It doesn't add a lot of overhead even for the case > > > where we do > 1 page bios, and it gets rid of the dual allocation for > > > the 1 page bio. > > > > I will try that too. I'm a bit touchy about sharing a cache line for > > different bio. But given that there are 200,000 I/O per second we are > > currently pushing the kernel, the chances of two cpu working on two > > bio that sits in the same cache line are pretty small. > > Yep I really think so. Besides, it's not like we are repeatedly writing > to these objects in the first place. This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing > 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. diff --git a/fs/bio.c b/fs/bio.c index aa4d09b..94b5e05 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -32,7 +32,7 @@ static kmem_cache_t *bio_slab __read_mostly; -#define BIOVEC_NR_POOLS 6 +#define BIOVEC_NR_POOLS 5 /* * a small number of entries is fine, not going to be performance critical. @@ -55,7 +55,7 @@ struct biovec_slab { #define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) } static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = { - BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES), + BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES), }; #undef BV @@ -76,7 +76,8 @@ struct bio_set { */ static struct bio_set *fs_bio_set; -static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs) +static struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, +struct bio_set *bs) { struct bio_vec *bvl; @@ -84,15 +85,25 @@ static inline struct bio_vec *bvec_alloc * see comment near bvec_array define! */ switch (nr) { - case 1: *idx = 0; break; - case 2 ... 4: *idx = 1; break; - case 5 ... 16: *idx = 2; break; - case 17 ... 64: *idx = 3; break; - case 65 ... 128: *idx = 4; break; - case 129 ... BIO_MAX_PAGES: *idx = 5; break; - default: - return NULL; + case 2 ... 4: + *idx = 0; + break; + case 5 ... 16: + *idx = 1; + break; + case 17 ... 64: + *idx = 2; + break; + case 65 ... 128: + *idx = 3; + break; + case 129 ... BIO_MAX_PAGES: + *idx = 4; + break; + default: + return NULL; } + /* * idx now points to the pool we want to allocate from */ @@ -109,11 +120,13 @@ static inline struct bio_vec *bvec_alloc void bio_free(struct bio *bio, struct bio_set *bio_set) { - const int pool_idx = BIO_POOL_IDX(bio); + if ((bio->bi_flags & (1 << BIO_BVL_EMBED)) == 0) { + const int pool_idx = BIO_POOL_IDX(bio); - BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS); + BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS); + mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); + } - mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); mempool_free(bio, bio_set->bio_pool); } @@ -166,7 +179,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m struct bio_vec *bvl = NULL; bio_init(bio); - if (likely(nr_iovecs)) { + + if (nr_iovecs == 1) { + bvl = (void *) bio + sizeof(*bio); + memset(bvl, 0, sizeof(*bvl)); + bio->bi_flags |= 1 << BIO_BVL_EMBED; + bio->bi_max_vecs = 1; + } else if (nr_iovecs) { unsigned long idx = 0; /* shut up gcc */
Re: [patch] speed up single bio_vec allocation
On Mon, Dec 04 2006, Jens Axboe wrote: On Mon, Dec 04 2006, Chen, Kenneth W wrote: [...] Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a bio is allocated. It doesn't add a lot of overhead even for the case where we do 1 page bios, and it gets rid of the dual allocation for the 1 page bio. I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. Yep I really think so. Besides, it's not like we are repeatedly writing to these objects in the first place. This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. diff --git a/fs/bio.c b/fs/bio.c index aa4d09b..94b5e05 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -32,7 +32,7 @@ static kmem_cache_t *bio_slab __read_mostly; -#define BIOVEC_NR_POOLS 6 +#define BIOVEC_NR_POOLS 5 /* * a small number of entries is fine, not going to be performance critical. @@ -55,7 +55,7 @@ struct biovec_slab { #define BV(x) { .nr_vecs = x, .name = biovec-__stringify(x) } static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = { - BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES), + BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES), }; #undef BV @@ -76,7 +76,8 @@ struct bio_set { */ static struct bio_set *fs_bio_set; -static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs) +static struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, +struct bio_set *bs) { struct bio_vec *bvl; @@ -84,15 +85,25 @@ static inline struct bio_vec *bvec_alloc * see comment near bvec_array define! */ switch (nr) { - case 1: *idx = 0; break; - case 2 ... 4: *idx = 1; break; - case 5 ... 16: *idx = 2; break; - case 17 ... 64: *idx = 3; break; - case 65 ... 128: *idx = 4; break; - case 129 ... BIO_MAX_PAGES: *idx = 5; break; - default: - return NULL; + case 2 ... 4: + *idx = 0; + break; + case 5 ... 16: + *idx = 1; + break; + case 17 ... 64: + *idx = 2; + break; + case 65 ... 128: + *idx = 3; + break; + case 129 ... BIO_MAX_PAGES: + *idx = 4; + break; + default: + return NULL; } + /* * idx now points to the pool we want to allocate from */ @@ -109,11 +120,13 @@ static inline struct bio_vec *bvec_alloc void bio_free(struct bio *bio, struct bio_set *bio_set) { - const int pool_idx = BIO_POOL_IDX(bio); + if ((bio-bi_flags (1 BIO_BVL_EMBED)) == 0) { + const int pool_idx = BIO_POOL_IDX(bio); - BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS); + BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS); + mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]); + } - mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]); mempool_free(bio, bio_set-bio_pool); } @@ -166,7 +179,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m struct bio_vec *bvl = NULL; bio_init(bio); - if (likely(nr_iovecs)) { + + if (nr_iovecs == 1) { + bvl = (void *) bio + sizeof(*bio); + memset(bvl, 0, sizeof(*bvl)); + bio-bi_flags |= 1 BIO_BVL_EMBED; + bio-bi_max_vecs = 1; + } else if (nr_iovecs) { unsigned long idx = 0; /* shut up gcc */ bvl = bvec_alloc_bs(gfp_mask,
Re: [patch] speed up single bio_vec allocation
On Wed, Dec 06 2006, Jens Axboe wrote: I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. That would look something like this. You'd probably want to re-tweak the slab sizes now, so that we get the most out of the slab page. If we accept the 2^1 order allocation for the largest bio, we can make it larger than 256 pages at no extra cost. Probably around 500 pages would still fit inside the two pages. diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 08a40f4..b4bf3b3 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -813,7 +813,7 @@ static int crypt_ctr(struct dm_target *t goto bad4; } - cc-bs = bioset_create(MIN_IOS, MIN_IOS, 4); + cc-bs = bioset_create(MIN_IOS, 4); if (!cc-bs) { ti-error = Cannot allocate crypt bioset; goto bad_bs; diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c index da663d2..581c24a 100644 --- a/drivers/md/dm-io.c +++ b/drivers/md/dm-io.c @@ -60,7 +60,7 @@ static int resize_pool(unsigned int new_ if (!_io_pool) return -ENOMEM; - _bios = bioset_create(16, 16, 4); + _bios = bioset_create(16, 4); if (!_bios) { mempool_destroy(_io_pool); _io_pool = NULL; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index fc4f743..98f6768 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -973,7 +973,7 @@ static struct mapped_device *alloc_dev(i if (!md-tio_pool) goto bad3; - md-bs = bioset_create(16, 16, 4); + md-bs = bioset_create(16, 4); if (!md-bs) goto bad_no_bioset; diff --git a/fs/bio.c b/fs/bio.c index aa4d09b..452e8f7 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -30,10 +30,6 @@ #define BIO_POOL_SIZE 256 -static kmem_cache_t *bio_slab __read_mostly; - -#define BIOVEC_NR_POOLS 6 - /* * a small number of entries is fine, not going to be performance critical. * basically we just need to survive @@ -41,23 +37,23 @@ static kmem_cache_t *bio_slab __read_mos #define BIO_SPLIT_ENTRIES 8 mempool_t *bio_split_pool __read_mostly; -struct biovec_slab { +struct bio_slab { int nr_vecs; - char *name; + const char *name; kmem_cache_t *slab; }; -/* - * if you change this list, also change bvec_alloc or things will - * break badly! cannot be bigger than what you can fit into an - * unsigned short - */ - -#define BV(x) { .nr_vecs = x, .name = biovec-__stringify(x) } -static struct biovec_slab bvec_slabs[BIOVEC_NR_POOLS] __read_mostly = { - BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES), +#define BIOSLAB_NR_POOLS 6 +#define BIOSLAB(x) { .nr_vecs = x, .name = bio-__stringify(x) } +static struct bio_slab bio_slabs[BIOSLAB_NR_POOLS] __read_mostly = { + BIOSLAB(0), + BIOSLAB(4), + BIOSLAB(16), + BIOSLAB(64), + BIOSLAB(128), + BIOSLAB(BIO_MAX_PAGES), }; -#undef BV +#undef BIOSLAB /* * bio_set is used to allow other portions of the IO system to @@ -66,8 +62,7 @@ static struct biovec_slab bvec_slabs[BIO * and the bvec_slabs[]. */ struct bio_set { - mempool_t *bio_pool; - mempool_t *bvec_pools[BIOVEC_NR_POOLS]; + mempool_t *bio_pools[BIOSLAB_NR_POOLS]; }; /* @@ -76,45 +71,12 @@ struct bio_set { */ static struct bio_set *fs_bio_set; -static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs) -{ - struct bio_vec *bvl; - - /* -* see comment near bvec_array define! -*/ - switch (nr) { - case 1: *idx = 0; break; - case 2 ... 4: *idx = 1; break; - case 5 ... 16: *idx = 2; break; - case 17 ... 64: *idx = 3; break; - case 65 ... 128: *idx = 4; break; - case 129 ... BIO_MAX_PAGES: *idx = 5; break; - default: - return NULL; - } - /* -* idx now points to the pool we want to allocate from -*/ - - bvl = mempool_alloc(bs-bvec_pools[*idx], gfp_mask); - if (bvl) { - struct biovec_slab *bp = bvec_slabs + *idx; - - memset(bvl, 0, bp-nr_vecs * sizeof(struct bio_vec)); - } - -
RE: [patch] speed up single bio_vec allocation
Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. Yep I really think so. Besides, it's not like we are repeatedly writing to these objects in the first place. This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Jens Axboe wrote on Monday, December 04, 2006 12:07 PM > On Mon, Dec 04 2006, Chen, Kenneth W wrote: > > On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is > > created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory > > available at the end of bio. I think we can utilize that memory for > > bio_vec allocation. The purpose is not so much on saving memory consumption > > for bio_vec, instead, I'm attempting to optimize away a call to > > bvec_alloc_bs. > > > > So here is a patch to do just that for 1 segment bio_vec (we currently only > > have space for 1 on 2.6.19). And the detection whether there are spare > > space > > available is dynamically calculated at compile time. If there are no space > > available, there will be no run time cost at all because gcc simply optimize > > away all the code added in this patch. If there are space available, the > > only > > run time check is to see what the size of iovec is and we do appropriate > > assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole > > lot back from not calling bvec_alloc_bs() function. > > > > I tried to use cache_line_size() to find out the alignment of struct bio, > > but > > stumbled on that it is a runtime function for x86_64. So instead I made bio > > to hint to the slab allocator to align on 32 byte (slab will use the larger > > value of hw cache line and caller hints of "align"). I think it is a sane > > number for majority of the CPUs out in the world. > > Any benchmarks for this one? About 0.2% on database transaction processing benchmark. It was done a while back on top of a major Linux vendor kernel. I will retest it again for 2.6.19. > [...] > > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a > bio is allocated. It doesn't add a lot of overhead even for the case > where we do > 1 page bios, and it gets rid of the dual allocation for > the 1 page bio. I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Mon, Dec 04 2006, Chen, Kenneth W wrote: > > [...] > > > > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, > > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a > > bio is allocated. It doesn't add a lot of overhead even for the case > > where we do > 1 page bios, and it gets rid of the dual allocation for > > the 1 page bio. > > I will try that too. I'm a bit touchy about sharing a cache line for > different bio. But given that there are 200,000 I/O per second we are > currently pushing the kernel, the chances of two cpu working on two > bio that sits in the same cache line are pretty small. Yep I really think so. Besides, it's not like we are repeatedly writing to these objects in the first place. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Mon, Dec 04 2006, Chen, Kenneth W wrote: > On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is > created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory > available at the end of bio. I think we can utilize that memory for > bio_vec allocation. The purpose is not so much on saving memory consumption > for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs. > > So here is a patch to do just that for 1 segment bio_vec (we currently only > have space for 1 on 2.6.19). And the detection whether there are spare space > available is dynamically calculated at compile time. If there are no space > available, there will be no run time cost at all because gcc simply optimize > away all the code added in this patch. If there are space available, the only > run time check is to see what the size of iovec is and we do appropriate > assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole > lot back from not calling bvec_alloc_bs() function. > > I tried to use cache_line_size() to find out the alignment of struct bio, but > stumbled on that it is a runtime function for x86_64. So instead I made bio > to hint to the slab allocator to align on 32 byte (slab will use the larger > value of hw cache line and caller hints of "align"). I think it is a sane > number for majority of the CPUs out in the world. Any benchmarks for this one? > --- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800 > +++ ./fs/bio.c2006-12-03 21:29:20.0 -0800 > @@ -29,11 +29,14 @@ > #include /* for struct sg_iovec */ > > #define BIO_POOL_SIZE 256 > - > +#define BIO_ALIGN 32 /* minimal bio structure alignment */ > static kmem_cache_t *bio_slab __read_mostly; > > #define BIOVEC_NR_POOLS 6 > > +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \ > + (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\ > + ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN)) > /* > * a small number of entries is fine, not going to be performance critical. > * basically we just need to survive > @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi > > BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS); > > - mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); > + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx) > + mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); I think you should use a flag for this instead. > mempool_free(bio, bio_set->bio_pool); > } > > @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m > struct bio_vec *bvl = NULL; > > bio_init(bio); > - if (likely(nr_iovecs)) { > + > + /* > + * if bio_vec can fit into remaining cache line of struct > + * bio, go ahead use it and skip mempool allocation. > + */ > + if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) { > + bvl = (struct bio_vec*) (bio + 1); > + bio->bi_max_vecs = 1; > + } else if (likely(nr_iovecs)) { > unsigned long idx = 0; /* shut up gcc */ Either move this logic to bvec_alloc_bs(), or remember to clear bvl as is done there. > @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi > struct biovec_slab *bvs = bvec_slabs + i; > > size = bvs->nr_vecs * sizeof(struct bio_vec); > - bvs->slab = kmem_cache_create(bvs->name, size, 0, > + bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN, > SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); > } > } Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a bio is allocated. It doesn't add a lot of overhead even for the case where we do > 1 page bios, and it gets rid of the dual allocation for the 1 page bio. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] speed up single bio_vec allocation
On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory available at the end of bio. I think we can utilize that memory for bio_vec allocation. The purpose is not so much on saving memory consumption for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs. So here is a patch to do just that for 1 segment bio_vec (we currently only have space for 1 on 2.6.19). And the detection whether there are spare space available is dynamically calculated at compile time. If there are no space available, there will be no run time cost at all because gcc simply optimize away all the code added in this patch. If there are space available, the only run time check is to see what the size of iovec is and we do appropriate assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole lot back from not calling bvec_alloc_bs() function. I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. So instead I made bio to hint to the slab allocator to align on 32 byte (slab will use the larger value of hw cache line and caller hints of "align"). I think it is a sane number for majority of the CPUs out in the world. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800 +++ ./fs/bio.c 2006-12-03 21:29:20.0 -0800 @@ -29,11 +29,14 @@ #include/* for struct sg_iovec */ #define BIO_POOL_SIZE 256 - +#define BIO_ALIGN 32 /* minimal bio structure alignment */ static kmem_cache_t *bio_slab __read_mostly; #define BIOVEC_NR_POOLS 6 +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \ + (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\ +ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN)) /* * a small number of entries is fine, not going to be performance critical. * basically we just need to survive @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS); - mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx) + mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); mempool_free(bio, bio_set->bio_pool); } @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m struct bio_vec *bvl = NULL; bio_init(bio); - if (likely(nr_iovecs)) { + + /* +* if bio_vec can fit into remaining cache line of struct +* bio, go ahead use it and skip mempool allocation. +*/ + if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) { + bvl = (struct bio_vec*) (bio + 1); + bio->bi_max_vecs = 1; + } else if (likely(nr_iovecs)) { unsigned long idx = 0; /* shut up gcc */ bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, , bs); @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi struct biovec_slab *bvs = bvec_slabs + i; size = bvs->nr_vecs * sizeof(struct bio_vec); - bvs->slab = kmem_cache_create(bvs->name, size, 0, + bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] speed up single bio_vec allocation
On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory available at the end of bio. I think we can utilize that memory for bio_vec allocation. The purpose is not so much on saving memory consumption for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs. So here is a patch to do just that for 1 segment bio_vec (we currently only have space for 1 on 2.6.19). And the detection whether there are spare space available is dynamically calculated at compile time. If there are no space available, there will be no run time cost at all because gcc simply optimize away all the code added in this patch. If there are space available, the only run time check is to see what the size of iovec is and we do appropriate assignment to bio-bi_io_Vec etc. The cost is minimal and we gain a whole lot back from not calling bvec_alloc_bs() function. I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. So instead I made bio to hint to the slab allocator to align on 32 byte (slab will use the larger value of hw cache line and caller hints of align). I think it is a sane number for majority of the CPUs out in the world. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800 +++ ./fs/bio.c 2006-12-03 21:29:20.0 -0800 @@ -29,11 +29,14 @@ #include scsi/sg.h /* for struct sg_iovec */ #define BIO_POOL_SIZE 256 - +#define BIO_ALIGN 32 /* minimal bio structure alignment */ static kmem_cache_t *bio_slab __read_mostly; #define BIOVEC_NR_POOLS 6 +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \ + (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\ +ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN)) /* * a small number of entries is fine, not going to be performance critical. * basically we just need to survive @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS); - mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]); + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx) + mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]); mempool_free(bio, bio_set-bio_pool); } @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m struct bio_vec *bvl = NULL; bio_init(bio); - if (likely(nr_iovecs)) { + + /* +* if bio_vec can fit into remaining cache line of struct +* bio, go ahead use it and skip mempool allocation. +*/ + if (nr_iovecs == 1 BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) { + bvl = (struct bio_vec*) (bio + 1); + bio-bi_max_vecs = 1; + } else if (likely(nr_iovecs)) { unsigned long idx = 0; /* shut up gcc */ bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, idx, bs); @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi struct biovec_slab *bvs = bvec_slabs + i; size = bvs-nr_vecs * sizeof(struct bio_vec); - bvs-slab = kmem_cache_create(bvs-name, size, 0, + bvs-slab = kmem_cache_create(bvs-name, size, BIO_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); } } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Mon, Dec 04 2006, Chen, Kenneth W wrote: On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory available at the end of bio. I think we can utilize that memory for bio_vec allocation. The purpose is not so much on saving memory consumption for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs. So here is a patch to do just that for 1 segment bio_vec (we currently only have space for 1 on 2.6.19). And the detection whether there are spare space available is dynamically calculated at compile time. If there are no space available, there will be no run time cost at all because gcc simply optimize away all the code added in this patch. If there are space available, the only run time check is to see what the size of iovec is and we do appropriate assignment to bio-bi_io_Vec etc. The cost is minimal and we gain a whole lot back from not calling bvec_alloc_bs() function. I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. So instead I made bio to hint to the slab allocator to align on 32 byte (slab will use the larger value of hw cache line and caller hints of align). I think it is a sane number for majority of the CPUs out in the world. Any benchmarks for this one? --- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800 +++ ./fs/bio.c2006-12-03 21:29:20.0 -0800 @@ -29,11 +29,14 @@ #include scsi/sg.h /* for struct sg_iovec */ #define BIO_POOL_SIZE 256 - +#define BIO_ALIGN 32 /* minimal bio structure alignment */ static kmem_cache_t *bio_slab __read_mostly; #define BIOVEC_NR_POOLS 6 +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \ + (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\ + ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN)) /* * a small number of entries is fine, not going to be performance critical. * basically we just need to survive @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi BIO_BUG_ON(pool_idx = BIOVEC_NR_POOLS); - mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]); + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx) + mempool_free(bio-bi_io_vec, bio_set-bvec_pools[pool_idx]); I think you should use a flag for this instead. mempool_free(bio, bio_set-bio_pool); } @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m struct bio_vec *bvl = NULL; bio_init(bio); - if (likely(nr_iovecs)) { + + /* + * if bio_vec can fit into remaining cache line of struct + * bio, go ahead use it and skip mempool allocation. + */ + if (nr_iovecs == 1 BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) { + bvl = (struct bio_vec*) (bio + 1); + bio-bi_max_vecs = 1; + } else if (likely(nr_iovecs)) { unsigned long idx = 0; /* shut up gcc */ Either move this logic to bvec_alloc_bs(), or remember to clear bvl as is done there. @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi struct biovec_slab *bvs = bvec_slabs + i; size = bvs-nr_vecs * sizeof(struct bio_vec); - bvs-slab = kmem_cache_create(bvs-name, size, 0, + bvs-slab = kmem_cache_create(bvs-name, size, BIO_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); } } Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a bio is allocated. It doesn't add a lot of overhead even for the case where we do 1 page bios, and it gets rid of the dual allocation for the 1 page bio. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] speed up single bio_vec allocation
On Mon, Dec 04 2006, Chen, Kenneth W wrote: [...] Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a bio is allocated. It doesn't add a lot of overhead even for the case where we do 1 page bios, and it gets rid of the dual allocation for the 1 page bio. I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. Yep I really think so. Besides, it's not like we are repeatedly writing to these objects in the first place. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Jens Axboe wrote on Monday, December 04, 2006 12:07 PM On Mon, Dec 04 2006, Chen, Kenneth W wrote: On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory available at the end of bio. I think we can utilize that memory for bio_vec allocation. The purpose is not so much on saving memory consumption for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs. So here is a patch to do just that for 1 segment bio_vec (we currently only have space for 1 on 2.6.19). And the detection whether there are spare space available is dynamically calculated at compile time. If there are no space available, there will be no run time cost at all because gcc simply optimize away all the code added in this patch. If there are space available, the only run time check is to see what the size of iovec is and we do appropriate assignment to bio-bi_io_Vec etc. The cost is minimal and we gain a whole lot back from not calling bvec_alloc_bs() function. I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. So instead I made bio to hint to the slab allocator to align on 32 byte (slab will use the larger value of hw cache line and caller hints of align). I think it is a sane number for majority of the CPUs out in the world. Any benchmarks for this one? About 0.2% on database transaction processing benchmark. It was done a while back on top of a major Linux vendor kernel. I will retest it again for 2.6.19. [...] Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a bio is allocated. It doesn't add a lot of overhead even for the case where we do 1 page bios, and it gets rid of the dual allocation for the 1 page bio. I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/