Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 05/02/2014 10:43 AM, Christoph Hellwig wrote: > On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote: >>> At least for SCSI devices _tag space_ is plenty, it's just the we >>> artifically limit our tag space to the queue depth to avoid having to >>> track that one separately. In addition we also preallocaste a request >>> for each tag, so even if we would track the queue depth separately >>> we would waste a lot of memory. >> >> In practice it comes out to the same, it's not feasible to run a much >> larger space and track on queue depth. So I don't think that changes the >> conclusion for SCSI. > > Agreed. Just had to check if my smart ass hat still fits this morning.. If not, these are on sale http://www.cafepress.com/+smartass_trucker_hat,18823771 :-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote: > > At least for SCSI devices _tag space_ is plenty, it's just the we > > artifically limit our tag space to the queue depth to avoid having to > > track that one separately. In addition we also preallocaste a request > > for each tag, so even if we would track the queue depth separately > > we would waste a lot of memory. > > In practice it comes out to the same, it's not feasible to run a much > larger space and track on queue depth. So I don't think that changes the > conclusion for SCSI. Agreed. Just had to check if my smart ass hat still fits this morning.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 05/01/2014 11:05 PM, Christoph Hellwig wrote: > On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: >> I've taken the consequence of this and implemented another tagging >> scheme that blk-mq will use if it deems that percpu_ida isn't going >> to be effective for the device being initialized. But I really hate >> to have both of them in there. Unfortunately I have no devices >> available that have a tag space that will justify using percu_ida, >> so comparisons are a bit hard at the moment. NVMe should change >> that, though, so decision will have to be deferred until that is >> tested. > > At least for SCSI devices _tag space_ is plenty, it's just the we > artifically limit our tag space to the queue depth to avoid having to > track that one separately. In addition we also preallocaste a request > for each tag, so even if we would track the queue depth separately > we would waste a lot of memory. In practice it comes out to the same, it's not feasible to run a much larger space and track on queue depth. So I don't think that changes the conclusion for SCSI. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 05/01/2014 11:05 PM, Christoph Hellwig wrote: On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. At least for SCSI devices _tag space_ is plenty, it's just the we artifically limit our tag space to the queue depth to avoid having to track that one separately. In addition we also preallocaste a request for each tag, so even if we would track the queue depth separately we would waste a lot of memory. In practice it comes out to the same, it's not feasible to run a much larger space and track on queue depth. So I don't think that changes the conclusion for SCSI. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote: At least for SCSI devices _tag space_ is plenty, it's just the we artifically limit our tag space to the queue depth to avoid having to track that one separately. In addition we also preallocaste a request for each tag, so even if we would track the queue depth separately we would waste a lot of memory. In practice it comes out to the same, it's not feasible to run a much larger space and track on queue depth. So I don't think that changes the conclusion for SCSI. Agreed. Just had to check if my smart ass hat still fits this morning.. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 05/02/2014 10:43 AM, Christoph Hellwig wrote: On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote: At least for SCSI devices _tag space_ is plenty, it's just the we artifically limit our tag space to the queue depth to avoid having to track that one separately. In addition we also preallocaste a request for each tag, so even if we would track the queue depth separately we would waste a lot of memory. In practice it comes out to the same, it's not feasible to run a much larger space and track on queue depth. So I don't think that changes the conclusion for SCSI. Agreed. Just had to check if my smart ass hat still fits this morning.. If not, these are on sale http://www.cafepress.com/+smartass_trucker_hat,18823771 :-) -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: > I've taken the consequence of this and implemented another tagging > scheme that blk-mq will use if it deems that percpu_ida isn't going > to be effective for the device being initialized. But I really hate > to have both of them in there. Unfortunately I have no devices > available that have a tag space that will justify using percu_ida, > so comparisons are a bit hard at the moment. NVMe should change > that, though, so decision will have to be deferred until that is > tested. At least for SCSI devices _tag space_ is plenty, it's just the we artifically limit our tag space to the queue depth to avoid having to track that one separately. In addition we also preallocaste a request for each tag, so even if we would track the queue depth separately we would waste a lot of memory. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-05-01 20:38, Kent Overstreet wrote: On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: On 2014-05-01 16:47, Kent Overstreet wrote: On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. Yeah, I agree that is annoying. I've thought about the issue too though and I haven't been able to come up with any better ideas, myself. I have failed in that area, too, and it's not for lack of thinking about it and experimenting. So hence a new solution was thought up, based on a lot of userland prototyping and testing. Things considered, two solutions is better than no solution. A given driver probably should be able to always use one or the other though, so we shouldn't _need_ a runtime conditional because of this, though structuring the code that way might be more trouble than it's worth from my vague recollection of what blk-mq looks likee... It's completely runtime conditional at this point, not sure how not to make it so. This is transparent to drivers, they should not care about what kind of tagging scheme to use. If we present that, then we have failed. The runtime conditional is still better than a function pointer, though, so it'll likely stay that way for now. So it the entry points now all look like this: if (use_new_scheme) ret = new_foo(); else ret = foo(); At least it should branch predict really well :-) (I've actually been fighting with unrelated issues at a very similar layer of abstraction, it's quite annoying.). BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. It's hard to blame Shaohua, and I helped push that. It was an attempt in making percpu_ida actually useful for what blk-mq needs it for, and being the primary user of it, it was definitely worth doing. A tagging scheme that requires the tag space to be effectively sparse/huge to be fast is not a good generic tagging algorithm. Yeah it was definitely not an unreasonable attempt and it's probably my own fault for not speaking up louder at the time (I can't remember how much I commented at the time). Ah well, irritating problem space :) Not a problem, I think the main failure here is that we have been coming at this with clashing expectations of what the requirements are. And a further issue is wanting to cling to the percpu_ida tags on my end, thinking it could be made to work. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: > On 2014-05-01 16:47, Kent Overstreet wrote: > >On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: > >>On 04/29/2014 05:35 AM, Ming Lei wrote: > >>>On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe wrote: > On 2014-04-25 18:01, Ming Lei wrote: > > > >Hi Jens, > > > >On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: > >> > >>On 04/25/2014 03:10 AM, Ming Lei wrote: > >> > >>Sorry, I did run it the other day. It has little to no effect here, but > >>that's mostly because there's so much other crap going on in there. The > >>most effective way to currently make it work better, is just to ensure > >>the caching pool is of a sane size. > > > > > >Yes, that is just what the patch is doing, :-) > > > But it's not enough. > >>> > >>>Yes, the patch is only for cases of mutli hw queue and having > >>>offline CPUs existed. > >>> > For instance, my test case, it's 255 tags and 64 CPUs. > We end up in cross-cpu spinlock nightmare mode. > >>> > >>>IMO, the scaling problem for the above case might be > >>>caused by either current percpu ida design or blk-mq's > >>>usage on it. > >> > >>That is pretty much my claim, yes. Basically I don't think per-cpu tag > >>caching is ever going to be the best solution for the combination of > >>modern machines and the hardware that is out there (limited tags). > > > >Sorry for not being more active in the discussion earlier, but anyways - I'm > >in > >100% agreement with this. > > > >Percpu freelists are _fundamentally_ only _useful_ when you don't need to be > >using all your available tags, because percpu sharding requires wasting your > >tag > >space. I could write a mathematical proof of this if I cared enough. > > > >Otherwise what happens is on alloc failure you're touching all the other > >cachelines every single time and now you're bouncing _more_ cachelines than > >if > >you just had a single global freelist. > > > >So yeah, for small tag spaces just use a single simple bit vector on a single > >cacheline. > > I've taken the consequence of this and implemented another tagging scheme > that blk-mq will use if it deems that percpu_ida isn't going to be effective > for the device being initialized. But I really hate to have both of them in > there. Unfortunately I have no devices available that have a tag space that > will justify using percu_ida, so comparisons are a bit hard at the moment. > NVMe should change that, though, so decision will have to be deferred until > that is tested. Yeah, I agree that is annoying. I've thought about the issue too though and I haven't been able to come up with any better ideas, myself. A given driver probably should be able to always use one or the other though, so we shouldn't _need_ a runtime conditional because of this, though structuring the code that way might be more trouble than it's worth from my vague recollection of what blk-mq looks likee... (I've actually been fighting with unrelated issues at a very similar layer of abstraction, it's quite annoying.). > >BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed > >when steal_tags() runs was fundamentally wrong and broken in this respect, > >and > >should be reverted, whatever usage it was that was expecting to be able to > >allocate the entire tag space was the problem. > > It's hard to blame Shaohua, and I helped push that. It was an attempt in > making percpu_ida actually useful for what blk-mq needs it for, and being > the primary user of it, it was definitely worth doing. A tagging scheme that > requires the tag space to be effectively sparse/huge to be fast is not a > good generic tagging algorithm. Yeah it was definitely not an unreasonable attempt and it's probably my own fault for not speaking up louder at the time (I can't remember how much I commented at the time). Ah well, irritating problem space :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-05-01 16:47, Kent Overstreet wrote: On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. It's hard to blame Shaohua, and I helped push that. It was an attempt in making percpu_ida actually useful for what blk-mq needs it for, and being the primary user of it, it was definitely worth doing. A tagging scheme that requires the tag space to be effectively sparse/huge to be fast is not a good generic tagging algorithm. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: > On 04/29/2014 05:35 AM, Ming Lei wrote: > > On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe wrote: > >> On 2014-04-25 18:01, Ming Lei wrote: > >>> > >>> Hi Jens, > >>> > >>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: > > On 04/25/2014 03:10 AM, Ming Lei wrote: > > Sorry, I did run it the other day. It has little to no effect here, but > that's mostly because there's so much other crap going on in there. The > most effective way to currently make it work better, is just to ensure > the caching pool is of a sane size. > >>> > >>> > >>> Yes, that is just what the patch is doing, :-) > >> > >> > >> But it's not enough. > > > > Yes, the patch is only for cases of mutli hw queue and having > > offline CPUs existed. > > > >> For instance, my test case, it's 255 tags and 64 CPUs. > >> We end up in cross-cpu spinlock nightmare mode. > > > > IMO, the scaling problem for the above case might be > > caused by either current percpu ida design or blk-mq's > > usage on it. > > That is pretty much my claim, yes. Basically I don't think per-cpu tag > caching is ever going to be the best solution for the combination of > modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Thu, May 01, 2014 at 11:24:42PM +0200, Alexander Gordeev wrote: > On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote: > > I've not had time to revisit/finish them, but you should definitely > > clean up the percpu_ida stuff and reduce existing contention before > > going overboard. > > Hi Peter, > > I tried to combine your series into a single patch against 3.15-rc3. > > I hope, it looks like you intended and my comments in follow-up > message make sense. > > Thanks! > > diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c > index 93d145e..14f5151 100644 > --- a/lib/percpu_ida.c > +++ b/lib/percpu_ida.c > @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned > *dst_nr, > *dst_nr += nr; > } > > +static inline void double_lock(spinlock_t *l1, spinlock_t *l2) > +{ > + if (l1 > l2) > + swap(l1, l2); > + > + spin_lock(l1); > + spin_lock_nested(l2, SINGLE_DEPTH_NESTING); > +} > + > /* > * Try to steal tags from a remote cpu's percpu freelist. > * > @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool, > if (remote == tags) > continue; > > - spin_lock(>lock); > + double_lock(>lock, >lock); > > if (remote->nr_free) { > memcpy(tags->freelist, > @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool, > } > > spin_unlock(>lock); > + spin_unlock(>lock); > > if (tags->nr_free) > break; > } > } > > -/* > - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them > onto > - * our percpu freelist: > - */ > -static inline void alloc_global_tags(struct percpu_ida *pool, > - struct percpu_ida_cpu *tags) > +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) > { > - move_tags(tags->freelist, >nr_free, > - pool->freelist, >nr_free, > - min(pool->nr_free, pool->percpu_batch_size)); > + int tag = -ENOSPC; > + > + spin_lock(>lock); > + if (tags->nr_free) > + tag = tags->freelist[--tags->nr_free]; > + spin_unlock(>lock); > + > + return tag; > } > > -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) > +static inline int alloc_local_tag(struct percpu_ida *pool) > { > + struct percpu_ida_cpu *tags; > + unsigned long flags; > int tag = -ENOSPC; > > + local_irq_save(flags); > + tags = this_cpu_ptr(pool->tag_cpu); > spin_lock(>lock); > if (tags->nr_free) > tag = tags->freelist[--tags->nr_free]; > - spin_unlock(>lock); > + spin_unlock_irqrestore(>lock, flags); > > return tag; > } > > +static inline int alloc_global_tag(struct percpu_ida *pool) > +{ > + struct percpu_ida_cpu *tags; > + unsigned long flags; > + int tag = -ENOSPC; > + > + spin_lock_irqsave(>lock, flags); > + tags = this_cpu_ptr(pool->tag_cpu); If it makes more sense to make it the way alloc_local_tag() does? local_irq_save(flags); tags = this_cpu_ptr(pool->tag_cpu); spin_lock(>lock); > + > + if (!tags->nr_free) { > + /* > + * Move tags from the global-, onto our percpu-freelist. > + */ > + move_tags(tags->freelist, >nr_free, > + pool->freelist, >nr_free, > + min(pool->nr_free, pool->percpu_batch_size)); 'tags' are accessed lock-less and race with steal_tags(). If double_lock() would help here? > + } > + > + spin_unlock(>lock); It is quite difficult to comment here, since there is no affected code shown, but I will try nevertheless. So steal_tags() used to serialized with 'pool->lock', which is removed. As result, two (or more) concurrent threads could find and unset the very same bit in this piece of code: for (cpus_have_tags = cpumask_weight(>cpus_have_tags); cpus_have_tags; cpus_have_tags--) { cpu = cpumask_next(cpu, >cpus_have_tags); // *** FIND if (cpu >= nr_cpu_ids) { cpu = cpumask_first(>cpus_have_tags); if (cpu >= nr_cpu_ids) BUG(); } pool->cpu_last_stolen = cpu; remote = per_cpu_ptr(pool->tag_cpu, cpu); cpumask_clear_cpu(cpu, >cpus_have_tags); // *** UNSET ... } The second's thread cpumask_clear_cpu() races with cpumask_set_cpu() in percpu_ida_free() if (nr_free == 1) { cpumask_set_cpu(smp_processor_id(), >cpus_have_tags); // here the bit is UNSET by the second concurrent thread wake_up(>wait); } Which results in the woken thread would not find the illegitimately unset bit
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote: > I've not had time to revisit/finish them, but you should definitely > clean up the percpu_ida stuff and reduce existing contention before > going overboard. Hi Peter, I tried to combine your series into a single patch against 3.15-rc3. I hope, it looks like you intended and my comments in follow-up message make sense. Thanks! diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c index 93d145e..14f5151 100644 --- a/lib/percpu_ida.c +++ b/lib/percpu_ida.c @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr, *dst_nr += nr; } +static inline void double_lock(spinlock_t *l1, spinlock_t *l2) +{ + if (l1 > l2) + swap(l1, l2); + + spin_lock(l1); + spin_lock_nested(l2, SINGLE_DEPTH_NESTING); +} + /* * Try to steal tags from a remote cpu's percpu freelist. * @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool, if (remote == tags) continue; - spin_lock(>lock); + double_lock(>lock, >lock); if (remote->nr_free) { memcpy(tags->freelist, @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool, } spin_unlock(>lock); + spin_unlock(>lock); if (tags->nr_free) break; } } -/* - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto - * our percpu freelist: - */ -static inline void alloc_global_tags(struct percpu_ida *pool, -struct percpu_ida_cpu *tags) +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) { - move_tags(tags->freelist, >nr_free, - pool->freelist, >nr_free, - min(pool->nr_free, pool->percpu_batch_size)); + int tag = -ENOSPC; + + spin_lock(>lock); + if (tags->nr_free) + tag = tags->freelist[--tags->nr_free]; + spin_unlock(>lock); + + return tag; } -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) +static inline int alloc_local_tag(struct percpu_ida *pool) { + struct percpu_ida_cpu *tags; + unsigned long flags; int tag = -ENOSPC; + local_irq_save(flags); + tags = this_cpu_ptr(pool->tag_cpu); spin_lock(>lock); if (tags->nr_free) tag = tags->freelist[--tags->nr_free]; - spin_unlock(>lock); + spin_unlock_irqrestore(>lock, flags); return tag; } +static inline int alloc_global_tag(struct percpu_ida *pool) +{ + struct percpu_ida_cpu *tags; + unsigned long flags; + int tag = -ENOSPC; + + spin_lock_irqsave(>lock, flags); + tags = this_cpu_ptr(pool->tag_cpu); + + if (!tags->nr_free) { + /* +* Move tags from the global-, onto our percpu-freelist. +*/ + move_tags(tags->freelist, >nr_free, + pool->freelist, >nr_free, + min(pool->nr_free, pool->percpu_batch_size)); + } + + spin_unlock(>lock); + + if (!tags->nr_free) + steal_tags(pool, tags); + + if (tags->nr_free) { + spin_lock(>lock); + if (tags->nr_free) { + tag = tags->freelist[--tags->nr_free]; + if (tags->nr_free) { + cpumask_set_cpu(smp_processor_id(), + >cpus_have_tags); + } + } + spin_unlock(>lock); + } + + local_irq_restore(flags); + + return tag; +} + /** * percpu_ida_alloc - allocate a tag * @pool: pool to allocate from @@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) */ int percpu_ida_alloc(struct percpu_ida *pool, int state) { - DEFINE_WAIT(wait); - struct percpu_ida_cpu *tags; - unsigned long flags; - int tag; - - local_irq_save(flags); - tags = this_cpu_ptr(pool->tag_cpu); - - /* Fastpath */ - tag = alloc_local_tag(tags); - if (likely(tag >= 0)) { - local_irq_restore(flags); - return tag; - } - - while (1) { - spin_lock(>lock); - - /* -* prepare_to_wait() must come before steal_tags(), in case -* percpu_ida_free() on another cpu flips a bit in -* cpus_have_tags -* -* global lock held and irqs disabled, don't need percpu lock -*/ - if (state != TASK_RUNNING) - prepare_to_wait(>wait, , state); - - if (!tags->nr_free) - alloc_global_tags(pool, tags); - if
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote: I've not had time to revisit/finish them, but you should definitely clean up the percpu_ida stuff and reduce existing contention before going overboard. Hi Peter, I tried to combine your series into a single patch against 3.15-rc3. I hope, it looks like you intended and my comments in follow-up message make sense. Thanks! diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c index 93d145e..14f5151 100644 --- a/lib/percpu_ida.c +++ b/lib/percpu_ida.c @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr, *dst_nr += nr; } +static inline void double_lock(spinlock_t *l1, spinlock_t *l2) +{ + if (l1 l2) + swap(l1, l2); + + spin_lock(l1); + spin_lock_nested(l2, SINGLE_DEPTH_NESTING); +} + /* * Try to steal tags from a remote cpu's percpu freelist. * @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool, if (remote == tags) continue; - spin_lock(remote-lock); + double_lock(tags-lock, remote-lock); if (remote-nr_free) { memcpy(tags-freelist, @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool, } spin_unlock(remote-lock); + spin_unlock(tags-lock); if (tags-nr_free) break; } } -/* - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto - * our percpu freelist: - */ -static inline void alloc_global_tags(struct percpu_ida *pool, -struct percpu_ida_cpu *tags) +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) { - move_tags(tags-freelist, tags-nr_free, - pool-freelist, pool-nr_free, - min(pool-nr_free, pool-percpu_batch_size)); + int tag = -ENOSPC; + + spin_lock(tags-lock); + if (tags-nr_free) + tag = tags-freelist[--tags-nr_free]; + spin_unlock(tags-lock); + + return tag; } -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) +static inline int alloc_local_tag(struct percpu_ida *pool) { + struct percpu_ida_cpu *tags; + unsigned long flags; int tag = -ENOSPC; + local_irq_save(flags); + tags = this_cpu_ptr(pool-tag_cpu); spin_lock(tags-lock); if (tags-nr_free) tag = tags-freelist[--tags-nr_free]; - spin_unlock(tags-lock); + spin_unlock_irqrestore(tags-lock, flags); return tag; } +static inline int alloc_global_tag(struct percpu_ida *pool) +{ + struct percpu_ida_cpu *tags; + unsigned long flags; + int tag = -ENOSPC; + + spin_lock_irqsave(pool-lock, flags); + tags = this_cpu_ptr(pool-tag_cpu); + + if (!tags-nr_free) { + /* +* Move tags from the global-, onto our percpu-freelist. +*/ + move_tags(tags-freelist, tags-nr_free, + pool-freelist, pool-nr_free, + min(pool-nr_free, pool-percpu_batch_size)); + } + + spin_unlock(pool-lock); + + if (!tags-nr_free) + steal_tags(pool, tags); + + if (tags-nr_free) { + spin_lock(tags-lock); + if (tags-nr_free) { + tag = tags-freelist[--tags-nr_free]; + if (tags-nr_free) { + cpumask_set_cpu(smp_processor_id(), + pool-cpus_have_tags); + } + } + spin_unlock(tags-lock); + } + + local_irq_restore(flags); + + return tag; +} + /** * percpu_ida_alloc - allocate a tag * @pool: pool to allocate from @@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) */ int percpu_ida_alloc(struct percpu_ida *pool, int state) { - DEFINE_WAIT(wait); - struct percpu_ida_cpu *tags; - unsigned long flags; - int tag; - - local_irq_save(flags); - tags = this_cpu_ptr(pool-tag_cpu); - - /* Fastpath */ - tag = alloc_local_tag(tags); - if (likely(tag = 0)) { - local_irq_restore(flags); - return tag; - } - - while (1) { - spin_lock(pool-lock); - - /* -* prepare_to_wait() must come before steal_tags(), in case -* percpu_ida_free() on another cpu flips a bit in -* cpus_have_tags -* -* global lock held and irqs disabled, don't need percpu lock -*/ - if (state != TASK_RUNNING) - prepare_to_wait(pool-wait, wait, state); - - if (!tags-nr_free) -
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Thu, May 01, 2014 at 11:24:42PM +0200, Alexander Gordeev wrote: On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote: I've not had time to revisit/finish them, but you should definitely clean up the percpu_ida stuff and reduce existing contention before going overboard. Hi Peter, I tried to combine your series into a single patch against 3.15-rc3. I hope, it looks like you intended and my comments in follow-up message make sense. Thanks! diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c index 93d145e..14f5151 100644 --- a/lib/percpu_ida.c +++ b/lib/percpu_ida.c @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr, *dst_nr += nr; } +static inline void double_lock(spinlock_t *l1, spinlock_t *l2) +{ + if (l1 l2) + swap(l1, l2); + + spin_lock(l1); + spin_lock_nested(l2, SINGLE_DEPTH_NESTING); +} + /* * Try to steal tags from a remote cpu's percpu freelist. * @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool, if (remote == tags) continue; - spin_lock(remote-lock); + double_lock(tags-lock, remote-lock); if (remote-nr_free) { memcpy(tags-freelist, @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool, } spin_unlock(remote-lock); + spin_unlock(tags-lock); if (tags-nr_free) break; } } -/* - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto - * our percpu freelist: - */ -static inline void alloc_global_tags(struct percpu_ida *pool, - struct percpu_ida_cpu *tags) +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) { - move_tags(tags-freelist, tags-nr_free, - pool-freelist, pool-nr_free, - min(pool-nr_free, pool-percpu_batch_size)); + int tag = -ENOSPC; + + spin_lock(tags-lock); + if (tags-nr_free) + tag = tags-freelist[--tags-nr_free]; + spin_unlock(tags-lock); + + return tag; } -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags) +static inline int alloc_local_tag(struct percpu_ida *pool) { + struct percpu_ida_cpu *tags; + unsigned long flags; int tag = -ENOSPC; + local_irq_save(flags); + tags = this_cpu_ptr(pool-tag_cpu); spin_lock(tags-lock); if (tags-nr_free) tag = tags-freelist[--tags-nr_free]; - spin_unlock(tags-lock); + spin_unlock_irqrestore(tags-lock, flags); return tag; } +static inline int alloc_global_tag(struct percpu_ida *pool) +{ + struct percpu_ida_cpu *tags; + unsigned long flags; + int tag = -ENOSPC; + + spin_lock_irqsave(pool-lock, flags); + tags = this_cpu_ptr(pool-tag_cpu); If it makes more sense to make it the way alloc_local_tag() does? local_irq_save(flags); tags = this_cpu_ptr(pool-tag_cpu); spin_lock(pool-lock); + + if (!tags-nr_free) { + /* + * Move tags from the global-, onto our percpu-freelist. + */ + move_tags(tags-freelist, tags-nr_free, + pool-freelist, pool-nr_free, + min(pool-nr_free, pool-percpu_batch_size)); 'tags' are accessed lock-less and race with steal_tags(). If double_lock() would help here? + } + + spin_unlock(pool-lock); It is quite difficult to comment here, since there is no affected code shown, but I will try nevertheless. So steal_tags() used to serialized with 'pool-lock', which is removed. As result, two (or more) concurrent threads could find and unset the very same bit in this piece of code: for (cpus_have_tags = cpumask_weight(pool-cpus_have_tags); cpus_have_tags; cpus_have_tags--) { cpu = cpumask_next(cpu, pool-cpus_have_tags); // *** FIND if (cpu = nr_cpu_ids) { cpu = cpumask_first(pool-cpus_have_tags); if (cpu = nr_cpu_ids) BUG(); } pool-cpu_last_stolen = cpu; remote = per_cpu_ptr(pool-tag_cpu, cpu); cpumask_clear_cpu(cpu, pool-cpus_have_tags); // *** UNSET ... } The second's thread cpumask_clear_cpu() races with cpumask_set_cpu() in percpu_ida_free() if (nr_free == 1) { cpumask_set_cpu(smp_processor_id(), pool-cpus_have_tags); // here the bit is UNSET by the second concurrent thread wake_up(pool-wait); } Which results in the woken thread would not find the illegitimately unset bit while looping over the mask in steal_tags().
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-05-01 16:47, Kent Overstreet wrote: On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. It's hard to blame Shaohua, and I helped push that. It was an attempt in making percpu_ida actually useful for what blk-mq needs it for, and being the primary user of it, it was definitely worth doing. A tagging scheme that requires the tag space to be effectively sparse/huge to be fast is not a good generic tagging algorithm. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: On 2014-05-01 16:47, Kent Overstreet wrote: On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. Yeah, I agree that is annoying. I've thought about the issue too though and I haven't been able to come up with any better ideas, myself. A given driver probably should be able to always use one or the other though, so we shouldn't _need_ a runtime conditional because of this, though structuring the code that way might be more trouble than it's worth from my vague recollection of what blk-mq looks likee... (I've actually been fighting with unrelated issues at a very similar layer of abstraction, it's quite annoying.). BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. It's hard to blame Shaohua, and I helped push that. It was an attempt in making percpu_ida actually useful for what blk-mq needs it for, and being the primary user of it, it was definitely worth doing. A tagging scheme that requires the tag space to be effectively sparse/huge to be fast is not a good generic tagging algorithm. Yeah it was definitely not an unreasonable attempt and it's probably my own fault for not speaking up louder at the time (I can't remember how much I commented at the time). Ah well, irritating problem space :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-05-01 20:38, Kent Overstreet wrote: On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: On 2014-05-01 16:47, Kent Overstreet wrote: On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote: On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). Sorry for not being more active in the discussion earlier, but anyways - I'm in 100% agreement with this. Percpu freelists are _fundamentally_ only _useful_ when you don't need to be using all your available tags, because percpu sharding requires wasting your tag space. I could write a mathematical proof of this if I cared enough. Otherwise what happens is on alloc failure you're touching all the other cachelines every single time and now you're bouncing _more_ cachelines than if you just had a single global freelist. So yeah, for small tag spaces just use a single simple bit vector on a single cacheline. I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. Yeah, I agree that is annoying. I've thought about the issue too though and I haven't been able to come up with any better ideas, myself. I have failed in that area, too, and it's not for lack of thinking about it and experimenting. So hence a new solution was thought up, based on a lot of userland prototyping and testing. Things considered, two solutions is better than no solution. A given driver probably should be able to always use one or the other though, so we shouldn't _need_ a runtime conditional because of this, though structuring the code that way might be more trouble than it's worth from my vague recollection of what blk-mq looks likee... It's completely runtime conditional at this point, not sure how not to make it so. This is transparent to drivers, they should not care about what kind of tagging scheme to use. If we present that, then we have failed. The runtime conditional is still better than a function pointer, though, so it'll likely stay that way for now. So it the entry points now all look like this: if (use_new_scheme) ret = new_foo(); else ret = foo(); At least it should branch predict really well :-) (I've actually been fighting with unrelated issues at a very similar layer of abstraction, it's quite annoying.). BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed when steal_tags() runs was fundamentally wrong and broken in this respect, and should be reverted, whatever usage it was that was expecting to be able to allocate the entire tag space was the problem. It's hard to blame Shaohua, and I helped push that. It was an attempt in making percpu_ida actually useful for what blk-mq needs it for, and being the primary user of it, it was definitely worth doing. A tagging scheme that requires the tag space to be effectively sparse/huge to be fast is not a good generic tagging algorithm. Yeah it was definitely not an unreasonable attempt and it's probably my own fault for not speaking up louder at the time (I can't remember how much I commented at the time). Ah well, irritating problem space :) Not a problem, I think the main failure here is that we have been coming at this with clashing expectations of what the requirements are. And a further issue is wanting to cling to the percpu_ida tags on my end, thinking it could be made to work. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote: I've taken the consequence of this and implemented another tagging scheme that blk-mq will use if it deems that percpu_ida isn't going to be effective for the device being initialized. But I really hate to have both of them in there. Unfortunately I have no devices available that have a tag space that will justify using percu_ida, so comparisons are a bit hard at the moment. NVMe should change that, though, so decision will have to be deferred until that is tested. At least for SCSI devices _tag space_ is plenty, it's just the we artifically limit our tag space to the queue depth to avoid having to track that one separately. In addition we also preallocaste a request for each tag, so even if we would track the queue depth separately we would waste a lot of memory. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe wrote: > On 04/29/2014 05:35 AM, Ming Lei wrote: >> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe wrote: >>> On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: > > On 04/25/2014 03:10 AM, Ming Lei wrote: > > Sorry, I did run it the other day. It has little to no effect here, but > that's mostly because there's so much other crap going on in there. The > most effective way to currently make it work better, is just to ensure > the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) >>> >>> >>> But it's not enough. >> >> Yes, the patch is only for cases of mutli hw queue and having >> offline CPUs existed. >> >>> For instance, my test case, it's 255 tags and 64 CPUs. >>> We end up in cross-cpu spinlock nightmare mode. >> >> IMO, the scaling problem for the above case might be >> caused by either current percpu ida design or blk-mq's >> usage on it. > > That is pretty much my claim, yes. Basically I don't think per-cpu tag > caching is ever going to be the best solution for the combination of > modern machines and the hardware that is out there (limited tags). > >> One of problems in blk-mq is that the 'set->queue_depth' >> parameter from driver isn't scalable, maybe it is reasonable to >> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache' >> can be computed as below: >> >> max(nr_tags / hctx->nr_ctx, set->min_percpu_cache) >> >> Another problem in blk-mq is that if it can be improved by computing >> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current >> approach should be based on that there are parallel I/O >> activity on each CPU, but I am wondering if it is the common >> case in reality. Suppose there are N(N << online CPUs in >> big machine) concurrent I/O on some of CPUs, percpu cache >> can be increased a lot by (nr_tags / N). > > That would certainly help the common case, but it'd still be slow for > the cases where you DO have IO from lots of sources. If we consider It may be difficult to figure out a efficient solution for the unusual case. 8-16 > tags the minimum for balanced performance, than that doesn't take a > whole lot of CPUs to spread out the tag space. Just looking at a case > today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the > "bigger" case of the Micron card, you still only have 255 active tags. > And we probably want to split that up into groups of 32, making the > problem even worse. Yes, that is a contradiction between having limited hardware queue tags and requiring more local cpu cache. But it is really a challenge to figure out an efficient approach in case that lots of CPUs need to contend very limited resources. Maybe blk-mq can learn network device: move the hw queue_depth constraint into low level(such as, blk_mq_run_hw_queue()), and keep adequate tags in the percpu pool, which means nr_tags of percpu pool can be much bigger than queue_depth for keeping enough percpu cache. When hw queue is full, congestion control can be applied in blk_mq_alloc_request() to avoid cross-cpu spinlock nightmare in percpu allocation/free. But if device requires tag to be less than queue_depth, more work is needed for the approach. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe ax...@kernel.dk wrote: On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). One of problems in blk-mq is that the 'set-queue_depth' parameter from driver isn't scalable, maybe it is reasonable to introduce 'set-min_percpu_cache', then ' tags-nr_max_cache' can be computed as below: max(nr_tags / hctx-nr_ctx, set-min_percpu_cache) Another problem in blk-mq is that if it can be improved by computing tags-nr_max_cache as 'nr_tags / hctx-nr_ctx' ? The current approach should be based on that there are parallel I/O activity on each CPU, but I am wondering if it is the common case in reality. Suppose there are N(N online CPUs in big machine) concurrent I/O on some of CPUs, percpu cache can be increased a lot by (nr_tags / N). That would certainly help the common case, but it'd still be slow for the cases where you DO have IO from lots of sources. If we consider It may be difficult to figure out a efficient solution for the unusual case. 8-16 tags the minimum for balanced performance, than that doesn't take a whole lot of CPUs to spread out the tag space. Just looking at a case today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the bigger case of the Micron card, you still only have 255 active tags. And we probably want to split that up into groups of 32, making the problem even worse. Yes, that is a contradiction between having limited hardware queue tags and requiring more local cpu cache. But it is really a challenge to figure out an efficient approach in case that lots of CPUs need to contend very limited resources. Maybe blk-mq can learn network device: move the hw queue_depth constraint into low level(such as, blk_mq_run_hw_queue()), and keep adequate tags in the percpu pool, which means nr_tags of percpu pool can be much bigger than queue_depth for keeping enough percpu cache. When hw queue is full, congestion control can be applied in blk_mq_alloc_request() to avoid cross-cpu spinlock nightmare in percpu allocation/free. But if device requires tag to be less than queue_depth, more work is needed for the approach. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 04/29/2014 05:35 AM, Ming Lei wrote: > On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe wrote: >> On 2014-04-25 18:01, Ming Lei wrote: >>> >>> Hi Jens, >>> >>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. >>> >>> >>> Yes, that is just what the patch is doing, :-) >> >> >> But it's not enough. > > Yes, the patch is only for cases of mutli hw queue and having > offline CPUs existed. > >> For instance, my test case, it's 255 tags and 64 CPUs. >> We end up in cross-cpu spinlock nightmare mode. > > IMO, the scaling problem for the above case might be > caused by either current percpu ida design or blk-mq's > usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). > One of problems in blk-mq is that the 'set->queue_depth' > parameter from driver isn't scalable, maybe it is reasonable to > introduce 'set->min_percpu_cache', then ' tags->nr_max_cache' > can be computed as below: > > max(nr_tags / hctx->nr_ctx, set->min_percpu_cache) > > Another problem in blk-mq is that if it can be improved by computing > tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current > approach should be based on that there are parallel I/O > activity on each CPU, but I am wondering if it is the common > case in reality. Suppose there are N(N << online CPUs in > big machine) concurrent I/O on some of CPUs, percpu cache > can be increased a lot by (nr_tags / N). That would certainly help the common case, but it'd still be slow for the cases where you DO have IO from lots of sources. If we consider 8-16 tags the minimum for balanced performance, than that doesn't take a whole lot of CPUs to spread out the tag space. Just looking at a case today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the "bigger" case of the Micron card, you still only have 255 active tags. And we probably want to split that up into groups of 32, making the problem even worse. >> That's what I did, essentially. Ensuring that the percpu_max_size is at >> least 8 makes it a whole lot better here. But still slower than a regular >> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based >> scheme I tested here is around 20% faster than the bitmap approach on a >> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, >> I can't think of cases where that approach would not be faster than >> percpu_ida. That means all of SCSI, basically, and the basic block drivers. > > If percpu_ida wants to beat bitmap allocation, the local cache hit > ratio has to keep high, in my tests, it can be got with enough local > cache size. Yes, that is exactly the issue, local cache hit must be high, and you pretty much need a higher local cache count for that. And therein lies the problem, you can't get that high local cache size for most common cases. With enough tags we could, but that's not what most people will run. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe wrote: > On 2014-04-25 18:01, Ming Lei wrote: >> >> Hi Jens, >> >> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: >>> >>> On 04/25/2014 03:10 AM, Ming Lei wrote: >>> >>> Sorry, I did run it the other day. It has little to no effect here, but >>> that's mostly because there's so much other crap going on in there. The >>> most effective way to currently make it work better, is just to ensure >>> the caching pool is of a sane size. >> >> >> Yes, that is just what the patch is doing, :-) > > > But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. > For instance, my test case, it's 255 tags and 64 CPUs. > We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. One of problems in blk-mq is that the 'set->queue_depth' parameter from driver isn't scalable, maybe it is reasonable to introduce 'set->min_percpu_cache', then ' tags->nr_max_cache' can be computed as below: max(nr_tags / hctx->nr_ctx, set->min_percpu_cache) Another problem in blk-mq is that if it can be improved by computing tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current approach should be based on that there are parallel I/O activity on each CPU, but I am wondering if it is the common case in reality. Suppose there are N(N << online CPUs in big machine) concurrent I/O on some of CPUs, percpu cache can be increased a lot by (nr_tags / N). > > >> From percpu_ida view, it is easy to observe it can improve >> allocation performance. I have several patches to export >> these information by sysfs for monitoring percpu_ida >> performance. > > > Sounds good! If we need exporting percpu_ida allocation/free information via sysfs for monitoring performance, percpu ida need a parent kobject, which means it may be better to allocate percpu_ida tag until sw/hw queue is initialized, like the patch does. > > >>> I've got an alternative tagging scheme that I think would be useful for >>> the cases where the tag space to cpu ratio isn't big enough. So I think >>> we'll retain percpu_ida for the cases where it can cache enough, and >>> punt to an alternative scheme when not. >> >> >> OK, care to comment on the patch or the idea of setting percpu cache >> size as (nr_tags / hctx->nr_ctx)? > > > I think it's a good idea. The problem is that for percpu_ida to be > effective, you need a bigger cache than the 3 I'd get above. If that isn't > the case, it performs poorly. I'm just not convinced the design can ever > work in the realm of realistic queue depths. > > > >>> That doesn't mean we should not improve percpu_ida. There's quite a bit >>> of low hanging fruit in there. >> >> >> IMO percpu_max_size in percpu_ida is very important for the >> performance, and it might need to adjust dynamically according >> to the percpu allocation loading, but it is far more complicated >> to implement. And it might be the simplest way to fix the parameter >> before percpu_ida_init(). > > > That's what I did, essentially. Ensuring that the percpu_max_size is at > least 8 makes it a whole lot better here. But still slower than a regular > simple bitmap, which makes me sad. A fairly straight forward cmpxchg based > scheme I tested here is around 20% faster than the bitmap approach on a > basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, > I can't think of cases where that approach would not be faster than > percpu_ida. That means all of SCSI, basically, and the basic block drivers. If percpu_ida wants to beat bitmap allocation, the local cache hit ratio has to keep high, in my tests, it can be got with enough local cache size. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. One of problems in blk-mq is that the 'set-queue_depth' parameter from driver isn't scalable, maybe it is reasonable to introduce 'set-min_percpu_cache', then ' tags-nr_max_cache' can be computed as below: max(nr_tags / hctx-nr_ctx, set-min_percpu_cache) Another problem in blk-mq is that if it can be improved by computing tags-nr_max_cache as 'nr_tags / hctx-nr_ctx' ? The current approach should be based on that there are parallel I/O activity on each CPU, but I am wondering if it is the common case in reality. Suppose there are N(N online CPUs in big machine) concurrent I/O on some of CPUs, percpu cache can be increased a lot by (nr_tags / N). From percpu_ida view, it is easy to observe it can improve allocation performance. I have several patches to export these information by sysfs for monitoring percpu_ida performance. Sounds good! If we need exporting percpu_ida allocation/free information via sysfs for monitoring performance, percpu ida need a parent kobject, which means it may be better to allocate percpu_ida tag until sw/hw queue is initialized, like the patch does. I've got an alternative tagging scheme that I think would be useful for the cases where the tag space to cpu ratio isn't big enough. So I think we'll retain percpu_ida for the cases where it can cache enough, and punt to an alternative scheme when not. OK, care to comment on the patch or the idea of setting percpu cache size as (nr_tags / hctx-nr_ctx)? I think it's a good idea. The problem is that for percpu_ida to be effective, you need a bigger cache than the 3 I'd get above. If that isn't the case, it performs poorly. I'm just not convinced the design can ever work in the realm of realistic queue depths. That doesn't mean we should not improve percpu_ida. There's quite a bit of low hanging fruit in there. IMO percpu_max_size in percpu_ida is very important for the performance, and it might need to adjust dynamically according to the percpu allocation loading, but it is far more complicated to implement. And it might be the simplest way to fix the parameter before percpu_ida_init(). That's what I did, essentially. Ensuring that the percpu_max_size is at least 8 makes it a whole lot better here. But still slower than a regular simple bitmap, which makes me sad. A fairly straight forward cmpxchg based scheme I tested here is around 20% faster than the bitmap approach on a basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, I can't think of cases where that approach would not be faster than percpu_ida. That means all of SCSI, basically, and the basic block drivers. If percpu_ida wants to beat bitmap allocation, the local cache hit ratio has to keep high, in my tests, it can be got with enough local cache size. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 04/29/2014 05:35 AM, Ming Lei wrote: On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. Yes, the patch is only for cases of mutli hw queue and having offline CPUs existed. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. IMO, the scaling problem for the above case might be caused by either current percpu ida design or blk-mq's usage on it. That is pretty much my claim, yes. Basically I don't think per-cpu tag caching is ever going to be the best solution for the combination of modern machines and the hardware that is out there (limited tags). One of problems in blk-mq is that the 'set-queue_depth' parameter from driver isn't scalable, maybe it is reasonable to introduce 'set-min_percpu_cache', then ' tags-nr_max_cache' can be computed as below: max(nr_tags / hctx-nr_ctx, set-min_percpu_cache) Another problem in blk-mq is that if it can be improved by computing tags-nr_max_cache as 'nr_tags / hctx-nr_ctx' ? The current approach should be based on that there are parallel I/O activity on each CPU, but I am wondering if it is the common case in reality. Suppose there are N(N online CPUs in big machine) concurrent I/O on some of CPUs, percpu cache can be increased a lot by (nr_tags / N). That would certainly help the common case, but it'd still be slow for the cases where you DO have IO from lots of sources. If we consider 8-16 tags the minimum for balanced performance, than that doesn't take a whole lot of CPUs to spread out the tag space. Just looking at a case today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the bigger case of the Micron card, you still only have 255 active tags. And we probably want to split that up into groups of 32, making the problem even worse. That's what I did, essentially. Ensuring that the percpu_max_size is at least 8 makes it a whole lot better here. But still slower than a regular simple bitmap, which makes me sad. A fairly straight forward cmpxchg based scheme I tested here is around 20% faster than the bitmap approach on a basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, I can't think of cases where that approach would not be faster than percpu_ida. That means all of SCSI, basically, and the basic block drivers. If percpu_ida wants to beat bitmap allocation, the local cache hit ratio has to keep high, in my tests, it can be got with enough local cache size. Yes, that is exactly the issue, local cache hit must be high, and you pretty much need a higher local cache count for that. And therein lies the problem, you can't get that high local cache size for most common cases. With enough tags we could, but that's not what most people will run. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. From percpu_ida view, it is easy to observe it can improve allocation performance. I have several patches to export these information by sysfs for monitoring percpu_ida performance. Sounds good! I've got an alternative tagging scheme that I think would be useful for the cases where the tag space to cpu ratio isn't big enough. So I think we'll retain percpu_ida for the cases where it can cache enough, and punt to an alternative scheme when not. OK, care to comment on the patch or the idea of setting percpu cache size as (nr_tags / hctx->nr_ctx)? I think it's a good idea. The problem is that for percpu_ida to be effective, you need a bigger cache than the 3 I'd get above. If that isn't the case, it performs poorly. I'm just not convinced the design can ever work in the realm of realistic queue depths. That doesn't mean we should not improve percpu_ida. There's quite a bit of low hanging fruit in there. IMO percpu_max_size in percpu_ida is very important for the performance, and it might need to adjust dynamically according to the percpu allocation loading, but it is far more complicated to implement. And it might be the simplest way to fix the parameter before percpu_ida_init(). That's what I did, essentially. Ensuring that the percpu_max_size is at least 8 makes it a whole lot better here. But still slower than a regular simple bitmap, which makes me sad. A fairly straight forward cmpxchg based scheme I tested here is around 20% faster than the bitmap approach on a basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, I can't think of cases where that approach would not be faster than percpu_ida. That means all of SCSI, basically, and the basic block drivers. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe wrote: > On 04/25/2014 03:10 AM, Ming Lei wrote: > > Sorry, I did run it the other day. It has little to no effect here, but > that's mostly because there's so much other crap going on in there. The > most effective way to currently make it work better, is just to ensure > the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) >From percpu_ida view, it is easy to observe it can improve allocation performance. I have several patches to export these information by sysfs for monitoring percpu_ida performance. > > I've got an alternative tagging scheme that I think would be useful for > the cases where the tag space to cpu ratio isn't big enough. So I think > we'll retain percpu_ida for the cases where it can cache enough, and > punt to an alternative scheme when not. OK, care to comment on the patch or the idea of setting percpu cache size as (nr_tags / hctx->nr_ctx)? > > That doesn't mean we should not improve percpu_ida. There's quite a bit > of low hanging fruit in there. IMO percpu_max_size in percpu_ida is very important for the performance, and it might need to adjust dynamically according to the percpu allocation loading, but it is far more complicated to implement. And it might be the simplest way to fix the parameter before percpu_ida_init(). Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 04/25/2014 03:10 AM, Ming Lei wrote: > Hi Jens, > > On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe wrote: >> On 2014-04-22 18:53, Ming Lei wrote: >> >> >>> In my null_blk test on a quad core SMP VM: >>> >>> - 4 hw queue >>> - timer mode >>> >>> With the above approach, tag allocation from local CPU can be >>> improved from: >>> >>> 5% -> 50% for boot CPU >>> 30% -> 90% for non-boot CPU. >>> >>> If no one objects the idea, I'd like to post a patch for review. >> >> >> Sent it out, that can't hurt. I'll take a look at it, and give it a test >> spin as well. > > I have sent out the patch in below link, and appreciate much > you will take a look at it. > > marc.info/?l=linux-kernel=139826944613225=2 Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. I've got an alternative tagging scheme that I think would be useful for the cases where the tag space to cpu ratio isn't big enough. So I think we'll retain percpu_ida for the cases where it can cache enough, and punt to an alternative scheme when not. That doesn't mean we should not improve percpu_ida. There's quite a bit of low hanging fruit in there. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hi Jens, On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe wrote: > On 2014-04-22 18:53, Ming Lei wrote: > > >> In my null_blk test on a quad core SMP VM: >> >> - 4 hw queue >> - timer mode >> >> With the above approach, tag allocation from local CPU can be >> improved from: >> >> 5% -> 50% for boot CPU >> 30% -> 90% for non-boot CPU. >> >> If no one objects the idea, I'd like to post a patch for review. > > > Sent it out, that can't hurt. I'll take a look at it, and give it a test > spin as well. I have sent out the patch in below link, and appreciate much you will take a look at it. marc.info/?l=linux-kernel=139826944613225=2 Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hi Jens, On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-22 18:53, Ming Lei wrote: In my null_blk test on a quad core SMP VM: - 4 hw queue - timer mode With the above approach, tag allocation from local CPU can be improved from: 5% - 50% for boot CPU 30% - 90% for non-boot CPU. If no one objects the idea, I'd like to post a patch for review. Sent it out, that can't hurt. I'll take a look at it, and give it a test spin as well. I have sent out the patch in below link, and appreciate much you will take a look at it. marc.info/?l=linux-kernelm=139826944613225w=2 Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 04/25/2014 03:10 AM, Ming Lei wrote: Hi Jens, On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe ax...@kernel.dk wrote: On 2014-04-22 18:53, Ming Lei wrote: In my null_blk test on a quad core SMP VM: - 4 hw queue - timer mode With the above approach, tag allocation from local CPU can be improved from: 5% - 50% for boot CPU 30% - 90% for non-boot CPU. If no one objects the idea, I'd like to post a patch for review. Sent it out, that can't hurt. I'll take a look at it, and give it a test spin as well. I have sent out the patch in below link, and appreciate much you will take a look at it. marc.info/?l=linux-kernelm=139826944613225w=2 Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. I've got an alternative tagging scheme that I think would be useful for the cases where the tag space to cpu ratio isn't big enough. So I think we'll retain percpu_ida for the cases where it can cache enough, and punt to an alternative scheme when not. That doesn't mean we should not improve percpu_ida. There's quite a bit of low hanging fruit in there. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) From percpu_ida view, it is easy to observe it can improve allocation performance. I have several patches to export these information by sysfs for monitoring percpu_ida performance. I've got an alternative tagging scheme that I think would be useful for the cases where the tag space to cpu ratio isn't big enough. So I think we'll retain percpu_ida for the cases where it can cache enough, and punt to an alternative scheme when not. OK, care to comment on the patch or the idea of setting percpu cache size as (nr_tags / hctx-nr_ctx)? That doesn't mean we should not improve percpu_ida. There's quite a bit of low hanging fruit in there. IMO percpu_max_size in percpu_ida is very important for the performance, and it might need to adjust dynamically according to the percpu allocation loading, but it is far more complicated to implement. And it might be the simplest way to fix the parameter before percpu_ida_init(). Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-04-25 18:01, Ming Lei wrote: Hi Jens, On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote: On 04/25/2014 03:10 AM, Ming Lei wrote: Sorry, I did run it the other day. It has little to no effect here, but that's mostly because there's so much other crap going on in there. The most effective way to currently make it work better, is just to ensure the caching pool is of a sane size. Yes, that is just what the patch is doing, :-) But it's not enough. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode. From percpu_ida view, it is easy to observe it can improve allocation performance. I have several patches to export these information by sysfs for monitoring percpu_ida performance. Sounds good! I've got an alternative tagging scheme that I think would be useful for the cases where the tag space to cpu ratio isn't big enough. So I think we'll retain percpu_ida for the cases where it can cache enough, and punt to an alternative scheme when not. OK, care to comment on the patch or the idea of setting percpu cache size as (nr_tags / hctx-nr_ctx)? I think it's a good idea. The problem is that for percpu_ida to be effective, you need a bigger cache than the 3 I'd get above. If that isn't the case, it performs poorly. I'm just not convinced the design can ever work in the realm of realistic queue depths. That doesn't mean we should not improve percpu_ida. There's quite a bit of low hanging fruit in there. IMO percpu_max_size in percpu_ida is very important for the performance, and it might need to adjust dynamically according to the percpu allocation loading, but it is far more complicated to implement. And it might be the simplest way to fix the parameter before percpu_ida_init(). That's what I did, essentially. Ensuring that the percpu_max_size is at least 8 makes it a whole lot better here. But still slower than a regular simple bitmap, which makes me sad. A fairly straight forward cmpxchg based scheme I tested here is around 20% faster than the bitmap approach on a basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, I can't think of cases where that approach would not be faster than percpu_ida. That means all of SCSI, basically, and the basic block drivers. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-04-22 18:53, Ming Lei wrote: Hi Jens, On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe wrote: On 04/22/2014 08:03 AM, Jens Axboe wrote: On 2014-04-22 01:10, Alexander Gordeev wrote: On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Sorry for dropping the ball on this. Improvements wrt when to steal, how much, and from whom are sorely needed in percpu_ida. I'll do a bench with this on a system that currently falls apart with it. Ran some quick numbers with three kernels: stock 3.15-rc2 limit 3.15-rc2 + steal limit patch (attached) I am thinking/working on this sort of improving too, but my idea is to compute tags->nr_max_cache by below: nr_tags / hctx->max_nr_ctx hctx->max_nr_ctx means the max sw queues mapped to the hw queue, which need to be introduced in the approach, actually, the value should represent the CPU topology info. It is a bit complicated to compute hctx->max_nr_ctx because we need to take account into CPU hotplug and probable user-defined mapping callback. We can always just update the caching info, that's not a big problem. We update the mappings on those events anyway. If user-defined mapping callback needn't to be considered, the hctx->max_nr_ctx can be figured out before mapping sw queue in blk_mq_init_queue() by supposing each CPU is online first, once it is done, the map for offline CPU is cleared, then start to call blk_mq_map_swqueue(). I don't see how a user defined mapping would change things a whole lot. It's just another point of updating the cache. Besides, user defined mappings will be mostly (only?) for things like multiqueue, where the caching info would likely remain static over a reconfigure. In my null_blk test on a quad core SMP VM: - 4 hw queue - timer mode With the above approach, tag allocation from local CPU can be improved from: 5% -> 50% for boot CPU 30% -> 90% for non-boot CPU. If no one objects the idea, I'd like to post a patch for review. Sent it out, that can't hurt. I'll take a look at it, and give it a test spin as well. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hi Jens, On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe wrote: > On 04/22/2014 08:03 AM, Jens Axboe wrote: >> On 2014-04-22 01:10, Alexander Gordeev wrote: >>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: >>> >>> Hello Gentlemen, >>> >>> Any feedback on this? >> >> Sorry for dropping the ball on this. Improvements wrt when to steal, how >> much, and from whom are sorely needed in percpu_ida. I'll do a bench >> with this on a system that currently falls apart with it. > > Ran some quick numbers with three kernels: > > stock 3.15-rc2 > limit 3.15-rc2 + steal limit patch (attached) I am thinking/working on this sort of improving too, but my idea is to compute tags->nr_max_cache by below: nr_tags / hctx->max_nr_ctx hctx->max_nr_ctx means the max sw queues mapped to the hw queue, which need to be introduced in the approach, actually, the value should represent the CPU topology info. It is a bit complicated to compute hctx->max_nr_ctx because we need to take account into CPU hotplug and probable user-defined mapping callback. If user-defined mapping callback needn't to be considered, the hctx->max_nr_ctx can be figured out before mapping sw queue in blk_mq_init_queue() by supposing each CPU is online first, once it is done, the map for offline CPU is cleared, then start to call blk_mq_map_swqueue(). In my null_blk test on a quad core SMP VM: - 4 hw queue - timer mode With the above approach, tag allocation from local CPU can be improved from: 5% -> 50% for boot CPU 30% -> 90% for non-boot CPU. If no one objects the idea, I'd like to post a patch for review. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: > Hello, > > This series is against 3.14.0-rc7. > > It is amied to further improve 'percpu_ida' tags locality by taking > into account system's CPU topology when stealing tags. That is try > to steal from a CPU which is 'closest' to the stealing one. > > I would not bother to post this, since on several system the change > did not show any improvement, i.e. on such one: There's is much lower level fruit to be had before doing something like this. https://lkml.org/lkml/2014/1/23/257 https://lkml.org/lkml/2014/1/23/329 https://lkml.org/lkml/2014/1/23/343 https://lkml.org/lkml/2014/1/23/354 I've not had time to revisit/finish them, but you should definitely clean up the percpu_ida stuff and reduce existing contention before going overboard. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 04/22/2014 08:03 AM, Jens Axboe wrote: > On 2014-04-22 01:10, Alexander Gordeev wrote: >> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: >>> But other systems (more dense?) showed increased cache-hit rate >>> up to 20%, i.e. this one: >> >> Hello Gentlemen, >> >> Any feedback on this? > > Sorry for dropping the ball on this. Improvements wrt when to steal, how > much, and from whom are sorely needed in percpu_ida. I'll do a bench > with this on a system that currently falls apart with it. Ran some quick numbers with three kernels: stock 3.15-rc2 limit 3.15-rc2 + steal limit patch (attached) limit+ag3.15-rc2 + steal limit + your topology patch Two tests were run - the device has an effective queue depth limit of 255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both exercise near-limit depth and over limit depth. 8 processes were used, split into two groups. One group would always run on the local node, the other would be run on the adjacent node (near) or on the far node (far). Near + low --- IOPSsys time stock 1009.5K 55.78% limit 1084.4K 54.47% limit+ag1058.1K 52.42% Near + high --- IOPSsys time stock949.1K 75.12% limit980.7K 64.74% limit+ag1010.1K 70.27% Far + low - IOPSsys time stock 600.0K 72.28% limit 761.7K 71.17% limit+ag762.5K 74.48% Far + high -- IOPSsys time stock 465.9K 91.66% limit 716.2K 88.68% limit+ag758.0K 91.00% One huge issue on this box is that it's a 4 socket/node machine, with 32 cores (64 threads). Combined with a 255 queue depth limit, the percpu caching does not work well. I did not include stock+ag results, they didn't change things very much for me. We simply have to limit the stealing first, or we're still going to be hammering on percpu locks. If we compare the top profiles from stock-far-high and limit+ag-far-high, it looks pretty scary. Here's the stock one: - 50,84% fio [kernel.kallsyms] _raw_spin_lock + 89,83% percpu_ida_alloc + 6,03% mtip_queue_rq + 2,90% percpu_ida_free so 50% of the system time spent acquiring a spinlock, with 90% of that being percpu ida. The limit+ag variant looks like this: - 32,93% fio [kernel.kallsyms] _raw_spin_lock + 78,35% percpu_ida_alloc + 19,49% mtip_queue_rq + 1,21% __blk_mq_run_hw_queue which is still pretty horrid and has plenty of room for improvement. I think we need to make better decisions on the granularity of the tag caching. If we ignore thread siblings, that'll double our effective caching. If that's still not enough, I bet per-node/socket would be a huge improvement. -- Jens Axboe diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 7a799c4..689bbaf 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -109,6 +109,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, { unsigned int nr_tags, nr_cache; struct blk_mq_tags *tags; + unsigned int num_cpus; int ret; if (total_tags > BLK_MQ_TAG_MAX) { @@ -121,7 +122,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, return NULL; nr_tags = total_tags - reserved_tags; - nr_cache = nr_tags / num_possible_cpus(); + num_cpus = min(8U, num_online_cpus()); + nr_cache = nr_tags / num_cpus; if (nr_cache < BLK_MQ_TAG_CACHE_MIN) nr_cache = BLK_MQ_TAG_CACHE_MIN;
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-04-22 01:10, Alexander Gordeev wrote: On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Sorry for dropping the ball on this. Improvements wrt when to steal, how much, and from whom are sorely needed in percpu_ida. I'll do a bench with this on a system that currently falls apart with it. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: > But other systems (more dense?) showed increased cache-hit rate > up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Thanks! -- Regards, Alexander Gordeev agord...@redhat.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Thanks! -- Regards, Alexander Gordeev agord...@redhat.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-04-22 01:10, Alexander Gordeev wrote: On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Sorry for dropping the ball on this. Improvements wrt when to steal, how much, and from whom are sorely needed in percpu_ida. I'll do a bench with this on a system that currently falls apart with it. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 04/22/2014 08:03 AM, Jens Axboe wrote: On 2014-04-22 01:10, Alexander Gordeev wrote: On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Sorry for dropping the ball on this. Improvements wrt when to steal, how much, and from whom are sorely needed in percpu_ida. I'll do a bench with this on a system that currently falls apart with it. Ran some quick numbers with three kernels: stock 3.15-rc2 limit 3.15-rc2 + steal limit patch (attached) limit+ag3.15-rc2 + steal limit + your topology patch Two tests were run - the device has an effective queue depth limit of 255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both exercise near-limit depth and over limit depth. 8 processes were used, split into two groups. One group would always run on the local node, the other would be run on the adjacent node (near) or on the far node (far). Near + low --- IOPSsys time stock 1009.5K 55.78% limit 1084.4K 54.47% limit+ag1058.1K 52.42% Near + high --- IOPSsys time stock949.1K 75.12% limit980.7K 64.74% limit+ag1010.1K 70.27% Far + low - IOPSsys time stock 600.0K 72.28% limit 761.7K 71.17% limit+ag762.5K 74.48% Far + high -- IOPSsys time stock 465.9K 91.66% limit 716.2K 88.68% limit+ag758.0K 91.00% One huge issue on this box is that it's a 4 socket/node machine, with 32 cores (64 threads). Combined with a 255 queue depth limit, the percpu caching does not work well. I did not include stock+ag results, they didn't change things very much for me. We simply have to limit the stealing first, or we're still going to be hammering on percpu locks. If we compare the top profiles from stock-far-high and limit+ag-far-high, it looks pretty scary. Here's the stock one: - 50,84% fio [kernel.kallsyms] _raw_spin_lock + 89,83% percpu_ida_alloc + 6,03% mtip_queue_rq + 2,90% percpu_ida_free so 50% of the system time spent acquiring a spinlock, with 90% of that being percpu ida. The limit+ag variant looks like this: - 32,93% fio [kernel.kallsyms] _raw_spin_lock + 78,35% percpu_ida_alloc + 19,49% mtip_queue_rq + 1,21% __blk_mq_run_hw_queue which is still pretty horrid and has plenty of room for improvement. I think we need to make better decisions on the granularity of the tag caching. If we ignore thread siblings, that'll double our effective caching. If that's still not enough, I bet per-node/socket would be a huge improvement. -- Jens Axboe diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 7a799c4..689bbaf 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -109,6 +109,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, { unsigned int nr_tags, nr_cache; struct blk_mq_tags *tags; + unsigned int num_cpus; int ret; if (total_tags BLK_MQ_TAG_MAX) { @@ -121,7 +122,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, return NULL; nr_tags = total_tags - reserved_tags; - nr_cache = nr_tags / num_possible_cpus(); + num_cpus = min(8U, num_online_cpus()); + nr_cache = nr_tags / num_cpus; if (nr_cache BLK_MQ_TAG_CACHE_MIN) nr_cache = BLK_MQ_TAG_CACHE_MIN;
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: Hello, This series is against 3.14.0-rc7. It is amied to further improve 'percpu_ida' tags locality by taking into account system's CPU topology when stealing tags. That is try to steal from a CPU which is 'closest' to the stealing one. I would not bother to post this, since on several system the change did not show any improvement, i.e. on such one: There's is much lower level fruit to be had before doing something like this. https://lkml.org/lkml/2014/1/23/257 https://lkml.org/lkml/2014/1/23/329 https://lkml.org/lkml/2014/1/23/343 https://lkml.org/lkml/2014/1/23/354 I've not had time to revisit/finish them, but you should definitely clean up the percpu_ida stuff and reduce existing contention before going overboard. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hi Jens, On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe ax...@kernel.dk wrote: On 04/22/2014 08:03 AM, Jens Axboe wrote: On 2014-04-22 01:10, Alexander Gordeev wrote: On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Sorry for dropping the ball on this. Improvements wrt when to steal, how much, and from whom are sorely needed in percpu_ida. I'll do a bench with this on a system that currently falls apart with it. Ran some quick numbers with three kernels: stock 3.15-rc2 limit 3.15-rc2 + steal limit patch (attached) I am thinking/working on this sort of improving too, but my idea is to compute tags-nr_max_cache by below: nr_tags / hctx-max_nr_ctx hctx-max_nr_ctx means the max sw queues mapped to the hw queue, which need to be introduced in the approach, actually, the value should represent the CPU topology info. It is a bit complicated to compute hctx-max_nr_ctx because we need to take account into CPU hotplug and probable user-defined mapping callback. If user-defined mapping callback needn't to be considered, the hctx-max_nr_ctx can be figured out before mapping sw queue in blk_mq_init_queue() by supposing each CPU is online first, once it is done, the map for offline CPU is cleared, then start to call blk_mq_map_swqueue(). In my null_blk test on a quad core SMP VM: - 4 hw queue - timer mode With the above approach, tag allocation from local CPU can be improved from: 5% - 50% for boot CPU 30% - 90% for non-boot CPU. If no one objects the idea, I'd like to post a patch for review. Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
On 2014-04-22 18:53, Ming Lei wrote: Hi Jens, On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe ax...@kernel.dk wrote: On 04/22/2014 08:03 AM, Jens Axboe wrote: On 2014-04-22 01:10, Alexander Gordeev wrote: On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote: But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: Hello Gentlemen, Any feedback on this? Sorry for dropping the ball on this. Improvements wrt when to steal, how much, and from whom are sorely needed in percpu_ida. I'll do a bench with this on a system that currently falls apart with it. Ran some quick numbers with three kernels: stock 3.15-rc2 limit 3.15-rc2 + steal limit patch (attached) I am thinking/working on this sort of improving too, but my idea is to compute tags-nr_max_cache by below: nr_tags / hctx-max_nr_ctx hctx-max_nr_ctx means the max sw queues mapped to the hw queue, which need to be introduced in the approach, actually, the value should represent the CPU topology info. It is a bit complicated to compute hctx-max_nr_ctx because we need to take account into CPU hotplug and probable user-defined mapping callback. We can always just update the caching info, that's not a big problem. We update the mappings on those events anyway. If user-defined mapping callback needn't to be considered, the hctx-max_nr_ctx can be figured out before mapping sw queue in blk_mq_init_queue() by supposing each CPU is online first, once it is done, the map for offline CPU is cleared, then start to call blk_mq_map_swqueue(). I don't see how a user defined mapping would change things a whole lot. It's just another point of updating the cache. Besides, user defined mappings will be mostly (only?) for things like multiqueue, where the caching info would likely remain static over a reconfigure. In my null_blk test on a quad core SMP VM: - 4 hw queue - timer mode With the above approach, tag allocation from local CPU can be improved from: 5% - 50% for boot CPU 30% - 90% for non-boot CPU. If no one objects the idea, I'd like to post a patch for review. Sent it out, that can't hurt. I'll take a look at it, and give it a test spin as well. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hello, This series is against 3.14.0-rc7. It is amied to further improve 'percpu_ida' tags locality by taking into account system's CPU topology when stealing tags. That is try to steal from a CPU which is 'closest' to the stealing one. I would not bother to post this, since on several system the change did not show any improvement, i.e. on such one: CPU0 attaching sched-domain: domain 0: span 0,8 level SIBLING groups: 0 (cpu_power = 588) 8 (cpu_power = 588) domain 1: span 0-3,8-11 level MC groups: 0,8 (cpu_power = 1176) 1,9 (cpu_power = 1176) 2,10 (cpu_power = 1176) 3,11 (cpu_power = 1176) domain 2: span 0-15 level NUMA groups: 0-3,8-11 (cpu_power = 4704) 4-7,12-15 (cpu_power = 4704) But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: CPU5 attaching sched-domain: domain 0: span 0-5 level MC groups: 5 (cpu_power = 1023) 0 (cpu_power = 1023) 1 (cpu_power = 1023) 2 (cpu_power = 1023) 3 (cpu_power = 1023) 4 (cpu_power = 1023) domain 1: span 0-7 level NUMA groups: 0-5 (cpu_power = 6138) 6-7 (cpu_power = 2046) CPU6 attaching sched-domain: domain 0: span 6-7 level MC groups: 6 (cpu_power = 1023) 7 (cpu_power = 1023) domain 1: span 0-7 level NUMA groups: 6-7 (cpu_power = 2046) 0-5 (cpu_power = 6138) I tested using 'null_blk' device with number of threads equal to the number of CPUs with each thread affined to one CPU and not affined, with no difference. Suggestions are welcomed :) Thanks! Cc: Kent Overstreet Cc: Jens Axboe Cc: Shaohua Li Cc: Nicholas Bellinger Cc: Ingo Molnar Cc: Peter Zijlstra Alexander Gordeev (2): sched: Introduce topology level masks and for_each_tlm() macro percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() include/linux/percpu_ida.h |1 - include/linux/sched.h |5 ++ kernel/sched/core.c| 89 lib/percpu_ida.c | 46 +- 4 files changed, 113 insertions(+), 28 deletions(-) -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags
Hello, This series is against 3.14.0-rc7. It is amied to further improve 'percpu_ida' tags locality by taking into account system's CPU topology when stealing tags. That is try to steal from a CPU which is 'closest' to the stealing one. I would not bother to post this, since on several system the change did not show any improvement, i.e. on such one: CPU0 attaching sched-domain: domain 0: span 0,8 level SIBLING groups: 0 (cpu_power = 588) 8 (cpu_power = 588) domain 1: span 0-3,8-11 level MC groups: 0,8 (cpu_power = 1176) 1,9 (cpu_power = 1176) 2,10 (cpu_power = 1176) 3,11 (cpu_power = 1176) domain 2: span 0-15 level NUMA groups: 0-3,8-11 (cpu_power = 4704) 4-7,12-15 (cpu_power = 4704) But other systems (more dense?) showed increased cache-hit rate up to 20%, i.e. this one: CPU5 attaching sched-domain: domain 0: span 0-5 level MC groups: 5 (cpu_power = 1023) 0 (cpu_power = 1023) 1 (cpu_power = 1023) 2 (cpu_power = 1023) 3 (cpu_power = 1023) 4 (cpu_power = 1023) domain 1: span 0-7 level NUMA groups: 0-5 (cpu_power = 6138) 6-7 (cpu_power = 2046) CPU6 attaching sched-domain: domain 0: span 6-7 level MC groups: 6 (cpu_power = 1023) 7 (cpu_power = 1023) domain 1: span 0-7 level NUMA groups: 6-7 (cpu_power = 2046) 0-5 (cpu_power = 6138) I tested using 'null_blk' device with number of threads equal to the number of CPUs with each thread affined to one CPU and not affined, with no difference. Suggestions are welcomed :) Thanks! Cc: Kent Overstreet k...@daterainc.com Cc: Jens Axboe ax...@kernel.dk Cc: Shaohua Li s...@kernel.org Cc: Nicholas Bellinger n...@linux-iscsi.org Cc: Ingo Molnar mi...@redhat.com Cc: Peter Zijlstra pet...@infradead.org Alexander Gordeev (2): sched: Introduce topology level masks and for_each_tlm() macro percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags() include/linux/percpu_ida.h |1 - include/linux/sched.h |5 ++ kernel/sched/core.c| 89 lib/percpu_ida.c | 46 +- 4 files changed, 113 insertions(+), 28 deletions(-) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/