Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-02 Thread Jens Axboe
On 05/02/2014 10:43 AM, Christoph Hellwig wrote:
> On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote:
>>> At least for SCSI devices _tag space_ is plenty, it's just the we
>>> artifically limit our tag space to the queue depth to avoid having to
>>> track that one separately.  In addition we also preallocaste a request
>>> for each tag, so even if we would track the queue depth separately
>>> we would waste a lot of memory.
>>
>> In practice it comes out to the same, it's not feasible to run a much
>> larger space and track on queue depth. So I don't think that changes the
>> conclusion for SCSI.
> 
> Agreed.  Just had to check if my smart ass hat still fits this morning..

If not, these are on sale

http://www.cafepress.com/+smartass_trucker_hat,18823771

:-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-02 Thread Christoph Hellwig
On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote:
> > At least for SCSI devices _tag space_ is plenty, it's just the we
> > artifically limit our tag space to the queue depth to avoid having to
> > track that one separately.  In addition we also preallocaste a request
> > for each tag, so even if we would track the queue depth separately
> > we would waste a lot of memory.
> 
> In practice it comes out to the same, it's not feasible to run a much
> larger space and track on queue depth. So I don't think that changes the
> conclusion for SCSI.

Agreed.  Just had to check if my smart ass hat still fits this morning..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-02 Thread Jens Axboe
On 05/01/2014 11:05 PM, Christoph Hellwig wrote:
> On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
>> I've taken the consequence of this and implemented another tagging
>> scheme that blk-mq will use if it deems that percpu_ida isn't going
>> to be effective for the device being initialized. But I really hate
>> to have both of them in there. Unfortunately I have no devices
>> available that have a tag space that will justify using percu_ida,
>> so comparisons are a bit hard at the moment. NVMe should change
>> that, though, so decision will have to be deferred until that is
>> tested.
> 
> At least for SCSI devices _tag space_ is plenty, it's just the we
> artifically limit our tag space to the queue depth to avoid having to
> track that one separately.  In addition we also preallocaste a request
> for each tag, so even if we would track the queue depth separately
> we would waste a lot of memory.

In practice it comes out to the same, it's not feasible to run a much
larger space and track on queue depth. So I don't think that changes the
conclusion for SCSI.


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-02 Thread Jens Axboe
On 05/01/2014 11:05 PM, Christoph Hellwig wrote:
 On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
 I've taken the consequence of this and implemented another tagging
 scheme that blk-mq will use if it deems that percpu_ida isn't going
 to be effective for the device being initialized. But I really hate
 to have both of them in there. Unfortunately I have no devices
 available that have a tag space that will justify using percu_ida,
 so comparisons are a bit hard at the moment. NVMe should change
 that, though, so decision will have to be deferred until that is
 tested.
 
 At least for SCSI devices _tag space_ is plenty, it's just the we
 artifically limit our tag space to the queue depth to avoid having to
 track that one separately.  In addition we also preallocaste a request
 for each tag, so even if we would track the queue depth separately
 we would waste a lot of memory.

In practice it comes out to the same, it's not feasible to run a much
larger space and track on queue depth. So I don't think that changes the
conclusion for SCSI.


-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-02 Thread Christoph Hellwig
On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote:
  At least for SCSI devices _tag space_ is plenty, it's just the we
  artifically limit our tag space to the queue depth to avoid having to
  track that one separately.  In addition we also preallocaste a request
  for each tag, so even if we would track the queue depth separately
  we would waste a lot of memory.
 
 In practice it comes out to the same, it's not feasible to run a much
 larger space and track on queue depth. So I don't think that changes the
 conclusion for SCSI.

Agreed.  Just had to check if my smart ass hat still fits this morning..

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-02 Thread Jens Axboe
On 05/02/2014 10:43 AM, Christoph Hellwig wrote:
 On Fri, May 02, 2014 at 10:41:40AM -0600, Jens Axboe wrote:
 At least for SCSI devices _tag space_ is plenty, it's just the we
 artifically limit our tag space to the queue depth to avoid having to
 track that one separately.  In addition we also preallocaste a request
 for each tag, so even if we would track the queue depth separately
 we would waste a lot of memory.

 In practice it comes out to the same, it's not feasible to run a much
 larger space and track on queue depth. So I don't think that changes the
 conclusion for SCSI.
 
 Agreed.  Just had to check if my smart ass hat still fits this morning..

If not, these are on sale

http://www.cafepress.com/+smartass_trucker_hat,18823771

:-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Christoph Hellwig
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
> I've taken the consequence of this and implemented another tagging
> scheme that blk-mq will use if it deems that percpu_ida isn't going
> to be effective for the device being initialized. But I really hate
> to have both of them in there. Unfortunately I have no devices
> available that have a tag space that will justify using percu_ida,
> so comparisons are a bit hard at the moment. NVMe should change
> that, though, so decision will have to be deferred until that is
> tested.

At least for SCSI devices _tag space_ is plenty, it's just the we
artifically limit our tag space to the queue depth to avoid having to
track that one separately.  In addition we also preallocaste a request
for each tag, so even if we would track the queue depth separately
we would waste a lot of memory.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Jens Axboe

On 2014-05-01 20:38, Kent Overstreet wrote:

On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:

On 2014-05-01 16:47, Kent Overstreet wrote:

On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:

On 04/29/2014 05:35 AM, Ming Lei wrote:

On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe  wrote:

On 2014-04-25 18:01, Ming Lei wrote:


Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:


On 04/25/2014 03:10 AM, Ming Lei wrote:

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.



Yes, that is just what the patch is doing, :-)



But it's not enough.


Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.


For instance, my test case, it's 255 tags and 64 CPUs.
We end up in cross-cpu spinlock nightmare mode.


IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.


That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).


Sorry for not being more active in the discussion earlier, but anyways - I'm in
100% agreement with this.

Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
using all your available tags, because percpu sharding requires wasting your tag
space. I could write a mathematical proof of this if I cared enough.

Otherwise what happens is on alloc failure you're touching all the other
cachelines every single time and now you're bouncing _more_ cachelines than if
you just had a single global freelist.

So yeah, for small tag spaces just use a single simple bit vector on a single
cacheline.


I've taken the consequence of this and implemented another tagging scheme
that blk-mq will use if it deems that percpu_ida isn't going to be effective
for the device being initialized. But I really hate to have both of them in
there. Unfortunately I have no devices available that have a tag space that
will justify using percu_ida, so comparisons are a bit hard at the moment.
NVMe should change that, though, so decision will have to be deferred until
that is tested.


Yeah, I agree that is annoying. I've thought about the issue too though and I
haven't been able to come up with any better ideas, myself.


I have failed in that area, too, and it's not for lack of thinking about 
it and experimenting. So hence a new solution was thought up, based on a 
lot of userland prototyping and testing. Things considered, two 
solutions is better than no solution.



A given driver probably should be able to always use one or the other though, so
we shouldn't _need_ a runtime conditional because of this, though structuring
the code that way might be more trouble than it's worth from my vague
recollection of what blk-mq looks likee...


It's completely runtime conditional at this point, not sure how not to 
make it so. This is transparent to drivers, they should not care about 
what kind of tagging scheme to use. If we present that, then we have 
failed. The runtime conditional is still better than a function pointer, 
though, so it'll likely stay that way for now.


So it the entry points now all look like this:

if (use_new_scheme)
  ret = new_foo();
else
  ret = foo();

At least it should branch predict really well :-)


(I've actually been fighting with unrelated issues at a very similar layer of
abstraction, it's quite annoying.).


BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
when steal_tags() runs was fundamentally wrong and broken in this respect, and
should be reverted, whatever usage it was that was expecting to be able to
allocate the entire tag space was the problem.


It's hard to blame Shaohua, and I helped push that. It was an attempt in
making percpu_ida actually useful for what blk-mq needs it for, and being
the primary user of it, it was definitely worth doing. A tagging scheme that
requires the tag space to be effectively sparse/huge to be fast is not a
good generic tagging algorithm.


Yeah it was definitely not an unreasonable attempt and it's probably my own
fault for not speaking up louder at the time (I can't remember how much I
commented at the time). Ah well, irritating problem space :)


Not a problem, I think the main failure here is that we have been coming 
at this with clashing expectations of what the requirements are. And a 
further issue is wanting to cling to the percpu_ida tags on my end, 
thinking it could be made to work.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Kent Overstreet
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
> On 2014-05-01 16:47, Kent Overstreet wrote:
> >On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
> >>On 04/29/2014 05:35 AM, Ming Lei wrote:
> >>>On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe  wrote:
> On 2014-04-25 18:01, Ming Lei wrote:
> >
> >Hi Jens,
> >
> >On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:
> >>
> >>On 04/25/2014 03:10 AM, Ming Lei wrote:
> >>
> >>Sorry, I did run it the other day. It has little to no effect here, but
> >>that's mostly because there's so much other crap going on in there. The
> >>most effective way to currently make it work better, is just to ensure
> >>the caching pool is of a sane size.
> >
> >
> >Yes, that is just what the patch is doing, :-)
> 
> 
> But it's not enough.
> >>>
> >>>Yes, the patch is only for cases of mutli hw queue and having
> >>>offline CPUs existed.
> >>>
> For instance, my test case, it's 255 tags and 64 CPUs.
> We end up in cross-cpu spinlock nightmare mode.
> >>>
> >>>IMO, the scaling problem for the above case might be
> >>>caused by either current percpu ida design or blk-mq's
> >>>usage on it.
> >>
> >>That is pretty much my claim, yes. Basically I don't think per-cpu tag
> >>caching is ever going to be the best solution for the combination of
> >>modern machines and the hardware that is out there (limited tags).
> >
> >Sorry for not being more active in the discussion earlier, but anyways - I'm 
> >in
> >100% agreement with this.
> >
> >Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
> >using all your available tags, because percpu sharding requires wasting your 
> >tag
> >space. I could write a mathematical proof of this if I cared enough.
> >
> >Otherwise what happens is on alloc failure you're touching all the other
> >cachelines every single time and now you're bouncing _more_ cachelines than 
> >if
> >you just had a single global freelist.
> >
> >So yeah, for small tag spaces just use a single simple bit vector on a single
> >cacheline.
> 
> I've taken the consequence of this and implemented another tagging scheme
> that blk-mq will use if it deems that percpu_ida isn't going to be effective
> for the device being initialized. But I really hate to have both of them in
> there. Unfortunately I have no devices available that have a tag space that
> will justify using percu_ida, so comparisons are a bit hard at the moment.
> NVMe should change that, though, so decision will have to be deferred until
> that is tested.

Yeah, I agree that is annoying. I've thought about the issue too though and I
haven't been able to come up with any better ideas, myself.

A given driver probably should be able to always use one or the other though, so
we shouldn't _need_ a runtime conditional because of this, though structuring
the code that way might be more trouble than it's worth from my vague
recollection of what blk-mq looks likee...

(I've actually been fighting with unrelated issues at a very similar layer of
abstraction, it's quite annoying.).

> >BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
> >when steal_tags() runs was fundamentally wrong and broken in this respect, 
> >and
> >should be reverted, whatever usage it was that was expecting to be able to
> >allocate the entire tag space was the problem.
> 
> It's hard to blame Shaohua, and I helped push that. It was an attempt in
> making percpu_ida actually useful for what blk-mq needs it for, and being
> the primary user of it, it was definitely worth doing. A tagging scheme that
> requires the tag space to be effectively sparse/huge to be fast is not a
> good generic tagging algorithm.

Yeah it was definitely not an unreasonable attempt and it's probably my own
fault for not speaking up louder at the time (I can't remember how much I
commented at the time). Ah well, irritating problem space :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Jens Axboe

On 2014-05-01 16:47, Kent Overstreet wrote:

On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:

On 04/29/2014 05:35 AM, Ming Lei wrote:

On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe  wrote:

On 2014-04-25 18:01, Ming Lei wrote:


Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:


On 04/25/2014 03:10 AM, Ming Lei wrote:

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.



Yes, that is just what the patch is doing, :-)



But it's not enough.


Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.


For instance, my test case, it's 255 tags and 64 CPUs.
We end up in cross-cpu spinlock nightmare mode.


IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.


That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).


Sorry for not being more active in the discussion earlier, but anyways - I'm in
100% agreement with this.

Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
using all your available tags, because percpu sharding requires wasting your tag
space. I could write a mathematical proof of this if I cared enough.

Otherwise what happens is on alloc failure you're touching all the other
cachelines every single time and now you're bouncing _more_ cachelines than if
you just had a single global freelist.

So yeah, for small tag spaces just use a single simple bit vector on a single
cacheline.


I've taken the consequence of this and implemented another tagging 
scheme that blk-mq will use if it deems that percpu_ida isn't going to 
be effective for the device being initialized. But I really hate to have 
both of them in there. Unfortunately I have no devices available that 
have a tag space that will justify using percu_ida, so comparisons are a 
bit hard at the moment. NVMe should change that, though, so decision 
will have to be deferred until that is tested.



BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
when steal_tags() runs was fundamentally wrong and broken in this respect, and
should be reverted, whatever usage it was that was expecting to be able to
allocate the entire tag space was the problem.


It's hard to blame Shaohua, and I helped push that. It was an attempt in 
making percpu_ida actually useful for what blk-mq needs it for, and 
being the primary user of it, it was definitely worth doing. A tagging 
scheme that requires the tag space to be effectively sparse/huge to be 
fast is not a good generic tagging algorithm.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Kent Overstreet
On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
> On 04/29/2014 05:35 AM, Ming Lei wrote:
> > On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe  wrote:
> >> On 2014-04-25 18:01, Ming Lei wrote:
> >>>
> >>> Hi Jens,
> >>>
> >>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:
> 
>  On 04/25/2014 03:10 AM, Ming Lei wrote:
> 
>  Sorry, I did run it the other day. It has little to no effect here, but
>  that's mostly because there's so much other crap going on in there. The
>  most effective way to currently make it work better, is just to ensure
>  the caching pool is of a sane size.
> >>>
> >>>
> >>> Yes, that is just what the patch is doing, :-)
> >>
> >>
> >> But it's not enough.
> > 
> > Yes, the patch is only for cases of mutli hw queue and having
> > offline CPUs existed.
> > 
> >> For instance, my test case, it's 255 tags and 64 CPUs.
> >> We end up in cross-cpu spinlock nightmare mode.
> > 
> > IMO, the scaling problem for the above case might be
> > caused by either current percpu ida design or blk-mq's
> > usage on it.
> 
> That is pretty much my claim, yes. Basically I don't think per-cpu tag
> caching is ever going to be the best solution for the combination of
> modern machines and the hardware that is out there (limited tags).

Sorry for not being more active in the discussion earlier, but anyways - I'm in
100% agreement with this.

Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
using all your available tags, because percpu sharding requires wasting your tag
space. I could write a mathematical proof of this if I cared enough.

Otherwise what happens is on alloc failure you're touching all the other
cachelines every single time and now you're bouncing _more_ cachelines than if
you just had a single global freelist.

So yeah, for small tag spaces just use a single simple bit vector on a single
cacheline.

BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
when steal_tags() runs was fundamentally wrong and broken in this respect, and
should be reverted, whatever usage it was that was expecting to be able to
allocate the entire tag space was the problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Alexander Gordeev
On Thu, May 01, 2014 at 11:24:42PM +0200, Alexander Gordeev wrote:
> On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote:
> > I've not had time to revisit/finish them, but you should definitely
> > clean up the percpu_ida stuff and reduce existing contention before
> > going overboard.
> 
> Hi Peter,
> 
> I tried to combine your series into a single patch against 3.15-rc3.
> 
> I hope, it looks like you intended and my comments in follow-up
> message make sense.
> 
> Thanks!
> 
> diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
> index 93d145e..14f5151 100644
> --- a/lib/percpu_ida.c
> +++ b/lib/percpu_ida.c
> @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned 
> *dst_nr,
>   *dst_nr += nr;
>  }
>  
> +static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
> +{
> + if (l1 > l2)
> + swap(l1, l2);
> +
> + spin_lock(l1);
> + spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
> +}
> +
>  /*
>   * Try to steal tags from a remote cpu's percpu freelist.
>   *
> @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool,
>   if (remote == tags)
>   continue;
>  
> - spin_lock(>lock);
> + double_lock(>lock, >lock);
>  
>   if (remote->nr_free) {
>   memcpy(tags->freelist,
> @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool,
>   }
>  
>   spin_unlock(>lock);
> + spin_unlock(>lock);
>  
>   if (tags->nr_free)
>   break;
>   }
>  }
>  
> -/*
> - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them 
> onto
> - * our percpu freelist:
> - */
> -static inline void alloc_global_tags(struct percpu_ida *pool,
> -  struct percpu_ida_cpu *tags)
> +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
>  {
> - move_tags(tags->freelist, >nr_free,
> -   pool->freelist, >nr_free,
> -   min(pool->nr_free, pool->percpu_batch_size));
> + int tag = -ENOSPC;
> +
> + spin_lock(>lock);
> + if (tags->nr_free)
> + tag = tags->freelist[--tags->nr_free];
> + spin_unlock(>lock);
> +
> + return tag;
>  }
>  
> -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
> +static inline int alloc_local_tag(struct percpu_ida *pool)
>  {
> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
>   int tag = -ENOSPC;
>  
> + local_irq_save(flags);
> + tags = this_cpu_ptr(pool->tag_cpu);
>   spin_lock(>lock);
>   if (tags->nr_free)
>   tag = tags->freelist[--tags->nr_free];
> - spin_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  
>   return tag;
>  }
>  
> +static inline int alloc_global_tag(struct percpu_ida *pool)
> +{
> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
> + int tag = -ENOSPC;
> + 
> + spin_lock_irqsave(>lock, flags);
> + tags = this_cpu_ptr(pool->tag_cpu);

If it makes more sense to make it the way alloc_local_tag() does?

local_irq_save(flags);
tags = this_cpu_ptr(pool->tag_cpu);
spin_lock(>lock);

> +
> + if (!tags->nr_free) {
> + /*
> +  * Move tags from the global-, onto our percpu-freelist.
> +  */
> + move_tags(tags->freelist, >nr_free,
> +   pool->freelist, >nr_free,
> +   min(pool->nr_free, pool->percpu_batch_size));

'tags' are accessed lock-less and race with steal_tags().
If double_lock() would help here?

> + }
> +
> + spin_unlock(>lock);

It is quite difficult to comment here, since there is no affected code shown,
but I will try nevertheless.

So steal_tags() used to serialized with 'pool->lock', which is removed.
As result, two (or more) concurrent threads could find and unset the very
same bit in this piece of code:

for (cpus_have_tags = cpumask_weight(>cpus_have_tags);
 cpus_have_tags; cpus_have_tags--) {
cpu = cpumask_next(cpu, >cpus_have_tags); // *** FIND

if (cpu >= nr_cpu_ids) {
cpu = cpumask_first(>cpus_have_tags);
if (cpu >= nr_cpu_ids)
BUG();
}

pool->cpu_last_stolen = cpu;
remote = per_cpu_ptr(pool->tag_cpu, cpu);

cpumask_clear_cpu(cpu, >cpus_have_tags);  // *** UNSET

...
}


The second's thread cpumask_clear_cpu() races with cpumask_set_cpu() in
percpu_ida_free()

if (nr_free == 1) {
cpumask_set_cpu(smp_processor_id(),
>cpus_have_tags);

// here the bit is UNSET by the second concurrent thread

wake_up(>wait);
}

Which results in the woken thread would not find the illegitimately unset
bit 

Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Alexander Gordeev
On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote:
> I've not had time to revisit/finish them, but you should definitely
> clean up the percpu_ida stuff and reduce existing contention before
> going overboard.

Hi Peter,

I tried to combine your series into a single patch against 3.15-rc3.

I hope, it looks like you intended and my comments in follow-up
message make sense.

Thanks!

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 93d145e..14f5151 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr,
*dst_nr += nr;
 }
 
+static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+   if (l1 > l2)
+   swap(l1, l2);
+
+   spin_lock(l1);
+   spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
 /*
  * Try to steal tags from a remote cpu's percpu freelist.
  *
@@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool,
if (remote == tags)
continue;
 
-   spin_lock(>lock);
+   double_lock(>lock, >lock);
 
if (remote->nr_free) {
memcpy(tags->freelist,
@@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool,
}
 
spin_unlock(>lock);
+   spin_unlock(>lock);
 
if (tags->nr_free)
break;
}
 }
 
-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them 
onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
-struct percpu_ida_cpu *tags)
+static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
 {
-   move_tags(tags->freelist, >nr_free,
- pool->freelist, >nr_free,
- min(pool->nr_free, pool->percpu_batch_size));
+   int tag = -ENOSPC;
+
+   spin_lock(>lock);
+   if (tags->nr_free)
+   tag = tags->freelist[--tags->nr_free];
+   spin_unlock(>lock);
+
+   return tag;
 }
 
-static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
+static inline int alloc_local_tag(struct percpu_ida *pool)
 {
+   struct percpu_ida_cpu *tags;
+   unsigned long flags;
int tag = -ENOSPC;
 
+   local_irq_save(flags);
+   tags = this_cpu_ptr(pool->tag_cpu);
spin_lock(>lock);
if (tags->nr_free)
tag = tags->freelist[--tags->nr_free];
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
 
return tag;
 }
 
+static inline int alloc_global_tag(struct percpu_ida *pool)
+{
+   struct percpu_ida_cpu *tags;
+   unsigned long flags;
+   int tag = -ENOSPC;
+ 
+   spin_lock_irqsave(>lock, flags);
+   tags = this_cpu_ptr(pool->tag_cpu);
+
+   if (!tags->nr_free) {
+   /*
+* Move tags from the global-, onto our percpu-freelist.
+*/
+   move_tags(tags->freelist, >nr_free,
+ pool->freelist, >nr_free,
+ min(pool->nr_free, pool->percpu_batch_size));
+   }
+
+   spin_unlock(>lock);
+
+   if (!tags->nr_free)
+   steal_tags(pool, tags);
+
+   if (tags->nr_free) {
+   spin_lock(>lock);
+   if (tags->nr_free) {
+   tag = tags->freelist[--tags->nr_free];
+   if (tags->nr_free) {
+   cpumask_set_cpu(smp_processor_id(),
+   >cpus_have_tags);
+   }
+   }
+   spin_unlock(>lock);
+   }
+
+   local_irq_restore(flags);
+ 
+   return tag;
+}
+
 /**
  * percpu_ida_alloc - allocate a tag
  * @pool: pool to allocate from
@@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct 
percpu_ida_cpu *tags)
  */
 int percpu_ida_alloc(struct percpu_ida *pool, int state)
 {
-   DEFINE_WAIT(wait);
-   struct percpu_ida_cpu *tags;
-   unsigned long flags;
-   int tag;
-
-   local_irq_save(flags);
-   tags = this_cpu_ptr(pool->tag_cpu);
-
-   /* Fastpath */
-   tag = alloc_local_tag(tags);
-   if (likely(tag >= 0)) {
-   local_irq_restore(flags);
-   return tag;
-   }
-
-   while (1) {
-   spin_lock(>lock);
-
-   /*
-* prepare_to_wait() must come before steal_tags(), in case
-* percpu_ida_free() on another cpu flips a bit in
-* cpus_have_tags
-*
-* global lock held and irqs disabled, don't need percpu lock
-*/
-   if (state != TASK_RUNNING)
-   prepare_to_wait(>wait, , state);
-
-   if (!tags->nr_free)
-   alloc_global_tags(pool, tags);
-   if 

Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Alexander Gordeev
On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote:
 I've not had time to revisit/finish them, but you should definitely
 clean up the percpu_ida stuff and reduce existing contention before
 going overboard.

Hi Peter,

I tried to combine your series into a single patch against 3.15-rc3.

I hope, it looks like you intended and my comments in follow-up
message make sense.

Thanks!

diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index 93d145e..14f5151 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned *dst_nr,
*dst_nr += nr;
 }
 
+static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+   if (l1  l2)
+   swap(l1, l2);
+
+   spin_lock(l1);
+   spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
 /*
  * Try to steal tags from a remote cpu's percpu freelist.
  *
@@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool,
if (remote == tags)
continue;
 
-   spin_lock(remote-lock);
+   double_lock(tags-lock, remote-lock);
 
if (remote-nr_free) {
memcpy(tags-freelist,
@@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool,
}
 
spin_unlock(remote-lock);
+   spin_unlock(tags-lock);
 
if (tags-nr_free)
break;
}
 }
 
-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them 
onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
-struct percpu_ida_cpu *tags)
+static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
 {
-   move_tags(tags-freelist, tags-nr_free,
- pool-freelist, pool-nr_free,
- min(pool-nr_free, pool-percpu_batch_size));
+   int tag = -ENOSPC;
+
+   spin_lock(tags-lock);
+   if (tags-nr_free)
+   tag = tags-freelist[--tags-nr_free];
+   spin_unlock(tags-lock);
+
+   return tag;
 }
 
-static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
+static inline int alloc_local_tag(struct percpu_ida *pool)
 {
+   struct percpu_ida_cpu *tags;
+   unsigned long flags;
int tag = -ENOSPC;
 
+   local_irq_save(flags);
+   tags = this_cpu_ptr(pool-tag_cpu);
spin_lock(tags-lock);
if (tags-nr_free)
tag = tags-freelist[--tags-nr_free];
-   spin_unlock(tags-lock);
+   spin_unlock_irqrestore(tags-lock, flags);
 
return tag;
 }
 
+static inline int alloc_global_tag(struct percpu_ida *pool)
+{
+   struct percpu_ida_cpu *tags;
+   unsigned long flags;
+   int tag = -ENOSPC;
+ 
+   spin_lock_irqsave(pool-lock, flags);
+   tags = this_cpu_ptr(pool-tag_cpu);
+
+   if (!tags-nr_free) {
+   /*
+* Move tags from the global-, onto our percpu-freelist.
+*/
+   move_tags(tags-freelist, tags-nr_free,
+ pool-freelist, pool-nr_free,
+ min(pool-nr_free, pool-percpu_batch_size));
+   }
+
+   spin_unlock(pool-lock);
+
+   if (!tags-nr_free)
+   steal_tags(pool, tags);
+
+   if (tags-nr_free) {
+   spin_lock(tags-lock);
+   if (tags-nr_free) {
+   tag = tags-freelist[--tags-nr_free];
+   if (tags-nr_free) {
+   cpumask_set_cpu(smp_processor_id(),
+   pool-cpus_have_tags);
+   }
+   }
+   spin_unlock(tags-lock);
+   }
+
+   local_irq_restore(flags);
+ 
+   return tag;
+}
+
 /**
  * percpu_ida_alloc - allocate a tag
  * @pool: pool to allocate from
@@ -146,66 +200,26 @@ static inline unsigned alloc_local_tag(struct 
percpu_ida_cpu *tags)
  */
 int percpu_ida_alloc(struct percpu_ida *pool, int state)
 {
-   DEFINE_WAIT(wait);
-   struct percpu_ida_cpu *tags;
-   unsigned long flags;
-   int tag;
-
-   local_irq_save(flags);
-   tags = this_cpu_ptr(pool-tag_cpu);
-
-   /* Fastpath */
-   tag = alloc_local_tag(tags);
-   if (likely(tag = 0)) {
-   local_irq_restore(flags);
-   return tag;
-   }
-
-   while (1) {
-   spin_lock(pool-lock);
-
-   /*
-* prepare_to_wait() must come before steal_tags(), in case
-* percpu_ida_free() on another cpu flips a bit in
-* cpus_have_tags
-*
-* global lock held and irqs disabled, don't need percpu lock
-*/
-   if (state != TASK_RUNNING)
-   prepare_to_wait(pool-wait, wait, state);
-
-   if (!tags-nr_free)
-   

Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Alexander Gordeev
On Thu, May 01, 2014 at 11:24:42PM +0200, Alexander Gordeev wrote:
 On Tue, Apr 22, 2014 at 01:56:29PM +0200, Peter Zijlstra wrote:
  I've not had time to revisit/finish them, but you should definitely
  clean up the percpu_ida stuff and reduce existing contention before
  going overboard.
 
 Hi Peter,
 
 I tried to combine your series into a single patch against 3.15-rc3.
 
 I hope, it looks like you intended and my comments in follow-up
 message make sense.
 
 Thanks!
 
 diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
 index 93d145e..14f5151 100644
 --- a/lib/percpu_ida.c
 +++ b/lib/percpu_ida.c
 @@ -51,6 +51,15 @@ static inline void move_tags(unsigned *dst, unsigned 
 *dst_nr,
   *dst_nr += nr;
  }
  
 +static inline void double_lock(spinlock_t *l1, spinlock_t *l2)
 +{
 + if (l1  l2)
 + swap(l1, l2);
 +
 + spin_lock(l1);
 + spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
 +}
 +
  /*
   * Try to steal tags from a remote cpu's percpu freelist.
   *
 @@ -84,7 +93,7 @@ static inline void steal_tags(struct percpu_ida *pool,
   if (remote == tags)
   continue;
  
 - spin_lock(remote-lock);
 + double_lock(tags-lock, remote-lock);
  
   if (remote-nr_free) {
   memcpy(tags-freelist,
 @@ -96,36 +105,81 @@ static inline void steal_tags(struct percpu_ida *pool,
   }
  
   spin_unlock(remote-lock);
 + spin_unlock(tags-lock);
  
   if (tags-nr_free)
   break;
   }
  }
  
 -/*
 - * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them 
 onto
 - * our percpu freelist:
 - */
 -static inline void alloc_global_tags(struct percpu_ida *pool,
 -  struct percpu_ida_cpu *tags)
 +static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
  {
 - move_tags(tags-freelist, tags-nr_free,
 -   pool-freelist, pool-nr_free,
 -   min(pool-nr_free, pool-percpu_batch_size));
 + int tag = -ENOSPC;
 +
 + spin_lock(tags-lock);
 + if (tags-nr_free)
 + tag = tags-freelist[--tags-nr_free];
 + spin_unlock(tags-lock);
 +
 + return tag;
  }
  
 -static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
 +static inline int alloc_local_tag(struct percpu_ida *pool)
  {
 + struct percpu_ida_cpu *tags;
 + unsigned long flags;
   int tag = -ENOSPC;
  
 + local_irq_save(flags);
 + tags = this_cpu_ptr(pool-tag_cpu);
   spin_lock(tags-lock);
   if (tags-nr_free)
   tag = tags-freelist[--tags-nr_free];
 - spin_unlock(tags-lock);
 + spin_unlock_irqrestore(tags-lock, flags);
  
   return tag;
  }
  
 +static inline int alloc_global_tag(struct percpu_ida *pool)
 +{
 + struct percpu_ida_cpu *tags;
 + unsigned long flags;
 + int tag = -ENOSPC;
 + 
 + spin_lock_irqsave(pool-lock, flags);
 + tags = this_cpu_ptr(pool-tag_cpu);

If it makes more sense to make it the way alloc_local_tag() does?

local_irq_save(flags);
tags = this_cpu_ptr(pool-tag_cpu);
spin_lock(pool-lock);

 +
 + if (!tags-nr_free) {
 + /*
 +  * Move tags from the global-, onto our percpu-freelist.
 +  */
 + move_tags(tags-freelist, tags-nr_free,
 +   pool-freelist, pool-nr_free,
 +   min(pool-nr_free, pool-percpu_batch_size));

'tags' are accessed lock-less and race with steal_tags().
If double_lock() would help here?

 + }
 +
 + spin_unlock(pool-lock);

It is quite difficult to comment here, since there is no affected code shown,
but I will try nevertheless.

So steal_tags() used to serialized with 'pool-lock', which is removed.
As result, two (or more) concurrent threads could find and unset the very
same bit in this piece of code:

for (cpus_have_tags = cpumask_weight(pool-cpus_have_tags);
 cpus_have_tags; cpus_have_tags--) {
cpu = cpumask_next(cpu, pool-cpus_have_tags); // *** FIND

if (cpu = nr_cpu_ids) {
cpu = cpumask_first(pool-cpus_have_tags);
if (cpu = nr_cpu_ids)
BUG();
}

pool-cpu_last_stolen = cpu;
remote = per_cpu_ptr(pool-tag_cpu, cpu);

cpumask_clear_cpu(cpu, pool-cpus_have_tags);  // *** UNSET

...
}


The second's thread cpumask_clear_cpu() races with cpumask_set_cpu() in
percpu_ida_free()

if (nr_free == 1) {
cpumask_set_cpu(smp_processor_id(),
pool-cpus_have_tags);

// here the bit is UNSET by the second concurrent thread

wake_up(pool-wait);
}

Which results in the woken thread would not find the illegitimately unset
bit while looping over the mask in steal_tags(). 

Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Kent Overstreet
On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
 On 04/29/2014 05:35 AM, Ming Lei wrote:
  On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote:
  On 2014-04-25 18:01, Ming Lei wrote:
 
  Hi Jens,
 
  On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:
 
  On 04/25/2014 03:10 AM, Ming Lei wrote:
 
  Sorry, I did run it the other day. It has little to no effect here, but
  that's mostly because there's so much other crap going on in there. The
  most effective way to currently make it work better, is just to ensure
  the caching pool is of a sane size.
 
 
  Yes, that is just what the patch is doing, :-)
 
 
  But it's not enough.
  
  Yes, the patch is only for cases of mutli hw queue and having
  offline CPUs existed.
  
  For instance, my test case, it's 255 tags and 64 CPUs.
  We end up in cross-cpu spinlock nightmare mode.
  
  IMO, the scaling problem for the above case might be
  caused by either current percpu ida design or blk-mq's
  usage on it.
 
 That is pretty much my claim, yes. Basically I don't think per-cpu tag
 caching is ever going to be the best solution for the combination of
 modern machines and the hardware that is out there (limited tags).

Sorry for not being more active in the discussion earlier, but anyways - I'm in
100% agreement with this.

Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
using all your available tags, because percpu sharding requires wasting your tag
space. I could write a mathematical proof of this if I cared enough.

Otherwise what happens is on alloc failure you're touching all the other
cachelines every single time and now you're bouncing _more_ cachelines than if
you just had a single global freelist.

So yeah, for small tag spaces just use a single simple bit vector on a single
cacheline.

BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
when steal_tags() runs was fundamentally wrong and broken in this respect, and
should be reverted, whatever usage it was that was expecting to be able to
allocate the entire tag space was the problem.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Jens Axboe

On 2014-05-01 16:47, Kent Overstreet wrote:

On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:

On 04/29/2014 05:35 AM, Ming Lei wrote:

On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote:

On 2014-04-25 18:01, Ming Lei wrote:


Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:


On 04/25/2014 03:10 AM, Ming Lei wrote:

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.



Yes, that is just what the patch is doing, :-)



But it's not enough.


Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.


For instance, my test case, it's 255 tags and 64 CPUs.
We end up in cross-cpu spinlock nightmare mode.


IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.


That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).


Sorry for not being more active in the discussion earlier, but anyways - I'm in
100% agreement with this.

Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
using all your available tags, because percpu sharding requires wasting your tag
space. I could write a mathematical proof of this if I cared enough.

Otherwise what happens is on alloc failure you're touching all the other
cachelines every single time and now you're bouncing _more_ cachelines than if
you just had a single global freelist.

So yeah, for small tag spaces just use a single simple bit vector on a single
cacheline.


I've taken the consequence of this and implemented another tagging 
scheme that blk-mq will use if it deems that percpu_ida isn't going to 
be effective for the device being initialized. But I really hate to have 
both of them in there. Unfortunately I have no devices available that 
have a tag space that will justify using percu_ida, so comparisons are a 
bit hard at the moment. NVMe should change that, though, so decision 
will have to be deferred until that is tested.



BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
when steal_tags() runs was fundamentally wrong and broken in this respect, and
should be reverted, whatever usage it was that was expecting to be able to
allocate the entire tag space was the problem.


It's hard to blame Shaohua, and I helped push that. It was an attempt in 
making percpu_ida actually useful for what blk-mq needs it for, and 
being the primary user of it, it was definitely worth doing. A tagging 
scheme that requires the tag space to be effectively sparse/huge to be 
fast is not a good generic tagging algorithm.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Kent Overstreet
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
 On 2014-05-01 16:47, Kent Overstreet wrote:
 On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:
 On 04/29/2014 05:35 AM, Ming Lei wrote:
 On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-04-25 18:01, Ming Lei wrote:
 
 Hi Jens,
 
 On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:
 
 On 04/25/2014 03:10 AM, Ming Lei wrote:
 
 Sorry, I did run it the other day. It has little to no effect here, but
 that's mostly because there's so much other crap going on in there. The
 most effective way to currently make it work better, is just to ensure
 the caching pool is of a sane size.
 
 
 Yes, that is just what the patch is doing, :-)
 
 
 But it's not enough.
 
 Yes, the patch is only for cases of mutli hw queue and having
 offline CPUs existed.
 
 For instance, my test case, it's 255 tags and 64 CPUs.
 We end up in cross-cpu spinlock nightmare mode.
 
 IMO, the scaling problem for the above case might be
 caused by either current percpu ida design or blk-mq's
 usage on it.
 
 That is pretty much my claim, yes. Basically I don't think per-cpu tag
 caching is ever going to be the best solution for the combination of
 modern machines and the hardware that is out there (limited tags).
 
 Sorry for not being more active in the discussion earlier, but anyways - I'm 
 in
 100% agreement with this.
 
 Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
 using all your available tags, because percpu sharding requires wasting your 
 tag
 space. I could write a mathematical proof of this if I cared enough.
 
 Otherwise what happens is on alloc failure you're touching all the other
 cachelines every single time and now you're bouncing _more_ cachelines than 
 if
 you just had a single global freelist.
 
 So yeah, for small tag spaces just use a single simple bit vector on a single
 cacheline.
 
 I've taken the consequence of this and implemented another tagging scheme
 that blk-mq will use if it deems that percpu_ida isn't going to be effective
 for the device being initialized. But I really hate to have both of them in
 there. Unfortunately I have no devices available that have a tag space that
 will justify using percu_ida, so comparisons are a bit hard at the moment.
 NVMe should change that, though, so decision will have to be deferred until
 that is tested.

Yeah, I agree that is annoying. I've thought about the issue too though and I
haven't been able to come up with any better ideas, myself.

A given driver probably should be able to always use one or the other though, so
we shouldn't _need_ a runtime conditional because of this, though structuring
the code that way might be more trouble than it's worth from my vague
recollection of what blk-mq looks likee...

(I've actually been fighting with unrelated issues at a very similar layer of
abstraction, it's quite annoying.).

 BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
 when steal_tags() runs was fundamentally wrong and broken in this respect, 
 and
 should be reverted, whatever usage it was that was expecting to be able to
 allocate the entire tag space was the problem.
 
 It's hard to blame Shaohua, and I helped push that. It was an attempt in
 making percpu_ida actually useful for what blk-mq needs it for, and being
 the primary user of it, it was definitely worth doing. A tagging scheme that
 requires the tag space to be effectively sparse/huge to be fast is not a
 good generic tagging algorithm.

Yeah it was definitely not an unreasonable attempt and it's probably my own
fault for not speaking up louder at the time (I can't remember how much I
commented at the time). Ah well, irritating problem space :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Jens Axboe

On 2014-05-01 20:38, Kent Overstreet wrote:

On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:

On 2014-05-01 16:47, Kent Overstreet wrote:

On Tue, Apr 29, 2014 at 03:13:38PM -0600, Jens Axboe wrote:

On 04/29/2014 05:35 AM, Ming Lei wrote:

On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote:

On 2014-04-25 18:01, Ming Lei wrote:


Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:


On 04/25/2014 03:10 AM, Ming Lei wrote:

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.



Yes, that is just what the patch is doing, :-)



But it's not enough.


Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.


For instance, my test case, it's 255 tags and 64 CPUs.
We end up in cross-cpu spinlock nightmare mode.


IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.


That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).


Sorry for not being more active in the discussion earlier, but anyways - I'm in
100% agreement with this.

Percpu freelists are _fundamentally_ only _useful_ when you don't need to be
using all your available tags, because percpu sharding requires wasting your tag
space. I could write a mathematical proof of this if I cared enough.

Otherwise what happens is on alloc failure you're touching all the other
cachelines every single time and now you're bouncing _more_ cachelines than if
you just had a single global freelist.

So yeah, for small tag spaces just use a single simple bit vector on a single
cacheline.


I've taken the consequence of this and implemented another tagging scheme
that blk-mq will use if it deems that percpu_ida isn't going to be effective
for the device being initialized. But I really hate to have both of them in
there. Unfortunately I have no devices available that have a tag space that
will justify using percu_ida, so comparisons are a bit hard at the moment.
NVMe should change that, though, so decision will have to be deferred until
that is tested.


Yeah, I agree that is annoying. I've thought about the issue too though and I
haven't been able to come up with any better ideas, myself.


I have failed in that area, too, and it's not for lack of thinking about 
it and experimenting. So hence a new solution was thought up, based on a 
lot of userland prototyping and testing. Things considered, two 
solutions is better than no solution.



A given driver probably should be able to always use one or the other though, so
we shouldn't _need_ a runtime conditional because of this, though structuring
the code that way might be more trouble than it's worth from my vague
recollection of what blk-mq looks likee...


It's completely runtime conditional at this point, not sure how not to 
make it so. This is transparent to drivers, they should not care about 
what kind of tagging scheme to use. If we present that, then we have 
failed. The runtime conditional is still better than a function pointer, 
though, so it'll likely stay that way for now.


So it the entry points now all look like this:

if (use_new_scheme)
  ret = new_foo();
else
  ret = foo();

At least it should branch predict really well :-)


(I've actually been fighting with unrelated issues at a very similar layer of
abstraction, it's quite annoying.).


BTW, Shaohua Li's patch d835502f3dacad1638d516ab156d66f0ba377cf5 that changed
when steal_tags() runs was fundamentally wrong and broken in this respect, and
should be reverted, whatever usage it was that was expecting to be able to
allocate the entire tag space was the problem.


It's hard to blame Shaohua, and I helped push that. It was an attempt in
making percpu_ida actually useful for what blk-mq needs it for, and being
the primary user of it, it was definitely worth doing. A tagging scheme that
requires the tag space to be effectively sparse/huge to be fast is not a
good generic tagging algorithm.


Yeah it was definitely not an unreasonable attempt and it's probably my own
fault for not speaking up louder at the time (I can't remember how much I
commented at the time). Ah well, irritating problem space :)


Not a problem, I think the main failure here is that we have been coming 
at this with clashing expectations of what the requirements are. And a 
further issue is wanting to cling to the percpu_ida tags on my end, 
thinking it could be made to work.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-05-01 Thread Christoph Hellwig
On Thu, May 01, 2014 at 08:19:39PM -0600, Jens Axboe wrote:
 I've taken the consequence of this and implemented another tagging
 scheme that blk-mq will use if it deems that percpu_ida isn't going
 to be effective for the device being initialized. But I really hate
 to have both of them in there. Unfortunately I have no devices
 available that have a tag space that will justify using percu_ida,
 so comparisons are a bit hard at the moment. NVMe should change
 that, though, so decision will have to be deferred until that is
 tested.

At least for SCSI devices _tag space_ is plenty, it's just the we
artifically limit our tag space to the queue depth to avoid having to
track that one separately.  In addition we also preallocaste a request
for each tag, so even if we would track the queue depth separately
we would waste a lot of memory.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-30 Thread Ming Lei
On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe  wrote:
> On 04/29/2014 05:35 AM, Ming Lei wrote:
>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe  wrote:
>>> On 2014-04-25 18:01, Ming Lei wrote:

 Hi Jens,

 On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:
>
> On 04/25/2014 03:10 AM, Ming Lei wrote:
>
> Sorry, I did run it the other day. It has little to no effect here, but
> that's mostly because there's so much other crap going on in there. The
> most effective way to currently make it work better, is just to ensure
> the caching pool is of a sane size.


 Yes, that is just what the patch is doing, :-)
>>>
>>>
>>> But it's not enough.
>>
>> Yes, the patch is only for cases of mutli hw queue and having
>> offline CPUs existed.
>>
>>> For instance, my test case, it's 255 tags and 64 CPUs.
>>> We end up in cross-cpu spinlock nightmare mode.
>>
>> IMO, the scaling problem for the above case might be
>> caused by either current percpu ida design or blk-mq's
>> usage on it.
>
> That is pretty much my claim, yes. Basically I don't think per-cpu tag
> caching is ever going to be the best solution for the combination of
> modern machines and the hardware that is out there (limited tags).
>
>> One of problems in blk-mq is that the 'set->queue_depth'
>> parameter from driver isn't scalable, maybe it is reasonable to
>> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
>> can be computed as below:
>>
>>  max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
>>
>> Another problem in blk-mq is that if it can be improved by computing
>> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ?  The current
>> approach should be based on that there are parallel I/O
>> activity on each CPU, but I am wondering if it is the common
>> case in reality.  Suppose there are N(N << online CPUs in
>> big machine) concurrent I/O on some of CPUs, percpu cache
>> can be increased a lot by (nr_tags / N).
>
> That would certainly help the common case, but it'd still be slow for
> the cases where you DO have IO from lots of sources. If we consider

It may be difficult to figure out a efficient solution for the unusual case.

 8-16
> tags the minimum for balanced performance, than that doesn't take a
> whole lot of CPUs to spread out the tag space. Just looking at a case
> today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
> "bigger" case of the Micron card, you still only have 255 active tags.
> And we probably want to split that up into groups of 32, making the
> problem even worse.

Yes, that is a contradiction between having limited hardware
queue tags and requiring more local cpu cache. But it is really
a challenge to figure out an efficient approach in case that lots of
CPUs need to contend very limited resources.

Maybe blk-mq can learn network device: move the hw
queue_depth constraint into low level(such as, blk_mq_run_hw_queue()),
and keep adequate tags in the percpu pool, which means
nr_tags of percpu pool can be much bigger than queue_depth
for keeping enough percpu cache. When hw queue is full,
congestion control can be applied in blk_mq_alloc_request()
to avoid cross-cpu spinlock nightmare in percpu allocation/free.
But if device requires tag to be less than queue_depth, more
work is needed for the approach.


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-30 Thread Ming Lei
On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe ax...@kernel.dk wrote:
 On 04/29/2014 05:35 AM, Ming Lei wrote:
 On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-04-25 18:01, Ming Lei wrote:

 Hi Jens,

 On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:

 On 04/25/2014 03:10 AM, Ming Lei wrote:

 Sorry, I did run it the other day. It has little to no effect here, but
 that's mostly because there's so much other crap going on in there. The
 most effective way to currently make it work better, is just to ensure
 the caching pool is of a sane size.


 Yes, that is just what the patch is doing, :-)


 But it's not enough.

 Yes, the patch is only for cases of mutli hw queue and having
 offline CPUs existed.

 For instance, my test case, it's 255 tags and 64 CPUs.
 We end up in cross-cpu spinlock nightmare mode.

 IMO, the scaling problem for the above case might be
 caused by either current percpu ida design or blk-mq's
 usage on it.

 That is pretty much my claim, yes. Basically I don't think per-cpu tag
 caching is ever going to be the best solution for the combination of
 modern machines and the hardware that is out there (limited tags).

 One of problems in blk-mq is that the 'set-queue_depth'
 parameter from driver isn't scalable, maybe it is reasonable to
 introduce 'set-min_percpu_cache', then ' tags-nr_max_cache'
 can be computed as below:

  max(nr_tags / hctx-nr_ctx, set-min_percpu_cache)

 Another problem in blk-mq is that if it can be improved by computing
 tags-nr_max_cache as 'nr_tags / hctx-nr_ctx' ?  The current
 approach should be based on that there are parallel I/O
 activity on each CPU, but I am wondering if it is the common
 case in reality.  Suppose there are N(N  online CPUs in
 big machine) concurrent I/O on some of CPUs, percpu cache
 can be increased a lot by (nr_tags / N).

 That would certainly help the common case, but it'd still be slow for
 the cases where you DO have IO from lots of sources. If we consider

It may be difficult to figure out a efficient solution for the unusual case.

 8-16
 tags the minimum for balanced performance, than that doesn't take a
 whole lot of CPUs to spread out the tag space. Just looking at a case
 today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
 bigger case of the Micron card, you still only have 255 active tags.
 And we probably want to split that up into groups of 32, making the
 problem even worse.

Yes, that is a contradiction between having limited hardware
queue tags and requiring more local cpu cache. But it is really
a challenge to figure out an efficient approach in case that lots of
CPUs need to contend very limited resources.

Maybe blk-mq can learn network device: move the hw
queue_depth constraint into low level(such as, blk_mq_run_hw_queue()),
and keep adequate tags in the percpu pool, which means
nr_tags of percpu pool can be much bigger than queue_depth
for keeping enough percpu cache. When hw queue is full,
congestion control can be applied in blk_mq_alloc_request()
to avoid cross-cpu spinlock nightmare in percpu allocation/free.
But if device requires tag to be less than queue_depth, more
work is needed for the approach.


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-29 Thread Jens Axboe
On 04/29/2014 05:35 AM, Ming Lei wrote:
> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe  wrote:
>> On 2014-04-25 18:01, Ming Lei wrote:
>>>
>>> Hi Jens,
>>>
>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:

 On 04/25/2014 03:10 AM, Ming Lei wrote:

 Sorry, I did run it the other day. It has little to no effect here, but
 that's mostly because there's so much other crap going on in there. The
 most effective way to currently make it work better, is just to ensure
 the caching pool is of a sane size.
>>>
>>>
>>> Yes, that is just what the patch is doing, :-)
>>
>>
>> But it's not enough.
> 
> Yes, the patch is only for cases of mutli hw queue and having
> offline CPUs existed.
> 
>> For instance, my test case, it's 255 tags and 64 CPUs.
>> We end up in cross-cpu spinlock nightmare mode.
> 
> IMO, the scaling problem for the above case might be
> caused by either current percpu ida design or blk-mq's
> usage on it.

That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).

> One of problems in blk-mq is that the 'set->queue_depth'
> parameter from driver isn't scalable, maybe it is reasonable to
> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
> can be computed as below:
> 
>  max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
> 
> Another problem in blk-mq is that if it can be improved by computing
> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ?  The current
> approach should be based on that there are parallel I/O
> activity on each CPU, but I am wondering if it is the common
> case in reality.  Suppose there are N(N << online CPUs in
> big machine) concurrent I/O on some of CPUs, percpu cache
> can be increased a lot by (nr_tags / N).

That would certainly help the common case, but it'd still be slow for
the cases where you DO have IO from lots of sources. If we consider 8-16
tags the minimum for balanced performance, than that doesn't take a
whole lot of CPUs to spread out the tag space. Just looking at a case
today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
"bigger" case of the Micron card, you still only have 255 active tags.
And we probably want to split that up into groups of 32, making the
problem even worse.

>> That's what I did, essentially. Ensuring that the percpu_max_size is at
>> least 8 makes it a whole lot better here. But still slower than a regular
>> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
>> scheme I tested here is around 20% faster than the bitmap approach on a
>> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
>> I can't think of cases where that approach would not be faster than
>> percpu_ida. That means all of SCSI, basically, and the basic block drivers.
> 
> If percpu_ida wants to beat bitmap allocation,  the local cache hit
> ratio has to keep high, in my tests, it can be got with enough local
> cache size.

Yes, that is exactly the issue, local cache hit must be high, and you
pretty much need a higher local cache count for that. And therein lies
the problem, you can't get that high local cache size for most common
cases. With enough tags we could, but that's not what most people will run.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-29 Thread Ming Lei
On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe  wrote:
> On 2014-04-25 18:01, Ming Lei wrote:
>>
>> Hi Jens,
>>
>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:
>>>
>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>
>>> Sorry, I did run it the other day. It has little to no effect here, but
>>> that's mostly because there's so much other crap going on in there. The
>>> most effective way to currently make it work better, is just to ensure
>>> the caching pool is of a sane size.
>>
>>
>> Yes, that is just what the patch is doing, :-)
>
>
> But it's not enough.

Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.

> For instance, my test case, it's 255 tags and 64 CPUs.
> We end up in cross-cpu spinlock nightmare mode.

IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.

One of problems in blk-mq is that the 'set->queue_depth'
parameter from driver isn't scalable, maybe it is reasonable to
introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
can be computed as below:

 max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)

Another problem in blk-mq is that if it can be improved by computing
tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ?  The current
approach should be based on that there are parallel I/O
activity on each CPU, but I am wondering if it is the common
case in reality.  Suppose there are N(N << online CPUs in
big machine) concurrent I/O on some of CPUs, percpu cache
can be increased a lot by (nr_tags / N).

>
>
>> From percpu_ida view, it is easy to observe it can improve
>> allocation performance. I have several patches to export
>> these information by sysfs for monitoring percpu_ida
>> performance.
>
>
> Sounds good!

If we need exporting percpu_ida allocation/free information via
sysfs for monitoring performance, percpu ida need a parent
kobject, which means it may be better to allocate percpu_ida
tag until sw/hw queue is initialized, like the patch does.

>
>
>>> I've got an alternative tagging scheme that I think would be useful for
>>> the cases where the tag space to cpu ratio isn't big enough. So I think
>>> we'll retain percpu_ida for the cases where it can cache enough, and
>>> punt to an alternative scheme when not.
>>
>>
>> OK, care to comment on the patch or the idea of setting percpu cache
>> size as (nr_tags / hctx->nr_ctx)?
>
>
> I think it's a good idea. The problem is that for percpu_ida to be
> effective, you need a bigger cache than the 3 I'd get above. If that isn't
> the case, it performs poorly. I'm just not convinced the design can ever
> work in the realm of realistic queue depths.
>
>
>
>>> That doesn't mean we should not improve percpu_ida. There's quite a bit
>>> of low hanging fruit in there.
>>
>>
>> IMO percpu_max_size in percpu_ida is very important for the
>> performance, and it might need to adjust dynamically according
>> to the percpu allocation loading, but it is far more complicated
>> to implement. And it might be the simplest way to fix the parameter
>> before percpu_ida_init().
>
>
> That's what I did, essentially. Ensuring that the percpu_max_size is at
> least 8 makes it a whole lot better here. But still slower than a regular
> simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
> scheme I tested here is around 20% faster than the bitmap approach on a
> basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
> I can't think of cases where that approach would not be faster than
> percpu_ida. That means all of SCSI, basically, and the basic block drivers.

If percpu_ida wants to beat bitmap allocation,  the local cache hit
ratio has to keep high, in my tests, it can be got with enough local
cache size.


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-29 Thread Ming Lei
On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-04-25 18:01, Ming Lei wrote:

 Hi Jens,

 On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:

 On 04/25/2014 03:10 AM, Ming Lei wrote:

 Sorry, I did run it the other day. It has little to no effect here, but
 that's mostly because there's so much other crap going on in there. The
 most effective way to currently make it work better, is just to ensure
 the caching pool is of a sane size.


 Yes, that is just what the patch is doing, :-)


 But it's not enough.

Yes, the patch is only for cases of mutli hw queue and having
offline CPUs existed.

 For instance, my test case, it's 255 tags and 64 CPUs.
 We end up in cross-cpu spinlock nightmare mode.

IMO, the scaling problem for the above case might be
caused by either current percpu ida design or blk-mq's
usage on it.

One of problems in blk-mq is that the 'set-queue_depth'
parameter from driver isn't scalable, maybe it is reasonable to
introduce 'set-min_percpu_cache', then ' tags-nr_max_cache'
can be computed as below:

 max(nr_tags / hctx-nr_ctx, set-min_percpu_cache)

Another problem in blk-mq is that if it can be improved by computing
tags-nr_max_cache as 'nr_tags / hctx-nr_ctx' ?  The current
approach should be based on that there are parallel I/O
activity on each CPU, but I am wondering if it is the common
case in reality.  Suppose there are N(N  online CPUs in
big machine) concurrent I/O on some of CPUs, percpu cache
can be increased a lot by (nr_tags / N).



 From percpu_ida view, it is easy to observe it can improve
 allocation performance. I have several patches to export
 these information by sysfs for monitoring percpu_ida
 performance.


 Sounds good!

If we need exporting percpu_ida allocation/free information via
sysfs for monitoring performance, percpu ida need a parent
kobject, which means it may be better to allocate percpu_ida
tag until sw/hw queue is initialized, like the patch does.



 I've got an alternative tagging scheme that I think would be useful for
 the cases where the tag space to cpu ratio isn't big enough. So I think
 we'll retain percpu_ida for the cases where it can cache enough, and
 punt to an alternative scheme when not.


 OK, care to comment on the patch or the idea of setting percpu cache
 size as (nr_tags / hctx-nr_ctx)?


 I think it's a good idea. The problem is that for percpu_ida to be
 effective, you need a bigger cache than the 3 I'd get above. If that isn't
 the case, it performs poorly. I'm just not convinced the design can ever
 work in the realm of realistic queue depths.



 That doesn't mean we should not improve percpu_ida. There's quite a bit
 of low hanging fruit in there.


 IMO percpu_max_size in percpu_ida is very important for the
 performance, and it might need to adjust dynamically according
 to the percpu allocation loading, but it is far more complicated
 to implement. And it might be the simplest way to fix the parameter
 before percpu_ida_init().


 That's what I did, essentially. Ensuring that the percpu_max_size is at
 least 8 makes it a whole lot better here. But still slower than a regular
 simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
 scheme I tested here is around 20% faster than the bitmap approach on a
 basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
 I can't think of cases where that approach would not be faster than
 percpu_ida. That means all of SCSI, basically, and the basic block drivers.

If percpu_ida wants to beat bitmap allocation,  the local cache hit
ratio has to keep high, in my tests, it can be got with enough local
cache size.


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-29 Thread Jens Axboe
On 04/29/2014 05:35 AM, Ming Lei wrote:
 On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-04-25 18:01, Ming Lei wrote:

 Hi Jens,

 On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:

 On 04/25/2014 03:10 AM, Ming Lei wrote:

 Sorry, I did run it the other day. It has little to no effect here, but
 that's mostly because there's so much other crap going on in there. The
 most effective way to currently make it work better, is just to ensure
 the caching pool is of a sane size.


 Yes, that is just what the patch is doing, :-)


 But it's not enough.
 
 Yes, the patch is only for cases of mutli hw queue and having
 offline CPUs existed.
 
 For instance, my test case, it's 255 tags and 64 CPUs.
 We end up in cross-cpu spinlock nightmare mode.
 
 IMO, the scaling problem for the above case might be
 caused by either current percpu ida design or blk-mq's
 usage on it.

That is pretty much my claim, yes. Basically I don't think per-cpu tag
caching is ever going to be the best solution for the combination of
modern machines and the hardware that is out there (limited tags).

 One of problems in blk-mq is that the 'set-queue_depth'
 parameter from driver isn't scalable, maybe it is reasonable to
 introduce 'set-min_percpu_cache', then ' tags-nr_max_cache'
 can be computed as below:
 
  max(nr_tags / hctx-nr_ctx, set-min_percpu_cache)
 
 Another problem in blk-mq is that if it can be improved by computing
 tags-nr_max_cache as 'nr_tags / hctx-nr_ctx' ?  The current
 approach should be based on that there are parallel I/O
 activity on each CPU, but I am wondering if it is the common
 case in reality.  Suppose there are N(N  online CPUs in
 big machine) concurrent I/O on some of CPUs, percpu cache
 can be increased a lot by (nr_tags / N).

That would certainly help the common case, but it'd still be slow for
the cases where you DO have IO from lots of sources. If we consider 8-16
tags the minimum for balanced performance, than that doesn't take a
whole lot of CPUs to spread out the tag space. Just looking at a case
today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
bigger case of the Micron card, you still only have 255 active tags.
And we probably want to split that up into groups of 32, making the
problem even worse.

 That's what I did, essentially. Ensuring that the percpu_max_size is at
 least 8 makes it a whole lot better here. But still slower than a regular
 simple bitmap, which makes me sad. A fairly straight forward cmpxchg based
 scheme I tested here is around 20% faster than the bitmap approach on a
 basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe,
 I can't think of cases where that approach would not be faster than
 percpu_ida. That means all of SCSI, basically, and the basic block drivers.
 
 If percpu_ida wants to beat bitmap allocation,  the local cache hit
 ratio has to keep high, in my tests, it can be got with enough local
 cache size.

Yes, that is exactly the issue, local cache hit must be high, and you
pretty much need a higher local cache count for that. And therein lies
the problem, you can't get that high local cache size for most common
cases. With enough tags we could, but that's not what most people will run.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Jens Axboe

On 2014-04-25 18:01, Ming Lei wrote:

Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:

On 04/25/2014 03:10 AM, Ming Lei wrote:

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.


Yes, that is just what the patch is doing, :-)


But it's not enough. For instance, my test case, it's 255 tags and 64 
CPUs. We end up in cross-cpu spinlock nightmare mode.



From percpu_ida view, it is easy to observe it can improve
allocation performance. I have several patches to export
these information by sysfs for monitoring percpu_ida
performance.


Sounds good!


I've got an alternative tagging scheme that I think would be useful for
the cases where the tag space to cpu ratio isn't big enough. So I think
we'll retain percpu_ida for the cases where it can cache enough, and
punt to an alternative scheme when not.


OK, care to comment on the patch or the idea of setting percpu cache
size as (nr_tags / hctx->nr_ctx)?


I think it's a good idea. The problem is that for percpu_ida to be 
effective, you need a bigger cache than the 3 I'd get above. If that 
isn't the case, it performs poorly. I'm just not convinced the design 
can ever work in the realm of realistic queue depths.




That doesn't mean we should not improve percpu_ida. There's quite a bit
of low hanging fruit in there.


IMO percpu_max_size in percpu_ida is very important for the
performance, and it might need to adjust dynamically according
to the percpu allocation loading, but it is far more complicated
to implement. And it might be the simplest way to fix the parameter
before percpu_ida_init().


That's what I did, essentially. Ensuring that the percpu_max_size is at 
least 8 makes it a whole lot better here. But still slower than a 
regular simple bitmap, which makes me sad. A fairly straight forward 
cmpxchg based scheme I tested here is around 20% faster than the bitmap 
approach on a basic desktop machine, and around 35% faster on a 
4-socket. Outside of NVMe, I can't think of cases where that approach 
would not be faster than percpu_ida. That means all of SCSI, basically, 
and the basic block drivers.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Ming Lei
Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe  wrote:
> On 04/25/2014 03:10 AM, Ming Lei wrote:
>
> Sorry, I did run it the other day. It has little to no effect here, but
> that's mostly because there's so much other crap going on in there. The
> most effective way to currently make it work better, is just to ensure
> the caching pool is of a sane size.

Yes, that is just what the patch is doing, :-)

>From percpu_ida view, it is easy to observe it can improve
allocation performance. I have several patches to export
these information by sysfs for monitoring percpu_ida
performance.

>
> I've got an alternative tagging scheme that I think would be useful for
> the cases where the tag space to cpu ratio isn't big enough. So I think
> we'll retain percpu_ida for the cases where it can cache enough, and
> punt to an alternative scheme when not.

OK, care to comment on the patch or the idea of setting percpu cache
size as (nr_tags / hctx->nr_ctx)?

>
> That doesn't mean we should not improve percpu_ida. There's quite a bit
> of low hanging fruit in there.

IMO percpu_max_size in percpu_ida is very important for the
performance, and it might need to adjust dynamically according
to the percpu allocation loading, but it is far more complicated
to implement. And it might be the simplest way to fix the parameter
before percpu_ida_init().


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Jens Axboe
On 04/25/2014 03:10 AM, Ming Lei wrote:
> Hi Jens,
> 
> On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe  wrote:
>> On 2014-04-22 18:53, Ming Lei wrote:
>>
>>
>>> In my null_blk test on a quad core SMP VM:
>>>
>>>   - 4 hw queue
>>>   - timer mode
>>>
>>> With the above approach, tag allocation from local CPU can be
>>> improved from:
>>>
>>>  5%   -> 50% for boot CPU
>>>  30% -> 90% for non-boot CPU.
>>>
>>> If no one objects the idea, I'd like to post a patch for review.
>>
>>
>> Sent it out, that can't hurt. I'll take a look at it, and give it a test
>> spin as well.
> 
> I have sent out the patch in below link, and appreciate much
> you will take a look at it.
> 
>   marc.info/?l=linux-kernel=139826944613225=2

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.

I've got an alternative tagging scheme that I think would be useful for
the cases where the tag space to cpu ratio isn't big enough. So I think
we'll retain percpu_ida for the cases where it can cache enough, and
punt to an alternative scheme when not.

That doesn't mean we should not improve percpu_ida. There's quite a bit
of low hanging fruit in there.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Ming Lei
Hi Jens,

On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe  wrote:
> On 2014-04-22 18:53, Ming Lei wrote:
>
>
>> In my null_blk test on a quad core SMP VM:
>>
>>   - 4 hw queue
>>   - timer mode
>>
>> With the above approach, tag allocation from local CPU can be
>> improved from:
>>
>>  5%   -> 50% for boot CPU
>>  30% -> 90% for non-boot CPU.
>>
>> If no one objects the idea, I'd like to post a patch for review.
>
>
> Sent it out, that can't hurt. I'll take a look at it, and give it a test
> spin as well.

I have sent out the patch in below link, and appreciate much
you will take a look at it.

  marc.info/?l=linux-kernel=139826944613225=2

Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Ming Lei
Hi Jens,

On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-04-22 18:53, Ming Lei wrote:


 In my null_blk test on a quad core SMP VM:

   - 4 hw queue
   - timer mode

 With the above approach, tag allocation from local CPU can be
 improved from:

  5%   - 50% for boot CPU
  30% - 90% for non-boot CPU.

 If no one objects the idea, I'd like to post a patch for review.


 Sent it out, that can't hurt. I'll take a look at it, and give it a test
 spin as well.

I have sent out the patch in below link, and appreciate much
you will take a look at it.

  marc.info/?l=linux-kernelm=139826944613225w=2

Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Jens Axboe
On 04/25/2014 03:10 AM, Ming Lei wrote:
 Hi Jens,
 
 On Wed, Apr 23, 2014 at 9:25 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-04-22 18:53, Ming Lei wrote:


 In my null_blk test on a quad core SMP VM:

   - 4 hw queue
   - timer mode

 With the above approach, tag allocation from local CPU can be
 improved from:

  5%   - 50% for boot CPU
  30% - 90% for non-boot CPU.

 If no one objects the idea, I'd like to post a patch for review.


 Sent it out, that can't hurt. I'll take a look at it, and give it a test
 spin as well.
 
 I have sent out the patch in below link, and appreciate much
 you will take a look at it.
 
   marc.info/?l=linux-kernelm=139826944613225w=2

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.

I've got an alternative tagging scheme that I think would be useful for
the cases where the tag space to cpu ratio isn't big enough. So I think
we'll retain percpu_ida for the cases where it can cache enough, and
punt to an alternative scheme when not.

That doesn't mean we should not improve percpu_ida. There's quite a bit
of low hanging fruit in there.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Ming Lei
Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:
 On 04/25/2014 03:10 AM, Ming Lei wrote:

 Sorry, I did run it the other day. It has little to no effect here, but
 that's mostly because there's so much other crap going on in there. The
 most effective way to currently make it work better, is just to ensure
 the caching pool is of a sane size.

Yes, that is just what the patch is doing, :-)

From percpu_ida view, it is easy to observe it can improve
allocation performance. I have several patches to export
these information by sysfs for monitoring percpu_ida
performance.


 I've got an alternative tagging scheme that I think would be useful for
 the cases where the tag space to cpu ratio isn't big enough. So I think
 we'll retain percpu_ida for the cases where it can cache enough, and
 punt to an alternative scheme when not.

OK, care to comment on the patch or the idea of setting percpu cache
size as (nr_tags / hctx-nr_ctx)?


 That doesn't mean we should not improve percpu_ida. There's quite a bit
 of low hanging fruit in there.

IMO percpu_max_size in percpu_ida is very important for the
performance, and it might need to adjust dynamically according
to the percpu allocation loading, but it is far more complicated
to implement. And it might be the simplest way to fix the parameter
before percpu_ida_init().


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-25 Thread Jens Axboe

On 2014-04-25 18:01, Ming Lei wrote:

Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe ax...@kernel.dk wrote:

On 04/25/2014 03:10 AM, Ming Lei wrote:

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.


Yes, that is just what the patch is doing, :-)


But it's not enough. For instance, my test case, it's 255 tags and 64 
CPUs. We end up in cross-cpu spinlock nightmare mode.



From percpu_ida view, it is easy to observe it can improve
allocation performance. I have several patches to export
these information by sysfs for monitoring percpu_ida
performance.


Sounds good!


I've got an alternative tagging scheme that I think would be useful for
the cases where the tag space to cpu ratio isn't big enough. So I think
we'll retain percpu_ida for the cases where it can cache enough, and
punt to an alternative scheme when not.


OK, care to comment on the patch or the idea of setting percpu cache
size as (nr_tags / hctx-nr_ctx)?


I think it's a good idea. The problem is that for percpu_ida to be 
effective, you need a bigger cache than the 3 I'd get above. If that 
isn't the case, it performs poorly. I'm just not convinced the design 
can ever work in the realm of realistic queue depths.




That doesn't mean we should not improve percpu_ida. There's quite a bit
of low hanging fruit in there.


IMO percpu_max_size in percpu_ida is very important for the
performance, and it might need to adjust dynamically according
to the percpu allocation loading, but it is far more complicated
to implement. And it might be the simplest way to fix the parameter
before percpu_ida_init().


That's what I did, essentially. Ensuring that the percpu_max_size is at 
least 8 makes it a whole lot better here. But still slower than a 
regular simple bitmap, which makes me sad. A fairly straight forward 
cmpxchg based scheme I tested here is around 20% faster than the bitmap 
approach on a basic desktop machine, and around 35% faster on a 
4-socket. Outside of NVMe, I can't think of cases where that approach 
would not be faster than percpu_ida. That means all of SCSI, basically, 
and the basic block drivers.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Jens Axboe

On 2014-04-22 18:53, Ming Lei wrote:

Hi Jens,

On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe  wrote:

On 04/22/2014 08:03 AM, Jens Axboe wrote:

On 2014-04-22 01:10, Alexander Gordeev wrote:

On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:

But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:


Hello Gentlemen,

Any feedback on this?


Sorry for dropping the ball on this. Improvements wrt when to steal, how
much, and from whom are sorely needed in percpu_ida. I'll do a bench
with this on a system that currently falls apart with it.


Ran some quick numbers with three kernels:

stock   3.15-rc2
limit   3.15-rc2 + steal limit patch (attached)


I am thinking/working on this sort of improving too, but my
idea is to compute tags->nr_max_cache by below:

nr_tags / hctx->max_nr_ctx

hctx->max_nr_ctx means the max sw queues mapped to the
hw queue, which need to be introduced in the approach, actually,
the value should represent the CPU topology info.

It is a bit complicated to compute hctx->max_nr_ctx because
we need to take account into CPU hotplug and probable
user-defined mapping callback.


We can always just update the caching info, that's not a big problem. We 
update the mappings on those events anyway.



If user-defined mapping callback needn't to be considered, the
hctx->max_nr_ctx can be figured out before mapping sw
queue in blk_mq_init_queue() by supposing each CPU is
online first, once it is done, the map for offline CPU is cleared,
then start to call blk_mq_map_swqueue().


I don't see how a user defined mapping would change things a whole lot. 
It's just another point of updating the cache. Besides, user defined 
mappings will be mostly (only?) for things like multiqueue, where the 
caching info would likely remain static over a reconfigure.



In my null_blk test on a quad core SMP VM:

  - 4 hw queue
  - timer mode

With the above approach, tag allocation from local CPU can be
improved from:

 5%   -> 50% for boot CPU
 30% -> 90% for non-boot CPU.

If no one objects the idea, I'd like to post a patch for review.


Sent it out, that can't hurt. I'll take a look at it, and give it a test 
spin as well.



--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Ming Lei
Hi Jens,

On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe  wrote:
> On 04/22/2014 08:03 AM, Jens Axboe wrote:
>> On 2014-04-22 01:10, Alexander Gordeev wrote:
>>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
 But other systems (more dense?) showed increased cache-hit rate
 up to 20%, i.e. this one:
>>>
>>> Hello Gentlemen,
>>>
>>> Any feedback on this?
>>
>> Sorry for dropping the ball on this. Improvements wrt when to steal, how
>> much, and from whom are sorely needed in percpu_ida. I'll do a bench
>> with this on a system that currently falls apart with it.
>
> Ran some quick numbers with three kernels:
>
> stock   3.15-rc2
> limit   3.15-rc2 + steal limit patch (attached)

I am thinking/working on this sort of improving too, but my
idea is to compute tags->nr_max_cache by below:

   nr_tags / hctx->max_nr_ctx

hctx->max_nr_ctx means the max sw queues mapped to the
hw queue, which need to be introduced in the approach, actually,
the value should represent the CPU topology info.

It is a bit complicated to compute hctx->max_nr_ctx because
we need to take account into CPU hotplug and probable
user-defined mapping callback.

If user-defined mapping callback needn't to be considered, the
hctx->max_nr_ctx can be figured out before mapping sw
queue in blk_mq_init_queue() by supposing each CPU is
online first, once it is done, the map for offline CPU is cleared,
then start to call blk_mq_map_swqueue().

In my null_blk test on a quad core SMP VM:

 - 4 hw queue
 - timer mode

With the above approach, tag allocation from local CPU can be
improved from:

5%   -> 50% for boot CPU
30% -> 90% for non-boot CPU.

If no one objects the idea, I'd like to post a patch for review.


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Peter Zijlstra
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
> Hello,
> 
> This series is against 3.14.0-rc7.
> 
> It is amied to further improve 'percpu_ida' tags locality by taking
> into account system's CPU topology when stealing tags. That is try
> to steal from a CPU which is 'closest' to the stealing one.
> 
> I would not bother to post this, since on several system the change
> did not show any improvement, i.e. on such one:

There's is much lower level fruit to be had before doing something like
this.

  https://lkml.org/lkml/2014/1/23/257
  https://lkml.org/lkml/2014/1/23/329
  https://lkml.org/lkml/2014/1/23/343
  https://lkml.org/lkml/2014/1/23/354

I've not had time to revisit/finish them, but you should definitely
clean up the percpu_ida stuff and reduce existing contention before
going overboard.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Jens Axboe
On 04/22/2014 08:03 AM, Jens Axboe wrote:
> On 2014-04-22 01:10, Alexander Gordeev wrote:
>> On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
>>> But other systems (more dense?) showed increased cache-hit rate
>>> up to 20%, i.e. this one:
>>
>> Hello Gentlemen,
>>
>> Any feedback on this?
> 
> Sorry for dropping the ball on this. Improvements wrt when to steal, how
> much, and from whom are sorely needed in percpu_ida. I'll do a bench
> with this on a system that currently falls apart with it.

Ran some quick numbers with three kernels:

stock   3.15-rc2
limit   3.15-rc2 + steal limit patch (attached)
limit+ag3.15-rc2 + steal limit + your topology patch

Two tests were run - the device has an effective queue depth limit of
255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both
exercise near-limit depth and over limit depth. 8 processes were used,
split into two groups. One group would always run on the local node, the
other would be run on the adjacent node (near) or on the far node (far).

Near + low
---
IOPSsys time
stock   1009.5K 55.78%
limit   1084.4K 54.47%
limit+ag1058.1K 52.42%

Near + high
---
IOPSsys time
stock949.1K 75.12%
limit980.7K 64.74%
limit+ag1010.1K 70.27%

Far + low
-
IOPSsys time
stock   600.0K  72.28%
limit   761.7K  71.17%
limit+ag762.5K  74.48%

Far + high
--
IOPSsys time
stock   465.9K  91.66%
limit   716.2K  88.68%
limit+ag758.0K  91.00%

One huge issue on this box is that it's a 4 socket/node machine, with 32
cores (64 threads). Combined with a 255 queue depth limit, the percpu
caching does not work well. I did not include stock+ag results, they
didn't change things very much for me. We simply have to limit the
stealing first, or we're still going to be hammering on percpu locks. If
we compare the top profiles from stock-far-high and limit+ag-far-high,
it looks pretty scary. Here's the stock one:

-  50,84%  fio  [kernel.kallsyms]
_raw_spin_lock
  + 89,83% percpu_ida_alloc
  + 6,03% mtip_queue_rq
  + 2,90% percpu_ida_free

so 50% of the system time spent acquiring a spinlock, with 90% of that
being percpu ida. The limit+ag variant looks like this:

-  32,93%  fio  [kernel.kallsyms]
_raw_spin_lock
  + 78,35% percpu_ida_alloc
  + 19,49% mtip_queue_rq
  + 1,21% __blk_mq_run_hw_queue

which is still pretty horrid and has plenty of room for improvement. I
think we need to make better decisions on the granularity of the tag
caching. If we ignore thread siblings, that'll double our effective
caching. If that's still not enough, I bet per-node/socket would be a
huge improvement.

-- 
Jens Axboe

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7a799c4..689bbaf 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -109,6 +109,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 {
 	unsigned int nr_tags, nr_cache;
 	struct blk_mq_tags *tags;
+	unsigned int num_cpus;
 	int ret;
 
 	if (total_tags > BLK_MQ_TAG_MAX) {
@@ -121,7 +122,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 		return NULL;
 
 	nr_tags = total_tags - reserved_tags;
-	nr_cache = nr_tags / num_possible_cpus();
+	num_cpus = min(8U, num_online_cpus());
+	nr_cache = nr_tags / num_cpus;
 
 	if (nr_cache < BLK_MQ_TAG_CACHE_MIN)
 		nr_cache = BLK_MQ_TAG_CACHE_MIN;


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Jens Axboe

On 2014-04-22 01:10, Alexander Gordeev wrote:

On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:

But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:


Hello Gentlemen,

Any feedback on this?


Sorry for dropping the ball on this. Improvements wrt when to steal, how 
much, and from whom are sorely needed in percpu_ida. I'll do a bench 
with this on a system that currently falls apart with it.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Alexander Gordeev
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
> But other systems (more dense?) showed increased cache-hit rate
> up to 20%, i.e. this one:

Hello Gentlemen,

Any feedback on this?

Thanks!

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Alexander Gordeev
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
 But other systems (more dense?) showed increased cache-hit rate
 up to 20%, i.e. this one:

Hello Gentlemen,

Any feedback on this?

Thanks!

-- 
Regards,
Alexander Gordeev
agord...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Jens Axboe

On 2014-04-22 01:10, Alexander Gordeev wrote:

On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:

But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:


Hello Gentlemen,

Any feedback on this?


Sorry for dropping the ball on this. Improvements wrt when to steal, how 
much, and from whom are sorely needed in percpu_ida. I'll do a bench 
with this on a system that currently falls apart with it.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Jens Axboe
On 04/22/2014 08:03 AM, Jens Axboe wrote:
 On 2014-04-22 01:10, Alexander Gordeev wrote:
 On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
 But other systems (more dense?) showed increased cache-hit rate
 up to 20%, i.e. this one:

 Hello Gentlemen,

 Any feedback on this?
 
 Sorry for dropping the ball on this. Improvements wrt when to steal, how
 much, and from whom are sorely needed in percpu_ida. I'll do a bench
 with this on a system that currently falls apart with it.

Ran some quick numbers with three kernels:

stock   3.15-rc2
limit   3.15-rc2 + steal limit patch (attached)
limit+ag3.15-rc2 + steal limit + your topology patch

Two tests were run - the device has an effective queue depth limit of
255, so I ran one test at QD=248 (low) and one at QD=512 (high) to both
exercise near-limit depth and over limit depth. 8 processes were used,
split into two groups. One group would always run on the local node, the
other would be run on the adjacent node (near) or on the far node (far).

Near + low
---
IOPSsys time
stock   1009.5K 55.78%
limit   1084.4K 54.47%
limit+ag1058.1K 52.42%

Near + high
---
IOPSsys time
stock949.1K 75.12%
limit980.7K 64.74%
limit+ag1010.1K 70.27%

Far + low
-
IOPSsys time
stock   600.0K  72.28%
limit   761.7K  71.17%
limit+ag762.5K  74.48%

Far + high
--
IOPSsys time
stock   465.9K  91.66%
limit   716.2K  88.68%
limit+ag758.0K  91.00%

One huge issue on this box is that it's a 4 socket/node machine, with 32
cores (64 threads). Combined with a 255 queue depth limit, the percpu
caching does not work well. I did not include stock+ag results, they
didn't change things very much for me. We simply have to limit the
stealing first, or we're still going to be hammering on percpu locks. If
we compare the top profiles from stock-far-high and limit+ag-far-high,
it looks pretty scary. Here's the stock one:

-  50,84%  fio  [kernel.kallsyms]
_raw_spin_lock
  + 89,83% percpu_ida_alloc
  + 6,03% mtip_queue_rq
  + 2,90% percpu_ida_free

so 50% of the system time spent acquiring a spinlock, with 90% of that
being percpu ida. The limit+ag variant looks like this:

-  32,93%  fio  [kernel.kallsyms]
_raw_spin_lock
  + 78,35% percpu_ida_alloc
  + 19,49% mtip_queue_rq
  + 1,21% __blk_mq_run_hw_queue

which is still pretty horrid and has plenty of room for improvement. I
think we need to make better decisions on the granularity of the tag
caching. If we ignore thread siblings, that'll double our effective
caching. If that's still not enough, I bet per-node/socket would be a
huge improvement.

-- 
Jens Axboe

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7a799c4..689bbaf 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -109,6 +109,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 {
 	unsigned int nr_tags, nr_cache;
 	struct blk_mq_tags *tags;
+	unsigned int num_cpus;
 	int ret;
 
 	if (total_tags  BLK_MQ_TAG_MAX) {
@@ -121,7 +122,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 		return NULL;
 
 	nr_tags = total_tags - reserved_tags;
-	nr_cache = nr_tags / num_possible_cpus();
+	num_cpus = min(8U, num_online_cpus());
+	nr_cache = nr_tags / num_cpus;
 
 	if (nr_cache  BLK_MQ_TAG_CACHE_MIN)
 		nr_cache = BLK_MQ_TAG_CACHE_MIN;


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Peter Zijlstra
On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
 Hello,
 
 This series is against 3.14.0-rc7.
 
 It is amied to further improve 'percpu_ida' tags locality by taking
 into account system's CPU topology when stealing tags. That is try
 to steal from a CPU which is 'closest' to the stealing one.
 
 I would not bother to post this, since on several system the change
 did not show any improvement, i.e. on such one:

There's is much lower level fruit to be had before doing something like
this.

  https://lkml.org/lkml/2014/1/23/257
  https://lkml.org/lkml/2014/1/23/329
  https://lkml.org/lkml/2014/1/23/343
  https://lkml.org/lkml/2014/1/23/354

I've not had time to revisit/finish them, but you should definitely
clean up the percpu_ida stuff and reduce existing contention before
going overboard.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Ming Lei
Hi Jens,

On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe ax...@kernel.dk wrote:
 On 04/22/2014 08:03 AM, Jens Axboe wrote:
 On 2014-04-22 01:10, Alexander Gordeev wrote:
 On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:
 But other systems (more dense?) showed increased cache-hit rate
 up to 20%, i.e. this one:

 Hello Gentlemen,

 Any feedback on this?

 Sorry for dropping the ball on this. Improvements wrt when to steal, how
 much, and from whom are sorely needed in percpu_ida. I'll do a bench
 with this on a system that currently falls apart with it.

 Ran some quick numbers with three kernels:

 stock   3.15-rc2
 limit   3.15-rc2 + steal limit patch (attached)

I am thinking/working on this sort of improving too, but my
idea is to compute tags-nr_max_cache by below:

   nr_tags / hctx-max_nr_ctx

hctx-max_nr_ctx means the max sw queues mapped to the
hw queue, which need to be introduced in the approach, actually,
the value should represent the CPU topology info.

It is a bit complicated to compute hctx-max_nr_ctx because
we need to take account into CPU hotplug and probable
user-defined mapping callback.

If user-defined mapping callback needn't to be considered, the
hctx-max_nr_ctx can be figured out before mapping sw
queue in blk_mq_init_queue() by supposing each CPU is
online first, once it is done, the map for offline CPU is cleared,
then start to call blk_mq_map_swqueue().

In my null_blk test on a quad core SMP VM:

 - 4 hw queue
 - timer mode

With the above approach, tag allocation from local CPU can be
improved from:

5%   - 50% for boot CPU
30% - 90% for non-boot CPU.

If no one objects the idea, I'd like to post a patch for review.


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-04-22 Thread Jens Axboe

On 2014-04-22 18:53, Ming Lei wrote:

Hi Jens,

On Tue, Apr 22, 2014 at 11:57 PM, Jens Axboe ax...@kernel.dk wrote:

On 04/22/2014 08:03 AM, Jens Axboe wrote:

On 2014-04-22 01:10, Alexander Gordeev wrote:

On Wed, Mar 26, 2014 at 02:34:22PM +0100, Alexander Gordeev wrote:

But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:


Hello Gentlemen,

Any feedback on this?


Sorry for dropping the ball on this. Improvements wrt when to steal, how
much, and from whom are sorely needed in percpu_ida. I'll do a bench
with this on a system that currently falls apart with it.


Ran some quick numbers with three kernels:

stock   3.15-rc2
limit   3.15-rc2 + steal limit patch (attached)


I am thinking/working on this sort of improving too, but my
idea is to compute tags-nr_max_cache by below:

nr_tags / hctx-max_nr_ctx

hctx-max_nr_ctx means the max sw queues mapped to the
hw queue, which need to be introduced in the approach, actually,
the value should represent the CPU topology info.

It is a bit complicated to compute hctx-max_nr_ctx because
we need to take account into CPU hotplug and probable
user-defined mapping callback.


We can always just update the caching info, that's not a big problem. We 
update the mappings on those events anyway.



If user-defined mapping callback needn't to be considered, the
hctx-max_nr_ctx can be figured out before mapping sw
queue in blk_mq_init_queue() by supposing each CPU is
online first, once it is done, the map for offline CPU is cleared,
then start to call blk_mq_map_swqueue().


I don't see how a user defined mapping would change things a whole lot. 
It's just another point of updating the cache. Besides, user defined 
mappings will be mostly (only?) for things like multiqueue, where the 
caching info would likely remain static over a reconfigure.



In my null_blk test on a quad core SMP VM:

  - 4 hw queue
  - timer mode

With the above approach, tag allocation from local CPU can be
improved from:

 5%   - 50% for boot CPU
 30% - 90% for non-boot CPU.

If no one objects the idea, I'd like to post a patch for review.


Sent it out, that can't hurt. I'll take a look at it, and give it a test 
spin as well.



--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-03-26 Thread Alexander Gordeev
Hello,

This series is against 3.14.0-rc7.

It is amied to further improve 'percpu_ida' tags locality by taking
into account system's CPU topology when stealing tags. That is try
to steal from a CPU which is 'closest' to the stealing one.

I would not bother to post this, since on several system the change
did not show any improvement, i.e. on such one:

CPU0 attaching sched-domain:
 domain 0: span 0,8 level SIBLING
  groups: 0 (cpu_power = 588) 8 (cpu_power = 588)
  domain 1: span 0-3,8-11 level MC
   groups: 0,8 (cpu_power = 1176) 1,9 (cpu_power = 1176) 2,10 (cpu_power = 
1176) 3,11 (cpu_power = 1176)
   domain 2: span 0-15 level NUMA
groups: 0-3,8-11 (cpu_power = 4704) 4-7,12-15 (cpu_power = 4704)


But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:

CPU5 attaching sched-domain:
 domain 0: span 0-5 level MC
  groups: 5 (cpu_power = 1023) 0 (cpu_power = 1023) 1 (cpu_power = 1023) 2 
(cpu_power = 1023) 3 (cpu_power = 1023) 4 (cpu_power = 1023)
  domain 1: span 0-7 level NUMA
   groups: 0-5 (cpu_power = 6138) 6-7 (cpu_power = 2046)
CPU6 attaching sched-domain:
 domain 0: span 6-7 level MC
  groups: 6 (cpu_power = 1023) 7 (cpu_power = 1023)
  domain 1: span 0-7 level NUMA
   groups: 6-7 (cpu_power = 2046) 0-5 (cpu_power = 6138)


I tested using 'null_blk' device with number of threads equal
to the number of CPUs with each thread affined to one CPU and
not affined, with no difference.

Suggestions are welcomed :)

Thanks!

Cc: Kent Overstreet 
Cc: Jens Axboe 
Cc: Shaohua Li 
Cc: Nicholas Bellinger 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 

Alexander Gordeev (2):
  sched: Introduce topology level masks and for_each_tlm() macro
  percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags()

 include/linux/percpu_ida.h |1 -
 include/linux/sched.h  |5 ++
 kernel/sched/core.c|   89 
 lib/percpu_ida.c   |   46 +-
 4 files changed, 113 insertions(+), 28 deletions(-)

-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

2014-03-26 Thread Alexander Gordeev
Hello,

This series is against 3.14.0-rc7.

It is amied to further improve 'percpu_ida' tags locality by taking
into account system's CPU topology when stealing tags. That is try
to steal from a CPU which is 'closest' to the stealing one.

I would not bother to post this, since on several system the change
did not show any improvement, i.e. on such one:

CPU0 attaching sched-domain:
 domain 0: span 0,8 level SIBLING
  groups: 0 (cpu_power = 588) 8 (cpu_power = 588)
  domain 1: span 0-3,8-11 level MC
   groups: 0,8 (cpu_power = 1176) 1,9 (cpu_power = 1176) 2,10 (cpu_power = 
1176) 3,11 (cpu_power = 1176)
   domain 2: span 0-15 level NUMA
groups: 0-3,8-11 (cpu_power = 4704) 4-7,12-15 (cpu_power = 4704)


But other systems (more dense?) showed increased cache-hit rate
up to 20%, i.e. this one:

CPU5 attaching sched-domain:
 domain 0: span 0-5 level MC
  groups: 5 (cpu_power = 1023) 0 (cpu_power = 1023) 1 (cpu_power = 1023) 2 
(cpu_power = 1023) 3 (cpu_power = 1023) 4 (cpu_power = 1023)
  domain 1: span 0-7 level NUMA
   groups: 0-5 (cpu_power = 6138) 6-7 (cpu_power = 2046)
CPU6 attaching sched-domain:
 domain 0: span 6-7 level MC
  groups: 6 (cpu_power = 1023) 7 (cpu_power = 1023)
  domain 1: span 0-7 level NUMA
   groups: 6-7 (cpu_power = 2046) 0-5 (cpu_power = 6138)


I tested using 'null_blk' device with number of threads equal
to the number of CPUs with each thread affined to one CPU and
not affined, with no difference.

Suggestions are welcomed :)

Thanks!

Cc: Kent Overstreet k...@daterainc.com
Cc: Jens Axboe ax...@kernel.dk
Cc: Shaohua Li s...@kernel.org
Cc: Nicholas Bellinger n...@linux-iscsi.org
Cc: Ingo Molnar mi...@redhat.com
Cc: Peter Zijlstra pet...@infradead.org

Alexander Gordeev (2):
  sched: Introduce topology level masks and for_each_tlm() macro
  percpu_ida: Use for_each_tlm() macro for CPU lookup in steal_tags()

 include/linux/percpu_ida.h |1 -
 include/linux/sched.h  |5 ++
 kernel/sched/core.c|   89 
 lib/percpu_ida.c   |   46 +-
 4 files changed, 113 insertions(+), 28 deletions(-)

-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/