Re: [RFC] sched: The removal of idle_balance()

2013-02-18 Thread Rakib Mullick
On Mon, Feb 18, 2013 at 9:25 PM, Steven Rostedt  wrote:
> On Mon, 2013-02-18 at 13:43 +0530, Srikar Dronamraju wrote:
>> > The cache misses dropped by ~23% and migrations dropped by ~28%. I
>> > really believe that the idle_balance() hurts performance, and not just
>> > for something like hackbench, but the aggressive nature for migration
>> > that idle_balance() causes takes a large hit on a process' cache.
>> >
>> > Think about it some more, just because we go idle isn't enough reason to
>> > pull a runable task over. CPUs go idle all the time, and tasks are woken
>> > up all the time. There's no reason that we can't just wait for the sched
>> > tick to decide its time to do a bit of balancing. Sure, it would be nice
>> > if the idle CPU did the work. But I think that frame of mind was an
>> > incorrect notion from back in the early 2000s and does not apply to
>> > today's hardware, or perhaps it doesn't apply to the (relatively) new
>> > CFS scheduler. If you want aggressive scheduling, make the task rt, and
>> > it will do aggressive scheduling.
>> >
>>
>> How is it that the normal tick based load balancing gets it correctly while
>> the idle_balance gets is wrong?  Can it because of the different
>> cpu_idle_type?
>>
>
> Currently looks to be a fluke in my box, as this performance increase
> can't be duplicated elsewhere (yet). But from looking at my traces, it
> seems that my box does the idle balance at just the wrong time, and
> causes these issues.
>
A default hackbench run creates 400 tasks (10 * 40), on a i7 system (4
core, HT), idle_balance() shouldn't be in action, cause on a 8 cpu
system we're assigning 400 tasks. If idle_balance() comes in, that
means - we've done something wrong while distributing tasks among the
CPUs, that indicates a problem during fork/exec/wake balancing?

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-27 Thread Rakib Mullick
Hello Paul,

On 8/28/12, Paul E. McKenney  wrote:
> On Mon, Aug 20, 2012 at 09:26:57AM -0700, Paul E. McKenney wrote:
>> On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:
>
> How about the following updated patch?
>
Actually, I was waiting for Peter's update.

>   Thanx, Paul
>
> 
>
> sched: Fix load avg vs cpu-hotplug
>
> Rabik and Paul reported two different issues related to the same few
> lines of code.
>
> Rabik's issue is that the nr_uninterruptible migration code is wrong in
> that he sees artifacts due to this (Rabik please do expand in more
> detail).
>
> Paul's issue is that this code as it stands relies on us using
> stop_machine() for unplug, we all would like to remove this assumption
> so that eventually we can remove this stop_machine() usage altogether.
>
> The only reason we'd have to migrate nr_uninterruptible is so that we
> could use for_each_online_cpu() loops in favour of
> for_each_possible_cpu() loops, however since nr_uninterruptible() is the
> only such loop and its using possible lets not bother at all.
>
> The problem Rabik sees is (probably) caused by the fact that by
> migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> involved.
>
> So don't bother with fancy migration schemes (meaning we now have to
> keep using for_each_possible_cpu()) and instead fold any nr_active delta
> after we migrate all tasks away to make sure we don't have any skewed
> nr_active accounting.
>
> [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
> miscounting noted by Rakib. ]
>
> Reported-by: Rakib Mullick 
> Reported-by: Paul E. McKenney 
> Signed-off-by: Peter Zijlstra 
> Signed-off-by: Paul E. McKenney 
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e841dfc..a8807f2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5309,27 +5309,17 @@ void idle_task_exit(void)
>  }
>
>  /*
> - * While a dead CPU has no uninterruptible tasks queued at this point,
> - * it might still have a nonzero ->nr_uninterruptible counter, because
> - * for performance reasons the counter is not stricly tracking tasks to
> - * their home CPUs. So we just add the counter to another CPU's counter,
> - * to keep the global sum constant after CPU-down:
> - */
> -static void migrate_nr_uninterruptible(struct rq *rq_src)
> -{
> - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> -
> - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> - rq_src->nr_uninterruptible = 0;
> -}
> -
> -/*
> - * remove the tasks which were accounted by rq from calc_load_tasks.
> + * Since this CPU is going 'away' for a while, fold any nr_active delta
> + * we might have. Assumes we're called after migrate_tasks() so that the
> + * nr_active count is stable.
> + *
> + * Also see the comment "Global load-average calculations".
>   */
> -static void calc_global_load_remove(struct rq *rq)
> +static void calc_load_migrate(struct rq *rq)
>  {
> - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> - rq->calc_load_active = 0;
> + long delta = calc_load_fold_active(rq);
> + if (delta)
> + atomic_long_add(delta, &calc_load_tasks);
>  }
>
>  /*
> @@ -5622,9 +5612,18 @@ migration_call(struct notifier_block *nfb, unsigned
> long action, void *hcpu)
>   migrate_tasks(cpu);
>   BUG_ON(rq->nr_running != 1); /* the migration thread */
>   raw_spin_unlock_irqrestore(&rq->lock, flags);
> + break;
>
> - migrate_nr_uninterruptible(rq);
> - calc_global_load_remove(rq);
> + case CPU_DEAD:
> + {
> + struct rq *dest_rq;
> +
> + local_irq_save(flags);
> + dest_rq = cpu_rq(smp_processor_id());

Use of smp_processor_id() as dest cpu isn't clear to me, this
processor is about to get down, isn't it?

> + raw_spin_lock(&dest_rq->lock);
> + calc_load_migrate(rq);

Well, calc_load_migrate() has no impact cause rq->nr_running == 1 at
this point. It's been already pointed out previously.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-28 Thread Rakib Mullick
On 8/28/12, Paul E. McKenney  wrote:
> On Tue, Aug 28, 2012 at 12:57:09PM +0600, Rakib Mullick wrote:
>> Hello Paul,
>>
>> On 8/28/12, Paul E. McKenney  wrote:
>> > On Mon, Aug 20, 2012 at 09:26:57AM -0700, Paul E. McKenney wrote:
>> >> On Mon, Aug 20, 2012 at 11:26:57AM +0200, Peter Zijlstra wrote:
>> >
>> > How about the following updated patch?
>> >
>> Actually, I was waiting for Peter's update.
>
> I was too, but chatted with Peter.
>
>> >Thanx, Paul
>> >
>> > 
>> >
>> > sched: Fix load avg vs cpu-hotplug
>> >
>> > Rabik and Paul reported two different issues related to the same few
>> > lines of code.
>> >
>> > Rabik's issue is that the nr_uninterruptible migration code is wrong in
>> > that he sees artifacts due to this (Rabik please do expand in more
>> > detail).
>> >
>> > Paul's issue is that this code as it stands relies on us using
>> > stop_machine() for unplug, we all would like to remove this assumption
>> > so that eventually we can remove this stop_machine() usage altogether.
>> >
>> > The only reason we'd have to migrate nr_uninterruptible is so that we
>> > could use for_each_online_cpu() loops in favour of
>> > for_each_possible_cpu() loops, however since nr_uninterruptible() is
>> > the
>> > only such loop and its using possible lets not bother at all.
>> >
>> > The problem Rabik sees is (probably) caused by the fact that by
>> > migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
>> > involved.
>> >
>> > So don't bother with fancy migration schemes (meaning we now have to
>> > keep using for_each_possible_cpu()) and instead fold any nr_active
>> > delta
>> > after we migrate all tasks away to make sure we don't have any skewed
>> > nr_active accounting.
>> >
>> > [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
>> > miscounting noted by Rakib. ]
>> >
>> > Reported-by: Rakib Mullick 
>> > Reported-by: Paul E. McKenney 
>> > Signed-off-by: Peter Zijlstra 
>> > Signed-off-by: Paul E. McKenney 
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index e841dfc..a8807f2 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -5309,27 +5309,17 @@ void idle_task_exit(void)
>> >  }
>> >
>> >  /*
>> > - * While a dead CPU has no uninterruptible tasks queued at this point,
>> > - * it might still have a nonzero ->nr_uninterruptible counter, because
>> > - * for performance reasons the counter is not stricly tracking tasks
>> > to
>> > - * their home CPUs. So we just add the counter to another CPU's
>> > counter,
>> > - * to keep the global sum constant after CPU-down:
>> > - */
>> > -static void migrate_nr_uninterruptible(struct rq *rq_src)
>> > -{
>> > -  struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
>> > -
>> > -  rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
>> > -  rq_src->nr_uninterruptible = 0;
>> > -}
>> > -
>> > -/*
>> > - * remove the tasks which were accounted by rq from calc_load_tasks.
>> > + * Since this CPU is going 'away' for a while, fold any nr_active
>> > delta
>> > + * we might have. Assumes we're called after migrate_tasks() so that
>> > the
>> > + * nr_active count is stable.
>> > + *
>> > + * Also see the comment "Global load-average calculations".
>> >   */
>> > -static void calc_global_load_remove(struct rq *rq)
>> > +static void calc_load_migrate(struct rq *rq)
>> >  {
>> > -  atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
>> > -  rq->calc_load_active = 0;
>> > +  long delta = calc_load_fold_active(rq);
>> > +  if (delta)
>> > +  atomic_long_add(delta, &calc_load_tasks);
>> >  }
>> >
>> >  /*
>> > @@ -5622,9 +5612,18 @@ migration_call(struct notifier_block *nfb,
>> > unsigned
>> > long action, void *hcpu)
>> >migrate_tasks(cpu);
>> >BUG_ON(rq->nr_running != 1); /* the migration thread */
>> >raw_spin_unlo

Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-28 Thread Rakib Mullick
On 8/28/12, Paul E. McKenney  wrote:
>
> OK, but I thought that Peter said that ->nr_uninterruptible was
> meaningful only when summed across all CPUs.  If that is the case,
> it shouldn't matter where the counts are moved.
>
Yes, right. But, nr_uninterruptible is also use to calculate delta.
Please see calc_load_fold_active().

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: nohz_idle_balance

2012-09-13 Thread Rakib Mullick
On 9/13/12, Peter Zijlstra  wrote:
> On Thu, 2012-09-13 at 08:49 +0200, Mike Galbraith wrote:
>> On Thu, 2012-09-13 at 06:11 +0200, Vincent Guittot wrote:
>
> Well, updating the load statistics on the cpu you're going to balance
> seems like a good end to me.. ;-) No point updating the local statistics
> N times and leaving the ones you're going to balance stale for god knows
> how long.

Don't you think this patch has a very poor subject line?

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:sched/core] sched: Fix load avg vs cpu-hotplug

2012-09-05 Thread Rakib Mullick
On 9/6/12, Peter Zijlstra  wrote:
> On Wed, 2012-09-05 at 19:01 +0200, Peter Zijlstra wrote:
>> > Please do a delta.
>
> OK, so I suppose something like the below ought to do. Paul its slightly
> different than the one in your tree, given the changelog below, do you
> see anything wrong with it?
>
> Rakib, again, sorry for getting your name wrong, and this time for
> getting it merged :/
>
It's okay, no problem. I was just pointed out what was the mistakes. I
didn't take too much seriously. ( Actually, my friends often called me
in such names that are no way near of "Rakib" or "Rabik", those names
sounds worse than "Rabik" ;-). So, I had to cope with it :-).)

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/2] sched: move content out of core files for load average

2013-04-18 Thread Rakib Mullick
On Thu, Apr 18, 2013 at 9:54 PM, Paul Gortmaker
 wrote:
> On 13-04-18 07:14 AM, Peter Zijlstra wrote:
>> On Mon, 2013-04-15 at 11:33 +0200, Ingo Molnar wrote:
>>> * Paul Gortmaker  wrote:
>>>
 Recent activity has had a focus on moving functionally related blocks of 
 stuff
 out of sched/core.c into stand-alone files.  The code relating to load 
 average
 calculations has grown significantly enough recently to warrant placing it 
 in a
 separate file.

 Here we do that, and in doing so, we shed ~20k of code from sched/core.c 
 (~10%).

 A couple small static functions in the core sched.h header were also 
 localized
 to their singular user in sched/fair.c at the same time, with the goal to 
 also
 reduce the amount of "broadcast" content in that sched.h file.
>>>
>>> Nice!
>>>
>>> Peter, is this (and the naming of the new file) fine with you too?
>>
>> Yes and no.. that is I do like the change, but I don't like the
>> filename. We have _wy_ too many different things we call load_avg.
>>
>> That said, I'm having a somewhat hard time coming up with a coherent
>> alternative :/
>
> Several of the relocated functions start their name with "calc_load..."
> Does "calc_load.c" sound any better?
>
How about sched_load.c ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/2] sched: move content out of core files for load average

2013-04-18 Thread Rakib Mullick
On Fri, Apr 19, 2013 at 5:13 AM, Paul Gortmaker
 wrote:
> [Re: [RFC PATCH 0/2] sched: move content out of core files for load average] 
> On 18/04/2013 (Thu 23:06) Rakib Mullick wrote:
>
>> On Thu, Apr 18, 2013 at 9:54 PM, Paul Gortmaker
>>  wrote:
>> > On 13-04-18 07:14 AM, Peter Zijlstra wrote:
>> >> On Mon, 2013-04-15 at 11:33 +0200, Ingo Molnar wrote:
>> >>> * Paul Gortmaker  wrote:
>> >>>
>> >>>> Recent activity has had a focus on moving functionally related blocks 
>> >>>> of stuff
>> >>>> out of sched/core.c into stand-alone files.  The code relating to load 
>> >>>> average
>> >>>> calculations has grown significantly enough recently to warrant placing 
>> >>>> it in a
>> >>>> separate file.
>> >>>>
>> >>>> Here we do that, and in doing so, we shed ~20k of code from 
>> >>>> sched/core.c (~10%).
>> >>>>
>> >>>> A couple small static functions in the core sched.h header were also 
>> >>>> localized
>> >>>> to their singular user in sched/fair.c at the same time, with the goal 
>> >>>> to also
>> >>>> reduce the amount of "broadcast" content in that sched.h file.
>> >>>
>> >>> Nice!
>> >>>
>> >>> Peter, is this (and the naming of the new file) fine with you too?
>> >>
>> >> Yes and no.. that is I do like the change, but I don't like the
>> >> filename. We have _wy_ too many different things we call load_avg.
>> >>
>> >> That said, I'm having a somewhat hard time coming up with a coherent
>> >> alternative :/
>> >
>> > Several of the relocated functions start their name with "calc_load..."
>> > Does "calc_load.c" sound any better?
>> >
>> How about sched_load.c ?
>
> No, that doesn't work since it duplicates the path info in the file
> name -- something that none of the other kernel/sched/*.c files do.
> Do a "ls -1 kernel/sched" to see what I mean if it is not clear.
>
I understand your point, so just take sched out of it. I was just
trying to give a hint, if it could help.

> I honestly didn't spend a lot of time thinking about the file name.
> I chose load_avg.c since it had a parallel to the /proc/loadavg that
> linux has had since the early 1990s.  I have no real attachment
> to that name, but at the same time I'd like to avoid having name
> choice become a bikeshedding event...
>
I also didn't spend a lot of time thinking about file name. What I
said, it's from intuition and what I've seen so far. Name, it should
be simple, clear and should suggest what it's dealing with. But I also
think that,  name doesn't make things different, how it works - that's
what is important.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/2] sched: move content out of core files for load average

2013-04-19 Thread Rakib Mullick
On Fri, Apr 19, 2013 at 2:25 PM, Ingo Molnar  wrote:
>
> * Paul Gortmaker  wrote:
>
>> On 13-04-18 07:14 AM, Peter Zijlstra wrote:
>> > On Mon, 2013-04-15 at 11:33 +0200, Ingo Molnar wrote:
>> >> * Paul Gortmaker  wrote:
>> >>
>> >>> Recent activity has had a focus on moving functionally related blocks of 
>> >>> stuff
>> >>> out of sched/core.c into stand-alone files.  The code relating to load 
>> >>> average
>> >>> calculations has grown significantly enough recently to warrant placing 
>> >>> it in a
>> >>> separate file.
>> >>>
>> >>> Here we do that, and in doing so, we shed ~20k of code from sched/core.c 
>> >>> (~10%).
>> >>>
>> >>> A couple small static functions in the core sched.h header were also 
>> >>> localized
>> >>> to their singular user in sched/fair.c at the same time, with the goal 
>> >>> to also
>> >>> reduce the amount of "broadcast" content in that sched.h file.
>> >>
>> >> Nice!
>> >>
>> >> Peter, is this (and the naming of the new file) fine with you too?
>> >
>> > Yes and no.. that is I do like the change, but I don't like the
>> > filename. We have _wy_ too many different things we call load_avg.
>> >
>> > That said, I'm having a somewhat hard time coming up with a coherent
>> > alternative :/
>>
>> Several of the relocated functions start their name with "calc_load..."
>> Does "calc_load.c" sound any better?
>
> Peter has a point about load_avg being somewhat of a misnomer: that's not your
> fault in any way, we created overlapping naming within the scheduler and are 
> now
> hurting from it.
>
> Here are the main scheduler 'load' concepts we have right now:
>
>  - The externally visible 'average load' value extracted by tools like 'top' 
> via
>/proc/loadavg and handled by fs/proc/loadavg.c. Internally the naming is 
> all
>over the map: the fields that are updated are named 'avenrun[]', most other
>variables and methods are named calc_load_*(), and a few callbacks are 
> named
>*_cpu_load_*().
>
>  - rq->cpu_load, a weighted, vectored scheduler-internal notion of task load
>average with multiple run length averages. Only exposed by debug 
> interfaces but
>otherwise relied on by the scheduler for SMP load balancing.
>
>  - se->avg - per entity (per task) load average. This is integrated 
> differently
>from the cpu_load - but work is ongoing to possibly integrate it with the
>rq->cpu_load metric. This metric is used for CPU internal execution time
>allocation and timeslicing, based on nice value priorities and cgroup
>weights and constraints.
>
> Work is ongoing to integrate rq->cpu_load and se->avg - eventually they will
> become one metric.
>
> It might eventually make sense to integrate the 'average load' calculation as 
> well
> with all this - as they really have a similar purpose, the avenload[] vector 
> of
> averages is conceptually similar to the rq->cpu_load[] vector of averages.
>
> So I'd suggest to side-step all that existing confusion and simply name the 
> new
> file kernel/sched/proc.c - our external /proc scheduler ABI towards userspace.
> This is similar to the already existing kernel/irq/proc.c pattern.
>
Well, kernel/sched/stat.c - also exposes scheduler ABI to userspace.
Aren't these things going to introduce confusion (stat.c and proc.c
under same sched directory) ?

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH, RFC] sched: update_top_cache_domain only at the times of building sched domain.

2013-07-23 Thread Rakib Mullick
 Currently, update_top_cache_domain() is called whenever schedule domain is 
built or destroyed. But, the following
callpath shows that they're at the same callpath and can be avoided 
update_top_cache_domain() while destroying schedule
domain and update only at the times of building schedule domains.

partition_sched_domains()
detach_destroy_domain()
cpu_attach_domain()
update_top_cache_domain()
build_sched_domains()
cpu_attach_domain()
update_top_cache_domain()



Signed-off-by: Rakib Mullick 
---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7c32cb..8c6fee4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5102,8 +5102,8 @@ static void update_top_cache_domain(int cpu)
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
  */
-static void
-cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
+static void cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd,
+   int cpu, bool update_cache_domain)
 {
struct rq *rq = cpu_rq(cpu);
struct sched_domain *tmp;
@@ -5138,7 +5138,8 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);
 
-   update_top_cache_domain(cpu);
+   if (update_cache_domain)
+   update_top_cache_domain(cpu);
 }
 
 /* cpus with isolated domains */
@@ -6021,7 +6022,7 @@ static int build_sched_domains(const struct cpumask 
*cpu_map,
rcu_read_lock();
for_each_cpu(i, cpu_map) {
sd = *per_cpu_ptr(d.sd, i);
-   cpu_attach_domain(sd, d.rd, i);
+   cpu_attach_domain(sd, d.rd, i, 1);
}
rcu_read_unlock();
 
@@ -6109,7 +6110,7 @@ static void detach_destroy_domains(const struct cpumask 
*cpu_map)
 
rcu_read_lock();
for_each_cpu(i, cpu_map)
-   cpu_attach_domain(NULL, &def_root_domain, i);
+   cpu_attach_domain(NULL, &def_root_domain, i, 0);
rcu_read_unlock();
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, RFC] sched: update_top_cache_domain only at the times of building sched domain.

2013-07-23 Thread Rakib Mullick
On Tue, Jul 23, 2013 at 9:47 PM, Peter Zijlstra  wrote:
> On Tue, Jul 23, 2013 at 09:44:17PM +0600, Rakib Mullick wrote:
>>  Currently, update_top_cache_domain() is called whenever schedule domain is 
>> built or destroyed. But, the following
>> callpath shows that they're at the same callpath and can be avoided 
>> update_top_cache_domain() while destroying schedule
>> domain and update only at the times of building schedule domains.
>>
>>   partition_sched_domains()
>>   detach_destroy_domain()
>>   cpu_attach_domain()
>>   update_top_cache_domain()
>>   build_sched_domains()
>>   cpu_attach_domain()
>>   update_top_cache_domain()
>>
>
> Does it really matter?

Why should we do it twice? More importantly at the time of destroying
even though we know it'll be built again just few moment later.

>
>
> This just makes the code uglier for no gain afaict.
>
> If you really need to do this, key off @sd == NULL or something.

Sorry, would you please a bit more clearer? I can't pick what you're suggesting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, RFC] sched: update_top_cache_domain only at the times of building sched domain.

2013-07-23 Thread Rakib Mullick
On Tue, Jul 23, 2013 at 10:09 PM, Rakib Mullick  wrote:
> On Tue, Jul 23, 2013 at 9:47 PM, Peter Zijlstra  wrote:
>> On Tue, Jul 23, 2013 at 09:44:17PM +0600, Rakib Mullick wrote:
>>>  Currently, update_top_cache_domain() is called whenever schedule domain is 
>>> built or destroyed. But, the following
>>> callpath shows that they're at the same callpath and can be avoided 
>>> update_top_cache_domain() while destroying schedule
>>> domain and update only at the times of building schedule domains.
>>>
>>>   partition_sched_domains()
>>>   detach_destroy_domain()
>>>   cpu_attach_domain()
>>>   update_top_cache_domain()
>>>   build_sched_domains()
>>>   cpu_attach_domain()
>>>   update_top_cache_domain()
>>>
>>
>> Does it really matter?
>
> Why should we do it twice? More importantly at the time of destroying
> even though we know it'll be built again just few moment later.
>
>>
>>
>> This just makes the code uglier for no gain afaict.
>>
>> If you really need to do this, key off @sd == NULL or something.
>
> Sorry, would you please a bit more clearer? I can't pick what you're 
> suggesting.

You mean using sd == NULL rather than using update_cache_domain variable ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, RFC] sched: update_top_cache_domain only at the times of building sched domain.

2013-07-23 Thread Rakib Mullick
On Tue, Jul 23, 2013 at 10:53 PM, Peter Zijlstra  wrote:
> On Tue, Jul 23, 2013 at 10:20:20PM +0600, Rakib Mullick wrote:
>
>> You mean using sd == NULL rather than using update_cache_domain variable ?
>
> Yes, note how:
>
> @@ -6109,7 +6110,7 @@ static void detach_destroy_domains(const struct cpumask 
> *cpu_map)
>
> rcu_read_lock();
> for_each_cpu(i, cpu_map)
> -   cpu_attach_domain(NULL, &def_root_domain, i);
> rcu_read_unlock();
>  }
>
> Always has NULL for sd? Which means you can do:
>
> @@ -5138,7 +5138,8 @@ cpu_attach_domain(struct sched_domain *sd, struct 
> root_domain *rd, int cpu)
> rcu_assign_pointer(rq->sd, sd);
> destroy_sched_domains(tmp, cpu);
>
> -   update_top_cache_domain(cpu);
> +   if (sd)
> +   update_top_cache_domain(cpu);
>  }
>
>  /* cpus with isolated domains */

Okay, got it. Will submit an updated patch!

Thanks
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] sched: update_top_cache_domain only at the times of building sched domain.

2013-07-23 Thread Rakib Mullick
Currently, update_top_cache_domain() is called whenever schedule domain is 
built or destroyed. But, the following
callpath shows that they're at the same callpath and can be avoided 
update_top_cache_domain() while destroying schedule
domain and update only at the times of building schedule domains.

partition_sched_domains()
detach_destroy_domain()
cpu_attach_domain()
update_top_cache_domain()
build_sched_domains()
cpu_attach_domain()
update_top_cache_domain()

Changes since v1: use sd to determine when to skip, courtesy PeterZ

Signed-off-by: Rakib Mullick 
---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7c32cb..387fb66 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5138,7 +5138,8 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);
 
-   update_top_cache_domain(cpu);
+   if (sd)
+   update_top_cache_domain(cpu);
 }
 
 /* cpus with isolated domains */



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched: update_top_cache_domain only at the times of building sched domain.

2013-07-24 Thread Rakib Mullick
On Wed, Jul 24, 2013 at 9:26 AM, Michael Wang
 wrote:
> Hi, Rakib
>
> On 07/24/2013 01:42 AM, Rakib Mullick wrote:
>> Currently, update_top_cache_domain() is called whenever schedule domain is 
>> built or destroyed. But, the following
>> callpath shows that they're at the same callpath and can be avoided 
>> update_top_cache_domain() while destroying schedule
>> domain and update only at the times of building schedule domains.
>>
>>   partition_sched_domains()
>>   detach_destroy_domain()
>>   cpu_attach_domain()
>>   update_top_cache_domain()
>
> IMHO, cpu_attach_domain() and update_top_cache_domain() should be
> paired, below patch will open a window which 'rq->sd == NULL' while
> 'sd_llc != NULL', isn't it?
>
> I don't think we have the promise that before we rebuild the stuff
> correctly, no one will utilize 'sd_llc'...
>
I never said it. My point is different.  partition_sched_domain works as -

- destroying existing schedule domain (if previous domain and new
domain aren't same)
- building new partition

while doing the first it needs to detach all the cpus on that domain.
By detaching what it does,
it fall backs to it's root default domain. In this context (which i've
proposed to skip), by means
of updating top cache domain it takes the highest flag domain to setup
it's sd_llc_id or cpu itself.

Whatever done above gets overwritten (updating top cache domain),
while building new partition.
Then, why we did that before? Hope you understand my point.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] sched: update_top_cache_domain only at the times of building sched domain.

2013-07-24 Thread Rakib Mullick
On Wed, Jul 24, 2013 at 2:34 PM, Michael Wang
 wrote:
> On 07/24/2013 04:01 PM, Rakib Mullick wrote:
>> On Wed, Jul 24, 2013 at 9:26 AM, Michael Wang
>>  wrote:
>>> Hi, Rakib
>>>
>>> On 07/24/2013 01:42 AM, Rakib Mullick wrote:
>>>> Currently, update_top_cache_domain() is called whenever schedule domain is 
>>>> built or destroyed. But, the following
>>>> callpath shows that they're at the same callpath and can be avoided 
>>>> update_top_cache_domain() while destroying schedule
>>>> domain and update only at the times of building schedule domains.
>>>>
>>>>   partition_sched_domains()
>>>>   detach_destroy_domain()
>>>>   cpu_attach_domain()
>>>>   update_top_cache_domain()
>>>
>>> IMHO, cpu_attach_domain() and update_top_cache_domain() should be
>>> paired, below patch will open a window which 'rq->sd == NULL' while
>>> 'sd_llc != NULL', isn't it?
>>>
>>> I don't think we have the promise that before we rebuild the stuff
>>> correctly, no one will utilize 'sd_llc'...
>>>
>> I never said it. My point is different.  partition_sched_domain works as -
>>
>>   - destroying existing schedule domain (if previous domain and new
>> domain aren't same)
>>   - building new partition
>>
>> while doing the first it needs to detach all the cpus on that domain.
>> By detaching what it does,
>> it fall backs to it's root default domain. In this context (which i've
>> proposed to skip), by means
>> of updating top cache domain it takes the highest flag domain to setup
>> it's sd_llc_id or cpu itself.
>>
>> Whatever done above gets overwritten (updating top cache domain),
>> while building new partition.
>> Then, why we did that before? Hope you understand my point.
>
> I think you missed this in PeterZ's suggestion:
>
> -   cpu_attach_domain(NULL, &def_root_domain, i);
>
> With this change, it will be safe since you will still get an un-freed
> sd, although it's an old one.
>
I never meant it and clearly I missed it. If you remove
cpu_attach_domain(), then detach_destroy_domain() becomes meaningless.
And I don't have any intent to remove cpu_attach_domain from
detach_destroy_domain() at all.

> But your patch will run the risk to get a freed sd, since you make
> 'sd_llc' wrong for a period of time (between destroy and rebuild) IMO.
>
Building 'sd_llc' depends on schedule domain. If you don't have sd,
sd_llc will point to NULL and sd_llc_id is
the CPU itself. Since, we're  trying to re-construing so for this time
being it doesn't matter, cause we're building
it again. Now, please just note what you're saying, on last thread
you've said -

   "I don't think we have the promise that before we rebuild the stuff
correctly, no one will utilize 'sd_llc'..."

If that is the case, then we shouldn't worry about it at all. And this
above comments (from previous thread I've quoted and this thread I'm
replying) they're just self contradictory.

> I guess I get you point, you are trying to save one time update since
> you think this will be done twice, but actually the result of this two
> time update was different, it's not redo and it's in order to sync
> 'sd_llc' with 'rq->sd'.
>
Yes, you got my point now, but I don't understand your points. Anyway,
I'm not going to argue with this
anymore, this stuff isn't much of an issue, but removing this sorts of
stuff is typical in kernel development.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: old->umask copying without spin_lock, in copy_fs_struct()

2013-04-07 Thread Rakib Mullick
On Sun, Apr 7, 2013 at 1:56 PM, Al Viro  wrote:
> On Sun, Apr 07, 2013 at 11:37:27AM +0600, Rakib Mullick wrote:
>> Hello,
>>
>> In copy_fs_struct(), old->umask is assigned to fs->umask outside of
>> spin_lock(&old->lock). Shouldn't it be inside spin_lock()? Since we're
>> dealing with  fs_struct *old ? Isn't it unsafe? Following lines -
>>
>>   fs->umask = old->umask;
>>
>> spin_lock(&old->lock);
>
> What would moving it down buy us?  Root, pwd and umask are all modified
> independently; the *only* reason why we hold old->lock for root and
> pwd (and we might drop and regain it between copying those - it would
> be pointless, so we don't bother, but it wouldn't have affected correctness)
> is that we want the values of root.mnt and root.dentry taken at the same
> time and we want to grab extra references on those while they are still
> valid.  The same goes for pwd, of course.  That's what old->lock
> protects - we want the damn thing atomic wrt set_fs_root() and set_fs_pwd().
> umask is an integer; its updates are atomic anyway, so it's not as if we
> could see a half-updated value or needed to do anything with refcounts.

Thanks for your explanation! The ->umask operation is trivial and as
you've explained (I was also looking at the code),
it seems that code execution order makes sure that nothing goes wrong.
fs_struct's data are protected with the ->lock, that's what I was
thinking in that way and was just making sure it wasn't missed out
accidentally.

Thanks
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] auditsc: Use kzalloc instead of kmalloc+memset.

2013-04-07 Thread Rakib Mullick
  In function audit_alloc_context(), use kzalloc, instead of kmalloc+memset. 
Patch also renames audit_zero_context() to 
audit_set_context(), to represent it's inner workings properly.

Signed-off-by: Rakib Mullick 
---

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index a371f85..f5b6dc5 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1034,10 +1034,9 @@ static inline void audit_free_aux(struct audit_context 
*context)
}
 }
 
-static inline void audit_zero_context(struct audit_context *context,
+static inline void audit_set_context(struct audit_context *context,
  enum audit_state state)
 {
-   memset(context, 0, sizeof(*context));
context->state  = state;
context->prio = state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0;
 }
@@ -1046,9 +1045,10 @@ static inline struct audit_context 
*audit_alloc_context(enum audit_state state)
 {
struct audit_context *context;
 
-   if (!(context = kmalloc(sizeof(*context), GFP_KERNEL)))
+   context = kzalloc(sizeof(*context), GFP_KERNEL);
+   if (!context)
return NULL;
-   audit_zero_context(context, state);
+   audit_set_context(context, state);
INIT_LIST_HEAD(&context->killed_trees);
INIT_LIST_HEAD(&context->names_list);
return context;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUESTION] loops_per_jiffy calculation from smp_callin().

2013-04-08 Thread Rakib Mullick
Hello Ingo and all,

In function arch/x86/kernel/smpboot.c::smp_callin(), we do a
calibrate_delay(), the reason behind doing this is commented as
follows:

 * Get our bogomips.
 * Update loops_per_jiffy in cpu_data. Previous call to
 * smp_store_cpu_info() stored a value that is close but not as
 * accurate as the value just calculated.

Now, if we look at the init/calibrate.c::calibrate_delay() - a percpu
variable cpu_loops_per_jiffy is used to cache up the loops_per_jiffy
value. If cpu_loop_per_jiffy value is set (which should be after first
run) then it prevents calculating new loops_per_jiffy thus the
accuracy expected from the callsite isn't getable (I'm thinking about
cpu hotpluging for few times). Is it not something that we should care
about? Is lpj value is that important to recalculate (when mostly we
read from tsc)? And, if it's important, how cpu_loop_per_jiffy is
helping?


Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] auditsc: Use kzalloc instead of kmalloc+memset.

2013-04-08 Thread Rakib Mullick
On Tue, Apr 9, 2013 at 3:43 AM, Andrew Morton  wrote:
>
> Fair enough.  I'd go futher...
>
> From: Andrew Morton 
> Subject: auditsc-use-kzalloc-instead-of-kmallocmemset-fix
>
> remove audit_set_context() altogether - fold it into its caller
>
> Cc: Al Viro 
> Cc: Eric Paris 
> Cc: Rakib Mullick 
> Signed-off-by: Andrew Morton 
> ---
>
>  kernel/auditsc.c |   10 ++
>  1 file changed, 2 insertions(+), 8 deletions(-)
>
> diff -puN kernel/auditsc.c~auditsc-use-kzalloc-instead-of-kmallocmemset-fix 
> kernel/auditsc.c
> --- a/kernel/auditsc.c~auditsc-use-kzalloc-instead-of-kmallocmemset-fix
> +++ a/kernel/auditsc.c
> @@ -1034,13 +1034,6 @@ static inline void audit_free_aux(struct
> }
>  }
>
> -static inline void audit_set_context(struct audit_context *context,
> - enum audit_state state)
> -{
> -   context->state  = state;
> -   context->prio = state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0;
> -}
> -
>  static inline struct audit_context *audit_alloc_context(enum audit_state 
> state)
>  {
> struct audit_context *context;
> @@ -1048,7 +1041,8 @@ static inline struct audit_context *audi
> context = kzalloc(sizeof(*context), GFP_KERNEL);
> if (!context)
> return NULL;
> -   audit_set_context(context, state);
> +   context->state = state;
> +   context->prio = state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0;
> INIT_LIST_HEAD(&context->killed_trees);
> INIT_LIST_HEAD(&context->names_list);
> return context;
> _
>
Yes, this one is better than my patch and it's due to its diff stat.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/2] sched: move content out of core files for load average

2013-04-12 Thread Rakib Mullick
On Sat, Apr 13, 2013 at 6:04 AM, Paul Gortmaker
 wrote:
> Recent activity has had a focus on moving functionally related blocks of stuff
> out of sched/core.c into stand-alone files.  The code relating to load average
> calculations has grown significantly enough recently to warrant placing it in
> a separate file.
>
> Here we do that, and in doing so, we shed ~20k of code from sched/core.c 
> (~10%).
>
> A couple small static functions in the core sched.h header were also localized
> to their singular user in sched/fair.c at the same time, with the goal to also
> reduce the amount of "broadcast" content in that sched.h file.
>
> Paul.
> ---
>
> [ Patches sent here are tested on tip's sched/core, i.e. v3.9-rc1-38-gb329fd5
>
>   Assuming that this change is OK with folks, the timing can be whatever is 
> most
>   convenient -- i.e. I can update/respin it close to the end of the merge 
> window
>   for what will be v3.10-rc1, if that is what minimizes the inconvenience to 
> folks
>   who might be changing the code that is relocated here. ]
>
> Paul Gortmaker (2):
>   sched: fork load calculation code from sched/core --> sched/load_avg
>   sched: move update_load_[add/sub/set] from sched.h to fair.c
>
>  kernel/sched/Makefile   |   2 +-
>  kernel/sched/core.c | 569 ---
>  kernel/sched/fair.c |  18 ++
>  kernel/sched/load_avg.c | 577 
> 
>  kernel/sched/sched.h|  26 +--
>  5 files changed, 604 insertions(+), 588 deletions(-)
>  create mode 100644 kernel/sched/load_avg.c
>

Is there any impact positive over vmlinuz size after these changes?

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] nsproxy: Fix ->nsproxy counting problem in copy_namespace.

2013-03-07 Thread Rakib Mullick
>From cd41495e25cf2641ffe9e01a40d3d221a46b08be Mon Sep 17 00:00:00 2001
From: Rakib Mullick 
Date: Thu, 7 Mar 2013 14:52:20 +0600
Subject: [PATCH] nsproxy: Fix ->nsproxy counting problem in copy_namespace.

In copy_namespace(), get_nsproxy() (which increments nsproxy->count) is called 
before checking namespace related flags.
Therefore, task's nsproxy->count could have improper value, which could lead to 
calling free_nsproxy unnecessarily. Also,
incrementing nsproxy->count is an atomic operation (which is expensive than 
normal increment operation), so before
doing it - it's better we make sure namespace related flags are set.

Signed-off-by: Rakib Mullick 
---
 kernel/nsproxy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index afc0456..de36209 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -130,12 +130,12 @@ int copy_namespaces(unsigned long flags, struct 
task_struct *tsk)
if (!old_ns)
return 0;
 
-   get_nsproxy(old_ns);
-
if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET)))
return 0;
 
+   get_nsproxy(old_ns);
+
if (!ns_capable(user_ns, CAP_SYS_ADMIN)) {
err = -EPERM;
goto out;
-- 
1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [nsproxy] BUG: unable to handle kernel NULL pointer dereference at 0000000000000024

2013-03-08 Thread Rakib Mullick
On 3/7/13, Eric W. Biederman  wrote:
> Fengguang Wu  writes:
>
>> Greetings,
>>
>> I got the below oops and the first bad commit is
>
> Doh!  On a second look that change is totally wrong.  Of course we need
> to up the ref-count every time we create a new process.  Especially if
> we don't do anything with namespaces.
>
> I was looking at it from the wrong angle last night.  I should have
> known better.
>
> Patch dropped.
>

Sad to know :( . From the debug messages, it's kmemcheck report. I
can't related the problem specified with the patch I've proposed.

It seems at task exit path, at switch_task_namespaces() - after my
patch atomic_dec_and_test(&ns->count) becomes true (-1), thus
free_nsproxy() gets called. But, free_nsproxy() shouldn't get called
here.

Am I right? Or there's something else?

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [nsproxy] BUG: unable to handle kernel NULL pointer dereference at 0000000000000024

2013-03-08 Thread Rakib Mullick
On Fri, Mar 8, 2013 at 10:01 PM, Eric W. Biederman
 wrote:
>
> When a new task is created one of two things needs to happen.
> A) A reference count needs to be added to the current nsproxy.
> B) B a new nsproxy needs to be created.
>
> The way that code works today is far from a shiny example of totally
> clear code but it is not incorrect.
>
> By moving get_nsproxy down below the first return 0, you removed taking
> the reference count in the one case it is important.
>
> Arguably we should apply the patch below for clarity, and I just might
> queue it up for 3.10.
>
This one is much more cleaner. One thing regarding this patch, can we
check the namespace related flags at copy_namespace() call time at
copy_process(), also get_nsproxy()? I think this will reduce some
extra function call overhead and as you've mentioned get_nsproxy() is
needed at every process creation.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [nsproxy] BUG: unable to handle kernel NULL pointer dereference at 0000000000000024

2013-03-09 Thread Rakib Mullick
On Sat, Mar 9, 2013 at 2:33 PM, Eric W. Biederman  wrote:
> Rakib Mullick  writes:
>
>> On Fri, Mar 8, 2013 at 10:01 PM, Eric W. Biederman
>>  wrote:
>>>
>>> When a new task is created one of two things needs to happen.
>>> A) A reference count needs to be added to the current nsproxy.
>>> B) B a new nsproxy needs to be created.
>>>
>>> The way that code works today is far from a shiny example of totally
>>> clear code but it is not incorrect.
>>>
>>> By moving get_nsproxy down below the first return 0, you removed taking
>>> the reference count in the one case it is important.
>>>
>>> Arguably we should apply the patch below for clarity, and I just might
>>> queue it up for 3.10.
>>>
>> This one is much more cleaner. One thing regarding this patch, can we
>> check the namespace related flags at copy_namespace() call time at
>> copy_process(), also get_nsproxy()? I think this will reduce some
>> extra function call overhead and as you've mentioned get_nsproxy() is
>> needed at every process creation.
>
> If you can write a benchmark that can tell the difference, and the code
> continues to be readable.  It would be worth making the change.
>
> My gut says you are proposing an optimization that won't have
> a measurable impact.
>
Yes, it'll be hard to measure these sorts of optimization, though I'll
try and will tell you if the impact is measurable :).

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [nsproxy] BUG: unable to handle kernel NULL pointer dereference at 0000000000000024

2013-03-11 Thread Rakib Mullick
On Sat, Mar 9, 2013 at 10:48 PM, Rakib Mullick  wrote:
> On Sat, Mar 9, 2013 at 2:33 PM, Eric W. Biederman  
> wrote:
>> Rakib Mullick  writes:
>>
>>> On Fri, Mar 8, 2013 at 10:01 PM, Eric W. Biederman
>>>  wrote:
>>>>
>>>> When a new task is created one of two things needs to happen.
>>>> A) A reference count needs to be added to the current nsproxy.
>>>> B) B a new nsproxy needs to be created.
>>>>
>>>> The way that code works today is far from a shiny example of totally
>>>> clear code but it is not incorrect.
>>>>
>>>> By moving get_nsproxy down below the first return 0, you removed taking
>>>> the reference count in the one case it is important.
>>>>
>>>> Arguably we should apply the patch below for clarity, and I just might
>>>> queue it up for 3.10.
>>>>
>>> This one is much more cleaner. One thing regarding this patch, can we
>>> check the namespace related flags at copy_namespace() call time at
>>> copy_process(), also get_nsproxy()? I think this will reduce some
>>> extra function call overhead and as you've mentioned get_nsproxy() is
>>> needed at every process creation.
>>
>> If you can write a benchmark that can tell the difference, and the code
>> continues to be readable.  It would be worth making the change.
>>
>> My gut says you are proposing an optimization that won't have
>> a measurable impact.
>>
> Yes, it'll be hard to measure these sorts of optimization, though I'll
> try and will tell you if the impact is measurable :).
>
Well, it's quite certain that - the changes I've proposed that don't
have much noticable impact. From this tiny
changes we can't expect too much impact. The tinyest possible impact
could be on at every task creation path by
reducing some latency. So, to find it out I took the help of ftrace -
I used it's function_graph option with set_graph_function set to
do_fork.

So currently, at every task creation copy_namespaces() gets called and
it costs some like below (few snippet from ftrace log):

 3)   0.025 us|try_module_get();
 3) ! 887.805 us  |  } /* dup_mm */
 3)   0.278 us|  copy_namespaces(); <--
 3)   0.094 us|  copy_thread();

 3)   0.264 us|  copy_namespaces(); <-
 3)   0.091 us|  copy_thread();

 3)   0.306 us|  copy_namespaces();  <-
 3)   0.086 us|  copy_thread();

etc. copy_namespaces() just returns here.

On the otherhand, when namespace related flags are moved to
kernel/fork.c::copy_process(), copy_namespaces() never gets called
until flags are set.

So, this is what I did manage to get. If you're not convinced, then
you can drop it. But, we can't expect much than this for this simple
changes.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Memory Leak] free kprobe before optimized_kprobe free

2012-08-23 Thread Rakib Mullick
On 8/23/12, Akhilesh Kumar  wrote:
> From a77438899c7295299b59cdca8d1816ea69d6ed8e Mon Sep 17 00:00:00 2001
> From: Akhilesh Kumar 
> Date: Fri, 10 Aug 2012 14:02:07 +0530
> Subject:[Memory Leak] free kprobe before optimized_kprobe free
>
> Free *ap before *op otherwise ap pointer will be Dangling
>
> Signed-off-by: Akhilesh Kumar 

An usual way to submit a patch is to mark the subject line with
[patch] tag, not [memory leak].

Thanks,
Rakib

> ---
>  kernel/kprobes.c |1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index c62b854..ff0a97b 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -767,6 +767,7 @@ static __kprobes void
> try_to_optimize_kprobe(struct kprobe *p)
>   if (!arch_prepared_optinsn(&op->optinsn)) {
>   /* If failed to setup optimizing, fallback to kprobe */
>   arch_remove_optimized_kprobe(op);
> + kfree(ap);
>   kfree(op);
>   return;
>   }
> --
> 1.7.8.4
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-15 Thread Rakib Mullick
On 8/13/12, Alex Shi  wrote:
> Since there is no power saving consideration in scheduler CFS, I has a
> very rough idea for enabling a new power saving schema in CFS.
>
> It bases on the following assumption:
> 1, If there are many task crowd in system, just let few domain cpus
> running and let other cpus idle can not save power. Let all cpu take the
> load, finish tasks early, and then get into idle. will save more power
> and have better user experience.
>
This assumption indirectly point towards the scheme when performance
is enabled, isn't it? Cause you're trying to spread the load equally
amongst all the CPUs.

> 2, schedule domain, schedule group perfect match the hardware, and
> the power consumption unit. So, pull tasks out of a domain means
> potentially this power consumption unit idle.
>
How do you plan to test this power saving scheme? Using powertop? Or,
is there any other tools?

>
> select_task_rq_fair()
> {
>   for_each_domain(cpu, tmp) {
>   if (policy == power && tmp_has_capacity &&
>tmp->flags & sd_flag) {
>   sd = tmp;
>   //It is fine to got cpu in the domain
>   break;
>   }
>   }
>
>   while(sd) {
>   if policy == power
>   find_busiest_and_capable_group()

I'm not sure what find_busiest_and_capable_group() would really be, it
seems it'll find the busiest and capable group, but isn't it a
conflict with the first assumption you proposed on your proposal?

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-15 Thread Rakib Mullick
On Wed, Aug 15, 2012 at 8:55 PM, Peter Zijlstra  wrote:
> On Wed, 2012-08-15 at 20:24 +0600, Rakib Mullick wrote:
>> How do you plan to test this power saving scheme? Using powertop? Or,
>> is there any other tools?
>
> We should start out simple enough that we can validate it by looking at
> task placement by hand, eg. 4 tasks on a dual socket quad-core, should
> only keep one socket awake.
>
Yeah, that's what we can do "task placement", that's what scheduler is
known for :).

> We can also add an power aware evaluator to Linsched (another one of
> those things that needs getting sorted).
>
> And yeah, someone running with a power meter is of course king.
>
> _BUT_ we shouldn't go off the wall with power meters as that very
> quickly gets very specific to the system being measured.
>
> We should really keep this thing as simple as possible while still
> providing some benefit for all various architectures without tons of per
> arch knobs and knowhow.

Perhaps this is because, there's no well sorted specification from
various arch, to properly deal with power saving from scheduler's POV.
Actually, this is what I wasn't really sure whether there's any
documentation of how scheduler should work to save power.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-16 Thread Rakib Mullick
On 8/16/12, Alex Shi  wrote:
> On 08/15/2012 10:24 PM, Rakib Mullick wrote:
>
>> On 8/13/12, Alex Shi  wrote:
>>> Since there is no power saving consideration in scheduler CFS, I has a
>>> very rough idea for enabling a new power saving schema in CFS.
>>>
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>>
>> This assumption indirectly point towards the scheme when performance
>> is enabled, isn't it? Cause you're trying to spread the load equally
>> amongst all the CPUs.
>
>
> It is.
>
Okay, then what would be the default mechanism? Performance or
powersavings ? Your proposal deals with performance and power saving,
but there should be a default mechanism too, what that default
mechanism would be? Shouldn't  performance be the default one and
discard checking for performance?

>>
>>>
>>> select_task_rq_fair()
>>> {
>
>   int powersaving = 0;
>
>>> for_each_domain(cpu, tmp) {
>>> if (policy == power && tmp_has_capacity &&
>>>  tmp->flags & sd_flag) {
>>> sd = tmp;
>>> //It is fine to got cpu in the domain
>
>   powersaving = 1;
>
>>> break;
>>> }
>>> }
>>>
>>> while(sd) {
>   if (policy == power && powersaving == 1)
>>> find_busiest_and_capable_group()
>>
>> I'm not sure what find_busiest_and_capable_group() would really be, it
>> seems it'll find the busiest and capable group, but isn't it a
>> conflict with the first assumption you proposed on your proposal?
>
>
> This pseudo code missed a power saving workable flag , adding it into
> above code should solved your concern.
>
I think I should take a look at this one when it'll be prepared for RFC.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-16 Thread Rakib Mullick
 
 When a CPU is about to go down, it moves all it's sleeping task to an active 
CPU, then nr_uninterruptible counts are
also moved. When moving nr_uninterruptible count, currently it chooses a 
randomly picked CPU from the active CPU mask
to keep the global nr_uninterruptible count intact. But, it would be precise to 
move nr_uninterruptible counts to the
CPU where all the sleeping tasks were moved and it also might have subtle 
impact over rq's load calculation. So, this
patch is prepared to address this issue.

Signed-off-by: Rakib Mullick 
---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 82ad284..5839796 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5304,9 +5304,9 @@ void idle_task_exit(void)
  * their home CPUs. So we just add the counter to another CPU's counter,
  * to keep the global sum constant after CPU-down:
  */
-static void migrate_nr_uninterruptible(struct rq *rq_src)
+static void migrate_nr_uninterruptible(struct rq *rq_src, unsigned int 
dest_cpu)
 {
-   struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
+   struct rq *rq_dest = cpu_rq(dest_cpu);
 
rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
rq_src->nr_uninterruptible = 0;
@@ -5371,6 +5371,7 @@ static void migrate_tasks(unsigned int dead_cpu)
}
 
rq->stop = stop;
+   migrate_nr_uninterruptible(rq, dest_cpu);
 }
 
 #endif /* CONFIG_HOTPLUG_CPU */
@@ -5612,7 +5613,6 @@ migration_call(struct notifier_block *nfb, unsigned long 
action, void *hcpu)
BUG_ON(rq->nr_running != 1); /* the migration thread */
raw_spin_unlock_irqrestore(&rq->lock, flags);
 
-   migrate_nr_uninterruptible(rq);
calc_global_load_remove(rq);
break;
 #endif


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-16 Thread Rakib Mullick
On 8/16/12, Peter Zijlstra  wrote:
> On Thu, 2012-08-16 at 19:45 +0600, Rakib Mullick wrote:
>> When a CPU is about to go down, it moves all it's sleeping task to an
>> active CPU, then nr_uninterruptible counts are
>> also moved. When moving nr_uninterruptible count, currently it chooses a
>> randomly picked CPU from the active CPU mask
>> to keep the global nr_uninterruptible count intact. But, it would be
>> precise to move nr_uninterruptible counts to the
>> CPU where all the sleeping tasks were moved and it also might have subtle
>> impact over rq's load calculation. So, this
>> patch is prepared to address this issue.
>
> The Changelog is ill-formated. Other than that, the patch doesn't appear
> to actually do what it says. The sleeping tasks can be scattered to any
> number of cpus as decided by select_fallback_rq().
>
I'm not sure which parts are missing from Changelog to patch. And this
patch assumes that, sleeping tasks won't be scattered. From
select_fallback_rq(), sleeping tasks might get scattered due to
various cases like. if CPU is down, task isn't allowed to move a
particular CPU. Other than that, dest cpu supposed to be the same.

> Furthermore there should be absolutely no impact on load calculation
> what so ever. nr_uninterruptible is only ever useful as a sum over all
> cpus, this total sum doesn't change regardless of where you put the
> value.
>
> Worse, there's absolutely no relation to the tasks on the runqueue
> (sleeping or otherwise) and nr_uninterruptible, so coupling these
> actions makes no sense what so ever.
>
nr_uninterruptible is coupled with tasks on the runqueue to calculate
nr_active numbers.
In calc_load_fold_active(), this nr_active numbers are used to
calculate delta. This is how I understand this part and seeing some
impact.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-16 Thread Rakib Mullick
On 8/16/12, Peter Zijlstra  wrote:
> On Thu, 2012-08-16 at 20:28 +0600, Rakib Mullick wrote:
>
>> nr_uninterruptible is coupled with tasks on the runqueue to calculate
>> nr_active numbers.
>
> It is not.. nr_uninterruptible is incremented on the cpu the task goes
> to sleep and decremented on the cpu doing the wakeup.
>
If nr_uninterruptible's life cycle is this simple then,  while CPU
goes down, nr_uninterruptible count will be decremented when all the
tasks are moved to other CPUs and should be fine.

> This means that nr_uninterruptible is a complete mess and any per-cpu
> value isn't meaningful at all.
>
Well, if nr_uninterruptible is a mess, then this patch has no meaning.
And also I think migrate_nr_uninterruptible() is meaning less too.

> It is quite possible to always have the inc on cpu0 and the decrement on
> cpu1, yielding results like:
>
> {1000, -1000} for an effective nr_uninterruptible = 0. Taking either cpu
> down will then migrate whatever delta it has to another cpu, but there
> might only be a single task, yet the delta is +-1000.
>
>> In calc_load_fold_active(), this nr_active numbers are used to
>> calculate delta. This is how I understand this part and seeing some
>> impact.
>
> You understand wrong, please re-read the comment added in commit
> 5167e8d5.
>
Yes, reading.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-17 Thread Rakib Mullick
On 8/16/12, Peter Zijlstra  wrote:
> On Thu, 2012-08-16 at 21:32 +0600, Rakib Mullick wrote:
>> And also I think migrate_nr_uninterruptible() is meaning less too.
>
> Hmm, I think I see a problem.. we forget to migrate the effective delta
> created by rq->calc_load_active.
>
And rq->calc_load_active needs to be migrated to the proper dest_rq
not like currently picking any random rq.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Add rq->nr_uninterruptible count to dest cpu's rq while CPU goes down.

2012-08-20 Thread Rakib Mullick
On 8/20/12, Peter Zijlstra  wrote:
> On Fri, 2012-08-17 at 19:39 +0600, Rakib Mullick wrote:
>> On 8/16/12, Peter Zijlstra  wrote:
>> > On Thu, 2012-08-16 at 21:32 +0600, Rakib Mullick wrote:
>> >> And also I think migrate_nr_uninterruptible() is meaning less too.
>> >
>> > Hmm, I think I see a problem.. we forget to migrate the effective delta
>> > created by rq->calc_load_active.
>> >
>> And rq->calc_load_active needs to be migrated to the proper dest_rq
>> not like currently picking any random rq.
>
>
> OK, so how about something like the below, it would also solve Paul's
> issue with that code.
>
>
> Please do double check the logic, I've had all of 4 hours sleep and its
> far too warm for a brain to operate in any case.
>
> ---
> Subject: sched: Fix load avg vs cpu-hotplug
>
> Rabik and Paul reported two different issues related to the same few
> lines of code.
>
First of all, you've misspelled my name, it's Rakib not Rabik.

> Rabik's issue is that the nr_uninterruptible migration code is wrong in
> that he sees artifacts due to this (Rabik please do expand in more
> detail).
>
Okay, I was thinking about per rq->nr_uninterruptible accounting due
to it's use in delta calculation at calc_load_fold_active(). So, I
proposed to migrate nr_uninterruptible to the rq, where tasks were
migrated as if, they gets folded from update_cpu_load(). Now, note
that, delta is calculated using rq->nr_running and
rq->nr_uninterruptible, so if we migrate tasks into a rq, but migrate
nr_uninterruptible into another rq, its wrong and we're screwing the
delta calculation.

> Paul's issue is that this code as it stands relies on us using
> stop_machine() for unplug, we all would like to remove this assumption
> so that eventually we can remove this stop_machine() usage altogether.
>
> The only reason we'd have to migrate nr_uninterruptible is so that we
> could use for_each_online_cpu() loops in favour of
> for_each_possible_cpu() loops, however since nr_uninterruptible() is the
> only such loop and its using possible lets not bother at all.
>
> The problem Rabik sees is (probably) caused by the fact that by
> migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
> involved.
>
Certainly, we don't care about the rq which is going down. But, the dest_rq.

> So don't bother with fancy migration schemes (meaning we now have to
> keep using for_each_possible_cpu()) and instead fold any nr_active delta
> after we migrate all tasks away to make sure we don't have any skewed
> nr_active accounting.
>
>
> Reported-by: Rakib Mullick 
> Reported-by: Paul E. McKenney 
> Signed-off-by: Peter Zijlstra 
> ---
>  kernel/sched/core.c | 31 ++-
>  1 file changed, 10 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4376c9f..06d23c6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5338,27 +5338,17 @@ void idle_task_exit(void)
>  }
>
>  /*
> - * While a dead CPU has no uninterruptible tasks queued at this point,
> - * it might still have a nonzero ->nr_uninterruptible counter, because
> - * for performance reasons the counter is not stricly tracking tasks to
> - * their home CPUs. So we just add the counter to another CPU's counter,
> - * to keep the global sum constant after CPU-down:
> - */
> -static void migrate_nr_uninterruptible(struct rq *rq_src)
> -{
> - struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
> -
> - rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
> - rq_src->nr_uninterruptible = 0;
> -}
> -
> -/*
> - * remove the tasks which were accounted by rq from calc_load_tasks.
> + * Since this CPU is going 'away' for a while, fold any nr_active delta
> + * we might have. Assumes we're called after migrate_tasks() so that the
> + * nr_active count is stable.
> + *
But after migrate_tasks(), it's likely that rq->nr_running will be 1.
Then, nr_active will be screwed. No?

> + * Also see the comment "Global load-average calculations".
>   */
> -static void calc_global_load_remove(struct rq *rq)
> +static void calc_load_migrate(struct rq *rq)
>  {
> - atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
> - rq->calc_load_active = 0;
> + long delta = calc_load_fold_active(rq);
> + if (delta)
> + atomic_long_add(delta, &calc_load_tasks);
>  }
>
>  /*
> @@ -5652,8 +5642,7 @@ migration_call(struct notifier_block *nfb, unsigned
> long action, void *hcpu)
>   BUG_ON(rq->nr_runn

[PATCH] Fix isocpus's param handling when CPUMASK_OFFSTACK=n.

2017-10-15 Thread Rakib Mullick
 cpulist_parse() uses nr_cpumask_bits as limit to parse the
passed buffer from kernel commandline. What nr_cpumask_bits
represents varies depends upon CONFIG_CPUMASK_OFFSTACK option.
If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is same as
NR_CPUS, which might not represent the # of cpus really exist
(default 64). So, there's a chance of gap between nr_cpu_ids
and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
operation. For example, if isolcpus=9 is passed on a 8 cpu
system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
that it suppose to.

This patch fixes this issue by effectively find out the last
cpu of the passed isolcpus list and checking it with nr_cpu_ids.
Also, fixes the error message where the nr_cpu_ids should be
nr_cpu_ids-1, since the cpu numbering starts from 0.

Signed-off-by: Rakib Mullick 
---
 include/linux/cpumask.h | 16 
 kernel/sched/topology.c |  8 +---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 2404ad2..f9a7c9b 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -130,6 +130,11 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
return 0;
 }
 
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return 0;
+}
+
 /* Valid inputs for n are -1 and 0. */
 static inline unsigned int cpumask_next(int n, const struct cpumask *srcp)
 {
@@ -179,6 +184,17 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
 }
 
 /**
+ * cpumask_last - get the last cpu in a cpumask
+ * @srcp:  - the cpumask pointer
+ *
+ * Returns >= nr_cpumask_bits if no cpus set.
+ */
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
+}
+
+/**
  * cpumask_next - get the next cpu in a cpumask
  * @n: the cpu prior to the place to search (ie. return will be > @n)
  * @srcp: the cpumask pointer
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 1b0b4fb..55b5e09 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -452,12 +452,14 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
 /* Setup the mask of CPUs configured for isolated domains */
 static int __init isolated_cpu_setup(char *str)
 {
-   int ret;
+   int ret, lastcpu;
 
alloc_bootmem_cpumask_var(&cpu_isolated_map);
ret = cpulist_parse(str, cpu_isolated_map);
-   if (ret) {
-   pr_err("sched: Error, all isolcpus= values must be between 0 
and %d\n", nr_cpu_ids);
+   lastcpu = cpumask_last(cpu_isolated_map);
+   if (ret || lastcpu >= nr_cpu_ids) {
+   pr_err("sched: Error, all isolcpus= values must be between 0 
and %d\n",
+   nr_cpu_ids-1);
return 0;
}
return 1;
-- 
2.9.3



[PATCH] Documentation, fix module-signing file reference.

2017-10-31 Thread Rakib Mullick
Kconfig reference for module-signing.txt file needs to
be replaced with admin-guide/module-signing.rst.

Signed-off-by: Rakib Mullick 
---
 init/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/init/Kconfig b/init/Kconfig
index 3c1faaa..1b5e786 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1752,7 +1752,7 @@ config MODULE_SIG
help
  Check modules for valid signatures upon load: the signature
  is simply appended to the module. For more information see
- Documentation/module-signing.txt.
+ Documentation/admin-guide/module-signing.rst.
 
  Note that this option adds the OpenSSL development packages as a
  kernel build dependency so that the signing tool can use its crypto
-- 
2.9.3



[PATCH] irq/core: Fix boot crash when the irqaffinity= boot parameter is passed on CPUMASK_OFFSTACK=y kernels(v1)

2017-10-31 Thread Rakib Mullick
When the irqaffinity= kernel parameter is passed in a CPUMASK_OFFSTACK=y
kernel, it fails to boot, because zalloc_cpumask_var() cannot be used before
initializing the slab allocator to allocate a cpumask.

So, use alloc_bootmem_cpumask_var() instead.

Also do some cleanups while at it: in init_irq_default_affinity() remove
an unnecessary #ifdef.

Change since v0:
* Fix build warning.

Signed-off-by: Rakib Mullick 
Cc: Thomas Gleixner 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/20171026045800.27087-1-rakib.mull...@gmail.com
Signed-off-by: Ingo Molnar 
---
Patch created against -rc7 (commit 0b07194bb55ed836c2). I found tip had a merge
conflict, so used -rc7 instead.

 kernel/irq/irqdesc.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 82afb7e..e97bbae 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -27,7 +27,7 @@ static struct lock_class_key irq_desc_lock_class;
 #if defined(CONFIG_SMP)
 static int __init irq_affinity_setup(char *str)
 {
-   zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
+   alloc_bootmem_cpumask_var(&irq_default_affinity);
cpulist_parse(str, irq_default_affinity);
/*
 * Set at least the boot cpu. We don't want to end up with
@@ -40,10 +40,8 @@ __setup("irqaffinity=", irq_affinity_setup);
 
 static void __init init_irq_default_affinity(void)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-   if (!irq_default_affinity)
+   if (!cpumask_available(irq_default_affinity))
zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
-#endif
if (cpumask_empty(irq_default_affinity))
cpumask_setall(irq_default_affinity);
 }
-- 
2.9.3



Re: [PATCH] [irq] Fix boot failure when irqaffinity is passed.

2017-10-31 Thread Rakib Mullick
On Tue, Oct 31, 2017 at 5:29 PM, Ingo Molnar  wrote:
>
>
> Not applied, because this patch causes the following build warning:
>
>   kernel/irq/irqdesc.c:43:6: warning: the address of ‘irq_default_affinity’ 
> will always evaluate as ‘true’ [-Waddress]
>
Ah, sorry I didn't look into the build log. It happened due to removal
of #ifdef's. Now, it's been fixed by using cpumask_available().

> Also, please pick up the improved changelog below for the next version of the
> patch.
>
Thanks for the improved changelog, I have sent a new version here:
https://lkml.org/lkml/2017/11/1/6.

Thanks,
Rakib.


[PATCH] [irq] Fix boot failure when irqaffinity is passed.

2017-10-25 Thread Rakib Mullick
When irqaffinity kernel param is passed in a CPUMASK_OFFSTACK=y build
kernel, it fails to boot. zalloc_cpumask_var() cannot be used before
initializing mm stuff (slab allocator) to allocate cpumask. So, use
alloc_bootmem_cpumask_var(). Also in init_irq_default_affinity() removes
unneeded ifdef, these ifdef conditions are handled at defination site.

Signed-off-by: Rakib Mullick 
---
 kernel/irq/irqdesc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 982a357..db6380d 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -27,7 +27,7 @@ static struct lock_class_key irq_desc_lock_class;
 #if defined(CONFIG_SMP)
 static int __init irq_affinity_setup(char *str)
 {
-   zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
+   alloc_bootmem_cpumask_var(&irq_default_affinity);
cpulist_parse(str, irq_default_affinity);
/*
 * Set at least the boot cpu. We don't want to end up with
@@ -40,10 +40,8 @@ __setup("irqaffinity=", irq_affinity_setup);
 
 static void __init init_irq_default_affinity(void)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
if (!irq_default_affinity)
zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
-#endif
if (cpumask_empty(irq_default_affinity))
cpumask_setall(irq_default_affinity);
 }
-- 
2.9.3



[PATCH] lib: Avoid redundant sizeof checking in __bitmap_weight() calculation.

2017-11-13 Thread Rakib Mullick
Currently, during __bitmap_weight() calculation hweight_long() is used.
Inside a hweight_long() a check has been made to figure out whether a
hweight32() or hweight64() version to use.

However, it's unnecessary to do it in case of __bitmap_weight() calculation
inside the loop. We can detect whether to use hweight32() or hweight64()
upfront and call respective function directly. It also reduces the
vmlinux size.

Before the patch:
   textdata bss dec hex filename
129013327798930 1454181635242078219c05e vmlinux

After the patch:
   textdata bss dec hex filename
129013317798930 1454181635242077219c05d vmlinux

Signed-off-by: Rakib Mullick 
Cc: Andrew Morton 
Cc: Rasmus Villemoes 
Cc: Matthew Wilcox 
Cc: Yury Norov 
Cc: Mauro Carvalho Chehab 
 
---
Patch was created against torvald's tree (commit 43ff2f4db9d0f764).

 lib/bitmap.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/lib/bitmap.c b/lib/bitmap.c
index d8f0c09..552096f 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -241,10 +241,15 @@ EXPORT_SYMBOL(__bitmap_subset);
 int __bitmap_weight(const unsigned long *bitmap, unsigned int bits)
 {
unsigned int k, lim = bits/BITS_PER_LONG;
-   int w = 0;
-
-   for (k = 0; k < lim; k++)
-   w += hweight_long(bitmap[k]);
+   int w = 0, is32 = sizeof(bitmap[0]) ? 1 : 0;
+
+   if (is32) {
+   for (k = 0; k < lim; k++)
+   w += hweight32(bitmap[k]);
+   } else {
+   for (k = 0; k < lim; k++)
+   w += hweight64(bitmap[k]);
+   }
 
if (bits % BITS_PER_LONG)
w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits));
-- 
2.9.3



Re: [PATCH] lib: Avoid redundant sizeof checking in __bitmap_weight() calculation.

2017-11-14 Thread Rakib Mullick
On Tue, Nov 14, 2017 at 1:23 PM, Rasmus Villemoes
 wrote:
>
> hint: sizeof() very rarely evaluates to zero... So this is the same as
> "is32 = 1". So the patch as-is is broken (and may explain the 1-byte
> delta in vmlinux). But even if this condition is fixed, the patch
> doesn't change anything, since the sizeof() evaluation is done at
> compile-time, regardless of whether it happens inside the inlined
> hweight_long or outside. So it is certainly not worth it to duplicate
> the loop.
>
Right, no need to duplicate the loop. Checking the objdump output, it
didn't changed anything at all. Fixed condition nullifies the vmlinux
size advantage also. Drop it.

Thanks,
Rakib.


Re: [PATCH] Documentation, fix module-signing file reference.

2017-11-05 Thread Rakib Mullick
Ping! Anyone care to take this trivial fix?

On Tue, Oct 31, 2017 at 8:39 PM, Rakib Mullick  wrote:
> Kconfig reference for module-signing.txt file needs to
> be replaced with admin-guide/module-signing.rst.
>
> Signed-off-by: Rakib Mullick 
> ---
>  init/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 3c1faaa..1b5e786 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1752,7 +1752,7 @@ config MODULE_SIG
> help
>   Check modules for valid signatures upon load: the signature
>   is simply appended to the module. For more information see
> - Documentation/module-signing.txt.
> + Documentation/admin-guide/module-signing.rst.
>
>   Note that this option adds the OpenSSL development packages as a
>   kernel build dependency so that the signing tool can use its crypto
> --
> 2.9.3
>


[PATCH] x86: Avoid showing repetitive message from intel_init_thermal().

2014-09-18 Thread Rakib Mullick
 intel_init_thermal() is called from a) at the time of system initializing
and b) at the time of system resume to initialize thermal monitoring.
In case when thermal monitoring is handled by SMI, we get to know it via
printk(). Currently it gives the message at both cases, but its okay if we
get it only once and no need to get the same message at every time system
resumes. So, limit showing this message only at system boot time by avoid
showing at system resume and reduce abusing kernel log buffer.

Signed-off-by: Rakib Mullick 
---

diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c 
b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 36a1bb6..c8a8e45 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -498,8 +498,9 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
 

if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
-   printk(KERN_DEBUG
-  "CPU%d: Thermal monitoring handled by SMI\n", cpu);
+   if (system_state == SYSTEM_BOOTING)
+   printk(KERN_DEBUG
+   "CPU%d: Thermal monitoring handled by SMI\n", cpu);
return;
}
 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cpuidle: Minor optimization in cpuidle governor finding.

2014-09-26 Thread Rakib Mullick
 Currently in __cpuidle_find_governor(), strnicmp() is used to
find which governor (menu or ladder) to register where CPUIDLE_NAME_LEN
is passed strnicmp()'s comparision max limit. But, "menu" or "ladder"
these strings are much smaller than CPUIDLE_NAME_LEN. To avoid it
this patch introduces len field on cpuidle_governor structure, which
is statically initialized to the size of length of governor name. Later
on this size has been passed as strnicmp()'s length param. Another
advantage this patch introduces, before calling strnicmp() by checking
the size of governor name length we can determine whether to call
strnicmp() i.e a shortcut.

Signed-off-by: Rakib Mullick 
---

diff --git a/drivers/cpuidle/governor.c b/drivers/cpuidle/governor.c
index ca89412..f601cc5 100644
--- a/drivers/cpuidle/governor.c
+++ b/drivers/cpuidle/governor.c
@@ -20,16 +20,21 @@ struct cpuidle_governor *cpuidle_curr_governor;
 /**
  * __cpuidle_find_governor - finds a governor of the specified name
  * @str: the name
+ * @size: length of str
  *
  * Must be called with cpuidle_lock acquired.
  */
-static struct cpuidle_governor * __cpuidle_find_governor(const char *str)
+static struct cpuidle_governor *__cpuidle_find_governor(const char *str,
+   size_t size)
 {
struct cpuidle_governor *gov;
 
-   list_for_each_entry(gov, &cpuidle_governors, governor_list)
-   if (!strnicmp(str, gov->name, CPUIDLE_NAME_LEN))
+   list_for_each_entry(gov, &cpuidle_governors, governor_list) {
+   if (size != gov->len)
+   continue;
+   if (!strnicmp(str, gov->name, gov->len))
return gov;
+   }
 
return NULL;
 }
@@ -85,7 +90,7 @@ int cpuidle_register_governor(struct cpuidle_governor *gov)
return -ENODEV;
 
mutex_lock(&cpuidle_lock);
-   if (__cpuidle_find_governor(gov->name) == NULL) {
+   if (__cpuidle_find_governor(gov->name, gov->len) == NULL) {
ret = 0;
list_add_tail(&gov->governor_list, &cpuidle_governors);
if (!cpuidle_curr_governor ||
diff --git a/drivers/cpuidle/governors/ladder.c 
b/drivers/cpuidle/governors/ladder.c
index 044ee0d..91aff9c 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -177,6 +177,7 @@ static void ladder_reflect(struct cpuidle_device *dev, int 
index)
 
 static struct cpuidle_governor ladder_governor = {
.name = "ladder",
+   .len =  6,
.rating =   10,
.enable =   ladder_enable_device,
.select =   ladder_select_state,
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 34db2fb..23de182 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -477,6 +477,7 @@ static int menu_enable_device(struct cpuidle_driver *drv,
 
 static struct cpuidle_governor menu_governor = {
.name = "menu",
+   .len =  4,
.rating =   20,
.enable =   menu_enable_device,
.select =   menu_select,
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 25e0df6..bb9b5bd 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -198,6 +198,7 @@ struct cpuidle_governor {
charname[CPUIDLE_NAME_LEN];
struct list_headgovernor_list;
unsigned intrating;
+   size_t  len;
 
int  (*enable)  (struct cpuidle_driver *drv,
struct cpuidle_device *dev);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Do we really need curr_target in signal_struct ?

2014-01-28 Thread Rakib Mullick
I was wondering what might be the possible use of curr_target in signal_struct 
(atleast,
it's not doing what it's comment says). Also, I'm not seeing any real use of it 
except
in kernel/signal.c::complete_signal() where it's use as loop breaking condition 
in thread's
list traversing. As an alternative of using curr_target we can use 
get_nr_thread() count
to get # of threads in a group and can remove curr_target completely. This will 
also help
us to get rid from overhead of maintaining ->curr_target at fork()/exit() path. 
Below is
rough patch, I can prepare a good one if everyone agrees.

Or we really need it?

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 485234d..1fd65b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,7 +554,7 @@ struct signal_struct {
wait_queue_head_t   wait_chldexit;  /* for wait4() */
 
/* current thread group signal load-balancing target: */
-   struct task_struct  *curr_target;
+   //struct task_struct*curr_target;
 
/* shared signal handling: */
struct sigpending   shared_pending;
diff --git a/kernel/exit.c b/kernel/exit.c
index 1e77fc6..4a2cf37 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -113,8 +113,8 @@ static void __exit_signal(struct task_struct *tsk)
if (sig->notify_count > 0 && !--sig->notify_count)
wake_up_process(sig->group_exit_task);
 
-   if (tsk == sig->curr_target)
-   sig->curr_target = next_thread(tsk);
+   //if (tsk == sig->curr_target)
+   //  sig->curr_target = next_thread(tsk);
/*
 * Accumulate here the counters for all threads but the
 * group leader as they die, so they can be added into
diff --git a/kernel/fork.c b/kernel/fork.c
index 2f11bbe..8de4928 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1041,7 +1041,7 @@ static int copy_signal(unsigned long clone_flags, struct 
task_struct *tsk)
tsk->thread_node = (struct list_head)LIST_HEAD_INIT(sig->thread_head);
 
init_waitqueue_head(&sig->wait_chldexit);
-   sig->curr_target = tsk;
+   //sig->curr_target = tsk;
init_sigpending(&sig->shared_pending);
INIT_LIST_HEAD(&sig->posix_timers);
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 940b30e..1a1280a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -945,6 +945,7 @@ static void complete_signal(int sig, struct task_struct *p, 
int group)
 {
struct signal_struct *signal = p->signal;
struct task_struct *t;
+   int i;
 
/*
 * Now find a thread we can wake up to take the signal off the queue.
@@ -961,21 +962,16 @@ static void complete_signal(int sig, struct task_struct 
*p, int group)
 */
return;
else {
-   /*
-* Otherwise try to find a suitable thread.
-*/
-   t = signal->curr_target;
-   while (!wants_signal(sig, t)) {
+   i = get_nr_threads(p);
+   t = p;
+   do {
+   --i;
t = next_thread(t);
-   if (t == signal->curr_target)
-   /*
-* No thread needs to be woken.
-* Any eligible threads will see
-* the signal in the queue soon.
-*/
+   if (!i)
return;
-   }
-   signal->curr_target = t;
+   } while (!wants_signal(sig, t));
+   
+   //signal->curr_target = t;
}
 
/*


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Do we really need curr_target in signal_struct ?

2014-01-28 Thread Rakib Mullick
On Tue, Jan 28, 2014 at 10:43 PM, Oleg Nesterov  wrote:
> On 01/28, Rakib Mullick wrote:
>
> You could simply do while_each_thread(p, t) to find a thread which
> wants_signal(..).
>
Yes, while_each_thread() is much nicer than get_nr_thread(), thanks for
the pointer.

> But I guess ->curr_target was added exactly to avoid this loop if
> possible, assuming that wants_signal(->current_targer) should be
> likely true. Although perhaps this optimization is too simple.
>
Well, this code block will only hit when first check for wants_signal()
will miss, that means we need to find some other thread of the group.
AFAIU, ->current_target is only a loop breaker to avoid infinite loop,
but - by using while_each_thread() we can remove it completely, thus
helps to get rid from maintaining it too.

I'll prepare a proper patch with you suggestions for reviewing.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Do we really need curr_target in signal_struct ?

2014-01-28 Thread Rakib Mullick
On Wed, Jan 29, 2014 at 10:09 AM, Rakib Mullick  wrote:
> On Tue, Jan 28, 2014 at 10:43 PM, Oleg Nesterov  wrote:
>> On 01/28, Rakib Mullick wrote:
>>
>> You could simply do while_each_thread(p, t) to find a thread which
>> wants_signal(..).
>>
> Yes, while_each_thread() is much nicer than get_nr_thread(), thanks for
> the pointer.
>
Or, should we use for_each_thread()? Just noticed that, you've plan to remove
while_each_thread().

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Do we really need curr_target in signal_struct ?

2014-01-29 Thread Rakib Mullick
On Wed, Jan 29, 2014 at 8:55 PM, Oleg Nesterov  wrote:
> On 01/29, Rakib Mullick wrote:
>>
>> On Tue, Jan 28, 2014 at 10:43 PM, Oleg Nesterov  wrote:
>>
>
>> AFAIU, ->current_target is only a loop breaker to avoid infinite loop,
>
> No. It caches the last result of "find a thread which can handle this
> group-wide signal".
>
The reason behind of my understanding is the following comments:

/*
 * No thread needs to be woken.
 * Any eligible threads will see
 * the signal in the queue soon.
 */

What if, there's no thread in a group wants_signal()? Or it can't
practically happen?

>> but - by using while_each_thread() we can remove it completely, thus
>> helps to get rid from maintaining it too.
>
> ... and remove the optimization above.
>
>> I'll prepare a proper patch with you suggestions for reviewing.
>
> I am not sure we want this patch. Once again, I do not know how much
> ->curr_target helps, and certainaly it can't help always. But you
> should not blindly remove it just because yes, sure, it is not strictly
> needed to find a wants_signal() thread.
>
Are you thinking that , since things are not broken, then we shouldn't
try to do anything?

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Do we really need curr_target in signal_struct ?

2014-01-29 Thread Rakib Mullick
On Thu, Jan 30, 2014 at 12:32 AM, Oleg Nesterov  wrote:
> On 01/29, Rakib Mullick wrote:

>> Are you thinking that , since things are not broken, then we shouldn't
>> try to do anything?
>
> Hmm. No.
>
> I am thinking that, since you misunderstood the purpose of ->curr_target,
> I should probably try to argue with your patch which blindly removes this
> optimization ?
>
Since the optimization (usages of ->curr_target) isn't perfect, so there's
chance of being misunderstood. This optimization is misleading too (I think),
cause curr_target don't have anything to do with wants_signal() and
 ->curr_target is used only for this optimization and to get this optimization
needs to maintain it properly, and this maintenance does have cost and if
we don't get benefited too much, then it doesn't worth it (my pov).

> I also think that this logic doesn't look perfect, but this is another
> story.

Yes, this logic seems need to improve.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Do we really need curr_target in signal_struct ?

2014-01-31 Thread Rakib Mullick
On Thu, Jan 30, 2014 at 1:02 PM, Rakib Mullick  wrote:
> On Thu, Jan 30, 2014 at 12:32 AM, Oleg Nesterov  wrote:
>> On 01/29, Rakib Mullick wrote:
>
>>> Are you thinking that , since things are not broken, then we shouldn't
>>> try to do anything?
>>
>> Hmm. No.
>>
>> I am thinking that, since you misunderstood the purpose of ->curr_target,
>> I should probably try to argue with your patch which blindly removes this
>> optimization ?
>>
> Since the optimization (usages of ->curr_target) isn't perfect, so there's
> chance of being misunderstood. This optimization is misleading too (I think),
> cause curr_target don't have anything to do with wants_signal() and
>  ->curr_target is used only for this optimization and to get this optimization
> needs to maintain it properly, and this maintenance does have cost and if
> we don't get benefited too much, then it doesn't worth it (my pov).
>
>> I also think that this logic doesn't look perfect, but this is another
>> story.
>
> Yes, this logic seems need to improve.
>
As you've made few points about the current logic of thread picking in
complete_signal(), I took a look and found that using while_each_thread()
can make things better than current. Although my initial intent of this thread
wasn't related to complete_signal() logic, but as you've pointed out that
things could be better, so I prepared the attached patch (just to address
complete_signal()'s thread finding logic) not using ->curr_target but it's been
kept. What do you think?


Thanks,
Rakib
diff --git a/kernel/signal.c b/kernel/signal.c
index 52f881d..064f81f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -944,7 +944,7 @@ static inline int wants_signal(int sig, struct task_struct *p)
 static void complete_signal(int sig, struct task_struct *p, int group)
 {
 	struct signal_struct *signal = p->signal;
-	struct task_struct *t;
+	struct task_struct *t = p;
 
 	/*
 	 * Now find a thread we can wake up to take the signal off the queue.
@@ -952,33 +952,26 @@ static void complete_signal(int sig, struct task_struct *p, int group)
 	 * If the main thread wants the signal, it gets first crack.
 	 * Probably the least surprising to the average bear.
 	 */
-	if (wants_signal(sig, p))
-		t = p;
-	else if (!group || thread_group_empty(p))
-		/*
-		 * There is just one thread and it does not need to be woken.
-		 * It will dequeue unblocked signals before it runs again.
-		 */
-		return;
-	else {
-		/*
-		 * Otherwise try to find a suitable thread.
-		 */
-		t = signal->curr_target;
-		while (!wants_signal(sig, t)) {
-			t = next_thread(t);
-			if (t == signal->curr_target)
-/*
- * No thread needs to be woken.
- * Any eligible threads will see
- * the signal in the queue soon.
- */
-return;
+	if (!group || thread_group_empty(p)) {
+		if (wants_signal(sig, t))
+			goto found;
+	} else {
+		while_each_thread(p, t) {
+			if (wants_signal(sig, t))
+goto found;
 		}
-		signal->curr_target = t;
 	}
 
 	/*
+	 * No thread needs to be woken.
+	 * Any eligible threads will see
+	 * the signal in the queue soon.
+	 */
+	return;
+found:
+	signal->curr_target = t;
+
+	/*
 	 * Found a killable thread.  If the signal will be fatal,
 	 * then start taking the whole group down immediately.
 	 */


Re: Do we really need curr_target in signal_struct ?

2014-02-02 Thread Rakib Mullick
On Sat, Feb 1, 2014 at 10:51 PM, Oleg Nesterov  wrote:
> On 02/01, Rakib Mullick wrote:
>>
>> On Thu, Jan 30, 2014 at 1:02 PM, Rakib Mullick  
>> wrote:
>> > On Thu, Jan 30, 2014 at 12:32 AM, Oleg Nesterov  wrote:
>> >> On 01/29, Rakib Mullick wrote:
>> >
>> >>> Are you thinking that , since things are not broken, then we shouldn't
>> >>> try to do anything?
>> >>
>> >> Hmm. No.
>> >>
>> >> I am thinking that, since you misunderstood the purpose of ->curr_target,
>> >> I should probably try to argue with your patch which blindly removes this
>> >> optimization ?
>> >>
>> > Since the optimization (usages of ->curr_target) isn't perfect, so there's
>> > chance of being misunderstood. This optimization is misleading too (I 
>> > think),
>> > cause curr_target don't have anything to do with wants_signal()
>
> can't understand... curr_target is obviously connected to wants_signal() ?
No. I meant, curr_target doesn't makes sure that it really wants the
signal, it's likely
choice.

> As I already said it caches the last wants_signal(t) thread?
Yes, you said. But, gets messed up at exit path, not useful everytime.
If fails, p gets checked twice.

>> > and
>> >  ->curr_target is used only for this optimization and to get this 
>> > optimization
>> > needs to maintain it properly, and this maintenance does have cost and if
>> > we don't get benefited too much, then it doesn't worth it (my pov).
>
> but you need to prove this somehow.
>
Right, I'll try, I'll do it as per my understanding and everyone might
not agree.

>> I took a look and found that using while_each_thread()
>> can make things better than current.
>
> Why?
>
using while_each_thread() we can start thread traversing from p, which
is a likely
pick and also gets away from redundant checking of p.

>> What do you think?
>
> The patch is technically wrong, a group-wide signal doesn't check all
> threads after this change.
If group is empty, we don't need to check other than t.

> And I do not understand why complete_signal()
> still updates ->curr_target.
Didn't want to remove it blindly :-/

> Plus thread_group_empty() doesn't buy too
> much if we use while_each_thread(). But this doesn't really matter, easy
> to fix.
>
> Rakib. It is not that I like ->curr_target very much. But let me repeat,
> if you want to remove this optimization you need to explain why it doesn't
> make sense. You claim that this "can make things better" without any
> argument.
>
Well, I had other way of looking at it and didn't find any real usages
of ->curr_target and got confused though the code comment in curr_target.

> As for me - I simply do not know. This logic is very old, I am not even
> sure that the current usage of ->curr_signal matches the original intent.
> But it can obviously help in some cases, and you need to explain why we
> do not care.
>
Actually, this is exactly what i wanted to know. What is the purpose
of ->curr_signal, *do we really need it?* If I knew or seen any reason
I'd have never asked this question. You guys knows this domain better than
me, that's why I asked. Git log didn't help much, neither it's documentation
. As a result, I asked and thought about cleaning it up, (as i rarely do if
it meets my capability of course), so appended a sort of rough patch.

> So I won't argue if you submit the technically correct patch, but you
> need to convince Andrew or Ingo to take it. I think the right change in
> complete_signal() is something like below. It can be even simpler and use
> the single do/while loop, but then we need to check "group" inside that
> loop. With the change below ->curr_target can be simply removed.
>
I'll try to make points to convince Andrew or Ingo, I'll try to show
up my points,
and the rest will be upon them.

And here's what i think about ->curr_target removal (as per my understanding):

 a) ->curr_target is only might be useful in group wide case.
 b) Usually signals are sent particularly to a thread so ->curr_signal
 doesn't makes sense.
 c) If ->curr_signal pointed thread is killed, curr_signal points to
the next thread,
but nothing can make sure that it's a right pick.
 d) also current in implementation p is checked twice.
 e) One use case of ->curr_signal is, it always points to the lastly
created thread,
as it's a better candidate for want_signal cause newly created thread don't
usually have any signal pending. We can acheive the same without ->curr_signal
by traversing thread group from the lastly created thread.

So, this is what I think. Let me know if these reason's looks reasonable to you,
cause before Ingo or Andrew taking it, it requires your ack.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Do we really need curr_target in signal_struct ?

2014-02-03 Thread Rakib Mullick
On Mon, Feb 3, 2014 at 10:39 PM, Oleg Nesterov  wrote:
> On 02/02, Rakib Mullick wrote:
>
>> > As I already said it caches the last wants_signal(t) thread?
>> Yes, you said. But, gets messed up at exit path, not useful everytime.
>
> Yes.
>
>> If fails, p gets checked twice.
>
> Yes, but this is minor, I think.
>
Right.

>> >> I took a look and found that using while_each_thread()
>> >> can make things better than current.
>> >
>> > Why?
>> >
>> using while_each_thread() we can start thread traversing from p, which
>> is a likely
>> pick and also gets away from redundant checking of p.
>
> Heh. We always check "p" first. And (in general) we do not want to start
> from "p" if we want to find a wants_signal() thread, and ->curr_target can
> help to avoid this.
>
>> >> What do you think?
>> >
>> > The patch is technically wrong, a group-wide signal doesn't check all
>> > threads after this change.
>> If group is empty, we don't need to check other than t.
>
> I didn't meant the thread_group_empty() case. Please look at your code:
>
>
> if (!group || thread_group_empty(p)) {
> if (wants_signal(sig, t))
> goto found;
> } else {
> while_each_thread(p, t) {
> if (wants_signal(sig, t))
> goto found;
> }
> }
>
> Suppose that group == T, thread_group_empty(p) == F. Suppose that all
> sub-threads except "p" blocked this signal. With this change "p" (and
> thus the whole thread group) won't be notified. IOW, with your change
> we do not check "p" at all. This is wrong.
>
Oh, sorry, my bad. That was wrong.

> The only user of ->curr_target is complete_signal(), you have found it.
>
Indeed.

>
> I can only read the current code. I do not know the original intent.
>
This is where things are confusing.


> Really?
>
Sometimes, 100% correct (!group case) ;-).

>
> Yes (except a thread can't be killed), so what? Obviously, if ->curr_targer
> exits we should update this pointer. We could even nullify it.
>
That's makes ->curr_target less useful, that's what I meant.

>
> Yes, "p" can be checked twice. I don't think this is that bad, and I
> do not think this particular "problem" should be fixed.
>
Yes, it's minor.

>
> I simply can't understand. Why? I do not think so.
>
Cause, want_signal logic checks these thread attributes to find whether it's
eligible or not.

>> We can acheive the same without ->curr_signal
>> by traversing thread group from the lastly created thread.
>
> We certainly can't "achieve the same" this way, although I am not sure
> what this "the same" actually means.
>
>> So, this is what I think. Let me know if these reason's looks reasonable to 
>> you,
>
> No. Contrary, whatever I personally think about ->curr_signal, I feel
> that you do not understand the code you are trying to change. Sorry,
> I can be wrong. But I still do not see any argument.
>
Yes, right. I do not fully understand this code, also how it exactly puts impact
on signaling subsystems. And, therefore, I think I should not make any
changes in this code.

>> cause before Ingo or Andrew taking it, it requires your ack.
>
> Not really. And of course I'll review the patch correctness-wise, and
> I already sent the change in complete_signal() which looks right to me.
>
> But I am not going to ack the behaviour change, simply because I have
> no idea how this can impact the existing applications. Perhaps nobody
> will notice this change, but we can't know this.
>
Yes, I'm not also sure about the behavior change and it's impact over
existing applications, so, I'm skipping it.

I usually try to make small fixes, cleanup; cause it's less error-prone and
requires less follow-up. Since the things here becoming sort of "don't know"
thing, I think I should stop. But, thank you for helping and replying in this
thread.

Again thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ANNOUNCE: BLD-3.16 release.

2014-08-07 Thread Rakib Mullick
Hello everyone,

It's been quite a long time since the announcement of the Barbershop Load
Distribution (BLD) Algorithm. Quite a few changes have been made, since then.
Now it more reflects what it really should be ;-). It's a simplistic approach
towards load balancing, typical x86 SMP boxes should run okay (tested 
personally)
, but, yes it can break your boxes too. I'm looking forward to get some 
feedback,
to keep further development up and going.

(This patch is made for kernel version 3.16.)

Thanks,
Rakib
---

 Signed-off-by: Rakib Mullick 

diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..847f34d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -36,6 +36,15 @@ config BROKEN_ON_SMP
depends on BROKEN || !SMP
default y
 
+config BLD
+   bool "An alternate CPU load distribution technique for task scheduler"
+   depends on SMP
+   default y
+   help
+ This is an alternate CPU load distribution technique based for task
+ scheduler based on The Barbershop Load Distribution algorithm. Not
+ suitable for NUMA, should work well on SMP.
+
 config INIT_ENV_ARG_LIMIT
int
default 32 if !UML
diff --git a/kernel/sched/bld.h b/kernel/sched/bld.h
new file mode 100644
index 000..5a067c1
--- /dev/null
+++ b/kernel/sched/bld.h
@@ -0,0 +1,207 @@
+#ifdef CONFIG_BLD
+
+static DEFINE_RWLOCK(rt_list_lock);
+static LIST_HEAD(rt_rq_head);
+static LIST_HEAD(cfs_rq_head);
+static DEFINE_RWLOCK(cfs_list_lock);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct rq *rq_of_cfs(struct cfs_rq *cfs_rq)
+{
+   return cfs_rq->rq;
+}
+#else
+static inline struct rq *rq_of_cfs(struct cfs_rq *cfs_rq)
+{
+   return container_of(cfs_rq, struct rq, cfs);
+}
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline struct rq *rq_of_rt(struct rt_rq *rt_rq)
+{
+   return rt_rq->rq;
+}
+#else
+static inline struct rq *rq_of_rt(struct rt_rq *rt_rq)
+{
+   return container_of(rt_rq, struct rq, rt);
+}
+#endif
+
+static int select_cpu_for_wakeup(int task_type, struct cpumask *mask)
+{
+   int cpu = smp_processor_id(), i;
+   unsigned long load, min_load = ULONG_MAX;
+   struct rq *rq;
+
+   if (task_type) {
+   for_each_cpu(i, mask) {
+   rq = cpu_rq(i);
+   load = rq->cfs.load.weight;
+   if (load < min_load) {
+   min_load = load;
+   cpu = i;
+   }
+   }
+   } else {
+   min_load = -1;
+
+   for_each_cpu(i, mask) {
+   rq = cpu_rq(i);
+   load = rq->rt.lowbit;
+   if (load > min_load) {
+   min_load = load;
+   cpu = i;
+   }
+   }
+   }
+
+   return cpu;
+}
+
+static int bld_pick_cpu_cfs(struct task_struct *p, int sd_flags, int 
wake_flags)
+{
+   struct cfs_rq *cfs;
+   unsigned long flags;
+   unsigned int cpu = smp_processor_id();
+
+   read_lock_irqsave(&cfs_list_lock, flags);
+   list_for_each_entry(cfs, &cfs_rq_head, bld_cfs_list) {
+   cpu = cpu_of(rq_of_cfs(cfs));
+   if (cpu_online(cpu))
+   break;
+   }
+   read_unlock_irqrestore(&cfs_list_lock, flags);
+   return cpu;
+}
+
+static int bld_pick_cpu_rt(struct task_struct *p, int sd_flags, int wake_flags)
+{
+   struct rt_rq *rt;
+   unsigned long flags;
+   unsigned int cpu = smp_processor_id();
+
+   read_lock_irqsave(&rt_list_lock, flags);
+   list_for_each_entry(rt, &rt_rq_head, bld_rt_list) {
+   cpu = cpu_of(rq_of_rt(rt));
+   if (cpu_online(cpu))
+   break;
+   }
+   read_unlock_irqrestore(&rt_list_lock, flags);
+   return cpu;
+}
+
+static int bld_pick_cpu_domain(struct task_struct *p, int sd_flags, int 
wake_flags)
+{
+   unsigned int cpu = smp_processor_id(), want_affine = 0;
+   struct cpumask *tmpmask;
+
+   if (p->nr_cpus_allowed == 1)
+   return task_cpu(p);
+
+   if (sd_flags & SD_BALANCE_WAKE) {
+   if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) {
+   want_affine = 1;
+   }
+   }
+
+   if (want_affine)
+   tmpmask = tsk_cpus_allowed(p);
+   else
+   tmpmask = sched_domain_span(cpu_rq(task_cpu(p))->sd);
+
+   if (rt_task(p))
+   cpu = select_cpu_for_wakeup(0, tmpmask);
+   else
+   cpu = select_cpu_for_wakeup(1, tmpmask);
+
+   return cpu;
+}
+
+static void track_load_rt(struct rq *rq, struct task_struct *p)
+{
+   unsigned long flag;
+   int firstbit;
+   struct rt_rq *first;
+   struct rt_prio_array *array = &rq->rt.active;
+
+   

Re: [ANNOUNCE] BLD-3.17 release.

2014-10-13 Thread Rakib Mullick
On 10/13/14, Mike Galbraith  wrote:
> On Sat, 2014-10-11 at 12:20 +0600, Rakib Mullick wrote:
>> BLD (The Barbershop Load Distribution Algorithm) patch for Linux 3.17
>
> I had a curiosity attack, played with it a little.
>
Thanks for showing your interest!

> My little Q6600 box could be describes as being "micro-numa", with two
> pathetic little "nodes" connected by the worst interconnect this side of
> tin cans and string.  Communicating tasks sorely missed sharing cache.
>
> tbench
> 3.18.0-master
> Throughput 287.411 MB/sec  1 clients  1 procs  max_latency=1.614 ms
> 1.000
> Throughput 568.631 MB/sec  2 clients  2 procs  max_latency=1.942 ms
> 1.000
> Throughput 1069.75 MB/sec  4 clients  4 procs  max_latency=18.494 ms
> 1.000
> Throughput 1040.99 MB/sec  8 clients  8 procs  max_latency=17.364 ms
> 1.000
>
> 3.18.0-master-BLDvs
> master
> Throughput 261.986 MB/sec  1 clients  1 procs  max_latency=11.943 ms
> .911
> Throughput 264.461 MB/sec  2 clients  2 procs  max_latency=11.884 ms
> .465
> Throughput 476.191 MB/sec  4 clients  4 procs  max_latency=11.497 ms
> .445
> Throughput 558.236 MB/sec  8 clients  8 procs  max_latency=9.008 ms
> .536
>
>
> TCP_RR 4 unbound clients
> 3.18.0-master
> TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1
> (127.0.0.1) port 0 AF_INET
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size SizeTime Rate
> bytes  Bytes  bytesbytes   secs.per sec
>
> 16384  87380  11   30.0072436.65
> 16384  87380  11   30.0072438.55
> 16384  87380  11   30.0072213.18
> 16384  87380  11   30.0072493.48
>sum 289581.86 1.000
>
> 3.18.0-master-BLD
> TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1
> (127.0.0.1) port 0 AF_INET
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size SizeTime Rate
> bytes  Bytes  bytesbytes   secs.per sec
>
> 16384  87380  11   30.0029014.09
> 16384  87380  11   30.0028804.53
> 16384  87380  11   30.0028999.40
> 16384  87380  11   30.0028901.84
>sum 115719.86  .399 vs master
>
>
Okay. From the numbers above it's apparent that BLD isn't doing good,
atleast for the
kind of system that you have been using. I didn't had a chance to ran
it on any kind of
NUMA systems, for that reason on Kconfig, I've marked it as "Not
suitable for NUMA", yet.
Part of the reason is, I didn't manage to try it out myself and other
reason is, it's easy to
get things wrong if schedule domains are build improperly. I'm not
sure what was the
sched configuration in your case. BLD assumes (or kindof bliendly
believes systems
default sched domain topology) on wakeup tasks are cache hot and so
don't put those
task's on other sched domains, but if that isn't the case then perhaps
it'll miss out on
balancing oppourtunity, in that case CPU utilization will be improper.

Can you please share the perf stat of netperf runs? So, far I have
seen reduced context
switch numbers with -BLD with a drawback of huge increase of CPU
migration numbers.
But, the kind of systems I ran so far, it deemed too much CPU movement
didn't cost much.
But, it could be wrong for NUMA systems.

Thanks,
Rakib
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] BLD-3.17 release.

2014-10-14 Thread Rakib Mullick
On 10/14/14, Mike Galbraith  wrote:
> On Mon, 2014-10-13 at 21:14 +0600, Rakib Mullick wrote:
>
>> Okay. From the numbers above it's apparent that BLD isn't doing good,
>> atleast for the
>> kind of system that you have been using. I didn't had a chance to ran
>> it on any kind of
>> NUMA systems, for that reason on Kconfig, I've marked it as "Not
>> suitable for NUMA", yet.
>
> (yeah, a NUMA box would rip itself to shreds)
>
>> Part of the reason is, I didn't manage to try it out myself and other
>> reason is, it's easy to
>> get things wrong if schedule domains are build improperly. I'm not
>> sure what was the
>> sched configuration in your case. BLD assumes (or kindof bliendly
>> believes systems
>> default sched domain topology) on wakeup tasks are cache hot and so
>> don't put those
>> task's on other sched domains, but if that isn't the case then perhaps
>> it'll miss out on
>> balancing oppourtunity, in that case CPU utilization will be improper.
>
> Even when you have only one socket with a big L3, you don't really want
> to bounce fast/light tasks around too frequently, L2 misses still hurt.
>
>> Can you please share the perf stat of netperf runs? So, far I have
>> seen reduced context
>> switch numbers with -BLD with a drawback of huge increase of CPU
>> migration numbers.
>
> No need, it's L2 misses.  Q6600 has no L3 to mitigate the miss pain.
>
I see.

>> But, the kind of systems I ran so far, it deemed too much CPU movement
>> didn't cost much.
>> But, it could be wrong for NUMA systems.
>
> You can most definitely move even very fast/light tasks too much within
> an L3, L2 misses can demolish throughput.  We had that problem.
>
So, L2 misses is the key here. So far I haven't tried to utilize
various sched domain flags, like. SD_SHARE_PKG_RESOURCE/_CPUPOWER etc.
Probably, integrating those might help? I'll take a look at those and
will try to see what happens, although testing will be problematic.

Thanks,
Rakib.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ANNOUNCE] BLD-3.17 release.

2014-10-10 Thread Rakib Mullick
BLD (The Barbershop Load Distribution Algorithm) patch for Linux 3.17
, this version contains a build fix when CONFIG_BLD=n. I have recently
done some netperf benchmark (actually I did prevously too but never
publish) below shows the stat of default netperf run on localhost
system (client/server) running on local system (core i3, 2g mem,
higher is better).

tcp_stream  tcp_rrudp_stream udp_rr

mainline9343.5420812.0318231.74 24396.074
   18210.71

bld14738.3529224.5426475.75 34910.08
   26462.53

These are average of 5 runs of each tests. BLD performs better
and shows ~(35-40)% improvement. Config used for this benchmark
could be found at the following link:

https://raw.githubusercontent.com/rmullick/bld-patches/master/config.benchmark-3.17

And, recently Luis Cruz backports BLD's previous release BLD-3.16
for Android and experimentally ran it on his galaxy SIII, these
could be found at following link:

https://github.com/SyNtheticNightmar3/bld-patches

If you are interested in running it on Android, take a look at the
above link.

Thanks,
Rakib

Signed-off-by: Rakib Mullick 
---

diff --git a/init/Kconfig b/init/Kconfig
index 80a6907..65319c6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -36,6 +36,15 @@ config BROKEN_ON_SMP
depends on BROKEN || !SMP
default y
 
+config BLD
+   bool "An alternate CPU load distribution technique for task scheduler"
+   depends on SMP
+   default y
+   help
+ This is an alternate CPU load distribution technique based for task
+ scheduler based on The Barbershop Load Distribution algorithm. Not
+ suitable for NUMA, should work well on SMP.
+
 config INIT_ENV_ARG_LIMIT
int
default 32 if !UML
diff --git a/kernel/sched/bld.h b/kernel/sched/bld.h
new file mode 100644
index 000..097dd23
--- /dev/null
+++ b/kernel/sched/bld.h
@@ -0,0 +1,207 @@
+#ifdef CONFIG_BLD
+
+static DEFINE_RWLOCK(rt_list_lock);
+static LIST_HEAD(rt_rq_head);
+static LIST_HEAD(cfs_rq_head);
+static DEFINE_RWLOCK(cfs_list_lock);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct rq *rq_of_cfs(struct cfs_rq *cfs_rq)
+{
+   return cfs_rq->rq;
+}
+#else
+static inline struct rq *rq_of_cfs(struct cfs_rq *cfs_rq)
+{
+   return container_of(cfs_rq, struct rq, cfs);
+}
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline struct rq *rq_of_rt(struct rt_rq *rt_rq)
+{
+   return rt_rq->rq;
+}
+#else
+static inline struct rq *rq_of_rt(struct rt_rq *rt_rq)
+{
+   return container_of(rt_rq, struct rq, rt);
+}
+#endif
+
+static int select_cpu_for_wakeup(int task_type, struct cpumask *mask)
+{
+   int cpu = smp_processor_id(), i;
+   unsigned long load, min_load = ULONG_MAX;
+   struct rq *rq;
+
+   if (task_type) {
+   for_each_cpu(i, mask) {
+   rq = cpu_rq(i);
+   load = rq->cfs.load.weight;
+   if (load < min_load) {
+   min_load = load;
+   cpu = i;
+   }
+   }
+   } else {
+   min_load = -1;
+
+   for_each_cpu(i, mask) {
+   rq = cpu_rq(i);
+   load = rq->rt.lowbit;
+   if (load > min_load) {
+   min_load = load;
+   cpu = i;
+   }
+   }
+   }
+
+   return cpu;
+}
+
+static int bld_pick_cpu_cfs(struct task_struct *p, int sd_flags, int 
wake_flags)
+{
+   struct cfs_rq *cfs;
+   unsigned long flags;
+   unsigned int cpu = smp_processor_id();
+
+   read_lock_irqsave(&cfs_list_lock, flags);
+   list_for_each_entry(cfs, &cfs_rq_head, bld_cfs_list) {
+   cpu = cpu_of(rq_of_cfs(cfs));
+   if (cpu_online(cpu))
+   break;
+   }
+   read_unlock_irqrestore(&cfs_list_lock, flags);
+   return cpu;
+}
+
+static int bld_pick_cpu_rt(struct task_struct *p, int sd_flags, int wake_flags)
+{
+   struct rt_rq *rt;
+   unsigned long flags;
+   unsigned int cpu = smp_processor_id();
+
+   read_lock_irqsave(&rt_list_lock, flags);
+   list_for_each_entry(rt, &rt_rq_head, bld_rt_list) {
+   cpu = cpu_of(rq_of_rt(rt));
+   if (cpu_online(cpu))
+   break;
+   }
+   read_unlock_irqrestore(&rt_list_lock, flags);
+   return cpu;
+}
+
+static int bld_pick_cpu_domain(struct task_struct *p, int sd_flags, int 
wake_flags)
+{
+   unsigned int cpu = smp_processor_id(), want_affine = 0;
+   struct cpumask *tmpmask;
+
+   if (p->nr_cpus_allowed == 1)
+   return task_cpu(p);
+
+   if (sd_flags & SD_BALANCE_WAKE

[ANNOUNCE] BLD-3.18 release.

2014-12-16 Thread Rakib Mullick
t_affine = 0;
+   struct cpumask *tmpmask;
+
+   if (p->nr_cpus_allowed == 1)
+   return task_cpu(p);
+
+   if (sd_flags & SD_BALANCE_WAKE) {
+   if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) {
+   want_affine = 1;
+   }
+   }
+
+   if (want_affine)
+   tmpmask = tsk_cpus_allowed(p);
+   else
+   tmpmask = sched_domain_span(cpu_rq(task_cpu(p))->sd);
+
+   if (rt_task(p))
+   cpu = select_cpu_for_wakeup(0, tmpmask);
+   else
+   cpu = select_cpu_for_wakeup(1, tmpmask);
+
+   return cpu;
+}
+
+static void track_load_rt(struct rq *rq, struct task_struct *p)
+{
+   unsigned long flag;
+   int firstbit;
+   struct rt_rq *first;
+   struct rt_prio_array *array = &rq->rt.active;
+
+   first = list_entry(rt_rq_head.next, struct rt_rq, bld_rt_list);
+   firstbit = sched_find_first_bit(array->bitmap);
+
+   /* Maintaining rt.lowbit */
+   if (firstbit > 0 && firstbit <= rq->rt.lowbit)
+   rq->rt.lowbit = firstbit;
+
+   if (rq->rt.lowbit < first->lowbit) {
+   write_lock_irqsave(&rt_list_lock, flag);
+   list_del(&rq->rt.bld_rt_list);
+   list_add_tail(&rq->rt.bld_rt_list, &rt_rq_head);
+   write_unlock_irqrestore(&rt_list_lock, flag);
+   }
+}
+
+static int bld_get_cpu(struct task_struct *p, int sd_flags, int wake_flags)
+{
+   unsigned int cpu;
+
+   if (sd_flags == SD_BALANCE_WAKE || (sd_flags == SD_BALANCE_EXEC && 
(get_nr_threads(p) > 1)))
+   cpu = bld_pick_cpu_domain(p, sd_flags, wake_flags);
+   else {
+   if (rt_task(p))
+   cpu = bld_pick_cpu_rt(p, sd_flags, wake_flags);
+   else
+   cpu = bld_pick_cpu_cfs(p, sd_flags, wake_flags);
+   }
+
+   return cpu;
+}
+
+static void bld_track_load_activate(struct rq *rq, struct task_struct *p)
+{
+   unsigned long flag;
+   if (rt_task(p)) {
+   track_load_rt(rq, p);
+   } else {
+   if (rq->cfs.pos != 2) {
+   struct cfs_rq *last;
+   last = list_entry(cfs_rq_head.prev, struct cfs_rq, 
bld_cfs_list);
+   if (rq->cfs.load.weight >= last->load.weight) {
+   write_lock_irqsave(&cfs_list_lock, flag);
+   list_del(&rq->cfs.bld_cfs_list);
+   list_add_tail(&rq->cfs.bld_cfs_list, 
&cfs_rq_head);
+   rq->cfs.pos = 2; last->pos = 1;
+   write_unlock_irqrestore(&cfs_list_lock, flag);
+   }
+   }
+   }
+}
+
+static void bld_track_load_deactivate(struct rq *rq, struct task_struct *p)
+{
+   unsigned long flag;
+   if (rt_task(p)) {
+   track_load_rt(rq, p);
+   } else {
+   if (rq->cfs.pos != 0) {
+   struct cfs_rq *first;
+   first = list_entry(cfs_rq_head.next, struct cfs_rq, 
bld_cfs_list);
+   if (rq->cfs.load.weight <= first->load.weight) {
+   write_lock_irqsave(&cfs_list_lock, flag);
+   list_del(&rq->cfs.bld_cfs_list);
+   list_add(&rq->cfs.bld_cfs_list, &cfs_rq_head);
+   rq->cfs.pos = 0; first->pos = 1;
+   write_unlock_irqrestore(&cfs_list_lock, flag);
+   }
+   }
+   }
+}
+#else
+static inline void bld_track_load_activate(struct rq *rq, struct task_struct 
*p)
+{
+}
+
+static inline void bld_track_load_deactivate(struct rq *rq, struct task_struct 
*p)
+{
+}
+#endif /* CONFIG_BLD */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89e7283..bd702c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,8 @@
  *  2007-07-01  Group scheduling enhancements by Srivatsa Vaddagiri
  *  2007-11-29  RT balancing improvements by Steven Rostedt, Gregory Haskins,
  *  Thomas Gleixner, Mike Kravetz
+ *  2012-Feb   The Barbershop Load Distribution (BLD) algorithm - an alternate
+ * CPU load distribution technique for kernel scheduler by Rakib 
Mullick.
  */
 
 #include 
@@ -86,6 +88,7 @@
 #include "sched.h"
 #include "../workqueue_internal.h"
 #include "../smpboot.h"
+#include "bld.h"
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -840,6 +843,8 @@ static void enqueue_task(struct rq *rq, struct task_struct 
*p, int flags)
update_rq_clock(rq);
sched_info_queued(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
+   if (!dl_task(p))
+  

[PATCH] Remove cpuset_update_active_cpus()'s parameter.

2017-04-08 Thread Rakib Mullick
In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove
it.

Signed-off-by: Rakib Mullick
---
 include/linux/cpuset.h | 4 ++--
 kernel/cgroup/cpuset.c | 2 +-
 kernel/sched/core.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 611fce5..119a3f9 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -42,7 +42,7 @@ static inline void cpuset_dec(void)
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
-extern void cpuset_update_active_cpus(bool cpu_online);
+extern void cpuset_update_active_cpus(void);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -155,7 +155,7 @@ static inline bool cpusets_enabled(void) { return false; }
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
-static inline void cpuset_update_active_cpus(bool cpu_online)
+static inline void cpuset_update_active_cpus(void)
 {
partition_sched_domains(1, NULL, NULL);
 }
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0f41292..d6e8ded 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2354,7 +2354,7 @@ static void cpuset_hotplug_workfn(struct work_struct 
*work)
rebuild_sched_domains();
 }
 
-void cpuset_update_active_cpus(bool cpu_online)
+void cpuset_update_active_cpus(void)
 {
/*
 * We're inside cpu hotplug critical region which usually nests
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b31fc0..430b046 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5732,7 +5732,7 @@ static void cpuset_cpu_active(void)
 * cpuset configurations.
 */
}
-   cpuset_update_active_cpus(true);
+   cpuset_update_active_cpus();
 }
 
 static int cpuset_cpu_inactive(unsigned int cpu)
@@ -5755,7 +5755,7 @@ static int cpuset_cpu_inactive(unsigned int cpu)
 
if (overflow)
return -EBUSY;
-   cpuset_update_active_cpus(false);
+   cpuset_update_active_cpus();
} else {
num_cpus_frozen++;
partition_sched_domains(1, NULL, NULL);
-- 
2.9.3



[PATCH] RCU: Remove have_rcu_nocb_mask from tree_plugin.h

2017-11-17 Thread Rakib Mullick
Currently have_rcu_nocb_mask is used to avoid double allocation of
rcu_nocb_mask during boot up. Due to different representation of
cpumask_var_t on different kernel config CPUMASK=y(or n) it was okay.
But now we have a helper cpumask_available(), which can be utilized
to check whether rcu_nocb_mask has been allocated or not without using
a variable.

Removing the variable also reduces vmlinux size.

Unpatched version:
text   data bss dec hex filename
130503937852470 145434083544627121cddff vmlinux

Patched version:
 text  data bss dec hex filename
130503907852438 145434083544623621cdddc vmlinux

Signed-off-by: Rakib Mullick 
Cc: "Paul E. McKenney" 
Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
---
Patch applied on top of linus's tree (commit cf9b0772f2e41).

 kernel/rcu/tree_plugin.h | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index db85ca3..13a8e08 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -61,7 +61,6 @@ DEFINE_PER_CPU(char, rcu_cpu_has_work);
 
 #ifdef CONFIG_RCU_NOCB_CPU
 static cpumask_var_t rcu_nocb_mask; /* CPUs to have callbacks offloaded. */
-static bool have_rcu_nocb_mask;/* Was rcu_nocb_mask allocated? */
 static bool __read_mostly rcu_nocb_poll;/* Offload kthread are to poll. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
@@ -1752,7 +1751,6 @@ static void increment_cpu_stall_ticks(void)
 static int __init rcu_nocb_setup(char *str)
 {
alloc_bootmem_cpumask_var(&rcu_nocb_mask);
-   have_rcu_nocb_mask = true;
cpulist_parse(str, rcu_nocb_mask);
return 1;
 }
@@ -1801,7 +1799,7 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 /* Is the specified CPU a no-CBs CPU? */
 bool rcu_is_nocb_cpu(int cpu)
 {
-   if (have_rcu_nocb_mask)
+   if (cpumask_available(rcu_nocb_mask))
return cpumask_test_cpu(cpu, rcu_nocb_mask);
return false;
 }
@@ -2295,14 +2293,13 @@ void __init rcu_init_nohz(void)
need_rcu_nocb_mask = true;
 #endif /* #if defined(CONFIG_NO_HZ_FULL) */
 
-   if (!have_rcu_nocb_mask && need_rcu_nocb_mask) {
+   if (!cpumask_available(rcu_nocb_mask) && need_rcu_nocb_mask) {
if (!zalloc_cpumask_var(&rcu_nocb_mask, GFP_KERNEL)) {
pr_info("rcu_nocb_mask allocation failed, callback 
offloading disabled.\n");
return;
}
-   have_rcu_nocb_mask = true;
}
-   if (!have_rcu_nocb_mask)
+   if (!cpumask_available(rcu_nocb_mask))
return;
 
 #if defined(CONFIG_NO_HZ_FULL)
@@ -2428,7 +2425,7 @@ static void __init rcu_organize_nocb_kthreads(struct 
rcu_state *rsp)
struct rcu_data *rdp_leader = NULL;  /* Suppress misguided gcc warn. */
struct rcu_data *rdp_prev = NULL;
 
-   if (!have_rcu_nocb_mask)
+   if (!cpumask_available(rcu_nocb_mask))
return;
if (ls == -1) {
ls = int_sqrt(nr_cpu_ids);
-- 
2.9.3



[PATCH] Fix isocpus's param handling when CPUMASK_OFFSTACK=n.

2017-10-21 Thread Rakib Mullick
>  *On Fri, Oct 20, 2017 at 2:49 PM, Ingo Molnar  wrote:
>
>> Rakib Mullick  wrote:
>>  include/linux/cpumask.h | 16 
>>  kernel/sched/topology.c |  8 +---
>>  2 files changed, 21 insertions(+), 3 deletions(-)
>
> What kernel is this against? It does not apply to the latest kernels.

It was against 4.14-rc4, prepared before -rc5 release. Please, consider
the below one, against -rc5.

 cpulist_parse() uses nr_cpumask_bits as limit to parse the
passed buffer from kernel commandline. What nr_cpumask_bits
represents varies depends upon CONFIG_CPUMASK_OFFSTACK option.
If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is same as
NR_CPUS, which might not represent the # of cpus really exist
(default 64). So, there's a chance of gap between nr_cpu_ids
and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
operation. For example, if isolcpus=9 is passed on a 8 cpu
system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
that it suppose to.

This patch fixes this issue by effectively find out the last
cpu of the passed isolcpus list and checking it with nr_cpu_ids.
Also, fixes the error message where the nr_cpu_ids should be
nr_cpu_ids-1, since the cpu numbering starts from 0.

Signed-off-by: Rakib Mullick 
---
 include/linux/cpumask.h | 16 
 kernel/sched/topology.c |  8 +---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cd415b7..5631725 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -130,6 +130,11 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
return 0;
 }
 
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return 0;
+}
+
 /* Valid inputs for n are -1 and 0. */
 static inline unsigned int cpumask_next(int n, const struct cpumask *srcp)
 {
@@ -178,6 +183,17 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
+/**
+ * cpumask_last - get the last cpu in a cpumask
+ * @srcp:  - the cpumask pointer
+ *
+ * Returns >= nr_cpumask_bits if no cpus set.
+ */
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
+}
+
 unsigned int cpumask_next(int n, const struct cpumask *srcp);
 
 /**
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f1cf4f3..b9265c8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -466,12 +466,14 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
 /* Setup the mask of CPUs configured for isolated domains */
 static int __init isolated_cpu_setup(char *str)
 {
-   int ret;
+   int ret, lastcpu;
 
alloc_bootmem_cpumask_var(&cpu_isolated_map);
ret = cpulist_parse(str, cpu_isolated_map);
-   if (ret) {
-   pr_err("sched: Error, all isolcpus= values must be between 0 
and %u\n", nr_cpu_ids);
+   lastcpu = cpumask_last(cpu_isolated_map);
+   if (ret || lastcpu >= nr_cpu_ids) {
+   pr_err("sched: Error, all isolcpus= values must be between 0 
and %u\n",
+   nr_cpu_ids-1);
return 0;
}
return 1;
-- 
2.9.3



[PATCH] Fix isocpus's param handling when CPUMASK_OFFSTACK=n.

2017-10-23 Thread Rakib Mullick

> On Mon, Oct 23, 2017 at 5:50 PM, Ingo Molnar  wrote:
>
>> * Rakib Mullick  wrote:
>> +/**
>> + * cpumask_last - get the last cpu in a cpumask
>
> Please capitalize 'CPU' properly in documentation.
>
OK.
>> + int ret, lastcpu;
>>
>>   alloc_bootmem_cpumask_var(&cpu_isolated_map);
>>   ret = cpulist_parse(str, cpu_isolated_map);
>> - if (ret) {
>> - pr_err("sched: Error, all isolcpus= values must be between 0 
>> and %u\n", nr_cpu_ids);
>> + lastcpu = cpumask_last(cpu_isolated_map);
>> + if (ret || lastcpu >= nr_cpu_ids) {
>
> Any reason why 'lastcpu' has to be introduced - why not just use 
> cpumask_last()
> directly in the condition? It looks obvious enough of a pattern.
Thought it reflects what we're doing here, but yes, actually it's redundant.

>> + pr_err("sched: Error, all isolcpus= values must be between 0 
>> and %u\n",
>> + nr_cpu_ids-1);
>
> Please don't break the line mindlessly just due to checkpatch complaining - it
> makes the code less readable.
>
Yes, right, mindlessly followed what checkpatch was complaining.

Please see the following patch, with changes made based on your review and
patched up against -rc6.

Thanks,
Rakib
---

[PATCH](v1) Fix isocpus's param handling when CPUMASK_OFFSTACK=n.

 cpulist_parse() uses nr_cpumask_bits as limit to parse the
passed buffer from kernel commandline. What nr_cpumask_bits
represents varies depends upon CONFIG_CPUMASK_OFFSTACK option.
If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is same as
NR_CPUS, which might not represent the # of cpus really exist
(default 64). So, there's a chance of gap between nr_cpu_ids
and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
operation. For example, if isolcpus=9 is passed on a 8 cpu
system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
that it suppose to.

This patch fixes this issue by effectively find out the last
cpu of the passed isolcpus list and checking it with nr_cpu_ids.
Also, fixes the error message where the nr_cpu_ids should be
nr_cpu_ids-1, since the cpu numbering starts from 0.

Changes since v0 (Ingo):
* use cpumask_last() directly into the condition.
* use CPU rather cpu in documentation
* undo line break

Signed-off-by: Rakib Mullick 
---
 include/linux/cpumask.h | 16 
 kernel/sched/topology.c |  4 ++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cd415b7..63661de 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -130,6 +130,11 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
return 0;
 }
 
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return 0;
+}
+
 /* Valid inputs for n are -1 and 0. */
 static inline unsigned int cpumask_next(int n, const struct cpumask *srcp)
 {
@@ -178,6 +183,17 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
+/**
+ * cpumask_last - get the last CPU in a cpumask
+ * @srcp:  - the cpumask pointer
+ *
+ * Returns >= nr_cpumask_bits if no CPUs set.
+ */
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
+}
+
 unsigned int cpumask_next(int n, const struct cpumask *srcp);
 
 /**
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f1cf4f3..060cee5 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -470,8 +470,8 @@ static int __init isolated_cpu_setup(char *str)
 
alloc_bootmem_cpumask_var(&cpu_isolated_map);
ret = cpulist_parse(str, cpu_isolated_map);
-   if (ret) {
-   pr_err("sched: Error, all isolcpus= values must be between 0 
and %u\n", nr_cpu_ids);
+   if (ret || cpumask_last(cpu_isolated_map) >= nr_cpu_ids) {
+   pr_err("sched: Error, all isolcpus= values must be between 0 
and %u\n", nr_cpu_ids-1);
return 0;
}
return 1;
-- 
2.9.3



[tip:irq/core] irq/core: Fix boot crash when the irqaffinity= boot parameter is passed on CPUMASK_OFFSTACK=y kernels(v1)

2017-11-01 Thread tip-bot for Rakib Mullick
Commit-ID:  10d94ff4d558b96bfc4f55bb0051ae4d938246fe
Gitweb: https://git.kernel.org/tip/10d94ff4d558b96bfc4f55bb0051ae4d938246fe
Author: Rakib Mullick 
AuthorDate: Wed, 1 Nov 2017 10:14:51 +0600
Committer:  Ingo Molnar 
CommitDate: Wed, 1 Nov 2017 09:56:39 +0100

irq/core: Fix boot crash when the irqaffinity= boot parameter is passed on 
CPUMASK_OFFSTACK=y kernels(v1)

When the irqaffinity= kernel parameter is passed in a CPUMASK_OFFSTACK=y
kernel, it fails to boot, because zalloc_cpumask_var() cannot be used before
initializing the slab allocator to allocate a cpumask.

So, use alloc_bootmem_cpumask_var() instead.

Also do some cleanups while at it: in init_irq_default_affinity() remove
an #ifdef via using cpumask_available().

Signed-off-by: Rakib Mullick 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20171026045800.27087-1-rakib.mull...@gmail.com
Link: http://lkml.kernel.org/r/20171101041451.12581-1-rakib.mull...@gmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/irq/irqdesc.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 982a357..f2edcf8 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -27,7 +27,7 @@ static struct lock_class_key irq_desc_lock_class;
 #if defined(CONFIG_SMP)
 static int __init irq_affinity_setup(char *str)
 {
-   zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
+   alloc_bootmem_cpumask_var(&irq_default_affinity);
cpulist_parse(str, irq_default_affinity);
/*
 * Set at least the boot cpu. We don't want to end up with
@@ -40,10 +40,8 @@ __setup("irqaffinity=", irq_affinity_setup);
 
 static void __init init_irq_default_affinity(void)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-   if (!irq_default_affinity)
+   if (!cpumask_available(irq_default_affinity))
zalloc_cpumask_var(&irq_default_affinity, GFP_NOWAIT);
-#endif
if (cpumask_empty(irq_default_affinity))
cpumask_setall(irq_default_affinity);
 }


[tip:sched/core] sched/isolcpus: Fix "isolcpus=" boot parameter handling when !CONFIG_CPUMASK_OFFSTACK

2017-10-24 Thread tip-bot for Rakib Mullick
Commit-ID:  e22cdc3fc5991956146b9856d36b4971fe54dcd6
Gitweb: https://git.kernel.org/tip/e22cdc3fc5991956146b9856d36b4971fe54dcd6
Author: Rakib Mullick 
AuthorDate: Mon, 23 Oct 2017 19:01:54 +0600
Committer:  Ingo Molnar 
CommitDate: Tue, 24 Oct 2017 11:47:25 +0200

sched/isolcpus: Fix "isolcpus=" boot parameter handling when 
!CONFIG_CPUMASK_OFFSTACK

cpulist_parse() uses nr_cpumask_bits as a limit to parse the
passed buffer from kernel commandline. What nr_cpumask_bits
represents varies depending upon the CONFIG_CPUMASK_OFFSTACK option:

 - If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is the same as
   NR_CPUS, which might not represent the # of CPUs that really exist
   (default 64). So, there's a chance of a gap between nr_cpu_ids
   and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
   operation. For example, if isolcpus=9 is passed on an 8 cpu
   system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
   that it's supposed to.

This patch fixes this bug by finding the last CPU of the passed
isolcpus= list and checking it against nr_cpu_ids.

It also fixes the error message where the nr_cpu_ids should be
nr_cpu_ids-1, since CPU numbering starts from 0.

Signed-off-by: Rakib Mullick 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: adobri...@gmail.com
Cc: a...@linux-foundation.org
Cc: long...@redhat.com
Cc: m...@chromium.org
Cc: t...@kernel.org
Link: http://lkml.kernel.org/r/20171023130154.9050-1-rakib.mull...@gmail.com
[ Enhanced the changelog and the kernel message. ]
Signed-off-by: Ingo Molnar 

 include/linux/cpumask.h |   16 
 kernel/sched/topology.c |4 ++--
 2 files changed, 18 insertions(+), 2 deletions(-)
---
 include/linux/cpumask.h | 16 
 kernel/sched/topology.c |  4 ++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index cd415b7..63661de 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -130,6 +130,11 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
return 0;
 }
 
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return 0;
+}
+
 /* Valid inputs for n are -1 and 0. */
 static inline unsigned int cpumask_next(int n, const struct cpumask *srcp)
 {
@@ -178,6 +183,17 @@ static inline unsigned int cpumask_first(const struct 
cpumask *srcp)
return find_first_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
+/**
+ * cpumask_last - get the last CPU in a cpumask
+ * @srcp:  - the cpumask pointer
+ *
+ * Returns >= nr_cpumask_bits if no CPUs set.
+ */
+static inline unsigned int cpumask_last(const struct cpumask *srcp)
+{
+   return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
+}
+
 unsigned int cpumask_next(int n, const struct cpumask *srcp);
 
 /**
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e50450c..e3d31b0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -476,8 +476,8 @@ static int __init isolated_cpu_setup(char *str)
 
alloc_bootmem_cpumask_var(&cpu_isolated_map);
ret = cpulist_parse(str, cpu_isolated_map);
-   if (ret) {
-   pr_err("sched: Error, all isolcpus= values must be between 0 
and %u\n", nr_cpu_ids);
+   if (ret || cpumask_last(cpu_isolated_map) >= nr_cpu_ids) {
+   pr_err("sched: Error, all isolcpus= values must be between 0 
and %u - ignoring them.\n", nr_cpu_ids-1);
return 0;
}
return 1;