Re: Tiny cpusets -- cpusets for small systems?

2008-02-25 Thread Max Krasnyanskiy

Paul Jackson wrote:

So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
level mechanism that actually makes kernel aware of what's isolated what's not.
Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
but scheduler does not use cpusets directly.


One could use cpusets to control the setting of cpu_isolated_map,
separate from the code such as your select_irq_affinity() that
uses it.
Yes. That's what I proposed too. In one of the CPU isolation threads with 
Peter. The only issue is that you need to simulate CPU_DOWN hotplug even in 
order to cleanup what's already running on those CPUs.



In a foreseeable future 2-8 cores will be most common configuration.
Do you think that cpusets are needed/useful for those machines ?
The reason I'm asking is because given the restrictions you mentioned
above it seems that you might as well just do
taskset -c 1,2,3 app1
	taskset -c 3,4,5 app2 


People tend to manage the CPU and memory placement of the threads
and processes within a single co-operating job using taskset
(sched_setaffinity) and numactl (mbind, set_mempolicy.)

They tend to manage the placement of multiple unrelated jobs onto
a single system, whether on separate or shared CPUs and nodes,
using cpusets.

>

Something like cpu_isolated_map looks to me like a system-wide
mechanism, which should, like sched_domains, be managed system-wide.
Managing it with a mechanism that encourages each thread to update
it directly, as if that thread owned the system, will break down,
resulting in conflicting updates, as multiple, insufficiently
co-operating threads issue conflicting settings.
I'm not sure how to interpret that. I think you might have mixed a couple of 
things I asked about in one reply ;-).
The question was that given the restrictions you talked about when you 
explained tiny-cpusets functionality I asked how much one gains from using 
them compared to the taskset/numactl. ie On the machines with 2-8 cores it's 
fairly easy to manage cpus with simple affinity masks.


The second part of your reply seems to imply that I somehow made you think 
that I suggested that cpu_isolated_map is managed per thread. That is of 
course not the case. It's definitely a system-wide mechanism and individual 
threads have nothing to do with it.
btw I just re-read my prev reply. I definitely did not say anything about 
threads managing cpu_isolated_map :).



Stuff that I'm working on this days (wireless basestations) is designed
with the  following model:
cpuN - runs soft-RT networking and management code
cpuN+1 to cpuN+x - are used as dedicated engines
ie Simplest example would be
cpu0 - runs IP, L2 and control plane
	cpu1 - runs hard-RT MAC 

So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ?


That depends on what more API is needed.  Do we need to place
irqs better ... cpusets might not be a natural for that use.
Aren't irqs directed to specific CPUs, not to hierarchically
nested subsets of CPUs.


You clipped the part where I elaborated. Which was:
So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ? I mean currently cpusets seems to be mostly dealing
with entire processes, whereas in this case we're really dealing with the threads. 
ie Different threads of the same process require different policies, some must run

on isolated cpus some must not. I guess one could write a thread's pid into 
cpusets
fs but that's not very convenient. pthread_set_affinity() is exactly what's 
needed.
In other words how would an app place its individual threads into the 
different cpusets.
IRQ stuff is separate, like we said above cpusets could simply update 
cpu_isolated_map which would take care of IRQs. I was talking specifically 
about the thread management.



Separate question:
  Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear
  as general purpose systems running a Linux kernel in your systems?
  These dedicated engines seem more like intelligent devices to me,
  such as disk controllers, which the kernel controls via device
  drivers, not by loading itself on them too.
We still want to be able to run normal threads on them. Which means IPI, 
memory management, etc is still needed. So yes they better show up as normal 
CPUs :)
Also with dynamic isolation you can for example un-isolate a cpu when you're 
compiling stuff on the machine and then isolate it when you're running special 
app(s).


Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-25 Thread Max Krasnyanskiy

Hi Peter,

Sorry for delay in reply.


Please, wrap your emails at 78 - most mailers can do this.

Done.


On Fri, 2008-02-22 at 14:05 -0800, Max Krasnyanskiy wrote:

Peter Zijlstra wrote:

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:




List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
 
cpu_isolated_map was a bad hack when it was introduced, I feel we should

deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

That's not not currently possible and will introduce a lot of complexity.
I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on 
that btw).
In fact I provided the link to that discussion in the original email.
Here it is again:
http://marc.info/?l=linux-kernel=120180692331461=2


I read it, I just firmly disagree.

Basically the problem is very simple. CPU isolation needs a simple/efficient way to check 
if CPU N is isolated.


I'm not seeing the need for that outside of setting up the various
states. That is, once all the affinities are set up, you'd hardly ever
would (or should - imho) need to know if a particular CPU is isolated or not.
Unless I'm missing something that's only possible for a very static system. 
What I mean is that yes you could go and set irq affinity, apps affinity, 
workqueue thread affinity, etc not to run on the isolated cpus. It works 
_until_ something changes, at which point the system needs to know that it's 
not supposed to touch CPU N.
For example new IRQ is registered, new workqueue is created (fs mounted, net 
network interface is created, etc), new kthread is started, etc.
Sure we can introduce default affinity masks for irqs, workqueues, etc. But 
that's essentially just duplicating cpu_isolated_map.



 cpuset/cgroup APIs are not designed for that. In other to figure
out if a CPU N is isolated one has to iterate through all cpusets and checks 
their cpu maps.
That requires several levels of locking (cgroup and cpuset).



The other issue is that cpusets are a bit too dynamic (see the thread above for 
more details)
we'd need notified mechanisms to notify subsystems when a CPUs become isolated. 
Again more
complexity. Since I integrated cpu isolation with cpu hotplug it's already 
addressed in a nice
simple way.


I guess you have another definition of nice than I do.

No, not really.
Lets talk specifics. My goal was not to introduce a bunch of new functionality 
and rewrite workqueues and stuff, instead I wanted to integrated with existing 
mechanisms. CPU maps are used everywhere and exporting cpu_isolated_map was a 
natural way to make other parts of the kernel aware of the isolated CPUs.



Please take a look at that discussion. I do not think it's worth the effort to 
put this into
cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the 
rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.


I'm thinking cpu_isolated_map is a very dirty hack.

If we want to integrate this stuff with cpusets I think the best approach would be is to 
have cpusets update the cpu_isolated_map just like it currently updates scheduler domains. 
 

CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.
This only works for user-space. As I mentioned about for full CPU isolation various kernel 
subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.


Yes, hence the proposed system flag to handle the kernel bits like
unbounded kernel threads and IRQs.
I do not see a specific proposals here. The funny part that we're not even 
disagreeing on the high level. Yes It'd be nice to have such a flag ;-)

But how will genirq subsystem, for example, be aware of that flag ?
ie How would it know that by default it is not supposed to route irqs to the 
CPUs in the cpusets with that flag ?

As I explained above setting affinity for existing irqs is not enough.
Same for workqueus or any other susbsytem that wants to run per-cpu threads 
and stuff.



This also allows for isolated groups, there are good reasons to isolate groups,
esp. now that we have a stronger RT balancer. SMP and hard RT are not
exclusive. A design that does not take that into account is too rigid.



You're thinking scheduling only. Paul had the same confusion ;-)


I'm not, I'm thinking it ought to allow for it.
One way I can think of how to support groups and allow for RT balancer is 
this: Make scheduler ignore cpu_isolated_map and give cpusets full control of 
the scheduler domains. Use cpu_isolated_map to only for hw irq and other 
kernel sub-systems. That way cpusets could mark cpus in the group as isolated 
to get rid of the  kernel activity and build sched domain such that tasks get 
balanced in it.
The thing I do not like about it is that there is no way to boot the system 
with CPU N isolated

Re: [RFC] Genirq and CPU isolation

2008-02-25 Thread Max Krasnyanskiy

Randy Dunlap wrote:

On Fri, 22 Feb 2008 22:19:18 -0800 Max Krasnyansky wrote:

Hi Max,


diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 438a014..e74db94 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -488,6 +491,26 @@ void free_irq(unsigned int irq, void *dev_id)
 }
 EXPORT_SYMBOL(free_irq);
 
+#ifndef CONFIG_AUTO_IRQ_AFFINITY

+/**
+ * Generic version of the affinity autoselector.
+ * Called under desc->lock from setup_irq().
+ * btw Should we rename this to select_irq_affinity() ?
+ */


Please don't begin comment blocks with "/**" unless they are in
kernel-doc format.  (See Documentation/kernel-doc-nano-HOWTO.txt
for details of it.)

My bad. Cut & pasted another function and forgot to nuke one asterisk.

Thanx for the review Randy.
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Genirq and CPU isolation

2008-02-25 Thread Max Krasnyanskiy

Randy Dunlap wrote:

On Fri, 22 Feb 2008 22:19:18 -0800 Max Krasnyansky wrote:

Hi Max,


diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 438a014..e74db94 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -488,6 +491,26 @@ void free_irq(unsigned int irq, void *dev_id)
 }
 EXPORT_SYMBOL(free_irq);
 
+#ifndef CONFIG_AUTO_IRQ_AFFINITY

+/**
+ * Generic version of the affinity autoselector.
+ * Called under desc-lock from setup_irq().
+ * btw Should we rename this to select_irq_affinity() ?
+ */


Please don't begin comment blocks with /** unless they are in
kernel-doc format.  (See Documentation/kernel-doc-nano-HOWTO.txt
for details of it.)

My bad. Cut  pasted another function and forgot to nuke one asterisk.

Thanx for the review Randy.
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-25 Thread Max Krasnyanskiy

Hi Peter,

Sorry for delay in reply.


Please, wrap your emails at 78 - most mailers can do this.

Done.


On Fri, 2008-02-22 at 14:05 -0800, Max Krasnyanskiy wrote:

Peter Zijlstra wrote:

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:




List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
 
cpu_isolated_map was a bad hack when it was introduced, I feel we should

deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

That's not not currently possible and will introduce a lot of complexity.
I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on 
that btw).
In fact I provided the link to that discussion in the original email.
Here it is again:
http://marc.info/?l=linux-kernelm=120180692331461w=2


I read it, I just firmly disagree.

Basically the problem is very simple. CPU isolation needs a simple/efficient way to check 
if CPU N is isolated.


I'm not seeing the need for that outside of setting up the various
states. That is, once all the affinities are set up, you'd hardly ever
would (or should - imho) need to know if a particular CPU is isolated or not.
Unless I'm missing something that's only possible for a very static system. 
What I mean is that yes you could go and set irq affinity, apps affinity, 
workqueue thread affinity, etc not to run on the isolated cpus. It works 
_until_ something changes, at which point the system needs to know that it's 
not supposed to touch CPU N.
For example new IRQ is registered, new workqueue is created (fs mounted, net 
network interface is created, etc), new kthread is started, etc.
Sure we can introduce default affinity masks for irqs, workqueues, etc. But 
that's essentially just duplicating cpu_isolated_map.



 cpuset/cgroup APIs are not designed for that. In other to figure
out if a CPU N is isolated one has to iterate through all cpusets and checks 
their cpu maps.
That requires several levels of locking (cgroup and cpuset).



The other issue is that cpusets are a bit too dynamic (see the thread above for 
more details)
we'd need notified mechanisms to notify subsystems when a CPUs become isolated. 
Again more
complexity. Since I integrated cpu isolation with cpu hotplug it's already 
addressed in a nice
simple way.


I guess you have another definition of nice than I do.

No, not really.
Lets talk specifics. My goal was not to introduce a bunch of new functionality 
and rewrite workqueues and stuff, instead I wanted to integrated with existing 
mechanisms. CPU maps are used everywhere and exporting cpu_isolated_map was a 
natural way to make other parts of the kernel aware of the isolated CPUs.



Please take a look at that discussion. I do not think it's worth the effort to 
put this into
cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the 
rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.


I'm thinking cpu_isolated_map is a very dirty hack.

If we want to integrate this stuff with cpusets I think the best approach would be is to 
have cpusets update the cpu_isolated_map just like it currently updates scheduler domains. 
 

CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.
This only works for user-space. As I mentioned about for full CPU isolation various kernel 
subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.


Yes, hence the proposed system flag to handle the kernel bits like
unbounded kernel threads and IRQs.
I do not see a specific proposals here. The funny part that we're not even 
disagreeing on the high level. Yes It'd be nice to have such a flag ;-)

But how will genirq subsystem, for example, be aware of that flag ?
ie How would it know that by default it is not supposed to route irqs to the 
CPUs in the cpusets with that flag ?

As I explained above setting affinity for existing irqs is not enough.
Same for workqueus or any other susbsytem that wants to run per-cpu threads 
and stuff.



This also allows for isolated groups, there are good reasons to isolate groups,
esp. now that we have a stronger RT balancer. SMP and hard RT are not
exclusive. A design that does not take that into account is too rigid.



You're thinking scheduling only. Paul had the same confusion ;-)


I'm not, I'm thinking it ought to allow for it.
One way I can think of how to support groups and allow for RT balancer is 
this: Make scheduler ignore cpu_isolated_map and give cpusets full control of 
the scheduler domains. Use cpu_isolated_map to only for hw irq and other 
kernel sub-systems. That way cpusets could mark cpus in the group as isolated 
to get rid of the  kernel activity and build sched domain such that tasks get 
balanced in it.
The thing I do not like about it is that there is no way to boot the system 
with CPU N

Re: Tiny cpusets -- cpusets for small systems?

2008-02-25 Thread Max Krasnyanskiy

Paul Jackson wrote:

So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
level mechanism that actually makes kernel aware of what's isolated what's not.
Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
but scheduler does not use cpusets directly.


One could use cpusets to control the setting of cpu_isolated_map,
separate from the code such as your select_irq_affinity() that
uses it.
Yes. That's what I proposed too. In one of the CPU isolation threads with 
Peter. The only issue is that you need to simulate CPU_DOWN hotplug even in 
order to cleanup what's already running on those CPUs.



In a foreseeable future 2-8 cores will be most common configuration.
Do you think that cpusets are needed/useful for those machines ?
The reason I'm asking is because given the restrictions you mentioned
above it seems that you might as well just do
taskset -c 1,2,3 app1
	taskset -c 3,4,5 app2 


People tend to manage the CPU and memory placement of the threads
and processes within a single co-operating job using taskset
(sched_setaffinity) and numactl (mbind, set_mempolicy.)

They tend to manage the placement of multiple unrelated jobs onto
a single system, whether on separate or shared CPUs and nodes,
using cpusets.



Something like cpu_isolated_map looks to me like a system-wide
mechanism, which should, like sched_domains, be managed system-wide.
Managing it with a mechanism that encourages each thread to update
it directly, as if that thread owned the system, will break down,
resulting in conflicting updates, as multiple, insufficiently
co-operating threads issue conflicting settings.
I'm not sure how to interpret that. I think you might have mixed a couple of 
things I asked about in one reply ;-).
The question was that given the restrictions you talked about when you 
explained tiny-cpusets functionality I asked how much one gains from using 
them compared to the taskset/numactl. ie On the machines with 2-8 cores it's 
fairly easy to manage cpus with simple affinity masks.


The second part of your reply seems to imply that I somehow made you think 
that I suggested that cpu_isolated_map is managed per thread. That is of 
course not the case. It's definitely a system-wide mechanism and individual 
threads have nothing to do with it.
btw I just re-read my prev reply. I definitely did not say anything about 
threads managing cpu_isolated_map :).



Stuff that I'm working on this days (wireless basestations) is designed
with the  following model:
cpuN - runs soft-RT networking and management code
cpuN+1 to cpuN+x - are used as dedicated engines
ie Simplest example would be
cpu0 - runs IP, L2 and control plane
	cpu1 - runs hard-RT MAC 

So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ?


That depends on what more API is needed.  Do we need to place
irqs better ... cpusets might not be a natural for that use.
Aren't irqs directed to specific CPUs, not to hierarchically
nested subsets of CPUs.


You clipped the part where I elaborated. Which was:
So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ? I mean currently cpusets seems to be mostly dealing
with entire processes, whereas in this case we're really dealing with the threads. 
ie Different threads of the same process require different policies, some must run

on isolated cpus some must not. I guess one could write a thread's pid into 
cpusets
fs but that's not very convenient. pthread_set_affinity() is exactly what's 
needed.
In other words how would an app place its individual threads into the 
different cpusets.
IRQ stuff is separate, like we said above cpusets could simply update 
cpu_isolated_map which would take care of IRQs. I was talking specifically 
about the thread management.



Separate question:
  Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear
  as general purpose systems running a Linux kernel in your systems?
  These dedicated engines seem more like intelligent devices to me,
  such as disk controllers, which the kernel controls via device
  drivers, not by loading itself on them too.
We still want to be able to run normal threads on them. Which means IPI, 
memory management, etc is still needed. So yes they better show up as normal 
CPUs :)
Also with dynamic isolation you can for example un-isolate a cpu when you're 
compiling stuff on the machine and then isolate it when you're running special 
app(s).


Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-22 Thread Max Krasnyanskiy

Hi Andi,


Max Krasnyanskiy <[EMAIL PROTECTED]> writes:

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Wow you found some really bad code. I bet it wouldn't be that
difficult to fix the code to allow oops safe list insertion
without using the big stop machine overkill hammer.

Let me know if you have something in mind. When I get a chance I'll stare
some more at that code and try to come up with an alternative solution.

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Fri, 2008-02-22 at 08:38 -0500, Mark Hounschell wrote:


List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
 
cpu_isolated_map was a bad hack when it was introduced, I feel we should

deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.


Peter, what about when I am NOT using cpusets and are disabled in my config but
I still want to use this?


Then you enable it?

I'm with Mark on this one. For example if I have two core machine I do not need 
cpusets
to manage them.
Plus like I explained in prev email cpuset is higher level API. We can think of 
a way to
integrated them if needed.


   cpuisol: Do not schedule workqueues on the isolated CPUs
 
(per-cpu workqueues, the single ones are treated in the previous section)


I still strongly disagree with this approach. Workqueues are passive, they
don't do anything unless work is provided to them. By blindly not starting them
you handicap the system and services that rely on them.


Have things changed since since my first bad encounter with Workqueues.
I am referring to this thread. 

http://kerneltrap.org/mailarchive/linux-kernel/2007/5/29/97039 


Just means you get to fix those problems. By blindly not starting them
you introduce others.


Please give me an example of what you have in mind.
Also if you look at the patch (which I've now posted properly) it's not just 
not starting them.
I also redirected all future scheduled work to non-isolated CPU. ie If work is 
scheduled on the
isolated CPU this work is treated as if the work queue is single threaded. As I 
explained before
most subsystem do not care which CPU actually gets to execute the work. 
Oprofile is the only
one I know of that breaks because it cannot collect the stats from the isolated 
CPUs. I'm thinking
of a different solution for oprofile, maybe collection samples through IPIs or 
something.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Mark Hounschell wrote:

Peter Zijlstra wrote:

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:

As you suggested I'm sending CPU isolation patches for review/inclusion into 
sched-devel tree. They are against 2.6.25-rc2.

You can also pull them from my GIT tree at
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
master
 
Post patches! I can't review a git tree..
 

Max, could you also post them for 2.6.24.2 stable please. Thanks

Will do.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:

As you suggested I'm sending CPU isolation patches for review/inclusion into 
sched-devel tree. They are against 2.6.25-rc2.

You can also pull them from my GIT tree at
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
master
 
Post patches! I can't review a git tree..

I did. But it looks like I screwed the --cc list. I just resent them.
 

List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
 
cpu_isolated_map was a bad hack when it was introduced, I feel we should

deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

That's not not currently possible and will introduce a lot of complexity.
I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on 
that btw).
In fact I provided the link to that discussion in the original email.
Here it is again:
http://marc.info/?l=linux-kernel=120180692331461=2

Basically the problem is very simple. CPU isolation needs a simple/efficient way to check 
if CPU N is isolated. cpuset/cgroup APIs are not designed for that. In other to figure

out if a CPU N is isolated one has to iterate through all cpusets and checks 
their cpu maps.
That requires several levels of locking (cgroup and cpuset).
The other issue is that cpusets are a bit too dynamic (see the thread above for 
more details)
we'd need notified mechanisms to notify subsystems when a CPUs become isolated. 
Again more
complexity. Since I integrated cpu isolation with cpu hotplug it's already 
addressed in a nice
simple way.
Please take a look at that discussion. I do not think it's worth the effort to 
put this into
cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the 
rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.


If we want to integrate this stuff with cpusets I think the best approach would be is to 
have cpusets update the cpu_isolated_map just like it currently updates scheduler domains. 


CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.
This only works for user-space. As I mentioned about for full CPU isolation various kernel 
subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.



This also allows for isolated groups, there are good reasons to isolate groups,
esp. now that we have a stronger RT balancer. SMP and hard RT are not
exclusive. A design that does not take that into account is too rigid.

You're thinking scheduling only. Paul had the same confusion ;-)
As I explained before I'm redefining (or proposing to redefine) CPU isolation to

1. Isolated CPU(s) must not be subject to scheduler load balancing
Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
Includes workqueues, per CPU threads, etc.
This feature is configurable and is disabled by default.

Only #1 has to do with the scheduling. The rest has _nothing_ to do with it.


   cpuisol: Do not route IRQs to the CPUs isolated at boot



From the diffstat you're not touching the genirq stuff, but instead hack a

single architecture to support this feature. Sounds like an ill designed hack.

Ah, good point. This patches started before genirq was merged and I did not 
realize
that there is a way to set default irq affinity with genirq.
I'll definitely take a look at that.


A better approach would be to add a flag to the cpuset infrastructure that says
whether its a system set or not. A system set would be one that services the
general purpose OS and would include things like the IRQ affinity and unbound
kernel threads (including unbound workqueues - or single workqueues). This flag
would default to on, and by switching it off for the root set, and a select
subset you would push the System away from those cpus, thereby isolating them.
You're talking about very high level API. I'm totally all for it. What the patches deal 
with is actual low level stuff that is needed to do the "push the system away from those cpus".

As I mentioned above we can have cpuset update cpu_isolated_map for example.


   cpuisol: Do not schedule workqueues on the isolated CPUs
 
(per-cpu workqueues, the single ones are treated in the previous section)


I still strongly disagree with this approach. Workqueues are passive, they
don't do anything unless work is provided to them. By blindly not starting them
you handicap the system and services that rely on them.

Oh boy, back to square one. I covered this already.
I even started a thread on that and explained what this is and why its needed.
  

Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Dmitry Adamushko wrote:

Hi Max,


 [ ... ]
 Last patch to the stop machine is potentially unsafe and is marked as 
experimental. Unfortunately
 it's currently the only option that allows dynamic module insertion/removal 
for above scenarios.


I'm puzzled by the following part (can be a misunderstanding from my side)

+config CPUISOL_STOPMACHINE
+   bool "Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL)"
+   depends on CPUISOL && STOP_MACHINE && EXPERIMENTAL
+   help
+ If this option is enabled kernel will not halt isolated CPUs
+ when Stop Machine is triggered. Stop Machine is currently only
+ used by the module insertion and removal.

this "only" part. What about e.g. a 'cpu hotplug' case (_cpu_down())?
(or we should abstract it a bit to the point that e.g. a cpu can be
considered as 'a module'? :-)


My bad. I forgot to update that text. As you and other folks pointed out
stopmachine is used in a few other places besides module loading. We had 
a discussion about this awhile ago. I just forgot to update the text. 
Will do.


Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Dmitry Adamushko wrote:

Hi Max,


 [ ... ]
 Last patch to the stop machine is potentially unsafe and is marked as 
experimental. Unfortunately
 it's currently the only option that allows dynamic module insertion/removal 
for above scenarios.


I'm puzzled by the following part (can be a misunderstanding from my side)

+config CPUISOL_STOPMACHINE
+   bool Do not halt isolated CPUs with Stop Machine (EXPERIMENTAL)
+   depends on CPUISOL  STOP_MACHINE  EXPERIMENTAL
+   help
+ If this option is enabled kernel will not halt isolated CPUs
+ when Stop Machine is triggered. Stop Machine is currently only
+ used by the module insertion and removal.

this only part. What about e.g. a 'cpu hotplug' case (_cpu_down())?
(or we should abstract it a bit to the point that e.g. a cpu can be
considered as 'a module'? :-)


My bad. I forgot to update that text. As you and other folks pointed out
stopmachine is used in a few other places besides module loading. We had 
a discussion about this awhile ago. I just forgot to update the text. 
Will do.


Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:

As you suggested I'm sending CPU isolation patches for review/inclusion into 
sched-devel tree. They are against 2.6.25-rc2.

You can also pull them from my GIT tree at
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
master
 
Post patches! I can't review a git tree..

I did. But it looks like I screwed the --cc list. I just resent them.
 

List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
 
cpu_isolated_map was a bad hack when it was introduced, I feel we should

deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

That's not not currently possible and will introduce a lot of complexity.
I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on 
that btw).
In fact I provided the link to that discussion in the original email.
Here it is again:
http://marc.info/?l=linux-kernelm=120180692331461w=2

Basically the problem is very simple. CPU isolation needs a simple/efficient way to check 
if CPU N is isolated. cpuset/cgroup APIs are not designed for that. In other to figure

out if a CPU N is isolated one has to iterate through all cpusets and checks 
their cpu maps.
That requires several levels of locking (cgroup and cpuset).
The other issue is that cpusets are a bit too dynamic (see the thread above for 
more details)
we'd need notified mechanisms to notify subsystems when a CPUs become isolated. 
Again more
complexity. Since I integrated cpu isolation with cpu hotplug it's already 
addressed in a nice
simple way.
Please take a look at that discussion. I do not think it's worth the effort to 
put this into
cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the 
rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.


If we want to integrate this stuff with cpusets I think the best approach would be is to 
have cpusets update the cpu_isolated_map just like it currently updates scheduler domains. 


CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.
This only works for user-space. As I mentioned about for full CPU isolation various kernel 
subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.



This also allows for isolated groups, there are good reasons to isolate groups,
esp. now that we have a stronger RT balancer. SMP and hard RT are not
exclusive. A design that does not take that into account is too rigid.

You're thinking scheduling only. Paul had the same confusion ;-)
As I explained before I'm redefining (or proposing to redefine) CPU isolation to

1. Isolated CPU(s) must not be subject to scheduler load balancing
Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
Includes workqueues, per CPU threads, etc.
This feature is configurable and is disabled by default.

Only #1 has to do with the scheduling. The rest has _nothing_ to do with it.


   cpuisol: Do not route IRQs to the CPUs isolated at boot



From the diffstat you're not touching the genirq stuff, but instead hack a

single architecture to support this feature. Sounds like an ill designed hack.

Ah, good point. This patches started before genirq was merged and I did not 
realize
that there is a way to set default irq affinity with genirq.
I'll definitely take a look at that.


A better approach would be to add a flag to the cpuset infrastructure that says
whether its a system set or not. A system set would be one that services the
general purpose OS and would include things like the IRQ affinity and unbound
kernel threads (including unbound workqueues - or single workqueues). This flag
would default to on, and by switching it off for the root set, and a select
subset you would push the System away from those cpus, thereby isolating them.
You're talking about very high level API. I'm totally all for it. What the patches deal 
with is actual low level stuff that is needed to do the push the system away from those cpus.

As I mentioned above we can have cpuset update cpu_isolated_map for example.


   cpuisol: Do not schedule workqueues on the isolated CPUs
 
(per-cpu workqueues, the single ones are treated in the previous section)


I still strongly disagree with this approach. Workqueues are passive, they
don't do anything unless work is provided to them. By blindly not starting them
you handicap the system and services that rely on them.

Oh boy, back to square one. I covered this already.
I even started a thread on that and explained what this is and why its needed.
http

Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Mark Hounschell wrote:

Peter Zijlstra wrote:

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:

As you suggested I'm sending CPU isolation patches for review/inclusion into 
sched-devel tree. They are against 2.6.25-rc2.

You can also pull them from my GIT tree at
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
master
 
Post patches! I can't review a git tree..
 

Max, could you also post them for 2.6.24.2 stable please. Thanks

Will do.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-22 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Fri, 2008-02-22 at 08:38 -0500, Mark Hounschell wrote:


List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
 
cpu_isolated_map was a bad hack when it was introduced, I feel we should

deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.


Peter, what about when I am NOT using cpusets and are disabled in my config but
I still want to use this?


Then you enable it?

I'm with Mark on this one. For example if I have two core machine I do not need 
cpusets
to manage them.
Plus like I explained in prev email cpuset is higher level API. We can think of 
a way to
integrated them if needed.


   cpuisol: Do not schedule workqueues on the isolated CPUs
 
(per-cpu workqueues, the single ones are treated in the previous section)


I still strongly disagree with this approach. Workqueues are passive, they
don't do anything unless work is provided to them. By blindly not starting them
you handicap the system and services that rely on them.


Have things changed since since my first bad encounter with Workqueues.
I am referring to this thread. 

http://kerneltrap.org/mailarchive/linux-kernel/2007/5/29/97039 


Just means you get to fix those problems. By blindly not starting them
you introduce others.


Please give me an example of what you have in mind.
Also if you look at the patch (which I've now posted properly) it's not just 
not starting them.
I also redirected all future scheduled work to non-isolated CPU. ie If work is 
scheduled on the
isolated CPU this work is treated as if the work queue is single threaded. As I 
explained before
most subsystem do not care which CPU actually gets to execute the work. 
Oprofile is the only
one I know of that breaks because it cannot collect the stats from the isolated 
CPUs. I'm thinking
of a different solution for oprofile, maybe collection samples through IPIs or 
something.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-22 Thread Max Krasnyanskiy

Hi Andi,


Max Krasnyanskiy [EMAIL PROTECTED] writes:

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Wow you found some really bad code. I bet it wouldn't be that
difficult to fix the code to allow oops safe list insertion
without using the big stop machine overkill hammer.

Let me know if you have something in mind. When I get a chance I'll stare
some more at that code and try to come up with an alternative solution.

Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH sched-devel 0/7] CPU isolation extensions

2008-02-21 Thread Max Krasnyanskiy

Ingo,

As you suggested I'm sending CPU isolation patches for review/inclusion into 
sched-devel tree. They are against 2.6.25-rc2.

You can also pull them from my GIT tree at
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
master

Diffstat:
b/Documentation/ABI/testing/sysfs-devices-system-cpu |   41 ++
b/Documentation/cpu-isolation.txt|  114 ++-
b/arch/x86/Kconfig   |1 
b/arch/x86/kernel/genapic_flat_64.c  |5 
b/drivers/base/cpu.c |   48 
b/include/linux/cpumask.h|3 
b/kernel/Kconfig.cpuisol |   15 ++
b/kernel/Makefile|4 
b/kernel/cpu.c   |   49 

b/kernel/sched.c |   37 --
b/kernel/stop_machine.c  |9 +
b/kernel/workqueue.c |   31 +++--
kernel/Kconfig.cpuisol   |   56 ++---
kernel/cpu.c |   16 +-
14 files changed, 356 insertions(+), 73 deletions(-)

List of commits
  cpuisol: Make cpu isolation configrable and export isolated map
  cpuisol: Do not route IRQs to the CPUs isolated at boot
  cpuisol: Do not schedule workqueues on the isolated CPUs
  cpuisol: Move on-stack array used for boot cmd parsing into __initdata
  cpuisol: Documentation updates
  cpuisol: Minor updates to the Kconfig options
  cpuisol: Do not halt isolated CPUs with Stop Machine

This patch series extends CPU isolation support.
The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the 
processors without adversely affecting or being affected by the other system activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under extreme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.


Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
 Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
 User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
 Includes workqueues, per CPU threads, etc.
 This feature is configurable and is disabled by default.  
---


I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). 
So this seems like a good time to merge. 


We've had scheduler support for CPU isolation ever since O(1) scheduler went 
it. In other words
#1 is already supported. These patches do not change/affect that functionality in any way. 
#2 is trivial one liner change to the IRQ init code. 
#3 is addressed by a couple of separate patches. The main problem here is that RT thread can prevent

kernel threads from running and machine gets stuck because other CPUs are 
waiting for those threads
to run and report back.

Folks involved in the scheduler/cpuset development provided a lot of feedback 
on the first series
of patches. I believe I managed to explain and clarify every aspect. 
Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked 
at it more closely and determined that exporting cpu_isolated_map instead is a better option.

Details here
http://marc.info/?l=linux-kernel=120180692331461=2

Last patch to the stop machine is potentially unsafe and is marked as experimental. Unfortunately 
it's currently the only option that allows dynamic module insertion/removal for above scenarios. 

From the previous discussions it's the only 

Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.

Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Ah... right.  That part doesn't have anything to do with module
reference counting as the comment suggests and can probably be removed
by updating how kallsyms synchronize against module load/unload.


That list (updated by __link_module) is accessed in couple of other places. ie 
outside symbol
lookup stuff used for kallsyms.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.


Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.


static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}

I actually rarely unload modules. The way I notice the problem in first place is when 
things started hanging when tun driver was autoloaded or when fs automounts triggered 
some auto loading.

These days it's kind hard to have a semi-general purpose machine without module 
loading :).

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Max Krasnyanskiy

Hi Tejun,


Max Krasnyansky wrote:

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.


Nope, it's integral part of module reference counting.  When using
refcnt for object lifetime management, the last put should be atomic
against initial get of the object.  This is usually achieved by
acquiring the lock used for object lookup before putting or using
atomic_dec_and_lock().

For module reference counts, this means that try_module_get() and
try_stop_module() should be atomic.  Note that modules don't use simple
refcnt so the latter part isn't module_put() but the analogy still
works.  There are two ways to synchronize try_module_get() against
try_stop_module() - the traditional is to grab lock in try_module_get()
and use atomic_dec_and_lock() in try_stop_module(), which works but
performance-wise bad because try_module_get() is used way much more than
try_stop_module() is.  For example, an IO command can go through several
try_module_get()'s.

So, all the burden of synchronization is put onto try_stop_module().
Because all of the cpus on the machine are stopped and none of them has
been stopped in the middle of non-preemptible code, __try_stop_module()
is synchronized from try_module_get() even though all the
synchronization try_module_get() does is get_cpu().
Thanks for the info. I guess I missed that from the code. In any case that seems like a 
pretty heavy refcounting mechanism. In a sense that every time something is loaded or 
unloaded entire machine freezes, potentially for several milliseconds. Normally it's not a 
big deal. But once you get more and more CPUs and/or start using realtime apps this becomes
a big deal. And it's plain broken for the use case that I mentioned during CPU isolation 
discussions. ie When user-space thread(s) prevent stopmachine kthread from running, in which

case machine simply hangs until those user-space threads exit.

Initially I assumed that it had to do with subsystems 
registration/unregistration being
potentially unsafe if it's only for module ref counting there is gotta be a 
less expensive way.
I'll think some more about it.

The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


Without the stop_machine call, there's no synchronization between
initial get and final put.  Things will break.

Got it.
Thanks again for the explanation. I'll stare at the module code some more with 
what you said
in mind.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CPU isolation extensions for the stable 2.6.24.y series

2008-02-21 Thread Max Krasnyanskiy

I got quite a few positive responses from end-users about CPU isolation 
extensions that I posted awhile ago.
For people who are interested in using latest CPU isolation stuff on top of the stable 2.6.24.y tree I created 
a separate GIT tree that is available at

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.24.y.git

It's currently based on the 2.6.24.2 released by the stable team. I'll be 
pulling latest stable updates into
it as they become available.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CPU isolation extensions for the stable 2.6.24.y series

2008-02-21 Thread Max Krasnyanskiy

I got quite a few positive responses from end-users about CPU isolation 
extensions that I posted awhile ago.
For people who are interested in using latest CPU isolation stuff on top of the stable 2.6.24.y tree I created 
a separate GIT tree that is available at

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.24.y.git

It's currently based on the 2.6.24.2 released by the stable team. I'll be 
pulling latest stable updates into
it as they become available.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.


Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.


static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}

I actually rarely unload modules. The way I notice the problem in first place is when 
things started hanging when tun driver was autoloaded or when fs automounts triggered 
some auto loading.

These days it's kind hard to have a semi-general purpose machine without module 
loading :).

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Max Krasnyanskiy

Hi Tejun,


Max Krasnyansky wrote:

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.


Nope, it's integral part of module reference counting.  When using
refcnt for object lifetime management, the last put should be atomic
against initial get of the object.  This is usually achieved by
acquiring the lock used for object lookup before putting or using
atomic_dec_and_lock().

For module reference counts, this means that try_module_get() and
try_stop_module() should be atomic.  Note that modules don't use simple
refcnt so the latter part isn't module_put() but the analogy still
works.  There are two ways to synchronize try_module_get() against
try_stop_module() - the traditional is to grab lock in try_module_get()
and use atomic_dec_and_lock() in try_stop_module(), which works but
performance-wise bad because try_module_get() is used way much more than
try_stop_module() is.  For example, an IO command can go through several
try_module_get()'s.

So, all the burden of synchronization is put onto try_stop_module().
Because all of the cpus on the machine are stopped and none of them has
been stopped in the middle of non-preemptible code, __try_stop_module()
is synchronized from try_module_get() even though all the
synchronization try_module_get() does is get_cpu().
Thanks for the info. I guess I missed that from the code. In any case that seems like a 
pretty heavy refcounting mechanism. In a sense that every time something is loaded or 
unloaded entire machine freezes, potentially for several milliseconds. Normally it's not a 
big deal. But once you get more and more CPUs and/or start using realtime apps this becomes
a big deal. And it's plain broken for the use case that I mentioned during CPU isolation 
discussions. ie When user-space thread(s) prevent stopmachine kthread from running, in which

case machine simply hangs until those user-space threads exit.

Initially I assumed that it had to do with subsystems 
registration/unregistration being
potentially unsafe if it's only for module ref counting there is gotta be a 
less expensive way.
I'll think some more about it.

The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


Without the stop_machine call, there's no synchronization between
initial get and final put.  Things will break.

Got it.
Thanks again for the explanation. I'll stare at the module code some more with 
what you said
in mind.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.

Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Ah... right.  That part doesn't have anything to do with module
reference counting as the comment suggests and can probably be removed
by updating how kallsyms synchronize against module load/unload.


That list (updated by __link_module) is accessed in couple of other places. ie 
outside symbol
lookup stuff used for kallsyms.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH sched-devel 0/7] CPU isolation extensions

2008-02-21 Thread Max Krasnyanskiy

Ingo,

As you suggested I'm sending CPU isolation patches for review/inclusion into 
sched-devel tree. They are against 2.6.25-rc2.

You can also pull them from my GIT tree at
git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git 
master

Diffstat:
b/Documentation/ABI/testing/sysfs-devices-system-cpu |   41 ++
b/Documentation/cpu-isolation.txt|  114 ++-
b/arch/x86/Kconfig   |1 
b/arch/x86/kernel/genapic_flat_64.c  |5 
b/drivers/base/cpu.c |   48 
b/include/linux/cpumask.h|3 
b/kernel/Kconfig.cpuisol |   15 ++
b/kernel/Makefile|4 
b/kernel/cpu.c   |   49 

b/kernel/sched.c |   37 --
b/kernel/stop_machine.c  |9 +
b/kernel/workqueue.c |   31 +++--
kernel/Kconfig.cpuisol   |   56 ++---
kernel/cpu.c |   16 +-
14 files changed, 356 insertions(+), 73 deletions(-)

List of commits
  cpuisol: Make cpu isolation configrable and export isolated map
  cpuisol: Do not route IRQs to the CPUs isolated at boot
  cpuisol: Do not schedule workqueues on the isolated CPUs
  cpuisol: Move on-stack array used for boot cmd parsing into __initdata
  cpuisol: Documentation updates
  cpuisol: Minor updates to the Kconfig options
  cpuisol: Do not halt isolated CPUs with Stop Machine

This patch series extends CPU isolation support.
The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the 
processors without adversely affecting or being affected by the other system activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under extreme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.


Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
 Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
 User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
 Includes workqueues, per CPU threads, etc.
 This feature is configurable and is disabled by default.  
---


I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). 
So this seems like a good time to merge. 


We've had scheduler support for CPU isolation ever since O(1) scheduler went 
it. In other words
#1 is already supported. These patches do not change/affect that functionality in any way. 
#2 is trivial one liner change to the IRQ init code. 
#3 is addressed by a couple of separate patches. The main problem here is that RT thread can prevent

kernel threads from running and machine gets stuck because other CPUs are 
waiting for those threads
to run and report back.

Folks involved in the scheduler/cpuset development provided a lot of feedback 
on the first series
of patches. I believe I managed to explain and clarify every aspect. 
Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked 
at it more closely and determined that exporting cpu_isolated_map instead is a better option.

Details here
http://marc.info/?l=linux-kernelm=120180692331461w=2

Last patch to the stop machine is potentially unsafe and is marked as experimental. Unfortunately 
it's currently the only option that allows dynamic module insertion/removal for above scenarios. 

From the previous discussions it's the only 

Re: Module loading/unloading and "The Stop Machine"

2008-02-08 Thread Max Krasnyanskiy

Max Krasnyansky wrote:

Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


So. Here is what I tried today on my Core2 Duo laptop

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
unsigned int cpu)
 
/* No CPUs can come up or down during this. */

lock_cpu_hotplug();
+/*
p = __stop_machine_run(fn, data, cpu);
if (!IS_ERR(p))
ret = kthread_stop(p);
else
ret = PTR_ERR(p);
+*/
+   ret = fn(data);
unlock_cpu_hotplug();
 
return ret;


ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and is using 
that USB mouse. The Bluetooth services are running too.

By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing this
email from :). It's still running all that :) 


So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've doing
for a couple of years now on a wide range of the machines where people are inserting 
modules left and right. 


What do you think ?

Thanx
Max


Quick update on this.
I've also ran
while true; do
sudo mount -o loop loopfs loopmnt && dd if=/dev/zero 
of=loopmnt/dummy bs=1M
sudo umount loopmnt
sleep 2
done
and
while true; do
/sbin/modprobe -r loop
/sbin/modprobe loop
sleep 1
done
in parallel on the Core2 Quad box for about 6 hours now. Same thing. No signs of problems 
whatsoever, with the "stop machine" completely disabled. Everything is working as expected.

Here we're exercising sysfs, block and fs layers.
So I'm now even more eager to see your response :).

btw Does anyone else have a module load/unload scenario that definitely 
requires stop machine ?

Max












--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-08 Thread Max Krasnyanskiy

Max Krasnyansky wrote:

Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


So. Here is what I tried today on my Core2 Duo laptop

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
unsigned int cpu)
 
/* No CPUs can come up or down during this. */

lock_cpu_hotplug();
+/*
p = __stop_machine_run(fn, data, cpu);
if (!IS_ERR(p))
ret = kthread_stop(p);
else
ret = PTR_ERR(p);
+*/
+   ret = fn(data);
unlock_cpu_hotplug();
 
return ret;


ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and is using 
that USB mouse. The Bluetooth services are running too.

By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing this
email from :). It's still running all that :) 


So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've doing
for a couple of years now on a wide range of the machines where people are inserting 
modules left and right. 


What do you think ?

Thanx
Max


Quick update on this.
I've also ran
while true; do
sudo mount -o loop loopfs loopmnt  dd if=/dev/zero 
of=loopmnt/dummy bs=1M
sudo umount loopmnt
sleep 2
done
and
while true; do
/sbin/modprobe -r loop
/sbin/modprobe loop
sleep 1
done
in parallel on the Core2 Quad box for about 6 hours now. Same thing. No signs of problems 
whatsoever, with the stop machine completely disabled. Everything is working as expected.

Here we're exercising sysfs, block and fs layers.
So I'm now even more eager to see your response :).

btw Does anyone else have a module load/unload scenario that definitely 
requires stop machine ?

Max












--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: cpuisol: CPU isolation extensions (take 2)

2008-02-06 Thread Max Krasnyanskiy
CC'ing linux-rt-users because I think my explanation below may be interesting for the 
RT folks.


Mark Hounschell wrote:

Max Krasnyanskiy wrote:


With CPU isolation
it's very easy to achieve single digit usec worst case and around 200
nsec average response times on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even
under exteme system load. 


Hi Max, could you elaborate on what sort events your response times are
from?


Sure. As I mentioned before I'm working with our legal team on releasing hard RT engine 
that uses isolated CPUs. You can think of that engine as a giant SW PLL. 
It requires a time source that it locks on to. For example the time source can be the 
kernel clock (gtod), some kind of memory mapped counter, or some external event. 
In my case the HW sends me an Ethernet packet every 24 millisecond. 
Once the PLL locks onto the timesource the engine executes a predefined "timeline". 
The timeline basically specifies tasks with offsets in nanoseconds from the start of 
the cycle (ie "at 100 nsec run task1", "at 15000 run task2", etc). The tasks are just 
callbacks.
The jitter in running those tasks is what I meant by "response time". Essentially it's 
a polling design where SW knows precisely when to expect an event. It's not a general

purpose solution but works beautifully for things like wireless PHY/MAC layers 
were the
framing structure is very deterministic and must be strictly enforced. It works for other 
applications as well once you get your head wrapped around the idea :). ie That you do 
not get interrupts for every single event, the SW already knows when that even will come.

btw The engine also enforces the deadlines. For example it knows right away if 
a task is
late and it knows exactly how late. That helps in debugging, a lot :).

The other option is to run normal pthreads on the isolated CPUs. As long as the 
threads
are carefully designed not to do certain things you can get very decent worst case latencies 
(10-12 usec on Opterons and Core2) even with vanilla kernels (patched with the isolation 
patches of course) because all the latency sources have been removed from those CPUs.


Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT scheduler config, suggestions and questions

2008-02-06 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Wed, 2008-02-06 at 07:36 +0100, Peter Zijlstra wrote:


btw I can see "watchdog" being very useful to catch hard-RT tasks that exceed 
the deadline.
But's it gotta be per thread.

It is.


Single setting per user is not enough. Unless a use has a single RT task.

?


Ah, its starting to make sense, you want it configurable per thread, not
process wide. Yes, I see how that is useful, just no idea how to expose
that to user-space atm.

Yes. That's what I meant. I don't think overall per process setting is that 
useful.
Per thread though would be useful.

How to expose that to the user-space ? The best option in my opinion is to 
extend
struct sched_param. That way both sched_setparam() and pthread_attr_setschedparam() 
can be used to set new attributes and it's backwards compatible.

Something like:

struct sched_param {
...
unsigned int sched_rt_limit;
unsigned int sched_rt_...;
};

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT scheduler config, suggestions and questions

2008-02-06 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Tue, 2008-02-05 at 15:37 -0800, Max Krasnyanskiy wrote:

Folks,

I just realized that in latest Linus' tree following sysctls are under 
SCHED_DEBUG:
sched_rt_period
sched_rt_ratio

I do not believe that is correct. I know that we do not want to expose 
scheduler knobs
in general but theses are not the heuristic kind of knobs. There is no way the scheduler 
can magically figure out what the correct setting should be here.


Yeah, since fixed.

Super. I guess it's not in mainline yet. Or it wasn't yesterday.

Also shouldn't those new RT features that recently went be configurable and _disabled_ 
by default ? For example "RT watchdog" and "RT throttling" actually seem very questionable. 
SCHED_FIFO is clearly defined as

"
  A SCHED_FIFO process runs until either it is blocked by an I/O request, it is preempted 
  by a higher priority process, or it calls sched_yield(2).

"


The watchdog is disabled by default, the bandwidth is .95s every 1s,
which is mainly a safe-guard against run-away real-time tasks. As long
as real-time usage stays within those limits nothing happens. If you
don't like it set sched_rt_runtime [*] to -1.

Oh. From looking at the code I assumed I need to set sched_rt_ratio to 65536.
I guess you changed things around a bit. I'll look at the latest changes.

Both the watchdog and the throttling are clearly braking that rule. I think it's good to have 
those features but not enabled by default and certainly not with sysctls that disable them 
hidden under debugging.

How about this:
- We introduce Kconfig options for them ?


I don't see why this would be needed.

Reduces code size and RT scheduler complexity for the cases when these features 
are not required.
For example my preference would for RT scheduler to be as lean as possible to avoid any kind of 
overhead and cache pollution. I know both watchdog and throttling end up being a single if() when

disabled via sysctl, but it all adds up.


- Expose all rt sysctls outside of #ifdef DEBUG


Already did this somewhere along the line.


btw I can see "watchdog" being very useful to catch hard-RT tasks that exceed 
the deadline.
But's it gotta be per thread.


It is.


Single setting per user is not enough. Unless a use has a single RT task.


?

Currently it's per process. Most RT apps will most likely have several RT 
threads.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT scheduler config, suggestions and questions

2008-02-06 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Tue, 2008-02-05 at 15:37 -0800, Max Krasnyanskiy wrote:

Folks,

I just realized that in latest Linus' tree following sysctls are under 
SCHED_DEBUG:
sched_rt_period
sched_rt_ratio

I do not believe that is correct. I know that we do not want to expose 
scheduler knobs
in general but theses are not the heuristic kind of knobs. There is no way the scheduler 
can magically figure out what the correct setting should be here.


Yeah, since fixed.

Super. I guess it's not in mainline yet. Or it wasn't yesterday.

Also shouldn't those new RT features that recently went be configurable and _disabled_ 
by default ? For example RT watchdog and RT throttling actually seem very questionable. 
SCHED_FIFO is clearly defined as


  A SCHED_FIFO process runs until either it is blocked by an I/O request, it is preempted 
  by a higher priority process, or it calls sched_yield(2).




The watchdog is disabled by default, the bandwidth is .95s every 1s,
which is mainly a safe-guard against run-away real-time tasks. As long
as real-time usage stays within those limits nothing happens. If you
don't like it set sched_rt_runtime [*] to -1.

Oh. From looking at the code I assumed I need to set sched_rt_ratio to 65536.
I guess you changed things around a bit. I'll look at the latest changes.

Both the watchdog and the throttling are clearly braking that rule. I think it's good to have 
those features but not enabled by default and certainly not with sysctls that disable them 
hidden under debugging.

How about this:
- We introduce Kconfig options for them ?


I don't see why this would be needed.

Reduces code size and RT scheduler complexity for the cases when these features 
are not required.
For example my preference would for RT scheduler to be as lean as possible to avoid any kind of 
overhead and cache pollution. I know both watchdog and throttling end up being a single if() when

disabled via sysctl, but it all adds up.


- Expose all rt sysctls outside of #ifdef DEBUG


Already did this somewhere along the line.


btw I can see watchdog being very useful to catch hard-RT tasks that exceed 
the deadline.
But's it gotta be per thread.


It is.


Single setting per user is not enough. Unless a use has a single RT task.


?

Currently it's per process. Most RT apps will most likely have several RT 
threads.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT scheduler config, suggestions and questions

2008-02-06 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Wed, 2008-02-06 at 07:36 +0100, Peter Zijlstra wrote:


btw I can see watchdog being very useful to catch hard-RT tasks that exceed 
the deadline.
But's it gotta be per thread.

It is.


Single setting per user is not enough. Unless a use has a single RT task.

?


Ah, its starting to make sense, you want it configurable per thread, not
process wide. Yes, I see how that is useful, just no idea how to expose
that to user-space atm.

Yes. That's what I meant. I don't think overall per process setting is that 
useful.
Per thread though would be useful.

How to expose that to the user-space ? The best option in my opinion is to 
extend
struct sched_param. That way both sched_setparam() and pthread_attr_setschedparam() 
can be used to set new attributes and it's backwards compatible.

Something like:

struct sched_param {
...
unsigned int sched_rt_limit;
unsigned int sched_rt_...;
};

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RT scheduler config, suggestions and questions

2008-02-05 Thread Max Krasnyanskiy

Folks,

I just realized that in latest Linus' tree following sysctls are under 
SCHED_DEBUG:
sched_rt_period
sched_rt_ratio

I do not believe that is correct. I know that we do not want to expose 
scheduler knobs
in general but theses are not the heuristic kind of knobs. There is no way the scheduler 
can magically figure out what the correct setting should be here.


Also shouldn't those new RT features that recently went be configurable and _disabled_ 
by default ? For example "RT watchdog" and "RT throttling" actually seem very questionable. 
SCHED_FIFO is clearly defined as

"
 A SCHED_FIFO process runs until either it is blocked by an I/O request, it is preempted 
 by a higher priority process, or it calls sched_yield(2).

"

Both the watchdog and the throttling are clearly braking that rule. I think it's good to have 
those features but not enabled by default and certainly not with sysctls that disable them 
hidden under debugging.

How about this:
- We introduce Kconfig options for them ?
- Expose all rt sysctls outside of #ifdef DEBUG

I can kook up some patches if that sounds ok.



btw I can see "watchdog" being very useful to catch hard-RT tasks that exceed 
the deadline.
But's it gotta be per thread. Single setting per user is not enough. Unless a use has a single 
RT task.


Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


cpuisol: CPU isolation extensions (take 2)

2008-02-05 Thread Max Krasnyanskiy

It seems that git-send-email for some reasons did not send an introductory 
email.
So I'm sending it manually. Sorry if you get it twice.

---

Following patch series extends CPU isolation support. Yes, most people want to virtuallize 
CPUs these days and I want to isolate them  :) .


The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the 
processors without adversely affecting or being affected by the other system activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under exteme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.


Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
  Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
  User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
  Includes workqueues, per CPU threads, etc.
  This feature is configurable and is disabled by default.  
---


I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). 
So this seems like a good time to merge. 


We've had scheduler support for CPU isolation ever since O(1) scheduler went 
it. In other words
#1 is already supported. These patches do not change/affect that functionality in any way. 
#2 is trivial one liner change to the IRQ init code. #3 is address by a couple of separate patches.


The patchset consist of 4 patches. First two are very simple. They simply make "CPU isolation" a 
configurable feature, export cpu_isolated_map and provide some helper functions to access it (just 
like cpu_online() and friends).

Last two patches add support for isolating CPUs from running workqueus and stop 
machine.
More details in the individual patch descriptions.

Folks involved in the scheduler/cpuset development provided a lot of feedback 
on the first series
of patches. I believe I managed to explain and clarify every aspect. 
Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked 
at it more closely and determined that exporting cpu_isolated_map instead is a better option. 

Last patch to the stop machine is potentially unsafe is marked as highly experimental. Unfortunately 
it's currently the only option that allows dynamic module insertion/removal for above scenarios. 
If people still feel that it's t ugly I can revert that change and keep it in the separate tree 
for now.


Ideally I'd like all of this to go in during this merge window. 
Linus please pull this patch set from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git

That tree is rebased against latest (as of today) Linus' tree.

Thanx
Max

b/arch/x86/Kconfig  |1 
b/arch/x86/kernel/genapic_flat_64.c |5 ++-

b/drivers/base/cpu.c|   48 +++
b/include/linux/cpumask.h   |3 ++
b/kernel/Kconfig.cpuisol|   15 +++
b/kernel/Makefile   |4 +-
b/kernel/cpu.c  |   49 
b/kernel/sched.c|   37 ---
b/kernel/stop_machine.c |9 +-
b/kernel/workqueue.c|   31 --
kernel/Kconfig.cpuisol  |   26 ++-
11 files changed, 176 insertions(+), 52 deletions(-)


cpuisol: Make cpu isolation configrable and export isolated map
cpuisol: Do not route IRQs to the CPUs isolated at boot
cpuisol: Do not schedule workqueues on the isolated CPUs
cpuisol: Do not halt isolated CPUs with Stop Machine





--
To unsubscribe from this 

cpuisol: CPU isolation extensions (take 2)

2008-02-05 Thread Max Krasnyanskiy

It seems that git-send-email for some reasons did not send an introductory 
email.
So I'm sending it manually. Sorry if you get it twice.

---

Following patch series extends CPU isolation support. Yes, most people want to virtuallize 
CPUs these days and I want to isolate them  :) .


The primary idea here is to be able to use some CPU cores as the dedicated 
engines for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor. I'd like to be able to run a CPU intensive (%100) RT task on one of the 
processors without adversely affecting or being affected by the other system activities. 
System activities here include _kernel_ activities as well. 

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems (vanilla kernel plus these patches) even under exteme system load. 
I'm working with legal folks on releasing hard RT user-space framework for that.
I believe with the current multi-core CPU trend we will see more and more applications that 
explore this capability: RT gaming engines, simulators, hard RT apps, etc.


Hence the proposal is to extend current CPU isolation feature.
The new definition of the CPU isolation would be:
---
1. Isolated CPU(s) must not be subject to scheduler load balancing
  Users must explicitly bind threads in order to run on those CPU(s).

2. By default interrupts must not be routed to the isolated CPU(s)
  User must route interrupts (if any) to those CPUs explicitly.

3. In general kernel subsystems must avoid activity on the isolated CPU(s) as 
much as possible
  Includes workqueues, per CPU threads, etc.
  This feature is configurable and is disabled by default.  
---


I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). 
So this seems like a good time to merge. 


We've had scheduler support for CPU isolation ever since O(1) scheduler went 
it. In other words
#1 is already supported. These patches do not change/affect that functionality in any way. 
#2 is trivial one liner change to the IRQ init code. #3 is address by a couple of separate patches.


The patchset consist of 4 patches. First two are very simple. They simply make CPU isolation a 
configurable feature, export cpu_isolated_map and provide some helper functions to access it (just 
like cpu_online() and friends).

Last two patches add support for isolating CPUs from running workqueus and stop 
machine.
More details in the individual patch descriptions.

Folks involved in the scheduler/cpuset development provided a lot of feedback 
on the first series
of patches. I believe I managed to explain and clarify every aspect. 
Paul Jackson initially suggested to implement #2 and #3 using cpusets subsystem. Paul and I looked 
at it more closely and determined that exporting cpu_isolated_map instead is a better option. 

Last patch to the stop machine is potentially unsafe is marked as highly experimental. Unfortunately 
it's currently the only option that allows dynamic module insertion/removal for above scenarios. 
If people still feel that it's t ugly I can revert that change and keep it in the separate tree 
for now.


Ideally I'd like all of this to go in during this merge window. 
Linus please pull this patch set from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git

That tree is rebased against latest (as of today) Linus' tree.

Thanx
Max

b/arch/x86/Kconfig  |1 
b/arch/x86/kernel/genapic_flat_64.c |5 ++-

b/drivers/base/cpu.c|   48 +++
b/include/linux/cpumask.h   |3 ++
b/kernel/Kconfig.cpuisol|   15 +++
b/kernel/Makefile   |4 +-
b/kernel/cpu.c  |   49 
b/kernel/sched.c|   37 ---
b/kernel/stop_machine.c |9 +-
b/kernel/workqueue.c|   31 --
kernel/Kconfig.cpuisol  |   26 ++-
11 files changed, 176 insertions(+), 52 deletions(-)


cpuisol: Make cpu isolation configrable and export isolated map
cpuisol: Do not route IRQs to the CPUs isolated at boot
cpuisol: Do not schedule workqueues on the isolated CPUs
cpuisol: Do not halt isolated CPUs with Stop Machine





--
To unsubscribe from this list: 

RT scheduler config, suggestions and questions

2008-02-05 Thread Max Krasnyanskiy

Folks,

I just realized that in latest Linus' tree following sysctls are under 
SCHED_DEBUG:
sched_rt_period
sched_rt_ratio

I do not believe that is correct. I know that we do not want to expose 
scheduler knobs
in general but theses are not the heuristic kind of knobs. There is no way the scheduler 
can magically figure out what the correct setting should be here.


Also shouldn't those new RT features that recently went be configurable and _disabled_ 
by default ? For example RT watchdog and RT throttling actually seem very questionable. 
SCHED_FIFO is clearly defined as


 A SCHED_FIFO process runs until either it is blocked by an I/O request, it is preempted 
 by a higher priority process, or it calls sched_yield(2).



Both the watchdog and the throttling are clearly braking that rule. I think it's good to have 
those features but not enabled by default and certainly not with sysctls that disable them 
hidden under debugging.

How about this:
- We introduce Kconfig options for them ?
- Expose all rt sysctls outside of #ifdef DEBUG

I can kook up some patches if that sounds ok.



btw I can see watchdog being very useful to catch hard-RT tasks that exceed 
the deadline.
But's it gotta be per thread. Single setting per user is not enough. Unless a use has a single 
RT task.


Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CPU isolation and workqueues [was Re: [CPUISOL] CPU isolation extensions]

2008-02-04 Thread Max Krasnyanskiy


Peter Zijlstra wrote:

On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:

On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:

  [PATCH] [CPUISOL] Support for workqueue isolation

The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.

No no no. That's what I though too ;-). The problem is that things like NFS and 
friends
expect _all_ their workqueue threads to report back when they do certain things 
like
flushing buffers and stuff. The reason I added this is because my machines were 
getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even 
though no IRQs
or other things are running on it.

This sounds more like we should fix NFS than add this for all workqueues.
Again, we want workqueues to run on the behalf of whatever is running on
that CPU, including those tasks that are running on an isolcpu.


agreed, by looking at my top output (and not the nfs code) it looks like
it just spawns a configurable number of active kernel threads which are
not cpu bound by in any way. I think just removing the isolated cpus
from their runnable mask should take care of them.


Peter, Steven,

I think I convinced you guys last time but I did not have a convincing example. 
So here is some
more info on why workqueues need to be aware of isolated cpus.

Here is how a work queue gets flushed.

static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
{
   int active;

   if (cwq->thread == current) {
   /*
* Probably keventd trying to flush its own queue. So simply run
* it by hand rather than deadlocking.
*/
   run_workqueue(cwq);
   active = 1;
   } else {
   struct wq_barrier barr;

   active = 0;
   spin_lock_irq(>lock);
   if (!list_empty(>worklist) || cwq->current_work != NULL) {
   insert_wq_barrier(cwq, , 1);
   active = 1;
   }
   spin_unlock_irq(>lock);

   if (active)
   wait_for_completion();
   }

   return active;
}

void fastcall flush_workqueue(struct workqueue_struct *wq)
{
   const cpumask_t *cpu_map = wq_cpu_map(wq);
   int cpu;

   might_sleep();
   lock_acquire(>lockdep_map, 0, 0, 0, 2, _THIS_IP_);
   lock_release(>lockdep_map, 1, _THIS_IP_);
   for_each_cpu_mask(cpu, *cpu_map)
   flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
}

In other words it schedules some work on each cpu and expects workqueue thread to run and 
trigger the completion. This is what I meant that _all_ threads are expected to report 
back even if there is nothing running on that CPU.


So my patch simply makes sure that isolated CPUs are ignored (if work queue 
isolation is enabled)
that work queue threads are not started on isolated in the CPUs that are 
isolated.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-04 Thread Max Krasnyanskiy

This is just an FYI. As part of the "Isolated CPU extensions" thread Daniel 
suggest for me
to check out latest RT kernels. So I did or at least tried to and immediately 
spotted a couple
of issues.

The machine I'm running it on is:
HP xw9300, Dual Opteron, NUMA

It looks like with -rt kernel IRQ affinity masks are ignored on that system. ie I write 1 to 
lets say /proc/irq/23/smp_affinity but the interrupts keep coming to CPU1. 
Vanilla 2.6.24 does not have that issue.


Also the first thing I tried was to bring CPU1 off-line. Thats the fastest way to get irqs, 
soft-irqs, timers, etc of a CPU. But the box hung completely. It also managed to mess up my 
ext3 filesystem to the point where it required manual fsck (have not see that for a couple of
years now). 
I tried the same thing (ie echo 0 > /sys/devices/cpu/cpu1/online) from the console. It hang 
again with the message that looked something like:

CPU1 is now off-line
Thread IRQ-23 is on CPU1 ...

IRQ 23 is NVidia SATA. So I guess it has something to do with the borked 
affinity handling.
Vanilla 2.6.24 handles this just fine.

Anyway, like I said it's just an FYI, not an urgent issue.

Max

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-04 Thread Max Krasnyanskiy

Paul Jackson wrote:

Max wrote:

Looks like I failed to explain what I'm trying to achieve. So let me try again.


Well done.  I read through that, expecting to disagree or at least
to not understand at some point, and got all the way through nodding
my head in agreement.  Good.

Whether the earlier confusions were lack of clarity in the presentation,
or lack of competence in my brain ... well guess I don't want to ask that
question ;).

:)


Well ... just one minor point:

Max wrote in reply to pj:

The cpu_isolated_map is a file static variable known only within
the kernel/sched.c file; this should not change.

I completely disagree. In fact I think all the cpu_xxx_map (online, present, 
isolated)
variables do not belong in the scheduler code. I'm thinking of submitting a 
patch that
factors them out into kernel/cpumask.c We already have cpumask.h.


Huh?  Why would you want to do that?

For one thing, the map being discussed here, cpu_isolated_map,
is only used in sched.c, so why publish it wider?

And for another thing, we already declare externs in cpumask.h for
the other, more widely used, cpu_*_map variables cpu_possible_map,
cpu_online_map, and cpu_present_map.

Well, to address #2 and #3 isolated map will need to be exported as well.
Those other maps do not really have much to do with the scheduler code.
That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place 
for them.


Other than that detail, we seem to be communicating and in agreement on
your first item, isolating CPU scheduler load balancing.  Good.

On your other two items, irq and workqueue isolation, which I had
suggested doing via cpuset sched_load_balance, I now agree that that
wasn't a good idea.

I am still a little surprised at using isolation extensions to
disable irqs on select CPUs; but others have thought far more about
irqs than I have, so I'll be quiet.

Please note that we're not talking about completely disabling IRQs. We're 
talking about
not routing them to the isolated CPUs by default. It's still possible to 
explicitly reroute an IRQ
to the isolated CPU.
Why is this needed  ? It is actually very easy to explain. IRQs are the major source of latency 
and overhead. IRQ handlers themselves are mostly ok but they typically schedule soft irqs, work 
queues and timers on the same CPU where an IRQ is handled. In other words if an isolated CPU is 
receiving IRQs it's not really isolated, because it's running a whole bunch of different kernel 
code (ie we're talking latencies, cache usage, etc). 
If course some folks may want to explicitly route certain  IRQs to the isolated CPUs. For example 
if an app depends on the network stack it may make sense to route an IRQ from the NIC to the same 
CPU the app is running on.


Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CPU hotplug and IRQ affinity with 2.6.24-rt1

2008-02-04 Thread Max Krasnyanskiy

This is just an FYI. As part of the Isolated CPU extensions thread Daniel 
suggest for me
to check out latest RT kernels. So I did or at least tried to and immediately 
spotted a couple
of issues.

The machine I'm running it on is:
HP xw9300, Dual Opteron, NUMA

It looks like with -rt kernel IRQ affinity masks are ignored on that system. ie I write 1 to 
lets say /proc/irq/23/smp_affinity but the interrupts keep coming to CPU1. 
Vanilla 2.6.24 does not have that issue.


Also the first thing I tried was to bring CPU1 off-line. Thats the fastest way to get irqs, 
soft-irqs, timers, etc of a CPU. But the box hung completely. It also managed to mess up my 
ext3 filesystem to the point where it required manual fsck (have not see that for a couple of
years now). 
I tried the same thing (ie echo 0  /sys/devices/cpu/cpu1/online) from the console. It hang 
again with the message that looked something like:

CPU1 is now off-line
Thread IRQ-23 is on CPU1 ...

IRQ 23 is NVidia SATA. So I guess it has something to do with the borked 
affinity handling.
Vanilla 2.6.24 handles this just fine.

Anyway, like I said it's just an FYI, not an urgent issue.

Max

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-02-04 Thread Max Krasnyanskiy

Paul Jackson wrote:

Max wrote:

Looks like I failed to explain what I'm trying to achieve. So let me try again.


Well done.  I read through that, expecting to disagree or at least
to not understand at some point, and got all the way through nodding
my head in agreement.  Good.

Whether the earlier confusions were lack of clarity in the presentation,
or lack of competence in my brain ... well guess I don't want to ask that
question ;).

:)


Well ... just one minor point:

Max wrote in reply to pj:

The cpu_isolated_map is a file static variable known only within
the kernel/sched.c file; this should not change.

I completely disagree. In fact I think all the cpu_xxx_map (online, present, 
isolated)
variables do not belong in the scheduler code. I'm thinking of submitting a 
patch that
factors them out into kernel/cpumask.c We already have cpumask.h.


Huh?  Why would you want to do that?

For one thing, the map being discussed here, cpu_isolated_map,
is only used in sched.c, so why publish it wider?

And for another thing, we already declare externs in cpumask.h for
the other, more widely used, cpu_*_map variables cpu_possible_map,
cpu_online_map, and cpu_present_map.

Well, to address #2 and #3 isolated map will need to be exported as well.
Those other maps do not really have much to do with the scheduler code.
That's why I think either kernel/cpumask.c or kernel/cpu.c is a better place 
for them.


Other than that detail, we seem to be communicating and in agreement on
your first item, isolating CPU scheduler load balancing.  Good.

On your other two items, irq and workqueue isolation, which I had
suggested doing via cpuset sched_load_balance, I now agree that that
wasn't a good idea.

I am still a little surprised at using isolation extensions to
disable irqs on select CPUs; but others have thought far more about
irqs than I have, so I'll be quiet.

Please note that we're not talking about completely disabling IRQs. We're 
talking about
not routing them to the isolated CPUs by default. It's still possible to 
explicitly reroute an IRQ
to the isolated CPU.
Why is this needed  ? It is actually very easy to explain. IRQs are the major source of latency 
and overhead. IRQ handlers themselves are mostly ok but they typically schedule soft irqs, work 
queues and timers on the same CPU where an IRQ is handled. In other words if an isolated CPU is 
receiving IRQs it's not really isolated, because it's running a whole bunch of different kernel 
code (ie we're talking latencies, cache usage, etc). 
If course some folks may want to explicitly route certain  IRQs to the isolated CPUs. For example 
if an app depends on the network stack it may make sense to route an IRQ from the NIC to the same 
CPU the app is running on.


Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


CPU isolation and workqueues [was Re: [CPUISOL] CPU isolation extensions]

2008-02-04 Thread Max Krasnyanskiy


Peter Zijlstra wrote:

On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:

On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:

  [PATCH] [CPUISOL] Support for workqueue isolation

The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.

No no no. That's what I though too ;-). The problem is that things like NFS and 
friends
expect _all_ their workqueue threads to report back when they do certain things 
like
flushing buffers and stuff. The reason I added this is because my machines were 
getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even 
though no IRQs
or other things are running on it.

This sounds more like we should fix NFS than add this for all workqueues.
Again, we want workqueues to run on the behalf of whatever is running on
that CPU, including those tasks that are running on an isolcpu.


agreed, by looking at my top output (and not the nfs code) it looks like
it just spawns a configurable number of active kernel threads which are
not cpu bound by in any way. I think just removing the isolated cpus
from their runnable mask should take care of them.


Peter, Steven,

I think I convinced you guys last time but I did not have a convincing example. 
So here is some
more info on why workqueues need to be aware of isolated cpus.

Here is how a work queue gets flushed.

static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq)
{
   int active;

   if (cwq-thread == current) {
   /*
* Probably keventd trying to flush its own queue. So simply run
* it by hand rather than deadlocking.
*/
   run_workqueue(cwq);
   active = 1;
   } else {
   struct wq_barrier barr;

   active = 0;
   spin_lock_irq(cwq-lock);
   if (!list_empty(cwq-worklist) || cwq-current_work != NULL) {
   insert_wq_barrier(cwq, barr, 1);
   active = 1;
   }
   spin_unlock_irq(cwq-lock);

   if (active)
   wait_for_completion(barr.done);
   }

   return active;
}

void fastcall flush_workqueue(struct workqueue_struct *wq)
{
   const cpumask_t *cpu_map = wq_cpu_map(wq);
   int cpu;

   might_sleep();
   lock_acquire(wq-lockdep_map, 0, 0, 0, 2, _THIS_IP_);
   lock_release(wq-lockdep_map, 1, _THIS_IP_);
   for_each_cpu_mask(cpu, *cpu_map)
   flush_cpu_workqueue(per_cpu_ptr(wq-cpu_wq, cpu));
}

In other words it schedules some work on each cpu and expects workqueue thread to run and 
trigger the completion. This is what I meant that _all_ threads are expected to report 
back even if there is nothing running on that CPU.


So my patch simply makes sure that isolated CPUs are ignored (if work queue 
isolation is enabled)
that work queue threads are not started on isolated in the CPUs that are 
isolated.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-31 Thread Max Krasnyanskiy

Hi Mark,

[EMAIL PROTECTED] wrote:
Following patch series extends CPU isolation support. Yes, most people want to virtuallize 
CPUs these days and I want to isolate them :).

The primary idea here is to be able to use some CPU cores as dedicated engines 
for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor.


We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
In fact that the primary distinction that I'm making between say "CPU sets" and 
"CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides

a way to isolate a CPU as much as possible (including kernel activities).

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems under exteme system load. I'm working with legal folks on releasing 
hard RT user-space framework for that.

I can also see other application like simulators and stuff that can benefit 
from this.

I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). 
So this seems like a good time to merge. 


Anyway. The patchset consist of 5 patches. First three are very simple and 
non-controversial.
They simply make "CPU isolation" a configurable feature, export 
cpu_isolated_map and provide
some helper functions to access it (just like cpu_online() and friends).
Last two patches add support for isolating CPUs from running workqueus and stop 
machine.
More details in the individual patch descriptions.

Ideally I'd like all of this to go in during this merge window. If people think it's acceptable 
Linus or Andrew (or whoever is more appropriate Ingo maybe) can pull this patch set from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git



It's good to see hear from someone else that thinks a multi-processor
box _should_ be able to run a CPU intensive (%100) RT app on one of the
processors without adversely affecting or being affected by the others.
I have had issues that were _traced_ back to the fact that I am doing
just that. All I got was, you can't do that or we don't support that
kind of thing in the Linux kernel.

One example,  Andrew Mortons feedback to the LKML thread "floppy.c soft lockup"

Good luck with this. I hope this gets someones attention.

Thanks for the support. I do the best I can because just like you I believe 
that it's
a perfectly valid workload and there a lot of interesting applications that 
will benefit
from mainline support.


BTW, I have tried your patches against a vanilla 2.6.24 kernel but am
not successful.

# echo '1' > /sys/devices/system/cpu/cpu1/isolated
bash: echo: write error: Device or resource busy

You have to bring it offline first.
In other words:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/isolated
echo 1 > /sys/devices/system/cpu/cpu1/online


The cpuisol=1 cmdline option yields:

harley:# cat /sys/devices/system/cpu/cpu1/isolated
0

harley:# cat /proc/cmdline
root=/dev/sda3 vga=normal apm=off selinux=0 noresume splash=silent
kmalloc=192M cpuisol=1

Sorry my bad. I had a typo in the patch description the option is "isolcpus=N".
We've had that option for awhile now. I mean it's not even part of my patch.

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-01-31 Thread Max Krasnyanskiy

Paul Jackson wrote:

Max wrote:

So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to 
integrated
with the code that already uses it. 


If it were just realtime support, then I suspect I'd agree that
extending cpu_isolated_map makes more sense.

But some people use realtime on systems that are also heavily
managed using cpusets.  The two have to work together.  I have
customers with systems running realtime on a few CPUs, at the
same time that they have a large batch scheduler (which is layered
on top of cpusets) managing jobs on a few hundred other CPUs.
Hence with the cpuset 'sched_load_balance' flag I think I've already
done what I think is one part of what your patches achieve by extending
the cpu_isolated_map.

This is a common situation with "resource management" mechanisms such
as cpusets (and more recently cgroups and the subsystem modules it
supports.)  They cut across existing core kernel code that manages such
key resources as CPUs and memory.  As best we can, they have to work
with each other.


Hi Paul,

I thought some more about your proposal to use sched_load_balance flag in 
cpusets instead
of extending cpu_isolated_map. I looked at the cpusets, cgroups, latest thread 
started by
Peter (about sched domains and stuff) and here are my thoughts on this.

Here is the list of things of issues with sched_load_balance flag from CPU isolation 
perspective:

--
(1) Boot time isolation is not possible. There is currently no way to setup a 
cpuset at
boot time. For example we won't be able to isolate cpus from irqs and 
workqueues at boot.
Not a major issue but still an inconvenience.

--
(2) There is currently no easy way to figure out what cpuset a cpu belongs to in order 
to query it's sched_load_balance flag. In order to do that we need a method that iterates
all active cpusets and checks their cpus_allowed masks. This implies holding cgroup and 
cpuset mutexes. It's not clear whether it's ok to do that from the the contexts CPU 
isolation happens in (apic, sched, workqueue). It seems that cgroup/cpuset api is designed

from top down access. ie adding a cpu to a set and then recomputing domains. 
Which makes
perfect sense for the common cpuset usecase but is not what cpu isolation needs.
In other words I think it's much simpler and cleaner to use the 
cpu_isolated_map for isolation
purposes.

--
(3) cpusets are a bit too dynamic :). What I mean by this is that 
sched_load_balance flag
can be changed at any time without bringing a CPU offline. What that means is 
that we'll
need some notifier mechanisms for killing and restarting workqueue threads when that flag 
changes. Also we'd need some logic that makes sure that a user does not disable load balancing 
on all cpus because that effectively will kill workqueues on all the cpus.

This particular case is already handled very nicely in my patches. Isolated bit 
can be set
only when cpu is offline and it cannot be set on the first online cpu. 
Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore 
isolated cpus when
they come online.

-

#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra 
complexity across
the board. I seriously doubt that I'll be able to push that through the reviews ;-). 

Also personally I still think cpusets and cpu isolation attack two different problems. cpusets 
is about partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs 
are designed to deal with tasks. CPU isolation is much simpler and is at the lower layer. It deals 
with IRQs, kernel per cpu threads, etc. The only intersection I see is that both features affect 
scheduling domains (cpu isolation is again simple here it just puts cpus into null domains and

that's an existing logic in sched.c nothing new here).
So here are some proposal on how we can make them play nicely with each other. 


--
(A) Make cpusets aware of isolated cpus.
All we have to do here is to change 
	guarantee_online_cpus()

common_cpu_mem_hotplug_unplug()
to exclude cpu_isolated_map from cpu_online_map before using it.
And we'd need to change 
	update_cpumasks()

to simply ignore isolated cpus.

That way if a cpu is isolated it'll be ignored by the cpusets logic. Which I 
believe would be
correct behavior. 
We're talking trivial ~5 liner patch which will be noop if cpu isolation is disabled.


(B) Ignore isolated map in cpuset. That's the current state of affairs with my 
patches applied.
Looks like your customers are happy with what they have now so they will probably not enable 
cpu isolation anyway :).


(C) Introduce cpu_usable_map. That map will be recomputed on hotplug events. 
Essentially it'd be
cpu_online_map AND ~cpu_isolated_map. Convert things like cpusets to use that map instead of 
online map.


We can probably come up with other 

Re: [CPUISOL] CPU isolation extensions

2008-01-31 Thread Max Krasnyanskiy

Hi Mark,

[EMAIL PROTECTED] wrote:
Following patch series extends CPU isolation support. Yes, most people want to virtuallize 
CPUs these days and I want to isolate them :).

The primary idea here is to be able to use some CPU cores as dedicated engines 
for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor.


We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
In fact that the primary distinction that I'm making between say CPU sets and 
CPU isolation. CPU sets let you manage user-space load while CPU isolation provides

a way to isolate a CPU as much as possible (including kernel activities).

I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems under exteme system load. I'm working with legal folks on releasing 
hard RT user-space framework for that.

I can also see other application like simulators and stuff that can benefit 
from this.

I've been maintaining this stuff since around 2.6.18 and it's been running in 
production
environment for a couple of years now. It's been tested on all kinds of 
machines, from NUMA
boxes like HP xw9300/9400 to tiny uTCA boards like Mercury AXA110.
The messiest part used to be SLAB garbage collector changes. With the new SLUB all that mess 
goes away (ie no changes necessary). Also CFS seems to handle CPU hotplug much better than O(1) 
did (ie domains are recomputed dynamically) so that isolation can be done at any time (via sysfs). 
So this seems like a good time to merge. 


Anyway. The patchset consist of 5 patches. First three are very simple and 
non-controversial.
They simply make CPU isolation a configurable feature, export 
cpu_isolated_map and provide
some helper functions to access it (just like cpu_online() and friends).
Last two patches add support for isolating CPUs from running workqueus and stop 
machine.
More details in the individual patch descriptions.

Ideally I'd like all of this to go in during this merge window. If people think it's acceptable 
Linus or Andrew (or whoever is more appropriate Ingo maybe) can pull this patch set from

git://git.kernel.org/pub/scm/linux/kernel/git/maxk/cpuisol-2.6.git



It's good to see hear from someone else that thinks a multi-processor
box _should_ be able to run a CPU intensive (%100) RT app on one of the
processors without adversely affecting or being affected by the others.
I have had issues that were _traced_ back to the fact that I am doing
just that. All I got was, you can't do that or we don't support that
kind of thing in the Linux kernel.

One example,  Andrew Mortons feedback to the LKML thread floppy.c soft lockup

Good luck with this. I hope this gets someones attention.

Thanks for the support. I do the best I can because just like you I believe 
that it's
a perfectly valid workload and there a lot of interesting applications that 
will benefit
from mainline support.


BTW, I have tried your patches against a vanilla 2.6.24 kernel but am
not successful.

# echo '1'  /sys/devices/system/cpu/cpu1/isolated
bash: echo: write error: Device or resource busy

You have to bring it offline first.
In other words:
echo 0  /sys/devices/system/cpu/cpu1/online
echo 1  /sys/devices/system/cpu/cpu1/isolated
echo 1  /sys/devices/system/cpu/cpu1/online


The cpuisol=1 cmdline option yields:

harley:# cat /sys/devices/system/cpu/cpu1/isolated
0

harley:# cat /proc/cmdline
root=/dev/sda3 vga=normal apm=off selinux=0 noresume splash=silent
kmalloc=192M cpuisol=1

Sorry my bad. I had a typo in the patch description the option is isolcpus=N.
We've had that option for awhile now. I mean it's not even part of my patch.

Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolation extensions]

2008-01-31 Thread Max Krasnyanskiy

Paul Jackson wrote:

Max wrote:

So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to 
integrated
with the code that already uses it. 


If it were just realtime support, then I suspect I'd agree that
extending cpu_isolated_map makes more sense.

But some people use realtime on systems that are also heavily
managed using cpusets.  The two have to work together.  I have
customers with systems running realtime on a few CPUs, at the
same time that they have a large batch scheduler (which is layered
on top of cpusets) managing jobs on a few hundred other CPUs.
Hence with the cpuset 'sched_load_balance' flag I think I've already
done what I think is one part of what your patches achieve by extending
the cpu_isolated_map.

This is a common situation with resource management mechanisms such
as cpusets (and more recently cgroups and the subsystem modules it
supports.)  They cut across existing core kernel code that manages such
key resources as CPUs and memory.  As best we can, they have to work
with each other.


Hi Paul,

I thought some more about your proposal to use sched_load_balance flag in 
cpusets instead
of extending cpu_isolated_map. I looked at the cpusets, cgroups, latest thread 
started by
Peter (about sched domains and stuff) and here are my thoughts on this.

Here is the list of things of issues with sched_load_balance flag from CPU isolation 
perspective:

--
(1) Boot time isolation is not possible. There is currently no way to setup a 
cpuset at
boot time. For example we won't be able to isolate cpus from irqs and 
workqueues at boot.
Not a major issue but still an inconvenience.

--
(2) There is currently no easy way to figure out what cpuset a cpu belongs to in order 
to query it's sched_load_balance flag. In order to do that we need a method that iterates
all active cpusets and checks their cpus_allowed masks. This implies holding cgroup and 
cpuset mutexes. It's not clear whether it's ok to do that from the the contexts CPU 
isolation happens in (apic, sched, workqueue). It seems that cgroup/cpuset api is designed

from top down access. ie adding a cpu to a set and then recomputing domains. 
Which makes
perfect sense for the common cpuset usecase but is not what cpu isolation needs.
In other words I think it's much simpler and cleaner to use the 
cpu_isolated_map for isolation
purposes.

--
(3) cpusets are a bit too dynamic :). What I mean by this is that 
sched_load_balance flag
can be changed at any time without bringing a CPU offline. What that means is 
that we'll
need some notifier mechanisms for killing and restarting workqueue threads when that flag 
changes. Also we'd need some logic that makes sure that a user does not disable load balancing 
on all cpus because that effectively will kill workqueues on all the cpus.

This particular case is already handled very nicely in my patches. Isolated bit 
can be set
only when cpu is offline and it cannot be set on the first online cpu. 
Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore 
isolated cpus when
they come online.

-

#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra 
complexity across
the board. I seriously doubt that I'll be able to push that through the reviews ;-). 

Also personally I still think cpusets and cpu isolation attack two different problems. cpusets 
is about partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs 
are designed to deal with tasks. CPU isolation is much simpler and is at the lower layer. It deals 
with IRQs, kernel per cpu threads, etc. The only intersection I see is that both features affect 
scheduling domains (cpu isolation is again simple here it just puts cpus into null domains and

that's an existing logic in sched.c nothing new here).
So here are some proposal on how we can make them play nicely with each other. 


--
(A) Make cpusets aware of isolated cpus.
All we have to do here is to change 
	guarantee_online_cpus()

common_cpu_mem_hotplug_unplug()
to exclude cpu_isolated_map from cpu_online_map before using it.
And we'd need to change 
	update_cpumasks()

to simply ignore isolated cpus.

That way if a cpu is isolated it'll be ignored by the cpusets logic. Which I 
believe would be
correct behavior. 
We're talking trivial ~5 liner patch which will be noop if cpu isolation is disabled.


(B) Ignore isolated map in cpuset. That's the current state of affairs with my 
patches applied.
Looks like your customers are happy with what they have now so they will probably not enable 
cpu isolation anyway :).


(C) Introduce cpu_usable_map. That map will be recomputed on hotplug events. 
Essentially it'd be
cpu_online_map AND ~cpu_isolated_map. Convert things like cpusets to use that map instead of 
online map.


We can probably come up with other 

Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Daniel Walker wrote:

On Mon, 2008-01-28 at 10:32 -0800, Max Krasnyanskiy wrote:

Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai 
can't do that.
For example I have separate tasks with hard deadlines that must be enforced in 50usec kind 
of range and basically no idle time whatsoever. Just to give more background it's a wireless

basestation with SW MAC/Scheduler. Another requirement is for the SW to know 
precise timing
because SW. For example there is no way we can do predictable 1-2 usec sleeps. 
So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
overhead from the kernel, just IPIs for memory management and that's basically it. When my legal 
department lets me I'll do a presentation on this stuff at Linux RT conference or something. 


What kind of hardware are you doing this on? 

All kinds of HW. I mentioned it in the intro email.
Here are the highlights
HP XW9300 (Dual Opteron NUMA box) and XW9400 (Dual Core Opteron)
HP DL145 G2 (Dual Opteron) and G3 (Dual Core Opteron)
	Dell Precision workstations (Core2 Duo and Quad) 
	Various Core2 Duo based systems uTCA boards

Mercury AXA110 (1.5Ghz)
Concurrent Tech AM110 (2.1Ghz)

This scheme should work on anything that lets you disable SMI on the isolated 
core(s).


Also I should note there is HRT (High resolution timers) which provided 
microsecond level
granularity ..
Not accurate enough and way too much overhead for what I need. I know at this point it probably 
sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let 
me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep() & 
gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between

HW and SW.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Paul Jackson wrote:

Max wrote:

So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to 
integrated
with the code that already uses it. 


If it were just realtime support, then I suspect I'd agree that
extending cpu_isolated_map makes more sense.

But some people use realtime on systems that are also heavily
managed using cpusets.  The two have to work together.  I have
customers with systems running realtime on a few CPUs, at the
same time that they have a large batch scheduler (which is layered
on top of cpusets) managing jobs on a few hundred other CPUs.
Hence with the cpuset 'sched_load_balance' flag I think I've already
done what I think is one part of what your patches achieve by extending
the cpu_isolated_map.

This is a common situation with "resource management" mechanisms such
as cpusets (and more recently cgroups and the subsystem modules it
supports.)  They cut across existing core kernel code that manages such
key resources as CPUs and memory.  As best we can, they have to work
with each other.


Thanks for the info Paul. I'll definitely look into using this flag instead 
and reply with pros and cons (if any).


Max


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:

On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:

  [PATCH] [CPUISOL] Support for workqueue isolation

The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.

No no no. That's what I though too ;-). The problem is that things like NFS and 
friends
expect _all_ their workqueue threads to report back when they do certain things 
like
flushing buffers and stuff. The reason I added this is because my machines were 
getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even 
though no IRQs
or other things are running on it.

This sounds more like we should fix NFS than add this for all workqueues.
Again, we want workqueues to run on the behalf of whatever is running on
that CPU, including those tasks that are running on an isolcpu.


agreed, by looking at my top output (and not the nfs code) it looks like
it just spawns a configurable number of active kernel threads which are
not cpu bound by in any way. I think just removing the isolated cpus
from their runnable mask should take care of them.


Actually NFS was just one example. I cannot remember of a top of my head what 
else was there
but there are definitely other users of work queues that expect all the threads 
to run at
some point in time.
Also if you think about it. The patch does _exactly_ what you propose. It removes workqueue 
threads from isolated CPUs. But instead of doing just for NFS and/or other subsystems 
separately it just does it in a generic way by simply not starting those threads in first 
place.



  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"

This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?

I agree in general. The thing is though that stop machine just kills any kind 
of latency
guaranties. Without the patch the machine just hangs waiting for the 
stop-machine to run
when module is inserted/removed. And running without dynamic module loading is 
not very
practical on general purpose machines. So I'd rather have an option with a big 
red warning
than no option at all :).

Well, that's something one of the greater powers (Linus, Andrew, Ingo)
must decide. ;-)


I'm in favour of better engineered method, that is, we really should try
to solve these problems in a proper way. Hacks like this might be fine
for custom kernels, but I think we should have a higher standard when it
comes to upstream - we all have to live many years with whatever we put
in there, we'd better think well about it.


100% agree. That's why I said mentioned that this patches is controversial in the first place. 
Right now those short from rewriting module loading to not use stop machine there is no other 
option. I'll think some more about it. If you guys have other ideas please drop me a note.


Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Mon, 2008-01-28 at 11:34 -0500, Steven Rostedt wrote:

On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:

Thanks for the CC, Peter.

Thanks from me too.


Max wrote:
We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.

I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs.  It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask.   That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.

It looks like you have three additional tweaks for realtime in this
patch set, with your patches:

  [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot

I didn't know we still routed IRQs to isolated CPUs. I guess I need to
look deeper into the code on this one. But I agree that isolated CPUs
should not have IRQs routed to them.


While I agree with this in principle, I'm not sure flat out denying all
IRQs to these cpus is a good option. What about the case where we want
to service just this one specific IRQ on this CPU and no others?

Can't this be done by userspace irq routing as used by irqbalanced?

Peter, I think you missed the point of this patch. It's just a convenience 
feature.
It simply excludes isolated CPUs from IRQ smp affinity masks. That's all. What 
did you
mean by "flat out denying all IRQs to these cpus" ? IRQs can still be routed to them 
by writing to /proc/irq/N/smp_affinity.


Also, this happens naturally when we bring a CPU off-line and then bring it 
back online.
ie When CPU comes back online it's excluded from the IRQ smp_affinity masks 
even without
my patch.


  [PATCH] [CPUISOL] Support for workqueue isolation

The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.


Quite so, if nobody uses it, there is no harm in having them around. If
they are used, its by someone already allowed on the cpu.


No no no. I just replied to Steven about that. The problem is that things like NFS and 
friends expect _all_ their workqueue threads to report back when they do certain things 
like flushing buffers and stuff. The reason I added this is because my machines were 
getting stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though 
no IRQs, softirqs or other things are running on it.



  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"

This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?


Very dangerous indeed!

Please see my reply to Steven. I agree it's somewhat dangerous. What we could 
do is make it
configurable with a big fat warning. In other words I'd rather have an option 
than just says
"do not use dynamic module loading" on those systems.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Steven Rostedt wrote:

On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:

Thanks for the CC, Peter.


Thanks from me too.


Max wrote:
We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.

I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs.  It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask.   That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.

It looks like you have three additional tweaks for realtime in this
patch set, with your patches:

  [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot


I didn't know we still routed IRQs to isolated CPUs. I guess I need to
look deeper into the code on this one. But I agree that isolated CPUs
should not have IRQs routed to them.

Also note that it's just a convenience feature. In other words it's not that 
with this patch
we'll never route IRQs to those CPUs. They can still be explicitly routed by writing to 
irq/N/smp_affitnity.



  [PATCH] [CPUISOL] Support for workqueue isolation


The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.

No no no. That's what I though too ;-). The problem is that things like NFS and 
friends
expect _all_ their workqueue threads to report back when they do certain things 
like
flushing buffers and stuff. The reason I added this is because my machines were 
getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even 
though no IRQs
or other things are running on it.


  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"


This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?
I agree in general. The thing is though that stop machine just kills any kind of latency 
guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
when module is inserted/removed. And running without dynamic module loading is not very 
practical on general purpose machines. So I'd rather have an option with a big red warning 
than no option at all :).


Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Paul Jackson wrote:

Thanks for the CC, Peter.

  Ingo - see question at end of message.

Max wrote:
We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.


I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs.  It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask.   That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.

It looks like you have three additional tweaks for realtime in this
patch set, with your patches:

  [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
  [PATCH] [CPUISOL] Support for workqueue isolation
  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the "stop machine"

It would be interesting to see a patchset with the above three realtime
tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
than layered on changes to make 'isolated_cpus' more dynamic.  Some of us
run realtime and cpuset-intensive loads on the same system, so like to
have those two capabilities co-operate with each other.

I'll definitely take a look. So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to 
integrated
with the code that already uses it. 
Anyway. I'll take a look at the cpuset flag that you mentioned and report back.


Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Hi Peter,

Peter Zijlstra wrote:

[ You really ought to CC people :-) ]

I was not sure who though :)
Do we have a mailing list for scheduler development btw ?
Or it's just folks that you included in CC ?
Some of the latest scheduler patches brake things that I'm doing and I'd like 
to make
them configurable (RT watchdog, etc).


On Sun, 2008-01-27 at 20:09 -0800, [EMAIL PROTECTED] wrote:
Following patch series extends CPU isolation support. Yes, most people want to virtuallize 
CPUs these days and I want to isolate them :).

The primary idea here is to be able to use some CPU cores as dedicated engines 
for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor.


We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
In fact that the primary distinction that I'm making between say "CPU sets" and 
"CPU isolation". "CPU sets" let you manage user-space load while "CPU isolation" provides

a way to isolate a CPU as much as possible (including kernel activities).


Ok, so you're aware of CPU sets, miss a feature, but instead of
extending it to cover your needs you build something new entirely?

It's not really new. CPU isolation bits just has not been exported before 
that's all.
Also "CPU sets" seem to mostly deal with the scheduler domains. I'll reply to Paul's 
proposal to use that instead.


I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems under exteme system load. I'm working with legal folks on releasing 
hard RT user-space framework for that.

I can also see other application like simulators and stuff that can benefit 
from this.


have you been using just this, or in combination with the -rt effort?

Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai 
can't do that.
For example I have separate tasks with hard deadlines that must be enforced in 50usec kind 
of range and basically no idle time whatsoever. Just to give more background it's a wireless

basestation with SW MAC/Scheduler. Another requirement is for the SW to know 
precise timing
because SW. For example there is no way we can do predictable 1-2 usec sleeps. 
So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
overhead from the kernel, just IPIs for memory management and that's basically it. When my legal 
department lets me I'll do a presentation on this stuff at Linux RT conference or something. 


Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Mon, 2008-01-28 at 11:34 -0500, Steven Rostedt wrote:

On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:

Thanks for the CC, Peter.

Thanks from me too.


Max wrote:
We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.

I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs.  It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask.   That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.

It looks like you have three additional tweaks for realtime in this
patch set, with your patches:

  [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot

I didn't know we still routed IRQs to isolated CPUs. I guess I need to
look deeper into the code on this one. But I agree that isolated CPUs
should not have IRQs routed to them.


While I agree with this in principle, I'm not sure flat out denying all
IRQs to these cpus is a good option. What about the case where we want
to service just this one specific IRQ on this CPU and no others?

Can't this be done by userspace irq routing as used by irqbalanced?

Peter, I think you missed the point of this patch. It's just a convenience 
feature.
It simply excludes isolated CPUs from IRQ smp affinity masks. That's all. What 
did you
mean by flat out denying all IRQs to these cpus ? IRQs can still be routed to them 
by writing to /proc/irq/N/smp_affinity.


Also, this happens naturally when we bring a CPU off-line and then bring it 
back online.
ie When CPU comes back online it's excluded from the IRQ smp_affinity masks 
even without
my patch.


  [PATCH] [CPUISOL] Support for workqueue isolation

The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.


Quite so, if nobody uses it, there is no harm in having them around. If
they are used, its by someone already allowed on the cpu.


No no no. I just replied to Steven about that. The problem is that things like NFS and 
friends expect _all_ their workqueue threads to report back when they do certain things 
like flushing buffers and stuff. The reason I added this is because my machines were 
getting stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even though 
no IRQs, softirqs or other things are running on it.



  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the stop machine

This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?


Very dangerous indeed!

Please see my reply to Steven. I agree it's somewhat dangerous. What we could 
do is make it
configurable with a big fat warning. In other words I'd rather have an option 
than just says
do not use dynamic module loading on those systems.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Peter Zijlstra wrote:

On Mon, 2008-01-28 at 14:00 -0500, Steven Rostedt wrote:

On Mon, 28 Jan 2008, Max Krasnyanskiy wrote:

  [PATCH] [CPUISOL] Support for workqueue isolation

The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.

No no no. That's what I though too ;-). The problem is that things like NFS and 
friends
expect _all_ their workqueue threads to report back when they do certain things 
like
flushing buffers and stuff. The reason I added this is because my machines were 
getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even 
though no IRQs
or other things are running on it.

This sounds more like we should fix NFS than add this for all workqueues.
Again, we want workqueues to run on the behalf of whatever is running on
that CPU, including those tasks that are running on an isolcpu.


agreed, by looking at my top output (and not the nfs code) it looks like
it just spawns a configurable number of active kernel threads which are
not cpu bound by in any way. I think just removing the isolated cpus
from their runnable mask should take care of them.


Actually NFS was just one example. I cannot remember of a top of my head what 
else was there
but there are definitely other users of work queues that expect all the threads 
to run at
some point in time.
Also if you think about it. The patch does _exactly_ what you propose. It removes workqueue 
threads from isolated CPUs. But instead of doing just for NFS and/or other subsystems 
separately it just does it in a generic way by simply not starting those threads in first 
place.



  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the stop machine

This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?

I agree in general. The thing is though that stop machine just kills any kind 
of latency
guaranties. Without the patch the machine just hangs waiting for the 
stop-machine to run
when module is inserted/removed. And running without dynamic module loading is 
not very
practical on general purpose machines. So I'd rather have an option with a big 
red warning
than no option at all :).

Well, that's something one of the greater powers (Linus, Andrew, Ingo)
must decide. ;-)


I'm in favour of better engineered method, that is, we really should try
to solve these problems in a proper way. Hacks like this might be fine
for custom kernels, but I think we should have a higher standard when it
comes to upstream - we all have to live many years with whatever we put
in there, we'd better think well about it.


100% agree. That's why I said mentioned that this patches is controversial in the first place. 
Right now those short from rewriting module loading to not use stop machine there is no other 
option. I'll think some more about it. If you guys have other ideas please drop me a note.


Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Paul Jackson wrote:

Max wrote:

So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to 
integrated
with the code that already uses it. 


If it were just realtime support, then I suspect I'd agree that
extending cpu_isolated_map makes more sense.

But some people use realtime on systems that are also heavily
managed using cpusets.  The two have to work together.  I have
customers with systems running realtime on a few CPUs, at the
same time that they have a large batch scheduler (which is layered
on top of cpusets) managing jobs on a few hundred other CPUs.
Hence with the cpuset 'sched_load_balance' flag I think I've already
done what I think is one part of what your patches achieve by extending
the cpu_isolated_map.

This is a common situation with resource management mechanisms such
as cpusets (and more recently cgroups and the subsystem modules it
supports.)  They cut across existing core kernel code that manages such
key resources as CPUs and memory.  As best we can, they have to work
with each other.


Thanks for the info Paul. I'll definitely look into using this flag instead 
and reply with pros and cons (if any).


Max


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Daniel Walker wrote:

On Mon, 2008-01-28 at 10:32 -0800, Max Krasnyanskiy wrote:

Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai 
can't do that.
For example I have separate tasks with hard deadlines that must be enforced in 50usec kind 
of range and basically no idle time whatsoever. Just to give more background it's a wireless

basestation with SW MAC/Scheduler. Another requirement is for the SW to know 
precise timing
because SW. For example there is no way we can do predictable 1-2 usec sleeps. 
So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
overhead from the kernel, just IPIs for memory management and that's basically it. When my legal 
department lets me I'll do a presentation on this stuff at Linux RT conference or something. 


What kind of hardware are you doing this on? 

All kinds of HW. I mentioned it in the intro email.
Here are the highlights
HP XW9300 (Dual Opteron NUMA box) and XW9400 (Dual Core Opteron)
HP DL145 G2 (Dual Opteron) and G3 (Dual Core Opteron)
	Dell Precision workstations (Core2 Duo and Quad) 
	Various Core2 Duo based systems uTCA boards

Mercury AXA110 (1.5Ghz)
Concurrent Tech AM110 (2.1Ghz)

This scheme should work on anything that lets you disable SMI on the isolated 
core(s).


Also I should note there is HRT (High resolution timers) which provided 
microsecond level
granularity ..
Not accurate enough and way too much overhead for what I need. I know at this point it probably 
sounds like I'm talking BS :). I wish I've released the engine and examples by now. Anyway let 
me just say that SW MAC has crazy tight deadlines with lots of small tasks. Using nanosleep()  
gettimeofday() is simply not practical. So it's all TSC based with clever time sync logic between

HW and SW.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Steven Rostedt wrote:

On Mon, Jan 28, 2008 at 08:59:10AM -0600, Paul Jackson wrote:

Thanks for the CC, Peter.


Thanks from me too.


Max wrote:
We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.

I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs.  It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask.   That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.

It looks like you have three additional tweaks for realtime in this
patch set, with your patches:

  [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot


I didn't know we still routed IRQs to isolated CPUs. I guess I need to
look deeper into the code on this one. But I agree that isolated CPUs
should not have IRQs routed to them.

Also note that it's just a convenience feature. In other words it's not that 
with this patch
we'll never route IRQs to those CPUs. They can still be explicitly routed by writing to 
irq/N/smp_affitnity.



  [PATCH] [CPUISOL] Support for workqueue isolation


The thing about workqueues is that they should only be woken on a CPU if
something on that CPU accessed them. IOW, the workqueue on a CPU handles
work that was called by something on that CPU. Which means that
something that high prio task did triggered a workqueue to do some work.
But this can also be triggered by interrupts, so by keeping interrupts
off the CPU no workqueue should be activated.

No no no. That's what I though too ;-). The problem is that things like NFS and 
friends
expect _all_ their workqueue threads to report back when they do certain things 
like
flushing buffers and stuff. The reason I added this is because my machines were 
getting
stuck because CPU0 was waiting for CPU1 to run NFS work queue threads even 
though no IRQs
or other things are running on it.


  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the stop machine


This I find very dangerous. We are making an assumption that tasks on an
isolated CPU wont be doing things that stopmachine requires. What stops
a task on an isolated CPU from calling something into the kernel that
stop_machine requires to halt?
I agree in general. The thing is though that stop machine just kills any kind of latency 
guaranties. Without the patch the machine just hangs waiting for the stop-machine to run
when module is inserted/removed. And running without dynamic module loading is not very 
practical on general purpose machines. So I'd rather have an option with a big red warning 
than no option at all :).


Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Paul Jackson wrote:

Thanks for the CC, Peter.

  Ingo - see question at end of message.

Max wrote:
We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.


I recently added the per-cpuset flag 'sched_load_balance' for some
other realtime folks, so that they can disable the kernel scheduler
load balancing on isolated CPUs.  It essentially allows for dynamic
control of which CPUs are isolated by the scheduler, using the cpuset
hierarchy, rather than enhancing the 'isolated_cpus' mask.   That
'isolated_cpus' mask remained a minimal kernel boottime parameter.
I believe this went to Linus's tree about Oct 2007.

It looks like you have three additional tweaks for realtime in this
patch set, with your patches:

  [PATCH] [CPUISOL] Do not route IRQs to the CPUs isolated at boot
  [PATCH] [CPUISOL] Support for workqueue isolation
  [PATCH] [CPUISOL] Isolated CPUs should be ignored by the stop machine

It would be interesting to see a patchset with the above three realtime
tweaks, layered on this new cpuset 'sched_load_balance' apparatus, rather
than layered on changes to make 'isolated_cpus' more dynamic.  Some of us
run realtime and cpuset-intensive loads on the same system, so like to
have those two capabilities co-operate with each other.

I'll definitely take a look. So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to 
integrated
with the code that already uses it. 
Anyway. I'll take a look at the cpuset flag that you mentioned and report back.


Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [CPUISOL] CPU isolation extensions

2008-01-28 Thread Max Krasnyanskiy

Hi Peter,

Peter Zijlstra wrote:

[ You really ought to CC people :-) ]

I was not sure who though :)
Do we have a mailing list for scheduler development btw ?
Or it's just folks that you included in CC ?
Some of the latest scheduler patches brake things that I'm doing and I'd like 
to make
them configurable (RT watchdog, etc).


On Sun, 2008-01-27 at 20:09 -0800, [EMAIL PROTECTED] wrote:
Following patch series extends CPU isolation support. Yes, most people want to virtuallize 
CPUs these days and I want to isolate them :).

The primary idea here is to be able to use some CPU cores as dedicated engines 
for running
user-space code with minimal kernel overhead/intervention, think of it as an SPE in the 
Cell processor.


We've had scheduler support for CPU isolation ever since O(1) scheduler went it. 
I'd like to extend it further to avoid kernel activity on those CPUs as much as possible.
In fact that the primary distinction that I'm making between say CPU sets and 
CPU isolation. CPU sets let you manage user-space load while CPU isolation provides

a way to isolate a CPU as much as possible (including kernel activities).


Ok, so you're aware of CPU sets, miss a feature, but instead of
extending it to cover your needs you build something new entirely?

It's not really new. CPU isolation bits just has not been exported before 
that's all.
Also CPU sets seem to mostly deal with the scheduler domains. I'll reply to Paul's 
proposal to use that instead.


I'm personally using this for hard realtime purposes. With CPU isolation it's very easy to 
achieve single digit usec worst case and around 200 nsec average response times on off-the-shelf
multi- processor/core systems under exteme system load. I'm working with legal folks on releasing 
hard RT user-space framework for that.

I can also see other application like simulators and stuff that can benefit 
from this.


have you been using just this, or in combination with the -rt effort?

Just this patches. RT patches cannot achieve what I needed. Even RTAI/Xenomai 
can't do that.
For example I have separate tasks with hard deadlines that must be enforced in 50usec kind 
of range and basically no idle time whatsoever. Just to give more background it's a wireless

basestation with SW MAC/Scheduler. Another requirement is for the SW to know 
precise timing
because SW. For example there is no way we can do predictable 1-2 usec sleeps. 
So I wrote a user-space engine that does all this, it requires full control of the CPU ie minimal
overhead from the kernel, just IPIs for memory management and that's basically it. When my legal 
department lets me I'll do a presentation on this stuff at Linux RT conference or something. 


Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/