from:"hui"

Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O

2005-04-05 Thread hui

On Tue, Apr 05, 2005 at 09:20:57PM -0400, Trond Myklebust wrote:
> ty den 05.04.2005 Klokka 11:46 (-0400) skreiv Benjamin LaHaise:
> 
> > I can see that goal, but I don't think introducing iosems is the right 
> > way to acheive it.  Instead (and I'll start tackling this), how about 
> > factoring out the existing semaphore implementations to use a common 
> > lib/semaphore.c, much like lib/rwsem.c?  The iosems can be used as a 
> > basis for the implementation, but we can avoid having to do a giant 
> > s/semaphore/iosem/g over the kernel tree.
> 
> If you're willing to take this on then you have my full support and I'd
> be happy to lend a hand.

I would expect also that some RT subgroups would be highly interested in
getting it to respect priority for reworking parts of softirq.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Priority Lists for the RT mutex

2005-04-11 Thread hui

On Mon, Apr 11, 2005 at 10:57:37AM +0200, Ingo Molnar wrote:
> 
> * Perez-Gonzalez, Inaky <[EMAIL PROTECTED]> wrote:
> 
> > Let me re-phrase then: it is a must have only on PI, to make sure you 
> > don't have a loop when doing it. Maybe is a consequence of the 
> > algorithm I chose. -However- it should be possible to disable it in 
> > cases where you are reasonably sure it won't happen (such as kernel 
> > code). In any case, AFAIR, I still did not implement it.
> 
> are there cases where userspace wants to disable deadlock-detection for 
> its own locks?

I'd disable it for userspace locks. There might be folks that want to
implement userspace drivers, but I can't imagine it being 'ok' to have
the kernel call out to userspace and have it block correctly. I would
expect them to do something else that's less drastic.

> the deadlock detector in PREEMPT_RT is pretty much specialized for 
> debugging (it does all sorts of weird locking tricks to get the first 
> deadlock out, and to really report it on the console), but it ought to 
> be possible to make it usable for userspace-controlled locks as well.

If I understand things correctly, I'd let that be an RT app issue and
the app folks should decided what is appropriate for their setup. If
they need a deadlock detector they should decide on their own protocol.
The kernel debugging issues are completely different.

That's my two cents.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Priority Lists for the RT mutex

2005-04-11 Thread hui

On Mon, Apr 11, 2005 at 03:31:41PM -0700, Perez-Gonzalez, Inaky wrote:
> If you are exposing the kernel locks to userspace to implement
> mutexes (eg POSIX mutexes), deadlock checking is a feature you want
> to have to complain with POSIX. According to some off the record
> requirements I've been given, some applications badly need it (I have 
> a hard time believing that they are so broken, but heck...).

I'd like to read about those requirements, but, IMO a lot of the value
of various priority protocols varies greatly on the context and size (N
threads) of the application using it. If user/kernel space have to be
coupled via some thread of execution, (IMO) then it's better to seperate
them with some event notification queues like signals (wake a thread
via an queue event) than to mix locks across the user/kernel space
boundary. There's tons of abuse that can be opened up with various
priority protocols with regard to RT apps and giving it a first class
entry way without consideration is kind of scary.

It's important to outline the requirements of the applications and then
see what you can do using minimal synchronization objects before
exploding that divide.

Also, Posix isn't always politically neutral nor complete regarding
various things. You have to consider the context of these things.
I'll have to think about this a bit more and review your patch more
carefully.

I'm all ears if you think I'm wrong.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Priority Lists for the RT mutex

2005-04-11 Thread hui

On Mon, Apr 11, 2005 at 04:28:25PM -0700, Perez-Gonzalez, Inaky wrote:
> >From: Bill Huey (hui) [mailto:[EMAIL PROTECTED]
...
> API than once upon a time was made multithreaded by just adding
> a bunch of pthread_mutex_[un]lock() at the API entry point...
> without realizing that some of the top level API calls also 
> called other top level API calls, so they'd deadlock.

Oh crap.

> Quick fix: the usual. Enable deadlock detection and if it
> returns deadlock, assume it is locked already and proceed (or
> do a recursive mutex, or a trylock).

You have to be joking me ? geez.
... 
> It is certainly something to explore, but I'd better drive your
> way than do it. It's cleaner. Hides implementation details.
>
> I agree, but it doesn't work that well when talking about legacy 
> systems...that's the problem.

Yeah, ok, I understand what's going on now. There isn't a notion
of projecting priority across into the Unix/Linux kernel traditionally
which is why it seemed so bizarre.

> Sure--and because most was for legacy reasons that adhered to 
> POSIX strictly, it was very simple: we need POSIX this, that and
> that (PI, proper adherence to scheduler policy wake up/rt-behaviour,
> deadlock detection, etc). 

Some of this stuff sounds like recursive locking. Would this be a
better expression to solve the "top level API locking" problem
you're referring to ?

> Fortunately in those areas POSIX is not too gray; code to the book.
> Deal. 

I would think that there will have to be a graph discontinuity
between user/kernel spaces at kernel entry and exit for the deadlock
detector. Can't say about issues at fork time, but I would expect
that those objects would have to be destroyed when the process exits.

The current RT (Ingo's) lock isn't recursive nor is the deadlock
detector the last time I looked. Do think that this is a problem
for legacy apps if it gets overload for being the userspace futex
as well ? (assuming I'm understanding all of this correctly)

> Of course, selling it to the lkml is another story.

I would think that pushing as much of this into userspace would
make the kernel hooks for it more acceptable. Don't know.

/me thinks more

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: FUSYN and RT

2005-04-15 Thread hui

On Wed, Apr 13, 2005 at 11:46:40AM -0400, Steven Rostedt wrote:
> On Tue, 2005-04-12 at 17:27 -0700, Daniel Walker wrote:
> > There is a great big snag in my assumptions. It's possible for a process
> > to hold a fusyn lock, then block on an RT lock. In that situation you
> > could have a high priority user space process be scheduled then block on
> > the same fusyn lock but the PI wouldn't be fully transitive , plus there
> > will be problems when the RT mutex tries to restore the priority. 
> > 
> > We could add simple hooks to force the RT mutex to fix up it's PI, but
> > it's not a long term solution.

Ok, I've been thinking about these issues and I believe there are a number
of misunderstandings here. The user and kernel space mutexes need to be
completely different implementations. I'll have more on this later.

First of all, priority transitivity should be discontinuous at the user/kernel
space boundary, but be propagated by the scheduler, via an API or hook,
upon a general priority boost to the thread in question.

You have thread A blocked in the kernel holding is onto userspace mutex 1a
and kernel mutex 2a. Thread A is priority boosted by a higher priority
thread B trying to acquire mutex 1a. The transitivity operation propagates
through the rest of the lock graph in userspace, via depth first search,
as usual. When it hits the last userspace mutex in question, this portion
of the propagation activity stops. Next, the scheduler itself finds out
that thread A has had it's priority altered because of a common priority
change API and starts another priority propagation operation in kernel
space to mutex 1b. There you have it. It's complete from user to kernel
space using a scheduler event/hook/api to propagate priority changes
into the kernel.

With all of that in place, you do a couple of things for the mutex
implementation. First, you convert as much code of the current RT mutex
code to be type polymorphic
as you can:

1) You use Daniel Walker's PI list handling for wait queue insertion for
   both mutex implementation. This is done since it's already a library
   and is already generic.

2) Then you generalize the dead lock detection code so that things like
   "what to do in a deadlock case" is determine at the instantiation of
   the code. You might have to use C preprocessor macros to do a generic
   implementation and then fill in the parametric values for creating a
   usable instance.

3) Make the grab owner code generic.

4) ...more part of the RT mutex...
   etc...

> How hard would it be to use the RT mutex PI for the priority inheritance
> for fusyn?  I only work with the RT mutex now and haven't looked at the
> fusyn.  Maybe Ingo can make a separate PI system with its own API that
> both the fusyn and RT mutex can use. This way the fusyn locks can still
> be separate from the RT mutex locks but still work together. 

I'd apply these implementation ideas across both mutexes, but keep the
individual mutexes functionality distinct. I look at this problem from
more of a reusability perspective than anything else.

> Basically can the fusyn work with the rt_mutex_waiter?  That's what I
> would pull into its own subsystem.  Have another structure that would
> reside in both the fusyn and RT mutex that would take over for the
> current rt_mutex that is used in pi_setprio and task_blocks_on_lock in
> rt.c.  So if both locks used the same PI system, then this should all be
> cleared up. 

Same thing...

There will be problems trying to implement a Posix read/write lock using
this method and the core RT mutex might have to be fundamentally altered
to handle recursion of some sort, decomposed into smaller bits and
recomposed into something else.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] RCU and CONFIG_PREEMPT_RT progress, part 3

2005-07-13 Thread hui

On Wed, Jul 13, 2005 at 11:48:01AM -0700, Paul E. McKenney wrote:
> 1.Is use of spin_trylock() and spin_unlock() in hardirq code
>   (e.g., rcu_check_callbacks() and callees) a Bad Thing?
>   Seems to result in boot-time hangs when I try it, and switching
>   to _raw_spin_trylock() and _raw_spin_unlock() seems to work
>   better.  But I don't see why the other primitives hang --
>   after all, you can call wakeup functions in irq context in
>   stock kernels...

The implementation of "printk" does funky stuff like this so I'm assuming it's
sort of acceptable.

Some of those function bypass latency tracing and preemption violation checks.
Don't see a reason why you should be touching those functions unless you're
going to modify implementation of spinlocks directly. Just use
spinlock_t/raw_spinlock_t to take advantage of the type parametrics in Ingo's
spinlock code to determine which lock you're using and you should be fine.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] RCU and CONFIG_PREEMPT_RT progress, part 3

2005-07-13 Thread hui

On Wed, Jul 13, 2005 at 03:06:38PM -0400, Steven Rostedt wrote:
> > 3.  Since SPIN_LOCK_UNLOCKED now takes the lock itself as an
> > argument, what is the best way to initialize per-CPU
> > locks?  An explicit initialization function, or is there
> > some way that I am missing to make an initializer?
> 
> Ouch, I just notice that (been using an older version for some time). 
> 
> Ingo, is this to force the initialization of the lists instead of at
> runtime?

ANSI C99 is missing a concept of "self" during auto-intialization. The
explicit passing of the lvalue is needed so that it can be propagated
downward to other macros in the initialization structure. list_head
initialization is one of those things if I remember correctly.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RT and XFS

2005-07-18 Thread hui

On Mon, Jul 18, 2005 at 02:10:31PM +0200, Esben Nielsen wrote:
> Unfortunately, one of the goals of the preempt-rt branch is to avoid
> altering too much code. Therefore the type semaphore can't be removed
> there. Therefore the name still lingers ... :-(

This is where you failed. You assumed that that person making the comment,
Christopher, in the first place didn't have his head up his ass in the
first place and was open to your end of the discussion.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RT and XFS

2005-07-18 Thread hui

On Fri, Jul 15, 2005 at 09:16:55AM -0700, Daniel Walker wrote:
> I don't agree with that. But of course I'm always speaking from a real
> time perspective . PI is expensive , but it won't always be. However, no
> one is forcing PI on anyone, even if I think it's good ..

It depends on what kind of PI under specific circumstances. In the general
kernel, it's really to be avoided at all costs since it's masking a general
contention problem at those places. In a formally provable worst case system
using priority ceiling emulation and stuff, PI really valuable. How a system
like the Linux kernel fits into that is a totally different story. General
purpose kernels using general purpose facilities don't.

That's how I see it.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: FUSYN and RT

2005-04-17 Thread hui

On Fri, Apr 15, 2005 at 04:37:05PM -0700, Inaky Perez-Gonzalez wrote:
> By following your method, the pi engine becomes unnecesarily complex;
> you have actually two engines following two different propagation
> chains (one kernel, one user). If your mutexes/locks/whatever are the
> same with a different cover, then you can simplify the whole
> implementation by leaps.

The main comment that I'm making here (so it doesn't get lost) is that,
IMO, you're going to find that there is a mismatch with the requirements
of Posix threading verse kernel uses. To drop the kernel mutex in 1:1 to
back a futex-ish entity is going to be problematic mainly because of how
kernel specific the RT mutex is (or any future kernel mutex) for debugging,
etc... and I think this is going to be clear as it gets progressively
implemented.

I think folks really need to think about this clearly before moving into
any direction prematurely. That's what I'm saying. PI is one of those
issues, but ultimately it's the fundamental differences between userspace
and kernel work.

LynxOS (similar threading system) keep priority calculations of this kind
seperate between user and kernel space. I'll have the ask one of our
engineers here why again that's the case, but I suspect it's for the
reasons I've discussed previously.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PREEMPT_RT and I-PIPE: the numbers, part 4

2005-07-09 Thread hui

On Sat, Jul 09, 2005 at 10:22:07AM -0700, Daniel Walker wrote:
> PREEMPT_RT is not pre-tuned for every situation , but the bests
> performance is achieved when the system is tuned. If any of these tests
> rely on a low priority thread, then we just raise the priority and you
> have better performance.

Just think about it. Throttling those threads via the scheduler throttles 
the system in super controllable ways. This is very cool stuff. :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-Time Preemption and RCU

2005-03-18 Thread hui

On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> 5. Scalability -and- Realtime Response.
...

>   void
>   rcu_read_lock(void)
>   {
>   preempt_disable();
>   if (current->rcu_read_lock_nesting++ == 0) {
>   current->rcu_read_lock_ptr =
>   &__get_cpu_var(rcu_data).lock;
>   read_lock(current->rcu_read_lock_ptr);
>   }
>   preempt_enable();
>   }

Ok, here's a rather unsure question...

Uh, is that a sleep violation if that is exclusively held since it
can block within an atomic critical section (deadlock) ?

bill


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-Time Preemption and RCU

2005-03-18 Thread hui

On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote:
> On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote:
> > 5. Scalability -and- Realtime Response.
> ...
> 
> > void
> > rcu_read_lock(void)
> > {
> > preempt_disable();
> > if (current->rcu_read_lock_nesting++ == 0) {
> > current->rcu_read_lock_ptr =
> > &__get_cpu_var(rcu_data).lock;
> > read_lock(current->rcu_read_lock_ptr);
> > }
> > preempt_enable();
> > }
> 
> Ok, here's a rather unsure question...
> 
> Uh, is that a sleep violation if that is exclusively held since it
> can block within an atomic critical section (deadlock) ?

I'd like to note another problem. Mingo's current implementation of rt_mutex
(super mutex for all blocking synchronization) is still missing reader counts
and something like that would have to be implemented if you want to do priority
inheritance over blocks.

This is going to throw a wrench into your implementation if you assume that.

bill


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-Time Preemption and RCU

2005-03-20 Thread hui

On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
> That was just one random example.
> Another one would be :
> 
> drivers/chat/tty_io.c, __do_SAK() contains
>read_lock(&tasklist_lock);
>task_lock(p);
> 
> kernel/sys.c, sys_setrlimit contains
>task_lock(current->group_leader);
>read_lock(&tasklist_lock);
> 
> task_lock is a shorthand for spin_lock(&p->alloc_lock). If read_lock is 
> a normal spinlock, then this is an A/B B/A deadlock.

That code was already dubious in the first place just because it
contained that circularity. If you had a rwlock that block on an
upper read count maximum a deadlock situation would trigger anyways,
say, upon a flood of threads trying to do that sequence of aquires.

I'd probably experiment with using the {spin,read,write}-trylock
logic and release the all locks contains in a sequence like that
on the failure to aquire any of the locks in the chain as an
initial fix. A longer term fix might be to break things up a bit
so that whatever ordering being done would have that circularity.

BTW, the runtime lock cricularity detector was designed to trigger
on that situtation anyways.

That's my thoughts on the matter.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-Time Preemption and RCU

2005-03-20 Thread hui

On Sun, Mar 20, 2005 at 01:38:24PM -0800, Bill Huey wrote:
> On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote:
> > That was just one random example.
> > Another one would be :
> > 
> > drivers/chat/tty_io.c, __do_SAK() contains
> >read_lock(&tasklist_lock);
> >task_lock(p);
> > 
> > kernel/sys.c, sys_setrlimit contains
> >task_lock(current->group_leader);
> >read_lock(&tasklist_lock);
> > 
> > task_lock is a shorthand for spin_lock(&p->alloc_lock). If read_lock is 
> > a normal spinlock, then this is an A/B B/A deadlock.
> 
> That code was already dubious in the first place just because it
> contained that circularity. If you had a rwlock that block on an
> upper read count maximum[,] a deadlock situation would trigger anyways,
> say, upon a flood of threads trying to do that sequence of aquires.

The RT patch uses the lock ordering "in place" and whatevery nasty
situation was going on previously will be effectively under high load,
which increases the chance of it being triggered. Removal of the read
side semantic just increases load more so that those cases can trigger.

I disagree with this approach and I have an alternate implementation
here that restores it. It's only half tested and fairly meaningless
until an extreme contention case is revealed with the current rt lock
implementation. Numbers need to be gather to prove or disprove this
conjecture.

> I'd probably experiment with using the {spin,read,write}-trylock
> logic and release the all locks contains in a sequence like that
> on the failure to aquire any of the locks in the chain as an
> initial fix. A longer term fix might be to break things up a bit
> so that whatever ordering being done would have that circularity.

Excuse me, ...would *not* have that circularity.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-Time Preemption and RCU

2005-03-22 Thread hui

On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote:
> On Fri, 18 Mar 2005, Ingo Molnar wrote:
> > i really have no intention to allow multiple readers for rt-mutexes. We
> > got away with that so far, and i'd like to keep it so. Imagine 100
> > threads all blocked in the same critical section (holding the read-lock)
> > when a highprio writer thread comes around: instant 100x latency to let
> > all of them roll forward. The only sane solution is to not allow
> > excessive concurrency. (That limits SMP scalability, but there's no
> > other choice i can see.)
> 
> Unless a design change is made: One could argue for a semantics where
> write-locking _isn't_ deterministic and thus do not have to boost all the

RCU isn't write deterministic like typical RT apps are we can... (below :-))

> readers. Readers boost the writers but not the other way around. Readers
> will be deterministic, but not writers.
> Such a semantics would probably work for a lot of RT applications
> happening not to take any write-locks - these will in fact perform better. 
> But it will give the rest a lot of problems.

Just came up with an idea after I thought about how much of a bitch it
would be to get a fast RCU multipule reader semantic (our current shared-
exclusive lock inserts owners into a sorted priority list per-thread which
makes it very expensive for a simple RCU case since they are typically very
small batches of items being altered). Basically the RCU algorithm has *no*
notion of writer priority and to propagate a PI operation down all reader
is meaningless, so why not revert back to the original rwlock-semaphore to
get the multipule reader semantics ?

A notion of priority across a quiescience operation is crazy anyways, so
it would be safe just to use to the old rwlock-semaphore "in place" without
any changes or priorty handling addtions. The RCU algorithm is only concerned
with what is basically a coarse data guard and it isn't time or priority
critical.

What do you folks think ? That would make Paul's stuff respect multipule
readers which reduces contention and gets around the problem of possibly
overloading the current rt lock implementation that we've been bitching
about. The current RCU development track seem wrong in the first place and
this seem like it could be a better and more complete solution to the problem.

If this works, well, you heard it here first. :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-Time Preemption and RCU

2005-03-22 Thread hui

On Tue, Mar 22, 2005 at 02:04:46AM -0800, Bill Huey wrote:
> RCU isn't write deterministic like typical RT apps are[, so] we can... (below 
> :-))
... 
> Just came up with an idea after I thought about how much of a bitch it
> would be to get a fast RCU multipule reader semantic (our current shared-
> exclusive lock inserts owners into a sorted priority list per-thread which
> makes it very expensive for a simple RCU case[,] since they are typically very
> small batches of items being altered). Basically the RCU algorithm has *no*
> notion of writer priority and to propagate a PI operation down all reader[s]
> is meaningless, so why not revert back to the original rwlock-semaphore to
> get the multipule reader semantics ?

The original lock, for those that don't know, doesn't strictly track read owners
so reentrancy is cheap.

> A notion of priority across a quiescience operation is crazy anyways[-,-] so
> it would be safe just to use to the old rwlock-semaphore "in place" without
> any changes or priorty handling add[i]tions. The RCU algorithm is only 
> concerned
> with what is basically a coarse data guard and it isn't time or priority
> critical.

A little jitter in a quiescence operation isn't going to hurt things right ?. 

> What do you folks think ? That would make Paul's stuff respect multipule
> readers which reduces contention and gets around the problem of possibly
> overloading the current rt lock implementation that we've been bitching
> about. The current RCU development track seem wrong in the first place and
> this seem like it could be a better and more complete solution to the problem.

Got to get rid of those typos :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-Time Preemption and RCU

2005-03-22 Thread hui

On Tue, Mar 22, 2005 at 02:17:27AM -0800, Bill Huey wrote:
> > A notion of priority across a quiescience operation is crazy anyways[-,-] so
> > it would be safe just to use to the old rwlock-semaphore "in place" without
> > any changes or priorty handling add[i]tions. The RCU algorithm is only 
> > concerned
> > with what is basically a coarse data guard and it isn't time or priority
> > critical.
> 
> A little jitter in a quiescence operation isn't going to hurt things right ?. 

The only thing that I can think of that can go wrong here is what kind
of effect it would have on the thread write blocking against a bunch of
RCU readers. It could introduce a chain of delays into, say, a timer event
and might cause problems/side-effects for other things being processed.
RCU processing might have to decoupled processed by a different thread
to avoid some of that latency weirdness.

What do you folks think ?

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

2005-01-28 Thread hui

On Fri, Jan 28, 2005 at 08:45:46PM +0100, Ingo Molnar wrote:
> * Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > If you do have a highest interrupt case that causes all activity to
> > block, then rwsems may indeed fit the bill.
> > 
> > In the NFS client code we may use rwsems in order to protect stateful
> > operations against the (very infrequently used) server reboot recovery
> > code. The point is that when the server reboots, the server forces us
> > to block *all* requests that involve adding new state (e.g. opening an
> > NFSv4 file, or setting up a lock) while our client and others are
> > re-establishing their existing state on the server.
> 
> it seems the most scalable solution for this would be a global flag plus
> per-CPU spinlocks (or per-CPU mutexes) to make this totally scalable and
> still support the requirements of this rare event. An rwsem really
> bounces around on SMP, and it seems very unnecessary in the case you
> described.
> 
> possibly this could be formalised as an rwlock/rwlock implementation
> that scales better. brlocks were such an attempt.

>From how I understand it, you'll have to have a global structure to
denote an exclusive operation and then take some additional cpumask_t
representing the spinlocks set and use it to iterate over when doing a
PI chain operation.

Locking of each individual parametric typed spinlock might require
a raw_spinlock manipulate lists structures, which, added up, is rather
heavy weight.

No only that, you'd have to introduce a notion of it being counted
since it could also be aquired/preempted  by another higher priority
thread on that same procesor.  Not having this semantic would make the
thread in that specific circumstance effectively non-preemptable (PI
scheduler indeterminancy), where the mulipule readers portion of a
real read/write (shared-exclusve) lock would have permitted this.

http://people.lynuxworks.com/~bhuey/rt-share-exclusive-lock/rtsem.tgz.1208

Is our attempt at getting real shared-exclusive lock semantics in a
blocking lock and may still be incomplete and buggy. Igor is still
working on this and this is the latest that I have of his work. Getting
comments on this approach would be a good thing as I/we (me/Igor)
believed from the start that this approach is correct.

Assuming that this is possible with the current approach, optimizing
it to avoid CPU ping-ponging is an important next step

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-01-31 Thread hui

On Mon, Jan 31, 2005 at 05:29:10PM -0500, Bill Davidsen wrote:
> The problem hasn't changed in a few decades, neither has the urge of 
> developers to make their app look good at the expense of the rest of the 
> system. Been there and done that myself.
> 
> "Back when" we had no good tools except to raise priority and drop 
> timeslice if a process blocked for i/o and vice-versa if it used the 
> whole timeslice. The amzing thing is that it worked reasonably well as 
> long as no one was there who knew how to cook the books the scheduler 
> used. And the user could hold off interrupts for up to 16ms, just to 
> make it worse.

A lot of this scheduling policy work is going to have to be redone as
badly written apps start getting their crap together and as this patch
is more and more pervasive in the general Linux community. What's
happening now is only the beginning of things to come and it'll require
a solid sample application with even more hooks into the kernel before
we'll see the real benefits of this patch. SCHED_FIFO will have to do
until more development happens with QoS style policies.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-02-02 Thread hui

On Tue, Feb 01, 2005 at 11:10:48PM -0600, Jack O'Quin wrote:
> Ingo Molnar <[EMAIL PROTECTED]> writes:
> > (also, believe me, this is not arrogance or some kind of game on our
> > part. If there was a nice clean solution that solved your and others'
> > problems equally well then it would already be in Linus' tree. But there
> > is no such solution yet, at the moment. Moreover, the pure fact that so
> > many patch proposals exist and none looks dominantly convincing shows
> > that this is a problem area for which there are no easy solutions. We
> > hate such moments just as much as you do, but they do happen.)
> 
> The actual requirement is nowhere near as difficult as you imagine.
> You and several others continue to view realtime in a multi-user
> context.  That doesn't work.  No wonder you have no good solution.

A notion of process/thread scoping is needed from my point of view. How
to implement that is another matter and there are no clear solutions
that don't involve major changes in some way to fundamental syscalls
like fork/clone() and underlying kernel structures from what I see.
The very notion of Unix fork semantics isn't sufficient enough to
"contain" these semantics. It's more about controlling things with
known quantities over time, not about process creation relationships,
and therein lies the mismatch.

Also, as media apps get more sophisticated they're going to need some
kind of access to the some traditional softirq facilities, possibily
migrating it into userspace safely somehow, with how it handles IO
processing such as iSCSI, FireWire, networking and all peripherals
that need some kind of prioritized IO handling. It's akin to O_DIRECT,
where folks need to determine policy over the kernel's own facilities,
IO queues, but in a more broad way. This is inevitable for these
category of apps. Scary ? yes I know.

Think XFS streaming with guaranteed rate IO, then generalize this for
all things that can be streamed in the kernel. A side note, they'll
also be pegging CPU usage and attempting to draw to the screen at the
same time. It would be nice to have slack from scheduler frames be use
for less critical things such as drawing to the screen.

The policy for scheduling these IO requests maybe divorced from the
actual priority of the thread requesting it which present some problems
with the current Linux code as I understand it.

Whether this suitable for main stream inclusion is another matter. But
as a person that wants to write apps of this nature, I came into this
kernel stuff knowing that there's going to be a conflict between the
the needs of media apps folks and what the Linux kernel folks will
tolerate as a community.

> The humble RT-LSM was actually optimal for the multi-user scenario:
> don't load it.  Then it adds no security issues, complexity or
> scheduler pathlength.  As an added benefit, the sysadmin can easily
> verify that it's not there.
> 
> The cost/performance characteristics of commodity PC's running Linux
> are quite compelling for a wide range of practical realtime
> applications.  But, these are dedicated machines.  The whole system
> must be carefully tuned.  That is the only method that actually works.
> The scheduler is at most a peripheral concern; the best it can do is
> not screw up.

It's very compelling and very deadly to the industry if these things
become common place in the normal Linux kernel. It would instantly
make Linux the top platform for anything media related, graphic and
audio. (Hopefully, I can get back to kernel coding RT stuff after this
current distraction that has me reassigned onto an emergency project)

I hope I clarified some of this communication and not completely scare
Ingo and others too much. Just a little bit is ok. :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-02-02 Thread hui

On Wed, Feb 02, 2005 at 10:44:22AM -0600, Jack O'Quin wrote:
> Bill Huey (hui) <[EMAIL PROTECTED]> writes:
> > Also, as media apps get more sophisticated they're going to need some
> > kind of access to the some traditional softirq facilities, possibily
> > migrating it into userspace safely somehow, with how it handles IO
> > processing such as iSCSI, FireWire, networking and all peripherals
> > that need some kind of prioritized IO handling. It's akin to O_DIRECT,
> > where folks need to determine policy over the kernel's own facilities,
> > IO queues, but in a more broad way. This is inevitable for these
> > category of apps. Scary ? yes I know.
> 
> I believe Ingo's RT patches already support this on a per-IRQ basis.
> Each IRQ handler can run in a realtime thread with priority assigned
> by the sysadmin.  Balancing the interrupt handler priorities with
> those of other realtime activities allows excellent control.  

No they don't. That's a physical mapping of these kernel entities, not a
logic organization that projects upward to things like individual sockets
or file streams. The current irq-thread patches are primarily for dealing
with the low level acks and stuff for the devices in question. It does not
deal with queuing policy or how these things are scheduler on a logical
basis, which is what softirqs do. softirqs group a number of things together
in one big uncontrollable chunk. Really, a bit of time spent in the kernel
regarding this would clarify it more in the future. Don't speculate.

This misunderstanding, often babble, from app folks is why kernel folks
partially dismiss the needs requested from this subgroup. It's important
to understand your needs before articulating it to a wider community.

The kernel community must understand the true nature of these needs and
then facilitate them. If the relationship is where kernel folks dictate
what apps folks have, you basically pervert the relationbship and the
responsiblities of overall development, which fatally cripples app
and all development of this nature. It's a two way street, but kernel
folks can be more proactive about it, definitely.

Step one in this is to acknowlege that Unix scheduling semantics is
"inantiquated" with regard to media apps. Some notion of scoping needs to
be put in.

Everybody on the same page ?

> This is really only useful within the context of a dedicated realtime
> system, of course.
> 
> Stephane Letz reports a similar feature in Mac OS X.

OS X is very coarse grained (two funnels) and I would seriously doubt
that it would perform without scheduling side effects to the overall
system because of that. With a largely stalled FreeBSD SMPng project
where they hijack a good chunk of their code into an antiquate and
bloated Mach threading system, that situation isn't helping it.

What the Linux community has with the RT patches is potentially light
years ahead of OS X regarding overall system latency, since RT and
SMP performance is tightly related. It's just a matter of getting 
right folks to understand the problem space and then make changes so
that the overall picture is completed.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-02-02 Thread hui

On Wed, Feb 02, 2005 at 10:21:00PM +0100, Ingo Molnar wrote:
> yes and no. You are right in that the individual workloads (e.g.
> softirqs) are not separated and identified/credited to the thread that
> requested them. (in part due to the fact that you cannot e.g. credit a
> thread for e.g. unrequested workloads like incoming sockets, or for
> 'merged' workloads like writeout of a commonly accessed file.)

What's not being addressed here is a need for pervasive QoS across all
kernel systems. The power of this patch is multiplicative. It's not
about a specific component of the system having microsecond latencies,
it's about how all parts, softirqs, hardirqs, VM, etc... work together
so that the entire system is suitable for (near) hard real time. It's
unconstrained, unlike dual kernel RT systems, across all component
boundaries. Those constraints create large chunks of glue logic between
systems, which is exploded the complexity of things that app folks
much deal with.

This is where properly written Linux apps (non exist right now because
of kernel issues) can really overtake competing apps from other OSes
(ignoring how crappy X11 is).

> but Jack is right in practical terms: the audio folks achieved pretty
> good results with the current IRQ threading mechanism, partly due to the
> fact that the audio stack doesnt use softirqs, so all the
> latency-critical activities are in the audio IRQ thread and the
> application itself.

It's clever that they do that, but additional control is needed in the
future. jackd isn't the most sophisticate media app on this planet (not
too much of an insult :)) and the demands from this group is bound to
increase as their group and competing projects get more and more
sophisticated. When I mean kernel folks needs to be proactive, I really
mean it. The Linux kernel latency issues and poor driver support is
largely why media apps are way below even being second rate with regard
to other operating systems such as Apple's OS X for instance.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-02-02 Thread hui

On Wed, Feb 02, 2005 at 01:14:05PM -0800, Bill Huey wrote:
> Step one in this is to acknowlege that Unix scheduling semantics is
> "inantiquated" with regard to media apps. Some notion of scoping needs to

bah, "inadequate".

> be put in.
> 
> Everybody on the same page ?

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-02-02 Thread hui

On Wed, Feb 02, 2005 at 05:59:54PM -0500, Paul Davis wrote:
> Actually, JACK probably is the most sophisticated media *framework* on
> the planet, at least inasmuch as it connects ideas drawn from the
> media world and OS research/design into a coherent package. Its not
> perfect, and we've just started adding new data types to its
> capabilities (its actually relatively easy). But it is amazingly
> powerful in comparison to anything offered to data, and is
> unencumbered by the limitations that have affected other attempts to
> do what it does.

This is a bit off topic, but I'm interested in applications that are
more driven by time and has abstraction closer to that in a pure way.
A lot of audio kits tend to be overly about DSP and not about time.
This is difficult to explain, but what I'm referring to here is ideally
the next generation these applications and their design, not the current
lot. A lot more can be done.

> And it makes possible some of the most sophisticated *audio* apps on
> the planet, though admittedly not video and other data at this time.

Again, the notion of time based processing with broader uses and not
just DSP which what a lot of current graph driven audio frameworks
seem to still do at this time. Think gaming audio in 3d, etc...

I definitely have ideas on this subject and I'm going to hold my
current position on this matter in that we can collectively do much
better.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-02-02 Thread hui

On Thu, Feb 03, 2005 at 08:54:24AM +1100, Peter Williams wrote:
> As Ingo said in an earlier a post, with a little ingenuity this problem 
> can be solved in user space.  The programs in question can be setuid 
> root so that they can set RT scheduling policy BUT have their 
> permissions set so that they only executable by owner and group with the 
> group set to a group that only contains those users that have permission 
> to run this program in RT mode.  If you wish to allow other users to run 
> the program but not in RT mode then you would need two copies of the 
> program: one set up as above and the other with normal permissions.

Again, in my post that you snipped you didn't either read or understand
what I was saying regarding QoS, nor about the large scale issues regarding
dual/single kernel development environments. Ultimately this stuff requires
non-trivial support in kernel space, a softirq thread migration mechanism
and a frame driven scheduler to back IO submission across async boundaries.

My posts where pretty clear on this topic and lot of this has origins
coming from SGI IRIX. Yes, SGI IRIX. One of the only system man enough
to handle this stuff.

Ancient, antiquated Unix scheduler semantics (sort and run) and lack of
control over critical facilities like softirq processing are obstacles
to getting at this.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-02-03 Thread hui

On Thu, Feb 03, 2005 at 10:41:33PM +0100, Ingo Molnar wrote:
> * Bill Huey <[EMAIL PROTECTED]> wrote:
> > It's clever that they do that, but additional control is needed in the
> > future. jackd isn't the most sophisticate media app on this planet (not
> > too much of an insult :)) [...]
> 
> i think you are underestimating Jack - it is easily amongst the most
> sophisticated audio frameworks in existence, and it certainly has one of
> the most robust designs. Just shop around on google for Jack-based audio
> applications. What i'd love to see is more integration (and cooperation)
> between the audio frameworks of desktop projects (KDE, Gnome) and Jack.

This is a really long winded and long standing offtopic gripe I have with
general application development under Linux. The only way I'm going to
get folks to understand my position on it is if I code it up in my
implementation language of choice with my own APIs.

There's a TON more that can be done with QoS in the kernel (EDL schedulers),
DSP JIT compiler techniques and other kernel things that can support
pro-audio. I simply can't get to yet until the RT patch has a few more
goodies and I'm permitted to do this as my next project.

I had a crazy prototype of some DSP graph system (in C++) I wrote years
ago for 3D audio where I'm drawing my knowledge from and it's getting
time to resurrect it again if I'm going to provide a proof of concept
to push an edge.

Also, think, people working with the RT patch are also ignoring frame
accurate video and many others things that just haven't been done yet
since the patch is so new and there hasn't been more interest from
folks yet regarding it. I suspect that it's because that folks don't
know about it yet.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread hui

On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote:
> Chris Friesen wrote:
>> We currently use CKRM on an SMP machine, but the only way we can get away 
>> with it is because our main app is affined to one cpu and just about 
>> everything else is affined to the other.
>
> If you're not explicitly allocating resources, you're just low-latency, not 
> truly realtime.  Realtime requires guaranteed resources, so messing with 
> affinities is a necessary evil.

You've mentioned this twice in this thread. If you're going to talk about this
you should characterize this more specifically because resource allocation is
a rather incomplete area in the Linux. Rebalancing is still an open research
problem the last time I looked.

Tong's previous trio patch is an attempt at resolving this using a generic
grouping mechanism and some constructive discussion should come of it.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-24 Thread hui

On Tue, Jul 24, 2007 at 05:22:47PM -0400, Chris Snook wrote:
> Bill Huey (hui) wrote:
> Well, you need enough CPU time to meet your deadlines.  You need 
> pre-allocated memory, or to be able to guarantee that you can allocate 
> memory fast enough to meet your deadlines.  This principle extends to any 
> other shared resource, such as disk or network.  I'm being vague because 
> it's open-ended.  If a medical device fails to meet realtime guarantees 
> because the battery fails, the patient's family isn't going to care how 
> correct the software is.  Realtime engineering is hard.
...
> Actually, it's worse than merely an open problem.  A clairvoyant fair 
> scheduler with perfect future knowledge can underperform a heuristic fair 
> scheduler, because the heuristic scheduler can guess the future incorrectly 
> resulting in unfair but higher-throughput behavior.  This is a perfect 
> example of why we only try to be as fair as is beneficial.

I'm glad we agree on the above points. :)

It might be that there needs to be another more stiff policy than what goes
into SCHED_OTHER in that we also need a SCHED_ISO or something has more
strict rebalancing semantics for -rt applications, sort be a super SCHED_RR.
That's definitely needed and I don't see how the current CFS implementation
can deal with this properly even with numerical running averages, etc...
at this time.

SCHED_FIFO is another issue, but this actually more complicated than just
per cpu run queues in that a global priority analysis. I don't see how
CFS can deal with SCHED_FIFO efficiently without moving to a single run
queue. This is kind of a complicated problem with a significant set of
trade off to take into account (cpu binding, etc..)

>> Tong's previous trio patch is an attempt at resolving this using a generic
>> grouping mechanism and some constructive discussion should come of it.
>
> Sure, but it seems to me to be largely orthogonal to this patch.

It's based on the same kinds of ideas that he's been experimenting with in
Trio. I can't name a single other engineer that's posted to lkml recently
that has quite the depth of experience in this area than him. It would be
nice to facilitted/incorporate some his ideas or get him to and work on
something to this end that's suitable for inclusion in some tree some where.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

NetApp sues Sun regarding ZFS

2007-09-05 Thread hui


Folks,

The official announcement.

http://www.netapp.com/news/press/news_rel_20070905

Dave Hitz's blog about it.

http://blogs.netapp.com/dave/

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread hui

On Fri, Jul 27, 2007 at 12:03:28PM -0700, Tong Li wrote:
> Thanks for the interest. Attached is a design doc I wrote several months 
> ago (with small modifications). It talks about the two pieces of my design: 
> group scheduling and dwrr. The description was based on the original O(1) 
> scheduler, but as my CFS patch showed, the algorithm is applicable to other 
> underlying schedulers as well. It's interesting that I started working on 
> this in January for the purpose of eventually writing a paper about it. So 
> I knew reasonably well the related research work but was totally unaware 
> that people in the Linux community were also working on similar things. 
> This is good. If you are interested, I'd like to help with the algorithms 
> and theory side of the things.

Tong,

This is sufficient as an overview of the algorithm but not detailed enough
for it to be a discussable design doc I believe. You should ask Chris to see
what he means by this.

Some examples of your rebalancing scheme and how your invariant applies
across processor rounds would be helpful for me and possibly others as well.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] scheduler: improve SMP fairness in CFS

2007-07-27 Thread hui

On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote:
> I don't think that achieving a constant error bound is always a good thing. 
>  We all know that fairness has overhead.  If I have 3 threads and 2 
> processors, and I have a choice between fairly giving each thread 1.0 
> billion cycles during the next second, or unfairly giving two of them 1.1 
> billion cycles and giving the other 0.9 billion cycles, then we can have a 
> useful discussion about where we want to draw the line on the 
> fairness/performance tradeoff.  On the other hand, if we can give two of 
> them 1.1 billion cycles and still give the other one 1.0 billion cycles, 
> it's madness to waste those 0.2 billion cycles just to avoid user jealousy. 
>  The more complex the memory topology of a system, the more "free" cycles 
> you'll get by tolerating short-term unfairness.  As a crude heuristic, 
> scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but 
> eventually we should take the boot-time computed migration costs into 
> consideration.

You have to consider the target for this kind of code. There are applications
where you need something that falls within a constant error bound. According
to the numbers, the current CFS rebalancing logic doesn't achieve that to
any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more
strict than that.

Even the rt overload code (from my memory) is subject to these limitations
as well until it's moved to use a single global queue while using CPU
binding to turn off that logic. It's the price you pay for accuracy.

> If we allow a little short-term fairness (and I think we should) we can 
> still account for this unfairness and compensate for it (again, with the 
> same tolerance) at the next rebalancing.

Again, it's a function of *when* and depends on that application.

> Adding system calls, while great for research, is not something which is 
> done lightly in the published kernel.  If we're going to implement a user 
> interface beyond simply interpreting existing priorities more precisely, it 
> would be nice if this was part of a framework with a broader vision, such 
> as a scheduler economy.

I'm not sure what you mean by scheduler economy, but CFS can and should
be extended to handle proportional scheduling which is outside of the
traditional Unix priority semantics. Having a new API to get at this is
unavoidable if you want it to eventually support -rt oriented appications
that have bandwidth semantics.

All deadline based schedulers have API mechanisms like this to support
extended semantics. This is no different.

> I had a feeling this patch was originally designed for the O(1) scheduler, 
> and this is why.  The old scheduler had expired arrays, so adding a 
> round-expired array wasn't a radical departure from the design.  CFS does 
> not have an expired rbtree, so adding one *is* a radical departure from the 
> design.  I think we can implement DWRR or something very similar without 
> using this implementation method.  Since we've already got a tree of queued 
> tasks, it might be easiest to basically break off one subtree (usually just 
> one task, but not necessarily) and migrate it to a less loaded tree 
> whenever we can reduce the difference between the load on the two trees by 
> at least half.  This would prevent both overcorrection and undercorrection.

> The idea of rounds was another implementation detail that bothered me.  In 
> the old scheduler, quantizing CPU time was a necessary evil.  Now that we 
> can account for CPU time with nanosecond resolution, doing things on an 
> as-needed basis seems more appropriate, and should reduce the need for 
> global synchronization.

Well, there's nanosecond resolution with no mechanism that exploits it for
rebalancing. Rebalancing in general is a pain and the code for it is
generally orthogonal to the in-core scheduler data structures that are in
use, so I don't understand the objection to this argument and the choice
of methods. If it it gets the job done, then these kind of choices don't
have that much meaning.

> In summary, I think the accounting is sound, but the enforcement is 
> sub-optimal for the new scheduler.  A revision of the algorithm more 
> cognizant of the capabilities and design of the current scheduler would 
> seem to be in order.

That would be nice. But the amount of error in Tong's solution is much
less than the current CFS logic as was previously tested even without
consideration to high resolution clocks.

So you have to give some kind of credit for that approach and recognized
that current methods in CFS are technically a dead end if there's a need for
strict fairness in a more rigorous run category than SCHED_OTHER.

> I've referenced many times my desire to account for CPU/memory hierarchy in 
> these patches.  At present, I'm not sure we have sufficient infrastructure 
> in the kernel to automatically optimize for system topology, but I think 
> whatever de

Re: [ck] Re: Linus 2.6.23-rc1

2007-07-28 Thread hui

On Sat, Jul 28, 2007 at 09:28:36PM +0200, jos poortvliet wrote:
> Your point here seems to be: this is how it went, and it was right. Ok, got 
> that. Yet, Con walked away (and not just over SD). Seeing Con go, I wonder 
> how many did leave without this splash. How many didn't even get involved at 
> all??? Did THAT have to happen? I don't blame you for it - the point is that 
> somewhere in the process a valuable kernel hacker went away. How and why? And 
> is it due to a deeper problem?

Absolutely, the current Linux community hasn't realized how large the
community has gotten and the internal processes for dealing with new
developers, that aren't at companies like SuSE or RedHat, haven't been
extended to deal with it yet. It comes off as elitism which it partially
is.

Nobody tries to facilitate or understand ideas in the larger community
which locks folks like Con out that try to do provocative things outside
of the normal technical development mindset. He was punished for doing
so and is a huge failure in this community.

Con basically got caught in a scheduler philosophical argument of whether
to push a policy into userspace or to nice a process instead because
of how crappy X is. This is an open argument on how to solve, but it
should not have resulted in really one scheduler over the other. Both
where capable but one is locked out now because of the choices of
current high level kernel developers in Linux.

There are a lot good kernel folks in many different communities that
look at something like this and would be turned off to participating
in Linux development. And I have a good record of doing rather
interesting stuff in kernel.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ck] Re: Linus 2.6.23-rc1

2007-07-28 Thread hui

On Sat, Jul 28, 2007 at 11:06:09PM +0200, Diego Calleja wrote:
> So your argument is that SD shouldn't have been merged either, because it
> would have resulted in one scheduler over the other?

My argument is that schedule development is open ended. Although having
a central scheduler to hack is a a good thing, it shouldn't lock out or
supress development from other groups that might be trying to solve the
problem in unique ways.

This can be accomplished in a couple of ways:

1) scheduler modularity

Clearly Con is highly qualified to experiement with scheduler code and
this should be technically facilitate by some means if not a maintainer.
He's only a part time maintainer and nobody helped him with this stuff
nor did they try to understand what his scheduler was trying to do other
than Tong Li.

2) better code modularity

Now, cleaner code would help with this a lot. If that was in place, we
might not need (1) and pluggable scheduler. It would limit the amount
of refactoring for folks so that their code can drop in easier. There's
a significant amount of churn that it locks out developers by default
since they have to constantly clean up the code in question while another
developer can commit without consideration to how it effects others.
That's their right as a maintainer, but also as maintainer, they should
give proper amount of consideration to how others might intend to extend
the code so that development remains "inclusive".

This notion of "open source, open development" is false when working
under those circumstances.

> > where capable but one is locked out now because of the choices of
> > current high level kernel developers in Linux.
> 
> Well, there are two schedulers...it's obvious that "high level kernel
> developers" needed to chose one.

I think that's kind of a bogus assumption from the very get go. Scheduling
in Linux is one of the most unevolved systems in the kernel that still
could go through a large transformation and get big gains like what
we've had over the last few months. This evident with both schedulers,
both do well and it's a good thing overall the CFS is going in.

Now, the way it happened is completely screwed up in so many ways that I
don't see how folks can miss it. This is not just Ingo versus Con, this
is the current Linux community and how it makes decision from the top down
and the current cultural attitude towards developers doing things that
are:

1) architecturally significant

which they will get flamed to death by the establish Linux kernel culture
before they can get any users to report bugs after their posting on lkml.

2) conceptual different

which is subject to the reasons above, but also get flamed to death unless
it comes from folks internal to the Linux development processes.

When groups get to a certain size like it has, there needs to be a
revision of development processes so that they can scale and be "inclusive"
to the overall spirit the Linux development process. When that breaks down,
we get situations like what we have with Con leaving development. Other
developers like me get turned off to the situation, also feel the same as
Con and stop Linux development. That's my current status as well.

> The main problem is clearly that no scheduler was clearly better than the
> other. This remembers me of the LVM2/MD vs EVMS in the 2.5 days - both
> of them were good enought, but only one of them could be merged. The
> difference is that EVMS developers didn't get that annoyed, and not only
> they didn't quit but they continued developing their userspace tools to
> make it work with the solution included in the kernel

That's a good point to have folks not go down that particular path. But
Con was kind of put down during the arguments with Ingo about his
assumptions of the problems and then was personally crapped on by having
his own idea under go a a complete reversal in opinion by Ingo, with
Ingo then doing this own version of Con's work displacing him

How would you feel in that situation ? I'd be pretty damn pissed.

[For the record Peter Zijlstra did the same thing to me which is annoying,
but since he's my buddy doesn't get as rude as the above situation, included
me in every private mail about his working so that I don't feel like RH
is paying him to undermine my brilliance, it's ok :)]

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ck] Re: Linus 2.6.23-rc1

2007-07-28 Thread hui

On Sat, Jul 28, 2007 at 03:18:24PM -0700, Linus Torvalds wrote:
> I don't think anything was suppressed here.

I disagree. See below.

> You seem to say that more modular code would have helped make for a nicer 
> way to do schedulers, but if so, where were those patches to do that? 
> Con's patches didn't do that either. They just replaced the code.

They replaced code because he would have liked to have taken scheduler code
in possibly a completely different direction. This is a large conceptual
change from what is currently there. That might also mean how the notion of
bandwidth with regards to core frequency might be expressed in the system
with regards to power saving and other things. Things get dropped often
not because of pure technical reasons but because of person preference
and the lack of willingness to ask where this might take us.

The way that Con works and conceptualizes things is quite a bit different
and more comprehensive in a lot of ways compared to how the regular kernel
community operates. He's strong in this area and weak in general kernel
hackery as a function of time and experience. That doesn't mean that he,
his ideas and his code should be subject to an either/or situation with the
scheduler and other ideas that have been rejected by various folks. He
maintained -ck branch successfully for a long time and is a very capable
developer.

I do acknowledge that having a maintainer that you can trust is more
important, but it should not be exclusionary in this way. I totally
understand his reaction.

> In fact, Ingo's patches _do_ add some modularity, and might make it easier 
> to replace the scheduler. So it would seem that you would argue for CFS, 
> not against it?

It's not the same as sched plugin. Some folks might not like to use the
rbtree that's in place and express things in a completely different
manner. Take for instance, Tong Li's stuff with CFS a bit of a conceptual
mismatch with his attempt at expression rebalancing in terms expiry rounds
yet would be more seamlessly integrated with something like either the old
O(1) scheduler or Con's stuff. It's also the only method posted to lkml
that can deal with fairness across SMP situtations with low error. Yet
what's happening here is that his implementation is being rejected because
of size and complexity because of a data structure conceptual mismatch.

Because of this, his notion of trio as a general method of getting
aggressive group fairness (by far the most complete conceptually on lkml,
over design is a different topic altogether) may never see the light of
day in Linux because of people's collective lack of foresight.

To answer the question that you posed, no. I'm not arguing against it. I'm
in favor of it going into the kernel like any dead line mechanism since
it can be generalized, but the current developement processes in Linux
kernel should not create an either/or situation with the scheduler code.
There has been multipule rejection of ideas with regards to the scheduler
code over the years that could have take things in a very different and
possibly complete kick ass way that was suppress because of the development
attitude of various Linux kernel developers.

It's all of a sudden because of Con's work there's a flurry of development
in this area when this idea is shown to be superior and even then, it's
conceptually incomplete and subject to a lot of arbitrary hacking. This
is very different than Con's development style and mine as well.

This is an area that could have been addressed sooner if the general
community admitted that there was a problem earlier and permitted more
conscious and open change. I've seen changes in this area from Con be
reject time and time again which effect the technical direction he
originally wanted to take this.

Now, Con might have a communication problem here, but nobody asked to
clarify what he might have wanted and why, yet folks were very quick at
dismissing him, nitpick him to death,  even when he explained why he might
have wanted a particular change in the first place. This is the
"facilitation" part that's missing in the current kernel culture.

This is a very important idea as the community grows, because I see folks
that are capable of doing work get discouraged and locked out because of
code maintainability issues and an inability to get folks to move that
direction because of a missing concensus mechanism in the community
other that sucking up to developers.

Con and folks like him should be permitted the opportunity to fail on
their own account. If Linux was truely open, it would have dealt with
issue by now and there wouldn't be so much flammage from the general
community.

> > I think that's kind of a bogus assumption from the very get go. Scheduling
> > in Linux is one of the most unevolved systems in the kernel that still
> > could go through a large transformation and get big gains like what
> > we've had over the last few months. This evident with both schedulers,

Re: [ck] Re: Linus 2.6.23-rc1

2007-07-29 Thread hui

On Sun, Jul 29, 2007 at 10:25:42PM +0200, Mike Galbraith wrote:
> Absolutely.
> 
> Con quit for his own reasons.  Given that Con himself has said that CFS
> was _not_ why he quite, please discard this... bait.  Anyone who's name
> isn't Con Kolivas, who pretends to speak for him is at the very least
> overstepping his bounds, and that is being _very_ generous.

I know Con personally and I completely identify with his circumstance. This
is precisely why he quit the project because of a generally perceived
ignorance and disconnect from end users. Since you side with Ingo on many
issues politically, this response from you is no surprise.

Again, the choices that have been currently made with CFS basically locks
him out of development. If you don't understand that, then you don't
understand the technical issues he's struggled to pursue. He has a large
following which is why this has been a repeated and issue between end users
of his tree and a number of current Linux kernel developers.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ck] Re: Linus 2.6.23-rc1

2007-07-31 Thread hui

On Tue, Jul 31, 2007 at 09:15:17AM +0800, Carlo Florendo wrote:
> And I think you are digressing from the main issue, which is the empirical 
> comparison of SD vs. CFS and to determine which is best.   The root of all 
> the scheduler fuss was the emotional reaction of SD's author on why his 
> scheduler began to be compared with CFS.

Legitimate emotional reaction for being locked out of the development
process. There's a very human aspect to this, yes, a negative human
aspect that pervade Linux kernel development and is overly defensive and
protective of new ideas.

> We obviously all saw how the particular authors tried to address the 
> issues.  Ingo tried to address all concerns while Con simply ranted about 
> his scheduler being better.  If this is what you think about being a bit 
> more human, then I think that this has no place in the lkml.

That's highly inaccurate and rather disrespect of Con's experience.
There as a policy decision made with SD that one person basically didn't
like, this person whined like a baby for the a formula bottle and didn't
understand how to use "nice" to control this inherent behavior of this
scheduler.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linus 2.6.23-rc1

2007-07-31 Thread hui

On Sun, Jul 29, 2007 at 04:18:18PM -0700, Linus Torvalds wrote:
> Ingo posted numbers. Look at those numbers, and then I would suggest some 
> people could seriously consider just shutting up. I've seen too many 
> idiotic people who claim that Con got treated unfairly, without those 
> people admitting that maybe I had a point when I said that we have had a 
> scheduler maintainer for years that actually knows what he's doing.

Here's the problem, *a lot* of folks can do scheduler development in and
outside community, so what's with exclusive-only attitude towards the
scheduler ?

There's sufficient effort coming from folks working on CFS from many
sources so how's sched-plugin a *threat* to stock kernel scheduler
development if it gets to the main tree as the default compile option ??

Those are the core question that Con brought in the APC article, folks
are angry because and nobody central to the current Linux has address
this and instead focused on a single narrow set of technical issues
to justify a particular set of actions.

I mean, I'm not the only that has said this so there has to be some
kind of truth behind it.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-24 Thread hui

On Wed, Jan 24, 2007 at 12:31:15PM +0100, Ingo Molnar wrote:
> * Bill Huey <[EMAIL PROTECTED]> wrote:
> 
> > Patch here:
> > 
> > 
> > http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch
> 
> hm, most of the review feedback i gave has not been addressed in the 
> patch above. So i did it myself: find below various fixups for problems 
> i outlined plus for a few other problems as well (ontop of 
> 2.3.lock_stat).

Sorry, I've been siliently keeping your suggested change in my private
repo without announcing it to the world. I'll reply to the old email in
another message at length.

http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.4.lock_stat.patch

> While it's now markedly better it's still not quite mergeable, for 
> example the build fails quite spectacularly if LOCK_STAT is disabled in 
> the .config.

I'll look into it. I've been focused on clean up and a couple of other
things regard the stability of this patch. Making small changes in it
tends to make the kernel crash hard and I suspect that it's an interaction
problem with lockdep and that I need to turn lockdep off when hitting
"lock stats" locks. I'm going to move to "__raw_..." locks...

Meanwhile please wait until I hand interpret and merge your changes to an
older patch into my latest stuff. If it's takes too long, I suggest keeping
out of the tree for a bit until I finish this round unless something is
pressing for this to happen now like a mass change to the spinlock macros
or something. I stalled a bit trying to get Peter Zijlstra an extra feature.

> Also, it would be nice to replace those #ifdef CONFIG_LOCK_STAT changes 
> in rtmutex.c with some nice inline functions that do nothing on 
> !CONFIG_LOCK_STAT.

I'll look into it. Not sure what your choice in style is here and I'm
open to suggestions. I'm also interested in a reduction of #define
identifier length if you or somebody else has some kind of good convention
to suggest.

> but in general i'm positive about the direction this is heading, it just 
> needs more work.

Sorry, for the lag. Trying to juggle this and the current demands of my
employeer contributed to this lag unfortunately.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-24 Thread hui

On Thu, Jan 04, 2007 at 05:46:59AM +0100, Ingo Molnar wrote:
> thanks. It's looking better, but there's still quite a bit of work left:
> 
> there's considerable amount of whitespace noise in it - lots of lines 
> with space/tab at the end, lines with 8 spaces instead of tabs, etc.

These comments from me are before the hand merge I'm going to do tonight.

> comment style issues:
> 
> +/* To be use for avoiding the dynamic attachment of spinlocks at runtime
> + * by attaching it inline with the lock initialization function */
> 
> the proper multi-line style is:
> 
> /*
>  * To be used for avoiding the dynamic attachment of spinlocks at 
>  * runtime by attaching it inline with the lock initialization function:
>  */

I fixed all of those I can find.

> (note i also fixed a typo in the one above)
> 
> more unused code:
> 
> +/*
> +static DEFINE_LS_ENTRY(__pte_alloc);
> +static DEFINE_LS_ENTRY(get_empty_filp);
> +static DEFINE_LS_ENTRY(init_waitqueue_head);
> ...
> +*/

Removed. They are for annotation which isn't important right now.

> +static int lock_stat_inited = 0;
> 
> should not be initialized to 0, that is implicit for static variables.

Removed.

> weird alignment here:
> 
> +void lock_stat_init(struct lock_stat *oref)
> +{
> +   oref->function[0]   = 0;
> +   oref->file  = NULL;
> +   oref->line  = 0;
> +
> +   oref->ntracked  = 0;

I reduced that all to a single space without using huge tabs.

> funky branching:
> 
> +   spin_lock_irqsave(&free_store_lock, flags);
> +   if (!list_empty(&lock_stat_free_store)) {
> +   struct list_head *e = lock_stat_free_store.next;
> +   struct lock_stat *s;
> +
> +   s = container_of(e, struct lock_stat, list_head);
> +   list_del(e);
> +
> +   spin_unlock_irqrestore(&free_store_lock, flags);
> +
> +   return s;
> +   }
> +   spin_unlock_irqrestore(&free_store_lock, flags);
> +
> +   return NULL;
> 
> that should be s = NULL in the function scope and a plain unlock and 
> return s.

I made this change.

> assignments mixed with arithmetics:
> 
> +static
> +int lock_stat_compare_objs(struct lock_stat *x, struct lock_stat *y)
> +{
> +   int a = 0, b = 0, c = 0;
> +
> +   (a = ksym_strcmp(x->function, y->function)) ||
> +   (b = ksym_strcmp(x->file, y->file)) ||
> +   (c = (x->line - y->line));
> +
> +   return a | b | c;
> 
> the usual (and more readable) style is to separate them out explicitly:
> 
>   a = ksym_strcmp(x->function, y->function);
>   if (!a)
>   return 0;
>   b = ksym_strcmp(x->file, y->file);
>   if (!b)
>   return 0;
> 
>   return x->line == y->line;
> 
> (detail: this btw also fixes a bug in the function above AFAICS, in the 
> a && !b case.)

Not sure what you mean here but I made the key comparison so that it would
treat each struct field in most to least significant order evaluation. The
old code worked fine. What you're seeing is the newer stuff.

> also, i'm not fully convinced we want that x->function as a string. That 
> makes comparisons alot slower. Why not make it a void *, and resolve to 
> the name via kallsyms only when printing it in /proc, like lockdep does 
> it?

I've made your suggested change, but I'm not done with it.

> 
> no need to put dates into comments:
> 
> +* Fri Oct 27 00:26:08 PDT 2006
> 
> then:
> 
> +   while (node)
> +   {
> 
> proper style is:
> 
> + while (node) {

Done. I misinterpreted the style guide and have made the changes to
conform to it..

> this function definition:
> 
> +static
> +void lock_stat_insert_object(struct lock_stat *o)
> 
> can be single-line. We make it multi-line only when needed.

Done for all the instances I remember off hand.

> these are only samples of the types of style problems still present in 
> the code.

I'm a bit of a space cadet so I might have missed something.

Latest patch here.


http://finfin.is-a-geek.org/~billh/contention/patch-2.6.20-rc2-rt2.4.lock_stat.patch

I'm going to review and hand merge your changes to the older patch tonight.

Thanks for the comments.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]

2006-12-29 Thread hui

On Tue, Dec 26, 2006 at 04:51:21PM -0800, Chen, Tim C wrote:
> Ingo Molnar wrote:
> > If you'd like to profile this yourself then the lowest-cost way of
> > profiling lock contention on -rt is to use the yum kernel and run the
> > attached trace-it-lock-prof.c code on the box while your workload is
> > in 'steady state' (and is showing those extended idle times):
> > 
> >   ./trace-it-lock-prof > trace.txt
>
> Thanks for the pointer.  Will let you know of any relevant traces.

Tim,

http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.lock_stat.patch

You can also apply this patch to get more precise statistics down to
the lock. For example:

...

[50, 30, 279 :: 1, 0]   {tty_ldisc_try, -, 0}
[5, 5, 0 :: 19, 0]  {alloc_super, fs/super.c, 76}
[5, 5, 3 :: 1, 0]   {__free_pages_ok, -, 0}
[5728, 862, 156 :: 2, 0]{journal_init_common, 
fs/jbd/journal.c, 667}
[594713, 79020, 4287 :: 60818, 0]   {inode_init_once, 
fs/inode.c, 193}
[602, 0, 0 :: 1, 0] {lru_cache_add_active, -, 0}
[63, 5, 59 :: 1, 0] {lookup_mnt, -, 0}
[6425, 378, 103 :: 24, 0]   {initialize_tty_struct, 
drivers/char/tty_io.c, 3530}
[6708, 1, 225 :: 1, 0]  {file_move, -, 0}
[67, 8, 15 :: 1, 0] {do_lookup, -, 0}
[69, 0, 0 :: 1, 0]  {exit_mmap, -, 0}
[7, 0, 0 :: 1, 0]   {uart_set_options, 
drivers/serial/serial_core.c, 1876}
[76, 0, 0 :: 1, 0]  {get_zone_pcp, -, 0}
[, 5, 9 :: 1, 0]{as_work_handler, -, 0}
[8689, 0, 0 :: 15, 0]   {create_workqueue_thread, 
kernel/workqueue.c, 474}
[89, 7, 6 :: 195, 0]{sighand_ctor, kernel/fork.c, 1474}
@contention events = 1791177
@found = 21

Is the output from /proc/lock_stat/contention. First column is the number
of contention that will results in a full block of the task, second is the
number of times the mutex owner is active on a per cpu run queue the
scheduler and third is the number of times Steve Rostedt's ownership handoff
code averted a full block. Peter Zijlstra used it initially during his
files_lock work.

Overhead of the patch is very low since it is only recording stuff in the
slow path of the rt-mutex implementation.

Writing to that file clears all of the stats for a fresh run with a
benchmark. This should give a precise point at which any contention would
happen in -rt. In general, -rt should do about as well as the stock kernel
minus the overhead of interrupt threads.

Since the last release, I've added checks for whether the task is running
as "current" on a run queue to see if adaptive spins would be useful in -rt.

These new stats show that only a small percentage of events would benefit
from the use of adaptive spins in front of a rt- mutex. Any implementation
of it would have little impact on the system. It's not the mechanism but
the raw MP work itself that contributes to the good MP performance of Linux.

Apply and have fun.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]

2006-12-30 Thread hui

On Sat, Dec 30, 2006 at 06:56:08AM -0800, Daniel Walker wrote:
> On Sat, 2006-12-30 at 12:19 +0100, Ingo Molnar wrote:
> 
> > 
> >  - Documentation/CodingStyle compliance - the code is not ugly per se
> >but still looks a bit 'alien' - please try to make it look Linuxish,
> >if i apply this we'll probably stick with it forever. This is the
> >major reason i havent applied it yet.
> 
> I did some cleanup while reviewing the patch, nothing very exciting but
> it's an attempt to bring it more into the "Linuxish" scope .. I didn't
> compile it so be warned.
> 
> There lots of ifdef'd code under CONFIG_LOCK_STAT inside rtmutex.c I
> suspect it would be a benefit to move that all into a header and ifdef
> only in the header .

Ingo and Daniel,

I'll try and make it more Linuxish. It's one of the reasons why I posted
it since I knew it would need some kind of help in that arena and I've
been in need of feedback regarding it. Originally, I picked a style that
made what I was doing extremely obvious and clear to facilitate
development which is the rationale behind it.

I'll make those changes and we can progressively pass it back and forth
to see if this passes.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]

2007-01-02 Thread hui

On Tue, Jan 02, 2007 at 02:51:05PM -0800, Chen, Tim C wrote:
> Bill,
> 
> I'm having some problem getting this patch to run stablely.  I'm
> encoutering errors like that in the trace that follow:
> 
> Thanks.
> Tim
> 
> Unable to handle kernel NULL pointer dereference at 0008

Yes, those are the reason why I have some aggressive asserts in the code
to try track down the problem. Try this:

http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.1.lock_stat.patch

It's got some cosmestic clean up in it to make it more Linux-ish instead
of me trying to reinvent some kind of Smalltalk system in the kernel. I'm
trying to address all of Ingo's complaints about the code so it's still a
work in progress, namely the style issues (I'd like help/suggestions on
that) and assert conventions.

It might the case that the lock isn't know to the lock stats code yet.
It's got some technical overlap with lockdep in that a lock might not be
known yet and is causing a crashing.

Try that patch and report back to me what happens.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]

2007-01-02 Thread hui

On Tue, Jan 02, 2007 at 03:12:34PM -0800, Bill Huey wrote:
> On Tue, Jan 02, 2007 at 02:51:05PM -0800, Chen, Tim C wrote:
> > Bill,
> > 
> > I'm having some problem getting this patch to run stablely.  I'm
> > encoutering errors like that in the trace that follow:
> 
> It might the case that the lock isn't know to the lock stats code yet.
> It's got some technical overlap with lockdep in that a lock might not be
> known yet and is causing a crashing.

The stack trace and code examination reveals, if that structure in the
timer code is used before it's initialized by the CPU bringup, it'll
cause problems like that crash. I'll look at it later on tonight.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-02 Thread hui

On Sat, Dec 30, 2006 at 12:19:40PM +0100, Ingo Molnar wrote:
> your patch looks pretty ok to me in principle. A couple of suggestions 
> to make it more mergable:
> 
>  - instead of BUG_ON()s please use DEBUG_LOCKS_WARN_ON() and make sure 
>the code is never entered again if one assertion has been triggered.
>Pass down a return result of '0' to signal failure. See
>kernel/lockdep.c about how to do this. One thing we dont need are
>bugs in instrumentation bringing down a machine.

I'm using a non-fatal error checking instead of BUG_ON. BUG_ON was a more
aggressive way that I use to find problem initiallly.

>  - remove dead (#if 0) code

Done.

>  - Documentation/CodingStyle compliance - the code is not ugly per se
>but still looks a bit 'alien' - please try to make it look Linuxish,
>if i apply this we'll probably stick with it forever. This is the
>major reason i havent applied it yet.

I reformatted most of the patch to be 80 column limited. I simplified a
number of names, but I'm open to suggestions and patches to how to go
about this. Much of this code was a style experiment, but now I have to
make this more mergable.

>  - the xfs/wrap_lock change looks bogus - the lock is initialized
>already. What am i missing?

Correct. This has been removed.

I've applied Daniel Walker's changes as well.

Patch here:


http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-03 Thread hui

On Wed, Jan 03, 2007 at 03:59:28PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch
> 
> This version is much better and ran stablely.  
> 
> If I'm reading the output correctly, the locks are listed by 
> their initialization point (function, file and line # that a lock is
> initialized).  
> That's good information to identify the lock.  

Yes, that's correct.

Good to know that. What did the output reveal ?

It can be extended by pid/futex for userspace app that has yet to be done.
It might require changes to glibc or a some kind of dynamic tracing to
communicate to kernel space information about that lock. There are other
kernel uses as well. It's just a basic mechanisms for a variety of uses.
This patch has some LTT and Dtrace-isms to it.

What's your intended use again summarized ? futex contention ? I'll read
the first posting again.

> However, it will be more useful if there is information about where the
> locking
> was initiated from and who was trying to obtain the lock.

It would add quite a bit more overhead, but it could be done with lockdep
directly I believe in conjunction with this patch. However, it should be
specific enough though that a kernel code examination at the key points
of all users of the lock would show where the problem places are as well
as users.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-03 Thread hui

On Wed, Jan 03, 2007 at 04:25:46PM -0800, Chen, Tim C wrote:
> Earlier I used latency_trace and figured that there was read contention
> on mm->mmap_sem during call to _rt_down_read by java threads
> when I was running volanomark.  That caused the slowdown of the rt
> kernel
> compared to non-rt kernel.  The output from lock_stat confirm
> that mm->map_sem was indeed the most heavily contended lock.

Can you sort the output ("sort -n" what ever..) and post it without the
zeroed entries ?

I'm curious about how that statistical spike compares to the rest of the
system activity. I'm sure that'll get the attention of Peter as well and
maybe he'll do something about it ? :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-03 Thread hui

On Wed, Jan 03, 2007 at 04:46:37PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> > Can you sort the output ("sort -n" what ever..) and post it without
> > the zeroed entries ?
> > 
> > I'm curious about how that statistical spike compares to the rest of
> > the system activity. I'm sure that'll get the attention of Peter as
> > well and maybe he'll do something about it ? :)
... 
> @contention events = 247149
> @failure_events = 146
> @lookup_failed_scope = 175
> @lookup_failed_static = 43
> @static_found = 16
> [1, 113, 77 -- 32768, 0]{tcp_init, net/ipv4/tcp.c, 2426}
> [2, 759, 182 -- 1, 0] {lock_kernel, -, 0}
> [13, 0, 7 -- 4, 0]{kmem_cache_free, -, 0}
> [25, 3564, 9278 -- 1, 0]{lock_timer_base, -, 0}
> [56, 9528, 24552 -- 3, 0]   {init_timers_cpu, kernel/timer.c, 1842}
> [471, 52845, 17682 -- 10448, 0] {sock_lock_init, net/core/sock.c, 817}
> [32251, 9024, 242 -- 256, 0]{init, kernel/futex.c, 2781}
> [173724, 11899638, 9886960 -- 11194, 0] {mm_init, kernel/fork.c, 369}

Thanks, the numbers look a bit weird in that the first column should
have a bigger number of events than that second column since it is a
special case subset. Looking at the lock_stat_note() code should show
that to be the case. Did you make a change to the output ?

I can't tell which are "steal", actively running or overall contention
stats against the lock from your output.

Thanks

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-03 Thread hui

On Wed, Jan 03, 2007 at 05:00:49PM -0800, Bill Huey wrote:
> On Wed, Jan 03, 2007 at 04:46:37PM -0800, Chen, Tim C wrote:
> > @contention events = 247149
> > @failure_events = 146
> > @lookup_failed_scope = 175
> > @lookup_failed_static = 43
> > @static_found = 16
> > [1, 113, 77 -- 32768, 0]{tcp_init, net/ipv4/tcp.c, 2426}
> > [2, 759, 182 -- 1, 0]   {lock_kernel, -, 0}
> > [13, 0, 7 -- 4, 0]  {kmem_cache_free, -, 0}
> > [25, 3564, 9278 -- 1, 0]{lock_timer_base, -, 0}
> > [56, 9528, 24552 -- 3, 0]   {init_timers_cpu, kernel/timer.c, 1842}
> > [471, 52845, 17682 -- 10448, 0] {sock_lock_init, net/core/sock.c, 817}
> > [32251, 9024, 242 -- 256, 0]{init, kernel/futex.c, 2781}
> > [173724, 11899638, 9886960 -- 11194, 0] {mm_init, kernel/fork.c, 
> > 369}
> 
> Thanks, the numbers look a bit weird in that the first column should
> have a bigger number of events than that second column since it is a
> special case subset. Looking at the lock_stat_note() code should show
> that to be the case. Did you make a change to the output ?
> 
> I can't tell which are "steal", actively running or overall contention
> stats against the lock from your output.

Also, the output is weird in that the "contention_events" should be
a total of all of the events in the first three columns. Clearly the
number is wrong and I don't know if the text output was mangled or
if that's accurate and my code is buggy. I have yet to see it give
me data like that yet in my set up.

The fourth and fifth columns are the number of times this lock was
initialized inline by a function like spin_lock_init(). It might
have a correspondence to clone() calls.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-03 Thread hui

On Wed, Jan 03, 2007 at 05:11:04PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> > 
> > Thanks, the numbers look a bit weird in that the first column should
> > have a bigger number of events than that second column since it is a
> > special case subset. Looking at the lock_stat_note() code should show
> > that to be the case. Did you make a change to the output ?
> 
> No, I did not change the output.  I did reset to the contention content
> 
> by doing echo "0" > /proc/lock_stat/contention.
> 
> I noticed that the first column get reset but not the second column. So
> the reset code probably need to be checked.

This should have the fix. 

http://mmlinux.sf.net/public/patch-2.6.20-rc2-rt2.3.lock_stat.patch

If you can rerun it and post the results, it'll hopefully show the behavior 
of that lock acquisition better.

Thanks

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch

2007-01-04 Thread hui

On Wed, Jan 03, 2007 at 06:14:11PM -0800, Chen, Tim C wrote:
> Bill Huey (hui) wrote:
> http://mmlinux.sf.net/public/patch-2.6.20-rc2-rt2.3.lock_stat.patch
> > If you can rerun it and post the results, it'll hopefully show the
> > behavior of that lock acquisition better.
> 
> Here's the run with fix to produce correct statistics.
> 
> Tim
> 
> @contention events = 848858
> @failure_events = 10
> @lookup_failed_scope = 175
> @lookup_failed_static = 47
> @static_found = 17
...
> [112584, 150, 6 -- 256, 0]  {init, kernel/futex.c, 2781}
> [597012, 183895, 136277 -- 9546, 0] {mm_init, kernel/fork.c,
> 369}

Interesting. The second column means that those can be adaptively spun
on to prevent the blocking from happening. That's roughly 1/3rd of the
blocking events that happen (second/first). Something like that would
help out, but the problem is that contention on that lock in the first
place.

Also, Linux can do a hell of a lot of context switches per second. Is
the number of total contentions (top figure) in that run consistent with
the performance degradation ? and how much the reduction of those events
by 1/3rd would help out with the benchmark ? Those are the questions in
my mind at this moment.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-26 Thread hui

On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar wrote:
> * Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
> 
> > If kernelspace rescheduling is that fast, then please explain me why 
> > userspace one always beats kernel/userspace?
> 
> because 'user space scheduling' makes no sense? I explained my thinking 
> about that in a past mail:
> 
> -->
> One often repeated (because pretty much only) performance advantage of 
> 'light threads' is context-switch performance between user-space 
> threads. But reality is, nobody /cares/ about being able to 
> context-switch between "light user-space threads"! Why? Because there 
> are only two reasons why such a high-performance context-switch would 
> occur:
... 
>  2) there has been an IO event. The thing is, for IO events we enter the
> kernel no matter what - and we'll do so for the next 10 years at
> minimum. We want to abstract away the hardware, we want to do
> reliable resource accounting, we want to share hardware resources,
> we want to rate-limit, etc., etc. While in /theory/ you could handle
> IO purely from user-space, in practice you dont want to do that. And
> if we accept the premise that we'll enter the kernel anyway, there's
> zero performance difference between scheduling right there in the
> kernel, or returning back to user-space to schedule there. (in fact
> i submit that the former is faster). Or if we accept the theoretical
> possibility of 'perfect IO hardware' that implements /all/ the
> features that the kernel wants (in a secure and generic way, and
> mind you, such IO hardware does not exist yet), then /at most/ the
> performance advantage of user-space doing the scheduling is the
> overhead of a null syscall entry. Which is a whopping 100 nsecs on
> modern CPUs! That's roughly the latency of a /single/ DRAM access!

Ingo and Evgeniy,

I was trying to avoid getting into this discussion, but whatever. M:N
threading systems also require just about all of the threading semantics
that are inside the kernel to be available in userspace. Implementations
of the userspace scheduler side of things must be able to turn off
preemption to do per CPU local storage, report blocking/preempting via
(via upcall or a mailbox) and other scheduler-ish things in reliable way
so that the complexity of a system like that ends up not being worth it
and is often monsteriously large to implement and debug. That's why
Solaris 10 removed their scheduler activations framework and went with
1:1 like in Linux since the scheduler activations model is so difficult
to control. The slowness of the futex stuff might be compounded by some
VM mapping issues that Bill Irwin and Peter Ziljstra have pointed out in
the past regard, if I understand correctly.

Bryan Cantril of Solaris 10/dtrace fame can comment on that if you ask
him sometime.

For an exercise, think about all of things you need to either migrate
or to do a cross CPU wake of a task. It goes to hell in complexity
really quick. Erlang and other language based concurrency systems get
their regularities by indirectly oversimplifying what threading is from
what kernel folks are use to. Try doing a cross CPU wake quickly a
system like that, good luck. Now think about how to do an IPI in
userspace ? Good luck.

That's all :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove sb->s_files and file_list_lock usage in dquot.c

2007-02-08 Thread hui

On Tue, Feb 06, 2007 at 02:23:33PM +0100, Christoph Hellwig wrote:
> Iterate over sb->s_inodes instead of sb->s_files in add_dquot_ref.
> This reduces list search and lock hold time aswell as getting rid of
> one of the few uses of file_list_lock which Ingo identified as a
> scalability problem.

Christoph,

The i_mutex lock the inode structure is also a source of contention
heavy when running a lot of parallel "find"s. I'm sure that folks
would be open to hearing suggestions regarding how to fix that.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove sb->s_files and file_list_lock usage in dquot.c

2007-02-08 Thread hui

On Thu, Feb 08, 2007 at 01:01:21AM -0800, Bill Huey wrote:
> Christoph,
> 
> The i_mutex lock the inode structure is also a source of contention
> heavy when running a lot of parallel "find"s. I'm sure that folks
> would be open to hearing suggestions regarding how to fix that.

Christoph,

And while you're at it, you should also know that dcache_lock is next
in line to be nixed out of existence if possible.

i_mutex is a bitch and I'm not even going to think about how to get
rid of it since it's so widely used in many places (file systems aren't
my think as well). Maybe some more precise tracking of contention paths
would be useful to see if there's a pathological case creating a
cascade of contention events so that can be nixed, don't know.

About 1/10th of the lock stat events I've logged report that the owner
of the rtmutex is the "current" on a runqueue some where. An adaptive
lock would help with those contention events (spin for it while owner
is running for the mutex release) but really, the contention should be
avoided in the first place since they have a kind of (I think) polynomial
increase in contention time as you add more processors to the mix.

I have an adaptive lock implementation in my tree that eliminates
the contention between what looks like the IDE layer, work queues and
the anticipatory scheduler, but that's not a real problem unlike what
I've mentioned above. I can get you and others more specifics on the
problem if folks working on the lower layers want it.

Other than that the -rt patch does quite well with instrumenting all
sort of kernel behaviors that include contention and latency issues.

So I don't need to tell you how valuable the -rt patch is for these
issues since it's obvious, and I'm sure that you'll agree, that it's
been instrumental at discovering many problems with the stock kernel.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] remove sb->s_files and file_list_lock usage in dquot.c

2007-02-08 Thread hui

On Thu, Feb 08, 2007 at 11:14:04PM -0800, Bill Huey wrote:
> I have an adaptive lock implementation in my tree that eliminates
> the contention between what looks like the IDE layer, work queues and
> the anticipatory scheduler, but that's not a real problem unlike what
> I've mentioned above. I can get you and others more specifics on the
> problem if folks working on the lower layers want it.

Correction, it eliminates all of the "live" contentions where the
mutex owner isn't asleep already.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7] fs: break the file_list_lock for sb->s_files

2007-01-28 Thread hui

On Sun, Jan 28, 2007 at 03:30:06PM +, Christoph Hellwig wrote:
> On Sun, Jan 28, 2007 at 04:21:06PM +0100, Ingo Molnar wrote:
> > > > sb->s_files is converted to a lock_list. furthermore to prevent the 
> > > > lock_list_head of getting too contended with concurrent add 
> > > > operations the add is buffered in per cpu filevecs.
> > > 
> > > NACK.  Please don't start using lockdep internals in core code.
> > 
> > what do you mean by that?
> 
> struct lock_list is an lockdep implementation detail and should not
> leak out and be used anywhere else.   If we want something similar it
> should be given a proper name and a header of it's own, but I don't
> think it's a valueable abstraction for the core kernel.

Christoph,

"lock list" has nothing to do with lockdep. It's a relatively new
data type used to construct concurrent linked lists using a spinlock
per entry.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lockmeter

2007-01-28 Thread hui

On Sun, Jan 28, 2007 at 09:38:16AM -0800, Martin J. Bligh wrote:
> Christoph Hellwig wrote:
> >On Sun, Jan 28, 2007 at 08:52:25AM -0800, Martin J. Bligh wrote:
> >>Mmm. not wholly convinced that's true. Whilst i don't have lockmeter
> >>stats to hand, the heavy time in __d_lookup seems to indicate we may
> >>still have a problem to me. I guess we could move the spinlocks out
> >>of line again to test this fairly easily (or get lockmeter upstream).
> >
> >We definitly should get lockmeter in.  Does anyone volunteer for doing
> >the cleanup and merged?
> 
> On second thoughts .. I don't think it'd actually work for this since
> the locks aren't global. Not that it shouldn't be done anyway, but ...
> 
> ISTR we still thought dcache scalability was a significant problem last
> time anyone looked at it seriously - just never got fixed. Dipankar?

My lock stat stuff shows dcache to a be a problem under -rt as well. It
is keyed off the same mechanism as lockdep. It's pretty heavily hit
under even normal loads relative to other kinds of lock overhead even
for casual file operations on a 2x system. I can't imagine how lousy
it's going to be under real load on a 8x or higher machine.

However, this pathc is -rt only and spinlock times are meaningless under
it because of preemptiblity.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lockmeter

2007-01-28 Thread hui

On Sun, Jan 28, 2007 at 10:17:05PM +0100, Ingo Molnar wrote:
> btw., while my plan is to prototype your lock-stat patch in -rt 
> initially, it should be doable to extend it to be usable with the 
> upstream kernel as well.
> 
> We can gather lock contention events when there is spinlock debugging 
> enabled, from lib/spinlock_debug.c. For example __spin_lock_debug() does 
> this:
> 
> static void __spin_lock_debug(spinlock_t *lock)
> {
> ...
> for (i = 0; i < loops; i++) {
> if (__raw_spin_trylock(&lock->raw_lock))
> return;
> __delay(1);
> }
> 
> where the __delay(1) call is done do we notice contention - and there 
> you could drive the lock-stat code. Similarly, rwlocks have such natural 
> points too where you could insert a lock-stat callback without impacting 
> performance (or the code) all that much. mutexes and rwsems have natural 
> contention points too (kernel/mutex.c:__mutex_lock_slowpath() and 
> kernel/rwsem.c:rwsem_down_failed_common()), even with mutex debugging is 
> off.
> 
> for -rt the natural point to gather contention events is in 
> kernel/rtmutex.c, as you are doing it currently.
> 
> finally, you can enable lockdep's initialization/etc. wrappers so that 
> infrastructure between lockdep and lock-stat is shared, but you dont 
> have to call into the lock-tracking code of lockdep.c if LOCK_STAT is 
> enabled and PROVE_LOCKING is disabled. That should give you the lockdep 
> infrastructure for LOCK_STAT, without the lockdep overhead.
> 
> all in one, one motivation behind my interest in your patch for -rt is 
> that i think it's useful for upstream too, and that it can merge with 
> lockdep to a fair degree.

Fantastic. I'm going to try and finish up your suggested changes tonight
and get it to work with CONFIG_LOCK_STAT off. It's been challenging to
find time to do Linux these days, so I don't mind handing it off to you
after this point so that you and tweek it to your heart's content.

Yeah, one of the major motivations behind it was to see if Solaris style
locks were useful and to either validate or invalidate their usefulness.
Because of this patch, we have an idea of what's going on with regard to
adaptive spinning and such. The sensible conclusion is that it's not
sophistication of the lock, but parallelizing the code in the first
place to prevent the contention in the first place that's the key
philosophical drive.

I forward merged it into rc6-rt2 and you can expect a drop tonight of
what I have regardless whether it's perfect or not.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lockmeter

2007-01-29 Thread hui

On Sun, Jan 28, 2007 at 09:27:45PM -0800, Bill Huey wrote:
> On Sun, Jan 28, 2007 at 10:17:05PM +0100, Ingo Molnar wrote:
> > btw., while my plan is to prototype your lock-stat patch in -rt 
> > initially, it should be doable to extend it to be usable with the 
> > upstream kernel as well.
...
> Fantastic. I'm going to try and finish up your suggested changes tonight
> and get it to work with CONFIG_LOCK_STAT off. It's been challenging to
> find time to do Linux these days, so I don't mind handing it off to you
> after this point so that you and tweek it to your heart's content.

Ingo,

Got it.

http://mmlinux.sourceforge.net/public/patch-2.6.20-rc6-rt2.1.lock_stat.patch

This compiles and runs with the CONFIG_LOCK_STAT option turned off now
and I believe addresses all of your forementioned concern that you
mentioned. I could have missed a detail here and there, but I think
it's in pretty good shape now.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/4] lock stat for 2.6.19-rt1

2006-12-03 Thread hui

The second is the number of times this lock object was initialized. The third
is the annotation scheme that directly attaches the lock object (spinlock,
etc..) in line with the function initializer to avoid the binary tree lookup.

This shows contention issue with inode access around what looks like a lock
around a tree structure (inode_init_once @ lines 193/196) as well as lower
layers of the IO system around the anticipatory scheduler and interrupt 
handlers.
Keep in mind this is -rt so we're using interrupt threads here.

This result is a combination of series of normal "find" loads as a result of
the machine being idle overnight in a 2x AMD process set up. I'm sure that
bigger machines should generate more interesting loads from the IBM folks and
others.

Thanks to all of the folks on #offtopic2 (Zwane Mwaikambo, nikita, Rik van Riel,
the ever increasingly bitter Con Kolivas, Bill Irwin, etc...) for talking me
through early stages of this process and general support.

I'm requesting review, comments and inclusion of this into the -rt series of
patches as well as general help.

I've CCed folks that might be interested in this patch outside of the normal
core -rt that I thought might find this interesting or have been friendly with
me about -rt in the past.

bill


--- include/linux/lock_stat.h   5e0836a9785a182c8954c3bee8a92e63dd61602b
+++ include/linux/lock_stat.h   5e0836a9785a182c8954c3bee8a92e63dd61602b
@@ -0,0 +1,144 @@
+/*
+ * By Bill Huey (hui) at <[EMAIL PROTECTED]>
+ *
+ * Release under the what ever the Linux kernel chooses for a
+ * license, GNU Public License GPL v2
+ *
+ * Tue Sep  5 17:27:48 PDT 2006
+ * Created lock_stat.h
+ *
+ * Wed Sep  6 15:36:14 PDT 2006
+ * Thinking about the object lifetime of a spinlock. Please refer to
+ * comments in kernel/lock_stat.c instead.
+ *
+ */
+
+#ifndefLOCK_STAT_H
+#define LOCK_STAT_H
+
+#ifdef CONFIG_LOCK_STAT
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+typedef struct lock_stat {
+   charfunction[KSYM_NAME_LEN];
+   int line;
+   char*file;
+
+   atomic_tncontended;
+   unsigned intntracked;
+   atomic_tninlined;
+
+   struct rb_node  rb_node;
+   struct list_headlist_head;
+} lock_stat_t;
+
+typedef lock_stat_t *lock_stat_ref_t;
+
+#define LOCK_STAT_INIT(field)
+#define LOCK_STAT_INITIALIZER(field) { \
+   __FILE__, __FUNCTION__, __LINE__,   \
+   ATOMIC_INIT(0), LIST_HEAD_INIT(field)}
+
+#define LOCK_STAT_NOTE __FILE__, __FUNCTION__, __LINE__
+#define LOCK_STAT_NOTE_VARS_file, _function, _line
+#define LOCK_STAT_NOTE_PARAM_DECL  const char *_file,  \
+   const char *_function,  \
+   int _line
+
+#define __COMMA_LOCK_STAT_FN_DECL  , const char *_function
+#define __COMMA_LOCK_STAT_FN_VAR   , _function
+#define __COMMA_LOCK_STAT_NOTE_FN  , __FUNCTION__
+
+#define __COMMA_LOCK_STAT_NOTE , LOCK_STAT_NOTE
+#define __COMMA_LOCK_STAT_NOTE_VARS, LOCK_STAT_NOTE_VARS
+#define __COMMA_LOCK_STAT_NOTE_PARAM_DECL , LOCK_STAT_NOTE_PARAM_DECL
+
+
+#define __COMMA_LOCK_STAT_NOTE_FLLN_DECL , const char *_file, int _line
+#define __COMMA_LOCK_STAT_NOTE_FLLN , __FILE__, __LINE__
+#define __COMMA_LOCK_STAT_NOTE_FLLN_VARS , _file, _line
+
+#define __COMMA_LOCK_STAT_INITIALIZER  , .lock_stat = NULL,
+
+#define __COMMA_LOCK_STAT_IP_DECL  , unsigned long _ip
+#define __COMMA_LOCK_STAT_IP   , _ip
+#define __COMMA_LOCK_STAT_RET_IP   , (unsigned long) 
__builtin_return_address(0)
+
+extern void lock_stat_init(struct lock_stat *ls);
+extern void lock_stat_sys_init(void);
+
+#define lock_stat_is_initialized(o) ((unsigned long) (*o)->file)
+
+extern void lock_stat_note_contention(lock_stat_ref_t *ls, unsigned long ip);
+extern void lock_stat_print(void);
+extern void lock_stat_scoped_attach(lock_stat_ref_t *_s, 
LOCK_STAT_NOTE_PARAM_DECL);
+
+#define ksym_strcmp(a, b) strncmp(a, b, KSYM_NAME_LEN)
+#define ksym_strcpy(a, b) strncpy(a, b, KSYM_NAME_LEN)
+#define ksym_strlen(a) strnlen(a, KSYM_NAME_LEN)
+
+/*
+static inline char * ksym_strdup(const char *a)
+{
+   char *s = (char *) kmalloc(ksym_strlen(a), GFP_KERNEL);
+   return strncpy(s, a, KSYM_NAME_LEN);
+}
+*/
+
+#define LS_INIT(name, h) { \
+   /*.function,*/ .file = h, .line = 1,\
+   .ntracked = 0, .ncontended = ATOMIC_INIT(0),\
+   .list_head = LIST_HEAD_INIT(name.list_head),\
+   .rb_node.rb_left = NULL, .rb_node.rb_left = NULL \
+   }
+
+#define DECLARE_LS_ENTRY(name) \
+   extern struct lock_stat _lock_stat_##name##_entry
+
+/* char _##name##_string[] = #name;\

[PATCH 1/4] lock stat for 2.6.19-rt1

2006-12-03 Thread hui


This hooks into the preexisting lock definitions in the -rt kernel and
hijacks parts of lockdep for the object hash key.

bill


--- include/linux/mutex.h   d231debc2848a8344e1b04055ef22e489702e648
+++ include/linux/mutex.h   734c89362a3d77d460eb20eec3107e7b76fed938
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -35,7 +36,8 @@ extern void
}
 
 extern void
-_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key);
+_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key
+   __COMMA_LOCK_STAT_NOTE_PARAM_DECL);
 
 extern void __lockfunc _mutex_lock(struct mutex *lock);
 extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock);
@@ -56,11 +58,15 @@ extern void __lockfunc _mutex_unlock(str
 # define mutex_lock_nested(l, s)   _mutex_lock(l)
 #endif
 
+#define __mutex_init(l,n)  __rt_mutex_init(&(l)->mutex,\
+   n   \
+   __COMMA_LOCK_STAT_NOTE)
+
 # define mutex_init(mutex) \
 do {   \
static struct lock_class_key __key; \
\
-   _mutex_init((mutex), #mutex, &__key);   \
+   _mutex_init((mutex), #mutex, &__key __COMMA_LOCK_STAT_NOTE);\
 } while (0)
 
 #else

--- include/linux/rt_lock.h d7515027865666075d3e285bcec8c36e9b6cfc47
+++ include/linux/rt_lock.h 297792307de5b4aef2c7e472e2a32c727e5de3f1
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PREEMPT_RT
 /*
@@ -28,8 +29,8 @@ typedef struct {
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # define __SPIN_LOCK_UNLOCKED(name) \
-   (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \
-   , .save_state = 1, .file = __FILE__, .line = __LINE__ } }
+   (spinlock_t) { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \
+   , .save_state = 1, .file = __FILE__, .line = __LINE__ 
__COMMA_LOCK_STAT_INITIALIZER} }
 #else
 # define __SPIN_LOCK_UNLOCKED(name) \
(spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } }
@@ -92,7 +93,7 @@ typedef struct {
 # ifdef CONFIG_DEBUG_RT_MUTEXES
 #  define __RW_LOCK_UNLOCKED(name) (rwlock_t) \
{ .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name), \
-.save_state = 1, .file = __FILE__, .line = __LINE__ } }
+.save_state = 1, .file = __FILE__, .line = __LINE__ 
__COMMA_LOCK_STAT_INITIALIZER } }
 # else
 #  define __RW_LOCK_UNLOCKED(name) (rwlock_t) \
{ .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } }
@@ -139,14 +140,16 @@ struct semaphore name = \
  */
 #define DECLARE_MUTEX_LOCKED COMPAT_DECLARE_MUTEX_LOCKED
 
-extern void fastcall __sema_init(struct semaphore *sem, int val, char *name, 
char *file, int line);
+extern void fastcall __sema_init(struct semaphore *sem, int val, char *name
+   __COMMA_LOCK_STAT_FN_DECL, char *_file, int 
_line);
 
 #define rt_sema_init(sem, val) \
-   __sema_init(sem, val, #sem, __FILE__, __LINE__)
+   __sema_init(sem, val, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, 
__LINE__)
 
-extern void fastcall __init_MUTEX(struct semaphore *sem, char *name, char 
*file, int line);
+extern void fastcall __init_MUTEX(struct semaphore *sem, char *name
+   __COMMA_LOCK_STAT_FN_DECL, char *_file, int 
_line);
 #define rt_init_MUTEX(sem) \
-   __init_MUTEX(sem, #sem, __FILE__, __LINE__)
+   __init_MUTEX(sem, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, 
__LINE__)
 
 extern void there_is_no_init_MUTEX_LOCKED_for_RT_semaphores(void);
 
@@ -247,13 +250,14 @@ extern void fastcall __rt_rwsem_init(str
struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname)
 
 extern void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name,
-struct lock_class_key *key);
+struct lock_class_key *key
+   __COMMA_LOCK_STAT_NOTE_PARAM_DECL);
 
 # define rt_init_rwsem(sem)\
 do {   \
static struct lock_class_key __key; \
\
-   __rt_rwsem_init((sem), #sem, &__key);   \
+   __rt_rwsem_init((sem), #sem, &__key __COMMA_LOCK_STAT_NOTE);
\
 } while (0)
 
 extern void fastcall rt_down_write(struct rw_semaphore *rwsem);

--- include/linux/rtmutex.h e6fa10297e6c20d27edba172aeb078a60c64488e
+++ include/linux/rtmutex.h 55cd2de44a52e049fa8a0da63bde6449cefeb8fe
@@

[PATCH 2/4] lock stat (rt/rtmutex.c mods) for 2.6.19-rt1

2006-12-03 Thread hui


Mods to rt.c and rtmutex.c

bill


--- init/main.c 268ab0d5f5bdc422e2864cadf35a7bb95958de10
+++ init/main.c 9d14ac66cb0fe3b90334512c0659146aec5e241c
@@ -608,6 +608,7 @@ asmlinkage void __init start_kernel(void
 #ifdef CONFIG_PROC_FS
proc_root_init();
 #endif
+   lock_stat_sys_init(); //--billh
cpuset_init();
taskstats_init_early();
delayacct_init();

--- kernel/rt.c 5fc97ed10d5053f52488dddfefdb92e6aee2b148
+++ kernel/rt.c 3b86109e8e4163223f17c7d13a5bf53df0e04d70
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rtmutex_common.h"
 
@@ -75,6 +76,42 @@
 # include "rtmutex.h"
 #endif
 
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_MUTEX_LOCK(a)   \
+   rt_mutex_lock_with_ip(a,\
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_MUTEX_LOCK(a)   \
+   rt_mutex_lock(a);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a, b)  \
+   rt_mutex_lock_interruptible_with_ip(a, b,   \
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a) \
+   rt_mutex_lock_interruptible(a, b);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\
+   rt_mutex_trylock_with_ip(a, \
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\
+   rt_mutex_trylock(a);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_SPIN_LOCK(a)\
+   __rt_spin_lock_with_ip(a,   \
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_SPIN_LOCK(a)\
+   __rt_spin_lock(a);
+#endif
+
 #ifdef CONFIG_PREEMPT_RT
 /*
  * Unlock these on crash:
@@ -88,7 +125,8 @@ void zap_rt_locks(void)
 /*
  * struct mutex functions
  */
-void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key)
+void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key
+   __COMMA_LOCK_STAT_NOTE_PARAM_DECL)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
@@ -97,14 +135,15 @@ void _mutex_init(struct mutex *lock, cha
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-   __rt_mutex_init(&lock->lock, name);
+   __rt_mutex_init(&lock->lock, name __COMMA_LOCK_STAT_NOTE_VARS);
 }
 EXPORT_SYMBOL(_mutex_init);
 
 void __lockfunc _mutex_lock(struct mutex *lock)
 {
mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-   rt_mutex_lock(&lock->lock);
+
+   __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock);
 }
 EXPORT_SYMBOL(_mutex_lock);
 
@@ -124,14 +163,14 @@ void __lockfunc _mutex_lock_nested(struc
 void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass)
 {
mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
-   rt_mutex_lock(&lock->lock);
+   __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock);
 }
 EXPORT_SYMBOL(_mutex_lock_nested);
 #endif
 
 int __lockfunc _mutex_trylock(struct mutex *lock)
 {
-   int ret = rt_mutex_trylock(&lock->lock);
+   int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&lock->lock);
 
if (ret)
mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
@@ -152,7 +191,7 @@ int __lockfunc rt_write_trylock(rwlock_t
  */
 int __lockfunc rt_write_trylock(rwlock_t *rwlock)
 {
-   int ret = rt_mutex_trylock(&rwlock->lock);
+   int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&rwlock->lock);
 
if (ret)
rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_);
@@ -179,7 +218,7 @@ int __lockfunc rt_read_trylock(rwlock_t 
}
spin_unlock_irqrestore(&lock->wait_lock, flags);
 
-   ret = rt_mutex_trylock(lock);
+   ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(lock);
if (ret)
rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_);
 
@@ -190,7 +229,7 @@ void __lockfunc rt_write_lock(rwlock_t *
 void __lockfunc rt_write_lock(rwlock_t *rwlock)
 {
rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_);
-   __rt_spin_lock(&rwlock->lock);
+   __LOCK_STAT_RT_SPIN_LOCK(&rwlock->lock);
 }
 EXPORT_SYMBOL(rt_write_lock);
 
@@ -210,11 +249,44 @@ void __lockfunc rt_read_lock(rwlock_t *r
return;
}
spin_unlock_irqrestore(&lock->wait_lock, flags);
-   __rt_spin_lock(lock);
+   __LOCK_STAT_RT_SPIN_LOCK(lock);
 }
 
 EXPORT_SYMBOL(rt_read_lock);
 
+#ifdef CONFIG_LOCK_STAT
+void __lockfunc rt_write_lock_with_ip(rwlock_t *rwlock, unsigned long ip)
+{
+   rwlock_acquire(&rwlock->dep_map, 0, 0, ip);
+

[PATCH 3/4] lock stat (rt/rtmutex.c mods) for 2.6.19-rt1

2006-12-03 Thread hui


Rudimentary annotations to the lock initializers to avoid the binary
tree search before attachment. For things like inodes that are created
and destroyed constantly this might be useful to get around some
overhead.

Sorry, about the patch numbering order. I think I screwed up on it.

bill


--- arch/xtensa/platform-iss/network.c  eee47b0ca011d1c327ce7aff0c9a7547695d3a1f
+++ arch/xtensa/platform-iss/network.c  76b16d29a46677a45d56b64983e0783959aa2160
@@ -648,6 +648,8 @@ static int iss_net_configure(int index, 
.have_mac   = 0,
});
 
+   spin_lock_init(&lp->lock);
+
/*
 * Try all transport protocols.
 * Note: more protocols can be added by adding '&& !X_init(lp, eth)'.

--- fs/dcache.c 20226054e6d6b080847e7a892d0b47a7ad042288
+++ fs/dcache.c 64d2b2b78b50dc2da7e409f2a9721b80c8fbbaf3
@@ -884,7 +884,7 @@ struct dentry *d_alloc(struct dentry * p
 
atomic_set(&dentry->d_count, 1);
dentry->d_flags = DCACHE_UNHASHED;
-   spin_lock_init(&dentry->d_lock);
+   spin_lock_init_annotated(&dentry->d_lock, &_lock_stat_d_alloc_entry);
dentry->d_inode = NULL;
dentry->d_parent = NULL;
dentry->d_sb = NULL;

--- fs/xfs/support/ktrace.c 1136cf72f9273718da47405b594caebaa59b66d3
+++ fs/xfs/support/ktrace.c 122729d6084fa84115b8f8f75cc55c585bfe3676
@@ -162,6 +162,7 @@ ktrace_enter(
 
ASSERT(ktp != NULL);
 
+   spin_lock_init(&wrap_lock); //--billh
/*
 * Grab an entry by pushing the index up to the next one.
 */

--- include/linux/eventpoll.h   bd142a622609d04952fac6215586fff353dab729
+++ include/linux/eventpoll.h   43271ded1a3b9f40beb37aaff9e02fadeecb4655
@@ -15,6 +15,7 @@
 #define _LINUX_EVENTPOLL_H
 
 #include 
+#include 
 
 
 /* Valid opcodes to issue to sys_epoll_ctl() */
@@ -55,7 +56,7 @@ static inline void eventpoll_init_file(s
 static inline void eventpoll_init_file(struct file *file)
 {
INIT_LIST_HEAD(&file->f_ep_links);
-   spin_lock_init(&file->f_ep_lock);
+   spin_lock_init_annotated(&file->f_ep_lock, 
&_lock_stat_eventpoll_init_file_entry);
 }
 
 

--- net/tipc/node.c d6ddb08c5332517b0eff3b72ee0adc48f47801ff
+++ net/tipc/node.c 9712633ceb8f939fc14a0a4861f7121840beff1d
@@ -77,7 +77,7 @@ struct node *tipc_node_create(u32 addr)

memset(n_ptr, 0, sizeof(*n_ptr));
n_ptr->addr = addr;
-spin_lock_init(&n_ptr->lock);
+   spin_lock_init(&n_ptr->lock);
INIT_LIST_HEAD(&n_ptr->nsub);
n_ptr->owner = c_ptr;
tipc_cltr_attach_node(c_ptr, n_ptr);

Re: [PATCH 3/4] lock stat (rt/rtmutex.c mods) for 2.6.19-rt1

2006-12-03 Thread hui

On Sun, Dec 03, 2006 at 06:00:09PM -0800, Bill Huey wrote:
> Rudimentary annotations to the lock initializers to avoid the binary
> tree search before attachment. For things like inodes that are created
> and destroyed constantly this might be useful to get around some
> overhead.
> 
> Sorry, about the patch numbering order. I think I screwed up on it.

I also screwed up on the title for the email contents. Sorry about that.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] lock stat for 2.6.19-rt1

2006-12-04 Thread hui

On Mon, Dec 04, 2006 at 01:21:29PM +0100, bert hubert wrote:
> On Sun, Dec 03, 2006 at 05:53:23PM -0800, Bill Huey wrote:
> 
> > [8264, 996648, 0]   {inode_init_once, fs/inode.c, 196}
> > [8552, 996648, 0]   {inode_init_once, fs/inode.c, 193}
> 
> Impressive, Bill!
> 
> How tightly is your work bound to -rt? Iow, any chance of separating the
> two? Or should we even want to?

Right now, it's solely dependent on -rt, but the basic mechanisms of how
it works is pretty much the same as lockdep. Parts of it should be
moveable across to regular kernels. The only remaining parts would be
altering the lock structure (spinlock, mutex, etc...) to have a pointer
that it can use to do the statistical tracking. If it's NULL then it's
dunamically allocated and handled differently and allocated when the
lock is contended against.

There's other uses for it as well. Think about RCU algorithms that need
to spin-try to make sure the update of an element or the validation of
it's data is safe to do. If an object was created to detect those spins
it'll track what is effectively contention as well as it is represented
in that algorithm. I've seen an RCU radix tree implementation do something
like that.

> > The first column is the number of the times that object was contented 
> > against.
> > The second is the number of times this lock object was initialized. The 
> > third
> > is the annotation scheme that directly attaches the lock object (spinlock,
> > etc..) in line with the function initializer to avoid the binary tree 
> > lookup.
> 
> I don't entirely get the third item, can you elaborate a bit?

I hate the notion of a search, so I directly set the pointer to a
object that's statically defined. It means that the object is directly
connected to what is suppose to backing it without doing a runtime
search.  When that's done, all of the numbers from the second column
get moved to the third.

> Do you have a feeling of the runtime overhead?

There's minimal runtime overhead I believe since it's only doing an
atomic increment of the stats during the slowpath before the thread
is actually shoved into a wait queue. That's something that happpens
seldomly.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] lock stat for 2.6.19-rt1

2006-12-04 Thread hui

On Mon, Dec 04, 2006 at 09:08:56AM -0800, Bill Huey wrote:
> On Mon, Dec 04, 2006 at 01:21:29PM +0100, bert hubert wrote:
> > How tightly is your work bound to -rt? Iow, any chance of separating the
> > two? Or should we even want to?
> 
> There's other uses for it as well. Think about RCU algorithms that need
> to spin-try to make sure the update of an element or the validation of
> it's data is safe to do. If an object was created to detect those spins
> it'll track what is effectively contention as well as it is represented
> in that algorithm. I've seen an RCU radix tree implementation do something
> like that.

That was a horrible paragraph plus I'm bored at the moment. What I meant is
that lockless algorithms occasionally have a spin-try associated with it as
well that might possibly validate the data that's updated against the entire
data structure for some kind of consistency cohernecy or possibly on an
individual element. That retry or spin can be considered a contention as well
and it can be made aware to this lock-stat patch just by connecting the
actually occurance of retry logic against a backing object.

I need to be more conscious about proofreading what I write before sending
it off. Was this clear ?

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] lock stat kills lock meter for -rt (.h files)

2006-12-14 Thread hui


Containes .h file changes.

bill


--- include/linux/mutex.h   d231debc2848a8344e1b04055ef22e489702e648
+++ include/linux/mutex.h   734c89362a3d77d460eb20eec3107e7b76fed938
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -35,7 +36,8 @@ extern void
}
 
 extern void
-_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key);
+_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key
+   __COMMA_LOCK_STAT_NOTE_PARAM_DECL);
 
 extern void __lockfunc _mutex_lock(struct mutex *lock);
 extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock);
@@ -56,11 +58,15 @@ extern void __lockfunc _mutex_unlock(str
 # define mutex_lock_nested(l, s)   _mutex_lock(l)
 #endif
 
+#define __mutex_init(l,n)  __rt_mutex_init(&(l)->mutex,\
+   n   \
+   __COMMA_LOCK_STAT_NOTE)
+
 # define mutex_init(mutex) \
 do {   \
static struct lock_class_key __key; \
\
-   _mutex_init((mutex), #mutex, &__key);   \
+   _mutex_init((mutex), #mutex, &__key __COMMA_LOCK_STAT_NOTE);\
 } while (0)
 
 #else

--- include/linux/rt_lock.h d7515027865666075d3e285bcec8c36e9b6cfc47
+++ include/linux/rt_lock.h 297792307de5b4aef2c7e472e2a32c727e5de3f1
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PREEMPT_RT
 /*
@@ -28,8 +29,8 @@ typedef struct {
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # define __SPIN_LOCK_UNLOCKED(name) \
-   (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \
-   , .save_state = 1, .file = __FILE__, .line = __LINE__ } }
+   (spinlock_t) { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \
+   , .save_state = 1, .file = __FILE__, .line = __LINE__ 
__COMMA_LOCK_STAT_INITIALIZER} }
 #else
 # define __SPIN_LOCK_UNLOCKED(name) \
(spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } }
@@ -92,7 +93,7 @@ typedef struct {
 # ifdef CONFIG_DEBUG_RT_MUTEXES
 #  define __RW_LOCK_UNLOCKED(name) (rwlock_t) \
{ .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name), \
-.save_state = 1, .file = __FILE__, .line = __LINE__ } }
+.save_state = 1, .file = __FILE__, .line = __LINE__ 
__COMMA_LOCK_STAT_INITIALIZER } }
 # else
 #  define __RW_LOCK_UNLOCKED(name) (rwlock_t) \
{ .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } }
@@ -139,14 +140,16 @@ struct semaphore name = \
  */
 #define DECLARE_MUTEX_LOCKED COMPAT_DECLARE_MUTEX_LOCKED
 
-extern void fastcall __sema_init(struct semaphore *sem, int val, char *name, 
char *file, int line);
+extern void fastcall __sema_init(struct semaphore *sem, int val, char *name
+   __COMMA_LOCK_STAT_FN_DECL, char *_file, int 
_line);
 
 #define rt_sema_init(sem, val) \
-   __sema_init(sem, val, #sem, __FILE__, __LINE__)
+   __sema_init(sem, val, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, 
__LINE__)
 
-extern void fastcall __init_MUTEX(struct semaphore *sem, char *name, char 
*file, int line);
+extern void fastcall __init_MUTEX(struct semaphore *sem, char *name
+   __COMMA_LOCK_STAT_FN_DECL, char *_file, int 
_line);
 #define rt_init_MUTEX(sem) \
-   __init_MUTEX(sem, #sem, __FILE__, __LINE__)
+   __init_MUTEX(sem, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, 
__LINE__)
 
 extern void there_is_no_init_MUTEX_LOCKED_for_RT_semaphores(void);
 
@@ -247,13 +250,14 @@ extern void fastcall __rt_rwsem_init(str
struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname)
 
 extern void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name,
-struct lock_class_key *key);
+struct lock_class_key *key
+   __COMMA_LOCK_STAT_NOTE_PARAM_DECL);
 
 # define rt_init_rwsem(sem)\
 do {   \
static struct lock_class_key __key; \
\
-   __rt_rwsem_init((sem), #sem, &__key);   \
+   __rt_rwsem_init((sem), #sem, &__key __COMMA_LOCK_STAT_NOTE);
\
 } while (0)
 
 extern void fastcall rt_down_write(struct rw_semaphore *rwsem);

--- include/linux/rtmutex.h e6fa10297e6c20d27edba172aeb078a60c64488e
+++ include/linux/rtmutex.h 55cd2de44a52e049fa8a0da63bde6449cefeb8fe
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The rt_mutex structure
@

[PATCH 4/5] lock stat kills lock meter for -rt (annotations)

2006-12-14 Thread hui


Rough annotations to speed up the object attachment logic.

bill


--- arch/xtensa/platform-iss/network.c  eee47b0ca011d1c327ce7aff0c9a7547695d3a1f
+++ arch/xtensa/platform-iss/network.c  76b16d29a46677a45d56b64983e0783959aa2160
@@ -648,6 +648,8 @@ static int iss_net_configure(int index, 
.have_mac   = 0,
});
 
+   spin_lock_init(&lp->lock);
+
/*
 * Try all transport protocols.
 * Note: more protocols can be added by adding '&& !X_init(lp, eth)'.

--- fs/dcache.c 20226054e6d6b080847e7a892d0b47a7ad042288
+++ fs/dcache.c 64d2b2b78b50dc2da7e409f2a9721b80c8fbbaf3
@@ -884,7 +884,7 @@ struct dentry *d_alloc(struct dentry * p
 
atomic_set(&dentry->d_count, 1);
dentry->d_flags = DCACHE_UNHASHED;
-   spin_lock_init(&dentry->d_lock);
+   spin_lock_init_annotated(&dentry->d_lock, &_lock_stat_d_alloc_entry);
dentry->d_inode = NULL;
dentry->d_parent = NULL;
dentry->d_sb = NULL;

--- fs/xfs/support/ktrace.c 1136cf72f9273718da47405b594caebaa59b66d3
+++ fs/xfs/support/ktrace.c 122729d6084fa84115b8f8f75cc55c585bfe3676
@@ -162,6 +162,7 @@ ktrace_enter(
 
ASSERT(ktp != NULL);
 
+   spin_lock_init(&wrap_lock); //--billh
/*
 * Grab an entry by pushing the index up to the next one.
 */

--- include/linux/eventpoll.h   bd142a622609d04952fac6215586fff353dab729
+++ include/linux/eventpoll.h   43271ded1a3b9f40beb37aaff9e02fadeecb4655
@@ -15,6 +15,7 @@
 #define _LINUX_EVENTPOLL_H
 
 #include 
+#include 
 
 
 /* Valid opcodes to issue to sys_epoll_ctl() */
@@ -55,7 +56,7 @@ static inline void eventpoll_init_file(s
 static inline void eventpoll_init_file(struct file *file)
 {
INIT_LIST_HEAD(&file->f_ep_links);
-   spin_lock_init(&file->f_ep_lock);
+   spin_lock_init_annotated(&file->f_ep_lock, 
&_lock_stat_eventpoll_init_file_entry);
 }
 
 

--- include/linux/wait.h12da8de69f1f2660443a04c3df199e5d851ea2ca
+++ include/linux/wait.h9b7448af82583bd11d18032aedfa8f2af44345f4
@@ -81,7 +81,7 @@ extern void init_waitqueue_head(wait_que
 
 extern void init_waitqueue_head(wait_queue_head_t *q);
 
-#ifdef CONFIG_LOCKDEP
+#if defined(CONFIG_LOCKDEP) || defined(CONFIG_LOCK_STAT)
 # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
({ init_waitqueue_head(&name); name; })
 # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \

--- init/main.c 636e95fd9af6357291dace2b9995fd72d36e945f
+++ init/main.c 2e30dc30c4aca9b1ff56064887e04d8262db30e7
@@ -608,6 +608,7 @@ asmlinkage void __init start_kernel(void
 #ifdef CONFIG_PROC_FS
proc_root_init();
 #endif
+   lock_stat_sys_init(); //--billh
cpuset_init();
taskstats_init_early();
delayacct_init();

--- net/tipc/node.c d6ddb08c5332517b0eff3b72ee0adc48f47801ff
+++ net/tipc/node.c 9712633ceb8f939fc14a0a4861f7121840beff1d
@@ -77,7 +77,7 @@ struct node *tipc_node_create(u32 addr)

memset(n_ptr, 0, sizeof(*n_ptr));
n_ptr->addr = addr;
-spin_lock_init(&n_ptr->lock);
+   spin_lock_init(&n_ptr->lock);
INIT_LIST_HEAD(&n_ptr->nsub);
n_ptr->owner = c_ptr;
tipc_cltr_attach_node(c_ptr, n_ptr);

[PATCH 5/5] lock stat kills lock meter for -rt (makefile)

2006-12-14 Thread hui


Build system changes.

bill


--- kernel/Kconfig.preempt  3148bd94270ea0a853d8e443616cd7a668dd0d3b
+++ kernel/Kconfig.preempt  d63831dbfbb9e68386bfc862fd2dd1a8f1e9779f
@@ -176,3 +176,12 @@ config RCU_TRACE
 
  Say Y here if you want to enable RCU tracing
  Say N if you are unsure.
+
+config LOCK_STAT
+   bool "Lock contention statistics tracking in /proc"
+   depends on PREEMPT_RT && !DEBUG_RT_MUTEXES
+   default y
+   help
+ General lock statistics tracking with regard to contention in
+ /proc/lock_stat/contention
+

--- kernel/Makefile 0690fbe8c605a1c7e24b7b94d05a96ea32574aab
+++ kernel/Makefile 08087775b67b7ac1682dac0310003ef7ecbd7e70
@@ -63,6 +63,7 @@ obj-$(CONFIG_TASKSTATS) += taskstats.o t
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_LOCK_STAT) += lock_stat.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <[EMAIL PROTECTED]>, the -fno-omit-frame-pointer is

[PATCH 1/5] lock stat kills lock meter for -rt (core)

2006-12-14 Thread hui


Core infrastructure files with /proc interface


--- include/linux/lock_stat.h   554e4c1a2bc399f8a4fe4a1634b29aae6f4bb4de
+++ include/linux/lock_stat.h   554e4c1a2bc399f8a4fe4a1634b29aae6f4bb4de
@@ -0,0 +1,147 @@
+/*
+ * By Bill Huey (hui) at <[EMAIL PROTECTED]>
+ *
+ * Release under the what ever the Linux kernel chooses for a
+ * license, GNU Public License GPL v2
+ *
+ * Tue Sep  5 17:27:48 PDT 2006
+ * Created lock_stat.h
+ *
+ * Wed Sep  6 15:36:14 PDT 2006
+ * Thinking about the object lifetime of a spinlock. Please refer to
+ * comments in kernel/lock_stat.c instead.
+ *
+ */
+
+#ifndefLOCK_STAT_H
+#define LOCK_STAT_H
+
+#ifdef CONFIG_LOCK_STAT
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+typedef struct lock_stat {
+   charfunction[KSYM_NAME_LEN];
+   int line;
+   char*file;
+
+   atomic_tncontended;
+   unsigned intntracked;
+   atomic_tninlined;
+   atomic_tnspinnable;
+
+   struct rb_node  rb_node;
+   struct list_headlist_head;
+} lock_stat_t;
+
+typedef lock_stat_t *lock_stat_ref_t;
+
+struct task_struct;
+
+#define LOCK_STAT_INIT(field)
+#define LOCK_STAT_INITIALIZER(field) { \
+   __FILE__, __FUNCTION__, __LINE__,   \
+   ATOMIC_INIT(0), LIST_HEAD_INIT(field)}
+
+#define LOCK_STAT_NOTE __FILE__, __FUNCTION__, __LINE__
+#define LOCK_STAT_NOTE_VARS_file, _function, _line
+#define LOCK_STAT_NOTE_PARAM_DECL  const char *_file,  \
+   const char *_function,  \
+   int _line
+
+#define __COMMA_LOCK_STAT_FN_DECL  , const char *_function
+#define __COMMA_LOCK_STAT_FN_VAR   , _function
+#define __COMMA_LOCK_STAT_NOTE_FN  , __FUNCTION__
+
+#define __COMMA_LOCK_STAT_NOTE , LOCK_STAT_NOTE
+#define __COMMA_LOCK_STAT_NOTE_VARS, LOCK_STAT_NOTE_VARS
+#define __COMMA_LOCK_STAT_NOTE_PARAM_DECL , LOCK_STAT_NOTE_PARAM_DECL
+
+
+#define __COMMA_LOCK_STAT_NOTE_FLLN_DECL , const char *_file, int _line
+#define __COMMA_LOCK_STAT_NOTE_FLLN , __FILE__, __LINE__
+#define __COMMA_LOCK_STAT_NOTE_FLLN_VARS , _file, _line
+
+#define __COMMA_LOCK_STAT_INITIALIZER  , .lock_stat = NULL,
+
+#define __COMMA_LOCK_STAT_IP_DECL  , unsigned long _ip
+#define __COMMA_LOCK_STAT_IP   , _ip
+#define __COMMA_LOCK_STAT_RET_IP   , (unsigned long) 
__builtin_return_address(0)
+
+extern void lock_stat_init(struct lock_stat *ls);
+extern void lock_stat_sys_init(void);
+
+#define lock_stat_is_initialized(o) ((unsigned long) (*o)->file)
+
+extern void lock_stat_note_contention(lock_stat_ref_t *ls, struct task_struct 
*owner, unsigned long ip);
+extern void lock_stat_print(void);
+extern void lock_stat_scoped_attach(lock_stat_ref_t *_s, 
LOCK_STAT_NOTE_PARAM_DECL);
+
+#define ksym_strcmp(a, b) strncmp(a, b, KSYM_NAME_LEN)
+#define ksym_strcpy(a, b) strncpy(a, b, KSYM_NAME_LEN)
+#define ksym_strlen(a) strnlen(a, KSYM_NAME_LEN)
+
+/*
+static inline char * ksym_strdup(const char *a)
+{
+   char *s = (char *) kmalloc(ksym_strlen(a), GFP_KERNEL);
+   return strncpy(s, a, KSYM_NAME_LEN);
+}
+*/
+
+#define LS_INIT(name, h) { \
+   /*.function,*/ .file = h, .line = 1,\
+   .ntracked = 0, .ncontended = ATOMIC_INIT(0),\
+   .list_head = LIST_HEAD_INIT(name.list_head),\
+   .rb_node.rb_left = NULL, .rb_node.rb_left = NULL \
+   }
+
+#define DECLARE_LS_ENTRY(name) \
+   extern struct lock_stat _lock_stat_##name##_entry
+
+/* char _##name##_string[] = #name;\
+*/
+
+#define DEFINE_LS_ENTRY(name)  \
+   struct lock_stat _lock_stat_##name##_entry = 
LS_INIT(_lock_stat_##name##_entry, #name "_string")
+
+DECLARE_LS_ENTRY(d_alloc);
+DECLARE_LS_ENTRY(eventpoll_init_file);
+/*
+DECLARE_LS_ENTRY(get_empty_filp);
+DECLARE_LS_ENTRY(init_once_1);
+DECLARE_LS_ENTRY(init_once_2);
+DECLARE_LS_ENTRY(inode_init_once_1);
+DECLARE_LS_ENTRY(inode_init_once_2);
+DECLARE_LS_ENTRY(inode_init_once_3);
+DECLARE_LS_ENTRY(inode_init_once_4);
+DECLARE_LS_ENTRY(inode_init_once_5);
+DECLARE_LS_ENTRY(inode_init_once_6);
+DECLARE_LS_ENTRY(inode_init_once_7);
+*/
+
+#else /* CONFIG_LOCK_STAT  */
+
+#define __COMMA_LOCK_STAT_FN_DECL
+#define __COMMA_LOCK_STAT_FN_VAR
+#define __COMMA_LOCK_STAT_NOTE_FN
+
+#define __COMMA_LOCK_STAT_NOTE_FLLN_DECL
+#define __COMMA_LOCK_STAT_NOTE_FLLN
+#define __COMMA_LOCK_STAT_NOTE_FLLN_VARS
+
+#define __COMMA_LOCK_STAT_INITIALIZER  
+
+#define __COMMA_LOCK_STAT_IP_DECL
+#define __COMMA_LOCK_STAT_IP
+#define __COMMA_LOCK_STAT_RET_IP
+
+#endif /* CONFIG_LOCK_STAT */
+
+#endif /* LOCK_STAT_H */
+

---

[PATCH 2/5] lock stat kills lock meter for -rt

2006-12-14 Thread hui


.c files that have been changed, rtmutex.c, rt.c


--- kernel/rt.c 5fc97ed10d5053f52488dddfefdb92e6aee2b148
+++ kernel/rt.c 3b86109e8e4163223f17c7d13a5bf53df0e04d70
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rtmutex_common.h"
 
@@ -75,6 +76,42 @@
 # include "rtmutex.h"
 #endif
 
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_MUTEX_LOCK(a)   \
+   rt_mutex_lock_with_ip(a,\
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_MUTEX_LOCK(a)   \
+   rt_mutex_lock(a);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a, b)  \
+   rt_mutex_lock_interruptible_with_ip(a, b,   \
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a) \
+   rt_mutex_lock_interruptible(a, b);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\
+   rt_mutex_trylock_with_ip(a, \
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\
+   rt_mutex_trylock(a);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+#define __LOCK_STAT_RT_SPIN_LOCK(a)\
+   __rt_spin_lock_with_ip(a,   \
+   (unsigned long) __builtin_return_address(0))
+#else
+#define __LOCK_STAT_RT_SPIN_LOCK(a)\
+   __rt_spin_lock(a);
+#endif
+
 #ifdef CONFIG_PREEMPT_RT
 /*
  * Unlock these on crash:
@@ -88,7 +125,8 @@ void zap_rt_locks(void)
 /*
  * struct mutex functions
  */
-void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key)
+void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key
+   __COMMA_LOCK_STAT_NOTE_PARAM_DECL)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
@@ -97,14 +135,15 @@ void _mutex_init(struct mutex *lock, cha
debug_check_no_locks_freed((void *)lock, sizeof(*lock));
lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-   __rt_mutex_init(&lock->lock, name);
+   __rt_mutex_init(&lock->lock, name __COMMA_LOCK_STAT_NOTE_VARS);
 }
 EXPORT_SYMBOL(_mutex_init);
 
 void __lockfunc _mutex_lock(struct mutex *lock)
 {
mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-   rt_mutex_lock(&lock->lock);
+
+   __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock);
 }
 EXPORT_SYMBOL(_mutex_lock);
 
@@ -124,14 +163,14 @@ void __lockfunc _mutex_lock_nested(struc
 void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass)
 {
mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
-   rt_mutex_lock(&lock->lock);
+   __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock);
 }
 EXPORT_SYMBOL(_mutex_lock_nested);
 #endif
 
 int __lockfunc _mutex_trylock(struct mutex *lock)
 {
-   int ret = rt_mutex_trylock(&lock->lock);
+   int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&lock->lock);
 
if (ret)
mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
@@ -152,7 +191,7 @@ int __lockfunc rt_write_trylock(rwlock_t
  */
 int __lockfunc rt_write_trylock(rwlock_t *rwlock)
 {
-   int ret = rt_mutex_trylock(&rwlock->lock);
+   int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&rwlock->lock);
 
if (ret)
rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_);
@@ -179,7 +218,7 @@ int __lockfunc rt_read_trylock(rwlock_t 
}
spin_unlock_irqrestore(&lock->wait_lock, flags);
 
-   ret = rt_mutex_trylock(lock);
+   ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(lock);
if (ret)
rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_);
 
@@ -190,7 +229,7 @@ void __lockfunc rt_write_lock(rwlock_t *
 void __lockfunc rt_write_lock(rwlock_t *rwlock)
 {
rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_);
-   __rt_spin_lock(&rwlock->lock);
+   __LOCK_STAT_RT_SPIN_LOCK(&rwlock->lock);
 }
 EXPORT_SYMBOL(rt_write_lock);
 
@@ -210,11 +249,44 @@ void __lockfunc rt_read_lock(rwlock_t *r
return;
}
spin_unlock_irqrestore(&lock->wait_lock, flags);
-   __rt_spin_lock(lock);
+   __LOCK_STAT_RT_SPIN_LOCK(lock);
 }
 
 EXPORT_SYMBOL(rt_read_lock);
 
+#ifdef CONFIG_LOCK_STAT
+void __lockfunc rt_write_lock_with_ip(rwlock_t *rwlock, unsigned long ip)
+{
+   rwlock_acquire(&rwlock->dep_map, 0, 0, ip);
+   __rt_spin_lock_with_ip(&rwlock->lock, ip);
+}
+EXPORT_SYMBOL(rt_write_lock_with_ip);
+
+void __lockfunc rt_read_lock_with_ip(rwlock_t *rwlock, unsigned long ip)
+{
+   unsigned long flags;
+   struct rt_mutex *lock = &rwlock->lock;
+
+   /*
+* NOTE: we handle it as a write-lock:
+*/
+   rwlock_acquire(&rwlock->dep_map, 0, 0, ip);
+   /*
+* Read loc

[PATCH 0/5] lock stat kills lock meter for -rt

2006-12-14 Thread hui


Hello,

I'm back with another annoying announcement and post of my "lock stat"
patches for Ingo's 2.6.19-rt14 patch. I want review, comments and
eventually inclusion into the -rt.

Changes in this release:

- forward ported to 2,6.19-rt14

- rt_mutex_slowtrylock() path now works with lock stat after an
initialization check. Apparently there's a try-lock some where before
my lock stat stuff is initialized and it hard crashes the machine on
boot. This is fixed now.

- Addes a new field to track adaptive spins in the rtmutex as a future
feature.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread hui

On Mon, May 28, 2007 at 10:09:19PM +0530, Srivatsa Vaddagiri wrote:
> On Fri, May 25, 2007 at 10:14:58AM -0700, Li, Tong N wrote:
> > is represented by a weight of 10. Inside the group, let's say the two
> > tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for
> > P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens
> > weights into one level without changing the shares they represent.
> 
> What do these task weights control? Timeslice primarily? If so, I am not
> sure how well it can co-exist with cfs then (unless you are planning to
> replace cfs with a equally good interactive/fair scheduler :)

It's called SD. From Con Kolivas that got it right the first time around :)

> I would be very interested if this weight calculation can be used for
> smpnice based load balancing purposes too ..

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] [Patch 4/4] lock contention tracking slimmed down

2007-06-06 Thread hui

On Thu, Jun 07, 2007 at 02:17:45AM +0200, Martin Peschke wrote:
> Ingo Molnar wrote:
> >, quite some work went into it - NACK :-(
> 
> Considering the amount of code.. ;-)I am sorry.
> 
> But seriously, did you consider using some user space tool or script to
> format this stuff the way you like it - similar to the way the powertop tool
> reshuffles timer_stats data found in a proc file, for example?

When I was doing my stuff, I intended for it to be parsed by a script or
simple command line tools like sort/grep piped through less. I also though
it might be interesting to output the text into either a python or ruby
syntax collect so that it can go through a more extensive sorting using
those languages.

There are roughly about 400 locks in a normal kernel for a desktop. The
list is rather cumbersome anyways so, IMO, it really should be handled
by parsing tools, etc... There could be more properties attached to each
lock especially if you intend to get this to work on -rt which need more
things reported.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] [Patch 4/4] lock contention tracking slimmed down

2007-06-07 Thread hui

On Thu, Jun 07, 2007 at 09:30:21AM +0200, Ingo Molnar wrote:
> * Martin Peschke <[EMAIL PROTECTED]> wrote:
> > Do mean I might submit this stuff for -rt?
> 
> Firstly, submit cleanup patches that _do not change the output_. If you 
> have any output changes, do it as a separate patch, ontop of the cleanup 
> patch. Mixing material changes and cleanups into a single patch is a 
> basic patch submission mistake that will only earn you NACKs.

Martin,

First of all I agree with Ingo in that this needs to be seperated from
the rest of the clean ups. However, I don't understand why all of this
is so heavy weight when the current measurements that Peter makes is
completely sufficient for any reasonable purpose I can think of at the
moment. What's this stuff with labels about ?

It's important to get the points of contention so that the greater
kernel group can fix this issues and not log statistics for the purpose
of logging it. The original purpose should not be ignore when working
on this stuff.

By the way, what's the purpose of all of this stuff ? like what do you
intend to do with it over the long haul ?

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] CFS scheduler, -v12

2007-05-17 Thread hui

On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> Even a simple 3D app like glxgears does a sys_sched_yield() for every 
> frame it generates (!) on certain 3D cards, which in essence punishes 
> any scheduler that implements sys_sched_yield() in a sane manner. This 
> interaction of CFS's yield implementation with this user-space bug could 
> be the main reason why some testers reported SD to be handling 3D games 
> better than CFS. (SD uses a yield implementation similar to the vanilla 
> scheduler.)
> 
> So i've added a yield workaround to -v12, which makes it work similar to 
> how the vanilla scheduler and SD does it. (Xorg has been notified and 
> this bug should be fixed there too. This took some time to debug because 
> the 3D driver i'm using for testing does not use sys_sched_yield().) The 
> workaround is activated by default so -v12 should work 'out of the box'.

This is an incorrect analysis. OpenGL has the ability to "yield" after
every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
by the system vertical retrace interrupt) so that it can free up CPU
resources for other tasks to run. The problem here is that the yield
behavior is treated generally instead of specifically to a particular
proportion scheduler policy.

The correct solution is for the app to use a directed yield and a policy
that can directly support it so that OpenGL can guaratee a frame rate
governed by CPU bandwidth allocated by the scheduler.

Will is working on such a mechanism now.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] CFS scheduler, -v12

2007-05-17 Thread hui

On Thu, May 17, 2007 at 05:18:41PM -0700, Bill Huey wrote:
> On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> > Even a simple 3D app like glxgears does a sys_sched_yield() for every 
> > frame it generates (!) on certain 3D cards, which in essence punishes 
> > any scheduler that implements sys_sched_yield() in a sane manner. This 
> > interaction of CFS's yield implementation with this user-space bug could 
> > be the main reason why some testers reported SD to be handling 3D games 
> > better than CFS. (SD uses a yield implementation similar to the vanilla 
> > scheduler.)
> > 
> > So i've added a yield workaround to -v12, which makes it work similar to 
> > how the vanilla scheduler and SD does it. (Xorg has been notified and 
> > this bug should be fixed there too. This took some time to debug because 
> > the 3D driver i'm using for testing does not use sys_sched_yield().) The 
> > workaround is activated by default so -v12 should work 'out of the box'.
> 
> This is an incorrect analysis. OpenGL has the ability to "yield" after
> every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
> by the system vertical retrace interrupt) so that it can free up CPU
> resources for other tasks to run. The problem here is that the yield
> behavior is treated generally instead of specifically to a particular
> proportion scheduler policy.
> 
> The correct solution is for the app to use a directed yield and a policy
> that can directly support it so that OpenGL can guaratee a frame rate
> governed by CPU bandwidth allocated by the scheduler.
> 
> Will is working on such a mechanism now.

Follow up:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Developer/books/REACT_PG/sgi_html/ch04.html

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-20 Thread hui

On Sun, May 20, 2007 at 12:30:26PM +0200, Peter Zijlstra wrote:
> The 4 points are the first 4 unique callsites that cause lock contention
> for the specified lock class.
> 
> writing a 0 to /proc/lockdep_contentions clears the stats

We should talk about unifying it with my lockstat work for -rt so that
we have a comprehensive solution for the "world". But you know that
already :)

Unifying lock initializer hash key initialization functions is a key
first step to that. Keep in mind, we can do more with this mechanism
than just kernel locks and we should probably keep that open and not
code into a corner.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-20 Thread hui

On Mon, May 21, 2007 at 08:08:28AM +0200, Ingo Molnar wrote:
> To me it appears Peter's stuff is already a pretty complete solution on 
> its own, and it's a whole lot simpler (and less duplicative) than your 
> lockstat patch. Could you list the specific items/features that you 
> think Peter's stuff doesnt have?

First of all, this isn't an either/or kind of thing nor should it be thought
of in that way.

Precise file/function/line placement for one thing. My patch is specifically
for -rt which does checks that Peter's doesn't and is needed to characterize
-rt better. My stuff is potentially more extensible since I have other ideas
for it that really are outside of the lockdep logic currently. These can be
unified, but not so that one overrides the intended features of other. That's
why I was hessitant to completely unify with lockdep in the manner you
suggested.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-21 Thread hui

On Mon, May 21, 2007 at 09:50:13AM +0200, Ingo Molnar wrote:
> Have you looked at the output Peter's patch produces? It prints out 
> precise symbols:
> 
>  dcache_lock: 3000 0 [618] [] _atomic_dec_and_lock+0x39/0x58
> 
> which can easily be turned into line numbers using debuginfo packages or 
> using gdb. (But normally one only needs the symbol name, and we 
> certainly do not want to burden the kernel source with tracking 
> __FILE__/__LINE__ metadata, if the same is already available via 
> CONFIG_DEBUG_INFO.)
> 
> anything else?

If his hashing scheme can produce precise locations of where locks are
initialized both by a initializer function or a statically allocated
object then my code is baroque and you should use Peter's code.

I write lockstat without the knowledge that lockdep was replicating the
same work and I audited 1600 something lock points in the kernel to
convert the usage of C99 style initializers to something more regular.

I also did this without consideration of things like debuginfo since
I don't use those things.

> > [...] My stuff is potentially more extensible since I have other ideas 
> > for it that really are outside of the lockdep logic currently. [...]
> 
> what do you mean, specifically?

Better if I show you the patches in the future instead of saying now.

> i really need specifics. Currently i have the choice between your stuff:
> 
>17 files changed, 1425 insertions(+), 80 deletions(-)
> 
> and Peter's patch:
> 
> 6 files changed,  266 insertions(+), 18 deletions(-)
> 
> and Peter's patch (if it works out fine in testing - and it seemed fine 
> so far on my testbox), is smaller, more maintainable, better integrated 
> and thus the clear candidate for merging into -rt and merging upstream 
> as well. It's far cleaner than i hoped this whole lock-stats thing could 
> be done based on lockdep, so i'm pretty happy with Peter's current patch 
> already.

If it meets your criteria and what you mentioned about is completely
accurate, then use it instead of mine. I'll just finish up what I have
done with reader tracking in my lockstat and migrate my -rt specific
goodies to his infrastructure.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-21 Thread hui

On Mon, May 21, 2007 at 11:36:39AM +0200, Ingo Molnar wrote:
> you got the history wrong i think: the first version of lockdep was 
> released to lkml a year ago (May 2006), while the first time you 
> mentioned your lock contention patch was November 2006 and you released 
> it to lkml in December 2006 - so it was _you_ who was "replicating the 
> same work", not lockdep :-) And this was pointed out to you very early 
> on, many months ago.

Yeah, and where do we disagree here again ? So I take it you're disagreeing
with my agreement with you that lockdep came first ? Geez, think about that
one for a bit. (chuckle) :)

I'd like to remind you that I mapped out the lock hierarchy for a fully
preemptive -rt kernel while you and *others* were wanking around with
voluntary preempt remember ? :) Keep in mind, I'm single obsessed with -rt.

[back to the topic]

> and regarding C99 style lock initializers: the -rt project has been 
> removing a whole heap of them in the past 2.5 years, since Oct 2004 or 
> so, and regularly cleansed the upstream kernel for old-style 
> initializers ever since then - so i'm not sure what you are referring 
> to.

Don't worry about it. I did the same work only to realize that there wasn't
much left to convert over.

> btw., you dont even need CONFIG_DEBUG_INFO to get usable symbol names, 
> CONFIG_KALLSYMS alone will do it too. (It's only if you really cannot 
> tell from the lock symbol name and the function name what the entry is 
> about - which is very rare - that you need to look at any debug-info)

I'm anal about these things. I thought that you can do more magic than that
from your previous email, but it just confirms my understanding of how
symbols work already, unless there was a meltdown of the universal physical
laws here or something. That's why I made the choices I did.

The inode initialization code is ambiguous which is why having a specific
line number was very useful. It showed that one of the locks protecting a
tree was heavily hit. There was multipule places in which it could have
been if I hadn't had this information.

Sleep time...

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-21 Thread hui

On Mon, May 21, 2007 at 10:55:47AM +0100, Christoph Hellwig wrote:
> On Mon, May 21, 2007 at 11:36:39AM +0200, Ingo Molnar wrote:
> > you got the history wrong i think: the first version of lockdep was 
> > released to lkml a year ago (May 2006), while the first time you 
> > mentioned your lock contention patch was November 2006 and you released 
> > it to lkml in December 2006 - so it was _you_ who was "replicating the 
> > same work", not lockdep :-) And this was pointed out to you very early 
> > on, many months ago.
> 
> And lockmeter, the very first patch of this sort is from the 90s, but
> got mostly ignored here on lkml, of course :)

Unfortunately, it's not nearly as cool as my patch by default because I wrote
it. :)

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-21 Thread hui

On Mon, May 21, 2007 at 03:19:46AM -0700, Bill Huey wrote:
> On Mon, May 21, 2007 at 11:36:39AM +0200, Ingo Molnar wrote:
> > you got the history wrong i think: the first version of lockdep was 
> > released to lkml a year ago (May 2006), while the first time you 
> > mentioned your lock contention patch was November 2006 and you released 
> > it to lkml in December 2006 - so it was _you_ who was "replicating the 
> > same work", not lockdep :-) And this was pointed out to you very early 
> > on, many months ago.
> 
> Yeah, and where do we disagree here again ? So I take it you're disagreeing
> with my agreement with you that lockdep came first ? Geez, think about that
> one for a bit. (chuckle) :)

Yeah, sorry about the wording reversal. It was unintentional. I tend to drop
minor words like "not" and stuff which can create confusion.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-21 Thread hui

On Mon, May 21, 2007 at 02:46:55PM +0200, Ingo Molnar wrote:
> which combines into this statement of yours: "I audited 1600 something 
> lock points in the kernel to convert the usage of C99 style initializers 
> something more regular, only to find out that there wasn't much left to 
> convert over?", correct? Which begs the question: why did you mention 
> this then at all? I usually reply to points made by others in the 
> assumption that there's some meaning behind them ;-)

It was about how much time I wasted replicating work that I didn't know
about. Some folks have different communication styles and it isn't ment
to be anything further than that.

Me and Peter are talking about possibly merging parts of each other's
patch. I've been working on splitting up the reader/writer paths so that
each type of contention is logged as well for tracking reader contentions
against that existing rwsem problem and the like, but my stuff doesn't
boot yet (working part time drags things). I/We'll keep you updated.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: lock contention tracking

2007-05-21 Thread hui

On Mon, May 21, 2007 at 12:58:03PM +0200, Ingo Molnar wrote:
> and nobody pushed strong enough to get it included. But ... Peter's 
> patch could perhaps be extended to cover similar stats as lockmeter, 
> ontop of the existing lockdep instrumentation. Peter, can you see any 
> particular roadblocks with that?

Definitely. Lockmeter isn't terribly Linux-ish from my examination of
that patch a while back. Doing it against lockdep is definitely the
right thing to do in that it unifies lock handling through initializer
keys that lockmeter doesn't, from my memory.

The spin time tracking can be put into the slow path of the spin, like
what peter has now, so that it has minimal impact against the uncontended
case. Updating the times would then be a trivial pointer dereference
plus add and hopefully won't have instrumentation side effects against
the rest of the locking behavior in the system.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Sched - graphic smoothness under load - cfs-v13 sd-0.48

2007-05-23 Thread hui

On Wed, May 23, 2007 at 09:58:35AM +0200, Xavier Bestel wrote:
> On Wed, 2007-05-23 at 07:23 +0200, Michael Gerdau wrote:
> > For me the huge difference you have for sd to the others increases the
> > likelyhood the glxgears benchmark does not measure scheduling of graphic
> > but something else.
> 
> I think some people forget that X11 has its own scheduler for graphics
> operations.

OpenGL is generally orthogonal to X11 or at least should be. But this could
vary with the implementation depending on how brain damaged the system is.

I'd expect the performance charateristics to be different depending on what
subsystem is being used.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] lock contention tracking -v2

2007-05-23 Thread hui

On Wed, May 23, 2007 at 12:33:11PM +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[EMAIL PROTECTED]> wrote:
...
> > It also measures lock wait-time and hold-time in nanoseconds. The 
> > minimum and maximum times are tracked, as well as a total (which 
> > together with the number of event can give the avg).
> > 
> > All statistics are done per lock class, per write (exclusive state) 
> > and per read (shared state).
> > 
> > The statistics are collected per-cpu, so that the collection overhead 
> > is minimized via having no global cachemisses.
...
> really nice changes! The wait-time and hold-time changes should make it 
> as capable as lockmeter and more: lockmeter only measured spinlocks, 
> while your approach covers all lock types (spinlocks, rwlocks and 
> mutexes).
> 
> The performance enhancements in -v2 should make it much more scalable 
> than your first version was. (in fact i think it should be completely 
> scalable as the statistics counters are all per-cpu, so there should be 
> no cacheline bouncing at all from this)

per cpu is pretty important since you can potentially hit that logic more
often with your wait-time code. You don't want to effect the actual
measurement with the measurement code. It's that uncertainty principal thing.

It is looking pretty good. :) You might like to pretty the output even more,
but it's pretty usable as is.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler

2007-03-08 Thread hui

On Thu, Mar 08, 2007 at 10:31:48PM -0800, Linus Torvalds wrote:
> On Thu, 8 Mar 2007, Bill Davidsen wrote:
> > Please, could you now rethink plugable scheduler as well? Even if one had to
> > be chosen at boot time and couldn't be change thereafter, it would still 
> > allow
> > a few new thoughts to be included.
> 
> No. Really.
> 
> I absolutely *detest* pluggable schedulers. They have a huge downside: 
> they allow people to think that it's ok to make special-case schedulers. 
> And I simply very fundamentally disagree.

Linus,

This is where I have to respectfully disagree. There are types of loads
that aren't covered in SCHED_OTHER. They are typically certain real time
loads and those folks (regardless of -rt patch) would benefit greatly
from having something like that in place. Those scheduler developers can
plug in (at compile time) their work without having to track and forward
port their code constantly so that non-SCHED_OTHER policies can be
experimented with easily.

This is especially so with rate monotonic influenced schedulers that are
in the works by real time folks, stock kernel or not. This is about
making Linux generally accessible to those folks and not folks doing
SCHED_OTHER work. They are orthogonal.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread hui

On Tue, Mar 13, 2007 at 08:41:05PM +1100, Con Kolivas wrote:
> On Tuesday 13 March 2007 20:29, Ingo Molnar wrote:
> > So the question is: if all tasks are on the same nice level, how does,
> > in Mike's test scenario, RSDL behave relative to the current
> > interactivity code?
... 
> The only way to get the same behaviour on RSDL without hacking an 
> interactivity estimator, priority boost cpu misproportionator onto it is to 
> either -nice X or +nice lame.

Hello Ingo,

After talking to Con over IRC (and if I can summarize it), he's wondering if
properly nicing those tasks, as previously mention in user emails, would solve
this potential user reported regression or is something additional needed. It
seems like folks are happy with the results once the nice tweeking is done.
This is a huge behavior change after all to scheduler (just thinking out loud).

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread hui

On Tue, Mar 13, 2007 at 12:58:01PM -0700, David Schwartz wrote:
> > But saying that the user needs to explicitly hold the schedulers hand
> > and nice everything to tell it how to schedule seems to be an abdication
> > of duty, an admission of failure.  We can't expect users to finesse all
> > their processes with nice, and it would be a bad user interface to ask
> > them to do so.
> 
> Then you will always get cases where the scheduler does not do what the user
> wants because the scheduler does not *know* what the user wants. You always
> have to tell a computer what you want it to do, and the best it can do is
> faithfully follow your request.
> 
> I think it's completely irrational to ask for a scheduler that automatically
> gives more CPU time to CPU hogs.

SGI machines had an interactive term in their scheduler as well as a
traditional nice priority. It might be useful for Con to possibly consider
this as an extension for problematic (badly hacked) processes like X.

Nice as a control mechanism is rather coarse, yet overly strict because of
the sophistication of his scheduler. Having an additional term (control knob)
would be nice for a scheduler that is built upon (correct me if I'm wrong Con):

1) has rudimentary bandwidth control for a group of runnable processes
2) has a basic deadline mechanism

The "nice" term is only an indirect way of controlling his scheduler and
think and this kind of imprecise tweeking being done with various apps is an
indicator of how lacking it is as a control term in the scheduler. It would
be good to have some kind of coherent and direct control over the knobs that
are (1) and (2).

Schedulers like this have superior control over these properties and they
should be fully exploited with terms in additional to "nice".

Item (1) is subject to a static "weight" multiplication in relation to other
runnable tasks. It also might be useful to make a part of that term a bit
dynamic to get some kind of interactivity control back. It's a matter of
testing, tweeking, etc... and are not easy for apps that don't have a
direct thread context to control like a thread unaware X system.

> > And if someone/distro *does* go to all the effort of managing how to get
> > all the processes at the right nice levels, you have this big legacy
> > problem where you're now stuck keeping all those nice values meaningful
> > as you continue to develop the scheduler.  Its bad enough to make them
> > do the work in the first place, but its worse if they need to make it a
> > kernel version dependent function.
> 
> I agree. I'm not claiming to have the perfect solution. Let's not let the
> perfect be the enemy of the good though.

I hope this was useful.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-13 Thread hui

On Tue, Mar 13, 2007 at 01:10:40PM -0700, Jeremy Fitzhardinge wrote:
> David Schwartz wrote:
> Hm, well.  The general preference has been for the kernel to do a
> good-enough job on getting the common cases right without tuning, and
> then only add knobs for the really tricky cases it can't do well.  But
> the impression I'm getting here is that you often get sucky behaviours
> without tuning.

Well, you get strict behaviors as expected for this scheduler. 

> > I think it's completely irrational to ask for a scheduler that automatically
> > gives more CPU time to CPU hogs.
> >   
> 
> Well, it doesn't have to.  It could give good low latency with short
> timeslices to things which appear to be interactive.  If the interactive
> program doesn't make good use of its low latency, then it will suck. 
> But that's largely independent of how much overall CPU you give it.

This is way beyond what SCHED_OTHER should do. It can't predict the universe.
Much of the interactivity estimator borders on magic. It just happens to
also "be a good fit" for hacky apps as well almost by accident.

> > I agree. I'm not claiming to have the perfect solution. Let's not let the
> > perfect be the enemy of the good though.
> 
> For all its faults, the current scheduler mostly does a good job without
> much tuning - I normally only use "nice" to run cpu-bound things without
> jacking the cpu speed up.  Certainly in my normal interactive use of
> compiz vs make -j4 on a dual-core generally gets pretty pretty good
> results.  I plan on testing the new scheduler soon though.

We can do MUCH better in the long run with something like Con's scheduler.
His approach shouldn't be dismissed because it's running into a relatively
few minor snags large the fault of scheduleing opaque applications. It's
precise enough that it can also be loosened up a bit with additional
control terms (previous email).

It might be good to think about that a bit to see if a schema like this can
be made more adaptable for the environment it serves. You'd then have both
precisely bounded control over CPU usage and enough flexibility for burstly
needs of certain apps.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-13 Thread hui

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
... 
>The CFS patch uses a completely different approach and implementation
>from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>that of RSDL/SD, which is a high standard to meet :-) Testing
>feedback is welcome to decide this one way or another. [ and, in any
>case, all of SD's logic could be added via a kernel/sched_sd.c module
>as well, if Con is interested in such an approach. ]

Ingo,

Con has been asking for module support for years if I understand your patch
corectly. You'll also need this for -rt as well with regards to bandwidth
scheduling. Good to see that you're moving in this direction.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-13 Thread hui

On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote:
> On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> > Yeah. Note that there are some subtle but crutial differences between 
> > PlugSched (which Con used, and which i opposed in the past) and this 
> > approach.
> > PlugSched cuts the interfaces at a high level in a monolithic way and 
> > introduces kernel/scheduler.c that uses one pluggable scheduler 
> > (represented via the 'scheduler' global template) at a time.
> 
> What I originally did did so for a good reason, which was that it was
> intended to support far more radical reorganizations, for instance,
> things that changed the per-cpu runqueue affairs for gang scheduling.
> I wrote a top-level driver that did support scheduling classes in a
> similar fashion, though it didn't survive others maintaining the patches.

Also, gang scheduling is needed to solve virtualization issues regarding
spinlocks in a guest image. You could potentally be spinning on a thread
that isn't currently running which, needless to say, is very bad.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-14 Thread hui

On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote:
> very much so! Both Con and Mike has contributed regularly to upstream 
> sched.c:

The problem here is tha Con can get demotivated (and rather upset) when an
idea gets proposed, like SchedPlug, only to have people be hostile to it
and then sudden turn around an adopt this idea. It give the impression
that you, in this specific case, were more interested in controlling a
situation and the track of development instead of actually being inclusive
of the development process with discussion and serious consideration, etc...

This is how the Linux community can be perceived as elitist. The old guard
would serve the community better if people were more mindful and sensitive
to developer issues. There was a particular speech that I was turned off by
at OLS 2006 that pretty much pandering to the "old guard's" needs over
newer developers. Since I'm a some what established engineer in -rt (being
the only other person that mapped the lock hierarchy out for full
preemptibility), I had the confidence to pretty much ignored it while
previously this could have really upset me and be highly discouraging to
a relatively new developer.

As Linux gets larger and larger this is going to be an increasing problem
when folks come into the community with new ideas and the community will
need to change if it intends to integrate these folks. IMO, a lot of
these flame ware wouldn't need to exist if folks listent ot each other
better and permit co-ownership of code like the scheduler since it needs
multipule hands in it adapt to new loads and situations, etc...

I'm saying this nicely now since I can be nasty about it.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ZFS with Linux: An Open Plea

2007-04-14 Thread hui

On Sat, Apr 14, 2007 at 10:04:23AM -0400, Mike Snitzer wrote:
> ZFS does have some powerful features but much of it depends on their
> broken layering of volume management.  Embedding the equivalent of LVM
> into a filesystem _feels_ quite wrong.

They have a clustering concept in their volume management that isn't
expressable with something like LVM. That justifes their approach from
what I can see.

> That aside, the native snapshot capabilities of ZFS really stand out
> for me.  The redirect on write semantics aren't exclusive to ZFS;
> NetApp's WAFL employs the same.  But with both ZFS and WAFL they were
> designed to do snapshots extremely well from the ground up.

Write allocation for these kinds of system (especially when concerned
with mirroring) is non-trivial.

> Unfortunately in order for Linux to incorporate such a feature I'd
> imagine a new filesystem would need to be developed with redirect on
> write at its core.  Can't really see ext4 or any other existing Linux
> filesystem grafting such a feature into it.  But even though I can't
> see it; do others?

You also can't use the standard page cache to buffer all of the sophicated
semantics of these systems and have to create your own.

> I've learned that Sun and NetApp's lawyers had it out over the
> redirect on write capability of ZFS.  When the dust settled Sun had
> enough patent protection to motivate a truce with NetApp.

I think they are still talking and it's far from over the last I heard.
The creation of a new inode and decending indirect blocks is a fundamental
concept behind WAFL. Also ZFS tends to be a heavy weight as far as
metadata goes and quite possibly uneccessarily so which is likely to effect
performance for things related to keep a relevant block allocation map in
memory. ZFS is a complete pig compared to traditional file systems.

> The interesting side-effect is now ZFS is "open" and with that comes
> redirect on write in a file system other than WAFL.  But ZFS's CDDL
> conflicts with the GPL so I'm not too sure how Linux could hit the
> ground running in this potentially patent mired area of filesystem
> development.  The validity of NetApp having patented redirect on write
> aside; does the conflict between CDDL and GPL _really_ matter?  Or did
> the CDDL release of ZFS somehow undermine NetApp's WAFL patent?

That doesn't really matter. FUSE could be extended to handle this kind
of stuff and still have it be in userspace. The BSD get around including
Stephen Tweedy's (sp?) ext2 header file by making the user manually
compile it. That's not a problem for Linux folks that can download a
patch and compile a kernel.

FreeBSD already has a port of ZFS. Just for a kick, Google for that as
a possible basis for a Linux kernel port.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-14 Thread hui

On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote:
...
> Now that you're agreeing my direction was correct you've done the usual Linux 
> kernel thing - ignore all my previous code and write your own version. Oh 
> well, that I've come to expect; at least you get a copyright notice in the 
> bootup and somewhere in the comments give me credit for proving it's 
> possible. Let's give some other credit here too. William Lee Irwin provided 
> the major architecture behind plugsched at my request and I simply finished 
> the work and got it working. He is also responsible for many IRC discussions 
> I've had about cpu scheduling fairness, designs, programming history and code 
> help. Even though he did not contribute code directly to SD, his comments 
> have been invaluable.

Hello folks,

I think the main failure I see here is that Con wasn't included in this design
or privately in review process. There could have been better co-ownership of the
code. This could also have been done openly on lkml (since this is kind of what
this medium is about to significant degree) so that consensus can happen (Con
can be reasoned with). It would have achieved the same thing but probably more
smoothly if folks just listened, considered an idea and then, in this case,
created something that would allow for experimentation from outsiders in a
fluid fashion.

If these issues aren't fixed, you're going to stuck with the same kind of 
creeping
elitism that has gradually killed the FreeBSD project and other BSDs. I can't
comment on the code implementation. I'm focus on other things now that I'm at
NetApp and I can't help out as much as I could. Being former BSDi, I had a first
hand account of these issues as they played out.

A development process like this is likely to exclude smart people from wanting
to contribute to Linux and folks should be conscious about this issues. It's
basically a lot of code and concept that at least two individuals have worked
on (wli and con) only to have it be rejected and then sudden replaced by
code from a community gatekeeper. In this case, this results in both Con and
Bill Irwin being woefully under utilized.

If I were one of these people. I'd be mighty pissed.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-15 Thread hui

On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> [...]
> 
> Demystify what?   The casual observer need only read either your attempt

Here's the problem. You're a casual observer and obviously not paying
attention.

> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.
> Now progress can happen, which was _not_ happening before.

I think that's inaccurate and there are plenty of folks that have that
technical skill and background. The scheduler code isn't a deep mystery
and there are plenty of good kernel hackers out here across many
communities.  Ingo isn't the only person on this planet to have deep
scheduler knowledge. Priority heaps are not new and Solaris has had a
pluggable scheduler framework for years.

Con's characterization is something that I'm more prone to believe about
how Linux kernel development works versus your view. I think it's a great
shame to have folks like Bill Irwin and Con to have waste time trying to
do something right only to have their ideas attack, then copied and held
as the solution for this kind of technical problem as complete reversal
of technical opinion as it suits a moment. This is just wrong in so many
ways.

It outlines the problems with Linux kernel development and questionable
elistism regarding ownership of certain sections of the kernel code.

I call it "churn squat" and instances like this only support that view
which I would rather it be completely wrong and inaccurate instead.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-15 Thread hui

On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote:
> I prefer such early releases to lkml _alot_ more than any private review 
> process. I released the CFS code about 6 hours after i thought "okay, 
> this looks pretty good" and i spent those final 6 hours on testing it 
> (making sure it doesnt blow up on your box, etc.), in the final 2 hours 
> i showed it to two folks i could reach on IRC (Arjan and Thomas) and on 
> various finishing touches. It doesnt get much faster than that and i 
> definitely didnt want to sit on it even one day longer because i very 
> much thought that Con and others should definitely see this work!
> 
> And i very much credited (and still credit) Con for the whole fairness 
> angle:
> 
> ||  i'd like to give credit to Con Kolivas for the general approach here:
> ||  he has proven via RSDL/SD that 'fair scheduling' is possible and that
> ||  it results in better desktop scheduling. Kudos Con!
> 
> the 'design consultation' phase you are talking about is _NOW_! :)
> 
> I got the v1 code out to Con, to Mike and to many others ASAP. That's 
> how you are able to comment on this thread and be part of the 
> development process to begin with, in a 'private consultation' setup 
> you'd not have had any opportunity to see _any_ of this.
> 
> In the BSD space there seem to be more 'political' mechanisms for 
> development, but Linux is truly about doing things out in the open, and 
> doing it immediately.

I can't even begin to talk about how screwed up BSD development is. Maybe
another time privately.

Ok, Linux development and inclusiveness can be improved. I'm not trying
to "call you out" (slang for accusing you with the sole intention to call
you crazy in a highly confrontative manner). This is discussed publically
here to bring this issue to light, open a communication channel as a means
to resolve it.

> Okay? ;-)

It's cool. We're still getting to know each other professionally and it's
okay to a certain degree to have a communication disconnect but only as
long as it clears. Your productivity is amazing BTW. But here's the
problem, there's this perception that NIH is the default mentality here
in Linux.

Con feels that this kind of action is intentional and has a malicious
quality to it as means of "churn squating" sections of the kernel tree.
The perception here is that there is that there is this expectation that
sections of the Linux kernel are intentionally "churn squated" to prevent
any other ideas from creeping in other than of the owner of that subsytem
(VM, scheduling, etc...) because of lack of modularity in the kernel.
This isn't an API question but a question possibly general code quality
and how maintenance () of it can .

This was predicted by folks and then this perception was *realized* when
you wrote the equivalent kind of code that has technical overlap with SDL
(this is just one dry example). To a person that is writing new code for
Linux, having one of the old guards write equivalent code to that of a
newcomer has the effect of displacing that person both with regards to
code and responsibility with that. When this happens over and over again
and folks get annoyed by it, it starts seeming that Linux development
seems elitist.

I know this because I heard (read) Con's IRC chats all the time about
these matters all of the time. This is not just his view but a view of
other kernel folks that differing views as to. The closing talk at OLS
2006 was highly disturbing in many ways. It went "Christoph" is right
everybody else is wrong which sends a highly negative message to new
kernel developers that, say, don't work for RH directly or any of the
other mainstream Linux companies. After a while, it starts seeming like
this kind of behavior is completely intentional and that Linux is
full of arrogant bastards.

What I would have done here was to contact Peter Williams, Bill Irwin
and Con about what your doing and reach a common concensus about how
to create something that would be inclusive of all of their ideas.
Discussions can technically heated but that's ok, the discussion is
happening and it brings down the wall of this perception. Bill and
Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra.
It might be very useful, it might not be. Folks are all stubborn
about there ideas and hold on to them for dear life. Effective
leaders can deconstruct this hostility and animosity. I don't claim
to be one.

Because of past hostility to something like schedplugin, the hostility
and terseness of responses can be percieved simply as "I'm right,
you're wrong" which is condescending. This effects discussion and
outright destroys a constructive process if this happens continually
since it reenforces that view of "You're an outsider, we don't care
about you". Nobody is listening to each other at that point, folks get
pissed. Then they think about "I'm going to NIH this person with patc
X because he/she did the same here" which is dysfunctional.

Oddly

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-15 Thread hui

On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote:
> Now this doesn't mean that people shouldn't be nice to each other, not
> cooperate or steal credits, but I don't get the impression that that is
> happening here. Ingo is taking part in the discussion with a counter
> proposal for discussion *on the mailing list*. What more do you want??

Con should have been CCed from the first moment this was put into motion
to limit the perception of exclusion. That was mistake number one and big
time failures to understand this dynamic. After it was Con's idea. Why
the hell he was excluded from Ingo's development process is baffling to
me and him (most likely).

He put int a lot of effort into SDL and his experiences with scheduling
should still be seriously considered in this development process even if
he doesn't write a single line of code from this moment on.

What should have happened is that our very busy associate at RH by the
name of Ingo Molnar should have leverage more of Con's and Bill's work
and use them as a proxy for his own ideas. They would have loved to have
contributed more and our very busy Ingo Molnar would have gotten a lot
of his work and ideas implemented without him even opening a single
source file for editting. They would have happily done this work for
Ingo. Ingo could have been used for something else more important like
making KVM less of a freaking ugly hack and we all would have benefitted
from this.

He could have been working on SystemTap so that you stop losing accounts
to Sun and Solaris 10's Dtrace. He could have been working with Riel to
fix your butt ugly page scanning problem causing horrible contention via
the Clock/Pro algorithm, etc... He could have been fixing the ugly futex
rwsem mapping problem that's killing -rt and anything that uses Posix
threads. He could have created a userspace thread control block (TCB)
with Mr. Drepper so that we can turn off preemption in userspace
(userspace per CPU local storage) and implement a very quick non-kernel
crossing implementation of priority ceilings (userspace check for priority
and flags at preempt_schedule() in the TCB) so that our -rt Posix API
doesn't suck donkey shit... Need I say more ?

As programmers like Ingo get spread more thinly, he needs super smart
folks like Bill Irwin and Con to help him out and learn to resist NIH
folk's stuff out of some weird fear. When this happens, folks like Ingo
must learn to "facilitate" development in addition to implementing it
with those kind of folks.

This takes time and practice to entrust folks to do things for him.
Ingo is the best method of getting new Linux kernel ideas and communicate
them to Linus. His value goes beyond just just code and is often the
biggest hammer we have in the Linux community to get stuff into the
kernel. "Facilitation" of others is something that solo programmers must
need when groups like the Linux kernel get larger and large every year.

Understand ? Are we in embarrassing agreement here ?

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

2007-04-17 Thread hui

On Tue, Apr 17, 2007 at 04:52:08PM -0700, Michael K. Edwards wrote:
> On 4/17/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
> >The ongoing scheduler work is on a much more basic level than these
> >affairs I'm guessing you googled. When the basics work as intended it
> >will be possible to move on to more advanced issues.
... 

Will probably shouldn't have dismissed your points but he probably means
that can't even get at this stuff until fundamental are in place.

> Clock scaling schemes that aren't integral to the scheduler design
> make a bad situation (scheduling embedded loads with shotgun
> heuristics tuned for desktop CPUs) worse, because the opaque
> heuristics are now being applied to distorted data.  Add a "smoothing"
> scheme for the distorted data, and you may find that you have
> introduced an actual control-path instability.  A small fluctuation in
> the data (say, two bursts of interrupt traffic at just the right
> interval) can result in a long-lasting oscillation in some task's
> "dynamic priority" -- and, on a fully loaded CPU, in the time that
> task actually gets.  If anything else depends on how much work this
> task gets done each time around, the oscillation can easily propagate
> throughout the system.  Thrash city.

Hyperthreading issues are quite similar that clock scaling issues.
Con's infrastructures changes to move things in that direction were
rejected, as well as other infrastructure changes, further infuritating
Con to drop development on RSDL and derivatives.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 589 matches

Mail list logo