Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O
On Tue, Apr 05, 2005 at 09:20:57PM -0400, Trond Myklebust wrote: > ty den 05.04.2005 Klokka 11:46 (-0400) skreiv Benjamin LaHaise: > > > I can see that goal, but I don't think introducing iosems is the right > > way to acheive it. Instead (and I'll start tackling this), how about > > factoring out the existing semaphore implementations to use a common > > lib/semaphore.c, much like lib/rwsem.c? The iosems can be used as a > > basis for the implementation, but we can avoid having to do a giant > > s/semaphore/iosem/g over the kernel tree. > > If you're willing to take this on then you have my full support and I'd > be happy to lend a hand. I would expect also that some RT subgroups would be highly interested in getting it to respect priority for reworking parts of softirq. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Priority Lists for the RT mutex
On Mon, Apr 11, 2005 at 10:57:37AM +0200, Ingo Molnar wrote: > > * Perez-Gonzalez, Inaky <[EMAIL PROTECTED]> wrote: > > > Let me re-phrase then: it is a must have only on PI, to make sure you > > don't have a loop when doing it. Maybe is a consequence of the > > algorithm I chose. -However- it should be possible to disable it in > > cases where you are reasonably sure it won't happen (such as kernel > > code). In any case, AFAIR, I still did not implement it. > > are there cases where userspace wants to disable deadlock-detection for > its own locks? I'd disable it for userspace locks. There might be folks that want to implement userspace drivers, but I can't imagine it being 'ok' to have the kernel call out to userspace and have it block correctly. I would expect them to do something else that's less drastic. > the deadlock detector in PREEMPT_RT is pretty much specialized for > debugging (it does all sorts of weird locking tricks to get the first > deadlock out, and to really report it on the console), but it ought to > be possible to make it usable for userspace-controlled locks as well. If I understand things correctly, I'd let that be an RT app issue and the app folks should decided what is appropriate for their setup. If they need a deadlock detector they should decide on their own protocol. The kernel debugging issues are completely different. That's my two cents. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Priority Lists for the RT mutex
On Mon, Apr 11, 2005 at 03:31:41PM -0700, Perez-Gonzalez, Inaky wrote: > If you are exposing the kernel locks to userspace to implement > mutexes (eg POSIX mutexes), deadlock checking is a feature you want > to have to complain with POSIX. According to some off the record > requirements I've been given, some applications badly need it (I have > a hard time believing that they are so broken, but heck...). I'd like to read about those requirements, but, IMO a lot of the value of various priority protocols varies greatly on the context and size (N threads) of the application using it. If user/kernel space have to be coupled via some thread of execution, (IMO) then it's better to seperate them with some event notification queues like signals (wake a thread via an queue event) than to mix locks across the user/kernel space boundary. There's tons of abuse that can be opened up with various priority protocols with regard to RT apps and giving it a first class entry way without consideration is kind of scary. It's important to outline the requirements of the applications and then see what you can do using minimal synchronization objects before exploding that divide. Also, Posix isn't always politically neutral nor complete regarding various things. You have to consider the context of these things. I'll have to think about this a bit more and review your patch more carefully. I'm all ears if you think I'm wrong. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Priority Lists for the RT mutex
On Mon, Apr 11, 2005 at 04:28:25PM -0700, Perez-Gonzalez, Inaky wrote: > >From: Bill Huey (hui) [mailto:[EMAIL PROTECTED] ... > API than once upon a time was made multithreaded by just adding > a bunch of pthread_mutex_[un]lock() at the API entry point... > without realizing that some of the top level API calls also > called other top level API calls, so they'd deadlock. Oh crap. > Quick fix: the usual. Enable deadlock detection and if it > returns deadlock, assume it is locked already and proceed (or > do a recursive mutex, or a trylock). You have to be joking me ? geez. ... > It is certainly something to explore, but I'd better drive your > way than do it. It's cleaner. Hides implementation details. > > I agree, but it doesn't work that well when talking about legacy > systems...that's the problem. Yeah, ok, I understand what's going on now. There isn't a notion of projecting priority across into the Unix/Linux kernel traditionally which is why it seemed so bizarre. > Sure--and because most was for legacy reasons that adhered to > POSIX strictly, it was very simple: we need POSIX this, that and > that (PI, proper adherence to scheduler policy wake up/rt-behaviour, > deadlock detection, etc). Some of this stuff sounds like recursive locking. Would this be a better expression to solve the "top level API locking" problem you're referring to ? > Fortunately in those areas POSIX is not too gray; code to the book. > Deal. I would think that there will have to be a graph discontinuity between user/kernel spaces at kernel entry and exit for the deadlock detector. Can't say about issues at fork time, but I would expect that those objects would have to be destroyed when the process exits. The current RT (Ingo's) lock isn't recursive nor is the deadlock detector the last time I looked. Do think that this is a problem for legacy apps if it gets overload for being the userspace futex as well ? (assuming I'm understanding all of this correctly) > Of course, selling it to the lkml is another story. I would think that pushing as much of this into userspace would make the kernel hooks for it more acceptable. Don't know. /me thinks more bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: FUSYN and RT
On Wed, Apr 13, 2005 at 11:46:40AM -0400, Steven Rostedt wrote: > On Tue, 2005-04-12 at 17:27 -0700, Daniel Walker wrote: > > There is a great big snag in my assumptions. It's possible for a process > > to hold a fusyn lock, then block on an RT lock. In that situation you > > could have a high priority user space process be scheduled then block on > > the same fusyn lock but the PI wouldn't be fully transitive , plus there > > will be problems when the RT mutex tries to restore the priority. > > > > We could add simple hooks to force the RT mutex to fix up it's PI, but > > it's not a long term solution. Ok, I've been thinking about these issues and I believe there are a number of misunderstandings here. The user and kernel space mutexes need to be completely different implementations. I'll have more on this later. First of all, priority transitivity should be discontinuous at the user/kernel space boundary, but be propagated by the scheduler, via an API or hook, upon a general priority boost to the thread in question. You have thread A blocked in the kernel holding is onto userspace mutex 1a and kernel mutex 2a. Thread A is priority boosted by a higher priority thread B trying to acquire mutex 1a. The transitivity operation propagates through the rest of the lock graph in userspace, via depth first search, as usual. When it hits the last userspace mutex in question, this portion of the propagation activity stops. Next, the scheduler itself finds out that thread A has had it's priority altered because of a common priority change API and starts another priority propagation operation in kernel space to mutex 1b. There you have it. It's complete from user to kernel space using a scheduler event/hook/api to propagate priority changes into the kernel. With all of that in place, you do a couple of things for the mutex implementation. First, you convert as much code of the current RT mutex code to be type polymorphic as you can: 1) You use Daniel Walker's PI list handling for wait queue insertion for both mutex implementation. This is done since it's already a library and is already generic. 2) Then you generalize the dead lock detection code so that things like "what to do in a deadlock case" is determine at the instantiation of the code. You might have to use C preprocessor macros to do a generic implementation and then fill in the parametric values for creating a usable instance. 3) Make the grab owner code generic. 4) ...more part of the RT mutex... etc... > How hard would it be to use the RT mutex PI for the priority inheritance > for fusyn? I only work with the RT mutex now and haven't looked at the > fusyn. Maybe Ingo can make a separate PI system with its own API that > both the fusyn and RT mutex can use. This way the fusyn locks can still > be separate from the RT mutex locks but still work together. I'd apply these implementation ideas across both mutexes, but keep the individual mutexes functionality distinct. I look at this problem from more of a reusability perspective than anything else. > Basically can the fusyn work with the rt_mutex_waiter? That's what I > would pull into its own subsystem. Have another structure that would > reside in both the fusyn and RT mutex that would take over for the > current rt_mutex that is used in pi_setprio and task_blocks_on_lock in > rt.c. So if both locks used the same PI system, then this should all be > cleared up. Same thing... There will be problems trying to implement a Posix read/write lock using this method and the core RT mutex might have to be fundamentally altered to handle recursion of some sort, decomposed into smaller bits and recomposed into something else. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] RCU and CONFIG_PREEMPT_RT progress, part 3
On Wed, Jul 13, 2005 at 11:48:01AM -0700, Paul E. McKenney wrote: > 1.Is use of spin_trylock() and spin_unlock() in hardirq code > (e.g., rcu_check_callbacks() and callees) a Bad Thing? > Seems to result in boot-time hangs when I try it, and switching > to _raw_spin_trylock() and _raw_spin_unlock() seems to work > better. But I don't see why the other primitives hang -- > after all, you can call wakeup functions in irq context in > stock kernels... The implementation of "printk" does funky stuff like this so I'm assuming it's sort of acceptable. Some of those function bypass latency tracing and preemption violation checks. Don't see a reason why you should be touching those functions unless you're going to modify implementation of spinlocks directly. Just use spinlock_t/raw_spinlock_t to take advantage of the type parametrics in Ingo's spinlock code to determine which lock you're using and you should be fine. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] RCU and CONFIG_PREEMPT_RT progress, part 3
On Wed, Jul 13, 2005 at 03:06:38PM -0400, Steven Rostedt wrote: > > 3. Since SPIN_LOCK_UNLOCKED now takes the lock itself as an > > argument, what is the best way to initialize per-CPU > > locks? An explicit initialization function, or is there > > some way that I am missing to make an initializer? > > Ouch, I just notice that (been using an older version for some time). > > Ingo, is this to force the initialization of the lists instead of at > runtime? ANSI C99 is missing a concept of "self" during auto-intialization. The explicit passing of the lvalue is needed so that it can be propagated downward to other macros in the initialization structure. list_head initialization is one of those things if I remember correctly. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT and XFS
On Mon, Jul 18, 2005 at 02:10:31PM +0200, Esben Nielsen wrote: > Unfortunately, one of the goals of the preempt-rt branch is to avoid > altering too much code. Therefore the type semaphore can't be removed > there. Therefore the name still lingers ... :-( This is where you failed. You assumed that that person making the comment, Christopher, in the first place didn't have his head up his ass in the first place and was open to your end of the discussion. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT and XFS
On Fri, Jul 15, 2005 at 09:16:55AM -0700, Daniel Walker wrote: > I don't agree with that. But of course I'm always speaking from a real > time perspective . PI is expensive , but it won't always be. However, no > one is forcing PI on anyone, even if I think it's good .. It depends on what kind of PI under specific circumstances. In the general kernel, it's really to be avoided at all costs since it's masking a general contention problem at those places. In a formally provable worst case system using priority ceiling emulation and stuff, PI really valuable. How a system like the Linux kernel fits into that is a totally different story. General purpose kernels using general purpose facilities don't. That's how I see it. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: FUSYN and RT
On Fri, Apr 15, 2005 at 04:37:05PM -0700, Inaky Perez-Gonzalez wrote: > By following your method, the pi engine becomes unnecesarily complex; > you have actually two engines following two different propagation > chains (one kernel, one user). If your mutexes/locks/whatever are the > same with a different cover, then you can simplify the whole > implementation by leaps. The main comment that I'm making here (so it doesn't get lost) is that, IMO, you're going to find that there is a mismatch with the requirements of Posix threading verse kernel uses. To drop the kernel mutex in 1:1 to back a futex-ish entity is going to be problematic mainly because of how kernel specific the RT mutex is (or any future kernel mutex) for debugging, etc... and I think this is going to be clear as it gets progressively implemented. I think folks really need to think about this clearly before moving into any direction prematurely. That's what I'm saying. PI is one of those issues, but ultimately it's the fundamental differences between userspace and kernel work. LynxOS (similar threading system) keep priority calculations of this kind seperate between user and kernel space. I'll have the ask one of our engineers here why again that's the case, but I suspect it's for the reasons I've discussed previously. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PREEMPT_RT and I-PIPE: the numbers, part 4
On Sat, Jul 09, 2005 at 10:22:07AM -0700, Daniel Walker wrote: > PREEMPT_RT is not pre-tuned for every situation , but the bests > performance is achieved when the system is tuned. If any of these tests > rely on a low priority thread, then we just raise the priority and you > have better performance. Just think about it. Throttling those threads via the scheduler throttles the system in super controllable ways. This is very cool stuff. :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: > 5. Scalability -and- Realtime Response. ... > void > rcu_read_lock(void) > { > preempt_disable(); > if (current->rcu_read_lock_nesting++ == 0) { > current->rcu_read_lock_ptr = > &__get_cpu_var(rcu_data).lock; > read_lock(current->rcu_read_lock_ptr); > } > preempt_enable(); > } Ok, here's a rather unsure question... Uh, is that a sleep violation if that is exclusively held since it can block within an atomic critical section (deadlock) ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 04:56:41AM -0800, Bill Huey wrote: > On Thu, Mar 17, 2005 at 04:20:26PM -0800, Paul E. McKenney wrote: > > 5. Scalability -and- Realtime Response. > ... > > > void > > rcu_read_lock(void) > > { > > preempt_disable(); > > if (current->rcu_read_lock_nesting++ == 0) { > > current->rcu_read_lock_ptr = > > &__get_cpu_var(rcu_data).lock; > > read_lock(current->rcu_read_lock_ptr); > > } > > preempt_enable(); > > } > > Ok, here's a rather unsure question... > > Uh, is that a sleep violation if that is exclusively held since it > can block within an atomic critical section (deadlock) ? I'd like to note another problem. Mingo's current implementation of rt_mutex (super mutex for all blocking synchronization) is still missing reader counts and something like that would have to be implemented if you want to do priority inheritance over blocks. This is going to throw a wrench into your implementation if you assume that. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote: > That was just one random example. > Another one would be : > > drivers/chat/tty_io.c, __do_SAK() contains >read_lock(&tasklist_lock); >task_lock(p); > > kernel/sys.c, sys_setrlimit contains >task_lock(current->group_leader); >read_lock(&tasklist_lock); > > task_lock is a shorthand for spin_lock(&p->alloc_lock). If read_lock is > a normal spinlock, then this is an A/B B/A deadlock. That code was already dubious in the first place just because it contained that circularity. If you had a rwlock that block on an upper read count maximum a deadlock situation would trigger anyways, say, upon a flood of threads trying to do that sequence of aquires. I'd probably experiment with using the {spin,read,write}-trylock logic and release the all locks contains in a sequence like that on the failure to aquire any of the locks in the chain as an initial fix. A longer term fix might be to break things up a bit so that whatever ordering being done would have that circularity. BTW, the runtime lock cricularity detector was designed to trigger on that situtation anyways. That's my thoughts on the matter. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Sun, Mar 20, 2005 at 01:38:24PM -0800, Bill Huey wrote: > On Sun, Mar 20, 2005 at 05:57:23PM +0100, Manfred Spraul wrote: > > That was just one random example. > > Another one would be : > > > > drivers/chat/tty_io.c, __do_SAK() contains > >read_lock(&tasklist_lock); > >task_lock(p); > > > > kernel/sys.c, sys_setrlimit contains > >task_lock(current->group_leader); > >read_lock(&tasklist_lock); > > > > task_lock is a shorthand for spin_lock(&p->alloc_lock). If read_lock is > > a normal spinlock, then this is an A/B B/A deadlock. > > That code was already dubious in the first place just because it > contained that circularity. If you had a rwlock that block on an > upper read count maximum[,] a deadlock situation would trigger anyways, > say, upon a flood of threads trying to do that sequence of aquires. The RT patch uses the lock ordering "in place" and whatevery nasty situation was going on previously will be effectively under high load, which increases the chance of it being triggered. Removal of the read side semantic just increases load more so that those cases can trigger. I disagree with this approach and I have an alternate implementation here that restores it. It's only half tested and fairly meaningless until an extreme contention case is revealed with the current rt lock implementation. Numbers need to be gather to prove or disprove this conjecture. > I'd probably experiment with using the {spin,read,write}-trylock > logic and release the all locks contains in a sequence like that > on the failure to aquire any of the locks in the chain as an > initial fix. A longer term fix might be to break things up a bit > so that whatever ordering being done would have that circularity. Excuse me, ...would *not* have that circularity. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Fri, Mar 18, 2005 at 05:55:44PM +0100, Esben Nielsen wrote: > On Fri, 18 Mar 2005, Ingo Molnar wrote: > > i really have no intention to allow multiple readers for rt-mutexes. We > > got away with that so far, and i'd like to keep it so. Imagine 100 > > threads all blocked in the same critical section (holding the read-lock) > > when a highprio writer thread comes around: instant 100x latency to let > > all of them roll forward. The only sane solution is to not allow > > excessive concurrency. (That limits SMP scalability, but there's no > > other choice i can see.) > > Unless a design change is made: One could argue for a semantics where > write-locking _isn't_ deterministic and thus do not have to boost all the RCU isn't write deterministic like typical RT apps are we can... (below :-)) > readers. Readers boost the writers but not the other way around. Readers > will be deterministic, but not writers. > Such a semantics would probably work for a lot of RT applications > happening not to take any write-locks - these will in fact perform better. > But it will give the rest a lot of problems. Just came up with an idea after I thought about how much of a bitch it would be to get a fast RCU multipule reader semantic (our current shared- exclusive lock inserts owners into a sorted priority list per-thread which makes it very expensive for a simple RCU case since they are typically very small batches of items being altered). Basically the RCU algorithm has *no* notion of writer priority and to propagate a PI operation down all reader is meaningless, so why not revert back to the original rwlock-semaphore to get the multipule reader semantics ? A notion of priority across a quiescience operation is crazy anyways, so it would be safe just to use to the old rwlock-semaphore "in place" without any changes or priorty handling addtions. The RCU algorithm is only concerned with what is basically a coarse data guard and it isn't time or priority critical. What do you folks think ? That would make Paul's stuff respect multipule readers which reduces contention and gets around the problem of possibly overloading the current rt lock implementation that we've been bitching about. The current RCU development track seem wrong in the first place and this seem like it could be a better and more complete solution to the problem. If this works, well, you heard it here first. :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 02:04:46AM -0800, Bill Huey wrote: > RCU isn't write deterministic like typical RT apps are[, so] we can... (below > :-)) ... > Just came up with an idea after I thought about how much of a bitch it > would be to get a fast RCU multipule reader semantic (our current shared- > exclusive lock inserts owners into a sorted priority list per-thread which > makes it very expensive for a simple RCU case[,] since they are typically very > small batches of items being altered). Basically the RCU algorithm has *no* > notion of writer priority and to propagate a PI operation down all reader[s] > is meaningless, so why not revert back to the original rwlock-semaphore to > get the multipule reader semantics ? The original lock, for those that don't know, doesn't strictly track read owners so reentrancy is cheap. > A notion of priority across a quiescience operation is crazy anyways[-,-] so > it would be safe just to use to the old rwlock-semaphore "in place" without > any changes or priorty handling add[i]tions. The RCU algorithm is only > concerned > with what is basically a coarse data guard and it isn't time or priority > critical. A little jitter in a quiescence operation isn't going to hurt things right ?. > What do you folks think ? That would make Paul's stuff respect multipule > readers which reduces contention and gets around the problem of possibly > overloading the current rt lock implementation that we've been bitching > about. The current RCU development track seem wrong in the first place and > this seem like it could be a better and more complete solution to the problem. Got to get rid of those typos :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-Time Preemption and RCU
On Tue, Mar 22, 2005 at 02:17:27AM -0800, Bill Huey wrote: > > A notion of priority across a quiescience operation is crazy anyways[-,-] so > > it would be safe just to use to the old rwlock-semaphore "in place" without > > any changes or priorty handling add[i]tions. The RCU algorithm is only > > concerned > > with what is basically a coarse data guard and it isn't time or priority > > critical. > > A little jitter in a quiescence operation isn't going to hurt things right ?. The only thing that I can think of that can go wrong here is what kind of effect it would have on the thread write blocking against a bunch of RCU readers. It could introduce a chain of delays into, say, a timer event and might cause problems/side-effects for other things being processed. RCU processing might have to decoupled processed by a different thread to avoid some of that latency weirdness. What do you folks think ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)
On Fri, Jan 28, 2005 at 08:45:46PM +0100, Ingo Molnar wrote: > * Trond Myklebust <[EMAIL PROTECTED]> wrote: > > If you do have a highest interrupt case that causes all activity to > > block, then rwsems may indeed fit the bill. > > > > In the NFS client code we may use rwsems in order to protect stateful > > operations against the (very infrequently used) server reboot recovery > > code. The point is that when the server reboots, the server forces us > > to block *all* requests that involve adding new state (e.g. opening an > > NFSv4 file, or setting up a lock) while our client and others are > > re-establishing their existing state on the server. > > it seems the most scalable solution for this would be a global flag plus > per-CPU spinlocks (or per-CPU mutexes) to make this totally scalable and > still support the requirements of this rare event. An rwsem really > bounces around on SMP, and it seems very unnecessary in the case you > described. > > possibly this could be formalised as an rwlock/rwlock implementation > that scales better. brlocks were such an attempt. >From how I understand it, you'll have to have a global structure to denote an exclusive operation and then take some additional cpumask_t representing the spinlocks set and use it to iterate over when doing a PI chain operation. Locking of each individual parametric typed spinlock might require a raw_spinlock manipulate lists structures, which, added up, is rather heavy weight. No only that, you'd have to introduce a notion of it being counted since it could also be aquired/preempted by another higher priority thread on that same procesor. Not having this semantic would make the thread in that specific circumstance effectively non-preemptable (PI scheduler indeterminancy), where the mulipule readers portion of a real read/write (shared-exclusve) lock would have permitted this. http://people.lynuxworks.com/~bhuey/rt-share-exclusive-lock/rtsem.tgz.1208 Is our attempt at getting real shared-exclusive lock semantics in a blocking lock and may still be incomplete and buggy. Igor is still working on this and this is the latest that I have of his work. Getting comments on this approach would be a good thing as I/we (me/Igor) believed from the start that this approach is correct. Assuming that this is possible with the current approach, optimizing it to avoid CPU ping-ponging is an important next step bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Mon, Jan 31, 2005 at 05:29:10PM -0500, Bill Davidsen wrote: > The problem hasn't changed in a few decades, neither has the urge of > developers to make their app look good at the expense of the rest of the > system. Been there and done that myself. > > "Back when" we had no good tools except to raise priority and drop > timeslice if a process blocked for i/o and vice-versa if it used the > whole timeslice. The amzing thing is that it worked reasonably well as > long as no one was there who knew how to cook the books the scheduler > used. And the user could hold off interrupts for up to 16ms, just to > make it worse. A lot of this scheduling policy work is going to have to be redone as badly written apps start getting their crap together and as this patch is more and more pervasive in the general Linux community. What's happening now is only the beginning of things to come and it'll require a solid sample application with even more hooks into the kernel before we'll see the real benefits of this patch. SCHED_FIFO will have to do until more development happens with QoS style policies. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Tue, Feb 01, 2005 at 11:10:48PM -0600, Jack O'Quin wrote: > Ingo Molnar <[EMAIL PROTECTED]> writes: > > (also, believe me, this is not arrogance or some kind of game on our > > part. If there was a nice clean solution that solved your and others' > > problems equally well then it would already be in Linus' tree. But there > > is no such solution yet, at the moment. Moreover, the pure fact that so > > many patch proposals exist and none looks dominantly convincing shows > > that this is a problem area for which there are no easy solutions. We > > hate such moments just as much as you do, but they do happen.) > > The actual requirement is nowhere near as difficult as you imagine. > You and several others continue to view realtime in a multi-user > context. That doesn't work. No wonder you have no good solution. A notion of process/thread scoping is needed from my point of view. How to implement that is another matter and there are no clear solutions that don't involve major changes in some way to fundamental syscalls like fork/clone() and underlying kernel structures from what I see. The very notion of Unix fork semantics isn't sufficient enough to "contain" these semantics. It's more about controlling things with known quantities over time, not about process creation relationships, and therein lies the mismatch. Also, as media apps get more sophisticated they're going to need some kind of access to the some traditional softirq facilities, possibily migrating it into userspace safely somehow, with how it handles IO processing such as iSCSI, FireWire, networking and all peripherals that need some kind of prioritized IO handling. It's akin to O_DIRECT, where folks need to determine policy over the kernel's own facilities, IO queues, but in a more broad way. This is inevitable for these category of apps. Scary ? yes I know. Think XFS streaming with guaranteed rate IO, then generalize this for all things that can be streamed in the kernel. A side note, they'll also be pegging CPU usage and attempting to draw to the screen at the same time. It would be nice to have slack from scheduler frames be use for less critical things such as drawing to the screen. The policy for scheduling these IO requests maybe divorced from the actual priority of the thread requesting it which present some problems with the current Linux code as I understand it. Whether this suitable for main stream inclusion is another matter. But as a person that wants to write apps of this nature, I came into this kernel stuff knowing that there's going to be a conflict between the the needs of media apps folks and what the Linux kernel folks will tolerate as a community. > The humble RT-LSM was actually optimal for the multi-user scenario: > don't load it. Then it adds no security issues, complexity or > scheduler pathlength. As an added benefit, the sysadmin can easily > verify that it's not there. > > The cost/performance characteristics of commodity PC's running Linux > are quite compelling for a wide range of practical realtime > applications. But, these are dedicated machines. The whole system > must be carefully tuned. That is the only method that actually works. > The scheduler is at most a peripheral concern; the best it can do is > not screw up. It's very compelling and very deadly to the industry if these things become common place in the normal Linux kernel. It would instantly make Linux the top platform for anything media related, graphic and audio. (Hopefully, I can get back to kernel coding RT stuff after this current distraction that has me reassigned onto an emergency project) I hope I clarified some of this communication and not completely scare Ingo and others too much. Just a little bit is ok. :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Wed, Feb 02, 2005 at 10:44:22AM -0600, Jack O'Quin wrote: > Bill Huey (hui) <[EMAIL PROTECTED]> writes: > > Also, as media apps get more sophisticated they're going to need some > > kind of access to the some traditional softirq facilities, possibily > > migrating it into userspace safely somehow, with how it handles IO > > processing such as iSCSI, FireWire, networking and all peripherals > > that need some kind of prioritized IO handling. It's akin to O_DIRECT, > > where folks need to determine policy over the kernel's own facilities, > > IO queues, but in a more broad way. This is inevitable for these > > category of apps. Scary ? yes I know. > > I believe Ingo's RT patches already support this on a per-IRQ basis. > Each IRQ handler can run in a realtime thread with priority assigned > by the sysadmin. Balancing the interrupt handler priorities with > those of other realtime activities allows excellent control. No they don't. That's a physical mapping of these kernel entities, not a logic organization that projects upward to things like individual sockets or file streams. The current irq-thread patches are primarily for dealing with the low level acks and stuff for the devices in question. It does not deal with queuing policy or how these things are scheduler on a logical basis, which is what softirqs do. softirqs group a number of things together in one big uncontrollable chunk. Really, a bit of time spent in the kernel regarding this would clarify it more in the future. Don't speculate. This misunderstanding, often babble, from app folks is why kernel folks partially dismiss the needs requested from this subgroup. It's important to understand your needs before articulating it to a wider community. The kernel community must understand the true nature of these needs and then facilitate them. If the relationship is where kernel folks dictate what apps folks have, you basically pervert the relationbship and the responsiblities of overall development, which fatally cripples app and all development of this nature. It's a two way street, but kernel folks can be more proactive about it, definitely. Step one in this is to acknowlege that Unix scheduling semantics is "inantiquated" with regard to media apps. Some notion of scoping needs to be put in. Everybody on the same page ? > This is really only useful within the context of a dedicated realtime > system, of course. > > Stephane Letz reports a similar feature in Mac OS X. OS X is very coarse grained (two funnels) and I would seriously doubt that it would perform without scheduling side effects to the overall system because of that. With a largely stalled FreeBSD SMPng project where they hijack a good chunk of their code into an antiquate and bloated Mach threading system, that situation isn't helping it. What the Linux community has with the RT patches is potentially light years ahead of OS X regarding overall system latency, since RT and SMP performance is tightly related. It's just a matter of getting right folks to understand the problem space and then make changes so that the overall picture is completed. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Wed, Feb 02, 2005 at 10:21:00PM +0100, Ingo Molnar wrote: > yes and no. You are right in that the individual workloads (e.g. > softirqs) are not separated and identified/credited to the thread that > requested them. (in part due to the fact that you cannot e.g. credit a > thread for e.g. unrequested workloads like incoming sockets, or for > 'merged' workloads like writeout of a commonly accessed file.) What's not being addressed here is a need for pervasive QoS across all kernel systems. The power of this patch is multiplicative. It's not about a specific component of the system having microsecond latencies, it's about how all parts, softirqs, hardirqs, VM, etc... work together so that the entire system is suitable for (near) hard real time. It's unconstrained, unlike dual kernel RT systems, across all component boundaries. Those constraints create large chunks of glue logic between systems, which is exploded the complexity of things that app folks much deal with. This is where properly written Linux apps (non exist right now because of kernel issues) can really overtake competing apps from other OSes (ignoring how crappy X11 is). > but Jack is right in practical terms: the audio folks achieved pretty > good results with the current IRQ threading mechanism, partly due to the > fact that the audio stack doesnt use softirqs, so all the > latency-critical activities are in the audio IRQ thread and the > application itself. It's clever that they do that, but additional control is needed in the future. jackd isn't the most sophisticate media app on this planet (not too much of an insult :)) and the demands from this group is bound to increase as their group and competing projects get more and more sophisticated. When I mean kernel folks needs to be proactive, I really mean it. The Linux kernel latency issues and poor driver support is largely why media apps are way below even being second rate with regard to other operating systems such as Apple's OS X for instance. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Wed, Feb 02, 2005 at 01:14:05PM -0800, Bill Huey wrote: > Step one in this is to acknowlege that Unix scheduling semantics is > "inantiquated" with regard to media apps. Some notion of scoping needs to bah, "inadequate". > be put in. > > Everybody on the same page ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Wed, Feb 02, 2005 at 05:59:54PM -0500, Paul Davis wrote: > Actually, JACK probably is the most sophisticated media *framework* on > the planet, at least inasmuch as it connects ideas drawn from the > media world and OS research/design into a coherent package. Its not > perfect, and we've just started adding new data types to its > capabilities (its actually relatively easy). But it is amazingly > powerful in comparison to anything offered to data, and is > unencumbered by the limitations that have affected other attempts to > do what it does. This is a bit off topic, but I'm interested in applications that are more driven by time and has abstraction closer to that in a pure way. A lot of audio kits tend to be overly about DSP and not about time. This is difficult to explain, but what I'm referring to here is ideally the next generation these applications and their design, not the current lot. A lot more can be done. > And it makes possible some of the most sophisticated *audio* apps on > the planet, though admittedly not video and other data at this time. Again, the notion of time based processing with broader uses and not just DSP which what a lot of current graph driven audio frameworks seem to still do at this time. Think gaming audio in 3d, etc... I definitely have ideas on this subject and I'm going to hold my current position on this matter in that we can collectively do much better. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Thu, Feb 03, 2005 at 08:54:24AM +1100, Peter Williams wrote: > As Ingo said in an earlier a post, with a little ingenuity this problem > can be solved in user space. The programs in question can be setuid > root so that they can set RT scheduling policy BUT have their > permissions set so that they only executable by owner and group with the > group set to a group that only contains those users that have permission > to run this program in RT mode. If you wish to allow other users to run > the program but not in RT mode then you would need two copies of the > program: one set up as above and the other with normal permissions. Again, in my post that you snipped you didn't either read or understand what I was saying regarding QoS, nor about the large scale issues regarding dual/single kernel development environments. Ultimately this stuff requires non-trivial support in kernel space, a softirq thread migration mechanism and a frame driven scheduler to back IO submission across async boundaries. My posts where pretty clear on this topic and lot of this has origins coming from SGI IRIX. Yes, SGI IRIX. One of the only system man enough to handle this stuff. Ancient, antiquated Unix scheduler semantics (sort and run) and lack of control over critical facilities like softirq processing are obstacles to getting at this. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature
On Thu, Feb 03, 2005 at 10:41:33PM +0100, Ingo Molnar wrote: > * Bill Huey <[EMAIL PROTECTED]> wrote: > > It's clever that they do that, but additional control is needed in the > > future. jackd isn't the most sophisticate media app on this planet (not > > too much of an insult :)) [...] > > i think you are underestimating Jack - it is easily amongst the most > sophisticated audio frameworks in existence, and it certainly has one of > the most robust designs. Just shop around on google for Jack-based audio > applications. What i'd love to see is more integration (and cooperation) > between the audio frameworks of desktop projects (KDE, Gnome) and Jack. This is a really long winded and long standing offtopic gripe I have with general application development under Linux. The only way I'm going to get folks to understand my position on it is if I code it up in my implementation language of choice with my own APIs. There's a TON more that can be done with QoS in the kernel (EDL schedulers), DSP JIT compiler techniques and other kernel things that can support pro-audio. I simply can't get to yet until the RT patch has a few more goodies and I'm permitted to do this as my next project. I had a crazy prototype of some DSP graph system (in C++) I wrote years ago for 3D audio where I'm drawing my knowledge from and it's getting time to resurrect it again if I'm going to provide a proof of concept to push an edge. Also, think, people working with the RT patch are also ignoring frame accurate video and many others things that just haven't been done yet since the patch is so new and there hasn't been more interest from folks yet regarding it. I suspect that it's because that folks don't know about it yet. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Tue, Jul 24, 2007 at 04:39:47PM -0400, Chris Snook wrote: > Chris Friesen wrote: >> We currently use CKRM on an SMP machine, but the only way we can get away >> with it is because our main app is affined to one cpu and just about >> everything else is affined to the other. > > If you're not explicitly allocating resources, you're just low-latency, not > truly realtime. Realtime requires guaranteed resources, so messing with > affinities is a necessary evil. You've mentioned this twice in this thread. If you're going to talk about this you should characterize this more specifically because resource allocation is a rather incomplete area in the Linux. Rebalancing is still an open research problem the last time I looked. Tong's previous trio patch is an attempt at resolving this using a generic grouping mechanism and some constructive discussion should come of it. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Tue, Jul 24, 2007 at 05:22:47PM -0400, Chris Snook wrote: > Bill Huey (hui) wrote: > Well, you need enough CPU time to meet your deadlines. You need > pre-allocated memory, or to be able to guarantee that you can allocate > memory fast enough to meet your deadlines. This principle extends to any > other shared resource, such as disk or network. I'm being vague because > it's open-ended. If a medical device fails to meet realtime guarantees > because the battery fails, the patient's family isn't going to care how > correct the software is. Realtime engineering is hard. ... > Actually, it's worse than merely an open problem. A clairvoyant fair > scheduler with perfect future knowledge can underperform a heuristic fair > scheduler, because the heuristic scheduler can guess the future incorrectly > resulting in unfair but higher-throughput behavior. This is a perfect > example of why we only try to be as fair as is beneficial. I'm glad we agree on the above points. :) It might be that there needs to be another more stiff policy than what goes into SCHED_OTHER in that we also need a SCHED_ISO or something has more strict rebalancing semantics for -rt applications, sort be a super SCHED_RR. That's definitely needed and I don't see how the current CFS implementation can deal with this properly even with numerical running averages, etc... at this time. SCHED_FIFO is another issue, but this actually more complicated than just per cpu run queues in that a global priority analysis. I don't see how CFS can deal with SCHED_FIFO efficiently without moving to a single run queue. This is kind of a complicated problem with a significant set of trade off to take into account (cpu binding, etc..) >> Tong's previous trio patch is an attempt at resolving this using a generic >> grouping mechanism and some constructive discussion should come of it. > > Sure, but it seems to me to be largely orthogonal to this patch. It's based on the same kinds of ideas that he's been experimenting with in Trio. I can't name a single other engineer that's posted to lkml recently that has quite the depth of experience in this area than him. It would be nice to facilitted/incorporate some his ideas or get him to and work on something to this end that's suitable for inclusion in some tree some where. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NetApp sues Sun regarding ZFS
Folks, The official announcement. http://www.netapp.com/news/press/news_rel_20070905 Dave Hitz's blog about it. http://blogs.netapp.com/dave/ bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Fri, Jul 27, 2007 at 12:03:28PM -0700, Tong Li wrote: > Thanks for the interest. Attached is a design doc I wrote several months > ago (with small modifications). It talks about the two pieces of my design: > group scheduling and dwrr. The description was based on the original O(1) > scheduler, but as my CFS patch showed, the algorithm is applicable to other > underlying schedulers as well. It's interesting that I started working on > this in January for the purpose of eventually writing a paper about it. So > I knew reasonably well the related research work but was totally unaware > that people in the Linux community were also working on similar things. > This is good. If you are interested, I'd like to help with the algorithms > and theory side of the things. Tong, This is sufficient as an overview of the algorithm but not detailed enough for it to be a discussable design doc I believe. You should ask Chris to see what he means by this. Some examples of your rebalancing scheme and how your invariant applies across processor rounds would be helpful for me and possibly others as well. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] scheduler: improve SMP fairness in CFS
On Fri, Jul 27, 2007 at 07:36:17PM -0400, Chris Snook wrote: > I don't think that achieving a constant error bound is always a good thing. > We all know that fairness has overhead. If I have 3 threads and 2 > processors, and I have a choice between fairly giving each thread 1.0 > billion cycles during the next second, or unfairly giving two of them 1.1 > billion cycles and giving the other 0.9 billion cycles, then we can have a > useful discussion about where we want to draw the line on the > fairness/performance tradeoff. On the other hand, if we can give two of > them 1.1 billion cycles and still give the other one 1.0 billion cycles, > it's madness to waste those 0.2 billion cycles just to avoid user jealousy. > The more complex the memory topology of a system, the more "free" cycles > you'll get by tolerating short-term unfairness. As a crude heuristic, > scaling some fairly low tolerance by log2(NCPUS) seems appropriate, but > eventually we should take the boot-time computed migration costs into > consideration. You have to consider the target for this kind of code. There are applications where you need something that falls within a constant error bound. According to the numbers, the current CFS rebalancing logic doesn't achieve that to any degree of rigor. So CFS is ok for SCHED_OTHER, but not for anything more strict than that. Even the rt overload code (from my memory) is subject to these limitations as well until it's moved to use a single global queue while using CPU binding to turn off that logic. It's the price you pay for accuracy. > If we allow a little short-term fairness (and I think we should) we can > still account for this unfairness and compensate for it (again, with the > same tolerance) at the next rebalancing. Again, it's a function of *when* and depends on that application. > Adding system calls, while great for research, is not something which is > done lightly in the published kernel. If we're going to implement a user > interface beyond simply interpreting existing priorities more precisely, it > would be nice if this was part of a framework with a broader vision, such > as a scheduler economy. I'm not sure what you mean by scheduler economy, but CFS can and should be extended to handle proportional scheduling which is outside of the traditional Unix priority semantics. Having a new API to get at this is unavoidable if you want it to eventually support -rt oriented appications that have bandwidth semantics. All deadline based schedulers have API mechanisms like this to support extended semantics. This is no different. > I had a feeling this patch was originally designed for the O(1) scheduler, > and this is why. The old scheduler had expired arrays, so adding a > round-expired array wasn't a radical departure from the design. CFS does > not have an expired rbtree, so adding one *is* a radical departure from the > design. I think we can implement DWRR or something very similar without > using this implementation method. Since we've already got a tree of queued > tasks, it might be easiest to basically break off one subtree (usually just > one task, but not necessarily) and migrate it to a less loaded tree > whenever we can reduce the difference between the load on the two trees by > at least half. This would prevent both overcorrection and undercorrection. > The idea of rounds was another implementation detail that bothered me. In > the old scheduler, quantizing CPU time was a necessary evil. Now that we > can account for CPU time with nanosecond resolution, doing things on an > as-needed basis seems more appropriate, and should reduce the need for > global synchronization. Well, there's nanosecond resolution with no mechanism that exploits it for rebalancing. Rebalancing in general is a pain and the code for it is generally orthogonal to the in-core scheduler data structures that are in use, so I don't understand the objection to this argument and the choice of methods. If it it gets the job done, then these kind of choices don't have that much meaning. > In summary, I think the accounting is sound, but the enforcement is > sub-optimal for the new scheduler. A revision of the algorithm more > cognizant of the capabilities and design of the current scheduler would > seem to be in order. That would be nice. But the amount of error in Tong's solution is much less than the current CFS logic as was previously tested even without consideration to high resolution clocks. So you have to give some kind of credit for that approach and recognized that current methods in CFS are technically a dead end if there's a need for strict fairness in a more rigorous run category than SCHED_OTHER. > I've referenced many times my desire to account for CPU/memory hierarchy in > these patches. At present, I'm not sure we have sufficient infrastructure > in the kernel to automatically optimize for system topology, but I think > whatever de
Re: [ck] Re: Linus 2.6.23-rc1
On Sat, Jul 28, 2007 at 09:28:36PM +0200, jos poortvliet wrote: > Your point here seems to be: this is how it went, and it was right. Ok, got > that. Yet, Con walked away (and not just over SD). Seeing Con go, I wonder > how many did leave without this splash. How many didn't even get involved at > all??? Did THAT have to happen? I don't blame you for it - the point is that > somewhere in the process a valuable kernel hacker went away. How and why? And > is it due to a deeper problem? Absolutely, the current Linux community hasn't realized how large the community has gotten and the internal processes for dealing with new developers, that aren't at companies like SuSE or RedHat, haven't been extended to deal with it yet. It comes off as elitism which it partially is. Nobody tries to facilitate or understand ideas in the larger community which locks folks like Con out that try to do provocative things outside of the normal technical development mindset. He was punished for doing so and is a huge failure in this community. Con basically got caught in a scheduler philosophical argument of whether to push a policy into userspace or to nice a process instead because of how crappy X is. This is an open argument on how to solve, but it should not have resulted in really one scheduler over the other. Both where capable but one is locked out now because of the choices of current high level kernel developers in Linux. There are a lot good kernel folks in many different communities that look at something like this and would be turned off to participating in Linux development. And I have a good record of doing rather interesting stuff in kernel. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: Linus 2.6.23-rc1
On Sat, Jul 28, 2007 at 11:06:09PM +0200, Diego Calleja wrote: > So your argument is that SD shouldn't have been merged either, because it > would have resulted in one scheduler over the other? My argument is that schedule development is open ended. Although having a central scheduler to hack is a a good thing, it shouldn't lock out or supress development from other groups that might be trying to solve the problem in unique ways. This can be accomplished in a couple of ways: 1) scheduler modularity Clearly Con is highly qualified to experiement with scheduler code and this should be technically facilitate by some means if not a maintainer. He's only a part time maintainer and nobody helped him with this stuff nor did they try to understand what his scheduler was trying to do other than Tong Li. 2) better code modularity Now, cleaner code would help with this a lot. If that was in place, we might not need (1) and pluggable scheduler. It would limit the amount of refactoring for folks so that their code can drop in easier. There's a significant amount of churn that it locks out developers by default since they have to constantly clean up the code in question while another developer can commit without consideration to how it effects others. That's their right as a maintainer, but also as maintainer, they should give proper amount of consideration to how others might intend to extend the code so that development remains "inclusive". This notion of "open source, open development" is false when working under those circumstances. > > where capable but one is locked out now because of the choices of > > current high level kernel developers in Linux. > > Well, there are two schedulers...it's obvious that "high level kernel > developers" needed to chose one. I think that's kind of a bogus assumption from the very get go. Scheduling in Linux is one of the most unevolved systems in the kernel that still could go through a large transformation and get big gains like what we've had over the last few months. This evident with both schedulers, both do well and it's a good thing overall the CFS is going in. Now, the way it happened is completely screwed up in so many ways that I don't see how folks can miss it. This is not just Ingo versus Con, this is the current Linux community and how it makes decision from the top down and the current cultural attitude towards developers doing things that are: 1) architecturally significant which they will get flamed to death by the establish Linux kernel culture before they can get any users to report bugs after their posting on lkml. 2) conceptual different which is subject to the reasons above, but also get flamed to death unless it comes from folks internal to the Linux development processes. When groups get to a certain size like it has, there needs to be a revision of development processes so that they can scale and be "inclusive" to the overall spirit the Linux development process. When that breaks down, we get situations like what we have with Con leaving development. Other developers like me get turned off to the situation, also feel the same as Con and stop Linux development. That's my current status as well. > The main problem is clearly that no scheduler was clearly better than the > other. This remembers me of the LVM2/MD vs EVMS in the 2.5 days - both > of them were good enought, but only one of them could be merged. The > difference is that EVMS developers didn't get that annoyed, and not only > they didn't quit but they continued developing their userspace tools to > make it work with the solution included in the kernel That's a good point to have folks not go down that particular path. But Con was kind of put down during the arguments with Ingo about his assumptions of the problems and then was personally crapped on by having his own idea under go a a complete reversal in opinion by Ingo, with Ingo then doing this own version of Con's work displacing him How would you feel in that situation ? I'd be pretty damn pissed. [For the record Peter Zijlstra did the same thing to me which is annoying, but since he's my buddy doesn't get as rude as the above situation, included me in every private mail about his working so that I don't feel like RH is paying him to undermine my brilliance, it's ok :)] bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: Linus 2.6.23-rc1
On Sat, Jul 28, 2007 at 03:18:24PM -0700, Linus Torvalds wrote: > I don't think anything was suppressed here. I disagree. See below. > You seem to say that more modular code would have helped make for a nicer > way to do schedulers, but if so, where were those patches to do that? > Con's patches didn't do that either. They just replaced the code. They replaced code because he would have liked to have taken scheduler code in possibly a completely different direction. This is a large conceptual change from what is currently there. That might also mean how the notion of bandwidth with regards to core frequency might be expressed in the system with regards to power saving and other things. Things get dropped often not because of pure technical reasons but because of person preference and the lack of willingness to ask where this might take us. The way that Con works and conceptualizes things is quite a bit different and more comprehensive in a lot of ways compared to how the regular kernel community operates. He's strong in this area and weak in general kernel hackery as a function of time and experience. That doesn't mean that he, his ideas and his code should be subject to an either/or situation with the scheduler and other ideas that have been rejected by various folks. He maintained -ck branch successfully for a long time and is a very capable developer. I do acknowledge that having a maintainer that you can trust is more important, but it should not be exclusionary in this way. I totally understand his reaction. > In fact, Ingo's patches _do_ add some modularity, and might make it easier > to replace the scheduler. So it would seem that you would argue for CFS, > not against it? It's not the same as sched plugin. Some folks might not like to use the rbtree that's in place and express things in a completely different manner. Take for instance, Tong Li's stuff with CFS a bit of a conceptual mismatch with his attempt at expression rebalancing in terms expiry rounds yet would be more seamlessly integrated with something like either the old O(1) scheduler or Con's stuff. It's also the only method posted to lkml that can deal with fairness across SMP situtations with low error. Yet what's happening here is that his implementation is being rejected because of size and complexity because of a data structure conceptual mismatch. Because of this, his notion of trio as a general method of getting aggressive group fairness (by far the most complete conceptually on lkml, over design is a different topic altogether) may never see the light of day in Linux because of people's collective lack of foresight. To answer the question that you posed, no. I'm not arguing against it. I'm in favor of it going into the kernel like any dead line mechanism since it can be generalized, but the current developement processes in Linux kernel should not create an either/or situation with the scheduler code. There has been multipule rejection of ideas with regards to the scheduler code over the years that could have take things in a very different and possibly complete kick ass way that was suppress because of the development attitude of various Linux kernel developers. It's all of a sudden because of Con's work there's a flurry of development in this area when this idea is shown to be superior and even then, it's conceptually incomplete and subject to a lot of arbitrary hacking. This is very different than Con's development style and mine as well. This is an area that could have been addressed sooner if the general community admitted that there was a problem earlier and permitted more conscious and open change. I've seen changes in this area from Con be reject time and time again which effect the technical direction he originally wanted to take this. Now, Con might have a communication problem here, but nobody asked to clarify what he might have wanted and why, yet folks were very quick at dismissing him, nitpick him to death, even when he explained why he might have wanted a particular change in the first place. This is the "facilitation" part that's missing in the current kernel culture. This is a very important idea as the community grows, because I see folks that are capable of doing work get discouraged and locked out because of code maintainability issues and an inability to get folks to move that direction because of a missing concensus mechanism in the community other that sucking up to developers. Con and folks like him should be permitted the opportunity to fail on their own account. If Linux was truely open, it would have dealt with issue by now and there wouldn't be so much flammage from the general community. > > I think that's kind of a bogus assumption from the very get go. Scheduling > > in Linux is one of the most unevolved systems in the kernel that still > > could go through a large transformation and get big gains like what > > we've had over the last few months. This evident with both schedulers,
Re: [ck] Re: Linus 2.6.23-rc1
On Sun, Jul 29, 2007 at 10:25:42PM +0200, Mike Galbraith wrote: > Absolutely. > > Con quit for his own reasons. Given that Con himself has said that CFS > was _not_ why he quite, please discard this... bait. Anyone who's name > isn't Con Kolivas, who pretends to speak for him is at the very least > overstepping his bounds, and that is being _very_ generous. I know Con personally and I completely identify with his circumstance. This is precisely why he quit the project because of a generally perceived ignorance and disconnect from end users. Since you side with Ingo on many issues politically, this response from you is no surprise. Again, the choices that have been currently made with CFS basically locks him out of development. If you don't understand that, then you don't understand the technical issues he's struggled to pursue. He has a large following which is why this has been a repeated and issue between end users of his tree and a number of current Linux kernel developers. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ck] Re: Linus 2.6.23-rc1
On Tue, Jul 31, 2007 at 09:15:17AM +0800, Carlo Florendo wrote: > And I think you are digressing from the main issue, which is the empirical > comparison of SD vs. CFS and to determine which is best. The root of all > the scheduler fuss was the emotional reaction of SD's author on why his > scheduler began to be compared with CFS. Legitimate emotional reaction for being locked out of the development process. There's a very human aspect to this, yes, a negative human aspect that pervade Linux kernel development and is overly defensive and protective of new ideas. > We obviously all saw how the particular authors tried to address the > issues. Ingo tried to address all concerns while Con simply ranted about > his scheduler being better. If this is what you think about being a bit > more human, then I think that this has no place in the lkml. That's highly inaccurate and rather disrespect of Con's experience. There as a policy decision made with SD that one person basically didn't like, this person whined like a baby for the a formula bottle and didn't understand how to use "nice" to control this inherent behavior of this scheduler. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linus 2.6.23-rc1
On Sun, Jul 29, 2007 at 04:18:18PM -0700, Linus Torvalds wrote: > Ingo posted numbers. Look at those numbers, and then I would suggest some > people could seriously consider just shutting up. I've seen too many > idiotic people who claim that Con got treated unfairly, without those > people admitting that maybe I had a point when I said that we have had a > scheduler maintainer for years that actually knows what he's doing. Here's the problem, *a lot* of folks can do scheduler development in and outside community, so what's with exclusive-only attitude towards the scheduler ? There's sufficient effort coming from folks working on CFS from many sources so how's sched-plugin a *threat* to stock kernel scheduler development if it gets to the main tree as the default compile option ?? Those are the core question that Con brought in the APC article, folks are angry because and nobody central to the current Linux has address this and instead focused on a single narrow set of technical issues to justify a particular set of actions. I mean, I'm not the only that has said this so there has to be some kind of truth behind it. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Wed, Jan 24, 2007 at 12:31:15PM +0100, Ingo Molnar wrote: > * Bill Huey <[EMAIL PROTECTED]> wrote: > > > Patch here: > > > > > > http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch > > hm, most of the review feedback i gave has not been addressed in the > patch above. So i did it myself: find below various fixups for problems > i outlined plus for a few other problems as well (ontop of > 2.3.lock_stat). Sorry, I've been siliently keeping your suggested change in my private repo without announcing it to the world. I'll reply to the old email in another message at length. http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.4.lock_stat.patch > While it's now markedly better it's still not quite mergeable, for > example the build fails quite spectacularly if LOCK_STAT is disabled in > the .config. I'll look into it. I've been focused on clean up and a couple of other things regard the stability of this patch. Making small changes in it tends to make the kernel crash hard and I suspect that it's an interaction problem with lockdep and that I need to turn lockdep off when hitting "lock stats" locks. I'm going to move to "__raw_..." locks... Meanwhile please wait until I hand interpret and merge your changes to an older patch into my latest stuff. If it's takes too long, I suggest keeping out of the tree for a bit until I finish this round unless something is pressing for this to happen now like a mass change to the spinlock macros or something. I stalled a bit trying to get Peter Zijlstra an extra feature. > Also, it would be nice to replace those #ifdef CONFIG_LOCK_STAT changes > in rtmutex.c with some nice inline functions that do nothing on > !CONFIG_LOCK_STAT. I'll look into it. Not sure what your choice in style is here and I'm open to suggestions. I'm also interested in a reduction of #define identifier length if you or somebody else has some kind of good convention to suggest. > but in general i'm positive about the direction this is heading, it just > needs more work. Sorry, for the lag. Trying to juggle this and the current demands of my employeer contributed to this lag unfortunately. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Thu, Jan 04, 2007 at 05:46:59AM +0100, Ingo Molnar wrote: > thanks. It's looking better, but there's still quite a bit of work left: > > there's considerable amount of whitespace noise in it - lots of lines > with space/tab at the end, lines with 8 spaces instead of tabs, etc. These comments from me are before the hand merge I'm going to do tonight. > comment style issues: > > +/* To be use for avoiding the dynamic attachment of spinlocks at runtime > + * by attaching it inline with the lock initialization function */ > > the proper multi-line style is: > > /* > * To be used for avoiding the dynamic attachment of spinlocks at > * runtime by attaching it inline with the lock initialization function: > */ I fixed all of those I can find. > (note i also fixed a typo in the one above) > > more unused code: > > +/* > +static DEFINE_LS_ENTRY(__pte_alloc); > +static DEFINE_LS_ENTRY(get_empty_filp); > +static DEFINE_LS_ENTRY(init_waitqueue_head); > ... > +*/ Removed. They are for annotation which isn't important right now. > +static int lock_stat_inited = 0; > > should not be initialized to 0, that is implicit for static variables. Removed. > weird alignment here: > > +void lock_stat_init(struct lock_stat *oref) > +{ > + oref->function[0] = 0; > + oref->file = NULL; > + oref->line = 0; > + > + oref->ntracked = 0; I reduced that all to a single space without using huge tabs. > funky branching: > > + spin_lock_irqsave(&free_store_lock, flags); > + if (!list_empty(&lock_stat_free_store)) { > + struct list_head *e = lock_stat_free_store.next; > + struct lock_stat *s; > + > + s = container_of(e, struct lock_stat, list_head); > + list_del(e); > + > + spin_unlock_irqrestore(&free_store_lock, flags); > + > + return s; > + } > + spin_unlock_irqrestore(&free_store_lock, flags); > + > + return NULL; > > that should be s = NULL in the function scope and a plain unlock and > return s. I made this change. > assignments mixed with arithmetics: > > +static > +int lock_stat_compare_objs(struct lock_stat *x, struct lock_stat *y) > +{ > + int a = 0, b = 0, c = 0; > + > + (a = ksym_strcmp(x->function, y->function)) || > + (b = ksym_strcmp(x->file, y->file)) || > + (c = (x->line - y->line)); > + > + return a | b | c; > > the usual (and more readable) style is to separate them out explicitly: > > a = ksym_strcmp(x->function, y->function); > if (!a) > return 0; > b = ksym_strcmp(x->file, y->file); > if (!b) > return 0; > > return x->line == y->line; > > (detail: this btw also fixes a bug in the function above AFAICS, in the > a && !b case.) Not sure what you mean here but I made the key comparison so that it would treat each struct field in most to least significant order evaluation. The old code worked fine. What you're seeing is the newer stuff. > also, i'm not fully convinced we want that x->function as a string. That > makes comparisons alot slower. Why not make it a void *, and resolve to > the name via kallsyms only when printing it in /proc, like lockdep does > it? I've made your suggested change, but I'm not done with it. > > no need to put dates into comments: > > +* Fri Oct 27 00:26:08 PDT 2006 > > then: > > + while (node) > + { > > proper style is: > > + while (node) { Done. I misinterpreted the style guide and have made the changes to conform to it.. > this function definition: > > +static > +void lock_stat_insert_object(struct lock_stat *o) > > can be single-line. We make it multi-line only when needed. Done for all the instances I remember off hand. > these are only samples of the types of style problems still present in > the code. I'm a bit of a space cadet so I might have missed something. Latest patch here. http://finfin.is-a-geek.org/~billh/contention/patch-2.6.20-rc2-rt2.4.lock_stat.patch I'm going to review and hand merge your changes to the older patch tonight. Thanks for the comments. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]
On Tue, Dec 26, 2006 at 04:51:21PM -0800, Chen, Tim C wrote: > Ingo Molnar wrote: > > If you'd like to profile this yourself then the lowest-cost way of > > profiling lock contention on -rt is to use the yum kernel and run the > > attached trace-it-lock-prof.c code on the box while your workload is > > in 'steady state' (and is showing those extended idle times): > > > > ./trace-it-lock-prof > trace.txt > > Thanks for the pointer. Will let you know of any relevant traces. Tim, http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.lock_stat.patch You can also apply this patch to get more precise statistics down to the lock. For example: ... [50, 30, 279 :: 1, 0] {tty_ldisc_try, -, 0} [5, 5, 0 :: 19, 0] {alloc_super, fs/super.c, 76} [5, 5, 3 :: 1, 0] {__free_pages_ok, -, 0} [5728, 862, 156 :: 2, 0]{journal_init_common, fs/jbd/journal.c, 667} [594713, 79020, 4287 :: 60818, 0] {inode_init_once, fs/inode.c, 193} [602, 0, 0 :: 1, 0] {lru_cache_add_active, -, 0} [63, 5, 59 :: 1, 0] {lookup_mnt, -, 0} [6425, 378, 103 :: 24, 0] {initialize_tty_struct, drivers/char/tty_io.c, 3530} [6708, 1, 225 :: 1, 0] {file_move, -, 0} [67, 8, 15 :: 1, 0] {do_lookup, -, 0} [69, 0, 0 :: 1, 0] {exit_mmap, -, 0} [7, 0, 0 :: 1, 0] {uart_set_options, drivers/serial/serial_core.c, 1876} [76, 0, 0 :: 1, 0] {get_zone_pcp, -, 0} [, 5, 9 :: 1, 0]{as_work_handler, -, 0} [8689, 0, 0 :: 15, 0] {create_workqueue_thread, kernel/workqueue.c, 474} [89, 7, 6 :: 195, 0]{sighand_ctor, kernel/fork.c, 1474} @contention events = 1791177 @found = 21 Is the output from /proc/lock_stat/contention. First column is the number of contention that will results in a full block of the task, second is the number of times the mutex owner is active on a per cpu run queue the scheduler and third is the number of times Steve Rostedt's ownership handoff code averted a full block. Peter Zijlstra used it initially during his files_lock work. Overhead of the patch is very low since it is only recording stuff in the slow path of the rt-mutex implementation. Writing to that file clears all of the stats for a fresh run with a benchmark. This should give a precise point at which any contention would happen in -rt. In general, -rt should do about as well as the stock kernel minus the overhead of interrupt threads. Since the last release, I've added checks for whether the task is running as "current" on a run queue to see if adaptive spins would be useful in -rt. These new stats show that only a small percentage of events would benefit from the use of adaptive spins in front of a rt- mutex. Any implementation of it would have little impact on the system. It's not the mechanism but the raw MP work itself that contributes to the good MP performance of Linux. Apply and have fun. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]
On Sat, Dec 30, 2006 at 06:56:08AM -0800, Daniel Walker wrote: > On Sat, 2006-12-30 at 12:19 +0100, Ingo Molnar wrote: > > > > > - Documentation/CodingStyle compliance - the code is not ugly per se > >but still looks a bit 'alien' - please try to make it look Linuxish, > >if i apply this we'll probably stick with it forever. This is the > >major reason i havent applied it yet. > > I did some cleanup while reviewing the patch, nothing very exciting but > it's an attempt to bring it more into the "Linuxish" scope .. I didn't > compile it so be warned. > > There lots of ifdef'd code under CONFIG_LOCK_STAT inside rtmutex.c I > suspect it would be a benefit to move that all into a header and ifdef > only in the header . Ingo and Daniel, I'll try and make it more Linuxish. It's one of the reasons why I posted it since I knew it would need some kind of help in that arena and I've been in need of feedback regarding it. Originally, I picked a style that made what I was doing extremely obvious and clear to facilitate development which is the rationale behind it. I'll make those changes and we can progressively pass it back and forth to see if this passes. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]
On Tue, Jan 02, 2007 at 02:51:05PM -0800, Chen, Tim C wrote: > Bill, > > I'm having some problem getting this patch to run stablely. I'm > encoutering errors like that in the trace that follow: > > Thanks. > Tim > > Unable to handle kernel NULL pointer dereference at 0008 Yes, those are the reason why I have some aggressive asserts in the code to try track down the problem. Try this: http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.1.lock_stat.patch It's got some cosmestic clean up in it to make it more Linux-ish instead of me trying to reinvent some kind of Smalltalk system in the kernel. I'm trying to address all of Ingo's complaints about the code so it's still a work in progress, namely the style issues (I'd like help/suggestions on that) and assert conventions. It might the case that the lock isn't know to the lock stats code yet. It's got some technical overlap with lockdep in that a lock might not be known yet and is causing a crashing. Try that patch and report back to me what happens. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2 [was Re: 2.6.19-rt14 slowdown compared to 2.6.19]
On Tue, Jan 02, 2007 at 03:12:34PM -0800, Bill Huey wrote: > On Tue, Jan 02, 2007 at 02:51:05PM -0800, Chen, Tim C wrote: > > Bill, > > > > I'm having some problem getting this patch to run stablely. I'm > > encoutering errors like that in the trace that follow: > > It might the case that the lock isn't know to the lock stats code yet. > It's got some technical overlap with lockdep in that a lock might not be > known yet and is causing a crashing. The stack trace and code examination reveals, if that structure in the timer code is used before it's initialized by the CPU bringup, it'll cause problems like that crash. I'll look at it later on tonight. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Sat, Dec 30, 2006 at 12:19:40PM +0100, Ingo Molnar wrote: > your patch looks pretty ok to me in principle. A couple of suggestions > to make it more mergable: > > - instead of BUG_ON()s please use DEBUG_LOCKS_WARN_ON() and make sure >the code is never entered again if one assertion has been triggered. >Pass down a return result of '0' to signal failure. See >kernel/lockdep.c about how to do this. One thing we dont need are >bugs in instrumentation bringing down a machine. I'm using a non-fatal error checking instead of BUG_ON. BUG_ON was a more aggressive way that I use to find problem initiallly. > - remove dead (#if 0) code Done. > - Documentation/CodingStyle compliance - the code is not ugly per se >but still looks a bit 'alien' - please try to make it look Linuxish, >if i apply this we'll probably stick with it forever. This is the >major reason i havent applied it yet. I reformatted most of the patch to be 80 column limited. I simplified a number of names, but I'm open to suggestions and patches to how to go about this. Much of this code was a style experiment, but now I have to make this more mergable. > - the xfs/wrap_lock change looks bogus - the lock is initialized >already. What am i missing? Correct. This has been removed. I've applied Daniel Walker's changes as well. Patch here: http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Wed, Jan 03, 2007 at 03:59:28PM -0800, Chen, Tim C wrote: > Bill Huey (hui) wrote: > http://mmlinux.sourceforge.net/public/patch-2.6.20-rc2-rt2.2.lock_stat.patch > > This version is much better and ran stablely. > > If I'm reading the output correctly, the locks are listed by > their initialization point (function, file and line # that a lock is > initialized). > That's good information to identify the lock. Yes, that's correct. Good to know that. What did the output reveal ? It can be extended by pid/futex for userspace app that has yet to be done. It might require changes to glibc or a some kind of dynamic tracing to communicate to kernel space information about that lock. There are other kernel uses as well. It's just a basic mechanisms for a variety of uses. This patch has some LTT and Dtrace-isms to it. What's your intended use again summarized ? futex contention ? I'll read the first posting again. > However, it will be more useful if there is information about where the > locking > was initiated from and who was trying to obtain the lock. It would add quite a bit more overhead, but it could be done with lockdep directly I believe in conjunction with this patch. However, it should be specific enough though that a kernel code examination at the key points of all users of the lock would show where the problem places are as well as users. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Wed, Jan 03, 2007 at 04:25:46PM -0800, Chen, Tim C wrote: > Earlier I used latency_trace and figured that there was read contention > on mm->mmap_sem during call to _rt_down_read by java threads > when I was running volanomark. That caused the slowdown of the rt > kernel > compared to non-rt kernel. The output from lock_stat confirm > that mm->map_sem was indeed the most heavily contended lock. Can you sort the output ("sort -n" what ever..) and post it without the zeroed entries ? I'm curious about how that statistical spike compares to the rest of the system activity. I'm sure that'll get the attention of Peter as well and maybe he'll do something about it ? :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Wed, Jan 03, 2007 at 04:46:37PM -0800, Chen, Tim C wrote: > Bill Huey (hui) wrote: > > Can you sort the output ("sort -n" what ever..) and post it without > > the zeroed entries ? > > > > I'm curious about how that statistical spike compares to the rest of > > the system activity. I'm sure that'll get the attention of Peter as > > well and maybe he'll do something about it ? :) ... > @contention events = 247149 > @failure_events = 146 > @lookup_failed_scope = 175 > @lookup_failed_static = 43 > @static_found = 16 > [1, 113, 77 -- 32768, 0]{tcp_init, net/ipv4/tcp.c, 2426} > [2, 759, 182 -- 1, 0] {lock_kernel, -, 0} > [13, 0, 7 -- 4, 0]{kmem_cache_free, -, 0} > [25, 3564, 9278 -- 1, 0]{lock_timer_base, -, 0} > [56, 9528, 24552 -- 3, 0] {init_timers_cpu, kernel/timer.c, 1842} > [471, 52845, 17682 -- 10448, 0] {sock_lock_init, net/core/sock.c, 817} > [32251, 9024, 242 -- 256, 0]{init, kernel/futex.c, 2781} > [173724, 11899638, 9886960 -- 11194, 0] {mm_init, kernel/fork.c, 369} Thanks, the numbers look a bit weird in that the first column should have a bigger number of events than that second column since it is a special case subset. Looking at the lock_stat_note() code should show that to be the case. Did you make a change to the output ? I can't tell which are "steal", actively running or overall contention stats against the lock from your output. Thanks bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Wed, Jan 03, 2007 at 05:00:49PM -0800, Bill Huey wrote: > On Wed, Jan 03, 2007 at 04:46:37PM -0800, Chen, Tim C wrote: > > @contention events = 247149 > > @failure_events = 146 > > @lookup_failed_scope = 175 > > @lookup_failed_static = 43 > > @static_found = 16 > > [1, 113, 77 -- 32768, 0]{tcp_init, net/ipv4/tcp.c, 2426} > > [2, 759, 182 -- 1, 0] {lock_kernel, -, 0} > > [13, 0, 7 -- 4, 0] {kmem_cache_free, -, 0} > > [25, 3564, 9278 -- 1, 0]{lock_timer_base, -, 0} > > [56, 9528, 24552 -- 3, 0] {init_timers_cpu, kernel/timer.c, 1842} > > [471, 52845, 17682 -- 10448, 0] {sock_lock_init, net/core/sock.c, 817} > > [32251, 9024, 242 -- 256, 0]{init, kernel/futex.c, 2781} > > [173724, 11899638, 9886960 -- 11194, 0] {mm_init, kernel/fork.c, > > 369} > > Thanks, the numbers look a bit weird in that the first column should > have a bigger number of events than that second column since it is a > special case subset. Looking at the lock_stat_note() code should show > that to be the case. Did you make a change to the output ? > > I can't tell which are "steal", actively running or overall contention > stats against the lock from your output. Also, the output is weird in that the "contention_events" should be a total of all of the events in the first three columns. Clearly the number is wrong and I don't know if the text output was mangled or if that's accurate and my code is buggy. I have yet to see it give me data like that yet in my set up. The fourth and fifth columns are the number of times this lock was initialized inline by a function like spin_lock_init(). It might have a correspondence to clone() calls. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Wed, Jan 03, 2007 at 05:11:04PM -0800, Chen, Tim C wrote: > Bill Huey (hui) wrote: > > > > Thanks, the numbers look a bit weird in that the first column should > > have a bigger number of events than that second column since it is a > > special case subset. Looking at the lock_stat_note() code should show > > that to be the case. Did you make a change to the output ? > > No, I did not change the output. I did reset to the contention content > > by doing echo "0" > /proc/lock_stat/contention. > > I noticed that the first column get reset but not the second column. So > the reset code probably need to be checked. This should have the fix. http://mmlinux.sf.net/public/patch-2.6.20-rc2-rt2.3.lock_stat.patch If you can rerun it and post the results, it'll hopefully show the behavior of that lock acquisition better. Thanks bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lock stat for -rt 2.6.20-rc2-rt2.2.lock_stat.patch
On Wed, Jan 03, 2007 at 06:14:11PM -0800, Chen, Tim C wrote: > Bill Huey (hui) wrote: > http://mmlinux.sf.net/public/patch-2.6.20-rc2-rt2.3.lock_stat.patch > > If you can rerun it and post the results, it'll hopefully show the > > behavior of that lock acquisition better. > > Here's the run with fix to produce correct statistics. > > Tim > > @contention events = 848858 > @failure_events = 10 > @lookup_failed_scope = 175 > @lookup_failed_static = 47 > @static_found = 17 ... > [112584, 150, 6 -- 256, 0] {init, kernel/futex.c, 2781} > [597012, 183895, 136277 -- 9546, 0] {mm_init, kernel/fork.c, > 369} Interesting. The second column means that those can be adaptively spun on to prevent the blocking from happening. That's roughly 1/3rd of the blocking events that happen (second/first). Something like that would help out, but the problem is that contention on that lock in the first place. Also, Linux can do a hell of a lot of context switches per second. Is the number of total contentions (top figure) in that run consistent with the performance degradation ? and how much the reduction of those events by 1/3rd would help out with the benchmark ? Those are the questions in my mind at this moment. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar wrote: > * Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > > If kernelspace rescheduling is that fast, then please explain me why > > userspace one always beats kernel/userspace? > > because 'user space scheduling' makes no sense? I explained my thinking > about that in a past mail: > > --> > One often repeated (because pretty much only) performance advantage of > 'light threads' is context-switch performance between user-space > threads. But reality is, nobody /cares/ about being able to > context-switch between "light user-space threads"! Why? Because there > are only two reasons why such a high-performance context-switch would > occur: ... > 2) there has been an IO event. The thing is, for IO events we enter the > kernel no matter what - and we'll do so for the next 10 years at > minimum. We want to abstract away the hardware, we want to do > reliable resource accounting, we want to share hardware resources, > we want to rate-limit, etc., etc. While in /theory/ you could handle > IO purely from user-space, in practice you dont want to do that. And > if we accept the premise that we'll enter the kernel anyway, there's > zero performance difference between scheduling right there in the > kernel, or returning back to user-space to schedule there. (in fact > i submit that the former is faster). Or if we accept the theoretical > possibility of 'perfect IO hardware' that implements /all/ the > features that the kernel wants (in a secure and generic way, and > mind you, such IO hardware does not exist yet), then /at most/ the > performance advantage of user-space doing the scheduling is the > overhead of a null syscall entry. Which is a whopping 100 nsecs on > modern CPUs! That's roughly the latency of a /single/ DRAM access! Ingo and Evgeniy, I was trying to avoid getting into this discussion, but whatever. M:N threading systems also require just about all of the threading semantics that are inside the kernel to be available in userspace. Implementations of the userspace scheduler side of things must be able to turn off preemption to do per CPU local storage, report blocking/preempting via (via upcall or a mailbox) and other scheduler-ish things in reliable way so that the complexity of a system like that ends up not being worth it and is often monsteriously large to implement and debug. That's why Solaris 10 removed their scheduler activations framework and went with 1:1 like in Linux since the scheduler activations model is so difficult to control. The slowness of the futex stuff might be compounded by some VM mapping issues that Bill Irwin and Peter Ziljstra have pointed out in the past regard, if I understand correctly. Bryan Cantril of Solaris 10/dtrace fame can comment on that if you ask him sometime. For an exercise, think about all of things you need to either migrate or to do a cross CPU wake of a task. It goes to hell in complexity really quick. Erlang and other language based concurrency systems get their regularities by indirectly oversimplifying what threading is from what kernel folks are use to. Try doing a cross CPU wake quickly a system like that, good luck. Now think about how to do an IPI in userspace ? Good luck. That's all :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove sb->s_files and file_list_lock usage in dquot.c
On Tue, Feb 06, 2007 at 02:23:33PM +0100, Christoph Hellwig wrote: > Iterate over sb->s_inodes instead of sb->s_files in add_dquot_ref. > This reduces list search and lock hold time aswell as getting rid of > one of the few uses of file_list_lock which Ingo identified as a > scalability problem. Christoph, The i_mutex lock the inode structure is also a source of contention heavy when running a lot of parallel "find"s. I'm sure that folks would be open to hearing suggestions regarding how to fix that. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove sb->s_files and file_list_lock usage in dquot.c
On Thu, Feb 08, 2007 at 01:01:21AM -0800, Bill Huey wrote: > Christoph, > > The i_mutex lock the inode structure is also a source of contention > heavy when running a lot of parallel "find"s. I'm sure that folks > would be open to hearing suggestions regarding how to fix that. Christoph, And while you're at it, you should also know that dcache_lock is next in line to be nixed out of existence if possible. i_mutex is a bitch and I'm not even going to think about how to get rid of it since it's so widely used in many places (file systems aren't my think as well). Maybe some more precise tracking of contention paths would be useful to see if there's a pathological case creating a cascade of contention events so that can be nixed, don't know. About 1/10th of the lock stat events I've logged report that the owner of the rtmutex is the "current" on a runqueue some where. An adaptive lock would help with those contention events (spin for it while owner is running for the mutex release) but really, the contention should be avoided in the first place since they have a kind of (I think) polynomial increase in contention time as you add more processors to the mix. I have an adaptive lock implementation in my tree that eliminates the contention between what looks like the IDE layer, work queues and the anticipatory scheduler, but that's not a real problem unlike what I've mentioned above. I can get you and others more specifics on the problem if folks working on the lower layers want it. Other than that the -rt patch does quite well with instrumenting all sort of kernel behaviors that include contention and latency issues. So I don't need to tell you how valuable the -rt patch is for these issues since it's obvious, and I'm sure that you'll agree, that it's been instrumental at discovering many problems with the stock kernel. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove sb->s_files and file_list_lock usage in dquot.c
On Thu, Feb 08, 2007 at 11:14:04PM -0800, Bill Huey wrote: > I have an adaptive lock implementation in my tree that eliminates > the contention between what looks like the IDE layer, work queues and > the anticipatory scheduler, but that's not a real problem unlike what > I've mentioned above. I can get you and others more specifics on the > problem if folks working on the lower layers want it. Correction, it eliminates all of the "live" contentions where the mutex owner isn't asleep already. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/7] fs: break the file_list_lock for sb->s_files
On Sun, Jan 28, 2007 at 03:30:06PM +, Christoph Hellwig wrote: > On Sun, Jan 28, 2007 at 04:21:06PM +0100, Ingo Molnar wrote: > > > > sb->s_files is converted to a lock_list. furthermore to prevent the > > > > lock_list_head of getting too contended with concurrent add > > > > operations the add is buffered in per cpu filevecs. > > > > > > NACK. Please don't start using lockdep internals in core code. > > > > what do you mean by that? > > struct lock_list is an lockdep implementation detail and should not > leak out and be used anywhere else. If we want something similar it > should be given a proper name and a header of it's own, but I don't > think it's a valueable abstraction for the core kernel. Christoph, "lock list" has nothing to do with lockdep. It's a relatively new data type used to construct concurrent linked lists using a spinlock per entry. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockmeter
On Sun, Jan 28, 2007 at 09:38:16AM -0800, Martin J. Bligh wrote: > Christoph Hellwig wrote: > >On Sun, Jan 28, 2007 at 08:52:25AM -0800, Martin J. Bligh wrote: > >>Mmm. not wholly convinced that's true. Whilst i don't have lockmeter > >>stats to hand, the heavy time in __d_lookup seems to indicate we may > >>still have a problem to me. I guess we could move the spinlocks out > >>of line again to test this fairly easily (or get lockmeter upstream). > > > >We definitly should get lockmeter in. Does anyone volunteer for doing > >the cleanup and merged? > > On second thoughts .. I don't think it'd actually work for this since > the locks aren't global. Not that it shouldn't be done anyway, but ... > > ISTR we still thought dcache scalability was a significant problem last > time anyone looked at it seriously - just never got fixed. Dipankar? My lock stat stuff shows dcache to a be a problem under -rt as well. It is keyed off the same mechanism as lockdep. It's pretty heavily hit under even normal loads relative to other kinds of lock overhead even for casual file operations on a 2x system. I can't imagine how lousy it's going to be under real load on a 8x or higher machine. However, this pathc is -rt only and spinlock times are meaningless under it because of preemptiblity. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockmeter
On Sun, Jan 28, 2007 at 10:17:05PM +0100, Ingo Molnar wrote: > btw., while my plan is to prototype your lock-stat patch in -rt > initially, it should be doable to extend it to be usable with the > upstream kernel as well. > > We can gather lock contention events when there is spinlock debugging > enabled, from lib/spinlock_debug.c. For example __spin_lock_debug() does > this: > > static void __spin_lock_debug(spinlock_t *lock) > { > ... > for (i = 0; i < loops; i++) { > if (__raw_spin_trylock(&lock->raw_lock)) > return; > __delay(1); > } > > where the __delay(1) call is done do we notice contention - and there > you could drive the lock-stat code. Similarly, rwlocks have such natural > points too where you could insert a lock-stat callback without impacting > performance (or the code) all that much. mutexes and rwsems have natural > contention points too (kernel/mutex.c:__mutex_lock_slowpath() and > kernel/rwsem.c:rwsem_down_failed_common()), even with mutex debugging is > off. > > for -rt the natural point to gather contention events is in > kernel/rtmutex.c, as you are doing it currently. > > finally, you can enable lockdep's initialization/etc. wrappers so that > infrastructure between lockdep and lock-stat is shared, but you dont > have to call into the lock-tracking code of lockdep.c if LOCK_STAT is > enabled and PROVE_LOCKING is disabled. That should give you the lockdep > infrastructure for LOCK_STAT, without the lockdep overhead. > > all in one, one motivation behind my interest in your patch for -rt is > that i think it's useful for upstream too, and that it can merge with > lockdep to a fair degree. Fantastic. I'm going to try and finish up your suggested changes tonight and get it to work with CONFIG_LOCK_STAT off. It's been challenging to find time to do Linux these days, so I don't mind handing it off to you after this point so that you and tweek it to your heart's content. Yeah, one of the major motivations behind it was to see if Solaris style locks were useful and to either validate or invalidate their usefulness. Because of this patch, we have an idea of what's going on with regard to adaptive spinning and such. The sensible conclusion is that it's not sophistication of the lock, but parallelizing the code in the first place to prevent the contention in the first place that's the key philosophical drive. I forward merged it into rc6-rt2 and you can expect a drop tonight of what I have regardless whether it's perfect or not. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockmeter
On Sun, Jan 28, 2007 at 09:27:45PM -0800, Bill Huey wrote: > On Sun, Jan 28, 2007 at 10:17:05PM +0100, Ingo Molnar wrote: > > btw., while my plan is to prototype your lock-stat patch in -rt > > initially, it should be doable to extend it to be usable with the > > upstream kernel as well. ... > Fantastic. I'm going to try and finish up your suggested changes tonight > and get it to work with CONFIG_LOCK_STAT off. It's been challenging to > find time to do Linux these days, so I don't mind handing it off to you > after this point so that you and tweek it to your heart's content. Ingo, Got it. http://mmlinux.sourceforge.net/public/patch-2.6.20-rc6-rt2.1.lock_stat.patch This compiles and runs with the CONFIG_LOCK_STAT option turned off now and I believe addresses all of your forementioned concern that you mentioned. I could have missed a detail here and there, but I think it's in pretty good shape now. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] lock stat for 2.6.19-rt1
The second is the number of times this lock object was initialized. The third is the annotation scheme that directly attaches the lock object (spinlock, etc..) in line with the function initializer to avoid the binary tree lookup. This shows contention issue with inode access around what looks like a lock around a tree structure (inode_init_once @ lines 193/196) as well as lower layers of the IO system around the anticipatory scheduler and interrupt handlers. Keep in mind this is -rt so we're using interrupt threads here. This result is a combination of series of normal "find" loads as a result of the machine being idle overnight in a 2x AMD process set up. I'm sure that bigger machines should generate more interesting loads from the IBM folks and others. Thanks to all of the folks on #offtopic2 (Zwane Mwaikambo, nikita, Rik van Riel, the ever increasingly bitter Con Kolivas, Bill Irwin, etc...) for talking me through early stages of this process and general support. I'm requesting review, comments and inclusion of this into the -rt series of patches as well as general help. I've CCed folks that might be interested in this patch outside of the normal core -rt that I thought might find this interesting or have been friendly with me about -rt in the past. bill --- include/linux/lock_stat.h 5e0836a9785a182c8954c3bee8a92e63dd61602b +++ include/linux/lock_stat.h 5e0836a9785a182c8954c3bee8a92e63dd61602b @@ -0,0 +1,144 @@ +/* + * By Bill Huey (hui) at <[EMAIL PROTECTED]> + * + * Release under the what ever the Linux kernel chooses for a + * license, GNU Public License GPL v2 + * + * Tue Sep 5 17:27:48 PDT 2006 + * Created lock_stat.h + * + * Wed Sep 6 15:36:14 PDT 2006 + * Thinking about the object lifetime of a spinlock. Please refer to + * comments in kernel/lock_stat.c instead. + * + */ + +#ifndefLOCK_STAT_H +#define LOCK_STAT_H + +#ifdef CONFIG_LOCK_STAT + +#include +#include +#include +#include + +#include + +typedef struct lock_stat { + charfunction[KSYM_NAME_LEN]; + int line; + char*file; + + atomic_tncontended; + unsigned intntracked; + atomic_tninlined; + + struct rb_node rb_node; + struct list_headlist_head; +} lock_stat_t; + +typedef lock_stat_t *lock_stat_ref_t; + +#define LOCK_STAT_INIT(field) +#define LOCK_STAT_INITIALIZER(field) { \ + __FILE__, __FUNCTION__, __LINE__, \ + ATOMIC_INIT(0), LIST_HEAD_INIT(field)} + +#define LOCK_STAT_NOTE __FILE__, __FUNCTION__, __LINE__ +#define LOCK_STAT_NOTE_VARS_file, _function, _line +#define LOCK_STAT_NOTE_PARAM_DECL const char *_file, \ + const char *_function, \ + int _line + +#define __COMMA_LOCK_STAT_FN_DECL , const char *_function +#define __COMMA_LOCK_STAT_FN_VAR , _function +#define __COMMA_LOCK_STAT_NOTE_FN , __FUNCTION__ + +#define __COMMA_LOCK_STAT_NOTE , LOCK_STAT_NOTE +#define __COMMA_LOCK_STAT_NOTE_VARS, LOCK_STAT_NOTE_VARS +#define __COMMA_LOCK_STAT_NOTE_PARAM_DECL , LOCK_STAT_NOTE_PARAM_DECL + + +#define __COMMA_LOCK_STAT_NOTE_FLLN_DECL , const char *_file, int _line +#define __COMMA_LOCK_STAT_NOTE_FLLN , __FILE__, __LINE__ +#define __COMMA_LOCK_STAT_NOTE_FLLN_VARS , _file, _line + +#define __COMMA_LOCK_STAT_INITIALIZER , .lock_stat = NULL, + +#define __COMMA_LOCK_STAT_IP_DECL , unsigned long _ip +#define __COMMA_LOCK_STAT_IP , _ip +#define __COMMA_LOCK_STAT_RET_IP , (unsigned long) __builtin_return_address(0) + +extern void lock_stat_init(struct lock_stat *ls); +extern void lock_stat_sys_init(void); + +#define lock_stat_is_initialized(o) ((unsigned long) (*o)->file) + +extern void lock_stat_note_contention(lock_stat_ref_t *ls, unsigned long ip); +extern void lock_stat_print(void); +extern void lock_stat_scoped_attach(lock_stat_ref_t *_s, LOCK_STAT_NOTE_PARAM_DECL); + +#define ksym_strcmp(a, b) strncmp(a, b, KSYM_NAME_LEN) +#define ksym_strcpy(a, b) strncpy(a, b, KSYM_NAME_LEN) +#define ksym_strlen(a) strnlen(a, KSYM_NAME_LEN) + +/* +static inline char * ksym_strdup(const char *a) +{ + char *s = (char *) kmalloc(ksym_strlen(a), GFP_KERNEL); + return strncpy(s, a, KSYM_NAME_LEN); +} +*/ + +#define LS_INIT(name, h) { \ + /*.function,*/ .file = h, .line = 1,\ + .ntracked = 0, .ncontended = ATOMIC_INIT(0),\ + .list_head = LIST_HEAD_INIT(name.list_head),\ + .rb_node.rb_left = NULL, .rb_node.rb_left = NULL \ + } + +#define DECLARE_LS_ENTRY(name) \ + extern struct lock_stat _lock_stat_##name##_entry + +/* char _##name##_string[] = #name;\
[PATCH 1/4] lock stat for 2.6.19-rt1
This hooks into the preexisting lock definitions in the -rt kernel and hijacks parts of lockdep for the object hash key. bill --- include/linux/mutex.h d231debc2848a8344e1b04055ef22e489702e648 +++ include/linux/mutex.h 734c89362a3d77d460eb20eec3107e7b76fed938 @@ -15,6 +15,7 @@ #include #include #include +#include #include @@ -35,7 +36,8 @@ extern void } extern void -_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key); +_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key + __COMMA_LOCK_STAT_NOTE_PARAM_DECL); extern void __lockfunc _mutex_lock(struct mutex *lock); extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock); @@ -56,11 +58,15 @@ extern void __lockfunc _mutex_unlock(str # define mutex_lock_nested(l, s) _mutex_lock(l) #endif +#define __mutex_init(l,n) __rt_mutex_init(&(l)->mutex,\ + n \ + __COMMA_LOCK_STAT_NOTE) + # define mutex_init(mutex) \ do { \ static struct lock_class_key __key; \ \ - _mutex_init((mutex), #mutex, &__key); \ + _mutex_init((mutex), #mutex, &__key __COMMA_LOCK_STAT_NOTE);\ } while (0) #else --- include/linux/rt_lock.h d7515027865666075d3e285bcec8c36e9b6cfc47 +++ include/linux/rt_lock.h 297792307de5b4aef2c7e472e2a32c727e5de3f1 @@ -13,6 +13,7 @@ #include #include #include +#include #ifdef CONFIG_PREEMPT_RT /* @@ -28,8 +29,8 @@ typedef struct { #ifdef CONFIG_DEBUG_RT_MUTEXES # define __SPIN_LOCK_UNLOCKED(name) \ - (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \ - , .save_state = 1, .file = __FILE__, .line = __LINE__ } } + (spinlock_t) { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \ + , .save_state = 1, .file = __FILE__, .line = __LINE__ __COMMA_LOCK_STAT_INITIALIZER} } #else # define __SPIN_LOCK_UNLOCKED(name) \ (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } } @@ -92,7 +93,7 @@ typedef struct { # ifdef CONFIG_DEBUG_RT_MUTEXES # define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name), \ -.save_state = 1, .file = __FILE__, .line = __LINE__ } } +.save_state = 1, .file = __FILE__, .line = __LINE__ __COMMA_LOCK_STAT_INITIALIZER } } # else # define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } } @@ -139,14 +140,16 @@ struct semaphore name = \ */ #define DECLARE_MUTEX_LOCKED COMPAT_DECLARE_MUTEX_LOCKED -extern void fastcall __sema_init(struct semaphore *sem, int val, char *name, char *file, int line); +extern void fastcall __sema_init(struct semaphore *sem, int val, char *name + __COMMA_LOCK_STAT_FN_DECL, char *_file, int _line); #define rt_sema_init(sem, val) \ - __sema_init(sem, val, #sem, __FILE__, __LINE__) + __sema_init(sem, val, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, __LINE__) -extern void fastcall __init_MUTEX(struct semaphore *sem, char *name, char *file, int line); +extern void fastcall __init_MUTEX(struct semaphore *sem, char *name + __COMMA_LOCK_STAT_FN_DECL, char *_file, int _line); #define rt_init_MUTEX(sem) \ - __init_MUTEX(sem, #sem, __FILE__, __LINE__) + __init_MUTEX(sem, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, __LINE__) extern void there_is_no_init_MUTEX_LOCKED_for_RT_semaphores(void); @@ -247,13 +250,14 @@ extern void fastcall __rt_rwsem_init(str struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname) extern void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, -struct lock_class_key *key); +struct lock_class_key *key + __COMMA_LOCK_STAT_NOTE_PARAM_DECL); # define rt_init_rwsem(sem)\ do { \ static struct lock_class_key __key; \ \ - __rt_rwsem_init((sem), #sem, &__key); \ + __rt_rwsem_init((sem), #sem, &__key __COMMA_LOCK_STAT_NOTE); \ } while (0) extern void fastcall rt_down_write(struct rw_semaphore *rwsem); --- include/linux/rtmutex.h e6fa10297e6c20d27edba172aeb078a60c64488e +++ include/linux/rtmutex.h 55cd2de44a52e049fa8a0da63bde6449cefeb8fe @@
[PATCH 2/4] lock stat (rt/rtmutex.c mods) for 2.6.19-rt1
Mods to rt.c and rtmutex.c bill --- init/main.c 268ab0d5f5bdc422e2864cadf35a7bb95958de10 +++ init/main.c 9d14ac66cb0fe3b90334512c0659146aec5e241c @@ -608,6 +608,7 @@ asmlinkage void __init start_kernel(void #ifdef CONFIG_PROC_FS proc_root_init(); #endif + lock_stat_sys_init(); //--billh cpuset_init(); taskstats_init_early(); delayacct_init(); --- kernel/rt.c 5fc97ed10d5053f52488dddfefdb92e6aee2b148 +++ kernel/rt.c 3b86109e8e4163223f17c7d13a5bf53df0e04d70 @@ -66,6 +66,7 @@ #include #include #include +#include #include "rtmutex_common.h" @@ -75,6 +76,42 @@ # include "rtmutex.h" #endif +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_MUTEX_LOCK(a) \ + rt_mutex_lock_with_ip(a,\ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_MUTEX_LOCK(a) \ + rt_mutex_lock(a); +#endif + +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a, b) \ + rt_mutex_lock_interruptible_with_ip(a, b, \ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a) \ + rt_mutex_lock_interruptible(a, b); +#endif + +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\ + rt_mutex_trylock_with_ip(a, \ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\ + rt_mutex_trylock(a); +#endif + +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_SPIN_LOCK(a)\ + __rt_spin_lock_with_ip(a, \ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_SPIN_LOCK(a)\ + __rt_spin_lock(a); +#endif + #ifdef CONFIG_PREEMPT_RT /* * Unlock these on crash: @@ -88,7 +125,8 @@ void zap_rt_locks(void) /* * struct mutex functions */ -void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key) +void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key + __COMMA_LOCK_STAT_NOTE_PARAM_DECL) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -97,14 +135,15 @@ void _mutex_init(struct mutex *lock, cha debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(&lock->dep_map, name, key, 0); #endif - __rt_mutex_init(&lock->lock, name); + __rt_mutex_init(&lock->lock, name __COMMA_LOCK_STAT_NOTE_VARS); } EXPORT_SYMBOL(_mutex_init); void __lockfunc _mutex_lock(struct mutex *lock) { mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); - rt_mutex_lock(&lock->lock); + + __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock); } EXPORT_SYMBOL(_mutex_lock); @@ -124,14 +163,14 @@ void __lockfunc _mutex_lock_nested(struc void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass) { mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); - rt_mutex_lock(&lock->lock); + __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock); } EXPORT_SYMBOL(_mutex_lock_nested); #endif int __lockfunc _mutex_trylock(struct mutex *lock) { - int ret = rt_mutex_trylock(&lock->lock); + int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&lock->lock); if (ret) mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); @@ -152,7 +191,7 @@ int __lockfunc rt_write_trylock(rwlock_t */ int __lockfunc rt_write_trylock(rwlock_t *rwlock) { - int ret = rt_mutex_trylock(&rwlock->lock); + int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&rwlock->lock); if (ret) rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_); @@ -179,7 +218,7 @@ int __lockfunc rt_read_trylock(rwlock_t } spin_unlock_irqrestore(&lock->wait_lock, flags); - ret = rt_mutex_trylock(lock); + ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(lock); if (ret) rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); @@ -190,7 +229,7 @@ void __lockfunc rt_write_lock(rwlock_t * void __lockfunc rt_write_lock(rwlock_t *rwlock) { rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); - __rt_spin_lock(&rwlock->lock); + __LOCK_STAT_RT_SPIN_LOCK(&rwlock->lock); } EXPORT_SYMBOL(rt_write_lock); @@ -210,11 +249,44 @@ void __lockfunc rt_read_lock(rwlock_t *r return; } spin_unlock_irqrestore(&lock->wait_lock, flags); - __rt_spin_lock(lock); + __LOCK_STAT_RT_SPIN_LOCK(lock); } EXPORT_SYMBOL(rt_read_lock); +#ifdef CONFIG_LOCK_STAT +void __lockfunc rt_write_lock_with_ip(rwlock_t *rwlock, unsigned long ip) +{ + rwlock_acquire(&rwlock->dep_map, 0, 0, ip); +
[PATCH 3/4] lock stat (rt/rtmutex.c mods) for 2.6.19-rt1
Rudimentary annotations to the lock initializers to avoid the binary tree search before attachment. For things like inodes that are created and destroyed constantly this might be useful to get around some overhead. Sorry, about the patch numbering order. I think I screwed up on it. bill --- arch/xtensa/platform-iss/network.c eee47b0ca011d1c327ce7aff0c9a7547695d3a1f +++ arch/xtensa/platform-iss/network.c 76b16d29a46677a45d56b64983e0783959aa2160 @@ -648,6 +648,8 @@ static int iss_net_configure(int index, .have_mac = 0, }); + spin_lock_init(&lp->lock); + /* * Try all transport protocols. * Note: more protocols can be added by adding '&& !X_init(lp, eth)'. --- fs/dcache.c 20226054e6d6b080847e7a892d0b47a7ad042288 +++ fs/dcache.c 64d2b2b78b50dc2da7e409f2a9721b80c8fbbaf3 @@ -884,7 +884,7 @@ struct dentry *d_alloc(struct dentry * p atomic_set(&dentry->d_count, 1); dentry->d_flags = DCACHE_UNHASHED; - spin_lock_init(&dentry->d_lock); + spin_lock_init_annotated(&dentry->d_lock, &_lock_stat_d_alloc_entry); dentry->d_inode = NULL; dentry->d_parent = NULL; dentry->d_sb = NULL; --- fs/xfs/support/ktrace.c 1136cf72f9273718da47405b594caebaa59b66d3 +++ fs/xfs/support/ktrace.c 122729d6084fa84115b8f8f75cc55c585bfe3676 @@ -162,6 +162,7 @@ ktrace_enter( ASSERT(ktp != NULL); + spin_lock_init(&wrap_lock); //--billh /* * Grab an entry by pushing the index up to the next one. */ --- include/linux/eventpoll.h bd142a622609d04952fac6215586fff353dab729 +++ include/linux/eventpoll.h 43271ded1a3b9f40beb37aaff9e02fadeecb4655 @@ -15,6 +15,7 @@ #define _LINUX_EVENTPOLL_H #include +#include /* Valid opcodes to issue to sys_epoll_ctl() */ @@ -55,7 +56,7 @@ static inline void eventpoll_init_file(s static inline void eventpoll_init_file(struct file *file) { INIT_LIST_HEAD(&file->f_ep_links); - spin_lock_init(&file->f_ep_lock); + spin_lock_init_annotated(&file->f_ep_lock, &_lock_stat_eventpoll_init_file_entry); } --- net/tipc/node.c d6ddb08c5332517b0eff3b72ee0adc48f47801ff +++ net/tipc/node.c 9712633ceb8f939fc14a0a4861f7121840beff1d @@ -77,7 +77,7 @@ struct node *tipc_node_create(u32 addr) memset(n_ptr, 0, sizeof(*n_ptr)); n_ptr->addr = addr; -spin_lock_init(&n_ptr->lock); + spin_lock_init(&n_ptr->lock); INIT_LIST_HEAD(&n_ptr->nsub); n_ptr->owner = c_ptr; tipc_cltr_attach_node(c_ptr, n_ptr);
Re: [PATCH 3/4] lock stat (rt/rtmutex.c mods) for 2.6.19-rt1
On Sun, Dec 03, 2006 at 06:00:09PM -0800, Bill Huey wrote: > Rudimentary annotations to the lock initializers to avoid the binary > tree search before attachment. For things like inodes that are created > and destroyed constantly this might be useful to get around some > overhead. > > Sorry, about the patch numbering order. I think I screwed up on it. I also screwed up on the title for the email contents. Sorry about that. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] lock stat for 2.6.19-rt1
On Mon, Dec 04, 2006 at 01:21:29PM +0100, bert hubert wrote: > On Sun, Dec 03, 2006 at 05:53:23PM -0800, Bill Huey wrote: > > > [8264, 996648, 0] {inode_init_once, fs/inode.c, 196} > > [8552, 996648, 0] {inode_init_once, fs/inode.c, 193} > > Impressive, Bill! > > How tightly is your work bound to -rt? Iow, any chance of separating the > two? Or should we even want to? Right now, it's solely dependent on -rt, but the basic mechanisms of how it works is pretty much the same as lockdep. Parts of it should be moveable across to regular kernels. The only remaining parts would be altering the lock structure (spinlock, mutex, etc...) to have a pointer that it can use to do the statistical tracking. If it's NULL then it's dunamically allocated and handled differently and allocated when the lock is contended against. There's other uses for it as well. Think about RCU algorithms that need to spin-try to make sure the update of an element or the validation of it's data is safe to do. If an object was created to detect those spins it'll track what is effectively contention as well as it is represented in that algorithm. I've seen an RCU radix tree implementation do something like that. > > The first column is the number of the times that object was contented > > against. > > The second is the number of times this lock object was initialized. The > > third > > is the annotation scheme that directly attaches the lock object (spinlock, > > etc..) in line with the function initializer to avoid the binary tree > > lookup. > > I don't entirely get the third item, can you elaborate a bit? I hate the notion of a search, so I directly set the pointer to a object that's statically defined. It means that the object is directly connected to what is suppose to backing it without doing a runtime search. When that's done, all of the numbers from the second column get moved to the third. > Do you have a feeling of the runtime overhead? There's minimal runtime overhead I believe since it's only doing an atomic increment of the stats during the slowpath before the thread is actually shoved into a wait queue. That's something that happpens seldomly. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] lock stat for 2.6.19-rt1
On Mon, Dec 04, 2006 at 09:08:56AM -0800, Bill Huey wrote: > On Mon, Dec 04, 2006 at 01:21:29PM +0100, bert hubert wrote: > > How tightly is your work bound to -rt? Iow, any chance of separating the > > two? Or should we even want to? > > There's other uses for it as well. Think about RCU algorithms that need > to spin-try to make sure the update of an element or the validation of > it's data is safe to do. If an object was created to detect those spins > it'll track what is effectively contention as well as it is represented > in that algorithm. I've seen an RCU radix tree implementation do something > like that. That was a horrible paragraph plus I'm bored at the moment. What I meant is that lockless algorithms occasionally have a spin-try associated with it as well that might possibly validate the data that's updated against the entire data structure for some kind of consistency cohernecy or possibly on an individual element. That retry or spin can be considered a contention as well and it can be made aware to this lock-stat patch just by connecting the actually occurance of retry logic against a backing object. I need to be more conscious about proofreading what I write before sending it off. Was this clear ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] lock stat kills lock meter for -rt (.h files)
Containes .h file changes. bill --- include/linux/mutex.h d231debc2848a8344e1b04055ef22e489702e648 +++ include/linux/mutex.h 734c89362a3d77d460eb20eec3107e7b76fed938 @@ -15,6 +15,7 @@ #include #include #include +#include #include @@ -35,7 +36,8 @@ extern void } extern void -_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key); +_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key + __COMMA_LOCK_STAT_NOTE_PARAM_DECL); extern void __lockfunc _mutex_lock(struct mutex *lock); extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock); @@ -56,11 +58,15 @@ extern void __lockfunc _mutex_unlock(str # define mutex_lock_nested(l, s) _mutex_lock(l) #endif +#define __mutex_init(l,n) __rt_mutex_init(&(l)->mutex,\ + n \ + __COMMA_LOCK_STAT_NOTE) + # define mutex_init(mutex) \ do { \ static struct lock_class_key __key; \ \ - _mutex_init((mutex), #mutex, &__key); \ + _mutex_init((mutex), #mutex, &__key __COMMA_LOCK_STAT_NOTE);\ } while (0) #else --- include/linux/rt_lock.h d7515027865666075d3e285bcec8c36e9b6cfc47 +++ include/linux/rt_lock.h 297792307de5b4aef2c7e472e2a32c727e5de3f1 @@ -13,6 +13,7 @@ #include #include #include +#include #ifdef CONFIG_PREEMPT_RT /* @@ -28,8 +29,8 @@ typedef struct { #ifdef CONFIG_DEBUG_RT_MUTEXES # define __SPIN_LOCK_UNLOCKED(name) \ - (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \ - , .save_state = 1, .file = __FILE__, .line = __LINE__ } } + (spinlock_t) { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \ + , .save_state = 1, .file = __FILE__, .line = __LINE__ __COMMA_LOCK_STAT_INITIALIZER} } #else # define __SPIN_LOCK_UNLOCKED(name) \ (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } } @@ -92,7 +93,7 @@ typedef struct { # ifdef CONFIG_DEBUG_RT_MUTEXES # define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name), \ -.save_state = 1, .file = __FILE__, .line = __LINE__ } } +.save_state = 1, .file = __FILE__, .line = __LINE__ __COMMA_LOCK_STAT_INITIALIZER } } # else # define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } } @@ -139,14 +140,16 @@ struct semaphore name = \ */ #define DECLARE_MUTEX_LOCKED COMPAT_DECLARE_MUTEX_LOCKED -extern void fastcall __sema_init(struct semaphore *sem, int val, char *name, char *file, int line); +extern void fastcall __sema_init(struct semaphore *sem, int val, char *name + __COMMA_LOCK_STAT_FN_DECL, char *_file, int _line); #define rt_sema_init(sem, val) \ - __sema_init(sem, val, #sem, __FILE__, __LINE__) + __sema_init(sem, val, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, __LINE__) -extern void fastcall __init_MUTEX(struct semaphore *sem, char *name, char *file, int line); +extern void fastcall __init_MUTEX(struct semaphore *sem, char *name + __COMMA_LOCK_STAT_FN_DECL, char *_file, int _line); #define rt_init_MUTEX(sem) \ - __init_MUTEX(sem, #sem, __FILE__, __LINE__) + __init_MUTEX(sem, #sem __COMMA_LOCK_STAT_NOTE_FN, __FILE__, __LINE__) extern void there_is_no_init_MUTEX_LOCKED_for_RT_semaphores(void); @@ -247,13 +250,14 @@ extern void fastcall __rt_rwsem_init(str struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname) extern void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, -struct lock_class_key *key); +struct lock_class_key *key + __COMMA_LOCK_STAT_NOTE_PARAM_DECL); # define rt_init_rwsem(sem)\ do { \ static struct lock_class_key __key; \ \ - __rt_rwsem_init((sem), #sem, &__key); \ + __rt_rwsem_init((sem), #sem, &__key __COMMA_LOCK_STAT_NOTE); \ } while (0) extern void fastcall rt_down_write(struct rw_semaphore *rwsem); --- include/linux/rtmutex.h e6fa10297e6c20d27edba172aeb078a60c64488e +++ include/linux/rtmutex.h 55cd2de44a52e049fa8a0da63bde6449cefeb8fe @@ -15,6 +15,7 @@ #include #include #include +#include /* * The rt_mutex structure @
[PATCH 4/5] lock stat kills lock meter for -rt (annotations)
Rough annotations to speed up the object attachment logic. bill --- arch/xtensa/platform-iss/network.c eee47b0ca011d1c327ce7aff0c9a7547695d3a1f +++ arch/xtensa/platform-iss/network.c 76b16d29a46677a45d56b64983e0783959aa2160 @@ -648,6 +648,8 @@ static int iss_net_configure(int index, .have_mac = 0, }); + spin_lock_init(&lp->lock); + /* * Try all transport protocols. * Note: more protocols can be added by adding '&& !X_init(lp, eth)'. --- fs/dcache.c 20226054e6d6b080847e7a892d0b47a7ad042288 +++ fs/dcache.c 64d2b2b78b50dc2da7e409f2a9721b80c8fbbaf3 @@ -884,7 +884,7 @@ struct dentry *d_alloc(struct dentry * p atomic_set(&dentry->d_count, 1); dentry->d_flags = DCACHE_UNHASHED; - spin_lock_init(&dentry->d_lock); + spin_lock_init_annotated(&dentry->d_lock, &_lock_stat_d_alloc_entry); dentry->d_inode = NULL; dentry->d_parent = NULL; dentry->d_sb = NULL; --- fs/xfs/support/ktrace.c 1136cf72f9273718da47405b594caebaa59b66d3 +++ fs/xfs/support/ktrace.c 122729d6084fa84115b8f8f75cc55c585bfe3676 @@ -162,6 +162,7 @@ ktrace_enter( ASSERT(ktp != NULL); + spin_lock_init(&wrap_lock); //--billh /* * Grab an entry by pushing the index up to the next one. */ --- include/linux/eventpoll.h bd142a622609d04952fac6215586fff353dab729 +++ include/linux/eventpoll.h 43271ded1a3b9f40beb37aaff9e02fadeecb4655 @@ -15,6 +15,7 @@ #define _LINUX_EVENTPOLL_H #include +#include /* Valid opcodes to issue to sys_epoll_ctl() */ @@ -55,7 +56,7 @@ static inline void eventpoll_init_file(s static inline void eventpoll_init_file(struct file *file) { INIT_LIST_HEAD(&file->f_ep_links); - spin_lock_init(&file->f_ep_lock); + spin_lock_init_annotated(&file->f_ep_lock, &_lock_stat_eventpoll_init_file_entry); } --- include/linux/wait.h12da8de69f1f2660443a04c3df199e5d851ea2ca +++ include/linux/wait.h9b7448af82583bd11d18032aedfa8f2af44345f4 @@ -81,7 +81,7 @@ extern void init_waitqueue_head(wait_que extern void init_waitqueue_head(wait_queue_head_t *q); -#ifdef CONFIG_LOCKDEP +#if defined(CONFIG_LOCKDEP) || defined(CONFIG_LOCK_STAT) # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ ({ init_waitqueue_head(&name); name; }) # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ --- init/main.c 636e95fd9af6357291dace2b9995fd72d36e945f +++ init/main.c 2e30dc30c4aca9b1ff56064887e04d8262db30e7 @@ -608,6 +608,7 @@ asmlinkage void __init start_kernel(void #ifdef CONFIG_PROC_FS proc_root_init(); #endif + lock_stat_sys_init(); //--billh cpuset_init(); taskstats_init_early(); delayacct_init(); --- net/tipc/node.c d6ddb08c5332517b0eff3b72ee0adc48f47801ff +++ net/tipc/node.c 9712633ceb8f939fc14a0a4861f7121840beff1d @@ -77,7 +77,7 @@ struct node *tipc_node_create(u32 addr) memset(n_ptr, 0, sizeof(*n_ptr)); n_ptr->addr = addr; -spin_lock_init(&n_ptr->lock); + spin_lock_init(&n_ptr->lock); INIT_LIST_HEAD(&n_ptr->nsub); n_ptr->owner = c_ptr; tipc_cltr_attach_node(c_ptr, n_ptr);
[PATCH 5/5] lock stat kills lock meter for -rt (makefile)
Build system changes. bill --- kernel/Kconfig.preempt 3148bd94270ea0a853d8e443616cd7a668dd0d3b +++ kernel/Kconfig.preempt d63831dbfbb9e68386bfc862fd2dd1a8f1e9779f @@ -176,3 +176,12 @@ config RCU_TRACE Say Y here if you want to enable RCU tracing Say N if you are unsure. + +config LOCK_STAT + bool "Lock contention statistics tracking in /proc" + depends on PREEMPT_RT && !DEBUG_RT_MUTEXES + default y + help + General lock statistics tracking with regard to contention in + /proc/lock_stat/contention + --- kernel/Makefile 0690fbe8c605a1c7e24b7b94d05a96ea32574aab +++ kernel/Makefile 08087775b67b7ac1682dac0310003ef7ecbd7e70 @@ -63,6 +63,7 @@ obj-$(CONFIG_TASKSTATS) += taskstats.o t obj-$(CONFIG_UTS_NS) += utsname.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o +obj-$(CONFIG_LOCK_STAT) += lock_stat.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <[EMAIL PROTECTED]>, the -fno-omit-frame-pointer is
[PATCH 1/5] lock stat kills lock meter for -rt (core)
Core infrastructure files with /proc interface --- include/linux/lock_stat.h 554e4c1a2bc399f8a4fe4a1634b29aae6f4bb4de +++ include/linux/lock_stat.h 554e4c1a2bc399f8a4fe4a1634b29aae6f4bb4de @@ -0,0 +1,147 @@ +/* + * By Bill Huey (hui) at <[EMAIL PROTECTED]> + * + * Release under the what ever the Linux kernel chooses for a + * license, GNU Public License GPL v2 + * + * Tue Sep 5 17:27:48 PDT 2006 + * Created lock_stat.h + * + * Wed Sep 6 15:36:14 PDT 2006 + * Thinking about the object lifetime of a spinlock. Please refer to + * comments in kernel/lock_stat.c instead. + * + */ + +#ifndefLOCK_STAT_H +#define LOCK_STAT_H + +#ifdef CONFIG_LOCK_STAT + +#include +#include +#include +#include + +#include + +typedef struct lock_stat { + charfunction[KSYM_NAME_LEN]; + int line; + char*file; + + atomic_tncontended; + unsigned intntracked; + atomic_tninlined; + atomic_tnspinnable; + + struct rb_node rb_node; + struct list_headlist_head; +} lock_stat_t; + +typedef lock_stat_t *lock_stat_ref_t; + +struct task_struct; + +#define LOCK_STAT_INIT(field) +#define LOCK_STAT_INITIALIZER(field) { \ + __FILE__, __FUNCTION__, __LINE__, \ + ATOMIC_INIT(0), LIST_HEAD_INIT(field)} + +#define LOCK_STAT_NOTE __FILE__, __FUNCTION__, __LINE__ +#define LOCK_STAT_NOTE_VARS_file, _function, _line +#define LOCK_STAT_NOTE_PARAM_DECL const char *_file, \ + const char *_function, \ + int _line + +#define __COMMA_LOCK_STAT_FN_DECL , const char *_function +#define __COMMA_LOCK_STAT_FN_VAR , _function +#define __COMMA_LOCK_STAT_NOTE_FN , __FUNCTION__ + +#define __COMMA_LOCK_STAT_NOTE , LOCK_STAT_NOTE +#define __COMMA_LOCK_STAT_NOTE_VARS, LOCK_STAT_NOTE_VARS +#define __COMMA_LOCK_STAT_NOTE_PARAM_DECL , LOCK_STAT_NOTE_PARAM_DECL + + +#define __COMMA_LOCK_STAT_NOTE_FLLN_DECL , const char *_file, int _line +#define __COMMA_LOCK_STAT_NOTE_FLLN , __FILE__, __LINE__ +#define __COMMA_LOCK_STAT_NOTE_FLLN_VARS , _file, _line + +#define __COMMA_LOCK_STAT_INITIALIZER , .lock_stat = NULL, + +#define __COMMA_LOCK_STAT_IP_DECL , unsigned long _ip +#define __COMMA_LOCK_STAT_IP , _ip +#define __COMMA_LOCK_STAT_RET_IP , (unsigned long) __builtin_return_address(0) + +extern void lock_stat_init(struct lock_stat *ls); +extern void lock_stat_sys_init(void); + +#define lock_stat_is_initialized(o) ((unsigned long) (*o)->file) + +extern void lock_stat_note_contention(lock_stat_ref_t *ls, struct task_struct *owner, unsigned long ip); +extern void lock_stat_print(void); +extern void lock_stat_scoped_attach(lock_stat_ref_t *_s, LOCK_STAT_NOTE_PARAM_DECL); + +#define ksym_strcmp(a, b) strncmp(a, b, KSYM_NAME_LEN) +#define ksym_strcpy(a, b) strncpy(a, b, KSYM_NAME_LEN) +#define ksym_strlen(a) strnlen(a, KSYM_NAME_LEN) + +/* +static inline char * ksym_strdup(const char *a) +{ + char *s = (char *) kmalloc(ksym_strlen(a), GFP_KERNEL); + return strncpy(s, a, KSYM_NAME_LEN); +} +*/ + +#define LS_INIT(name, h) { \ + /*.function,*/ .file = h, .line = 1,\ + .ntracked = 0, .ncontended = ATOMIC_INIT(0),\ + .list_head = LIST_HEAD_INIT(name.list_head),\ + .rb_node.rb_left = NULL, .rb_node.rb_left = NULL \ + } + +#define DECLARE_LS_ENTRY(name) \ + extern struct lock_stat _lock_stat_##name##_entry + +/* char _##name##_string[] = #name;\ +*/ + +#define DEFINE_LS_ENTRY(name) \ + struct lock_stat _lock_stat_##name##_entry = LS_INIT(_lock_stat_##name##_entry, #name "_string") + +DECLARE_LS_ENTRY(d_alloc); +DECLARE_LS_ENTRY(eventpoll_init_file); +/* +DECLARE_LS_ENTRY(get_empty_filp); +DECLARE_LS_ENTRY(init_once_1); +DECLARE_LS_ENTRY(init_once_2); +DECLARE_LS_ENTRY(inode_init_once_1); +DECLARE_LS_ENTRY(inode_init_once_2); +DECLARE_LS_ENTRY(inode_init_once_3); +DECLARE_LS_ENTRY(inode_init_once_4); +DECLARE_LS_ENTRY(inode_init_once_5); +DECLARE_LS_ENTRY(inode_init_once_6); +DECLARE_LS_ENTRY(inode_init_once_7); +*/ + +#else /* CONFIG_LOCK_STAT */ + +#define __COMMA_LOCK_STAT_FN_DECL +#define __COMMA_LOCK_STAT_FN_VAR +#define __COMMA_LOCK_STAT_NOTE_FN + +#define __COMMA_LOCK_STAT_NOTE_FLLN_DECL +#define __COMMA_LOCK_STAT_NOTE_FLLN +#define __COMMA_LOCK_STAT_NOTE_FLLN_VARS + +#define __COMMA_LOCK_STAT_INITIALIZER + +#define __COMMA_LOCK_STAT_IP_DECL +#define __COMMA_LOCK_STAT_IP +#define __COMMA_LOCK_STAT_RET_IP + +#endif /* CONFIG_LOCK_STAT */ + +#endif /* LOCK_STAT_H */ + ---
[PATCH 2/5] lock stat kills lock meter for -rt
.c files that have been changed, rtmutex.c, rt.c --- kernel/rt.c 5fc97ed10d5053f52488dddfefdb92e6aee2b148 +++ kernel/rt.c 3b86109e8e4163223f17c7d13a5bf53df0e04d70 @@ -66,6 +66,7 @@ #include #include #include +#include #include "rtmutex_common.h" @@ -75,6 +76,42 @@ # include "rtmutex.h" #endif +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_MUTEX_LOCK(a) \ + rt_mutex_lock_with_ip(a,\ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_MUTEX_LOCK(a) \ + rt_mutex_lock(a); +#endif + +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a, b) \ + rt_mutex_lock_interruptible_with_ip(a, b, \ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_MUTEX_LOCK_INTERRUPTIBLE(a) \ + rt_mutex_lock_interruptible(a, b); +#endif + +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\ + rt_mutex_trylock_with_ip(a, \ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_MUTEX_TRYLOCK(a)\ + rt_mutex_trylock(a); +#endif + +#ifdef CONFIG_LOCK_STAT +#define __LOCK_STAT_RT_SPIN_LOCK(a)\ + __rt_spin_lock_with_ip(a, \ + (unsigned long) __builtin_return_address(0)) +#else +#define __LOCK_STAT_RT_SPIN_LOCK(a)\ + __rt_spin_lock(a); +#endif + #ifdef CONFIG_PREEMPT_RT /* * Unlock these on crash: @@ -88,7 +125,8 @@ void zap_rt_locks(void) /* * struct mutex functions */ -void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key) +void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key + __COMMA_LOCK_STAT_NOTE_PARAM_DECL) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -97,14 +135,15 @@ void _mutex_init(struct mutex *lock, cha debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(&lock->dep_map, name, key, 0); #endif - __rt_mutex_init(&lock->lock, name); + __rt_mutex_init(&lock->lock, name __COMMA_LOCK_STAT_NOTE_VARS); } EXPORT_SYMBOL(_mutex_init); void __lockfunc _mutex_lock(struct mutex *lock) { mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); - rt_mutex_lock(&lock->lock); + + __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock); } EXPORT_SYMBOL(_mutex_lock); @@ -124,14 +163,14 @@ void __lockfunc _mutex_lock_nested(struc void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass) { mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); - rt_mutex_lock(&lock->lock); + __LOCK_STAT_RT_MUTEX_LOCK(&lock->lock); } EXPORT_SYMBOL(_mutex_lock_nested); #endif int __lockfunc _mutex_trylock(struct mutex *lock) { - int ret = rt_mutex_trylock(&lock->lock); + int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&lock->lock); if (ret) mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); @@ -152,7 +191,7 @@ int __lockfunc rt_write_trylock(rwlock_t */ int __lockfunc rt_write_trylock(rwlock_t *rwlock) { - int ret = rt_mutex_trylock(&rwlock->lock); + int ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(&rwlock->lock); if (ret) rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_); @@ -179,7 +218,7 @@ int __lockfunc rt_read_trylock(rwlock_t } spin_unlock_irqrestore(&lock->wait_lock, flags); - ret = rt_mutex_trylock(lock); + ret = __LOCK_STAT_RT_MUTEX_TRYLOCK(lock); if (ret) rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); @@ -190,7 +229,7 @@ void __lockfunc rt_write_lock(rwlock_t * void __lockfunc rt_write_lock(rwlock_t *rwlock) { rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); - __rt_spin_lock(&rwlock->lock); + __LOCK_STAT_RT_SPIN_LOCK(&rwlock->lock); } EXPORT_SYMBOL(rt_write_lock); @@ -210,11 +249,44 @@ void __lockfunc rt_read_lock(rwlock_t *r return; } spin_unlock_irqrestore(&lock->wait_lock, flags); - __rt_spin_lock(lock); + __LOCK_STAT_RT_SPIN_LOCK(lock); } EXPORT_SYMBOL(rt_read_lock); +#ifdef CONFIG_LOCK_STAT +void __lockfunc rt_write_lock_with_ip(rwlock_t *rwlock, unsigned long ip) +{ + rwlock_acquire(&rwlock->dep_map, 0, 0, ip); + __rt_spin_lock_with_ip(&rwlock->lock, ip); +} +EXPORT_SYMBOL(rt_write_lock_with_ip); + +void __lockfunc rt_read_lock_with_ip(rwlock_t *rwlock, unsigned long ip) +{ + unsigned long flags; + struct rt_mutex *lock = &rwlock->lock; + + /* +* NOTE: we handle it as a write-lock: +*/ + rwlock_acquire(&rwlock->dep_map, 0, 0, ip); + /* +* Read loc
[PATCH 0/5] lock stat kills lock meter for -rt
Hello, I'm back with another annoying announcement and post of my "lock stat" patches for Ingo's 2.6.19-rt14 patch. I want review, comments and eventually inclusion into the -rt. Changes in this release: - forward ported to 2,6.19-rt14 - rt_mutex_slowtrylock() path now works with lock stat after an initialization check. Apparently there's a try-lock some where before my lock stat stuff is initialized and it hard crashes the machine on boot. This is fixed now. - Addes a new field to track adaptive spins in the rtmutex as a future feature. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS
On Mon, May 28, 2007 at 10:09:19PM +0530, Srivatsa Vaddagiri wrote: > On Fri, May 25, 2007 at 10:14:58AM -0700, Li, Tong N wrote: > > is represented by a weight of 10. Inside the group, let's say the two > > tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for > > P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens > > weights into one level without changing the shares they represent. > > What do these task weights control? Timeslice primarily? If so, I am not > sure how well it can co-exist with cfs then (unless you are planning to > replace cfs with a equally good interactive/fair scheduler :) It's called SD. From Con Kolivas that got it right the first time around :) > I would be very interested if this weight calculation can be used for > smpnice based load balancing purposes too .. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [Patch 4/4] lock contention tracking slimmed down
On Thu, Jun 07, 2007 at 02:17:45AM +0200, Martin Peschke wrote: > Ingo Molnar wrote: > >, quite some work went into it - NACK :-( > > Considering the amount of code.. ;-)I am sorry. > > But seriously, did you consider using some user space tool or script to > format this stuff the way you like it - similar to the way the powertop tool > reshuffles timer_stats data found in a proc file, for example? When I was doing my stuff, I intended for it to be parsed by a script or simple command line tools like sort/grep piped through less. I also though it might be interesting to output the text into either a python or ruby syntax collect so that it can go through a more extensive sorting using those languages. There are roughly about 400 locks in a normal kernel for a desktop. The list is rather cumbersome anyways so, IMO, it really should be handled by parsing tools, etc... There could be more properties attached to each lock especially if you intend to get this to work on -rt which need more things reported. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [Patch 4/4] lock contention tracking slimmed down
On Thu, Jun 07, 2007 at 09:30:21AM +0200, Ingo Molnar wrote: > * Martin Peschke <[EMAIL PROTECTED]> wrote: > > Do mean I might submit this stuff for -rt? > > Firstly, submit cleanup patches that _do not change the output_. If you > have any output changes, do it as a separate patch, ontop of the cleanup > patch. Mixing material changes and cleanups into a single patch is a > basic patch submission mistake that will only earn you NACKs. Martin, First of all I agree with Ingo in that this needs to be seperated from the rest of the clean ups. However, I don't understand why all of this is so heavy weight when the current measurements that Peter makes is completely sufficient for any reasonable purpose I can think of at the moment. What's this stuff with labels about ? It's important to get the points of contention so that the greater kernel group can fix this issues and not log statistics for the purpose of logging it. The original purpose should not be ignore when working on this stuff. By the way, what's the purpose of all of this stuff ? like what do you intend to do with it over the long haul ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote: > Even a simple 3D app like glxgears does a sys_sched_yield() for every > frame it generates (!) on certain 3D cards, which in essence punishes > any scheduler that implements sys_sched_yield() in a sane manner. This > interaction of CFS's yield implementation with this user-space bug could > be the main reason why some testers reported SD to be handling 3D games > better than CFS. (SD uses a yield implementation similar to the vanilla > scheduler.) > > So i've added a yield workaround to -v12, which makes it work similar to > how the vanilla scheduler and SD does it. (Xorg has been notified and > this bug should be fixed there too. This took some time to debug because > the 3D driver i'm using for testing does not use sys_sched_yield().) The > workaround is activated by default so -v12 should work 'out of the box'. This is an incorrect analysis. OpenGL has the ability to "yield" after every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven by the system vertical retrace interrupt) so that it can free up CPU resources for other tasks to run. The problem here is that the yield behavior is treated generally instead of specifically to a particular proportion scheduler policy. The correct solution is for the app to use a directed yield and a policy that can directly support it so that OpenGL can guaratee a frame rate governed by CPU bandwidth allocated by the scheduler. Will is working on such a mechanism now. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v12
On Thu, May 17, 2007 at 05:18:41PM -0700, Bill Huey wrote: > On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote: > > Even a simple 3D app like glxgears does a sys_sched_yield() for every > > frame it generates (!) on certain 3D cards, which in essence punishes > > any scheduler that implements sys_sched_yield() in a sane manner. This > > interaction of CFS's yield implementation with this user-space bug could > > be the main reason why some testers reported SD to be handling 3D games > > better than CFS. (SD uses a yield implementation similar to the vanilla > > scheduler.) > > > > So i've added a yield workaround to -v12, which makes it work similar to > > how the vanilla scheduler and SD does it. (Xorg has been notified and > > this bug should be fixed there too. This took some time to debug because > > the 3D driver i'm using for testing does not use sys_sched_yield().) The > > workaround is activated by default so -v12 should work 'out of the box'. > > This is an incorrect analysis. OpenGL has the ability to "yield" after > every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven > by the system vertical retrace interrupt) so that it can free up CPU > resources for other tasks to run. The problem here is that the yield > behavior is treated generally instead of specifically to a particular > proportion scheduler policy. > > The correct solution is for the app to use a directed yield and a policy > that can directly support it so that OpenGL can guaratee a frame rate > governed by CPU bandwidth allocated by the scheduler. > > Will is working on such a mechanism now. Follow up: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Developer/books/REACT_PG/sgi_html/ch04.html bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Sun, May 20, 2007 at 12:30:26PM +0200, Peter Zijlstra wrote: > The 4 points are the first 4 unique callsites that cause lock contention > for the specified lock class. > > writing a 0 to /proc/lockdep_contentions clears the stats We should talk about unifying it with my lockstat work for -rt so that we have a comprehensive solution for the "world". But you know that already :) Unifying lock initializer hash key initialization functions is a key first step to that. Keep in mind, we can do more with this mechanism than just kernel locks and we should probably keep that open and not code into a corner. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Mon, May 21, 2007 at 08:08:28AM +0200, Ingo Molnar wrote: > To me it appears Peter's stuff is already a pretty complete solution on > its own, and it's a whole lot simpler (and less duplicative) than your > lockstat patch. Could you list the specific items/features that you > think Peter's stuff doesnt have? First of all, this isn't an either/or kind of thing nor should it be thought of in that way. Precise file/function/line placement for one thing. My patch is specifically for -rt which does checks that Peter's doesn't and is needed to characterize -rt better. My stuff is potentially more extensible since I have other ideas for it that really are outside of the lockdep logic currently. These can be unified, but not so that one overrides the intended features of other. That's why I was hessitant to completely unify with lockdep in the manner you suggested. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Mon, May 21, 2007 at 09:50:13AM +0200, Ingo Molnar wrote: > Have you looked at the output Peter's patch produces? It prints out > precise symbols: > > dcache_lock: 3000 0 [618] [] _atomic_dec_and_lock+0x39/0x58 > > which can easily be turned into line numbers using debuginfo packages or > using gdb. (But normally one only needs the symbol name, and we > certainly do not want to burden the kernel source with tracking > __FILE__/__LINE__ metadata, if the same is already available via > CONFIG_DEBUG_INFO.) > > anything else? If his hashing scheme can produce precise locations of where locks are initialized both by a initializer function or a statically allocated object then my code is baroque and you should use Peter's code. I write lockstat without the knowledge that lockdep was replicating the same work and I audited 1600 something lock points in the kernel to convert the usage of C99 style initializers to something more regular. I also did this without consideration of things like debuginfo since I don't use those things. > > [...] My stuff is potentially more extensible since I have other ideas > > for it that really are outside of the lockdep logic currently. [...] > > what do you mean, specifically? Better if I show you the patches in the future instead of saying now. > i really need specifics. Currently i have the choice between your stuff: > >17 files changed, 1425 insertions(+), 80 deletions(-) > > and Peter's patch: > > 6 files changed, 266 insertions(+), 18 deletions(-) > > and Peter's patch (if it works out fine in testing - and it seemed fine > so far on my testbox), is smaller, more maintainable, better integrated > and thus the clear candidate for merging into -rt and merging upstream > as well. It's far cleaner than i hoped this whole lock-stats thing could > be done based on lockdep, so i'm pretty happy with Peter's current patch > already. If it meets your criteria and what you mentioned about is completely accurate, then use it instead of mine. I'll just finish up what I have done with reader tracking in my lockstat and migrate my -rt specific goodies to his infrastructure. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Mon, May 21, 2007 at 11:36:39AM +0200, Ingo Molnar wrote: > you got the history wrong i think: the first version of lockdep was > released to lkml a year ago (May 2006), while the first time you > mentioned your lock contention patch was November 2006 and you released > it to lkml in December 2006 - so it was _you_ who was "replicating the > same work", not lockdep :-) And this was pointed out to you very early > on, many months ago. Yeah, and where do we disagree here again ? So I take it you're disagreeing with my agreement with you that lockdep came first ? Geez, think about that one for a bit. (chuckle) :) I'd like to remind you that I mapped out the lock hierarchy for a fully preemptive -rt kernel while you and *others* were wanking around with voluntary preempt remember ? :) Keep in mind, I'm single obsessed with -rt. [back to the topic] > and regarding C99 style lock initializers: the -rt project has been > removing a whole heap of them in the past 2.5 years, since Oct 2004 or > so, and regularly cleansed the upstream kernel for old-style > initializers ever since then - so i'm not sure what you are referring > to. Don't worry about it. I did the same work only to realize that there wasn't much left to convert over. > btw., you dont even need CONFIG_DEBUG_INFO to get usable symbol names, > CONFIG_KALLSYMS alone will do it too. (It's only if you really cannot > tell from the lock symbol name and the function name what the entry is > about - which is very rare - that you need to look at any debug-info) I'm anal about these things. I thought that you can do more magic than that from your previous email, but it just confirms my understanding of how symbols work already, unless there was a meltdown of the universal physical laws here or something. That's why I made the choices I did. The inode initialization code is ambiguous which is why having a specific line number was very useful. It showed that one of the locks protecting a tree was heavily hit. There was multipule places in which it could have been if I hadn't had this information. Sleep time... bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Mon, May 21, 2007 at 10:55:47AM +0100, Christoph Hellwig wrote: > On Mon, May 21, 2007 at 11:36:39AM +0200, Ingo Molnar wrote: > > you got the history wrong i think: the first version of lockdep was > > released to lkml a year ago (May 2006), while the first time you > > mentioned your lock contention patch was November 2006 and you released > > it to lkml in December 2006 - so it was _you_ who was "replicating the > > same work", not lockdep :-) And this was pointed out to you very early > > on, many months ago. > > And lockmeter, the very first patch of this sort is from the 90s, but > got mostly ignored here on lkml, of course :) Unfortunately, it's not nearly as cool as my patch by default because I wrote it. :) bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Mon, May 21, 2007 at 03:19:46AM -0700, Bill Huey wrote: > On Mon, May 21, 2007 at 11:36:39AM +0200, Ingo Molnar wrote: > > you got the history wrong i think: the first version of lockdep was > > released to lkml a year ago (May 2006), while the first time you > > mentioned your lock contention patch was November 2006 and you released > > it to lkml in December 2006 - so it was _you_ who was "replicating the > > same work", not lockdep :-) And this was pointed out to you very early > > on, many months ago. > > Yeah, and where do we disagree here again ? So I take it you're disagreeing > with my agreement with you that lockdep came first ? Geez, think about that > one for a bit. (chuckle) :) Yeah, sorry about the wording reversal. It was unintentional. I tend to drop minor words like "not" and stuff which can create confusion. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Mon, May 21, 2007 at 02:46:55PM +0200, Ingo Molnar wrote: > which combines into this statement of yours: "I audited 1600 something > lock points in the kernel to convert the usage of C99 style initializers > something more regular, only to find out that there wasn't much left to > convert over?", correct? Which begs the question: why did you mention > this then at all? I usually reply to points made by others in the > assumption that there's some meaning behind them ;-) It was about how much time I wasted replicating work that I didn't know about. Some folks have different communication styles and it isn't ment to be anything further than that. Me and Peter are talking about possibly merging parts of each other's patch. I've been working on splitting up the reader/writer paths so that each type of contention is logged as well for tracking reader contentions against that existing rwsem problem and the like, but my stuff doesn't boot yet (working part time drags things). I/We'll keep you updated. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lockdep: lock contention tracking
On Mon, May 21, 2007 at 12:58:03PM +0200, Ingo Molnar wrote: > and nobody pushed strong enough to get it included. But ... Peter's > patch could perhaps be extended to cover similar stats as lockmeter, > ontop of the existing lockdep instrumentation. Peter, can you see any > particular roadblocks with that? Definitely. Lockmeter isn't terribly Linux-ish from my examination of that patch a while back. Doing it against lockdep is definitely the right thing to do in that it unifies lock handling through initializer keys that lockmeter doesn't, from my memory. The spin time tracking can be put into the slow path of the spin, like what peter has now, so that it has minimal impact against the uncontended case. Updating the times would then be a trivial pointer dereference plus add and hopefully won't have instrumentation side effects against the rest of the locking behavior in the system. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Sched - graphic smoothness under load - cfs-v13 sd-0.48
On Wed, May 23, 2007 at 09:58:35AM +0200, Xavier Bestel wrote: > On Wed, 2007-05-23 at 07:23 +0200, Michael Gerdau wrote: > > For me the huge difference you have for sd to the others increases the > > likelyhood the glxgears benchmark does not measure scheduling of graphic > > but something else. > > I think some people forget that X11 has its own scheduler for graphics > operations. OpenGL is generally orthogonal to X11 or at least should be. But this could vary with the implementation depending on how brain damaged the system is. I'd expect the performance charateristics to be different depending on what subsystem is being used. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/7] lock contention tracking -v2
On Wed, May 23, 2007 at 12:33:11PM +0200, Ingo Molnar wrote: > * Peter Zijlstra <[EMAIL PROTECTED]> wrote: ... > > It also measures lock wait-time and hold-time in nanoseconds. The > > minimum and maximum times are tracked, as well as a total (which > > together with the number of event can give the avg). > > > > All statistics are done per lock class, per write (exclusive state) > > and per read (shared state). > > > > The statistics are collected per-cpu, so that the collection overhead > > is minimized via having no global cachemisses. ... > really nice changes! The wait-time and hold-time changes should make it > as capable as lockmeter and more: lockmeter only measured spinlocks, > while your approach covers all lock types (spinlocks, rwlocks and > mutexes). > > The performance enhancements in -v2 should make it much more scalable > than your first version was. (in fact i think it should be completely > scalable as the statistics counters are all per-cpu, so there should be > no cacheline bouncing at all from this) per cpu is pretty important since you can potentially hit that logic more often with your wait-time code. You don't want to effect the actual measurement with the measurement code. It's that uncertainty principal thing. It is looking pretty good. :) You might like to pretty the output even more, but it's pretty usable as is. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
On Thu, Mar 08, 2007 at 10:31:48PM -0800, Linus Torvalds wrote: > On Thu, 8 Mar 2007, Bill Davidsen wrote: > > Please, could you now rethink plugable scheduler as well? Even if one had to > > be chosen at boot time and couldn't be change thereafter, it would still > > allow > > a few new thoughts to be included. > > No. Really. > > I absolutely *detest* pluggable schedulers. They have a huge downside: > they allow people to think that it's ok to make special-case schedulers. > And I simply very fundamentally disagree. Linus, This is where I have to respectfully disagree. There are types of loads that aren't covered in SCHED_OTHER. They are typically certain real time loads and those folks (regardless of -rt patch) would benefit greatly from having something like that in place. Those scheduler developers can plug in (at compile time) their work without having to track and forward port their code constantly so that non-SCHED_OTHER policies can be experimented with easily. This is especially so with rate monotonic influenced schedulers that are in the works by real time folks, stock kernel or not. This is about making Linux generally accessible to those folks and not folks doing SCHED_OTHER work. They are orthogonal. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Tue, Mar 13, 2007 at 08:41:05PM +1100, Con Kolivas wrote: > On Tuesday 13 March 2007 20:29, Ingo Molnar wrote: > > So the question is: if all tasks are on the same nice level, how does, > > in Mike's test scenario, RSDL behave relative to the current > > interactivity code? ... > The only way to get the same behaviour on RSDL without hacking an > interactivity estimator, priority boost cpu misproportionator onto it is to > either -nice X or +nice lame. Hello Ingo, After talking to Con over IRC (and if I can summarize it), he's wondering if properly nicing those tasks, as previously mention in user emails, would solve this potential user reported regression or is something additional needed. It seems like folks are happy with the results once the nice tweeking is done. This is a huge behavior change after all to scheduler (just thinking out loud). bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Tue, Mar 13, 2007 at 12:58:01PM -0700, David Schwartz wrote: > > But saying that the user needs to explicitly hold the schedulers hand > > and nice everything to tell it how to schedule seems to be an abdication > > of duty, an admission of failure. We can't expect users to finesse all > > their processes with nice, and it would be a bad user interface to ask > > them to do so. > > Then you will always get cases where the scheduler does not do what the user > wants because the scheduler does not *know* what the user wants. You always > have to tell a computer what you want it to do, and the best it can do is > faithfully follow your request. > > I think it's completely irrational to ask for a scheduler that automatically > gives more CPU time to CPU hogs. SGI machines had an interactive term in their scheduler as well as a traditional nice priority. It might be useful for Con to possibly consider this as an extension for problematic (badly hacked) processes like X. Nice as a control mechanism is rather coarse, yet overly strict because of the sophistication of his scheduler. Having an additional term (control knob) would be nice for a scheduler that is built upon (correct me if I'm wrong Con): 1) has rudimentary bandwidth control for a group of runnable processes 2) has a basic deadline mechanism The "nice" term is only an indirect way of controlling his scheduler and think and this kind of imprecise tweeking being done with various apps is an indicator of how lacking it is as a control term in the scheduler. It would be good to have some kind of coherent and direct control over the knobs that are (1) and (2). Schedulers like this have superior control over these properties and they should be fully exploited with terms in additional to "nice". Item (1) is subject to a static "weight" multiplication in relation to other runnable tasks. It also might be useful to make a part of that term a bit dynamic to get some kind of interactivity control back. It's a matter of testing, tweeking, etc... and are not easy for apps that don't have a direct thread context to control like a thread unaware X system. > > And if someone/distro *does* go to all the effort of managing how to get > > all the processes at the right nice levels, you have this big legacy > > problem where you're now stuck keeping all those nice values meaningful > > as you continue to develop the scheduler. Its bad enough to make them > > do the work in the first place, but its worse if they need to make it a > > kernel version dependent function. > > I agree. I'm not claiming to have the perfect solution. Let's not let the > perfect be the enemy of the good though. I hope this was useful. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Tue, Mar 13, 2007 at 01:10:40PM -0700, Jeremy Fitzhardinge wrote: > David Schwartz wrote: > Hm, well. The general preference has been for the kernel to do a > good-enough job on getting the common cases right without tuning, and > then only add knobs for the really tricky cases it can't do well. But > the impression I'm getting here is that you often get sucky behaviours > without tuning. Well, you get strict behaviors as expected for this scheduler. > > I think it's completely irrational to ask for a scheduler that automatically > > gives more CPU time to CPU hogs. > > > > Well, it doesn't have to. It could give good low latency with short > timeslices to things which appear to be interactive. If the interactive > program doesn't make good use of its low latency, then it will suck. > But that's largely independent of how much overall CPU you give it. This is way beyond what SCHED_OTHER should do. It can't predict the universe. Much of the interactivity estimator borders on magic. It just happens to also "be a good fit" for hacky apps as well almost by accident. > > I agree. I'm not claiming to have the perfect solution. Let's not let the > > perfect be the enemy of the good though. > > For all its faults, the current scheduler mostly does a good job without > much tuning - I normally only use "nice" to run cpu-bound things without > jacking the cpu speed up. Certainly in my normal interactive use of > compiz vs make -j4 on a dual-core generally gets pretty pretty good > results. I plan on testing the new scheduler soon though. We can do MUCH better in the long run with something like Con's scheduler. His approach shouldn't be dismissed because it's running into a relatively few minor snags large the fault of scheduleing opaque applications. It's precise enough that it can also be loosened up a bit with additional control terms (previous email). It might be good to think about that a bit to see if a schema like this can be made more adaptable for the environment it serves. You'd then have both precisely bounded control over CPU usage and enough flexibility for burstly needs of certain apps. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] ... >The CFS patch uses a completely different approach and implementation >from RSDL/SD. My goal was to make CFS's interactivity quality exceed >that of RSDL/SD, which is a high standard to meet :-) Testing >feedback is welcome to decide this one way or another. [ and, in any >case, all of SD's logic could be added via a kernel/sched_sd.c module >as well, if Con is interested in such an approach. ] Ingo, Con has been asking for module support for years if I understand your patch corectly. You'll also need this for -rt as well with regards to bandwidth scheduling. Good to see that you're moving in this direction. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote: > On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote: > > Yeah. Note that there are some subtle but crutial differences between > > PlugSched (which Con used, and which i opposed in the past) and this > > approach. > > PlugSched cuts the interfaces at a high level in a monolithic way and > > introduces kernel/scheduler.c that uses one pluggable scheduler > > (represented via the 'scheduler' global template) at a time. > > What I originally did did so for a good reason, which was that it was > intended to support far more radical reorganizations, for instance, > things that changed the per-cpu runqueue affairs for gang scheduling. > I wrote a top-level driver that did support scheduling classes in a > similar fashion, though it didn't survive others maintaining the patches. Also, gang scheduling is needed to solve virtualization issues regarding spinlocks in a guest image. You could potentally be spinning on a thread that isn't currently running which, needless to say, is very bad. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote: > very much so! Both Con and Mike has contributed regularly to upstream > sched.c: The problem here is tha Con can get demotivated (and rather upset) when an idea gets proposed, like SchedPlug, only to have people be hostile to it and then sudden turn around an adopt this idea. It give the impression that you, in this specific case, were more interested in controlling a situation and the track of development instead of actually being inclusive of the development process with discussion and serious consideration, etc... This is how the Linux community can be perceived as elitist. The old guard would serve the community better if people were more mindful and sensitive to developer issues. There was a particular speech that I was turned off by at OLS 2006 that pretty much pandering to the "old guard's" needs over newer developers. Since I'm a some what established engineer in -rt (being the only other person that mapped the lock hierarchy out for full preemptibility), I had the confidence to pretty much ignored it while previously this could have really upset me and be highly discouraging to a relatively new developer. As Linux gets larger and larger this is going to be an increasing problem when folks come into the community with new ideas and the community will need to change if it intends to integrate these folks. IMO, a lot of these flame ware wouldn't need to exist if folks listent ot each other better and permit co-ownership of code like the scheduler since it needs multipule hands in it adapt to new loads and situations, etc... I'm saying this nicely now since I can be nasty about it. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ZFS with Linux: An Open Plea
On Sat, Apr 14, 2007 at 10:04:23AM -0400, Mike Snitzer wrote: > ZFS does have some powerful features but much of it depends on their > broken layering of volume management. Embedding the equivalent of LVM > into a filesystem _feels_ quite wrong. They have a clustering concept in their volume management that isn't expressable with something like LVM. That justifes their approach from what I can see. > That aside, the native snapshot capabilities of ZFS really stand out > for me. The redirect on write semantics aren't exclusive to ZFS; > NetApp's WAFL employs the same. But with both ZFS and WAFL they were > designed to do snapshots extremely well from the ground up. Write allocation for these kinds of system (especially when concerned with mirroring) is non-trivial. > Unfortunately in order for Linux to incorporate such a feature I'd > imagine a new filesystem would need to be developed with redirect on > write at its core. Can't really see ext4 or any other existing Linux > filesystem grafting such a feature into it. But even though I can't > see it; do others? You also can't use the standard page cache to buffer all of the sophicated semantics of these systems and have to create your own. > I've learned that Sun and NetApp's lawyers had it out over the > redirect on write capability of ZFS. When the dust settled Sun had > enough patent protection to motivate a truce with NetApp. I think they are still talking and it's far from over the last I heard. The creation of a new inode and decending indirect blocks is a fundamental concept behind WAFL. Also ZFS tends to be a heavy weight as far as metadata goes and quite possibly uneccessarily so which is likely to effect performance for things related to keep a relevant block allocation map in memory. ZFS is a complete pig compared to traditional file systems. > The interesting side-effect is now ZFS is "open" and with that comes > redirect on write in a file system other than WAFL. But ZFS's CDDL > conflicts with the GPL so I'm not too sure how Linux could hit the > ground running in this potentially patent mired area of filesystem > development. The validity of NetApp having patented redirect on write > aside; does the conflict between CDDL and GPL _really_ matter? Or did > the CDDL release of ZFS somehow undermine NetApp's WAFL patent? That doesn't really matter. FUSE could be extended to handle this kind of stuff and still have it be in userspace. The BSD get around including Stephen Tweedy's (sp?) ext2 header file by making the user manually compile it. That's not a problem for Linux folks that can download a patch and compile a kernel. FreeBSD already has a port of ZFS. Just for a kick, Google for that as a possible basis for a Linux kernel port. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote: ... > Now that you're agreeing my direction was correct you've done the usual Linux > kernel thing - ignore all my previous code and write your own version. Oh > well, that I've come to expect; at least you get a copyright notice in the > bootup and somewhere in the comments give me credit for proving it's > possible. Let's give some other credit here too. William Lee Irwin provided > the major architecture behind plugsched at my request and I simply finished > the work and got it working. He is also responsible for many IRC discussions > I've had about cpu scheduling fairness, designs, programming history and code > help. Even though he did not contribute code directly to SD, his comments > have been invaluable. Hello folks, I think the main failure I see here is that Con wasn't included in this design or privately in review process. There could have been better co-ownership of the code. This could also have been done openly on lkml (since this is kind of what this medium is about to significant degree) so that consensus can happen (Con can be reasoned with). It would have achieved the same thing but probably more smoothly if folks just listened, considered an idea and then, in this case, created something that would allow for experimentation from outsiders in a fluid fashion. If these issues aren't fixed, you're going to stuck with the same kind of creeping elitism that has gradually killed the FreeBSD project and other BSDs. I can't comment on the code implementation. I'm focus on other things now that I'm at NetApp and I can't help out as much as I could. Being former BSDi, I had a first hand account of these issues as they played out. A development process like this is likely to exclude smart people from wanting to contribute to Linux and folks should be conscious about this issues. It's basically a lot of code and concept that at least two individuals have worked on (wli and con) only to have it be rejected and then sudden replaced by code from a community gatekeeper. In this case, this results in both Con and Bill Irwin being woefully under utilized. If I were one of these people. I'd be mighty pissed. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote: > [...] > > Demystify what? The casual observer need only read either your attempt Here's the problem. You're a casual observer and obviously not paying attention. > at writing a scheduler, or my attempts at fixing the one we have, to see > that it was high time for someone with the necessary skills to step in. > Now progress can happen, which was _not_ happening before. I think that's inaccurate and there are plenty of folks that have that technical skill and background. The scheduler code isn't a deep mystery and there are plenty of good kernel hackers out here across many communities. Ingo isn't the only person on this planet to have deep scheduler knowledge. Priority heaps are not new and Solaris has had a pluggable scheduler framework for years. Con's characterization is something that I'm more prone to believe about how Linux kernel development works versus your view. I think it's a great shame to have folks like Bill Irwin and Con to have waste time trying to do something right only to have their ideas attack, then copied and held as the solution for this kind of technical problem as complete reversal of technical opinion as it suits a moment. This is just wrong in so many ways. It outlines the problems with Linux kernel development and questionable elistism regarding ownership of certain sections of the kernel code. I call it "churn squat" and instances like this only support that view which I would rather it be completely wrong and inaccurate instead. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote: > I prefer such early releases to lkml _alot_ more than any private review > process. I released the CFS code about 6 hours after i thought "okay, > this looks pretty good" and i spent those final 6 hours on testing it > (making sure it doesnt blow up on your box, etc.), in the final 2 hours > i showed it to two folks i could reach on IRC (Arjan and Thomas) and on > various finishing touches. It doesnt get much faster than that and i > definitely didnt want to sit on it even one day longer because i very > much thought that Con and others should definitely see this work! > > And i very much credited (and still credit) Con for the whole fairness > angle: > > || i'd like to give credit to Con Kolivas for the general approach here: > || he has proven via RSDL/SD that 'fair scheduling' is possible and that > || it results in better desktop scheduling. Kudos Con! > > the 'design consultation' phase you are talking about is _NOW_! :) > > I got the v1 code out to Con, to Mike and to many others ASAP. That's > how you are able to comment on this thread and be part of the > development process to begin with, in a 'private consultation' setup > you'd not have had any opportunity to see _any_ of this. > > In the BSD space there seem to be more 'political' mechanisms for > development, but Linux is truly about doing things out in the open, and > doing it immediately. I can't even begin to talk about how screwed up BSD development is. Maybe another time privately. Ok, Linux development and inclusiveness can be improved. I'm not trying to "call you out" (slang for accusing you with the sole intention to call you crazy in a highly confrontative manner). This is discussed publically here to bring this issue to light, open a communication channel as a means to resolve it. > Okay? ;-) It's cool. We're still getting to know each other professionally and it's okay to a certain degree to have a communication disconnect but only as long as it clears. Your productivity is amazing BTW. But here's the problem, there's this perception that NIH is the default mentality here in Linux. Con feels that this kind of action is intentional and has a malicious quality to it as means of "churn squating" sections of the kernel tree. The perception here is that there is that there is this expectation that sections of the Linux kernel are intentionally "churn squated" to prevent any other ideas from creeping in other than of the owner of that subsytem (VM, scheduling, etc...) because of lack of modularity in the kernel. This isn't an API question but a question possibly general code quality and how maintenance () of it can . This was predicted by folks and then this perception was *realized* when you wrote the equivalent kind of code that has technical overlap with SDL (this is just one dry example). To a person that is writing new code for Linux, having one of the old guards write equivalent code to that of a newcomer has the effect of displacing that person both with regards to code and responsibility with that. When this happens over and over again and folks get annoyed by it, it starts seeming that Linux development seems elitist. I know this because I heard (read) Con's IRC chats all the time about these matters all of the time. This is not just his view but a view of other kernel folks that differing views as to. The closing talk at OLS 2006 was highly disturbing in many ways. It went "Christoph" is right everybody else is wrong which sends a highly negative message to new kernel developers that, say, don't work for RH directly or any of the other mainstream Linux companies. After a while, it starts seeming like this kind of behavior is completely intentional and that Linux is full of arrogant bastards. What I would have done here was to contact Peter Williams, Bill Irwin and Con about what your doing and reach a common concensus about how to create something that would be inclusive of all of their ideas. Discussions can technically heated but that's ok, the discussion is happening and it brings down the wall of this perception. Bill and Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra. It might be very useful, it might not be. Folks are all stubborn about there ideas and hold on to them for dear life. Effective leaders can deconstruct this hostility and animosity. I don't claim to be one. Because of past hostility to something like schedplugin, the hostility and terseness of responses can be percieved simply as "I'm right, you're wrong" which is condescending. This effects discussion and outright destroys a constructive process if this happens continually since it reenforces that view of "You're an outsider, we don't care about you". Nobody is listening to each other at that point, folks get pissed. Then they think about "I'm going to NIH this person with patc X because he/she did the same here" which is dysfunctional. Oddly
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote: > Now this doesn't mean that people shouldn't be nice to each other, not > cooperate or steal credits, but I don't get the impression that that is > happening here. Ingo is taking part in the discussion with a counter > proposal for discussion *on the mailing list*. What more do you want?? Con should have been CCed from the first moment this was put into motion to limit the perception of exclusion. That was mistake number one and big time failures to understand this dynamic. After it was Con's idea. Why the hell he was excluded from Ingo's development process is baffling to me and him (most likely). He put int a lot of effort into SDL and his experiences with scheduling should still be seriously considered in this development process even if he doesn't write a single line of code from this moment on. What should have happened is that our very busy associate at RH by the name of Ingo Molnar should have leverage more of Con's and Bill's work and use them as a proxy for his own ideas. They would have loved to have contributed more and our very busy Ingo Molnar would have gotten a lot of his work and ideas implemented without him even opening a single source file for editting. They would have happily done this work for Ingo. Ingo could have been used for something else more important like making KVM less of a freaking ugly hack and we all would have benefitted from this. He could have been working on SystemTap so that you stop losing accounts to Sun and Solaris 10's Dtrace. He could have been working with Riel to fix your butt ugly page scanning problem causing horrible contention via the Clock/Pro algorithm, etc... He could have been fixing the ugly futex rwsem mapping problem that's killing -rt and anything that uses Posix threads. He could have created a userspace thread control block (TCB) with Mr. Drepper so that we can turn off preemption in userspace (userspace per CPU local storage) and implement a very quick non-kernel crossing implementation of priority ceilings (userspace check for priority and flags at preempt_schedule() in the TCB) so that our -rt Posix API doesn't suck donkey shit... Need I say more ? As programmers like Ingo get spread more thinly, he needs super smart folks like Bill Irwin and Con to help him out and learn to resist NIH folk's stuff out of some weird fear. When this happens, folks like Ingo must learn to "facilitate" development in addition to implementing it with those kind of folks. This takes time and practice to entrust folks to do things for him. Ingo is the best method of getting new Linux kernel ideas and communicate them to Linus. His value goes beyond just just code and is often the biggest hammer we have in the Linux community to get stuff into the kernel. "Facilitation" of others is something that solo programmers must need when groups like the Linux kernel get larger and large every year. Understand ? Are we in embarrassing agreement here ? bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
On Tue, Apr 17, 2007 at 04:52:08PM -0700, Michael K. Edwards wrote: > On 4/17/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: > >The ongoing scheduler work is on a much more basic level than these > >affairs I'm guessing you googled. When the basics work as intended it > >will be possible to move on to more advanced issues. ... Will probably shouldn't have dismissed your points but he probably means that can't even get at this stuff until fundamental are in place. > Clock scaling schemes that aren't integral to the scheduler design > make a bad situation (scheduling embedded loads with shotgun > heuristics tuned for desktop CPUs) worse, because the opaque > heuristics are now being applied to distorted data. Add a "smoothing" > scheme for the distorted data, and you may find that you have > introduced an actual control-path instability. A small fluctuation in > the data (say, two bursts of interrupt traffic at just the right > interval) can result in a long-lasting oscillation in some task's > "dynamic priority" -- and, on a fully loaded CPU, in the time that > task actually gets. If anything else depends on how much work this > task gets done each time around, the oscillation can easily propagate > throughout the system. Thrash city. Hyperthreading issues are quite similar that clock scaling issues. Con's infrastructures changes to move things in that direction were rejected, as well as other infrastructure changes, further infuritating Con to drop development on RSDL and derivatives. bill - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/