Re: PREEMPT_RT: 2.6.20-rt8 patch tweaked for 2.6.20.7
John, On Fri, 2007-04-20 at 15:15 +0200, John Sigler wrote: I've tweaked patch-2.6.20-rt8(*) so that it applies to 2.6.20.7 (*) http://rt.wiki.kernel.org/index.php/Main_Page The original patch can be found here: http://people.redhat.com/mingo/realtime-preempt/older/patch-2.6.20-rt8 http://linux.kernel.free.fr/patch-2.6.20-rt8 diff to the original patch to show what was tweaked: http://linux.kernel.free.fr/patch-2.6.20-rt8.diff New patch that applies cleanly to 2.6.20.7: http://linux.kernel.free.fr/patch-2.6.20.7-rt8 As always, if someone spots something I've done wrong, I'd be happy to fix it in a hurry :-) Ingo, Thomas, are there any fixes that were included in the 2.6.21-rt branch only that need to be back-ported to the 2.6.20-rt branch? I've been busy with mainline merge of highres timers lately, so I have no good overview of the -rt state at the moment, but I will check on this later that week. Can you create an entry in the rt-wiki, so people can find your patches ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PREEMPT_RT: 2.6.20-rt8 patch tweaked for 2.6.20.7
On Mon, 2007-04-23 at 10:03 +0200, John Sigler wrote: Can you create an entry in the rt-wiki, so people can find your patches ? Sure. Should I add a link to my patch on the CONFIG PREEMPT RT Patch page? http://rt.wiki.kernel.org/index.php/CONFIG_PREEMPT_RT_Patch#Download e.g. in the Download section, something along the lines of: An updated version of the CONFIG_PREEMPT_RT patch (cleanly applies to kernel 2.6.20.7) is also available. Yep, that's fine. I should probably mention that is not an officially sanctioned version, and that it has not received the same scrutiny as other patches? :) tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch] futex: restartable futex_wait?
On Thu, 2007-03-08 at 18:29 +0100, Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: Hi Ingo, I'm seeing an LTP test fail for ltp test sigaction_16_24. Basically, it tests whether the SA_RESTART flag works for the sem_wait operation. Not sure, whether the testcase is correct or not. See below I see sem_wait is implemented with futex_wait, so I wonder whether we can make it restartable? Am I going about it the right way? (Seems to fix the testcase here). i think that's quite right. I'm wondering why this never came up before? But your fix is not complete i think: + restart-arg2 = time; + return -ERESTART_RESTARTBLOCK; + } 'time' here is relative, so the restarted syscall will do a /full/ wait again. maybe we should rather convert futex timed-waits to hrtimers? Thomas? The problem is that the original API is based on relative time and therefor can not be changed. sem_wait returns -EINTR to the application when it is interrupted, while pthread_mutex_lock does not. http://www.opengroup.org/onlinepubs/009695399/functions/sem_wait.html http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html We need to create a seperate op for the futex - just like the pi_futex and use absolute time there too. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: hardwired VMI crap
On Thu, 2007-03-08 at 15:39 -0800, Jeremy Fitzhardinge wrote: Ingo Molnar wrote: - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is quite advanced on this front. At last! Some love! The Xen approach has always been to prefer high-level interfaces over lower-level ones, so that guests can meaningfully participate in their own virtualization. There are some necessarily low-level things, but conceptually simple things like create a new vcpu should have simple interfaces. There's no point in going to the effort of emulating a whole pile of real hardware if Xen can present an interface which is a close match to an existing high-level interface within the operating system. Once you are there, you are near the point where you created a virtual architecture, which could run on any real architecture which gets supported by a hypervisor backend. I'd love that :) I know it is tricky to combine this with the upcoming hardware virtualization support. But it's at least a worthwhile thought experiment. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: hardwired VMI crap
On Thu, 2007-03-08 at 15:55 -0800, Zachary Amsden wrote: Jeremy Fitzhardinge wrote: No, but I'm not prejudiced against virtual hardware. If we have a piece of code that thinks its talking to an apic, then I think its OK to use that code whether its a real apic or a virtual one, _so long as its being used in a way that's consistent with its intended interface_. I have to admit I have not looked at apics - real or virtual - in any detail, so I won't claim to really understand the details of the existing arch/i386 code or what VMI's trying to do, but it does seem to me that it could all be much cleaner. And clean is good, we all love clean - and so, agreement! For APICs, we have two operations - APICRead and APICWrite. It is nice and clean, and plugs in very easily to the APIC accessors available in Linux. Is this not clean? No, because there is no need to use APIC. You just pave the road for doing the same thing to IO_APIC and whatever is on your interest next. We just don't drive the local timer interrupts through the APIC, we make hypercalls to schedule local timer alarms. Which is something we must do for UP kernels as well, which use the PIT / PIC. So there is a need for having clockevents code which doesn't program timers through the APIC. So we have one separate time device, independent from the traditional hardware timers, and we just program that. This design is not very complex, nor is it unclean, IMHO. And why exactly do you need the APIC operations for the complete abstract and virtual clock event device ? To inject the interrupt, which you anyway inject artificially into the paravirtualized kernel ? This is simply wrong and does not help anything. The 3 lines of code you share with the apic timer code are not a valid reason to hook yourself into the apic. You can use any arbitrary interrupt number to fire your VMI timer and this works on SMP as well, as we can pin interrupts on CPUs. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch] futex: restartable futex_wait?
On Fri, 2007-03-09 at 06:10 +0100, Nick Piggin wrote: i think that's quite right. I'm wondering why this never came up before? But your fix is not complete i think: + restart-arg2 = time; + return -ERESTART_RESTARTBLOCK; + } 'time' here is relative, so the restarted syscall will do a /full/ wait again. But it has been modified by schedule_timeout? But this does not change the syscall registers, so it is restarted in the same way. We need a new futex OP for this, which takes absolute time like the PI futex op does. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch] futex: restartable futex_wait?
On Fri, 2007-03-09 at 13:24 +0100, Nick Piggin wrote: 'time' here is relative, so the restarted syscall will do a /full/ wait again. But it has been modified by schedule_timeout? But this does not change the syscall registers, so it is restarted in the same way. We need a new futex OP for this, which takes absolute time like the PI futex op does. Forgive me if I'm missing something here, but I'm using the restart block and saving the updated value of time in -arg2, and using that as the new time parameter passed into futex_wait from futex_wait_restart. Oops. I went into confusion mode. You are right, the restart block keeps that. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about periodic clocks
On Fri, 2007-03-09 at 15:26 -0800, Jeremy Fitzhardinge wrote: How does the clock period get set on periodic timers? In my clock driver, I'm seeing a call to -set_mode(CLOCK_EVT_MODE_PERIODIC, evt), but then... nothing. I was expecting a call to set_next_event to set the timer period. Good point. I never thought about that and we set the period in the clock event device itself. You are right, the clockevents layer should hand over the period either with the set_mode call or seperately. Probably with the set_mode call, as it is needed exactly there and we don't want to have a if (dev-mode == XXX) check in set_next_event(). I look into this. Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: question about periodic clocks
On Sat, 2007-03-10 at 07:50 -0800, Jeremy Fitzhardinge wrote: Thomas Gleixner wrote: Good point. I never thought about that and we set the period in the clock event device itself. You are right, the clockevents layer should hand over the period either with the set_mode call or seperately. Probably with the set_mode call, as it is needed exactly there and we don't want to have a if (dev-mode == XXX) check in set_next_event(). I look into this. So, in the meantime, the period is 1/HZ? Yep. I also have a question about clockevent cpumasks. I was using the lapic clockevent as a model, but as I understand it there's a lapic per CPU, which explains why it registers a clockevent per cpu with that cpu alone in the cpumask. The Xen timer is a bit different; I guess more like hpet. There's a single (virtual-)machine-wide timer, which is owned by the last cpu with programmed it; ie, that cpu is the one which gets the resulting event interrupt. Does this mean I should register a single clockevent device with a cpumask of CPU_MASK_ALL? Or should I constrain it to a single cpu? Uuurg. That's ugly. clockevents expect a per CPU timer especially for dynamic ticks. If you cannot provide a per cpu timer, then you probably need to use the broadcast trick. Register a primary clocksource (as PIT/HPET) and register per CPU dummy clocksources with CLOCK_EVT_FEAT_DUMMY set - we use the same trick, when the lapic timer is broken. The clockevents code then uses PIT/HPET as the primary tick source and broadcasts the periodic tick to the other CPUs. In that case the dyntick / highres features are disabled. We did some experiments to support multiple CPUs with one timer for hres/dyntick but it does not scale and it is so ugly that it is not worth the trouble. It works for the lapic stops in C3 case, as we have a well defined point (right before going into the deep power state) where we can rearm the global clock event device. As we are idle at that point anyway there is not much penalty, but I really dont want to do that in an active system. There's a comment in hpet.c saying * Start hpet with the boot cpu mask and make it * global after the IO_APIC has been initialized. but I don't see any place where the hpet cpumask is updated. I wanted to do that in the first place, but never bothered. In an UP environment it does not matter. On a sane SMP box (where we do not have the local APIC stops in C3 problem) the HPET (analogous PIT) is switched off for ever. In the case of LAPIC stops in C3 the HPET(PIT) is used as a broadcast fallback. That means before we go into C3 we arm the HPET/PIT for the earliest to expire lapic event of all CPUs. In that case it does not matter, whether HPET/PIT is pinned to CPU#0 or anything else. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of absolute timeouts for oneshot timers
On Sat, 2007-03-10 at 14:52 -0800, Jeremy Fitzhardinge wrote: When booting under Xen, you'll get this if you're using both the xen clocksource and clockevent drivers. However, it seems that during boot on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen clocksource until it switches to highres timer mode. This means that during boot the kernel's monotonic clock is drifting with respect to the hypervisor, and all timeouts are unreliable. The clocksource is not used until the clocksource is installed. Also the periodic mode during boot, when the clock event device supports periodic mode, is not reading the time. It relies on the clock event device getting it straight. That's not a big deal during boot and on a kernel with NO_HZ=n and HIGHRES=n the periodic tick only updates jiffies. If the only clocksource is jiffies, then we have to live with it and we do not switch to NO_HZ/HIGHRES as we would lose track of time. Once we switch to NO_HZ or HIGHRES the clock event device is directly coupled to the clock event source. Initially I was just computing the kernel-hypervisor offset at boot time, but then I changed it to recompute it every time the timer mode changes. However, this didn't really help, and I was still getting unpredictable timeouts during boot. I've changed it to just compute the hypervisor absolute time directly using the delta each time the oneshot timer is set, which will definitely be reliable (if the kernel and hypervisor have drifting timebases then the meaning of Xns delta will be different, but at least thats a local error rather than a long-term cumulative error). We do not really care up to the point, where the high resolution clocksource (e.g. TSC, PM-Timer or HPET on real hardware) becomes active. Early boot is fragile and we switch over to high res clocksource and highres/nohz when things have stabilized. My analysis might be wrong here (I suspect the Xen periodic timer may have unexpected behaviour), but the overall conclusion still stands: using an absolute timeout only works if the kernel and hypervisor have non-drifting timebases. I think its too fragile for a clockevent implementation to assume that a particular clocksource is in use to get reliable results. Once we switched over to the clocksource, everything should be in perfect sync. Or perhaps this is a property of the whole clock subsystem: that clockevents must be paired with clocksources. But its not obvious to me that this enforced, or even acknowledged. It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute time, which is read back from the clocksource, even if we use a relative value for real hardware clock event devices to program the next event. We calculate the delta between the absolute event and now. So we never get an accumulating error. What problem are you observing ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of absolute timeouts for oneshot timers
On Sat, 2007-03-10 at 16:42 -0800, Jeremy Fitzhardinge wrote: Thomas Gleixner wrote: It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute time, which is read back from the clocksource, even if we use a relative value for real hardware clock event devices to program the next event. We calculate the delta between the absolute event and now. So we never get an accumulating error. What problem are you observing ? Actually, two things. There was the unexpected pauses during boot, which is trivially fixable by not using the Xen periodic timer, and using the single-shot fallback. But I'm making the more general observation that if you use an absolute rather than relative time to set the single-shot timeout, then you have to deal with a long-term cumulative drift between the kernel's monotonic time and the hypervisor's monotonic time. This can happen even if your clocksource is derived directly from the hypervisor monotonic time, because running ntp will warp the kernel's time, and so it will drift with respect to the hypervisor clock. You can only avoid this by 1) not allowing adjtime, or 2) making those same adjtime warps to the hypervisor time. Neither of these is a good general solution. Sigh, yes. Using a relative time for the next event is probably the least ugly solution tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 6/9] signalfd/timerfd - timerfd core ...
Davide, On Sat, 2007-03-10 at 18:22 -0800, Davide Libenzi wrote: Some remarks: + +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype, + const struct timespec __user *utmr) +{ + int error; + struct timerfd_ctx *ctx; + struct file *file; + struct inode *inode; + ktime_t tval, tnow; + struct timespec ktmr, tmrnow; + + error = -EFAULT; + if (copy_from_user(ktmr, utmr, sizeof(ktmr))) + goto err_exit; Please do not use goto for a simple return -EFAULT; Please validate the timespec before converting it. if (!timespec_valid(ktmr)) return -EINVAL; + tval = timespec_to_ktime(ktmr); + error = -EINVAL; + if (clockid != CLOCK_MONOTONIC + clockid != CLOCK_REALTIME) + goto err_exit; + switch (tmrtype) { + case TFD_TIMER_REL: + case TFD_TIMER_SEQ: + break; + case TFD_TIMER_ABS: + getnstimeofday(tmrnow); + tnow = timespec_to_ktime(tmrnow); tnow = ktime_get(); + if (ktime_to_ns(tval) = ktime_to_ns(tnow)) + goto err_exit; + tval = ktime_sub(tval, tnow); Why do you want to do that ? hrtimers handle relative and absolute expiry times. You break down everything to relative time and lose the accuracy for absolute timers. + break; + default: + goto err_exit; + } + + if (ufd == -1) { + error = -ENOMEM; + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL); + if (!ctx) + goto err_exit; + + init_waitqueue_head(ctx-wqh); + spin_lock_init(ctx-lock); + ctx-ticks = 0; + ctx-tmrtype = tmrtype; + ctx-clockid = clockid; + ctx-tval = tval; + hrtimer_init(ctx-tmr, ctx-clockid, HRTIMER_REL); + ctx-tmr.expires = ctx-tval; + ctx-tmr.function = timerfd_tmrproc; + + hrtimer_start(ctx-tmr, ctx-tval, HRTIMER_REL); + + /* + * When we call this, the initialization must be complete, since + * aino_getfd() will install the fd. + */ + error = aino_getfd(ufd, inode, file, [timerfd], +timerfd_fops, ctx); + if (error) + goto err_fdalloc; Why is the timer started before we have everything in place ? Also if you turn it around then the (re)programming part of the timer can be shared. + } else { + error = -EBADF; + file = fget(ufd); + if (!file) + goto err_exit; + ctx = file-private_data; + error = -EINVAL; + if (file-f_op != timerfd_fops) { + fput(file); + goto err_exit; + } + + /* + * We need to stop the exiting timer before. We call + * hrtimer_cancel() w/out holding our lock. + */ + spin_lock_irq(ctx-lock); + while (hrtimer_active(ctx-tmr)) { + spin_unlock_irq(ctx-lock); + hrtimer_cancel(ctx-tmr); + spin_lock_irq(ctx-lock); + } Please use hrtimer_try_to_cancel() retry: spin_lock_irq(): if (hrtimer_try_to_cancel(ctx-tmr) 0) { spin_unlock_irq(); cpu_relax(); goto retry; } + +static unsigned int timerfd_poll(struct file *file, poll_table *wait) +{ + struct timerfd_ctx *ctx = file-private_data; + + poll_wait(file, ctx-wqh, wait); + + return ctx-ticks ? POLLIN: 0; This is racy: timer is set up (non periodic) timer expires poll now poll is stuck for ever ! tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SwSusp to disk doesn't work - Try 2
On Sun, 2007-03-11 at 22:09 +0100, Rafael J. Wysocki wrote: update_sched_domains detach_destroy_domains [waits here] -- synchronize_sched (==synchronize_rcu) Well, I think the call to wait_for_completion() does not return, probably because the task supposed to complete the completion is frozen at this point. Can you please try to confirm that it gets stuck on wait_for_completion() in synchronize_rcu()? Yes, it's in wait_for_completion() in synchronize_rcu(). As noted in some previous mail, it will wake up after event - key press etc. Patch in http://lkml.org/lkml/2007/3/7/255 solves different problem. I added it to my quilt and applied anyway - no change. Does the problem go away if NO_HZ is unset? i tried to boot with nohz=off, but the problem did persist. H, both variants (nohz=off or recompiled kernel without NO_HZ) works for me. Definitely something strange is going on here. I think we need an advice from someone who knows the RCU internals. RCU synchronization depends on the timer interrupt. Which kernel version are you guys talking about ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] futex: PI state locking fix
On Mon, 2007-03-12 at 10:13 +0100, Ingo Molnar wrote: Subject: [patch] futex: PI state locking fix From: Ingo Molnar [EMAIL PROTECTED] testing of -rt by IBM uncovered a locking bug in wake_futex_pi(): the PI state needs to be locked before we access it. this patch has been tested in -rt. Must-have for v2.6.21. Signed-off-by: Ingo Molnar [EMAIL PROTECTED] Acked-by: Thomas Gleixner [EMAIL PROTECTED] -- kernel/futex.c |2 ++ 1 file changed, 2 insertions(+) Index: linux/kernel/futex.c === --- linux.orig/kernel/futex.c +++ linux/kernel/futex.c @@ -566,6 +566,7 @@ static int wake_futex_pi(u32 __user *uad if (!pi_state) return -EINVAL; + spin_lock(pi_state-pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(pi_state-pi_mutex); /* @@ -605,6 +606,7 @@ static int wake_futex_pi(u32 __user *uad pi_state-owner = new_owner; spin_unlock_irq(new_owner-pi_lock); + spin_unlock(pi_state-pi_mutex.wait_lock); rt_mutex_unlock(pi_state-pi_mutex); return 0; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 6/9] signalfd/timerfd v3 - timerfd core ...
Davide, On Sun, 2007-03-11 at 16:04 -0700, Davide Libenzi wrote: +static int timerfd_setup(struct timerfd_ctx *ctx, int clockid, int tmrtype, + const struct itimerspec *ktmr) +{ + enum hrtimer_mode htmode; + ktime_t texp, tintv; + + if (clockid != CLOCK_MONOTONIC + clockid != CLOCK_REALTIME) + return -EINVAL; Please move the validation for clockid, tmrtype and the timerspec into sys_timerfd. Do it before anything else. Also please validate both it_value and it_interval unconditionally. Userspace should not send uninitialized stuff at all. The TFD_TIMER_SEQ thing is quite different to all other timer interfaces which POSIX provides. Both itimers and posixtimers use the it_interval value to distinguish between one shot and periodic timers. I think we should keep this new interface analogous, so programmers don't get more confused, than they are already. :) This also allows relative and absolute starting points for both one shot and sequential timers. Please use it_value == 0 to stop the timer. This is the same as for itimers and posixtimers. Right now you have to close the fd to stop a timer, but that's not necessarily what you want. Why do you want to store information, which is only relevant for setup in ctx ? If you do the validation right in sys_timerfd and get rid of TFD_TIMER_SEQ and the various useless fields, then timerfd_setup() boils down to ctx-ticks = 0; ctx-tintv = tintv; hrtimer_init(ctx-tmr, clockid, htmode); ctx-tmr.function = timerfd_tmrproc; if (texp.tv64 != 0) hrtimer_start(ctx-tmr, texp, htmode); and in the timer function you simply check for if (ctx-tintv.tv64 != 0) instead of the TIMER_SEQ mode. +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype, + const struct itimerspec __user *utmr) +{ + int error; + struct timerfd_ctx *ctx; + struct file *file; + struct inode *inode; + struct itimerspec ktmr; + + if (copy_from_user(ktmr, utmr, sizeof(ktmr))) + return -EFAULT; Do validation of clockid, tmrtype and ktmr here. + if (ufd == -1) { + error = -ENOMEM; + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL); + if (!ctx) + goto err_exit; return -ENOMEM; + init_waitqueue_head(ctx-wqh); + spin_lock_init(ctx-lock); + ctx-clockid = -1; + + error = timerfd_setup(ctx, clockid, tmrtype, ktmr); + if (error) + goto err_ctxfree; + + /* + * When we call this, the initialization must be complete, since + * aino_getfd() will install the fd. + */ + error = aino_getfd(ufd, inode, file, [timerfd], +timerfd_fops, ctx); + if (error) + goto err_ctxfree; + } else { + error = -EBADF; + file = fget(ufd); + if (!file) + goto err_exit; return -EBADF; + ctx = file-private_data; + error = -EINVAL; + if (file-f_op != timerfd_fops) { + fput(file); + goto err_exit; return -EINVAL; + } + /* + * We need to stop the exiting timer before. + */ -ENOPARSE. You probably mean: We need to stop an already running timer before we do a new setup. + for (;;) { + spin_lock_irq(ctx-lock); + if (hrtimer_try_to_cancel(ctx-tmr) = 0) + break; + spin_unlock_irq(ctx-lock); + cpu_relax(); + } + /* + * Re-program the timer to the new value ... + */ + error = timerfd_setup(ctx, clockid, tmrtype, ktmr); + + spin_unlock_irq(ctx-lock); + fput(file); + if (error) + goto err_exit; return error; + } + + return ufd; + +err_ctxfree: + timerfd_cleanup(ctx); +err_exit: + return error; +} + tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] change futex_wait() to hrtimers
On Mon, 2007-03-12 at 12:27 +0100, Andi Kleen wrote: Ingo Molnar [EMAIL PROTECTED] writes: the only correct approach is the use of hrtimers, and a patch exists for that - see below. This has been included in -rt for quite some time. But isn't that bad for power management? You'll likely get more idle wakeups, won't you? Why so ? It comes more precise, but only once. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] change futex_wait() to hrtimers
On Mon, 2007-03-12 at 12:02 +0100, Ingo Molnar wrote: Well I did convert futex_wait to an absolute timeout based version in the subsequent incremental patch. I think that is OK? it still has the rounding artifacts: using timer_list there is no way to do a precise long sleep based on many small sleeps. even if this means more work for you (i'm sorry about that!) i'm quite sure we should take Sebastien's hrtimers based implementation of futex_wait(), and use the nanosleep method to restart it. There's no point in further tweaking the imprecise approach: whenever some timeout needs to be restarted, it's a candidate for hrtimers. until then, glibc already handles timeouts and restarts it manually. This also allows us to add a seperate absolute time bases futex op, which allows to remove the conversion of abstime to reltime in glibc. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [6/6] 2.6.21-rc3: known regressions
On Tue, 2007-03-13 at 13:50 +0100, Adrian Bunk wrote: Subject: hrtimer_switch_to_hres(): wrong tick_init_highres() return value handling References : http://lkml.org/lkml/2007/3/6/262 Submitter : Linus Torvalds [EMAIL PROTECTED] Caused-By : Thomas Gleixner [EMAIL PROTECTED] commit 54cdfdb47f73b5af3d1ebb0f1e383efbe70fde9e Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : unknown Linus merged the original patch, which solved the real problem. He just gave me a lesson how to do it right next time. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64 system lockup from userspace using setitimer()
On Tue, 2007-03-13 at 16:02 -0400, Chuck Ebbert wrote: struct itimerval tim = { .it_interval = { .tv_sec = 140735669863712, .tv_usec = 4199521 }, Could this be fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8bfd9a7a229b5f3d3eda5d7d45c2eebec5b4ba16 [PATCH] hrtimers: prevent possible itimer DoS No. The possible DoS is only when high res timers are enabled, which is not the case in 2.6.20. Looking at the values 140735669863712 = 0x7FFF 939C 0520 We convert second to nanoseconds: 140735669863712 * 1e9 = 0x1DCD 4BC3 6B82 914B 4000 The seconds value is limited to LONG_MAX, but on a 64 bit machine, the 140735669863712 is inside LONG_MAX and we have an multiplication overflow. I'm not sure, how this results in a DoS, but I will look into this tomorrow morning, when I'm more awake. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [6/6] 2.6.21-rc3: known regressions
On Tue, 2007-03-13 at 13:50 +0100, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: Clocksource tsc unstable (delta = -154983451 ns) References : http://lkml.org/lkml/2007/3/9/271 Submitter : Jiri Slaby [EMAIL PROTECTED] Status : unknown That's not a regression. That's an informal message, when the TSC watchdog detects that the TSC is unreliable. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()
hrtimer_forward() does not check for the possible overflow of timer-expires. This can happen on 64 bit machines with large interval values and results currently in an endless loop in the softirq because the expiry value becomes negative and therefor the timer is expired all the time. Check for this condition and set the expiry value to the max. expiry time in the future. The fix should be applied to stable kernel series as well. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED],de diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index ec4cb9f..5e7122d 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -644,6 +644,12 @@ hrtimer_forward(struct hrtimer *timer, k orun++; } timer-expires = ktime_add(timer-expires, interval); + /* +* Make sure, that the result did not wrap with a very large +* interval. +*/ + if (timer-expires.tv64 0) + timer-expires = ktime_set(KTIME_SEC_MAX, 0); return orun; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [6/6] 2.6.21-rc3: known regressions
On Wed, 2007-03-14 at 19:02 +0100, Florian Lohoff wrote: On Wed, Mar 14, 2007 at 12:44:17PM +0100, Adrian Bunk wrote: Due to the huge amount of recipients, please trim the Cc when answering. Subject: Clocksource tsc unstable (delta = -154983451 ns) References : http://lkml.org/lkml/2007/3/9/271 Submitter : Jiri Slaby [EMAIL PROTECTED] Status : unknown That's not a regression. That's an informal message, when the TSC watchdog detects that the TSC is unreliable. Looking at [1], there's also be a probably related doesn't boot problem. My first guess would be commit 6bb74df481223731af6c7e0ff3adb31f6442cfcd clocksource init adjustments (fix bug #7426). Jiri, is the message also present with 2.6.21-rc2 (at a different place of the dmesg) for you? With the current git of today the halt on boot is gone. I am running it now ... I'm really curious what made it go away. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hrtimer: Fixup unlocked access to wall_to_monotonic
commit f4304ab21513b834c8fe3403927c60c2b81a72d7 (HZ free NTP) moved the access to wall_to_monotonic in hrtimer_get_softirq_time() out of the xtime_lock protection. Move it back into the seq_lock section. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index ec4cb9f..e2053da 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -135,7 +135,7 @@ EXPORT_SYMBOL_GPL(ktime_get_ts); static void hrtimer_get_softirq_time(struct hrtimer_cpu_base *base) { ktime_t xtim, tomono; - struct timespec xts; + struct timespec xts, tom; unsigned long seq; do { @@ -145,10 +145,11 @@ #ifdef CONFIG_NO_HZ #else xts = xtime; #endif + tom = wall_to_monotonic; } while (read_seqretry(xtime_lock, seq)); xtim = timespec_to_ktime(xts); - tomono = timespec_to_ktime(wall_to_monotonic); + tomono = timespec_to_ktime(tom); base-clock_base[CLOCK_REALTIME].softirq_time = xtim; base-clock_base[CLOCK_MONOTONIC].softirq_time = ktime_add(xtim, tomono); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 6/13] signalfd/timerfd/asyncfd v5 - timerfd core ...
Davide, On Wed, 2007-03-14 at 15:19 -0700, Davide Libenzi wrote: +static int timerfd_tmrproc(struct hrtimer *htmr) +{ + struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr); + int rval = HRTIMER_NORESTART; + unsigned long flags; + + spin_lock_irqsave(ctx-lock, flags); + ctx-ticks++; + wake_up_locked(ctx-wqh); + if (ctx-tintv.tv64 != 0) { + hrtimer_forward(htmr, htmr-base-softirq_time, ctx-tintv); Sorry, I missed that in the first reviews. Please use hrtimer_cb_get_time(htmr) instead of htmr-base-softirq_time, so this is high res timer safe. + rval = HRTIMER_RESTART; + } + spin_unlock_irqrestore(ctx-lock, flags); + + return rval; +} + + +static int timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags, + const struct itimerspec *ktmr) +{ Make this void, returns 0 anyway + enum hrtimer_mode htmode; + + htmode = (flags TFD_TIMER_ABSTIME) ? HRTIMER_ABS: HRTIMER_REL; + + ctx-ticks = 0; + ctx-clockid = clockid; + ctx-flags = flags; + ctx-texp = timespec_to_ktime(ktmr-it_value); clockid is stored in the timer on setup, so no need to store it again. expiry time and flags are not used after setup. Please remove those fields. + ctx-tintv = timespec_to_ktime(ktmr-it_interval); + hrtimer_init(ctx-tmr, ctx-clockid, htmode); + ctx-tmr.expires = ctx-texp; + ctx-tmr.function = timerfd_tmrproc; + if (ctx-texp.tv64 != 0) + hrtimer_start(ctx-tmr, ctx-texp, htmode); + + return 0; +} + + if (ufd == -1) { + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + init_waitqueue_head(ctx-wqh); + spin_lock_init(ctx-lock); + ctx-clockid = -1; + + error = timerfd_setup(ctx, clockid, flags, ktmr); + if (error) + goto err_ctxfree; Timer setup can not fail + /* + * When we call this, the initialization must be complete, since + * aino_getfd() will install the fd. + */ + error = aino_getfd(ufd, inode, file, [timerfd], +timerfd_fops, ctx); + if (error) + goto err_ctxfree; Again: Please turn this around. No need to start the timer before we know, that everything works. + } else { + file = fget(ufd); + if (!file) + return -EBADF; + ctx = file-private_data; + if (file-f_op != timerfd_fops) { + fput(file); + return -EINVAL; + } + /* + * We need to stop the existing timer before reprogramming + * it to the new values. + */ + for (;;) { + spin_lock_irq(ctx-lock); + if (hrtimer_try_to_cancel(ctx-tmr) = 0) + break; + spin_unlock_irq(ctx-lock); + cpu_relax(); + } + /* + * Re-program the timer to the new value ... + */ + error = timerfd_setup(ctx, clockid, flags, ktmr); Timer setup can not fail + spin_unlock_irq(ctx-lock); + fput(file); + if (error) + return error; + } + + return ufd; + +err_ctxfree: + timerfd_cleanup(ctx); + return error; +} + + +static void timerfd_cleanup(struct timerfd_ctx *ctx) +{ + if (ctx-clockid = 0) + hrtimer_cancel(ctx-tmr); You don't have a file descriptor, when the setup failed. So the timer is always initialized. + kmem_cache_free(timerfd_ctx_cachep, ctx); +} + + +static int timerfd_close(struct inode *inode, struct file *file) +{ + timerfd_cleanup(file-private_data); + return 0; +} + Please move the timerfd_cleanup code into close(). tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 6/13] signal/timer/event fds v6 - timerfd core ...
On Thu, 2007-03-15 at 17:22 -0700, Davide Libenzi wrote: +static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags, + const struct itimerspec *ktmr) +{ + enum hrtimer_mode htmode; + + htmode = (flags TFD_TIMER_ABSTIME) ? HRTIMER_MODE_ABS: HRTIMER_MODE_REL; + + ctx-ticks = 0; + ctx-texp = timespec_to_ktime(ktmr-it_value); I know, I'm racking your nerves. texp is only used for setup. No need to carry it in the ctx data structure. :) + ctx-tintv = timespec_to_ktime(ktmr-it_interval); + hrtimer_init(ctx-tmr, clockid, htmode); + ctx-tmr.expires = ctx-texp; + ctx-tmr.function = timerfd_tmrproc; + if (ctx-texp.tv64 != 0) + hrtimer_start(ctx-tmr, ctx-texp, htmode); +} Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 6/13] signalfd/timerfd/asyncfd v5 - timerfd core ...
On Thu, 2007-03-15 at 16:02 -0700, Davide Libenzi wrote: + /* + * When we call this, the initialization must be complete, since + * aino_getfd() will install the fd. + */ + error = aino_getfd(ufd, inode, file, [timerfd], +timerfd_fops, ctx); + if (error) + goto err_ctxfree; Again: Please turn this around. No need to start the timer before we know, that everything works. The timerfd_setup() is not locked, so we need to make sure everything is setup, before advertising the fd (and aino_getfd does that). Right. Did not think about the bad boys peeking at file descriptors :) tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.21-rc4
On Fri, 2007-03-16 at 21:34 +0100, Rafael J. Wysocki wrote: On Friday, 16 March 2007 17:33, Linus Torvalds wrote: I pushed out the -git trees yesterday, but then got distracted, so the patches and tar-balls and the announcement got delayed until this morning. Oops. I'm a scatter-brain. Anyway, the good news about -rc4 is that there's just lots of random fixes. I'm hoping that we've seriously cut down on the regression list, and I'd ask everybody who is on Adrian's list to please re-verify their regression, and in case it's one of the patches available ones but I haven't merged (maybe because it hasn't been sent to me!), make sure I do. I'm afraid that if CONFIG_TICK_ONESHOT or CONFIG_NO_HZ is set, we still have a problem with RCU synchronization while nonboot CPUs are being enabled during a resume (http://lkml.org/lkml/2007/3/11/144, http://lkml.org/lkml/2007/3/4/88). Can someone who had this problem with -rc3 check if it's present in -rc4? I finally found a box today, which shows this problem. I'm working on a fix. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()
On Fri, 2007-03-16 at 12:43 -0800, Andrew Morton wrote: On Wed, 14 Mar 2007 11:00:12 +0100 Thomas Gleixner [EMAIL PROTECTED] wrote: rtimer_forward() does not check for the possible overflow of timer-expires. This can happen on 64 bit machines with large interval values and results currently in an endless loop in the softirq because the expiry value becomes negative and therefor the timer is expired all the time. Check for this condition and set the expiry value to the max. expiry time in the future. The fix should be applied to stable kernel series as well. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED],de diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index ec4cb9f..5e7122d 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -644,6 +644,12 @@ hrtimer_forward(struct hrtimer *timer, k orun++; } timer-expires = ktime_add(timer-expires, interval); + /* +* Make sure, that the result did not wrap with a very large +* interval. +*/ + if (timer-expires.tv64 0) + timer-expires = ktime_set(KTIME_SEC_MAX, 0); return orun; } kernel/hrtimer.c: In function 'hrtimer_forward': kernel/hrtimer.c:652: warning: overflow in implicit constant conversion problem is, KTIME_SEC_MAX is 9,223,372,036 and ktime_set() takes a `long'. Stupid me :( This? --- a/include/linux/ktime.h~ktime_set-fix-arg-type +++ a/include/linux/ktime.h @@ -72,13 +72,13 @@ typedef union { * * Return the ktime_t representation of the value */ -static inline ktime_t ktime_set(const long secs, const unsigned long nsecs) +static inline ktime_t ktime_set(const s64 secs, const unsigned long nsecs) { #if (BITS_PER_LONG == 64) if (unlikely(secs = KTIME_SEC_MAX)) return (ktime_t){ .tv64 = KTIME_MAX }; #endif - return (ktime_t) { .tv64 = (s64)secs * NSEC_PER_SEC + (s64)nsecs }; + return (ktime_t) { .tv64 = secs * NSEC_PER_SEC + (s64)nsecs }; } /* Subtract two ktime_t variables. rem = lhs -rhs: */ _ I worry about that `secs = KTIME_SEC_MAX' comparison in there, too. Both operands are signed. I'd prefer this one: The maximum seconds value we can handle on 32bit is LONG_MAX. diff --git a/include/linux/ktime.h b/include/linux/ktime.h index c68c7ac..248305b 100644 --- a/include/linux/ktime.h +++ b/include/linux/ktime.h @@ -57,7 +57,11 @@ typedef union { } ktime_t; #define KTIME_MAX ((s64)~((u64)1 63)) -#define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC) +#if (BITS_PER_LONG == 64) +# define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC) +#else +# define KTIME_SEC_MAX LONG_MAX +#endif /* * ktime_t definitions when using the 64-bit scalar representation: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] clockevents: Fix suspend/resume to disk hangs
I finally found a dual core box, which survives suspend/resume without crashing in the middle of nowhere. Sigh, I never figured out from the code and the bug reports what's going on. The observed hangs are caused by a stale state transition of the clock event devices, which keeps the RCU synchronization away from completion, when the non boot CPU is brought back up. The suspend/resume in oneshot mode needs the similar care as the periodic mode during suspend to RAM. My assumption that the state transitions during the different shutdown/bringups of s2disk would go through the periodic boot phase and then switch over to highres resp. nohz mode were simply wrong. Add the appropriate suspend / resume handling for the non periodic modes. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c index 5567745..eadfce2 100644 --- a/kernel/time/tick-broadcast.c +++ b/kernel/time/tick-broadcast.c @@ -307,12 +307,19 @@ int tick_resume_broadcast(void) spin_lock_irqsave(tick_broadcast_lock, flags); bc = tick_broadcast_device.evtdev; - if (bc) { - if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC - !cpus_empty(tick_broadcast_mask)) - tick_broadcast_start_periodic(bc); - broadcast = cpu_isset(smp_processor_id(), tick_broadcast_mask); + if (bc) { + switch (tick_broadcast_device.mode) { + case TICKDEV_MODE_PERIODIC: + if(!cpus_empty(tick_broadcast_mask)) + tick_broadcast_start_periodic(bc); + broadcast = cpu_isset(smp_processor_id(), + tick_broadcast_mask); + break; + case TICKDEV_MODE_ONESHOT: + broadcast = tick_resume_broadcast_oneshot(bc); + break; + } } spin_unlock_irqrestore(tick_broadcast_lock, flags); @@ -347,6 +354,16 @@ static int tick_broadcast_set_event(ktime_t expires, int force) } } +int tick_resume_broadcast_oneshot(struct clock_event_device *bc) +{ + clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT); + + if(!cpus_empty(tick_broadcast_oneshot_mask)) + tick_broadcast_set_event(ktime_get(), 1); + + return cpu_isset(smp_processor_id(), tick_broadcast_oneshot_mask); +} + /* * Reprogram the broadcast device: * diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 43ba1bd..bfda3f7 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -298,18 +298,17 @@ static void tick_shutdown(unsigned int *cpup) spin_unlock_irqrestore(tick_device_lock, flags); } -static void tick_suspend_periodic(void) +static void tick_suspend(void) { struct tick_device *td = __get_cpu_var(tick_cpu_device); unsigned long flags; spin_lock_irqsave(tick_device_lock, flags); - if (td-mode == TICKDEV_MODE_PERIODIC) - clockevents_set_mode(td-evtdev, CLOCK_EVT_MODE_SHUTDOWN); + clockevents_set_mode(td-evtdev, CLOCK_EVT_MODE_SHUTDOWN); spin_unlock_irqrestore(tick_device_lock, flags); } -static void tick_resume_periodic(void) +static void tick_resume(void) { struct tick_device *td = __get_cpu_var(tick_cpu_device); unsigned long flags; @@ -317,6 +316,8 @@ static void tick_resume_periodic(void) spin_lock_irqsave(tick_device_lock, flags); if (td-mode == TICKDEV_MODE_PERIODIC) tick_setup_periodic(td-evtdev, 0); + else + tick_resume_oneshot(); spin_unlock_irqrestore(tick_device_lock, flags); } @@ -348,13 +349,13 @@ static int tick_notify(struct notifier_block *nb, unsigned long reason, break; case CLOCK_EVT_NOTIFY_SUSPEND: - tick_suspend_periodic(); + tick_suspend(); tick_suspend_broadcast(); break; case CLOCK_EVT_NOTIFY_RESUME: if (!tick_resume_broadcast()) - tick_resume_periodic(); + tick_resume(); break; default: diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h index 75890ef..c9d203b 100644 --- a/kernel/time/tick-internal.h +++ b/kernel/time/tick-internal.h @@ -19,12 +19,13 @@ extern void tick_setup_oneshot(struct clock_event_device *newdev, extern int tick_program_event(ktime_t expires, int force); extern void tick_oneshot_notify(void); extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *)); - +extern void tick_resume_oneshot(void); # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc); extern void tick_broadcast_oneshot_control(unsigned long reason); extern void
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote: Mar 14 00:22:23 MAIN kernel: [2.072875] caller is check_tsc_sync_source+0x1d/0x100 Mar 14 00:22:23 MAIN kernel: [2.072878] [show_trace_log_lvl+26/48] show_trace_log_lvl+0x1a/0x30 Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization [CPU#0 - CPU#1]: Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off It looks clear that preempt is enabled all the way in second cpu initialization, ( I think that at least in check_tsc_sync_source, it should be disabled, shouldn't it ? ) This should be fixed by commit d04f41e35343f1d788551fd3f753f51794f4afcf tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
Maxim, On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote: 3) Sometimes I get this (once in three boots or so) [ 36.217405] ENABLING IO-APIC IRQs [ 36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 36.433917] APIC timer disabled due to verification failure. And NO_HZ is disabled due to that (I get 1000/s timer's interrupts) I haven't investigated that yet. It looks like another new test that my hardware fails to perform... Yes, this is probably caused by SMM code trying to emulate a PS/2 keyboard from a (maybe connected or not) USB keyboard. Unfortunately we have no way to disable this BIOS misfeature in the early boot process. Arjan, Len ? I built in this test to rule out bogus LAPIC timer calibration values which are sometimes off by factor 2-10. But I also built in a calibration against the PM-Timer, which turned out to be quite reliable and I think the additional verification step is only necessary for sytems without PM-Timer. That was a bit over cautious from my side. I send a patch to avoid this when PM-Timer is available in a separate mail. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386: trust the PM-Timer calibration of the local APIC timer
When PM-Timer is available for local APIC timer calibration we can skip the verification of the calibrated time value. The resulting error is quite small on a bunch of evaluated platforms and is less harming than the observed false positives. We need to keep the verification on systems, which have no PM-Timer to avoid bogus local APIC timer calibrations in the range of factor 2-10, which can be observed when swicthing off the PM-timer support in the kernel configuration. The wrong calibration values are probably caused by SMM code trying to emulate a PS/2 keyboard from a (maybe connected or not) USB keyboard. This prohibits the accurate delivery of PIT interrupts, which are used to calibrate the local APIC timer. Unfortunately we have no way to disable this BIOS misfeature in the early boot process. Add also the dropped cpu_relax() back to the wait loops. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c index 2383bcf..92f4210 100644 --- a/arch/i386/kernel/apic.c +++ b/arch/i386/kernel/apic.c @@ -338,6 +338,7 @@ void __init setup_boot_APIC_clock(void) void (*real_handler)(struct clock_event_device *dev); unsigned long deltaj; long delta, deltapm; + int pm_referenced = 0; apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n calibrating APIC timer ...\n); @@ -357,7 +358,8 @@ void __init setup_boot_APIC_clock(void) /* Let the interrupts run */ local_irq_enable(); - while(lapic_cal_loops = LAPIC_CAL_LOOPS); + while(lapic_cal_loops = LAPIC_CAL_LOOPS) + cpu_relax(); local_irq_disable(); @@ -394,6 +396,7 @@ void __init setup_boot_APIC_clock(void) %lu (%ld)\n, (unsigned long) res, delta); delta = (long) res; } + pm_referenced = 1; } /* Calculate the scaled math multiplication factor */ @@ -423,68 +426,41 @@ void __init setup_boot_APIC_clock(void) calibration_result / (100 / HZ), calibration_result % (100 / HZ)); - - apic_printk(APIC_VERBOSE, ... verify APIC timer\n); - - /* -* Setup the apic timer manually -*/ local_apic_timer_verify_ok = 1; - levt-event_handler = lapic_cal_handler; - lapic_timer_setup(CLOCK_EVT_MODE_PERIODIC, levt); - lapic_cal_loops = -1; - /* Let the interrupts run */ - local_irq_enable(); + /* We trust the pm timer based calibration */ + if (!pm_referenced) { + apic_printk(APIC_VERBOSE, ... verify APIC timer\n); - while(lapic_cal_loops = LAPIC_CAL_LOOPS); + /* +* Setup the apic timer manually +*/ + levt-event_handler = lapic_cal_handler; + lapic_timer_setup(CLOCK_EVT_MODE_PERIODIC, levt); + lapic_cal_loops = -1; - local_irq_disable(); + /* Let the interrupts run */ + local_irq_enable(); - /* Stop the lapic timer */ - lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, levt); + while(lapic_cal_loops = LAPIC_CAL_LOOPS) + cpu_relax(); - local_irq_enable(); + local_irq_disable(); - /* Jiffies delta */ - deltaj = lapic_cal_j2 - lapic_cal_j1; - apic_printk(APIC_VERBOSE, ... jiffies delta = %lu\n, deltaj); + /* Stop the lapic timer */ + lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, levt); - /* Check, if the PM timer is available */ - deltapm = lapic_cal_pm2 - lapic_cal_pm1; - apic_printk(APIC_VERBOSE, ... PM timer delta = %ld\n, deltapm); + local_irq_enable(); - local_apic_timer_verify_ok = 0; + /* Jiffies delta */ + deltaj = lapic_cal_j2 - lapic_cal_j1; + apic_printk(APIC_VERBOSE, ... jiffies delta = %lu\n, deltaj); - if (deltapm) { - if (deltapm (pm_100ms - pm_thresh) - deltapm (pm_100ms + pm_thresh)) { - apic_printk(APIC_VERBOSE, ... PM timer result ok\n); - /* Check, if the jiffies result is consistent */ - if (deltaj LAPIC_CAL_LOOPS-2 || - deltaj LAPIC_CAL_LOOPS+2) { - /* -* Not sure, what we can do about this one. -* When high resultion timers are active -* and the lapic timer does not stop in C3 -* we are fine. Otherwise more trouble might -* be waiting. -- tglx -*/ - printk(KERN_WARNING Global event device %s - has wrong
Re: + small-irq-management-simplification.patch added to -mm tree
On Wed, 2007-02-14 at 15:26 -0800, [EMAIL PROTECTED] wrote: Subject: small irq management simplification From: Jan Beulich [EMAIL PROTECTED] Use mask_ack_irq() where possible. Signed-off-by: Jan Beulich [EMAIL PROTECTED] Cc: Thomas Gleixner [EMAIL PROTECTED] Cc: Ingo Molnar [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Acked-by: Thomas Gleixner [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168
On Fri, 2007-02-16 at 21:38 +0100, Michal Piotrowski wrote: Hi, This looks like a tickless stuff Yup. 0xc0139ea0 is in tick_nohz_stop_sched_tick (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168). 163 164 if (need_resched()) 165 goto end; 166 167 cpu = smp_processor_id(); 168 BUG_ON(local_softirq_pending()); Hmm, the BUG_ON is inside of an interrupt disabled region, so we should have bailed out early in the need_resched() check above (because we are in the idle task context according to the stack trace). Is this reproducible ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-git: undefined reference to `smp_call_function_single'
On Fri, 2007-02-16 at 21:08 -0500, Len Brown wrote: Yes, an obscure .config, but it used to build before today: kernel/built-in.o: In function `tick_broadcast_on_off': (.text+0x1b6f0): undefined reference to `smp_call_function_single' Yup, this obscure machine is missing smp_call_function_single(). James ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168
On Sat, 2007-02-17 at 15:47 +0100, Alex Riesen wrote: 164 if (need_resched()) 165 goto end; 166 167 cpu = smp_processor_id(); 168 BUG_ON(local_softirq_pending()); Hmm, the BUG_ON is inside of an interrupt disabled region, so we should have bailed out early in the need_resched() check above (because we are in the idle task context according to the stack trace). Is this reproducible ? Seen this too (Ubuntu, P4/ht-SMT, SATA, typed from screen): Can you please apply the patch below, so we can at least see, which softirq is pending. This should trigger independently of hrtimers and dynticks. You can keep it compiled in and disable it at the kernel commandline with nohz=off and / or highres=off tglx diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c index bea304d..deeb90e 100644 --- a/arch/i386/kernel/process.c +++ b/arch/i386/kernel/process.c @@ -236,6 +236,11 @@ void cpu_idle(void) * Otherwise, idle callbacks can misfire. */ local_irq_disable(); + if (local_softirq_pending() !need_resched()) + printk(KERN_ERR + Idle: local softirq pending: %04x, + local_softirq_pending()); + enter_idle(); idle(); __exit_idle(); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 95e41f7..91d459f 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -165,7 +165,6 @@ void tick_nohz_stop_sched_tick(void) goto end; cpu = smp_processor_id(); - BUG_ON(local_softirq_pending()); now = ktime_get(); /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Using sched_clock for mmio-trace
On Sat, 2007-02-17 at 15:56 +0100, Andi Kleen wrote: This is one of the reasons why we don't just use good old do_gettimeofday(), since it takes locks and can lead to lock recursion if parts of itself are probed. do_gettimeofday doesn't take locks. Only restriction is that you can't single step it with long pauses between instructions. Err, it uses read side of xtime lock, so you can not call it from a place which write locks xtime lock. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168
On Sat, 2007-02-17 at 17:46 +0100, Alex Riesen wrote: Can you please apply the patch below, so we can at least see, which softirq is pending. This should trigger independently of hrtimers and dynticks. You can keep it compiled in and disable it at the kernel commandline with nohz=off and / or highres=off It did, only one time: Idle: local softirq pending: 00206USB Universal Host Controller Interface driver v3.0 0x20 is the TASKLET_SOFTIRQ. I have no idea yet, how this can happen. Can you please check, if this happens when you add nohz=off to the kernel command line. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-git: undefined reference to `smp_call_function_single'
On Sat, 2007-02-17 at 12:25 -0600, James Bottomley wrote: Yup, this obscure machine is missing smp_call_function_single(). James ? Where's this coming from? smp_call_function_single() is an obscure kvm only API think for x86/ia64 ... it's not supported on any other architecure. The symbol you have is blowing up in the kernel subdirectory which suggests someone has tried to use it in generic code, which will fail to compile on a lot more than voyager and parisc ... smp_call_function_single() was added with commit: eaa70773e750cc09d60938bceacd028bc76b8e3a [PATCH] i386: add smp_call_function_single Continiung the series of small patches necessary for the perfmon subsystem, here is a patch that adds support for the smp_call_function_single() function for i386. It exists for almost all other architectures but i386. The perfmon subsystem needs it in one case to free some state on a designated remote CPU. It's not an obscure kvm API :) But the claim that it is available on almost all other architectures but i386 is wrong. Only x86_64, ia64 and i386 have it. The function is defined in include/linux/smp.h and there is no indication that it is an architecture specific thingy. What a steaming pile of /me grumbles tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tick management: make broadcast dependent on local APIC
The broadcast functionality is only necessary when a local APIC is available. Make the config switch depend on X86_LOCAL_APIC. This resolves the mach-voyager breakage introduced by the tick managament code. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index 1df4a1f..2f76725 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -29,6 +29,7 @@ config GENERIC_CLOCKEVENTS config GENERIC_CLOCKEVENTS_BROADCAST bool default y + depends on X86_LOCAL_APIC config LOCKDEP_SUPPORT bool - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168
On Sat, 2007-02-17 at 23:41 +0100, Michal Piotrowski wrote: sudo cat /var/log/messages | grep Idle Feb 17 17:35:34 bitis-gabonica kernel: Idle: local softirq pending: 00206hdd: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33) Feb 17 19:20:01 bitis-gabonica kernel: Idle: local softirq pending: 00203Idle: local softirq pending: 00203Idle: local softirq pending: 00207PM: Removing info for No Bus:vcs7 I can confirm that this is ICH5 SATA controller problem. The arch/i386/kernel/process.c part of the patch should apply to 2.6.20 as well. Can you check if the problem is there too ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168
On Sun, 2007-02-18 at 10:50 +0100, Alex Riesen wrote: The arch/i386/kernel/process.c part of the patch should apply to 2.6.20 as well. Can you check if the problem is there too ? It does not apply and does not look trivially hackable. The code for cpu_idle was introduced in 2ff2d3d7 i386: add idle notifier. Here you go. tglx Index: linux-2.6.20/arch/i386/kernel/process.c === --- linux-2.6.20.orig/arch/i386/kernel/process.c +++ linux-2.6.20/arch/i386/kernel/process.c @@ -189,6 +189,13 @@ void cpu_idle(void) play_dead(); __get_cpu_var(irq_stat).idle_timestamp = jiffies; + + local_irq_disable(); + if (local_softirq_pending() !need_resched()) + printk(KERN_ERR + Idle: local softirq pending: %04x\n, + local_softirq_pending()); + local_irq_enable(); idle(); } preempt_enable_no_resched(); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
Len, On Fri, 2007-03-16 at 21:32 -0400, Len Brown wrote: [ 36.433917] APIC timer disabled due to verification failure. And NO_HZ is disabled due to that (I get 1000/s timer's interrupts) I haven't investigated that yet. It looks like another new test that my hardware fails to perform... Yes, this is probably caused by SMM code trying to emulate a PS/2 keyboard from a (maybe connected or not) USB keyboard. Unfortunately we have no way to disable this BIOS misfeature in the early boot process. Arjan, Len ? Nope. By definition, SMM is invisible to the OS -- we don't even get a bit that said it occurred (though we'd like one -- it would be really helpful to diagnose issues like this one) I know that it is invisible. Nevertheless I know that the BIOSes emulate PS/2 keyboards from USB via SMM during the boot process until we call the usb_handoff function. So go into BIOS SETUP and see if there is a USB Legacy Emulation feature that you can disable. Sometimes there is not, but disabling onboard USB altogether may help at least prove the issue is in that area. I have more than one box (even original Intel mainboards), where either plugging a PS/2 keyboard or switching off USB makes this problem go away. I built in this test to rule out bogus LAPIC timer calibration values which are sometimes off by factor 2-10. But I also built in a calibration against the PM-Timer, which turned out to be quite reliable and I think the additional verification step is only necessary for sytems without PM-Timer. That was a bit over cautious from my side. I send a patch to avoid this when PM-Timer is available in a separate mail. PM-Timer was invented to work-around the issue that the TSC became unreliable in the face of power management on laptops. In particular, to be able to time duration of OS idle where TSC stopped. While it is not fine grain, and it is not low-latency, is should be very reliable. My understanding is that it is implemented as a simple divider right off the system 14MHz clock -- the signal which most motherboard clocks are PLL multiplied up from -- including the 100MHz front-side bus which drives the LAPIC timer. But that said, I don't understand why calibrating the LAPIC timer using the PM-timer is going to be more reliable -- exactly how and why did the previous calibration scheme fail? Maybe I could follow the new logic in apic.c if I saw the apic=debug output for this box. calibrating APIC timer ... ... lapic delta = 2426884 ... PM timer delta = 833908 APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms APIC delta adjusted to PM-Timer: 1041737 (2426884) . delta 1041737 . mult: 44749065 . calibration result: 166677 . CPU clock speed is 4659.0624 MHz. . host bus clock speed is 166.0677 MHz. This box is off by factor 2.3 and using the PM-Timer instead of the PIT/jiffies values gives me a correct result. Another one: APIC calibration not consistent with PM Timer: 2020ms instead of 100ms APIC delta adjusted to PM-Timer: 1254436 (2534) Off by factor 20 !! The original APIC timer calibration did: local_irq_disable(); wait_until_pit_underflows(); t1 = read_apic_counter(); for (i = 0; i HZ/10; i++) wait_until_pit_underflows(); t2 = read_apic_counter(); and calculated the APIC timer frequency from the delta of t1 and t2 vs. the 100ms time. This had 2 problems: 1. It gave results, which are off by factor 2-10 on a couple of boxen. 2. Some systems stop there dead as the PIT readout is broken. I changed it to do: local_irq_disable(); original_pit_handler = pit-handler; pit-handler = lapic_calibration_handler; loops = 0; local_irq_enable(); wait_until_handler_has_done_HZ/10_loops(); The handler does: if (!loops++) { t1_apic = read_apic_counter(); t1_jiffies = jiffies; t1_pm = read_pm_timer(); } if (loops == HZ/10) { t2_apic = read_apic_counter(); t2_jiffies = jiffies; t2_pm = read_pm_timer(); done = 1; } If the pmtimer is available, then calculate the APIC timer frequency from the t1_pm/t2_pm delta, otherwise use jiffies. When pm_timer is there, we can trust the calculated value, if not we do a verify run of the periodic apic timer and the pit timer. If this fails - and it fails often due to the SMM crap - then I use the PIT and IPIs. In the first version I did a verification run even when pm_timer was there, but this produced false positives as well, because the lapic timer interrupt is in the same way delayed as the PIT interrupt. I removed this to avoid unnecessary switching to IPIs after I verified, that it always produced false positives when the calibration was done against the PM-Timer. tglx - To unsubscribe
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote: calibrating APIC timer ... ... lapic delta = 2426884 ... PM timer delta = 833908 APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms APIC delta adjusted to PM-Timer: 1041737 (2426884) . delta 1041737 . mult: 44749065 . calibration result: 166677 . CPU clock speed is 4659.0624 MHz. . host bus clock speed is 166.0677 MHz. This box is off by factor 2.3 and using the PM-Timer instead of the PIT/jiffies values gives me a correct result. Another one: APIC calibration not consistent with PM Timer: 2020ms instead of 100ms APIC delta adjusted to PM-Timer: 1254436 (2534) Off by factor 20 !! This weird behaviour also can be seen with the BogoMIPS calibration: Calibrating delay using timer specific routine.. 6428.32 BogoMIPS (lpj=12856647) Initializing CPU#1 Calibrating delay using timer specific routine.. 103837.25 BogoMIPS (lpj=207674508) Note, that I never observed that on CPU#0. It always affects CPU#1. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote: Maybe I could follow the new logic in apic.c if I saw the apic=debug output for this box. calibrating APIC timer ... ... lapic delta = 2426884 ... PM timer delta = 833908 APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms APIC delta adjusted to PM-Timer: 1041737 (2426884) . delta 1041737 . mult: 44749065 . calibration result: 166677 . CPU clock speed is 4659.0624 MHz. . host bus clock speed is 166.0677 MHz. This box is off by factor 2.3 and using the PM-Timer instead of the PIT/jiffies values gives me a correct result. I instrumented the lapic calibration on this box: I1: 999 us total:999 us I2: 999 us total: 1998 us ... I28:999 us total: 27980 us I29: 135097 us total: 163077 us I30:881 us total: 163958 us ... I98: 1000 us total: 231918 us I99:999 us total: 232917 us So it vanishes away for 132 ms, which is exactly the error above. This happens in random places and sometimes I'm lucky that it does not happen at all. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/6] 2.6.21-rc2: known regressions
On Sat, 2007-03-17 at 23:41 +0200, Michael S. Tsirkin wrote: a quick ping: on your box that doesnt resume - if you can log in over the network after resume (or somehow run shell commands), does 'date' advance properly or not? (or do you not get that far to be able to tell?) Ingo I just retested - 'date' does not advance after resume for me. This is with NO_HZ *not* set. Sorry it took so long. Update: just re-tested with 2.6.21-rc4, same behaviour: date does not advance after resume from ram. Can you get a full dmesg from boot to resume out of the box ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
Maxim, On Sun, 2007-03-18 at 01:00 +0200, Maxim wrote: Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization [CPU#0 - CPU#1]: Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC warp between CPUs, turning off ^ This one I don't think is related to NO_HZ, maybe it is hardware problem, but it exist without NO_HZ The TSC is checked for synchronization between the CPUs. It's nothing to worry about. We switch off the TSC and use a different clocksource. Is this after resume ? If yes, then something (probably BIOS) is fiddling with the TSC of one CPU when the resume happens. [ 36.217405] ENABLING IO-APIC IRQs [ 36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 36.433917] APIC timer disabled due to verification failure. This one is now discussed, I will look at it and it is not related to NO_HZ I sent a patch for this yesterday: http://marc.info/?l=linux-kernelm=117408952322631w=2 And I forgot to tell about another problem with (now I know ,hi-resolution timers) That before suspend to ram APIC timer is used and HPET is not used : [EMAIL PROTECTED]:/home/maxim# cat /sys/devices/system/clockevents/clockevents0/registered lapicF:0007 M:3(periodic) C: 1 hpet F:0003 M:1(shutdown) C: 0 lapicF:0007 M:3(periodic) C: 0 [EMAIL PROTECTED]:/home/maxim# But after suspend to ram HPET is 'woken' [EMAIL PROTECTED]:/home/maxim# cat /sys/devices/system/clockevents/clockevents0/registered lapicF:0007 M:3(one shoot) C: 1 hpet F:0003 M:3(one shoot) C: 0 lapicF:0007 M:3(one shoot) C: 0 This is unrelated to suspend / resume. The local apic timers stop (hardware madness), when the CPU enters C3 power state. In this case we switch to HPET (or PIT when HPET is not available) and broadcast the events via Inter Processor Interrupts. This is nothing to worry about. I'm a bit surprised though, that your system was in periodic mode before suspend and switched to one shot mode on resume. Is this reproducible ? If yes, can you please provide the dmesg output from boot to resume ? Note that I added those (one shoot), (periodic) descriptions, would be nice to have them in kernel, can I send a patch ? ;-) Sure, just s/shoot/shot/ :) and I see average of 18 IRQs/sec on IRQ 0 So dynticks are working as expected. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] i386: trust the PM-Timer calibration of the local APIC timer
On Sun, 2007-03-18 at 00:12 -0800, Andrew Morton wrote: On Sat, 17 Mar 2007 01:04:56 +0100 Thomas Gleixner [EMAIL PROTECTED] wrote: When PM-Timer is available for local APIC timer calibration we can skip the verification of the calibrated time value. The resulting error is quite small on a bunch of evaluated platforms and is less harming than the observed false positives. We need to keep the verification on systems, which have no PM-Timer to avoid bogus local APIC timer calibrations in the range of factor 2-10, which can be observed when swicthing off the PM-timer support in the kernel configuration. The wrong calibration values are probably caused by SMM code trying to emulate a PS/2 keyboard from a (maybe connected or not) USB keyboard. This prohibits the accurate delivery of PIT interrupts, which are used to calibrate the local APIC timer. Unfortunately we have no way to disable this BIOS misfeature in the early boot process. Add also the dropped cpu_relax() back to the wait loops. Is this a for-2.6.21 thing? Yes please. The false positives of the original calibration are annoying. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()
On Sun, 2007-03-18 at 17:16 -0400, Chuck Ebbert wrote: Thomas Gleixner wrote: I'd prefer this one: The maximum seconds value we can handle on 32bit is LONG_MAX. diff --git a/include/linux/ktime.h b/include/linux/ktime.h index c68c7ac..248305b 100644 --- a/include/linux/ktime.h +++ b/include/linux/ktime.h @@ -57,7 +57,11 @@ typedef union { } ktime_t; #define KTIME_MAX ((s64)~((u64)1 63)) -#define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC) +#if (BITS_PER_LONG == 64) +# define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC) +#else +# define KTIME_SEC_MAX LONG_MAX +#endif /* * ktime_t definitions when using the 64-bit scalar representation: Just to be clear: this replaces the earlier patch, right? This replaces the fix Andrew did. http://marc.info/?l=linux-kernelm=117407812411997w=2 tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()
On Sun, 2007-03-18 at 17:53 -0400, Chuck Ebbert wrote: Just to be clear: this replaces the earlier patch, right? This replaces the fix Andrew did. http://marc.info/?l=linux-kernelm=117407812411997w=2 Right, but is the original Prevent DOS patch from you still needed? Or did Andrew's patch replace that one, and now this replaces his? The original patch is still needed - it handles the problem in the first place. I missed to compile it for 32bit and Andrew did a fix, which I replaced. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Mon, 2007-03-19 at 18:10 +0100, Stefan Prechtel wrote: So I tried to boot with nolapic on battery and with this option the kernel (and system) starts as it should. If you need more information, I will send it to you. Can you please provide your .config and a bootlog from a boot with nolapic and without. Also please add apic=verbose to the commandline. Can you please use Linus' latest git snaphost http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.21-rc4-git4.bz2 or pull from Linus' git repository. You can please open a new bug (Category: Timers, Component: Other) on http://bugzilla.kernel.org and upload the files there, so we avoid distributing them via LKML. Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Mon, 2007-03-19 at 18:36 +0100, Thomas Gleixner wrote: On Mon, 2007-03-19 at 18:10 +0100, Stefan Prechtel wrote: So I tried to boot with nolapic on battery and with this option the kernel (and system) starts as it should. If you need more information, I will send it to you. Can you please provide your .config and a bootlog from a boot with nolapic and without. Also please add apic=verbose to the commandline. Can you please use Linus' latest git snaphost http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.21-rc4-git4.bz2 or pull from Linus' git repository. You can please open a new bug (Category: Timers, Component: Other) on http://bugzilla.kernel.org and upload the files there, so we avoid distributing them via LKML. Oh, a bootlog with ac plugged in would be great too. Also can you please enable CONFIG_SYSRQ and hit SysRq-Q once, when the slowness kicks in. Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
Matt, On Mon, 2007-03-19 at 12:08 -0500, Matt Mackall wrote: On Sun, Mar 18, 2007 at 03:31:50PM -0500, Josh Boyer wrote: On Sun, Mar 18, 2007 at 02:18:12PM -0500, Matt Mackall wrote: I'm well aware of all that. I wrote a NAND driver just last month. Let's consider this table: HARD drives MTD device Consists of sectors Consists of eraseblocks Sectors are small (512, 1024 bytes) Eraseblocks are larger (32KiB, 128KiB) read sector and write sector read, write, and erase block Bad sectors are re-mappedBad eraseblocks are not hidden HDD sectors don't wear out Eraseblocks get worn-out N/A NAND flash addressed in pages N/A NAND flash has OOB areas N/A (?) NAND flash requires ECC Disks have OOB areas with ECC, it's just nicely hidden inside the drive. They also typically have physical sectors bigger than 512 bytes, again hidden. The difference is that the harddrive has an intellegent controller, which hides all this away. NAND FLASH has not and we have to do it in software. If the end goal is to end up with something that looks like a block device (which seems to be implied by adding transparent wear leveling Nope, not the end goal. It's more about wear-leveling across the entire flash chip than it is presenting a block like device. It seems to be about spanning devices and repartitioning as well. Hence the analogy with LVM. Yes, UBI is a kind of LVM for FLASH and we did think quite a time about reusing LVM before we went the UBI way. and bad block remapping), then I don't see any reason it can't be done in device mapper. The 'smarts' of mtdblock could in fact be pulled up There is nothing smart about mtdblock. And mtdblock has nothing to do with UBI. Note the scare quotes. Device mapper runs on top of a block device. And mtdblock is currently the block interface that MTD exports. And it has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly talking about an approach that doesn't involve UBI at all. MTD block has no 'smarts' at all. It is a stupid and broken hack, which you can utilize to lose data and wear your FLASH out. In the end, a block device is something which does random access block-oriented I/O. Disk and NAND both fit that description. NAND very much doesn't fit the random access part of that. For writes you have to write in incrementing pages within eraseblocks. And? You can't do I/O smaller than a sector on a disk. Should we export block devices with 16/32/64/128 KiB size ? If not, we would need to put a lot of clever functionality into the mtd block device code, which we decided to put into UBI, so FLASH aware file systems can use this shared functionality too. If someone wants to implement an intellegent mtd block device, which allows to run arbitrary filesystems, then it should be done on top of UBI. It's not rocket science, but nobody bothers as we have functional FLASH filesystems which do their job better w/o any notion of a block device. A disk _IS_ fundamentally different to FLASH and all the magic which is done inside of CF-Cards and USB-Sticks is just hiding this away. Most of the controller chips in these devices are broken and I would never ever store any important data on such. The main points of UBI are: - wear levelling across the complete device - background handling of bitflips - safe updates - handling of static volumes, which are easily accessible for bootloaders Nothing of this is anyway near of LVM and disks. The only LVM alike feature is dynamic creation/deletion/resizing of volumes. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
Stefan, On Mon, 2007-03-19 at 19:53 +0100, Stefan Prechtel wrote: You can find the files here: http://bugzilla.kernel.org/show_bug.cgi?id=8235 thanks for providing the data. Your ACPI tables don't provide information about the power states (C-States), but your BIOS seems to switch the CPUs into deeper power states, when it runs on battery. In those deeper power states the local APIC timers and the TSC are stopped. So the machine waits for ever on the next timer interrupt. We have a broadcast mechanism for this, which gets activated from ACPI, but the broadcast mechanism is not activated: [3.798000] Clock Event Device: pit [3.798000] tick_broadcast_mask: Can you please boot with 2.6.20 or earlier and check the output of /proc/interrupts ? IRQ#0 and the LOC (local APIC timer) Interrupts should increment in the same frequency. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Mon, 2007-03-19 at 20:49 +0100, Stefan Prechtel wrote: Can you please boot with 2.6.20 or earlier and check the output of /proc/interrupts ? IRQ#0 and the LOC (local APIC timer) Interrupts should increment in the same frequency. tglx Here is the output of /proc/interrupts on 2.6.20: CPU0 CPU1 0: 7089 0 local-APIC-edge-fasteio timer Can you provide the numbers for LOC too ? 0: 29801420 29793520IO-APIC-edge timer ... LOC: 119180305 119180039 And please do a sleep 10; between two reads, so I can see the deltas. and this on 2.6.21-rc*: CPU0 CPU1 0:255 0 local-APIC-edge-fasteoi timer on 2.6.21-rc* the number 255 doesn't change. Yes. I know. We rely on the local APIC, if the ACPI code does us not tell to use the PIT broadcast, sigh. But if it is ACPI relevant, shouldn't it boot with acpi=off? I've tried with acpi=off and noapic but only with nolapic it started. And the content of /proc/acpi/processor/C000/power shows only one c-state; shouldn't it show more C-states? (please correct me if I'm wrong) # cat /proc/acpi/processor/C000/power active state:C1 max_cstate: C8 bus master activity: maximum allowed latency: 2000 usec states: *C1: type[C1] promotion[--] demotion[--] latency[000] usage[] duration[] Yup. It should. Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Mon, 2007-03-19 at 21:35 +0100, Stefan Prechtel wrote: CPU0 CPU1 0: 28289 0 local-APIC-edge-fasteio timer ... LOC: 28237 28236 after a read: (I hope that is this what you want :-) CPU0 CPU1 0: 30344 0 local-APIC-edge-fasteio timer ... LOC: 30292 30291 Is this with AC plugged in ? If yes, please provide the same numbers for battery mode. What's the output of cat /proc/acpi/processor/C000/power for 2.6.20 and 2.6.21-rc4-latest-git with and w/o AC ? Can you also please upload a bootlog with and without AC of 2.6.20 to bugzilla ? Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 2007-03-19 at 15:12 -0500, Matt Mackall wrote: Should we export block devices with 16/32/64/128 KiB size ? Sure, why not? Simply because we want to have the ability to write fine grained in order to write data safe to FLASH. If we export those large sizes we lose this ability and have to write full erase blocks for a couple of bytes. This simply breaks JFFS2 and you can do the math yourself what that means for the life time of FLASH, when you write small data chunks in fast sequences and want to make sure that they are written to FLASH immidiately. A disk _IS_ fundamentally different to FLASH and all the magic which is done inside of CF-Cards and USB-Sticks is just hiding this away. And yet they're still both block devices. That our current block layer doesn't handle one as well as the other is something we should fix instead of inventing a whole new full-feature but incompatible block layer on the side. And yet they are still broken and unreliable. And you can wear them out in no time, just because they are stupid and do full eraseblock updates when you write one sector. No thanks. A bunch of people have done experiments with those beasts and they are unusable for environments, where we need to make sure, that data is on FLASH. UBI is not an incompatible block layer. It allows to implement a very clever block layer on top. And you can use just one large partition and small ones for your kernel image and bootloader, which still get the benefits of data integrity (by doing background safe copies on bit flips) and the easy implementation in an IPL. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote: (UBI also has static volumes which LVM doesn't but that is an aside.) If a static volume is simply a non-dynamic volume, then device mapper can do that too. And countless other things. Which is not an aside. UBI growing to do all the things that device mapper does is exactly the thing we should be seeking to avoid. No it can't and device mapper sits on top of block devices. FLASH is no block device. Period. Device mapper can not provide a simple easy to decode scheme for boot loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH and be able to find the kernel or second stage boot loader in this unordered device. And no, fixed addresses do not work. Do you want to implement device mapper into your Initialial Bootloader stage ? That's why I suggested fixing the MTD layers that present block devices first in the part of my reply that you cut off. It seems to me that you're really after getting flash to look like a block device, which would enable device mapper to be used for something similar to UBI. That's fine, but until someone does that work UBI fills a need, has users, and has an existing implementation. False starts that get mainlined delay or prevent things getting done right. The question is and remains is UBI the right way to do things? Not is UBI the easiest way to do things? or is UBI something people have already adopted? If the right way is instead to extend the block layer and device mapper to encompass the quirks of NAND in a sensible fashion, then UBI should not go in. No, block layer on top of FLASH needs 80% of the functionality of UBI in the first place. You need to implement a clever journalling block device emulator in order to keep the data alive and the FLASH not weared out within no time. You need the wear levelling, otherwise you can throw away your FLASH in no time. Let me draw a picture so we have something to argue about: iSCSI/nbd(6) | filesystem {swap | ext3ext3 jffs2 \ | || / / \ | dm-crypt-snapshot(5) / device mapper -|\ \ | / | partitioning / | | partitioning(4) |wear leveling(3) / | | / | block concatenation | ||| | \ bad block remapping(2) ||| | MTD raw block { raw block devices with no smarts(1) / | \ \ hardware { NANDNAND NAND NAND Notes: 1. This would provide a block device that allowed writing pages and a secondary method for erasing whole blocks as well as a method for querying/setting out of band information. Forget about OOB data. OOB data is reserved for ECC. Please read the recommendations of the NAND FLASH manufacturers. NAND gets less reliable with higher density devices and smaller processes. 2. This would hide erase blocks either by using an embedded table or out of band info. This could stack on top of block concatenation if desired. Hide erase blocks ? UBI does not hide anything. It maps logical eraseblocks, which are exposed to the clients to arbitrary physical eraseblocks on the FLASH device in order to provide across device wear levelling. This is fundamentaly different to device mapper. 3. This would provide wear leveling, and probably simultaneously provide relatively efficient and safe access to write sector and page-sized I/O. Below this level, things had better be comfortable with the limitations of NAND if they want to work well. I don't see how this provides across device wear levelling. 4. JFFS2 has its own wear-leving scheme, as do several other filesystems, so they probably want to bypass this piece of the stack. JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own wear levelling sucks. 5. We don't reimplement higher pieces of the stack (dm-crypt, snapshot, etc.). Why should we reimplement that ? 6. We make some things possible that simply aren't otherwise. And this picture isn't even interesting yet. Imagine a dm-cache layer that caches data read from disks in high-speed flash. Or using dm-mirror to mirror writes to local flash over NBD or to a USB drive. Neither of these can be done 'right' in a stack split between device mapper and UBI. Err. Implement a clever block layer on top of UBI and use all the goodies you want including device mapper. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH] [REVIEW] Fix irqpoll on IA64 (timer interrupt != 0)
On Mon, 2007-03-19 at 19:13 +0100, Bernhard Walle wrote: That requires changes in Linux-generic files. The default of timer_irq is 0, so the patch doesn't break i386/x86_64. However, other platforms also may also have a timer interrupt non-equal to zero, so they can also use the new set_timer_interrupt() function. The patch is against 2.6.21-rc4. Please give me your input how to improve the way it's done if you don't like the way I did the change. irqpoll is required to work with kdump in some situations and that's why I discovered that kdump doesn't work on that platform (HP rx2660). Signed-off-by: Bernhard Walle [EMAIL PROTECTED] Acked-by: Thomas Gleixner [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote: If a static volume is simply a non-dynamic volume, then device mapper can do that too. And countless other things. Which is not an aside. UBI growing to do all the things that device mapper does is exactly the thing we should be seeking to avoid. No it can't and device mapper sits on top of block devices. FLASH is no block device. Period. Which of the following two properties does it lack? - discrete blocks - non-sequential access to blocks When you do the obvious s/blocks/eraseblocks/, this appears to be true. It appears to be, but it is not. You enforce semantics on a device, which it does not have. Saying but I can't do I/O smaller than the blocksize doesn't change this any more than it would for disks. There is a huge difference. Disk block size is 512 byte and FLASH block size is min 16KiB and up to 256KiB. Just do the math: Write sampling data streams in 2KiB chunks to your uber devicemapper on a 1GiB device with 64KiB erase block size: Fine grained FLASH aware writes allow 32 chunks in a block without erasing the block. Your method erases the block 32 times to write the same amount of data. Result: You wear out the flash 32 times faster. Cool feature. Saying but I can do smaller I/O efficiently in some circumstances also doesn't change it. We can do it under _any_ circumstances and that _does_ change it. Implementing a clever block device layer on top of UBI is simple and would provide FLASH page sized I/O, i.e. 2Kib in the above example. In historical UNIX, some tapes were block devices too. Because they supported seek(). I'm impressed. How exactly are some tapes comparable to FLASH chips ? Your next proposal is to throw away MTD-utils and use mt instead ? Device mapper can not provide a simple easy to decode scheme for boot loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH and be able to find the kernel or second stage boot loader in this unordered device. And no, fixed addresses do not work. Do you want to implement device mapper into your Initialial Bootloader stage ? This is exactly the same problem as booting on a desktop PC. But somehow LILO manages. My first Linux box had a hell of a lot less disk than the platform I bootstrapped (and wrote NAND drivers for) last month had in NAND. No, it is not. You get the absolute sector address of your second stage and this is a complete nobrainer. The translation is done in the DISK device. You simply ignore the fact, that inside each disk, USB Stick, CF-CARD, whatever - there is a more or less intellegent controller device, which does the mapping to the physical storage location. There is _NO_ such thing on a bare FLASH chip. It does not matter, whether your embedded device had more NAND space than my old CP/M machines floppy. It simply matters, that even the old CP/M floppy device had some rudimentary intellence on board. Furthermore I want to be able to get the bitflip correction on my second stage loader / kernel in the same safe way as we do it for everything else and still be able to bootstrap that from an extremly small bootloader. If the right way is instead to extend the block layer and device mapper to encompass the quirks of NAND in a sensible fashion, then UBI should not go in. No, block layer on top of FLASH needs 80% of the functionality of UBI in the first place. Incorrect. A block-based filesystem on top of flash needs this functionality. But a block device suitable to device mapper layering (which then provides the functionality) does not. How exactly does device mapper: A) across device wear levelling ? B) dynamic partitioning for FLASH aware file systems ? C) across device wear levelling for FLASH aware file systems ? D) background bit-flip corrections (copying affected blocks and recylce the old one) ? E) allow position independent placement of the second stage bootloader ? You need to implement a clever journalling block device emulator in order to keep the data alive and the FLASH not weared out within no time. You need the wear levelling, otherwise you can throw away your FLASH in no time. And that's why it's in my picture. Yes, it is in your picture, but: 1) it excludes FLASH aware file systems and UBI does not. 2) your picture does still not explain how it does achive the above A), B), C), D) and E) Your extra path for partitioning(4) and JFFS2 is just a weird hack, which makes your proposal completely absurd. Let me draw a picture so we have something to argue about: iSCSI/nbd(6) | filesystem {swap | ext3ext3 jffs2 \ | || / / \ | dm-crypt-snapshot(5) / device mapper -|\ \ | / | partitioning / |
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 2007-03-19 at 16:36 -0500, Matt Mackall wrote: On Mon, Mar 19, 2007 at 11:06:33PM +0200, Artem Bityutskiy wrote: On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote: The issue is 14000 lines of patch to make a parallel subsystem. Parallel system exists since very long. One is flash-SW_or_HW_FTL-all_blkdev_stuff. The other is MTD-JFFS2. Think about _why_ there are 2 of them. Hint - reliability, performance. Your ranting basically says that only the first one makes sense. This is not true. A better way would be for MTD to deliver a block dev with a rich enough interface for JFFS2 to use efficiently in the first place. Yes, I know that can't be done with the current block dev layer. But that's what the source is for. Why the hell would JFFS2 need a block device interface ? What's the gain ? We enhance the second branch, not the first, please, realize this. Both branches have their user base, and have always had. iSCSI/nbd(6) | filesystem {swap | ext3ext3 jffs2 \ | || / / \ | dm-crypt-snapshot(5) / device mapper -|\ \ | / | partitioning / | | partitioning(4) |wear leveling(3) / | | / | block concatenation | ||| | \ bad block remapping(2) ||| | MTD raw block { raw block devices with no smarts(1) / | \ \ hardware { NANDNAND NAND NAND Matt, as I pointed in the first mail, flash != block device. And as I pointed out, you're wrong. It is both block oriented (eraseBLOCK??) and random access. That's what a block device is. The fact that it doesn't look like the other things that Linux currently calls a block device and supports well is another matter. It does well matter, as it is not a block device. It is a FLASH device and you can do as much comparisons of eraseBLOCK as you want, you do not turn FLASH into a DISK. Again: Disks (including CF-Cards and USB-Sticks) have intellegent controllers, which abstract the hardware oddities away and present you a block device. In your picture I see NAND-MTD raw block. So am I right that you assume that we already have a decent FTL? The fact is that we do not. No. Look at the picture for more than two seconds, please. I can tell you didn't do this because you didn't manage to find (1) which explicitly says with no smarts. And you also cut out the footnote where I explained what I meant by with no smarts. Find the spots marked (2) and (3). These are your FTL. And where please are (2) and (3) inside of device mapper ? Please, bear in mind that decent FTL is difficult and an FS on top of FTL is slow, FTL hits performance considerably. ...and if you'd actually looked at the picture, you'd have seen JFFS2 bypassing it. Along with another footnote explaining it. The (4) partitioning and JFFS2 on top is a step back from the current UBI functionality. Now we can have resizable partitioning even for JFFS2 and JFFS2 can utilize the UBI wear levelling, which is way better than the crude heuristics of JFFS2. You want to force FLASH into device mapper for some strange and no obvious reason. Just the coincidence of eraseBLOCK and BLOCKdevice is not really convincing. You impose the usage of eraseblock size on FLASH, which is simply wrong: DISK has a 1:1 relationship of eraseblock and minimal I/O. FLASH has not. I did the math in a different mail and I'm not buying your factor 32 FLASH life time reduction for the price of having a bunch of lines of code less in the kernel. If you really consider to run ext3, xfs or whatever on top of FLASH, please go and do the homework on CF-Cards and USB-Sticks. Run them into the fast wearout death. And device mapper does not help anything to avoid that. Running ext3 on top of FLASH with a minimal I/O size of erase block size is simply braindead. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/13] signal/timer/event fds v7 - anonymous inode source ...
Davide, On Mon, 2007-03-19 at 16:47 -0700, Davide Libenzi wrote: This patch add an anonymous inode source, to be used for files that need and inode only in order to create a file*. We do not care of having an inode for each file, and we do not even care of having different names in the associated dentries (dentry names will be same for classes of file*). This allow code reuse, and will be used by epoll, signalfd and timerfd (and whatever else there'll be). +int aino_getfd(int *pfd, struct inode **pinode, struct file **pfile, +char const *name, const struct file_operations *fops, void *priv) +{ + struct qstr this; + struct dentry *dentry; + struct inode *inode; + struct file *file; + int error, fd; + + error = -ENFILE; + file = get_empty_filp(); + if (!file) + goto eexit_1; make this return -ENFILE; please + inode = aino_getinode(); + if (IS_ERR(inode)) { + error = PTR_ERR(inode); + goto eexit_2; Can you please use a bit more descriptive labels ? e.g: goto out_filp; + } + + error = get_unused_fd(); + if (error 0) + goto eexit_3; e.g: goto out_inode; + fd = error; + + /* + * Link the inode to a directory entry by creating a unique name + * using the inode sequence number. + */ + error = -ENOMEM; + this.name = name; + this.len = strlen(name); + this.hash = 0; + dentry = d_alloc(aino_mnt-mnt_sb-s_root, this); + if (!dentry) + goto eexit_4; e.g: goto out_fd; +static int ainofs_delete_dentry(struct dentry *dentry) +{ + /* + * We faked vfs to believe the dentry was hashed when we created it. + * Now we restore the flag so that dput() will work correctly. + */ + dentry-d_flags |= DCACHE_UNHASHED; + return 1; +} Please put either struct ainofs_dentry_operations ... below the next function or move ainofs_delete_dentry() above struct ainofs_dentry_operations ... It's annoying to lookup the protoypes and implemenation back and forth. +static struct inode *aino_getinode(void) +{ + return igrab(aino_inode); +} Please use igrab(aino_inode); directly in this one single place above. That saves us a prototype and an useless static function with no value. +/* + * A single inode exist for all aino files. On the contrary of pipes, + * aino inodes has no per-instance data associated, so we can avoid + * the allocation of multiple of them. + */ +static struct inode *aino_mkinode(void) +{ + int error = -ENOMEM; + struct inode *inode = new_inode(aino_mnt-mnt_sb); + + if (!inode) + goto eexit_1; return ERR_PTR(-ENOMEM); + inode-i_fop = aino_fops; +} + +static int ainofs_get_sb(struct file_system_type *fs_type, int flags, + const char *dev_name, void *data, struct vfsmount *mnt) +{ + return get_sb_pseudo(fs_type, aino:, NULL, AINOFS_MAGIC, mnt); +} Please put either struct file_system_type aino_fs_typ ... below this function or move ainofs_get_sb() above struct file_system_type aino_fs_typ ... +static int __init aino_init(void) +{ + + if (register_filesystem(aino_fs_type)) + goto epanic; + + aino_mnt = kern_mount(aino_fs_type); + if (IS_ERR(aino_mnt)) + goto epanic; + + aino_inode = aino_mkinode(); + if (IS_ERR(aino_inode)) + goto epanic; + + return 0; + +epanic: + panic(aino_init() failed\n); Panic ? It's not life critical - is it ? A printk(KERN_ERR...) and a return -Exx would be sufficient. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Mon, 2007-03-19 at 22:51 +0100, Stefan Prechtel wrote: 2007/3/19, Thomas Gleixner [EMAIL PROTECTED]: On Mon, 2007-03-19 at 21:35 +0100, Stefan Prechtel wrote: CPU0 CPU1 0: 28289 0 local-APIC-edge-fasteio timer ... LOC: 28237 28236 after a read: (I hope that is this what you want :-) CPU0 CPU1 0: 30344 0 local-APIC-edge-fasteio timer ... LOC: 30292 30291 Is this with AC plugged in ? If yes, please provide the same numbers for battery mode. Yes. And here is the output for battery mode (2.6.20): CPU0 CPU1 0: 292153 0 local-APIC-edge-fasteio timer LOC: 292114 292113 CPU0 CPU1 0: 293263 0 local-APIC-edge-fasteio timer LOC: 293224 293223 Hmm. Can you please apply the following patch on top of 2.6.20 and check, if the WARN_ON_ONCE triggers when you boot w/o AC plugged ? Thanks, tglx Index: linux-2.6.20/arch/i386/kernel/apic.c === --- linux-2.6.20.orig/arch/i386/kernel/apic.c +++ linux-2.6.20/arch/i386/kernel/apic.c @@ -1174,6 +1174,8 @@ void switch_APIC_timer_to_ipi(void *cpum cpumask_t mask = *(cpumask_t *)cpumask; int cpu = smp_processor_id(); + WARN_ON_ONCE(1); + if (cpu_isset(cpu, mask) !cpu_isset(cpu, timer_bcast_ipi)) { disable_APIC_timer(); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 2007-03-19 at 20:05 -0500, Matt Mackall wrote: On Tue, Mar 20, 2007 at 01:42:46AM +0100, Thomas Gleixner wrote: On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote: This is exactly the same problem as booting on a desktop PC. But somehow LILO manages. My first Linux box had a hell of a lot less disk than the platform I bootstrapped (and wrote NAND drivers for) last month had in NAND. No, it is not. You get the absolute sector address of your second stage and this is a complete nobrainer. The translation is done in the DISK device. LILO and friends manage to boot systems that use software RAID and LVM. There are multiple methods. Some use block lists, some use tiny boot partitions, etc. All of them are applicable to controllerless NAND. Yes, by using fixed addresses, which is not what I want. You simply ignore the fact, that inside each disk, USB Stick, CF-CARD, whatever - there is a more or less intellegent controller device, which does the mapping to the physical storage location. There is _NO_ such thing on a bare FLASH chip. How many times do I have to tell you that I wrote a driver for controllerless NAND just last month? Wow. I'm impressed because I'm pulling my opinion out of thin air. How exactly does device mapper: A) across device wear levelling ? The same way UBI does, but encapsulated in a device mapper layer. Does the device mapper do that ? B) dynamic partitioning for FLASH aware file systems ? See above. Does the device mapper do that ? C) across device wear levelling for FLASH aware file systems ? See above. Look at your own drawing. D) background bit-flip corrections (copying affected blocks and recylce the old one) ? See above. Repeating patterns do not impress me. Your drawing tells otherwise E) allow position independent placement of the second stage bootloader ? See way above to my LILO response. Neither LILO nor GRUB have search capabilities for randomly located second stage loaders. You need to implement a clever journalling block device emulator in order to keep the data alive and the FLASH not weared out within no time. You need the wear levelling, otherwise you can throw away your FLASH in no time. And that's why it's in my picture. Yes, it is in your picture, but: 1) it excludes FLASH aware file systems and UBI does not. 2) your picture does still not explain how it does achive the above A), B), C), D) and E) Your extra path for partitioning(4) and JFFS2 is just a weird hack, which makes your proposal completely absurd. No, it's just there to show the flexibility of device mapper. But I have the sneaking suspicion you have no idea how device mapper works. Sigh. Layering violation == flexibility. In brief: device mapper takes one or more devices, applies a mapping to them, and returns a new device. For example, take various spans of /dev/hda1 and /dev/sda3 and present them as new-device1. Take new-device1 and transform it with dm-crypt to get new-device2. The kernel doesn't decide how to do this, any more than it decides where to mount your filesystems. Userspace does. I know how it works. But your blurb does not answer any of my questions. 5. We don't reimplement higher pieces of the stack (dm-crypt, snapshot, etc.). Why should we reimplement that ? So that you can get encryption and snapshot, etc.? 1. On top of a clever block device. 2. UBI can do snapshots by design. Oh, so you HAVE reimplemented it. No, it already works 3. Encryption should be done on the VFS layer and not below the filesystem layer. Doing it inside the block layer or the device mapper is broken by design. That's highly debatable and not a topic for this thread. I see, you define, what has to be discussed. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote: On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote: Arjan van de Ven [EMAIL PROTECTED] writes: well we can do the handshake to take ownership like we do much later in boot, but that requires PCI to be there and fully discovered, which we don't have this early. That's not true - we do early pci discovery. Doing USB handsoff there would be quite possible. What, we don't do USB handoff early enough in the boot process? It's happening at PCI quirk time now, which I think should be early enough for everyone (and too early for some who rely on USB keyboards and initramfs shells...) It happens way after the CPUs are brought up. At this point both the delay loop calibration and the local APIC calibration are already done. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Tue, 2007-03-20 at 17:47 +0100, Grzegorz Chwesewicz wrote: I have HP nx6325. I've tried to use WARN_ON_ONCE patch, but I don't see nothing special in dmesg. Just in case I'm posting my dmesg_2.6.20_WARN_ON_ONCE_on_battery log on http://bugzilla.kernel.org/show_bug.cgi?id=8235 . Below I post output of my /proc interrupts (10 sec. delay between reads). Other interesting thing on 2.6-git is that when I press a key on keyboard it doesn't repeat (on battery), but it repeats on 2.6-git on ac. Sigh. The periodic PIT interrupt pampers over the problem in 2.6.21-rc. It prevents the BIOS to switch the CPU in lower power states. I'm working on a detect LAPIC / BIOS madness check. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Wed, 2007-03-21 at 10:46 +0100, Andi Kleen wrote: On Wednesday 21 March 2007 10:24, Thomas Gleixner wrote: On Tue, 2007-03-20 at 17:47 +0100, Grzegorz Chwesewicz wrote: I have HP nx6325. I've tried to use WARN_ON_ONCE patch, but I don't see nothing special in dmesg. Just in case I'm posting my dmesg_2.6.20_WARN_ON_ONCE_on_battery log on http://bugzilla.kernel.org/show_bug.cgi?id=8235 . Below I post output of my /proc interrupts (10 sec. delay between reads). Other interesting thing on 2.6-git is that when I press a key on keyboard it doesn't repeat (on battery), but it repeats on 2.6-git on ac. Sigh. The periodic PIT interrupt pampers over the problem in 2.6.21-rc. It prevents the BIOS to switch the CPU in lower power states. I think I ran into the same problem with my initial noidletick patch. I don't have that test machine anymore though. Normally the use PIT when AMD Cstate = 2 check should have caught that though. Why did it here? The BIOS/ACPI is broken and does only expose C1, which should not switch off LAPIC. The BIOS is switching into deeper C-States behind the kernels back somehow. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Wed, 2007-03-21 at 11:37 +0100, Andi Kleen wrote: The BIOS/ACPI is broken and does only expose C1, which should not switch off LAPIC. The BIOS is switching into deeper C-States behind the kernels back somehow. Hmm, perhaps we can check AMD (cstate = 2 || has a battery) ? Should be doable by looking up the battery object in ACPI Which makes us rely on another ACPI feature. What guarantees that the ACPI tables are correct for this one ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote: On Tue, 20 March 2007 01:42:46 +0100, Thomas Gleixner wrote: On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote: 4. JFFS2 has its own wear-leving scheme, as do several other filesystems, so they probably want to bypass this piece of the stack. JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own wear levelling sucks. Ok, fine. How about LogFS, then? LogFS can easily leverage UBI's wear algorithm. Ok, now we have reached the absurd. UBI quite fundamentally cannot do wear leveling as good as LogFS can. Simply because UBI has zero knowledge of the _contents_ of its blocks. Knowing whether a block is 90% garbage or not makes a great difference. Also LogFS currently requires erasesizes of 2^n. Last time I talked to you about that, you said it would be possible and fixable. We talked about several mechanisms, which would allow a filesystem or other users to hint such things to UBI. Even if the LogFS wear levelling is so superior, it CAN'T do across device wear levelling. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Wed, 2007-03-21 at 12:35 +0100, Jörn Engel wrote: Even if such flashes still contain a bootloader and a kernel, that will occupy less than 1% of the device. Wear leveling across the device is fairly pointless here. This is what I designed LogFS for. Still you need to have a solution for handling bitflips in those bootloader and kernel areas. I don't dispute, that on a Terrabyte solid state disk which is used in a totally different way, UBI is not necessarily the right tool. There is some middle ground where a combination of UBI and LogFS may make sense. LogFS can still make sense for devices as small as 64MiB. But I'm not too concerned about that because flashes will continue to grow and the advantages of cross-device wear leveling will continue to diminish. Flashes will grow, but this will not change the embedded use case with a relativly small FLASH and the bootloader / kernel / rootfs / datafs scenario, where UBI is the right tool to use. There is no hammer for all nails and I don't see device mapper doing what UBI does right now. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
Stefan, Grzegorz On Wed, 2007-03-21 at 12:14 +0100, Thomas Gleixner wrote: On Wed, 2007-03-21 at 11:37 +0100, Andi Kleen wrote: The BIOS/ACPI is broken and does only expose C1, which should not switch off LAPIC. The BIOS is switching into deeper C-States behind the kernels back somehow. Hmm, perhaps we can check AMD (cstate = 2 || has a battery) ? Should be doable by looking up the battery object in ACPI Which makes us rely on another ACPI feature. What guarantees that the ACPI tables are correct for this one ? Can you please apply the patch below and add nolapic_timer to the kernel command line ? Please provide also the output of # dmidecode on your laptops. Thanks, tglx diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c index 5cff797..67f8d9f 100644 --- a/arch/i386/kernel/apic.c +++ b/arch/i386/kernel/apic.c @@ -61,6 +61,8 @@ static int enable_local_apic __initdata = 0; /* Local APIC timer verification ok */ static int local_apic_timer_verify_ok; +/* Disable local APIC timer from the kernel commandline */ +static int local_apic_timer_disabled; /* * Debug level, exported for io_apic.c @@ -340,6 +342,13 @@ void __init setup_boot_APIC_clock(void) long delta, deltapm; int pm_referenced = 0; + if (local_apic_timer_disabled) { + /* No broadcast on UP ! */ + if (num_possible_cpus() 1) + setup_APIC_timer(); + return; + }} + apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n calibrating APIC timer ...\n); @@ -1179,6 +1188,13 @@ static int __init parse_nolapic(char *arg) } early_param(nolapic, parse_nolapic); +static int __init parse_disable_lapic_timer(char *arg) +{ + local_apic_timer_disabled = 1; + return 0; +} +early_param(nolapic_timer, parse_disable_lapic_timer); + static int __init apic_set_verbosity(char *str) { if (strcmp(debug, str) == 0) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Wed, 2007-03-21 at 13:15 +0100, Thomas Gleixner wrote: + return; + }} + Ooops, sorry. Did not quilt refresh before sending it out. Correct version below. tglx diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c index 5cff797..83cf98d 100644 --- a/arch/i386/kernel/apic.c +++ b/arch/i386/kernel/apic.c @@ -61,6 +61,8 @@ static int enable_local_apic __initdata = 0; /* Local APIC timer verification ok */ static int local_apic_timer_verify_ok; +/* Disable local APIC timer from the kernel commandline */ +static int local_apic_timer_disabled; /* * Debug level, exported for io_apic.c @@ -340,6 +342,13 @@ void __init setup_boot_APIC_clock(void) long delta, deltapm; int pm_referenced = 0; + if (local_apic_timer_disabled) { + /* No broadcast on UP ! */ + if (num_possible_cpus() 1) + setup_APIC_timer(); + return; + } + apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n calibrating APIC timer ...\n); @@ -1179,6 +1188,13 @@ static int __init parse_nolapic(char *arg) } early_param(nolapic, parse_nolapic); +static int __init parse_disable_lapic_timer(char *arg) +{ + local_apic_timer_disabled = 1; + return 0; +} +early_param(nolapic_timer, parse_disable_lapic_timer); + static int __init apic_set_verbosity(char *str) { if (strcmp(debug, str) == 0) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fwd: [PATCH 7/9] ACPI: Only use IPI on known broken machines (AMD, Dothan/BaniasPentium M)
On Tue, 2007-03-20 at 20:23 +0100, Andi Kleen wrote: + else if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL + boot_cpu_data.x86 == 6) +(boot_cpu_data.x86_model == 13 || + boot_cpu_data.x86_model == 9)) What is with 10..12 and 13 ? I would just force it for all model 6s that have = C2 and definitely for all with C3. C3 is unconditinally anyway. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})
On Wed, 2007-03-21 at 14:04 +0100, Stefan Prechtel wrote: I uploaded the output of dmesg (kernel 2.6.21-rc4-git5) (battery / ac) and dmidecode I can boot on battery with nolapic_timer and the second core is online, too. /proc/acpi/processor/C000/ shows the same as before but /proc/interrupts has changed: (battery) CPU0 CPU1 0: 47131 0 local-APIC-edge-fasteoi timer LOC: 0 46978 (ac) CPU0 CPU1 0: 59137 0 local-APIC-edge-fasteoi timer LOC: 0 58984 That's correct. We keep the PIT alive and trigger the lapic timer interrupt via an IPI. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] clockevents: Fix suspend/resume to disk hangs
On Tue, 2007-03-20 at 10:35 +0100, Marcus Better wrote: Thomas Gleixner wrote: I finally found a dual core box, which survives suspend/resume without crashing in the middle of nowhere. Sigh, I never figured out from the code and the bug reports what's going on. The observed hangs are caused by a stale state transition of the clock event devices, which keeps the RCU synchronization away from completion, when the non boot CPU is brought back up. This didn't fix the suspend problems on my Thinkpad R60. (Sorry for nagging - please let me know if I can assist in debugging this...) I did not expect that it fixes your problem. clockevents are only used in arch/i386 right now. You are running a 64 bit kernel, so a change of your problem would have been very surprising. You said, that the breakage came between 2.6.20 and rc2. Can you bisect it ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386: disable local apic timer via command line or dmi quirk
The local APIC timer stops to work in deeper C-States. This is handled by the ACPI code and a broadcast mechanism in the clockevents / tick managment code. Some systems do not expose the deeper C-States to the kernel, but switch into deeper C-States behind the kernels back. This delays the local apic timer interrupts for ever and makes the systems unusable. Add a command line option to disable the local apic timer and a dmi quirk for known broken systems. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 856c8b1..06377c7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1117,6 +1117,8 @@ and is between 256 and 4096 characters. It is defined in the file nolapic [IA-32,APIC] Do not enable or use the local APIC. + nolapic_timer [IA-32,APIC] Do not use the local APIC timer. + noltlbs [PPC] Do not use large page/tlb entries for kernel lowmem mapping on PPC40x. diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c index 5cff797..3682511 100644 --- a/arch/i386/kernel/apic.c +++ b/arch/i386/kernel/apic.c @@ -28,6 +28,7 @@ #include linux/clockchips.h #include linux/acpi_pmtmr.h #include linux/module.h +#include linux/dmi.h #include asm/atomic.h #include asm/smp.h @@ -61,6 +62,8 @@ static int enable_local_apic __initdata = 0; /* Local APIC timer verification ok */ static int local_apic_timer_verify_ok; +/* Disable local APIC timer from the kernel commandline or via dmi quirk */ +static int local_apic_timer_disabled; /* * Debug level, exported for io_apic.c @@ -266,6 +269,32 @@ static void __devinit setup_APIC_timer(void) } /* + * Detect systems with known broken BIOS implementations + */ +static int __init lapic_check_broken_bios(struct dmi_system_id *d) +{ + printk(KERN_NOTICE %s detected: disabling lapic timer.\n, + d-ident); + local_apic_timer_disabled = 1; + return 0; +} + +static struct dmi_system_id __initdata broken_bios_dmi_table[] = { + { + /* +* BIOS exports only C1 state, but uses deeper power +* modes behind the kernels back. +*/ + .callback = lapic_check_broken_bios, + .ident = HP nx6325, + .matches = { + DMI_MATCH(DMI_PRODUCT_NAME, HP Compaq nx6325), + }, +}, +{} +}; + +/* * In this functions we calibrate APIC bus clocks to the external timer. * * We want to do the calibration only once since we want to have local timer @@ -340,6 +369,22 @@ void __init setup_boot_APIC_clock(void) long delta, deltapm; int pm_referenced = 0; + /* Detect know broken systems */ + dmi_check_system(broken_bios_dmi_table); + + /* +* The local apic timer can be disabled via the kernel +* commandline or from the dmi quirk above. Register the lapic +* timer as a dummy clock event source on SMP systems, so the +* broadcast mechanism is used. On UP systems simply ignore it. +*/ + if (local_apic_timer_disabled) { + /* No broadcast on UP ! */ + if (num_possible_cpus() 1) + setup_APIC_timer(); + return; + } + apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n calibrating APIC timer ...\n); @@ -1179,6 +1224,13 @@ static int __init parse_nolapic(char *arg) } early_param(nolapic, parse_nolapic); +static int __init parse_disable_lapic_timer(char *arg) +{ + local_apic_timer_disabled = 1; + return 0; +} +early_param(nolapic_timer, parse_disable_lapic_timer); + static int __init apic_set_verbosity(char *str) { if (strcmp(debug, str) == 0) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)
On Thu, 2007-03-22 at 08:28 -0700, Greg KH wrote: On Tue, Mar 20, 2007 at 11:54:03AM +, Pavel Machek wrote: Hi! [EMAIL PROTECTED]:/home/maxim# cat /sys/devices/system/clockevents/clockevents0/registered lapicF:0007 M:3(periodic) C: 1 hpet F:0003 M:1(shutdown) C: 0 lapicF:0007 M:3(periodic) C: 0 [EMAIL PROTECTED]:/home/maxim# Now... this file needs to die, before 2.6.21 is released. It tries to bring /proc-like parsing nightmare to sysfs. Kill it before it becomes part of stable ABI! Eeek! I agree, that needs to be fixed now. Remember, 1 value per file in sysfs! Shall I just submit a patch ripping it out for now? I fix it. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] no boot with 2.6.21-rc3 and later
On Thu, 2007-03-22 at 12:25 -0700, john stultz wrote: On Thu, 2007-03-22 at 13:14 -0600, Bob Tracy wrote: john stultz wrote: Try this patch and let me know if it does the right thing. Will do. I'll report back in a few hours. Although I do still need to dig a bit on the PIT hang issue. Any chance this might be related to the APIC issues currently being discussed in other threads? Hmmm. Good thought, I'll have to look into it. It could be that if the PIT is disabled in favor of the local apic, we'll have to make sure its not being used as a clocksource. Ouch, yes. That's fatal and can happen. Not sure, what to do about that. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386: clockevents fix breakage on Geode/Cyrix PIT implementations
The PIT has no dedicated mode for shut down. The only way to disable PIT is to put it into one shot mode. AMD implementations of PIT on Geode (also observed on Cyrix) are confused by an empty transition from CLOCK_EVT_MODE_UNUSED to CLOCK_EVT_MODE_SHUTDOWN, which puts the PIT into one shot mode momentarily. I realized after staring helpless at the bug report http://bugzilla.kernel.org/show_bug.cgi?id=8027 for quite a while, that the only change, which might influence the bogomips calibration, is the above transition during the PIT initialization. Avoiding the unnecessary switch to oneshot and later to periodic mode fixes the weird bogomips value and also the resulting slowness. The fix is confirmed on OLPC and another Geode based box. Note: this is unrelated to the Dual Core problem discussed here: http://lkml.org/lkml/2007/3/17/48 Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/arch/i386/kernel/i8253.c b/arch/i386/kernel/i8253.c index 5cbb776..10cef5c 100644 --- a/arch/i386/kernel/i8253.c +++ b/arch/i386/kernel/i8253.c @@ -47,9 +47,17 @@ static void init_pit_timer(enum clock_event_mode mode, outb(LATCH 8 , PIT_CH0); /* MSB */ break; - case CLOCK_EVT_MODE_ONESHOT: + /* +* Avoid unnecessary state transitions, as it confuses +* Geode / Cyrix based boxen. +*/ case CLOCK_EVT_MODE_SHUTDOWN: + if (evt-mode == CLOCK_EVT_MODE_UNUSED) + break; case CLOCK_EVT_MODE_UNUSED: + if (evt-mode == CLOCK_EVT_MODE_SHUTDOWN) + break; + case CLOCK_EVT_MODE_ONESHOT: /* One shot setup */ outb_p(0x38, PIT_MODE); udelay(10); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix irqpoll on IA64 (timer interrupt != 0)
On Thu, 2007-03-22 at 14:09 -0700, Andrew Morton wrote: I think the term 'timer_interrupt' is a bit generic-sounding. Would it be better to call it irqpoll_interrupt? After all, some architecture might want to use, umm, the keyboard interrupt to trigger IRQ polling ;) Interesting thought, but in general I have to agree. Also, the code presently passes the magic IRQ number into the generic IRQ code. I wonder if we'd get a more pleasing result if we were to make the generic IRQ code call _out_ to the architecture: Then, ia64 can implement arch_is_irqpoll_irq() and it can do whatever it wants in there. The __attribute__((weak)) thing adds a little bit of overhead, but I don't think this is a fastpath? Well, depends what you consider a fastpath. When noirqdebug == 0, it is called on every interrupt. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc[123] regression with NOAPIC
On Thu, 2007-03-22 at 14:42 +0100, Adrian Bunk wrote: Starting with head as of yesterday and reverting two commits (that are duplicates of each other -- the same commit came into Linus's tree via two different paths) 'fixes' the problem for me. I'll let those with the big brains decide just why. The two commits are 5c95d3f5783ab184f64b7848f0a871352c35c3cf and 3434933b17fa64adddf83059603c61296f6e1ee2 . The net reverse diff of those two is below. ... Thanks for tracking it down. It's quite possible that these commits trigger your problem. Does it work if you do _not_ revert the commits, and instead replace in drivers/acpi/processor_idle.c the #ifdef ARCH_APICTIMER_STOPS_ON_C3 with an #if 0 ? Then NOAPIC probably works again, but booting w/o NOAPIC fails. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc[123] regression with NOAPIC
On Thu, 2007-03-22 at 15:16 +0100, Adrian Bunk wrote: Does it work if you do _not_ revert the commits, and instead replace in drivers/acpi/processor_idle.c the #ifdef ARCH_APICTIMER_STOPS_ON_C3 with an #if 0 ? Then NOAPIC probably works again, but booting w/o NOAPIC fails. But we'll know that it's this code that has a problen with noapic in the CONFIG_GENERIC_CLOCKEVENTS=n case. Nope. This code does not have a problem. It causes a problem elsewhere: It calls switch_ipi_to_APIC_timer() or switch_APIC_timer_to_ipi(), which sets/clears a bit in the broadcast mask and enables / disables the local APIC timer. I don't see right now, why this causes the box to lock up hard, but maybe the debug printk's below give us some hint. tglx diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c index 723417d..29376e2 100644 --- a/arch/x86_64/kernel/apic.c +++ b/arch/x86_64/kernel/apic.c @@ -886,6 +886,8 @@ void disable_APIC_timer(void) if (using_apic_timer) { unsigned long v; + printk(Disabling local APIC timer %d\n, apic_runs_main_timer); + v = apic_read(APIC_LVTT); /* * When an illegal vector value (0-15) is written to an LVT @@ -910,6 +912,7 @@ void enable_APIC_timer(void) !cpu_isset(cpu, timer_interrupt_broadcast_ipi_mask)) { unsigned long v; + printk(Enabling local APIC timer: %d\n, apic_runs_main_timer); v = apic_read(APIC_LVTT); apic_write(APIC_LVTT, v ~APIC_LVT_MASKED); } @@ -934,6 +937,7 @@ void smp_send_timer_broadcast_ipi(void) cpus_and(mask, cpu_online_map, timer_interrupt_broadcast_ipi_mask); if (!cpus_empty(mask)) { + printk(Send IPI\n); send_IPI_mask(mask, LOCAL_TIMER_VECTOR); } } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/6] 2.6.21-rc4: known regressions
On Fri, 2007-03-23 at 12:42 +0100, Ingo Molnar wrote: there's a new post-rc4 regression: my T60 hangs during early bootup. I bisected the hang down to this recent commit: | commit 25496caec111481161e7f06bbfa12a533c43cc6f | Author: Thomas Renninger [EMAIL PROTECTED] | Date: Tue Feb 27 12:13:00 2007 -0500 | |ACPI: Only use IPI on known broken machines (AMD, Dothan/BaniasPentium M) undoing this change fixes my T60 so it correctly boots again. the commit has this confidence-raising comment: | However, I am not sure about the naming of the parameter and how it | could/should get integrated into the dyntick part | (CONFIG_GENERIC_CLOCKEVENTS). There, a more fine grained check (TSC | still running?, ..) is needed? could we please revert this commit until it's done correctly? and did this end up being a 'fix'? The change weakens the scope of a hardware workaround, which IMO has no place so late in the cycle. At a minimum the clockevents maintainer (Thomas) should have been Cc:-ed on it. Ingo, I had seen it before, and I had no objections under the premise, that it does not break things and especially survives on Andrews VAIO. I expected that to come in via -mm so it gets enough testing. We should revert that patch and add a trust_lapic_timer_in_c2 commandline option instead. So we are on the safe side. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386: add command line option local_apic_timer_c2_ok
On Fri, 2007-03-23 at 12:56 +0100, Thomas Gleixner wrote: We should revert that patch and add a trust_lapic_timer_in_c2 commandline option instead. So we are on the safe side. Here is a patch which applies after reverting 25496caec111481161e7f06bbfa12a533c43cc6f It turned out that it is almost impossible to trust ACPI, BIOS Co. regarding the C states. This was the reason to switch the local apic timer off in C2 state already. OTOH there are sane and well behaving systems, which get punished by that decision. Allow the user to confirm that the local apic timer is trustworthy in C2 state. This keeps the default behaviour on the safe side. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] Acked-by: Ingo Molnar [EMAIL PROTECTED] diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e39ab0c..09640a8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -780,6 +780,9 @@ and is between 256 and 4096 characters. It is defined in the file lapic [IA-32,APIC] Enable the local APIC even if BIOS disabled it. + lapic_timer_c2_ok [IA-32,APIC] trust the local apic timer in + C2 power state. + lasi= [HW,SCSI] PARISC LASI driver for the 53c700 chip Format: addr:io,irq:irq diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c index 244c3fe..e884152 100644 --- a/arch/i386/kernel/apic.c +++ b/arch/i386/kernel/apic.c @@ -64,6 +64,9 @@ static int enable_local_apic __initdata = 0; static int local_apic_timer_verify_ok; /* Disable local APIC timer from the kernel commandline or via dmi quirk */ static int local_apic_timer_disabled; +/* Local APIC timer works in C2 */ +int local_apic_timer_c2_ok; +EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok); /* * Debug level, exported for io_apic.c @@ -1232,6 +1235,13 @@ static int __init parse_disable_lapic_timer(char *arg) } early_param(nolapic_timer, parse_disable_lapic_timer); +static int __init parse_lapic_timer_c2_ok(char *arg) +{ + local_apic_timer_c2_ok = 1; + return 0; +} +early_param(lapic_timer_c2_ok, parse_lapic_timer_c2_ok); + static int __init apic_set_verbosity(char *str) { if (strcmp(debug, str) == 0) diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 6077300..cdf7894 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -268,6 +268,7 @@ static void acpi_timer_check_state(int state, struct acpi_processor *pr, struct acpi_processor_cx *cx) { struct acpi_processor_power *pwr = pr-power; + u8 type = local_apic_timer_c2_ok ? ACPI_STATE_C3 : ACPI_STATE_C2; /* * Check, if one of the previous states already marked the lapic @@ -276,7 +277,7 @@ static void acpi_timer_check_state(int state, struct acpi_processor *pr, if (pwr-timer_broadcast_on_state state) return; - if (cx-type = ACPI_STATE_C2) + if (cx-type = type) pr-power.timer_broadcast_on_state = state; } diff --git a/include/asm-i386/apic.h b/include/asm-i386/apic.h index cc6b165..a19810a 100644 --- a/include/asm-i386/apic.h +++ b/include/asm-i386/apic.h @@ -117,6 +117,7 @@ extern void enable_NMI_through_LVT0 (void * dummy); #define ARCH_APICTIMER_STOPS_ON_C3 1 extern int timer_over_8254; +extern int local_apic_timer_c2_ok; #else /* !CONFIG_X86_LOCAL_APIC */ static inline void lapic_shutdown(void) { } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [1/6] 2.6.21-rc4: known regressions
On Fri, 2007-03-23 at 11:28 -0700, Linus Torvalds wrote: On Fri, 23 Mar 2007, Linus Torvalds wrote: Thomas, please fix. Here's a possible fix. It compiles. And I still wish we had common files. You beat me by 30 seconds. ia64 shouldn't be affected, because ia64 doesn't #define the ARCH_APICTIMER_STOPS_ON_C3 flag (and then we don't use the c2_ok thing either. Right, ia64 does not see it. But this is still pretty damn ugly. Yes it is. Maybe a field in struct acpi_processor for C2/C3 problems? Hmm, the acpi processor stuff is modular. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: gettimeofday increments too slowly References : http://bugzilla.kernel.org/show_bug.cgi?id=8027 Submitter : David L [EMAIL PROTECTED] Caused-By : Thomas Gleixner [EMAIL PROTECTED] commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2 Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : problem is being debugged Patch available: http://lkml.org/lkml/2007/3/22/301 commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4 tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 20:15 +0100, Thomas Gleixner wrote: On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: gettimeofday increments too slowly References : http://bugzilla.kernel.org/show_bug.cgi?id=8027 Submitter : David L [EMAIL PROTECTED] Caused-By : Thomas Gleixner [EMAIL PROTECTED] commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2 Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : problem is being debugged Patch available: http://lkml.org/lkml/2007/3/22/301 commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4 Oops. That fixed only the one half of the problem. The timeofday one persists. John, any idea ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: dynticks makes ksoftirqd1 use unreasonable amount of cpu time References : http://bugzilla.kernel.org/show_bug.cgi?id=8100 Submitter : Emil Karlson [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : problem is being debugged The problem is not reproducible on any of my machines. Emil, is it still there with Linus latest ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: This email lists some known regressions in Linus' tree compared to 2.6.20. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject: system doesn't come out of suspend (CONFIG_NO_HZ) References : http://lkml.org/lkml/2007/2/22/391 Submitter : Michael S. Tsirkin [EMAIL PROTECTED] Soeren Sonnenburg [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Ingo Molnar [EMAIL PROTECTED] Tejun Heo [EMAIL PROTECTED] Rafael J. Wysocki [EMAIL PROTECTED] Status : problem is being debugged Subject: first disk access after resume takes several minutes ('date' does not advance after resume from RAM, CONFIG_NO_HZ=n) References : http://lkml.org/lkml/2007/3/8/117 Submitter : Michael S. Tsirkin [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Ingo Molnar [EMAIL PROTECTED] Status : problem is being debugged I lost track of Michaels various nested problems. Michael can you please give a summary on _all_ entries in the regressions list against Linus latest ? Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: Dynticks and High resolution Timer hanging the system workaround: clocksource=acpi_pm References : http://lkml.org/lkml/2007/3/7/504 Submitter : Stephane Casset [EMAIL PROTECTED] Caused-By : Thomas Gleixner [EMAIL PROTECTED] Status : unknown Stephane, does the problem still exists with Linus latest ? Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: soft lockup detected on CPU#0 References : http://lkml.org/lkml/2007/3/3/152 Submitter : Michal Piotrowski [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Ingo Molnar [EMAIL PROTECTED] Status : unknown Michal, any news on that one ? You said the same problem exists in 2.6.20.1. Has this been resolved in 2.6.20.2/3 Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 19:48 +0100, Adrian Bunk wrote: Subject: x86_64: ACPI regression with noapic (APICTIMER_STOPS_ON_C3?) References : http://lkml.org/lkml/2007/3/8/468 http://lkml.org/lkml/2007/3/22/156 Submitter : Ray Lee [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : problem is being debugged Ray, can you please test the patch below ? Thanks, tglx -- Subject: [PATCH] x86_64: avoid sending LOCAL_TIMER_VECTOR IPI to itself Ray Lee reported, that on an UP kernel with noapic commandline option set, the box locks hard during boot. Adding some debug printks revieled, that the last action on the box before stalling was Send IPI - a debug printk which was put into smp_send_timer_broadcast_ipi(). It seems that send_IPI_mask(mask, LOCAL_TIMER_VECTOR) fails when noapic is set on the commandline on an UP kernel. Aside of that it does not make much sense to trigger an interrupt instead of calling the function directly on the CPU which gets the PIT/HPET interrupt in case of broadcasting. Reported-by: Ray Lee [EMAIL PROTECTED] Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c index 723417d..83328e1 100644 --- a/arch/x86_64/kernel/apic.c +++ b/arch/x86_64/kernel/apic.c @@ -930,9 +930,17 @@ EXPORT_SYMBOL(switch_APIC_timer_to_ipi); void smp_send_timer_broadcast_ipi(void) { + int cpu = smp_processor_id(); cpumask_t mask; cpus_and(mask, cpu_online_map, timer_interrupt_broadcast_ipi_mask); + + if (cpu_isset(cpu, mask)) { + cpu_clear(cpu, mask); + add_pda(apic_timer_irqs, 1); + smp_local_timer_interrupt(); + } + if (!cpus_empty(mask)) { send_IPI_mask(mask, LOCAL_TIMER_VECTOR); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 18:23 -0400, Chuck Ebbert wrote: Thomas Gleixner wrote: On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: gettimeofday increments too slowly References : http://bugzilla.kernel.org/show_bug.cgi?id=8027 Submitter : David L [EMAIL PROTECTED] Caused-By : Thomas Gleixner [EMAIL PROTECTED] commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2 Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : problem is being debugged Patch available: http://lkml.org/lkml/2007/3/22/301 commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4 For the other issue raised there, clock running too slow, I now realize there is a similar report: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231626 That's a different one, AFAICT. Davids problem is probably caused by me breaking the TSC watchdog. /me orders paperbags prophylactically and goes back to look at the code tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Fri, 2007-03-23 at 23:43 +0100, Thomas Gleixner wrote: On Fri, 2007-03-23 at 18:23 -0400, Chuck Ebbert wrote: Thomas Gleixner wrote: On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: gettimeofday increments too slowly References : http://bugzilla.kernel.org/show_bug.cgi?id=8027 Submitter : David L [EMAIL PROTECTED] Caused-By : Thomas Gleixner [EMAIL PROTECTED] commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2 Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : problem is being debugged Patch available: http://lkml.org/lkml/2007/3/22/301 commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4 For the other issue raised there, clock running too slow, I now realize there is a similar report: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231626 That's a different one, AFAICT. Davids problem is probably caused by me breaking the TSC watchdog. /me orders paperbags prophylactically and goes back to look at the code David, can you please test the patch below ? tglx - Subject: [PATCH] clocksource: Fix thinko in watchdog selection The watchdog implementation excludes low res / non continuous clocksources from being selected as a watchdog reference unintentionally. Allow using jiffies/PIT as a watchdog reference as long as no better clocksource is available. This is necessary to detect TSC breakage on systems, which have no pmtimer/hpet. The main goal of the initial patch (preventing to switch to highres/nohz when no reliable fallback clocksource is available) is still guaranteed by the checks in clocksource_watchdog(). Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 5b0e46b..fe5c7db 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -151,7 +151,8 @@ static void clocksource_check_watchdog(struct clocksource *cs) watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL; add_timer(watchdog_timer); } - } else if (cs-flags CLOCK_SOURCE_IS_CONTINUOUS) { + } else { + if (cs-flags CLOCK_SOURCE_IS_CONTINUOUS) cs-flags |= CLOCK_SOURCE_VALID_FOR_HRES; if (!watchdog || cs-rating watchdog-rating) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/5] 2.6.21-rc4: known regressions (v2)
Ray, On Fri, 2007-03-23 at 17:14 -0700, Ray Lee wrote: (I wondered about the IPI on a UP system, seemed a bit weird :-).) Works great, booting both with NOAPIC and without. *Much* thanks for debugging this while you're also handling a bunch of other issues at the same time. Thank you for debugging and excellent problem descriptions ! Patch reproduced below, with an acked-by (and, uhm, a couple of spelling fixes in the description -- don't hate me, 'kay?). I know that my English sucks. Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
Emil, On Fri, 2007-03-23 at 20:22 +0100, Thomas Gleixner wrote: On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: dynticks makes ksoftirqd1 use unreasonable amount of cpu time References : http://bugzilla.kernel.org/show_bug.cgi?id=8100 Submitter : Emil Karlson [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Status : problem is being debugged The problem is not reproducible on any of my machines. I've uploaded a patch against 2.6.21-rc4 to http://tglx.de/private/tglx/2.6.21-rc4-trace/2.6.21-rc4-trace.patch.bz2 It contains all changes in Linus tree since -rc4 plus the two pending fixes (http://tglx.de/private/tglx/2.6.21-rc4-pending/) along with a backport of the latency tracer from the realtime preemption patch. Can you please apply the patch on top of -rc4 and build it with the configuration, which exposes this strange behaviour. Please enable also CONFIG_LATENCY_TRACE in the Kernel hacking menu. When the problem is visible, then run trace-it (http://tglx.de/private/tglx/2.6.21-rc4-trace/trace-it.c) as root: # trace-it trace.txt This captures roughly one second of kernel code pathes. Please stick trace.txt into Bugzilla. Thanks, tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Sat, 2007-03-24 at 14:59 +0100, Michal Piotrowski wrote: On 23/03/07, Thomas Gleixner [EMAIL PROTECTED] wrote: On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote: Subject: soft lockup detected on CPU#0 References : http://lkml.org/lkml/2007/3/3/152 Submitter : Michal Piotrowski [EMAIL PROTECTED] Handled-By : Thomas Gleixner [EMAIL PROTECTED] Ingo Molnar [EMAIL PROTECTED] Status : unknown Michal, any news on that one ? You said the same problem exists in 2.6.20.1. Has this been resolved in 2.6.20.2/3 Yes, I tried 2.6.20.4 and it works fine. Is it solved in Linus latest too ? tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386: Prevent early access to TSC to avoid crash on TSCless systems
commit f9690982b8c2f9a2c65acdc113e758ec356676a3 removed the check for cpu_khz from sched_clock(), which prevented early access to the TSC by non obvious magic. This is harmless as long as the CPU has a TSC. On TSCless systems this results in an illegal instruction trap. Replace tsc_disabled and tsc_unstable by tsc_enabled, which is only set when the tsc is available and not unstable. Signed-off-by: Thomas Gleixner [EMAIL PROTECTED] diff --git a/arch/i386/kernel/tsc.c b/arch/i386/kernel/tsc.c index 0e65f7a..6cb8f53 100644 --- a/arch/i386/kernel/tsc.c +++ b/arch/i386/kernel/tsc.c @@ -18,6 +18,8 @@ #include mach_timer.h +static int tsc_enabled; + /* * On some systems the TSC frequency does not * change with the cpu frequency. So we need @@ -105,7 +107,7 @@ unsigned long long sched_clock(void) /* * Fall back to jiffies if there's no TSC available: */ - if (tsc_unstable || unlikely(tsc_disable)) + if (unlikely(!tsc_enabled)) /* No locking but a rare wrong value is not a big deal: */ return (jiffies_64 - INITIAL_JIFFIES) * (10 / HZ); @@ -283,6 +285,7 @@ void mark_tsc_unstable(void) { if (!tsc_unstable) { tsc_unstable = 1; + tsc_enabled = 0; /* Can be called before registration */ if (clocksource_tsc.mult) clocksource_change_rating(clocksource_tsc, 0); @@ -383,7 +386,9 @@ void __init tsc_init(void) if (check_tsc_unstable()) { clocksource_tsc.rating = 0; clocksource_tsc.flags = ~CLOCK_SOURCE_IS_CONTINUOUS; - } + } else + tsc_enabled = 1; + clocksource_register(clocksource_tsc); return; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/5] 2.6.21-rc4: known regressions (v2)
On Sun, 2007-03-25 at 09:11 +0200, Michael S. Tsirkin wrote: I lost track of Michaels various nested problems. Michael can you please give a summary on _all_ entries in the regressions list against Linus latest ? I tested 2 different configurations on my T60: - With CONFIG_NO_HZ enabled. I tested this on -rc1, and have not retested with CONFIG_NO_HZ since. Observed behaviour: the system would not come out of suspend to RAM. After I press Fn/F4 the crescent LED starts blinking so it seems Linux started doing something. This is a problem but not a regression as such, since CONFIG_NO_HZ is new in 2.6.21. It needs to be fixed before 2.6.21 final nevertheless. - Without CONFIG_NO_HZ I last tested this with cd05a1f818073a623455a58e756c5b419fc98db9. After systems comes out of suspend to ram, I observed the following behaviour (I used s2ram from console): 1. The first disk access takes much longer than with 2.6.20 2. System clock does not advance (date always reports the same time) 3. After an attempt to switch to X, X starts drawing some windows and then hangs All 3 issues are new and did not occur under 2.6.20, so this is a regression. Attached is a full dmesg from boot to resume. There is not much interesting to see in the log. Can you please test the following: Add clocksource=acpi_pm to the kernel commandline. If this does not change anything, then disable CONFIG_HPET and retry. One thing in the log is indeed scary: [2.959150] Calibrating delay using timer specific routine.. 20089.12 BogoMIPS (lpj=100445639) This is after the reboot, but it is not related to your problem. This is a different problem, which needs urgent attention. Adrian, can you open a seperate entry for this please ? It is not a new thing, this can be observed with older kernels as well, but it needs to be addressed. It probably needs a similar solution as I did for the local apic timer calibration. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BUG] __copy_to_user_inatomic broken on non Pentium machines
Environment: Pre Pentium systems, (boot_cpu_data.wp_works_ok == 0) Last known working kernel: 2.6.18 (did not try 2.6.19 yet) Enabling CONFIG_PREEMPT on latest mainline as well as 2.6.20 trigger [ 14.15] BUG: sleeping function called from invalid context at /home/tglx/work/kernel/vanilla/linux-2.6.20/kernel/rwsem.c:20 [ 14.16] in_atomic():1, irqs_disabled():0 [ 14.16] no locks held by init/1. [ 14.17] [c0103346] show_trace_log_lvl+0x1a/0x2f [ 14.18] [c0103441] show_trace+0x12/0x14 [ 14.19] [c0103cf5] dump_stack+0x16/0x18 [ 14.19] [c010aa62] __might_sleep+0xc7/0xcd [ 14.20] [c01213a1] down_read+0x18/0x47 [ 14.21] [c01a01e4] __copy_to_user_ll+0x5e/0x1b6 [ 14.22] [c012cf85] file_read_actor+0x10b/0x149 [ 14.23] [c012d7b2] do_generic_mapping_read+0x187/0x433 [ 14.24] [c012f64b] generic_file_aio_read+0x191/0x1ca [ 14.24] [c0141657] do_sync_read+0xc2/0xff [ 14.25] [c0141eb6] vfs_read+0x90/0x145 [ 14.26] [c014227e] sys_read+0x3f/0x63 [ 14.27] [c0102fb0] syscall_call+0x7/0xb [ 14.27] === and [ 22.66] BUG: scheduling while atomic: e2fsck/0x1001/272 [ 22.67] 1 lock held by e2fsck/272: [ 22.68] #0: (mm-mmap_sem){}, at: [c01a01e4] __copy_to_user_ll+0x5e/0x1b6 [ 22.69] [c0103346] show_trace_log_lvl+0x1a/0x2f [ 22.70] [c0103441] show_trace+0x12/0x14 [ 22.71] [c0103cf5] dump_stack+0x16/0x18 [ 22.72] [c024a189] __sched_text_start+0x71/0x57f [ 22.72] [c010b49f] __cond_resched+0x21/0x3b [ 22.73] [c024aca7] cond_resched+0x26/0x31 [ 22.74] [c0137ae5] get_user_pages+0x1e1/0x23c [ 22.75] [c01a021e] __copy_to_user_ll+0x98/0x1b6 [ 22.76] [c012cf85] file_read_actor+0x10b/0x149 [ 22.77] [c012d7b2] do_generic_mapping_read+0x187/0x433 [ 22.78] [c012f64b] generic_file_aio_read+0x191/0x1ca [ 22.79] [c0141657] do_sync_read+0xc2/0xff [ 22.79] [c0141eb6] vfs_read+0x90/0x145 [ 22.80] [c014227e] sys_read+0x3f/0x63 [ 22.81] [c0102fb0] syscall_call+0x7/0xb [ 22.82] === which is not surprising. int file_read_actor(read_descriptor_t *desc, struct page *page, unsigned long offset, unsigned long size) { /* * Faults on the destination of a read are common, so do it before * taking the kmap. */ if (!fault_in_pages_writeable(desc-arg.buf, size)) { kaddr = kmap_atomic(page, KM_USER0); left = __copy_to_user_inatomic(desc-arg.buf, kaddr + offset, size); is called with preempt_count == 1, due to the kmap_atomic() above. Now __copy_to_user_ll() takes the (boot_cpu_data.wp_works_ok == 0) path, which in turn calls down_read(current-mm-mmap_sem) - which might sleep and get_user_pages() - which has a cond_resched() inside. Not sure how to fix that. tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/