Re: PREEMPT_RT: 2.6.20-rt8 patch tweaked for 2.6.20.7

2007-04-23 Thread Thomas Gleixner
John,

On Fri, 2007-04-20 at 15:15 +0200, John Sigler wrote:
 I've tweaked patch-2.6.20-rt8(*) so that it applies to 2.6.20.7
 (*) http://rt.wiki.kernel.org/index.php/Main_Page
 
 The original patch can be found here:
 http://people.redhat.com/mingo/realtime-preempt/older/patch-2.6.20-rt8
 http://linux.kernel.free.fr/patch-2.6.20-rt8
 
 diff to the original patch to show what was tweaked:
 http://linux.kernel.free.fr/patch-2.6.20-rt8.diff
 
 New patch that applies cleanly to 2.6.20.7:
 http://linux.kernel.free.fr/patch-2.6.20.7-rt8
 
 As always, if someone spots something I've done wrong,
 I'd be happy to fix it in a hurry :-)
 
 Ingo, Thomas, are there any fixes that were included in the 2.6.21-rt 
 branch only that need to be back-ported to the 2.6.20-rt branch?

I've been busy with mainline merge of highres timers lately, so I have
no good overview of the -rt state at the moment, but I will check on
this later that week.

Can you create an entry in the rt-wiki, so people can find your
patches ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PREEMPT_RT: 2.6.20-rt8 patch tweaked for 2.6.20.7

2007-04-23 Thread Thomas Gleixner
On Mon, 2007-04-23 at 10:03 +0200, John Sigler wrote:
  Can you create an entry in the rt-wiki, so people can find your
  patches ?
 
 Sure.
 
 Should I add a link to my patch on the CONFIG PREEMPT RT Patch page?
 http://rt.wiki.kernel.org/index.php/CONFIG_PREEMPT_RT_Patch#Download
 
 e.g. in the Download section, something along the lines of:
 
 An updated version of the CONFIG_PREEMPT_RT patch (cleanly applies to
 kernel 2.6.20.7) is also available.

Yep, that's fine.

 I should probably mention that is not an officially sanctioned version,
 and that it has not received the same scrutiny as other patches?

:)

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch] futex: restartable futex_wait?

2007-03-08 Thread Thomas Gleixner
On Thu, 2007-03-08 at 18:29 +0100, Ingo Molnar wrote:
 * Nick Piggin [EMAIL PROTECTED] wrote:
 
  Hi Ingo,
  
  I'm seeing an LTP test fail for ltp test sigaction_16_24. Basically, 
  it tests whether the SA_RESTART flag works for the sem_wait operation.

Not sure, whether the testcase is correct or not. See below

  I see sem_wait is implemented with futex_wait, so I wonder whether we 
  can make it restartable? Am I going about it the right way? (Seems to 
  fix the testcase here).
 
 i think that's quite right. I'm wondering why this never came up before? 
 But your fix is not complete i think:
 
  + restart-arg2 = time;
  + return -ERESTART_RESTARTBLOCK;
  + }
 
 'time' here is relative, so the restarted syscall will do a /full/ wait 
 again.
 
 maybe we should rather convert futex timed-waits to hrtimers? Thomas?

The problem is that the original API is based on relative time and
therefor can not be changed. 

sem_wait returns -EINTR to the application when it is interrupted, while
pthread_mutex_lock does not.

http://www.opengroup.org/onlinepubs/009695399/functions/sem_wait.html

http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html

We need to create a seperate op for the futex - just like the pi_futex
and use absolute time there too. 

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: hardwired VMI crap

2007-03-08 Thread Thomas Gleixner
On Thu, 2007-03-08 at 15:39 -0800, Jeremy Fitzhardinge wrote:
 Ingo Molnar wrote:
   - /One/ _intelligent_ higher-level virtualization API/ABI. Xen's API is 
 quite advanced on this front.
 
 At last!  Some love!
 
 The Xen approach has always been to prefer high-level interfaces over
 lower-level ones, so that guests can meaningfully participate in their
 own virtualization.  There are some necessarily low-level things, but
 conceptually simple things like create a new vcpu should have simple
 interfaces.  There's no point in going to the effort of emulating a
 whole pile of real hardware if Xen can present an interface which is a
 close match to an existing high-level interface within the operating system.

Once you are there, you are near the point where you created a virtual
architecture, which could run on any real architecture which gets
supported by a hypervisor backend.

I'd love that :)

I know it is tricky to combine this with the upcoming hardware
virtualization support. But it's at least a worthwhile thought
experiment.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: hardwired VMI crap

2007-03-08 Thread Thomas Gleixner
On Thu, 2007-03-08 at 15:55 -0800, Zachary Amsden wrote:
 Jeremy Fitzhardinge wrote:
  No, but I'm not prejudiced against virtual hardware.  If we have a piece
  of code that thinks its talking to an apic, then I think its OK to use
  that code whether its a real apic or a virtual one, _so long as its
  being used in a way that's consistent with its intended interface_.  I
  have to admit I have not looked at apics - real or virtual - in any
  detail, so I won't claim to really understand the details of the
  existing arch/i386 code or what VMI's trying to do, but it does seem to
  me that it could all be much cleaner.
 
  And clean is good, we all love clean - and so, agreement!

 For APICs, we have two operations - APICRead and APICWrite.  It is nice 
 and clean, and plugs in very easily to the APIC accessors available in 
 Linux.
 
 Is this not clean?

No, because there is no need to use APIC. You just pave the road for
doing the same thing to IO_APIC and whatever is on your interest next.

 We just don't drive the local timer interrupts through the APIC, we make 
 hypercalls to schedule local timer alarms.  Which is something we must 
 do for UP kernels as well, which use the PIT / PIC.  So there is a need 
 for having clockevents code which doesn't program timers through the APIC.

 So we have one separate time device, independent from the traditional 
 hardware timers, and we just program that.  This design is not very 
 complex, nor is it unclean, IMHO.

And why exactly do you need the APIC operations for the complete
abstract and virtual clock event device ? To inject the interrupt, which
you anyway inject artificially into the paravirtualized kernel ? 

This is simply wrong and does not help anything. The 3 lines of code you
share with the apic timer code are not a valid reason to hook yourself
into the apic.

You can use any arbitrary interrupt number to fire your VMI timer and
this works on SMP as well, as we can pin interrupts on CPUs.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch] futex: restartable futex_wait?

2007-03-09 Thread Thomas Gleixner
On Fri, 2007-03-09 at 06:10 +0100, Nick Piggin wrote:
  i think that's quite right. I'm wondering why this never came up before? 
  But your fix is not complete i think:
  
   + restart-arg2 = time;
   + return -ERESTART_RESTARTBLOCK;
   + }
  
  'time' here is relative, so the restarted syscall will do a /full/ wait 
  again.
 
 But it has been modified by schedule_timeout?

But this does not change the syscall registers, so it is restarted in
the same way. We need a new futex OP for this, which takes absolute time
like the PI futex op does.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc][patch] futex: restartable futex_wait?

2007-03-09 Thread Thomas Gleixner
On Fri, 2007-03-09 at 13:24 +0100, Nick Piggin wrote:
'time' here is relative, so the restarted syscall will do a /full/ wait 
again.
   
   But it has been modified by schedule_timeout?
  
  But this does not change the syscall registers, so it is restarted in
  the same way. We need a new futex OP for this, which takes absolute time
  like the PI futex op does.
 
 Forgive me if I'm missing something here, but I'm using the restart block
 and saving the updated value of time in -arg2, and using that as the new
 time parameter passed into futex_wait from futex_wait_restart.

Oops. I went into confusion mode. You are right, the restart block keeps
that.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about periodic clocks

2007-03-10 Thread Thomas Gleixner
On Fri, 2007-03-09 at 15:26 -0800, Jeremy Fitzhardinge wrote:
 How does the clock period get set on periodic timers?  In my clock
 driver, I'm seeing a call to -set_mode(CLOCK_EVT_MODE_PERIODIC, evt),
 but then... nothing.  I was expecting a call to set_next_event to set
 the timer period.

Good point. I never thought about that and we set the period in the
clock event device itself. You are right, the clockevents layer should
hand over the period either with the set_mode call or seperately.
Probably with the set_mode call, as it is needed exactly there and we
don't want to have a if (dev-mode == XXX) check in set_next_event().

I look into this.

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: question about periodic clocks

2007-03-10 Thread Thomas Gleixner
On Sat, 2007-03-10 at 07:50 -0800, Jeremy Fitzhardinge wrote:
 Thomas Gleixner wrote:
  Good point. I never thought about that and we set the period in the
  clock event device itself. You are right, the clockevents layer should
  hand over the period either with the set_mode call or seperately.
  Probably with the set_mode call, as it is needed exactly there and we
  don't want to have a if (dev-mode == XXX) check in set_next_event().
 
  I look into this.
 
 So, in the meantime, the period is 1/HZ?

Yep.

 I also have a question about clockevent cpumasks.  I was using the lapic
 clockevent as a model, but as I understand it there's a lapic per CPU,
 which explains why it registers a clockevent per cpu with that cpu alone
 in the cpumask.
 
 The Xen timer is a bit different; I guess more like hpet.  There's a
 single (virtual-)machine-wide timer, which is owned by the last cpu
 with programmed it; ie, that cpu is the one which gets the resulting
 event interrupt.  Does this mean I should register a single clockevent
 device with a cpumask of CPU_MASK_ALL?  Or should I constrain it to a
 single cpu?

Uuurg. That's ugly. clockevents expect a per CPU timer especially for
dynamic ticks. If you cannot provide a per cpu timer, then you probably
need to use the broadcast trick.

Register a primary clocksource (as PIT/HPET) and register per CPU dummy
clocksources with CLOCK_EVT_FEAT_DUMMY set - we use the same trick, when
the lapic timer is broken. The clockevents code then uses PIT/HPET as
the primary tick source and broadcasts the periodic tick to the other
CPUs. In that case the dyntick / highres features are disabled.

We did some experiments to support multiple CPUs with one timer for
hres/dyntick but it does not scale and it is so ugly that it is not
worth the trouble. It works for the lapic stops in C3 case, as we have a
well defined point (right before going into the deep power state) where
we can rearm the global clock event device. As we are idle at that point
anyway there is not much penalty, but I really dont want to do that in
an active system.

 There's a comment in hpet.c saying
 
* Start hpet with the boot cpu mask and make it
* global after the IO_APIC has been initialized.
 
 but I don't see any place where the hpet cpumask is updated.

I wanted to do that in the first place, but never bothered. In an UP
environment it does not matter. On a sane SMP box (where we do not have
the local APIC stops in C3 problem) the HPET (analogous PIT) is switched
off for ever. In the case of LAPIC stops in C3 the HPET(PIT) is used as
a broadcast fallback. That means before we go into C3 we arm the
HPET/PIT for the earliest to expire lapic event of all CPUs. In that
case it does not matter, whether HPET/PIT is pinned to CPU#0 or anything
else.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Thomas Gleixner
On Sat, 2007-03-10 at 14:52 -0800, Jeremy Fitzhardinge wrote:
 When booting under Xen, you'll get this if you're using both the xen
 clocksource and clockevent drivers.  However, it seems that during boot
 on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen
 clocksource until it switches to highres timer mode.  This means that
 during boot the kernel's monotonic clock is drifting with respect to the
 hypervisor, and all timeouts are unreliable.

The clocksource is not used until the clocksource is installed. Also the
periodic mode during boot, when the clock event device supports periodic
mode, is not reading the time. It relies on the clock event device
getting it straight. That's not a big deal during boot and on a kernel
with NO_HZ=n and HIGHRES=n the periodic tick only updates jiffies. If
the only clocksource is jiffies, then we have to live with it and we do
not switch to NO_HZ/HIGHRES as we would lose track of time.

Once we switch to NO_HZ or HIGHRES the clock event device is directly
coupled to the clock event source.

 Initially I was just computing the kernel-hypervisor offset at boot
 time, but then I changed it to recompute it every time the timer mode
 changes.  However, this didn't really help, and I was still getting
 unpredictable timeouts during boot.  I've changed it to just compute the
 hypervisor absolute time directly using the delta each time the oneshot
 timer is set, which will definitely be reliable (if the kernel and
 hypervisor have drifting timebases then the meaning of Xns delta will be
 different, but at least thats a local error rather than a long-term
 cumulative error).

We do not really care up to the point, where the high resolution
clocksource (e.g. TSC, PM-Timer or HPET on real hardware) becomes
active. Early boot is fragile and we switch over to high res clocksource
and highres/nohz when things have stabilized. 

 My analysis might be wrong here (I suspect the Xen periodic timer may
 have unexpected behaviour), but the overall conclusion still stands:
 using an absolute timeout only works if the kernel and hypervisor have
 non-drifting timebases.  I think its too fragile for a clockevent
 implementation to assume that a particular clocksource is in use to get
 reliable results.

Once we switched over to the clocksource, everything should be in
perfect sync.

 Or perhaps this is a property of the whole clock subsystem: that
 clockevents must be paired with clocksources.  But its not obvious to me
 that this enforced, or even acknowledged.

It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
time, which is read back from the clocksource, even if we use a relative
value for real hardware clock event devices to program the next event.
We calculate the delta between the absolute event and now. So we never
get an accumulating error.

What problem are you observing ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-11 Thread Thomas Gleixner
On Sat, 2007-03-10 at 16:42 -0800, Jeremy Fitzhardinge wrote:
 Thomas Gleixner wrote:
  It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
  time, which is read back from the clocksource, even if we use a relative
  value for real hardware clock event devices to program the next event.
  We calculate the delta between the absolute event and now. So we never
  get an accumulating error.
 
  What problem are you observing ?
 
 Actually, two things.  There was the unexpected pauses during boot,
 which is trivially fixable by not using the Xen periodic timer, and
 using the single-shot fallback.
 
 But I'm making the more general observation that if you use an absolute
 rather than relative time to set the single-shot timeout, then you have
 to deal with a long-term cumulative drift between the kernel's monotonic
 time and the hypervisor's monotonic time.  This can happen even if your
 clocksource is derived directly from the hypervisor monotonic time,
 because running ntp will warp the kernel's time, and so it will drift
 with respect to the hypervisor clock.  You can only avoid this by 1) not
 allowing adjtime, or 2) making those same adjtime warps to the
 hypervisor time.  Neither of these is a good general solution.

Sigh, yes. Using a relative time for the next event is probably the
least ugly solution

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd - timerfd core ...

2007-03-11 Thread Thomas Gleixner
Davide,

On Sat, 2007-03-10 at 18:22 -0800, Davide Libenzi wrote:

Some remarks:

 +
 +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype,
 + const struct timespec __user *utmr)
 +{
 + int error;
 + struct timerfd_ctx *ctx;
 + struct file *file;
 + struct inode *inode;
 + ktime_t tval, tnow;
 + struct timespec ktmr, tmrnow;
 +
 + error = -EFAULT;
 + if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
 + goto err_exit;

Please do not use goto for a simple
return -EFAULT;

Please validate the timespec before converting it.

if (!timespec_valid(ktmr))
return -EINVAL;


 + tval = timespec_to_ktime(ktmr);
 + error = -EINVAL;
 + if (clockid != CLOCK_MONOTONIC 
 + clockid != CLOCK_REALTIME)
 + goto err_exit;
 + switch (tmrtype) {
 + case TFD_TIMER_REL:
 + case TFD_TIMER_SEQ:
 + break;
 + case TFD_TIMER_ABS:
 + getnstimeofday(tmrnow);
 + tnow = timespec_to_ktime(tmrnow);

tnow = ktime_get();

 + if (ktime_to_ns(tval) = ktime_to_ns(tnow))
 + goto err_exit;
 + tval = ktime_sub(tval, tnow);

Why do you want to do that ? hrtimers handle relative and absolute
expiry times. You break down everything to relative time and lose the
accuracy for absolute timers. 

 + break;
 + default:
 + goto err_exit;
 + }
 +
 + if (ufd == -1) {
 + error = -ENOMEM;
 + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL);
 + if (!ctx)
 + goto err_exit;
 +
 + init_waitqueue_head(ctx-wqh);
 + spin_lock_init(ctx-lock);
 + ctx-ticks = 0;
 + ctx-tmrtype = tmrtype;
 + ctx-clockid = clockid;
 + ctx-tval = tval;
 + hrtimer_init(ctx-tmr, ctx-clockid, HRTIMER_REL);
 + ctx-tmr.expires = ctx-tval;
 + ctx-tmr.function = timerfd_tmrproc;
 +
 + hrtimer_start(ctx-tmr, ctx-tval, HRTIMER_REL);
 +
 + /*
 +  * When we call this, the initialization must be complete, since
 +  * aino_getfd() will install the fd.
 +  */
 + error = aino_getfd(ufd, inode, file, [timerfd],
 +timerfd_fops, ctx);
 + if (error)
 + goto err_fdalloc;

Why is the timer started before we have everything in place ? 

Also if you turn it around then the (re)programming part of the timer
can be shared.

 + } else {
 + error = -EBADF;
 + file = fget(ufd);
 + if (!file)
 + goto err_exit;
 + ctx = file-private_data;
 + error = -EINVAL;
 + if (file-f_op != timerfd_fops) {
 + fput(file);
 + goto err_exit;
 + }
 +
 + /*
 +  * We need to stop the exiting timer before. We call
 +  * hrtimer_cancel() w/out holding our lock.
 +  */
 + spin_lock_irq(ctx-lock);
 + while (hrtimer_active(ctx-tmr)) {
 + spin_unlock_irq(ctx-lock);
 + hrtimer_cancel(ctx-tmr);
 + spin_lock_irq(ctx-lock);
 + }

Please use hrtimer_try_to_cancel()

retry:
spin_lock_irq():
if (hrtimer_try_to_cancel(ctx-tmr)  0) {
spin_unlock_irq();
cpu_relax();
goto retry;
}

 +
 +static unsigned int timerfd_poll(struct file *file, poll_table *wait)
 +{
 + struct timerfd_ctx *ctx = file-private_data;
 +
 + poll_wait(file, ctx-wqh, wait);
 +
 + return ctx-ticks ? POLLIN: 0;

This is racy:

timer is set up (non periodic)
timer expires
poll 

now poll is stuck for ever !


tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SwSusp to disk doesn't work - Try 2

2007-03-11 Thread Thomas Gleixner
On Sun, 2007-03-11 at 22:09 +0100, Rafael J. Wysocki wrote:
   update_sched_domains
detach_destroy_domains
  [waits here] -- synchronize_sched (==synchronize_rcu)
   
   Well, I think the call to wait_for_completion() does not return, 
   probably
   because the task supposed to complete the completion is frozen at this
   point.  Can you please try to confirm that it gets stuck on
   wait_for_completion() in synchronize_rcu()?
 
   Yes, it's in wait_for_completion() in synchronize_rcu().
   As noted in some previous mail, it will wake up after
   event - key press etc.
  
   Patch in http://lkml.org/lkml/2007/3/7/255 solves different problem.
   I added it to my quilt and applied anyway - no change.
   
   Does the problem go away if NO_HZ is unset?
 
   
   i tried to boot with nohz=off, but the problem did persist.
  
  H, both variants (nohz=off or recompiled kernel without NO_HZ) works 
  for me.
 
 Definitely something strange is going on here.
 
 I think we need an advice from someone who knows the RCU internals.

RCU synchronization depends on the timer interrupt. Which kernel version
are you guys talking about ?

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] futex: PI state locking fix

2007-03-12 Thread Thomas Gleixner
On Mon, 2007-03-12 at 10:13 +0100, Ingo Molnar wrote:
 Subject: [patch] futex: PI state locking fix
 From: Ingo Molnar [EMAIL PROTECTED]
 
 testing of -rt by IBM uncovered a locking bug in wake_futex_pi(): the PI 
 state needs to be locked before we access it.
 
 this patch has been tested in -rt. Must-have for v2.6.21.
 
 Signed-off-by: Ingo Molnar [EMAIL PROTECTED]

Acked-by: Thomas Gleixner [EMAIL PROTECTED]

 --
  kernel/futex.c |2 ++
  1 file changed, 2 insertions(+)
 
 Index: linux/kernel/futex.c
 ===
 --- linux.orig/kernel/futex.c
 +++ linux/kernel/futex.c
 @@ -566,6 +566,7 @@ static int wake_futex_pi(u32 __user *uad
   if (!pi_state)
   return -EINVAL;
  
 + spin_lock(pi_state-pi_mutex.wait_lock);
   new_owner = rt_mutex_next_owner(pi_state-pi_mutex);
  
   /*
 @@ -605,6 +606,7 @@ static int wake_futex_pi(u32 __user *uad
   pi_state-owner = new_owner;
   spin_unlock_irq(new_owner-pi_lock);
  
 + spin_unlock(pi_state-pi_mutex.wait_lock);
   rt_mutex_unlock(pi_state-pi_mutex);
  
   return 0;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v3 - timerfd core ...

2007-03-12 Thread Thomas Gleixner
Davide,

On Sun, 2007-03-11 at 16:04 -0700, Davide Libenzi wrote:
 +static int timerfd_setup(struct timerfd_ctx *ctx, int clockid, int tmrtype,
 +  const struct itimerspec *ktmr)
 +{
 + enum hrtimer_mode htmode;
 + ktime_t texp, tintv;
 +
 + if (clockid != CLOCK_MONOTONIC 
 + clockid != CLOCK_REALTIME)
 + return -EINVAL;

Please move the validation for clockid, tmrtype and the timerspec into
sys_timerfd. Do it before anything else. Also please validate both
it_value and it_interval unconditionally. Userspace should not send
uninitialized stuff at all.

The TFD_TIMER_SEQ thing is quite different to all other timer interfaces
which POSIX provides. Both itimers and posixtimers use the it_interval
value to distinguish between one shot and periodic timers.

I think we should keep this new interface analogous, so programmers
don't get more confused, than they are already. :)

This also allows relative and absolute starting points for both one shot
and sequential timers.

Please use it_value == 0 to stop the timer. This is the same as for
itimers and posixtimers. Right now you have to close the fd to stop a
timer, but that's not necessarily what you want. 

Why do you want to store information, which is only relevant for setup
in ctx ?

If you do the validation right in sys_timerfd and get rid of
TFD_TIMER_SEQ and the various useless fields, then timerfd_setup() boils
down to

ctx-ticks = 0;
ctx-tintv = tintv;
hrtimer_init(ctx-tmr, clockid, htmode);
ctx-tmr.function = timerfd_tmrproc;

if (texp.tv64 != 0)
hrtimer_start(ctx-tmr, texp, htmode);

and in the timer function you simply check for

if (ctx-tintv.tv64 != 0) 

instead of the TIMER_SEQ mode.

 +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype,
 + const struct itimerspec __user *utmr)
 +{
 + int error;
 + struct timerfd_ctx *ctx;
 + struct file *file;
 + struct inode *inode;
 + struct itimerspec ktmr;
 +
 + if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
 + return -EFAULT;

Do validation of clockid, tmrtype and ktmr here.

 + if (ufd == -1) {
 + error = -ENOMEM;
 + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL);
 + if (!ctx)
 + goto err_exit;

return -ENOMEM;

 + init_waitqueue_head(ctx-wqh);
 + spin_lock_init(ctx-lock);
 + ctx-clockid = -1;
 +
 + error = timerfd_setup(ctx, clockid, tmrtype, ktmr);
 + if (error)
 + goto err_ctxfree;
 +
 + /*
 +  * When we call this, the initialization must be complete, since
 +  * aino_getfd() will install the fd.
 +  */
 + error = aino_getfd(ufd, inode, file, [timerfd],
 +timerfd_fops, ctx);
 + if (error)
 + goto err_ctxfree;
 + } else {
 + error = -EBADF;
 + file = fget(ufd);
 + if (!file)
 + goto err_exit;

return -EBADF;

 + ctx = file-private_data;
 + error = -EINVAL;
 + if (file-f_op != timerfd_fops) {
 + fput(file);
 + goto err_exit;

return -EINVAL;

 + }
 + /*
 +  * We need to stop the exiting timer before.
 +  */

-ENOPARSE. You probably mean: We need to stop an already running timer
before we do a new setup.

 + for (;;) {
 + spin_lock_irq(ctx-lock);
 + if (hrtimer_try_to_cancel(ctx-tmr) = 0)
 + break;
 + spin_unlock_irq(ctx-lock);
 + cpu_relax();
 + }
 + /*
 +  * Re-program the timer to the new value ...
 +  */
 + error = timerfd_setup(ctx, clockid, tmrtype, ktmr);
 +
 + spin_unlock_irq(ctx-lock);
 + fput(file);
 + if (error)
 + goto err_exit;

return error;

 + }
 +
 + return ufd;
 +
 +err_ctxfree:
 + timerfd_cleanup(ctx);
 +err_exit:
 + return error;
 +}
 +

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] change futex_wait() to hrtimers

2007-03-12 Thread Thomas Gleixner
On Mon, 2007-03-12 at 12:27 +0100, Andi Kleen wrote:
 Ingo Molnar [EMAIL PROTECTED] writes:
  
  the only correct approach is the use of hrtimers, and a patch exists for 
  that - see below. This has been included in -rt for quite some time.
 
 But isn't that bad for power management? You'll likely get more
 idle wakeups, won't you?

Why so ? It comes more precise, but only once.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] change futex_wait() to hrtimers

2007-03-12 Thread Thomas Gleixner
On Mon, 2007-03-12 at 12:02 +0100, Ingo Molnar wrote:
  Well I did convert futex_wait to an absolute timeout based version in 
  the subsequent incremental patch. I think that is OK?
 
 it still has the rounding artifacts: using timer_list there is no way to 
 do a precise long sleep based on many small sleeps.
 
 even if this means more work for you (i'm sorry about that!) i'm quite 
 sure we should take Sebastien's hrtimers based implementation of 
 futex_wait(), and use the nanosleep method to restart it. There's no 
 point in further tweaking the imprecise approach: whenever some timeout 
 needs to be restarted, it's a candidate for hrtimers.
 
 until then, glibc already handles timeouts and restarts it manually.

This also allows us to add a seperate absolute time bases futex op,
which allows to remove the conversion of abstime to reltime in glibc.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [6/6] 2.6.21-rc3: known regressions

2007-03-13 Thread Thomas Gleixner
On Tue, 2007-03-13 at 13:50 +0100, Adrian Bunk wrote:
 Subject: hrtimer_switch_to_hres():
  wrong tick_init_highres() return value handling
 References : http://lkml.org/lkml/2007/3/6/262
 Submitter  : Linus Torvalds [EMAIL PROTECTED]
 Caused-By  : Thomas Gleixner [EMAIL PROTECTED]
  commit 54cdfdb47f73b5af3d1ebb0f1e383efbe70fde9e
 Handled-By : Thomas Gleixner [EMAIL PROTECTED]
 Status : unknown

Linus merged the original patch, which solved the real problem. 

He just gave me a lesson how to do it right next time.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86_64 system lockup from userspace using setitimer()

2007-03-13 Thread Thomas Gleixner
On Tue, 2007-03-13 at 16:02 -0400, Chuck Ebbert wrote:
  struct itimerval tim = {
  .it_interval = {
  .tv_sec = 140735669863712,
  .tv_usec = 4199521
  },
 Could this be fixed by:
 
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8bfd9a7a229b5f3d3eda5d7d45c2eebec5b4ba16
 
 [PATCH] hrtimers: prevent possible itimer DoS

No. The possible DoS is only when high res timers are enabled, which is
not the case in 2.6.20.

Looking at the values 

140735669863712 = 0x7FFF 939C 0520

We convert second to nanoseconds:

140735669863712 * 1e9 =  0x1DCD 4BC3 6B82 914B 4000

The seconds value is limited to LONG_MAX, but on a 64 bit machine, the
140735669863712 is inside LONG_MAX and we have an multiplication
overflow.

I'm not sure, how this results in a DoS, but I will look into this
tomorrow morning, when I'm more awake.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [6/6] 2.6.21-rc3: known regressions

2007-03-13 Thread Thomas Gleixner
On Tue, 2007-03-13 at 13:50 +0100, Adrian Bunk wrote:
 This email lists some known regressions in Linus' tree compared to 2.6.20.
 
 If you find your name in the Cc header, you are either submitter of one
 of the bugs, maintainer of an affectected subsystem or driver, a patch
 of you caused a breakage or I'm considering you in any other way
 possibly involved with one or more of these issues.
 
 Due to the huge amount of recipients, please trim the Cc when answering.
 Subject: Clocksource tsc unstable (delta = -154983451 ns)
 References : http://lkml.org/lkml/2007/3/9/271
 Submitter  : Jiri Slaby [EMAIL PROTECTED]
 Status : unknown

That's not a regression. That's an informal message, when the TSC
watchdog detects that the TSC is unreliable. 

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()

2007-03-14 Thread Thomas Gleixner
hrtimer_forward() does not check for the possible overflow of
timer-expires. This can happen on 64 bit machines with large interval
values and results currently in an endless loop in the softirq because
the expiry value becomes negative and therefor the timer is expired all
the time.

Check for this condition and set the expiry value to the max. expiry
time in the future.

The fix should be applied to stable kernel series as well.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED],de

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index ec4cb9f..5e7122d 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -644,6 +644,12 @@ hrtimer_forward(struct hrtimer *timer, k
orun++;
}
timer-expires = ktime_add(timer-expires, interval);
+   /*
+* Make sure, that the result did not wrap with a very large
+* interval.
+*/
+   if (timer-expires.tv64  0)
+   timer-expires = ktime_set(KTIME_SEC_MAX, 0);
 
return orun;
 }


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [6/6] 2.6.21-rc3: known regressions

2007-03-14 Thread Thomas Gleixner
On Wed, 2007-03-14 at 19:02 +0100, Florian Lohoff wrote:
 On Wed, Mar 14, 2007 at 12:44:17PM +0100, Adrian Bunk wrote:
Due to the huge amount of recipients, please trim the Cc when answering.
Subject: Clocksource tsc unstable (delta = -154983451 ns)
References : http://lkml.org/lkml/2007/3/9/271
Submitter  : Jiri Slaby [EMAIL PROTECTED]
Status : unknown
   
   That's not a regression. That's an informal message, when the TSC
   watchdog detects that the TSC is unreliable. 
  
  Looking at [1], there's also be a probably related doesn't boot 
  problem.
  My first guess would be commit 6bb74df481223731af6c7e0ff3adb31f6442cfcd
  clocksource init adjustments (fix bug #7426).
  
  Jiri, is the message also present with 2.6.21-rc2 (at a different place 
  of the dmesg) for you?
 
 With the current git of today the halt on boot is gone. I am running 
 it now ...

I'm really curious what made it go away.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hrtimer: Fixup unlocked access to wall_to_monotonic

2007-03-14 Thread Thomas Gleixner
commit f4304ab21513b834c8fe3403927c60c2b81a72d7 (HZ free NTP) moved the
access to wall_to_monotonic in hrtimer_get_softirq_time() out of the
xtime_lock protection.

Move it back into the seq_lock section.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index ec4cb9f..e2053da 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -135,7 +135,7 @@ EXPORT_SYMBOL_GPL(ktime_get_ts);
 static void hrtimer_get_softirq_time(struct hrtimer_cpu_base *base)
 {
ktime_t xtim, tomono;
-   struct timespec xts;
+   struct timespec xts, tom;
unsigned long seq;
 
do {
@@ -145,10 +145,11 @@ #ifdef CONFIG_NO_HZ
 #else
xts = xtime;
 #endif
+   tom = wall_to_monotonic;
} while (read_seqretry(xtime_lock, seq));
 
xtim = timespec_to_ktime(xts);
-   tomono = timespec_to_ktime(wall_to_monotonic);
+   tomono = timespec_to_ktime(tom);
base-clock_base[CLOCK_REALTIME].softirq_time = xtim;
base-clock_base[CLOCK_MONOTONIC].softirq_time =
ktime_add(xtim, tomono);


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/13] signalfd/timerfd/asyncfd v5 - timerfd core ...

2007-03-15 Thread Thomas Gleixner
Davide,

On Wed, 2007-03-14 at 15:19 -0700, Davide Libenzi wrote:

 +static int timerfd_tmrproc(struct hrtimer *htmr)
 +{
 + struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr);
 + int rval = HRTIMER_NORESTART;
 + unsigned long flags;
 +
 + spin_lock_irqsave(ctx-lock, flags);
 + ctx-ticks++;
 + wake_up_locked(ctx-wqh);
 + if (ctx-tintv.tv64 != 0) {
 + hrtimer_forward(htmr, htmr-base-softirq_time, ctx-tintv);

Sorry, I missed that in the first reviews. Please use
hrtimer_cb_get_time(htmr) instead of htmr-base-softirq_time, so this
is high res timer safe.

 + rval = HRTIMER_RESTART;
 + }
 + spin_unlock_irqrestore(ctx-lock, flags);
 +
 + return rval;
 +}
 +
 +
 +static int timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
 +  const struct itimerspec *ktmr)
 +{

Make this void, returns 0 anyway

 + enum hrtimer_mode htmode;
 +
 + htmode = (flags  TFD_TIMER_ABSTIME) ? HRTIMER_ABS: HRTIMER_REL;
 +
 + ctx-ticks = 0;
 + ctx-clockid = clockid;
 + ctx-flags = flags;
 + ctx-texp = timespec_to_ktime(ktmr-it_value);

clockid is stored in the timer on setup, so no need to store it again.
expiry time and flags are not used after setup.

Please remove those fields.

 + ctx-tintv = timespec_to_ktime(ktmr-it_interval);
 + hrtimer_init(ctx-tmr, ctx-clockid, htmode);
 + ctx-tmr.expires = ctx-texp;
 + ctx-tmr.function = timerfd_tmrproc;
 + if (ctx-texp.tv64 != 0)
 + hrtimer_start(ctx-tmr, ctx-texp, htmode);
 +
 + return 0;
 +}
 +
 + if (ufd == -1) {
 + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL);
 + if (!ctx)
 + return -ENOMEM;
 +
 + init_waitqueue_head(ctx-wqh);
 + spin_lock_init(ctx-lock);
 + ctx-clockid = -1;
 +
 + error = timerfd_setup(ctx, clockid, flags, ktmr);
 + if (error)
 + goto err_ctxfree;

Timer setup can not fail

 + /*
 +  * When we call this, the initialization must be complete, since
 +  * aino_getfd() will install the fd.
 +  */
 + error = aino_getfd(ufd, inode, file, [timerfd],
 +timerfd_fops, ctx);
 + if (error)
 + goto err_ctxfree;

Again: Please turn this around. No need to start the timer before we
know, that everything works. 

 + } else {
 + file = fget(ufd);
 + if (!file)
 + return -EBADF;
 + ctx = file-private_data;
 + if (file-f_op != timerfd_fops) {
 + fput(file);
 + return -EINVAL;
 + }
 + /*
 +  * We need to stop the existing timer before reprogramming
 +  * it to the new values.
 +  */
 + for (;;) {
 + spin_lock_irq(ctx-lock);
 + if (hrtimer_try_to_cancel(ctx-tmr) = 0)
 + break;
 + spin_unlock_irq(ctx-lock);
 + cpu_relax();
 + }
 + /*
 +  * Re-program the timer to the new value ...
 +  */
 + error = timerfd_setup(ctx, clockid, flags, ktmr);

Timer setup can not fail

 + spin_unlock_irq(ctx-lock);
 + fput(file);
 + if (error)
 + return error;
 + }
 +
 + return ufd;
 +
 +err_ctxfree:
 + timerfd_cleanup(ctx);
 + return error;
 +}
 +
 +
 +static void timerfd_cleanup(struct timerfd_ctx *ctx)
 +{
 + if (ctx-clockid = 0)
 + hrtimer_cancel(ctx-tmr);

You don't have a file descriptor, when the setup failed. So the timer is
always initialized.

 + kmem_cache_free(timerfd_ctx_cachep, ctx);
 +}
 +
 +
 +static int timerfd_close(struct inode *inode, struct file *file)
 +{
 + timerfd_cleanup(file-private_data);
 + return 0;
 +}
 +

Please move the timerfd_cleanup code into close(). 

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/13] signal/timer/event fds v6 - timerfd core ...

2007-03-16 Thread Thomas Gleixner
On Thu, 2007-03-15 at 17:22 -0700, Davide Libenzi wrote:
 +static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
 +   const struct itimerspec *ktmr)
 +{
 + enum hrtimer_mode htmode;
 +
 + htmode = (flags  TFD_TIMER_ABSTIME) ? HRTIMER_MODE_ABS: 
 HRTIMER_MODE_REL;
 +
 + ctx-ticks = 0;
 + ctx-texp = timespec_to_ktime(ktmr-it_value);

I know, I'm racking your nerves. texp is only used for setup. No need to
carry it in the ctx data structure. :)

 + ctx-tintv = timespec_to_ktime(ktmr-it_interval);
 + hrtimer_init(ctx-tmr, clockid, htmode);
 + ctx-tmr.expires = ctx-texp;
 + ctx-tmr.function = timerfd_tmrproc;
 + if (ctx-texp.tv64 != 0)
 + hrtimer_start(ctx-tmr, ctx-texp, htmode);
 +}

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/13] signalfd/timerfd/asyncfd v5 - timerfd core ...

2007-03-16 Thread Thomas Gleixner
On Thu, 2007-03-15 at 16:02 -0700, Davide Libenzi wrote:
   + /*
   +  * When we call this, the initialization must be complete, since
   +  * aino_getfd() will install the fd.
   +  */
   + error = aino_getfd(ufd, inode, file, [timerfd],
   +timerfd_fops, ctx);
   + if (error)
   + goto err_ctxfree;
  
  Again: Please turn this around. No need to start the timer before we
  know, that everything works. 
 
 The timerfd_setup() is not locked, so we need to make sure everything is 
 setup, before advertising the fd (and aino_getfd does that).

Right. Did not think about the bad boys peeking at file descriptors :)

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.21-rc4

2007-03-16 Thread Thomas Gleixner
On Fri, 2007-03-16 at 21:34 +0100, Rafael J. Wysocki wrote:
 On Friday, 16 March 2007 17:33, Linus Torvalds wrote:
  
  I pushed out the -git trees yesterday, but then got distracted, so the 
  patches and tar-balls and the announcement got delayed until this morning. 
  Oops. I'm a scatter-brain.
  
  Anyway, the good news about -rc4 is that there's just lots of random 
  fixes. I'm hoping that we've seriously cut down on the regression list, 
  and I'd ask everybody who is on Adrian's list to please re-verify their 
  regression, and in case it's one of the patches available ones but I 
  haven't merged (maybe because it hasn't been sent to me!), make sure I do.
 
 I'm afraid that if CONFIG_TICK_ONESHOT or CONFIG_NO_HZ is set, we still have a
 problem with RCU synchronization while nonboot CPUs are being enabled during a
 resume (http://lkml.org/lkml/2007/3/11/144, http://lkml.org/lkml/2007/3/4/88).
 
 Can someone who had this problem with -rc3 check if it's present in -rc4?

I finally found a box today, which shows this problem. I'm working on a
fix.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()

2007-03-16 Thread Thomas Gleixner
On Fri, 2007-03-16 at 12:43 -0800, Andrew Morton wrote:
 On Wed, 14 Mar 2007 11:00:12 +0100 Thomas Gleixner [EMAIL PROTECTED] wrote:
 
  rtimer_forward() does not check for the possible overflow of
  timer-expires. This can happen on 64 bit machines with large interval
  values and results currently in an endless loop in the softirq because
  the expiry value becomes negative and therefor the timer is expired all
  the time.
  
  Check for this condition and set the expiry value to the max. expiry
  time in the future.
  
  The fix should be applied to stable kernel series as well.
  
  Signed-off-by: Thomas Gleixner [EMAIL PROTECTED],de
  
  diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
  index ec4cb9f..5e7122d 100644
  --- a/kernel/hrtimer.c
  +++ b/kernel/hrtimer.c
  @@ -644,6 +644,12 @@ hrtimer_forward(struct hrtimer *timer, k
  orun++;
  }
  timer-expires = ktime_add(timer-expires, interval);
  +   /*
  +* Make sure, that the result did not wrap with a very large
  +* interval.
  +*/
  +   if (timer-expires.tv64  0)
  +   timer-expires = ktime_set(KTIME_SEC_MAX, 0);
   
  return orun;
   }
 
 kernel/hrtimer.c: In function 'hrtimer_forward':
 kernel/hrtimer.c:652: warning: overflow in implicit constant conversion
 
 problem is, KTIME_SEC_MAX is 9,223,372,036 and ktime_set() takes a `long'.

Stupid me :(

 This?
 
 --- a/include/linux/ktime.h~ktime_set-fix-arg-type
 +++ a/include/linux/ktime.h
 @@ -72,13 +72,13 @@ typedef union {
   *
   * Return the ktime_t representation of the value
   */
 -static inline ktime_t ktime_set(const long secs, const unsigned long nsecs)
 +static inline ktime_t ktime_set(const s64 secs, const unsigned long nsecs)
  {
  #if (BITS_PER_LONG == 64)
   if (unlikely(secs = KTIME_SEC_MAX))
   return (ktime_t){ .tv64 = KTIME_MAX };
  #endif
 - return (ktime_t) { .tv64 = (s64)secs * NSEC_PER_SEC + (s64)nsecs };
 + return (ktime_t) { .tv64 = secs * NSEC_PER_SEC + (s64)nsecs };
  }
  
  /* Subtract two ktime_t variables. rem = lhs -rhs: */
 _
 
 I worry about that `secs = KTIME_SEC_MAX' comparison in there, too.  Both
 operands are signed.

I'd prefer this one: The maximum seconds value we can handle on 32bit is
LONG_MAX.

diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index c68c7ac..248305b 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -57,7 +57,11 @@ typedef union {
 } ktime_t;
 
 #define KTIME_MAX  ((s64)~((u64)1  63))
-#define KTIME_SEC_MAX  (KTIME_MAX / NSEC_PER_SEC)
+#if (BITS_PER_LONG == 64)
+# define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC)
+#else
+# define KTIME_SEC_MAX LONG_MAX
+#endif
 
 /*
  * ktime_t definitions when using the 64-bit scalar representation:


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] clockevents: Fix suspend/resume to disk hangs

2007-03-16 Thread Thomas Gleixner
I finally found a dual core box, which survives suspend/resume without
crashing in the middle of nowhere. Sigh, I never figured out from the
code and the bug reports what's going on.

The observed hangs are caused by a stale state transition of the clock
event devices, which keeps the RCU synchronization away from completion,
when the non boot CPU is brought back up.

The suspend/resume in oneshot mode needs the similar care as the
periodic mode during suspend to RAM. My assumption that the state
transitions during the different shutdown/bringups of s2disk would go
through the periodic boot phase and then switch over to highres resp.
nohz mode were simply wrong.

Add the appropriate suspend / resume handling for the non periodic
modes.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 5567745..eadfce2 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -307,12 +307,19 @@ int tick_resume_broadcast(void)
spin_lock_irqsave(tick_broadcast_lock, flags);
 
bc = tick_broadcast_device.evtdev;
-   if (bc) {
-   if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC 
-   !cpus_empty(tick_broadcast_mask))
-   tick_broadcast_start_periodic(bc);
 
-   broadcast = cpu_isset(smp_processor_id(), tick_broadcast_mask);
+   if (bc) {
+   switch (tick_broadcast_device.mode) {
+   case TICKDEV_MODE_PERIODIC:
+   if(!cpus_empty(tick_broadcast_mask))
+   tick_broadcast_start_periodic(bc);
+   broadcast = cpu_isset(smp_processor_id(),
+ tick_broadcast_mask);
+   break;
+   case TICKDEV_MODE_ONESHOT:
+   broadcast = tick_resume_broadcast_oneshot(bc);
+   break;
+   }
}
spin_unlock_irqrestore(tick_broadcast_lock, flags);
 
@@ -347,6 +354,16 @@ static int tick_broadcast_set_event(ktime_t expires, int 
force)
}
 }
 
+int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
+{
+   clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);
+
+   if(!cpus_empty(tick_broadcast_oneshot_mask))
+   tick_broadcast_set_event(ktime_get(), 1);
+
+   return cpu_isset(smp_processor_id(), tick_broadcast_oneshot_mask);
+}
+
 /*
  * Reprogram the broadcast device:
  *
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 43ba1bd..bfda3f7 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -298,18 +298,17 @@ static void tick_shutdown(unsigned int *cpup)
spin_unlock_irqrestore(tick_device_lock, flags);
 }
 
-static void tick_suspend_periodic(void)
+static void tick_suspend(void)
 {
struct tick_device *td = __get_cpu_var(tick_cpu_device);
unsigned long flags;
 
spin_lock_irqsave(tick_device_lock, flags);
-   if (td-mode == TICKDEV_MODE_PERIODIC)
-   clockevents_set_mode(td-evtdev, CLOCK_EVT_MODE_SHUTDOWN);
+   clockevents_set_mode(td-evtdev, CLOCK_EVT_MODE_SHUTDOWN);
spin_unlock_irqrestore(tick_device_lock, flags);
 }
 
-static void tick_resume_periodic(void)
+static void tick_resume(void)
 {
struct tick_device *td = __get_cpu_var(tick_cpu_device);
unsigned long flags;
@@ -317,6 +316,8 @@ static void tick_resume_periodic(void)
spin_lock_irqsave(tick_device_lock, flags);
if (td-mode == TICKDEV_MODE_PERIODIC)
tick_setup_periodic(td-evtdev, 0);
+   else
+   tick_resume_oneshot();
spin_unlock_irqrestore(tick_device_lock, flags);
 }
 
@@ -348,13 +349,13 @@ static int tick_notify(struct notifier_block *nb, 
unsigned long reason,
break;
 
case CLOCK_EVT_NOTIFY_SUSPEND:
-   tick_suspend_periodic();
+   tick_suspend();
tick_suspend_broadcast();
break;
 
case CLOCK_EVT_NOTIFY_RESUME:
if (!tick_resume_broadcast())
-   tick_resume_periodic();
+   tick_resume();
break;
 
default:
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 75890ef..c9d203b 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -19,12 +19,13 @@ extern void tick_setup_oneshot(struct clock_event_device 
*newdev,
 extern int tick_program_event(ktime_t expires, int force);
 extern void tick_oneshot_notify(void);
 extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device 
*));
-
+extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
 extern void tick_broadcast_oneshot_control(unsigned long reason);
 extern void

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Thomas Gleixner
On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
 Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
 check_tsc_sync_source+0x1d/0x100
 Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
 show_trace_log_lvl+0x1a/0x30
 Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
 [CPU#0 - CPU#1]:
 Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
 warp between CPUs, turning off
 
 It looks clear that preempt is enabled all the way in second cpu 
 initialization, ( I think that at least in check_tsc_sync_source, it should 
 be disabled,
 shouldn't it ? )

This should be fixed by commit d04f41e35343f1d788551fd3f753f51794f4afcf

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Thomas Gleixner
Maxim,

On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
 3) Sometimes I get this (once in three boots or so)
 
 [   36.217405] ENABLING IO-APIC IRQs
 [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
 [   36.433917] APIC timer disabled due to verification failure.
 
 And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
 I haven't investigated that yet.
 It looks like another new test that my hardware fails to perform... 

Yes, this is probably caused by SMM code trying to emulate a PS/2
keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
have no way to disable this BIOS misfeature in the early boot process. 
Arjan, Len ?

I built in this test to rule out bogus LAPIC timer calibration values
which are sometimes off by factor 2-10.

But I also built in a calibration against the PM-Timer, which turned out
to be quite reliable and I think the additional verification step is
only necessary for sytems without PM-Timer.

That was a bit over cautious from my side. I send a patch to avoid this
when PM-Timer is available in a separate mail.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i386: trust the PM-Timer calibration of the local APIC timer

2007-03-16 Thread Thomas Gleixner
When PM-Timer is available for local APIC timer calibration we can skip
the verification of the calibrated time value. The resulting error is
quite small on a bunch of evaluated platforms and is less harming than
the observed false positives.

We need to keep the verification on systems, which have no PM-Timer to
avoid bogus local APIC timer calibrations in the range of factor 2-10,
which can be observed when swicthing off the PM-timer support in the
kernel configuration.

The wrong calibration values are probably caused by SMM code trying to
emulate a PS/2 keyboard from a (maybe connected or not) USB keyboard.
This prohibits the accurate delivery of PIT interrupts, which are used
to calibrate the local APIC timer. Unfortunately we have no way to
disable this BIOS misfeature in the early boot process.

Add also the dropped cpu_relax() back to the wait loops.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 2383bcf..92f4210 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -338,6 +338,7 @@ void __init setup_boot_APIC_clock(void)
void (*real_handler)(struct clock_event_device *dev);
unsigned long deltaj;
long delta, deltapm;
+   int pm_referenced = 0;
 
apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n
calibrating APIC timer ...\n);
@@ -357,7 +358,8 @@ void __init setup_boot_APIC_clock(void)
/* Let the interrupts run */
local_irq_enable();
 
-   while(lapic_cal_loops = LAPIC_CAL_LOOPS);
+   while(lapic_cal_loops = LAPIC_CAL_LOOPS)
+   cpu_relax();
 
local_irq_disable();
 
@@ -394,6 +396,7 @@ void __init setup_boot_APIC_clock(void)
   %lu (%ld)\n, (unsigned long) res, delta);
delta = (long) res;
}
+   pm_referenced = 1;
}
 
/* Calculate the scaled math multiplication factor */
@@ -423,68 +426,41 @@ void __init setup_boot_APIC_clock(void)
calibration_result / (100 / HZ),
calibration_result % (100 / HZ));
 
-
-   apic_printk(APIC_VERBOSE, ... verify APIC timer\n);
-
-   /*
-* Setup the apic timer manually
-*/
local_apic_timer_verify_ok = 1;
-   levt-event_handler = lapic_cal_handler;
-   lapic_timer_setup(CLOCK_EVT_MODE_PERIODIC, levt);
-   lapic_cal_loops = -1;
 
-   /* Let the interrupts run */
-   local_irq_enable();
+   /* We trust the pm timer based calibration */
+   if (!pm_referenced) {
+   apic_printk(APIC_VERBOSE, ... verify APIC timer\n);
 
-   while(lapic_cal_loops = LAPIC_CAL_LOOPS);
+   /*
+* Setup the apic timer manually
+*/
+   levt-event_handler = lapic_cal_handler;
+   lapic_timer_setup(CLOCK_EVT_MODE_PERIODIC, levt);
+   lapic_cal_loops = -1;
 
-   local_irq_disable();
+   /* Let the interrupts run */
+   local_irq_enable();
 
-   /* Stop the lapic timer */
-   lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, levt);
+   while(lapic_cal_loops = LAPIC_CAL_LOOPS)
+   cpu_relax();
 
-   local_irq_enable();
+   local_irq_disable();
 
-   /* Jiffies delta */
-   deltaj = lapic_cal_j2 - lapic_cal_j1;
-   apic_printk(APIC_VERBOSE, ... jiffies delta = %lu\n, deltaj);
+   /* Stop the lapic timer */
+   lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, levt);
 
-   /* Check, if the PM timer is available */
-   deltapm = lapic_cal_pm2 - lapic_cal_pm1;
-   apic_printk(APIC_VERBOSE, ... PM timer delta = %ld\n, deltapm);
+   local_irq_enable();
 
-   local_apic_timer_verify_ok = 0;
+   /* Jiffies delta */
+   deltaj = lapic_cal_j2 - lapic_cal_j1;
+   apic_printk(APIC_VERBOSE, ... jiffies delta = %lu\n, deltaj);
 
-   if (deltapm) {
-   if (deltapm  (pm_100ms - pm_thresh) 
-   deltapm  (pm_100ms + pm_thresh)) {
-   apic_printk(APIC_VERBOSE, ... PM timer result ok\n);
-   /* Check, if the jiffies result is consistent */
-   if (deltaj  LAPIC_CAL_LOOPS-2 ||
-   deltaj  LAPIC_CAL_LOOPS+2) {
-   /*
-* Not sure, what we can do about this one.
-* When high resultion timers are active
-* and the lapic timer does not stop in C3
-* we are fine. Otherwise more trouble might
-* be waiting. -- tglx
-*/
-   printk(KERN_WARNING Global event device %s 
-  has wrong

Re: + small-irq-management-simplification.patch added to -mm tree

2007-02-16 Thread Thomas Gleixner
On Wed, 2007-02-14 at 15:26 -0800, [EMAIL PROTECTED] wrote:
 Subject: small irq management simplification
 From: Jan Beulich [EMAIL PROTECTED]
 
 Use mask_ack_irq() where possible.
 
 Signed-off-by: Jan Beulich [EMAIL PROTECTED]
 Cc: Thomas Gleixner [EMAIL PROTECTED]
 Cc: Ingo Molnar [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]

Acked-by: Thomas Gleixner [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

2007-02-16 Thread Thomas Gleixner
On Fri, 2007-02-16 at 21:38 +0100, Michal Piotrowski wrote:
 Hi,
 
 This looks like a tickless stuff

Yup.

 0xc0139ea0 is in tick_nohz_stop_sched_tick 
 (/mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168).
 163
 164 if (need_resched())
 165 goto end;
 166
 167 cpu = smp_processor_id();
 168 BUG_ON(local_softirq_pending());

Hmm, the BUG_ON is inside of an interrupt disabled region, so we should
have bailed out early in the need_resched() check above (because we are
in the idle task context according to the stack trace).

Is this reproducible ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-git: undefined reference to `smp_call_function_single'

2007-02-17 Thread Thomas Gleixner
On Fri, 2007-02-16 at 21:08 -0500, Len Brown wrote:
 Yes, an obscure .config, but it used to build before today:

 kernel/built-in.o: In function `tick_broadcast_on_off':
 (.text+0x1b6f0): undefined reference to `smp_call_function_single'

Yup, this obscure machine is missing smp_call_function_single().

James ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

2007-02-17 Thread Thomas Gleixner
On Sat, 2007-02-17 at 15:47 +0100, Alex Riesen wrote:
   164 if (need_resched())
   165 goto end;
   166
   167 cpu = smp_processor_id();
   168 BUG_ON(local_softirq_pending());
  
  Hmm, the BUG_ON is inside of an interrupt disabled region, so we should
  have bailed out early in the need_resched() check above (because we are
  in the idle task context according to the stack trace).
  
  Is this reproducible ?

 Seen this too (Ubuntu, P4/ht-SMT, SATA, typed from screen):

Can you please apply the patch below, so we can at least see, which
softirq is pending. This should trigger independently of hrtimers and
dynticks. You can keep it compiled in and disable it at the kernel
commandline with nohz=off and / or highres=off

tglx

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index bea304d..deeb90e 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -236,6 +236,11 @@ void cpu_idle(void)
 * Otherwise, idle callbacks can misfire.
 */
local_irq_disable();
+   if (local_softirq_pending()  !need_resched())
+   printk(KERN_ERR
+  Idle: local softirq pending: %04x,
+  local_softirq_pending());
+
enter_idle();
idle();
__exit_idle();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 95e41f7..91d459f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -165,7 +165,6 @@ void tick_nohz_stop_sched_tick(void)
goto end;
 
cpu = smp_processor_id();
-   BUG_ON(local_softirq_pending());
 
now = ktime_get();
/*


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Using sched_clock for mmio-trace

2007-02-17 Thread Thomas Gleixner
On Sat, 2007-02-17 at 15:56 +0100, Andi Kleen wrote:
  This is one of the reasons why we don't just use good old
  do_gettimeofday(), since it takes locks and can lead to lock recursion
  if parts of itself are probed.
 
 do_gettimeofday doesn't take locks.
 
 Only restriction is that you can't single step it with long 
 pauses between instructions.

Err, it uses read side of xtime lock, so you can not call it from a
place which write locks xtime lock.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

2007-02-17 Thread Thomas Gleixner
On Sat, 2007-02-17 at 17:46 +0100, Alex Riesen wrote:
  Can you please apply the patch below, so we can at least see, which
  softirq is pending. This should trigger independently of hrtimers and
  dynticks. You can keep it compiled in and disable it at the kernel
  commandline with nohz=off and / or highres=off
 
 It did, only one time:
 
 Idle: local softirq pending: 00206USB Universal Host Controller Interface 
 driver v3.0

0x20 is the TASKLET_SOFTIRQ. I have no idea yet, how this can happen.

Can you please check, if this happens when you add nohz=off to the
kernel command line.

tglx





-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-git: undefined reference to `smp_call_function_single'

2007-02-17 Thread Thomas Gleixner
On Sat, 2007-02-17 at 12:25 -0600, James Bottomley wrote:
  Yup, this obscure machine is missing smp_call_function_single().
  
  James ?
 
 Where's this coming from?  smp_call_function_single() is an obscure kvm
 only API think for x86/ia64 ... it's not supported on any other
 architecure.  The symbol you have is blowing up in the kernel
 subdirectory which suggests someone has tried to use it in generic code,
 which will fail to compile on a lot more than voyager and parisc ...

smp_call_function_single() was added with commit:
eaa70773e750cc09d60938bceacd028bc76b8e3a

   [PATCH] i386: add smp_call_function_single

Continiung the series of small patches necessary for the perfmon subsystem,
here is a patch that adds support for the smp_call_function_single()
function for i386.  It exists for almost all other architectures but i386.
The perfmon subsystem needs it in one case to free some state on a
designated remote CPU.

It's not an obscure kvm API :) But the claim that it is available on
almost all other architectures but i386 is wrong. Only x86_64, ia64 and
i386 have it.

The function is defined in include/linux/smp.h and there is no
indication that it is an architecture specific thingy. What a steaming
pile of 

/me grumbles

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] tick management: make broadcast dependent on local APIC

2007-02-17 Thread Thomas Gleixner
The broadcast functionality is only necessary when a local APIC is
available. Make the config switch depend on X86_LOCAL_APIC. This
resolves the mach-voyager breakage introduced by the tick managament
code.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig
index 1df4a1f..2f76725 100644
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -29,6 +29,7 @@ config GENERIC_CLOCKEVENTS
 config GENERIC_CLOCKEVENTS_BROADCAST
bool
default y
+   depends on X86_LOCAL_APIC
 
 config LOCKDEP_SUPPORT
bool


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

2007-02-18 Thread Thomas Gleixner
On Sat, 2007-02-17 at 23:41 +0100, Michal Piotrowski wrote:
 sudo cat /var/log/messages | grep Idle
 Feb 17 17:35:34 bitis-gabonica kernel: Idle: local softirq pending:
 00206hdd: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache,
 UDMA(33)
 Feb 17 19:20:01 bitis-gabonica kernel: Idle: local softirq pending:
 00203Idle: local softirq pending: 00203Idle: local softirq
 pending: 00207PM: Removing info for No Bus:vcs7

 I can confirm that this is ICH5 SATA controller problem.

The arch/i386/kernel/process.c part of the patch should apply to 2.6.20
as well. Can you check if the problem is there too ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-git13 kernel BUG at /mnt/md0/devel/linux-git/kernel/time/tick-sched.c:168

2007-02-18 Thread Thomas Gleixner
On Sun, 2007-02-18 at 10:50 +0100, Alex Riesen wrote:
  The arch/i386/kernel/process.c part of the patch should apply to 2.6.20
  as well. Can you check if the problem is there too ?
 
 It does not apply and does not look trivially hackable.
 The code for cpu_idle was introduced in 2ff2d3d7 i386: add idle notifier.

Here you go.

tglx

Index: linux-2.6.20/arch/i386/kernel/process.c
===
--- linux-2.6.20.orig/arch/i386/kernel/process.c
+++ linux-2.6.20/arch/i386/kernel/process.c
@@ -189,6 +189,13 @@ void cpu_idle(void)
play_dead();
 
__get_cpu_var(irq_stat).idle_timestamp = jiffies;
+
+   local_irq_disable();
+   if (local_softirq_pending()  !need_resched())
+   printk(KERN_ERR
+  Idle: local softirq pending: %04x\n,
+  local_softirq_pending());
+   local_irq_enable();
idle();
}
preempt_enable_no_resched();




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
Len,

On Fri, 2007-03-16 at 21:32 -0400, Len Brown wrote:
   [   36.433917] APIC timer disabled due to verification failure.
   
   And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
   I haven't investigated that yet.
   It looks like another new test that my hardware fails to perform... 
  
  Yes, this is probably caused by SMM code trying to emulate a PS/2
  keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
  have no way to disable this BIOS misfeature in the early boot process. 
  Arjan, Len ?
 
 Nope.  By definition, SMM is invisible to the OS -- we don't even
 get a bit that said it occurred (though we'd like one -- it would
 be really helpful to diagnose issues like this one)

I know that it is invisible. Nevertheless I know that the BIOSes emulate
PS/2 keyboards from USB via SMM during the boot process until we call
the usb_handoff function.

 So go into BIOS SETUP and see if there is a USB Legacy Emulation
 feature that you can disable.  Sometimes there is not, but disabling
 onboard USB altogether may help at least prove the issue is in that area.

I have more than one box (even original Intel mainboards), where either
plugging a PS/2 keyboard or switching off USB makes this problem go
away.

  I built in this test to rule out bogus LAPIC timer calibration values
  which are sometimes off by factor 2-10.
  
  But I also built in a calibration against the PM-Timer, which turned out
  to be quite reliable and I think the additional verification step is
  only necessary for sytems without PM-Timer.
  
  That was a bit over cautious from my side. I send a patch to avoid this
  when PM-Timer is available in a separate mail.
 
 PM-Timer was invented to work-around the issue that the TSC became unreliable
 in the face of power management on laptops.  In particular, to be able
 to time duration of OS idle where TSC stopped.
 
 While it is not fine grain, and it is not low-latency, is should
 be very reliable.  My understanding is that it is implemented as
 a simple divider right off the system 14MHz clock -- the signal
 which most motherboard clocks are PLL multiplied up from --
 including the 100MHz front-side bus which drives the LAPIC timer.
 
 But that said, I don't understand why calibrating the LAPIC timer
 using the PM-timer is going to be more reliable -- exactly how
 and why did the previous calibration scheme fail?
 Maybe I could follow the new logic in apic.c if I saw the apic=debug
 output for this box.

calibrating APIC timer ...
... lapic delta = 2426884
... PM timer delta = 833908
APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
APIC delta adjusted to PM-Timer: 1041737 (2426884)
. delta 1041737
. mult: 44749065
. calibration result: 166677
. CPU clock speed is 4659.0624 MHz.
. host bus clock speed is 166.0677 MHz.

This box is off by factor 2.3 and using the PM-Timer instead of the
PIT/jiffies values gives me a correct result.

Another one:
APIC calibration not consistent with PM Timer: 2020ms instead of 100ms
APIC delta adjusted to PM-Timer: 1254436 (2534)

Off by factor 20 !!

The original APIC timer calibration did:

local_irq_disable();
wait_until_pit_underflows();
t1 = read_apic_counter();
for (i = 0; i  HZ/10; i++)
wait_until_pit_underflows();
t2 = read_apic_counter();

and calculated the APIC timer frequency from the delta of t1 and t2 vs.
the 100ms time.

This had 2 problems:
1. It gave results, which are off by factor 2-10 on a couple of boxen.
2. Some systems stop there dead as the PIT readout is broken.

I changed it to do:

local_irq_disable();
original_pit_handler = pit-handler;
pit-handler = lapic_calibration_handler;
loops = 0;
local_irq_enable();
wait_until_handler_has_done_HZ/10_loops();

The handler does:

if (!loops++) {
t1_apic = read_apic_counter();
t1_jiffies = jiffies;
t1_pm = read_pm_timer();
}

if (loops == HZ/10) {
t2_apic = read_apic_counter();
t2_jiffies = jiffies;
t2_pm = read_pm_timer();
done = 1;
}

If the pmtimer is available, then calculate the APIC timer frequency
from the t1_pm/t2_pm delta, otherwise use jiffies.

When pm_timer is there, we can trust the calculated value, if not we do
a verify run of the periodic apic timer and the pit timer. If this fails
- and it fails often due to the SMM crap - then I use the PIT and IPIs.

In the first version I did a verification run even when pm_timer was
there, but this produced false positives as well, because the lapic
timer interrupt is in the same way delayed as the PIT interrupt. I
removed this to avoid unnecessary switching to IPIs after I verified,
that it always produced false positives when the calibration was done
against the PM-Timer.

tglx


-
To unsubscribe 

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote:
 calibrating APIC timer ...
 ... lapic delta = 2426884
 ... PM timer delta = 833908
 APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
 APIC delta adjusted to PM-Timer: 1041737 (2426884)
 . delta 1041737
 . mult: 44749065
 . calibration result: 166677
 . CPU clock speed is 4659.0624 MHz.
 . host bus clock speed is 166.0677 MHz.
 
 This box is off by factor 2.3 and using the PM-Timer instead of the
 PIT/jiffies values gives me a correct result.
 
 Another one:
 APIC calibration not consistent with PM Timer: 2020ms instead of 100ms
 APIC delta adjusted to PM-Timer: 1254436 (2534)
 
 Off by factor 20 !!

This weird behaviour also can be seen with the BogoMIPS calibration:

Calibrating delay using timer specific routine.. 6428.32 BogoMIPS 
(lpj=12856647)

Initializing CPU#1
Calibrating delay using timer specific routine.. 103837.25 BogoMIPS 
(lpj=207674508)

Note, that I never observed that on CPU#0. It always affects CPU#1.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote:
  Maybe I could follow the new logic in apic.c if I saw the apic=debug
  output for this box.
 
 calibrating APIC timer ...
 ... lapic delta = 2426884
 ... PM timer delta = 833908
 APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
 APIC delta adjusted to PM-Timer: 1041737 (2426884)
 . delta 1041737
 . mult: 44749065
 . calibration result: 166677
 . CPU clock speed is 4659.0624 MHz.
 . host bus clock speed is 166.0677 MHz.
 
 This box is off by factor 2.3 and using the PM-Timer instead of the
 PIT/jiffies values gives me a correct result.

I instrumented the lapic calibration on this box:

I1: 999 us total:999 us
I2: 999 us total:   1998 us
...
I28:999 us total:  27980 us
I29: 135097 us total: 163077 us  
I30:881 us total: 163958 us
...
I98:   1000 us total: 231918 us
I99:999 us total: 232917 us

So it vanishes away for 132 ms, which is exactly the error above. This
happens in random places and sometimes I'm lucky that it does not happen
at all.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/6] 2.6.21-rc2: known regressions

2007-03-17 Thread Thomas Gleixner
On Sat, 2007-03-17 at 23:41 +0200, Michael S. Tsirkin wrote:
   a quick ping: on your box that doesnt resume - if you can log in over 
   the network after resume (or somehow run shell commands), does 'date' 
   advance properly or not? (or do you not get that far to be able to 
   tell?)
   
 Ingo
  
  I just retested - 'date' does not advance after resume for me.
  This is with NO_HZ *not* set.
  Sorry it took so long.
 
 Update: just re-tested with 2.6.21-rc4, same behaviour: date
 does not advance after resume from ram.

Can you get a full dmesg from boot to resume out of the box ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
Maxim,

On Sun, 2007-03-18 at 01:00 +0200, Maxim wrote:
 Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
 [CPU#0 - CPU#1]:
 Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
 warp between CPUs, turning off
 
 ^ This one I don't think is related to NO_HZ, maybe it is hardware
 problem, but it exist without NO_HZ

The TSC is checked for synchronization between the CPUs. It's nothing to
worry about. We switch off the TSC and use a different clocksource.

Is this after resume ? If yes, then something (probably BIOS) is
fiddling with the TSC of one CPU when the resume happens.

 [   36.217405] ENABLING IO-APIC IRQs
 [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
 [   36.433917] APIC timer disabled due to verification failure.
 
 This one is now discussed, I will look at it and it is not related to NO_HZ

I sent a patch for this yesterday:

http://marc.info/?l=linux-kernelm=117408952322631w=2

 And I forgot to tell about another problem with (now I know ,hi-resolution 
 timers)
 That before suspend to ram APIC timer is used and HPET is not used :
 
 [EMAIL PROTECTED]:/home/maxim# cat 
 /sys/devices/system/clockevents/clockevents0/registered
 lapicF:0007 M:3(periodic) C: 1
 hpet F:0003 M:1(shutdown) C: 0
 lapicF:0007 M:3(periodic) C: 0
 [EMAIL PROTECTED]:/home/maxim#   
 
 But after suspend to ram HPET is 'woken'
 
 [EMAIL PROTECTED]:/home/maxim# cat 
 /sys/devices/system/clockevents/clockevents0/registered
 lapicF:0007 M:3(one shoot) C: 1
 hpet F:0003 M:3(one shoot) C: 0
 lapicF:0007 M:3(one shoot) C: 0

This is unrelated to suspend / resume. The local apic timers stop
(hardware madness), when the CPU enters C3 power state. In this case we
switch to HPET (or PIT when HPET is not available) and broadcast the
events via Inter Processor Interrupts. This is nothing to worry about. 

I'm a bit surprised though, that your system was in periodic mode before
suspend and switched to one shot mode on resume.

Is this reproducible ? If yes, can you please provide the dmesg output
from boot to resume ?

 Note that I added those (one shoot), (periodic) descriptions, would be
 nice to have them in kernel, can I send a patch ?  ;-)

Sure, just s/shoot/shot/ :)

 and I see average of 18 IRQs/sec on IRQ 0

So dynticks are working as expected.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: trust the PM-Timer calibration of the local APIC timer

2007-03-18 Thread Thomas Gleixner
On Sun, 2007-03-18 at 00:12 -0800, Andrew Morton wrote:
 On Sat, 17 Mar 2007 01:04:56 +0100 Thomas Gleixner [EMAIL PROTECTED] wrote:
 
  When PM-Timer is available for local APIC timer calibration we can skip
  the verification of the calibrated time value. The resulting error is
  quite small on a bunch of evaluated platforms and is less harming than
  the observed false positives.
  
  We need to keep the verification on systems, which have no PM-Timer to
  avoid bogus local APIC timer calibrations in the range of factor 2-10,
  which can be observed when swicthing off the PM-timer support in the
  kernel configuration.
  
  The wrong calibration values are probably caused by SMM code trying to
  emulate a PS/2 keyboard from a (maybe connected or not) USB keyboard.
  This prohibits the accurate delivery of PIT interrupts, which are used
  to calibrate the local APIC timer. Unfortunately we have no way to
  disable this BIOS misfeature in the early boot process.
  
  Add also the dropped cpu_relax() back to the wait loops.
 
 Is this a for-2.6.21 thing?

Yes please. The false positives of the original calibration are
annoying.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()

2007-03-18 Thread Thomas Gleixner
On Sun, 2007-03-18 at 17:16 -0400, Chuck Ebbert wrote:
 Thomas Gleixner wrote:
  
  I'd prefer this one: The maximum seconds value we can handle on 32bit is
  LONG_MAX.
  
  diff --git a/include/linux/ktime.h b/include/linux/ktime.h
  index c68c7ac..248305b 100644
  --- a/include/linux/ktime.h
  +++ b/include/linux/ktime.h
  @@ -57,7 +57,11 @@ typedef union {
   } ktime_t;
   
   #define KTIME_MAX  ((s64)~((u64)1  63))
  -#define KTIME_SEC_MAX  (KTIME_MAX / NSEC_PER_SEC)
  +#if (BITS_PER_LONG == 64)
  +# define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC)
  +#else
  +# define KTIME_SEC_MAX LONG_MAX
  +#endif
   
   /*
* ktime_t definitions when using the 64-bit scalar representation:
  
 
 Just to be clear: this replaces the earlier patch, right?

This replaces the fix Andrew did.

http://marc.info/?l=linux-kernelm=117407812411997w=2

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()

2007-03-18 Thread Thomas Gleixner
On Sun, 2007-03-18 at 17:53 -0400, Chuck Ebbert wrote:
  Just to be clear: this replaces the earlier patch, right?
  
  This replaces the fix Andrew did.
  
  http://marc.info/?l=linux-kernelm=117407812411997w=2
  
 
 Right, but is the original Prevent DOS patch from you still needed?
 Or did Andrew's patch replace that one, and now this replaces his?

The original patch is still needed - it handles the problem in the first
place.

I missed to compile it for 32bit and Andrew did a fix, which I replaced.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 18:10 +0100, Stefan Prechtel wrote:
 So I tried to boot with nolapic on battery and with this option the
 kernel (and system) starts as it should.
 If you need more information, I will send it to you.

Can you please provide your .config and a bootlog from a boot with
nolapic and without. Also please add apic=verbose to the commandline.

Can you please use Linus' latest git snaphost
http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.21-rc4-git4.bz2

or pull from Linus' git repository.

You can please open a new bug (Category: Timers, Component: Other) on
http://bugzilla.kernel.org and upload the files there, so we avoid
distributing them via LKML.

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 18:36 +0100, Thomas Gleixner wrote:
 On Mon, 2007-03-19 at 18:10 +0100, Stefan Prechtel wrote:
  So I tried to boot with nolapic on battery and with this option the
  kernel (and system) starts as it should.
  If you need more information, I will send it to you.
 
 Can you please provide your .config and a bootlog from a boot with
 nolapic and without. Also please add apic=verbose to the commandline.
 
 Can you please use Linus' latest git snaphost
 http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.21-rc4-git4.bz2
 
 or pull from Linus' git repository.
 
 You can please open a new bug (Category: Timers, Component: Other) on
 http://bugzilla.kernel.org and upload the files there, so we avoid
 distributing them via LKML.

Oh, a bootlog with ac plugged in would be great too. 

Also can you please enable CONFIG_SYSRQ and hit SysRq-Q once, when the
slowness kicks in.

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
Matt,

On Mon, 2007-03-19 at 12:08 -0500, Matt Mackall wrote:
 On Sun, Mar 18, 2007 at 03:31:50PM -0500, Josh Boyer wrote:
  On Sun, Mar 18, 2007 at 02:18:12PM -0500, Matt Mackall wrote:
   
   I'm well aware of all that. I wrote a NAND driver just last month.
   Let's consider this table:
   
   HARD drives  MTD device
   Consists of sectors  Consists of eraseblocks
   Sectors are small (512, 1024 bytes)  Eraseblocks are larger (32KiB, 
   128KiB)
   read sector and write sector read, write, and erase block
   Bad sectors are re-mappedBad eraseblocks are not hidden
   HDD sectors don't wear out Eraseblocks get worn-out
   N/A   NAND flash addressed in pages
   N/A   NAND flash has OOB areas
   N/A (?)   NAND flash requires ECC
 
 Disks have OOB areas with ECC, it's just nicely hidden inside the
 drive. They also typically have physical sectors bigger than 512
 bytes, again hidden.

The difference is that the harddrive has an intellegent controller,
which hides all this away. NAND FLASH has not and we have to do it in
software.

   If the end goal is to end up with something that looks like a block
   device (which seems to be implied by adding transparent wear leveling
  
  Nope, not the end goal.  It's more about wear-leveling across the entire
  flash chip than it is presenting a block like device.
 
 It seems to be about spanning devices and repartitioning as well.
 Hence the analogy with LVM.

Yes, UBI is a kind of LVM for FLASH and we did think quite a time about
reusing LVM before we went the UBI way.

   and bad block remapping), then I don't see any reason it can't be done
   in device mapper. The 'smarts' of mtdblock could in fact be pulled up
  
  There is nothing smart about mtdblock.  And mtdblock has nothing to do
  with UBI.
 
 Note the scare quotes. Device mapper runs on top of a block device.
 And mtdblock is currently the block interface that MTD exports. And it
 has 'smarts' that hide handling of sub-eraseblock I/O. I'm clearly
 talking about an approach that doesn't involve UBI at all.

MTD block has no 'smarts' at all. It is a stupid and broken hack, which
you can utilize to lose data and wear your FLASH out. 

   In the end, a block device is something which does random access
   block-oriented I/O. Disk and NAND both fit that description.
  
  NAND very much doesn't fit the random access part of that.  For writes
  you have to write in incrementing pages within eraseblocks.
 
 And? You can't do I/O smaller than a sector on a disk.

Should we export block devices with 16/32/64/128 KiB size ? If not, we
would need to put a lot of clever functionality into the mtd block
device code, which we decided to put into UBI, so FLASH aware file
systems can use this shared functionality too.

If someone wants to implement an intellegent mtd block device, which
allows to run arbitrary filesystems, then it should be done on top of
UBI. It's not rocket science, but nobody bothers as we have functional
FLASH filesystems which do their job better w/o any notion of a block
device.

A disk _IS_ fundamentally different to FLASH and all the magic which is
done inside of CF-Cards and USB-Sticks is just hiding this away. Most of
the controller chips in these devices are broken and I would never ever
store any important data on such.

The main points of UBI are:

- wear levelling across the complete device
- background handling of bitflips
- safe updates
- handling of static volumes, which are easily accessible for
bootloaders

Nothing of this is anyway near of LVM and disks. The only LVM alike
feature is dynamic creation/deletion/resizing of volumes.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-19 Thread Thomas Gleixner
Stefan,

On Mon, 2007-03-19 at 19:53 +0100, Stefan Prechtel wrote:
 You can find the files here:
 http://bugzilla.kernel.org/show_bug.cgi?id=8235

thanks for providing the data. Your ACPI tables don't provide
information about the power states (C-States), but your BIOS seems to
switch the CPUs into deeper power states, when it runs on battery. In
those deeper power states the local APIC timers and the TSC are stopped.
So the machine waits for ever on the next timer interrupt.

We have a broadcast mechanism for this, which gets activated from ACPI,
but the broadcast mechanism is not activated:

[3.798000] Clock Event Device: pit

[3.798000] tick_broadcast_mask: 

Can you please boot with 2.6.20 or earlier and check the output
of /proc/interrupts ?

IRQ#0 and the LOC (local APIC timer) Interrupts should increment in the
same frequency.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 20:49 +0100, Stefan Prechtel wrote:
  Can you please boot with 2.6.20 or earlier and check the output
  of /proc/interrupts ?
 
  IRQ#0 and the LOC (local APIC timer) Interrupts should increment in the
  same frequency.
 
  tglx
 
 Here is the output of /proc/interrupts on 2.6.20:
CPU0   CPU1
   0:   7089  0  local-APIC-edge-fasteio   timer
 

Can you provide the numbers for LOC too ?
  0:   29801420   29793520IO-APIC-edge  timer
...
LOC:  119180305  119180039

And please do a sleep 10; between two reads, so I can see the deltas.

 and this on 2.6.21-rc*:
CPU0   CPU1
   0:255  0  local-APIC-edge-fasteoi   timer
 
 
 on 2.6.21-rc* the number 255 doesn't change.

Yes. I know. We rely on the local APIC, if the ACPI code does us not
tell to use the PIT broadcast, sigh.

 But if it is ACPI relevant, shouldn't it boot with acpi=off?
 I've tried with acpi=off and noapic but only with nolapic it started.
 
 And the content of /proc/acpi/processor/C000/power shows only one
 c-state; shouldn't it show more C-states? (please correct me if I'm
 wrong)
 
  # cat /proc/acpi/processor/C000/power
 active state:C1
 max_cstate:  C8
 bus master activity: 
 maximum allowed latency: 2000 usec
 states:
*C1:  type[C1] promotion[--] demotion[--]
 latency[000] usage[] duration[]

Yup. It should.

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 21:35 +0100, Stefan Prechtel wrote:
CPU0   CPU1
  0:  28289  0  local-APIC-edge-fasteio   timer
 ...
 LOC:  28237  28236
 
 after a read: (I hope that is this what you want :-)
CPU0   CPU1
   0:  30344  0  local-APIC-edge-fasteio   timer
 ...
 LOC:  30292  30291

Is this with AC plugged in ? If yes, please provide the same numbers for
battery mode.

What's the output of 
cat /proc/acpi/processor/C000/power

for 2.6.20 and 2.6.21-rc4-latest-git with and w/o AC ?

Can you also please upload a bootlog with and without AC of 2.6.20 to
bugzilla ?

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 15:12 -0500, Matt Mackall wrote:
  Should we export block devices with 16/32/64/128 KiB size ?
 
 Sure, why not?

Simply because we want to have the ability to write fine grained in
order to write data safe to FLASH. If we export those large sizes we
lose this ability and have to write full erase blocks for a couple of
bytes. This simply breaks JFFS2 and you can do the math yourself what
that means for the life time of FLASH, when you write small data chunks
in fast sequences and want to make sure that they are written to FLASH
immidiately.

  A disk _IS_ fundamentally different to FLASH and all the magic which is
  done inside of CF-Cards and USB-Sticks is just hiding this away.
 
 And yet they're still both block devices. That our current block layer
 doesn't handle one as well as the other is something we should fix
 instead of inventing a whole new full-feature but incompatible block
 layer on the side.

And yet they are still broken and unreliable. And you can wear them out
in no time, just because they are stupid and do full eraseblock updates
when you write one sector.

No thanks. A bunch of people have done experiments with those beasts and
they are unusable for environments, where we need to make sure, that
data is on FLASH.

UBI is not an incompatible block layer. It allows to implement a very
clever block layer on top. And you can use just one large partition and
small ones for your kernel image and bootloader, which still get the
benefits of data integrity (by doing background safe copies on bit
flips) and the easy implementation in an IPL.

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
  (UBI also has static volumes which LVM doesn't but that is an aside.)
 
 If a static volume is simply a non-dynamic volume, then device mapper
 can do that too. And countless other things. Which is not an aside.
 UBI growing to do all the things that device mapper does is exactly
 the thing we should be seeking to avoid.

No it can't and device mapper sits on top of block devices. FLASH is no
block device. Period.
 
Device mapper can not provide a simple easy to decode scheme for boot
loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
and be able to find the kernel or second stage boot loader in this
unordered device.

And no, fixed addresses do not work. Do you want to implement device
mapper into your Initialial Bootloader stage ?

  That's why I suggested fixing the MTD layers that present block devices
  first in the part of my reply that you cut off.  It seems to me that
  you're really after getting flash to look like a block device, which
  would enable device mapper to be used for something similar to UBI.
  That's fine, but until someone does that work UBI fills a need, has
  users, and has an existing implementation.
 
 False starts that get mainlined delay or prevent things getting done
 right. The question is and remains is UBI the right way to do
 things? Not is UBI the easiest way to do things? or is UBI
 something people have already adopted?
 
 If the right way is instead to extend the block layer and device
 mapper to encompass the quirks of NAND in a sensible fashion, then UBI
 should not go in.

No, block layer on top of FLASH needs 80% of the functionality of UBI in
the first place. You need to implement a clever journalling block device
emulator in order to keep the data alive and the FLASH not weared out
within no time. You need the wear levelling, otherwise you can throw
away your FLASH in no time.

 Let me draw a picture so we have something to argue about:
 
  iSCSI/nbd(6)
   |
 filesystem {swap  |  ext3ext3 jffs2
   \   |   ||   /
/   \  | dm-crypt-snapshot(5) /
 device mapper -|\ \   |  /
| partitioning   /
|  |  partitioning(4)
|wear leveling(3)  /
|  |  /
|  block concatenation
|   ||| |
\  bad block remapping(2)   
||| |
 MTD raw block { raw block devices with no smarts(1)
   / | \  \
 hardware { NANDNAND   NAND   NAND
 
 Notes:
 1. This would provide a block device that allowed writing pages and
a secondary method for erasing whole blocks as well as a method for
querying/setting out of band information.

Forget about OOB data. OOB data is reserved for ECC. Please read the
recommendations of the NAND FLASH manufacturers. NAND gets less reliable
with higher density devices and smaller processes.

 2. This would hide erase blocks either by using an embedded table or
out of band info. This could stack on top of block concatenation if
desired.

Hide erase blocks ? UBI does not hide anything. It maps logical
eraseblocks, which are exposed to the clients to arbitrary physical
eraseblocks on the FLASH device in order to provide across device wear
levelling.

This is fundamentaly different to device mapper. 

 3. This would provide wear leveling, and probably simultaneously
provide relatively efficient and safe access to write sector 
and page-sized I/O. Below this level, things had better be
comfortable with the limitations of NAND if they want to work well.

I don't see how this provides across device wear levelling.

 4. JFFS2 has its own wear-leving scheme, as do several other
filesystems, so they probably want to bypass this piece of the stack.

JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
wear levelling sucks. 

 5. We don't reimplement higher pieces of the stack (dm-crypt,
snapshot, etc.).

Why should we reimplement that ?

 6. We make some things possible that simply aren't otherwise.

 And this picture isn't even interesting yet. Imagine a dm-cache layer
 that caches data read from disks in high-speed flash. Or using
 dm-mirror to mirror writes to local flash over NBD or to a USB drive.
 Neither of these can be done 'right' in a stack split between device
 mapper and UBI.

Err. Implement a clever block layer on top of UBI and use all the
goodies you want including device mapper.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH] [REVIEW] Fix irqpoll on IA64 (timer interrupt != 0)

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 19:13 +0100, Bernhard Walle wrote:
 That requires changes in Linux-generic files. The default of timer_irq is 0, 
 so
 the patch doesn't break i386/x86_64. However, other platforms also may also
 have a timer interrupt non-equal to zero, so they can also use the new
 set_timer_interrupt() function.
 
 The patch is against 2.6.21-rc4. Please give me your input how to improve
 the way it's done if you don't like the way I did the change. irqpoll is
 required to work with kdump in some situations and that's why I discovered
 that kdump doesn't work on that platform (HP rx2660).
 
 
 Signed-off-by: Bernhard Walle [EMAIL PROTECTED]

Acked-by: Thomas Gleixner [EMAIL PROTECTED]



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
   If a static volume is simply a non-dynamic volume, then device mapper
   can do that too. And countless other things. Which is not an aside.
   UBI growing to do all the things that device mapper does is exactly
   the thing we should be seeking to avoid.
  
  No it can't and device mapper sits on top of block devices. FLASH is no
  block device. Period.
 
 Which of the following two properties does it lack?
 
 - discrete blocks
 - non-sequential access to blocks
 
 When you do the obvious s/blocks/eraseblocks/, this appears to be
 true.

It appears to be, but it is not. You enforce semantics on a device,
which it does not have.

 Saying but I can't do I/O smaller than the blocksize doesn't change
 this any more than it would for disks.

There is a huge difference. Disk block size is 512 byte and FLASH block
size is min 16KiB and up to 256KiB.

Just do the math:

Write sampling data streams in 2KiB chunks to your uber devicemapper on
a 1GiB device with 64KiB erase block size:

Fine grained FLASH aware writes allow 32 chunks in a block without
erasing the block.

Your method erases the block 32 times to write the same amount of data.

Result: You wear out the flash 32 times faster. Cool feature.

 Saying but I can do smaller I/O efficiently in some circumstances
 also doesn't change it.

We can do it under _any_ circumstances and that _does_ change it.
Implementing a clever block device layer on top of UBI is simple and
would provide FLASH page sized I/O, i.e. 2Kib in the above example.

 In historical UNIX, some tapes were block devices too. Because they
 supported seek().

I'm impressed. How exactly are some tapes comparable to FLASH chips ?

Your next proposal is to throw away MTD-utils and use mt instead ?

  Device mapper can not provide a simple easy to decode scheme for boot
  loaders. We need to be able to boot out of 512 - 2048 byte of NAND FLASH
  and be able to find the kernel or second stage boot loader in this
  unordered device.
  
  And no, fixed addresses do not work. Do you want to implement device
  mapper into your Initialial Bootloader stage ?
 
 This is exactly the same problem as booting on a desktop PC. But
 somehow LILO manages. My first Linux box had a hell of a lot less disk
 than the platform I bootstrapped (and wrote NAND drivers for) last
 month had in NAND.

No, it is not. You get the absolute sector address of your second stage
and this is a complete nobrainer. The translation is done in the DISK
device.

You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
whatever - there is a more or less intellegent controller device, which
does the mapping to the physical storage location. There is _NO_ such
thing on a bare FLASH chip.

It does not matter, whether your embedded device had more NAND space
than my old CP/M machines floppy. It simply matters, that even the old
CP/M floppy device had some rudimentary intellence on board.

Furthermore I want to be able to get the bitflip correction on my second
stage loader / kernel in the same safe way as we do it for everything
else and still be able to bootstrap that from an extremly small
bootloader.

   If the right way is instead to extend the block layer and device
   mapper to encompass the quirks of NAND in a sensible fashion, then UBI
   should not go in.
  
  No, block layer on top of FLASH needs 80% of the functionality of UBI in
  the first place.
 
 Incorrect. A block-based filesystem on top of flash needs this
 functionality. But a block device suitable to device mapper layering
 (which then provides the functionality) does not.

How exactly does device mapper:

A) across device wear levelling ?
B) dynamic partitioning for FLASH aware file systems ?
C) across device wear levelling for FLASH aware file systems ?
D) background bit-flip corrections (copying affected blocks and recylce
the old one) ?
E) allow position independent placement of the second stage bootloader ?

  You need to implement a clever journalling block device
  emulator in order to keep the data alive and the FLASH not weared out
  within no time. You need the wear levelling, otherwise you can throw
  away your FLASH in no time.
 
 And that's why it's in my picture.

Yes, it is in your picture, but:

1) it excludes FLASH aware file systems and UBI does not.
2) your picture does still not explain how it does achive the above A),
B), C), D) and E)

Your extra path for partitioning(4) and JFFS2 is just a weird hack,
which makes your proposal completely absurd.

   Let me draw a picture so we have something to argue about:
   
iSCSI/nbd(6)
 |
   filesystem {swap  |  ext3ext3 jffs2
 \   |   ||   /
  /   \  | dm-crypt-snapshot(5) /
   device mapper -|\ \   |  /
  | partitioning   /
  |

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 16:36 -0500, Matt Mackall wrote:
 On Mon, Mar 19, 2007 at 11:06:33PM +0200, Artem Bityutskiy wrote:
  On Mon, 2007-03-19 at 14:54 -0500, Matt Mackall wrote:
   The issue is 14000 lines of patch to make a parallel subsystem.
  
  Parallel system exists since very long. One is
  flash-SW_or_HW_FTL-all_blkdev_stuff. The other is MTD-JFFS2. Think
  about _why_ there are 2 of them. Hint - reliability, performance. Your
  ranting basically says that only the first one makes sense. This is not
  true.
 
 A better way would be for MTD to deliver a block dev with a rich
 enough interface for JFFS2 to use efficiently in the first place. Yes,
 I know that can't be done with the current block dev layer. But that's
 what the source is for.

Why the hell would JFFS2 need a block device interface ? 

What's the gain ?

  We enhance the second branch, not the first, please, realize this. Both
  branches have their user base, and have always had.
  
iSCSI/nbd(6)
 |
   filesystem {swap  |  ext3ext3 jffs2
 \   |   ||   /
  /   \  | dm-crypt-snapshot(5) /
   device mapper -|\ \   |  /
  | partitioning   /
  |  |  partitioning(4)
  |wear leveling(3)  /
  |  |  /
  |  block concatenation
  |   ||| |
  \  bad block remapping(2)   
  ||| |
   MTD raw block { raw block devices with no smarts(1)
 / | \  \
   hardware { NANDNAND   NAND   NAND
  
  Matt, as I pointed in the first mail, flash != block device. 
 
 And as I pointed out, you're wrong. It is both block oriented
 (eraseBLOCK??) and random access. That's what a block device is. The
 fact that it doesn't look like the other things that Linux currently
 calls a block device and supports well is another matter.

It does well matter, as it is not a block device. It is a FLASH device
and you can do as much comparisons of eraseBLOCK as you want, you do not
turn FLASH into a DISK. 

Again: Disks (including CF-Cards and USB-Sticks) have intellegent
controllers, which abstract the hardware oddities away and present you a
block device.

  In your picture I see NAND-MTD raw block. So am I right that you
  assume that we already have a decent FTL? The fact is that we do
  not.
 
 No. Look at the picture for more than two seconds, please. 
 
 I can tell you didn't do this because you didn't manage to find (1)
 which explicitly says with no smarts. And you also cut out the footnote
 where I explained what I meant by with no smarts.

 Find the spots marked (2) and (3). These are your FTL. 

And where please are (2) and (3) inside of device mapper ?

  Please, bear in mind that decent FTL is difficult and an FS on top of
  FTL is slow, FTL hits performance considerably.
 
 ...and if you'd actually looked at the picture, you'd have seen JFFS2
 bypassing it. Along with another footnote explaining it.

The (4) partitioning and JFFS2 on top is a step back from the current
UBI functionality. Now we can have resizable partitioning even for JFFS2
and JFFS2 can utilize the UBI wear levelling, which is way better than
the crude heuristics of JFFS2.

You want to force FLASH into device mapper for some strange and no
obvious reason. Just the coincidence of eraseBLOCK and BLOCKdevice
is not really convincing. 

You impose the usage of eraseblock size on FLASH, which is simply wrong:

DISK has a 1:1 relationship of eraseblock and minimal I/O. FLASH has
not. I did the math in a different mail and I'm not buying your factor
32 FLASH life time reduction for the price of having a bunch of lines of
code less in the kernel.

If you really consider to run ext3, xfs or whatever on top of FLASH,
please go and do the homework on CF-Cards and USB-Sticks. Run them into
the fast wearout death. And device mapper does not help anything to
avoid that. Running ext3 on top of FLASH with a minimal I/O size of
erase block size is simply braindead.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/13] signal/timer/event fds v7 - anonymous inode source ...

2007-03-19 Thread Thomas Gleixner
Davide,

On Mon, 2007-03-19 at 16:47 -0700, Davide Libenzi wrote:
 This patch add an anonymous inode source, to be used for files that need 
 and inode only in order to create a file*. We do not care of having an 
 inode for each file, and we do not even care of having different names in 
 the associated dentries (dentry names will be same for classes of file*).
 This allow code reuse, and will be used by epoll, signalfd and timerfd 
 (and whatever else there'll be).

 +int aino_getfd(int *pfd, struct inode **pinode, struct file **pfile,
 +char const *name, const struct file_operations *fops, void *priv)
 +{
 + struct qstr this;
 + struct dentry *dentry;
 + struct inode *inode;
 + struct file *file;
 + int error, fd;
 +
 + error = -ENFILE;
 + file = get_empty_filp();
 + if (!file)
 + goto eexit_1;

make this return -ENFILE; please

 + inode = aino_getinode();
 + if (IS_ERR(inode)) {
 + error = PTR_ERR(inode);
 + goto eexit_2;

Can you please use a bit more descriptive labels ?

e.g:
goto out_filp;

 + }
 +
 + error = get_unused_fd();
 + if (error  0)
 + goto eexit_3;

e.g:
goto out_inode;

 + fd = error;
 +
 + /*
 +  * Link the inode to a directory entry by creating a unique name
 +  * using the inode sequence number.
 +  */
 + error = -ENOMEM;
 + this.name = name;
 + this.len = strlen(name);
 + this.hash = 0;
 + dentry = d_alloc(aino_mnt-mnt_sb-s_root, this);
 + if (!dentry)
 + goto eexit_4;

e.g:

goto out_fd;


 +static int ainofs_delete_dentry(struct dentry *dentry)
 +{
 + /*
 +  * We faked vfs to believe the dentry was hashed when we created it.
 +  * Now we restore the flag so that dput() will work correctly.
 +  */
 + dentry-d_flags |= DCACHE_UNHASHED;
 + return 1;
 +}

Please put either struct ainofs_dentry_operations ... below the next
function or move ainofs_delete_dentry() above struct
ainofs_dentry_operations ...

It's annoying to lookup the protoypes and implemenation back and forth.

 +static struct inode *aino_getinode(void)
 +{
 + return igrab(aino_inode);
 +}

Please use igrab(aino_inode); directly in this one single place above.
That saves us a prototype and an useless static function with no value.

 +/*
 + * A single inode exist for all aino files. On the contrary of pipes,
 + * aino inodes has no per-instance data associated, so we can avoid
 + * the allocation of multiple of them.
 + */
 +static struct inode *aino_mkinode(void)
 +{
 + int error = -ENOMEM;
 + struct inode *inode = new_inode(aino_mnt-mnt_sb);
 +
 + if (!inode)
 + goto eexit_1;

return ERR_PTR(-ENOMEM);

 + inode-i_fop = aino_fops;
 +}
 +
 +static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
 +  const char *dev_name, void *data, struct vfsmount *mnt)
 +{
 + return get_sb_pseudo(fs_type, aino:, NULL, AINOFS_MAGIC, mnt);
 +}

Please put either struct file_system_type aino_fs_typ ... below this
function or move ainofs_get_sb() above struct file_system_type
aino_fs_typ ...

 +static int __init aino_init(void)
 +{
 +
 + if (register_filesystem(aino_fs_type))
 + goto epanic;
 +
 + aino_mnt = kern_mount(aino_fs_type);
 + if (IS_ERR(aino_mnt))
 + goto epanic;
 +
 + aino_inode = aino_mkinode();
 + if (IS_ERR(aino_inode))
 + goto epanic;
 +
 + return 0;
 +
 +epanic:
 + panic(aino_init() failed\n);

Panic ? It's not life critical - is it ? 

A printk(KERN_ERR...) and a return -Exx would be sufficient.

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 22:51 +0100, Stefan Prechtel wrote:
 2007/3/19, Thomas Gleixner [EMAIL PROTECTED]:
  On Mon, 2007-03-19 at 21:35 +0100, Stefan Prechtel wrote:
  CPU0   CPU1
0:  28289  0  local-APIC-edge-fasteio   timer
   ...
   LOC:  28237  28236
  
   after a read: (I hope that is this what you want :-)
  CPU0   CPU1
 0:  30344  0  local-APIC-edge-fasteio   timer
   ...
   LOC:  30292  30291
 
  Is this with AC plugged in ? If yes, please provide the same numbers for
  battery mode.
 
 Yes. And here is the output for battery mode (2.6.20):
CPU0   CPU1
   0: 292153  0  local-APIC-edge-fasteio   timer
 LOC: 292114 292113
 
CPU0   CPU1
   0: 293263  0  local-APIC-edge-fasteio   timer
 LOC: 293224 293223

Hmm. Can you please apply the following patch on top of 2.6.20 and
check, if the WARN_ON_ONCE triggers when you boot w/o AC plugged ?

Thanks,

tglx

Index: linux-2.6.20/arch/i386/kernel/apic.c
===
--- linux-2.6.20.orig/arch/i386/kernel/apic.c
+++ linux-2.6.20/arch/i386/kernel/apic.c
@@ -1174,6 +1174,8 @@ void switch_APIC_timer_to_ipi(void *cpum
cpumask_t mask = *(cpumask_t *)cpumask;
int cpu = smp_processor_id();
 
+   WARN_ON_ONCE(1);
+
if (cpu_isset(cpu, mask) 
!cpu_isset(cpu, timer_bcast_ipi)) {
disable_APIC_timer();


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 20:05 -0500, Matt Mackall wrote:
 On Tue, Mar 20, 2007 at 01:42:46AM +0100, Thomas Gleixner wrote:
  On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
   This is exactly the same problem as booting on a desktop PC. But
   somehow LILO manages. My first Linux box had a hell of a lot less disk
   than the platform I bootstrapped (and wrote NAND drivers for) last
   month had in NAND.
  
  No, it is not. You get the absolute sector address of your second stage
  and this is a complete nobrainer. The translation is done in the DISK
  device.
 
 LILO and friends manage to boot systems that use software RAID and
 LVM. There are multiple methods. Some use block lists, some use tiny
 boot partitions, etc. All of them are applicable to controllerless NAND.

Yes, by using fixed addresses, which is not what I want.

  You simply ignore the fact, that inside each disk, USB Stick, CF-CARD,
  whatever - there is a more or less intellegent controller device, which
  does the mapping to the physical storage location. There is _NO_ such
  thing on a bare FLASH chip.
 
 How many times do I have to tell you that I wrote a driver for
 controllerless NAND just last month?

Wow. I'm impressed because I'm pulling my opinion out of thin air.

  How exactly does device mapper:
  
  A) across device wear levelling ?
 
 The same way UBI does, but encapsulated in a device mapper layer.

Does the device mapper do that ?

  B) dynamic partitioning for FLASH aware file systems ?

 See above.

Does the device mapper do that ?

  C) across device wear levelling for FLASH aware file systems ?
 
 See above.

Look at your own drawing. 

  D) background bit-flip corrections (copying affected blocks and recylce
  the old one) ?
 
 See above.

Repeating patterns do not impress me. Your drawing tells otherwise

  E) allow position independent placement of the second stage bootloader ?
 
 See way above to my LILO response.

Neither LILO nor GRUB have search capabilities for randomly located
second stage loaders.

You need to implement a clever journalling block device
emulator in order to keep the data alive and the FLASH not weared out
within no time. You need the wear levelling, otherwise you can throw
away your FLASH in no time.
   
   And that's why it's in my picture.
  
  Yes, it is in your picture, but:
  
  1) it excludes FLASH aware file systems and UBI does not.
  2) your picture does still not explain how it does achive the above A),
  B), C), D) and E)
  
  Your extra path for partitioning(4) and JFFS2 is just a weird hack,
  which makes your proposal completely absurd.
 
 No, it's just there to show the flexibility of device mapper. But I have
 the sneaking suspicion you have no idea how device mapper works.

Sigh. Layering violation == flexibility.

 In brief: device mapper takes one or more devices, applies a mapping
 to them, and returns a new device. For example, take various spans of
 /dev/hda1 and /dev/sda3 and present them as new-device1. Take
 new-device1 and transform it with dm-crypt to get new-device2. The
 kernel doesn't decide how to do this, any more than it decides where
 to mount your filesystems. Userspace does.

I know how it works. But your blurb does not answer any of my questions.

 5. We don't reimplement higher pieces of the stack (dm-crypt,
snapshot, etc.).

Why should we reimplement that ?
   
   So that you can get encryption and snapshot, etc.?
  
  1. On top of a clever block device.
  
  2. UBI can do snapshots by design.
 
 Oh, so you HAVE reimplemented it.

No, it already works

  3. Encryption should be done on the VFS layer and not below the
  filesystem layer. Doing it inside the block layer or the device mapper
  is broken by design.
 
 That's highly debatable and not a topic for this thread.

I see, you define, what has to be discussed.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote:
 On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
  Arjan van de Ven [EMAIL PROTECTED] writes:
   
   well we can do the handshake to take ownership like we do much later in
   boot, but that requires PCI to be there and fully discovered, which we
   don't have this early.
  
  That's not true - we do early pci discovery. Doing USB handsoff
  there would be quite possible.
 
 What, we don't do USB handoff early enough in the boot process?  It's
 happening at PCI quirk time now, which I think should be early enough
 for everyone (and too early for some who rely on USB keyboards and
 initramfs shells...)

It happens way after the CPUs are brought up. At this point both the
delay loop calibration and the local APIC calibration are already done.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-21 Thread Thomas Gleixner
On Tue, 2007-03-20 at 17:47 +0100, Grzegorz Chwesewicz wrote:
 I have HP nx6325. I've tried to use WARN_ON_ONCE patch, but I don't see
 nothing special in dmesg. Just in case I'm posting my
 dmesg_2.6.20_WARN_ON_ONCE_on_battery log on
 http://bugzilla.kernel.org/show_bug.cgi?id=8235 .
 
 Below I post output of my /proc interrupts (10 sec. delay between reads).

 Other interesting thing on 2.6-git is that when I press a key on keyboard it
 doesn't repeat (on battery), but it repeats on 2.6-git on ac.

Sigh. The periodic PIT interrupt pampers over the problem in 2.6.21-rc.
It prevents the BIOS to switch the CPU in lower power states.

I'm working on a detect LAPIC / BIOS madness check.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-21 Thread Thomas Gleixner
On Wed, 2007-03-21 at 10:46 +0100, Andi Kleen wrote:
 On Wednesday 21 March 2007 10:24, Thomas Gleixner wrote:
  On Tue, 2007-03-20 at 17:47 +0100, Grzegorz Chwesewicz wrote:
   I have HP nx6325. I've tried to use WARN_ON_ONCE patch, but I don't see
   nothing special in dmesg. Just in case I'm posting my
   dmesg_2.6.20_WARN_ON_ONCE_on_battery log on
   http://bugzilla.kernel.org/show_bug.cgi?id=8235 .
  
   Below I post output of my /proc interrupts (10 sec. delay between reads).
  
   Other interesting thing on 2.6-git is that when I press a key on keyboard
   it doesn't repeat (on battery), but it repeats on 2.6-git on ac.
 
  Sigh. The periodic PIT interrupt pampers over the problem in 2.6.21-rc.
  It prevents the BIOS to switch the CPU in lower power states.
 
 I think I ran into the same problem with my initial noidletick patch.
 I don't have that test machine anymore though.
 
 Normally the use PIT when AMD  Cstate = 2 check should
 have caught that though. Why did it here?

The BIOS/ACPI is broken and does only expose C1, which should not switch
off LAPIC. The BIOS is switching into deeper C-States behind the kernels
back somehow.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-21 Thread Thomas Gleixner
On Wed, 2007-03-21 at 11:37 +0100, Andi Kleen wrote:
  The BIOS/ACPI is broken and does only expose C1, which should not switch
  off LAPIC. The BIOS is switching into deeper C-States behind the kernels
  back somehow.
 
 Hmm, perhaps we can check AMD  (cstate = 2 || has a battery) ? 
 Should be doable by looking up the battery object in ACPI

Which makes us rely on another ACPI feature. What guarantees that the
ACPI tables are correct for this one ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-21 Thread Thomas Gleixner
On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
 On Tue, 20 March 2007 01:42:46 +0100, Thomas Gleixner wrote:
  On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
  
 4. JFFS2 has its own wear-leving scheme, as do several other
filesystems, so they probably want to bypass this piece of the 
 stack.

JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
wear levelling sucks. 
   
   Ok, fine. How about LogFS, then?
  
  LogFS can easily leverage UBI's wear algorithm.
 
 Ok, now we have reached the absurd.  UBI quite fundamentally cannot do
 wear leveling as good as LogFS can.  Simply because UBI has zero
 knowledge of the _contents_ of its blocks.  Knowing whether a block is
 90% garbage or not makes a great difference.
 
 Also LogFS currently requires erasesizes of 2^n.

Last time I talked to you about that, you said it would be possible and
fixable. We talked about several mechanisms, which would allow a
filesystem or other users to hint such things to UBI.

Even if the LogFS wear levelling is so superior, it CAN'T do across
device wear levelling.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-21 Thread Thomas Gleixner
On Wed, 2007-03-21 at 12:35 +0100, Jörn Engel wrote:
 Even if such flashes still contain a bootloader and a kernel, that will
 occupy less than 1% of the device.  Wear leveling across the device is
 fairly pointless here.  This is what I designed LogFS for.

Still you need to have a solution for handling bitflips in those
bootloader and kernel areas.

I don't dispute, that on a Terrabyte solid state disk which is used in a
totally different way, UBI is not necessarily the right tool.

 There is some middle ground where a combination of UBI and LogFS may
 make sense.  LogFS can still make sense for devices as small as 64MiB.
 But I'm not too concerned about that because flashes will continue to
 grow and the advantages of cross-device wear leveling will continue to
 diminish.

Flashes will grow, but this will not change the embedded use case with a
relativly small FLASH and the bootloader / kernel / rootfs / datafs
scenario, where UBI is the right tool to use.

There is no hammer for all nails and I don't see device mapper doing
what UBI does right now.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-21 Thread Thomas Gleixner
Stefan, Grzegorz

On Wed, 2007-03-21 at 12:14 +0100, Thomas Gleixner wrote:
 On Wed, 2007-03-21 at 11:37 +0100, Andi Kleen wrote:
   The BIOS/ACPI is broken and does only expose C1, which should not switch
   off LAPIC. The BIOS is switching into deeper C-States behind the kernels
   back somehow.
  
  Hmm, perhaps we can check AMD  (cstate = 2 || has a battery) ? 
  Should be doable by looking up the battery object in ACPI
 
 Which makes us rely on another ACPI feature. What guarantees that the
 ACPI tables are correct for this one ?

Can you please apply the patch below and add nolapic_timer to the
kernel command line ?

Please provide also the output of

# dmidecode

on your laptops.

Thanks,

tglx

diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 5cff797..67f8d9f 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -61,6 +61,8 @@ static int enable_local_apic __initdata = 0;
 
 /* Local APIC timer verification ok */
 static int local_apic_timer_verify_ok;
+/* Disable local APIC timer from the kernel commandline */
+static int local_apic_timer_disabled;
 
 /*
  * Debug level, exported for io_apic.c
@@ -340,6 +342,13 @@ void __init setup_boot_APIC_clock(void)
long delta, deltapm;
int pm_referenced = 0;
 
+   if (local_apic_timer_disabled) {
+   /* No broadcast on UP ! */
+   if (num_possible_cpus()  1)
+   setup_APIC_timer();
+   return;
+   }}
+
apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n
calibrating APIC timer ...\n);
 
@@ -1179,6 +1188,13 @@ static int __init parse_nolapic(char *arg)
 }
 early_param(nolapic, parse_nolapic);
 
+static int __init parse_disable_lapic_timer(char *arg)
+{
+   local_apic_timer_disabled = 1;
+   return 0;
+}
+early_param(nolapic_timer, parse_disable_lapic_timer);
+
 static int __init apic_set_verbosity(char *str)
 {
if (strcmp(debug, str) == 0)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-21 Thread Thomas Gleixner
On Wed, 2007-03-21 at 13:15 +0100, Thomas Gleixner wrote:
 + return;
 + }}
 +

Ooops, sorry. Did not quilt refresh before sending it out. Correct
version below.

tglx

diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 5cff797..83cf98d 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -61,6 +61,8 @@ static int enable_local_apic __initdata = 0;
 
 /* Local APIC timer verification ok */
 static int local_apic_timer_verify_ok;
+/* Disable local APIC timer from the kernel commandline */
+static int local_apic_timer_disabled;
 
 /*
  * Debug level, exported for io_apic.c
@@ -340,6 +342,13 @@ void __init setup_boot_APIC_clock(void)
long delta, deltapm;
int pm_referenced = 0;
 
+   if (local_apic_timer_disabled) {
+   /* No broadcast on UP ! */
+   if (num_possible_cpus()  1)
+   setup_APIC_timer();
+   return;
+   }
+
apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n
calibrating APIC timer ...\n);
 
@@ -1179,6 +1188,13 @@ static int __init parse_nolapic(char *arg)
 }
 early_param(nolapic, parse_nolapic);
 
+static int __init parse_disable_lapic_timer(char *arg)
+{
+   local_apic_timer_disabled = 1;
+   return 0;
+}
+early_param(nolapic_timer, parse_disable_lapic_timer);
+
 static int __init apic_set_verbosity(char *str)
 {
if (strcmp(debug, str) == 0)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fwd: [PATCH 7/9] ACPI: Only use IPI on known broken machines (AMD, Dothan/BaniasPentium M)

2007-03-21 Thread Thomas Gleixner
On Tue, 2007-03-20 at 20:23 +0100, Andi Kleen wrote:
  +   else if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL 
  + boot_cpu_data.x86 == 6) 
  +(boot_cpu_data.x86_model == 13 ||
  + boot_cpu_data.x86_model == 9))
 
 What is with 10..12 and  13 ? I would just force it for all model 6s that 
 have = C2 and definitely for all with C3.

C3 is unconditinally anyway.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG lapic: Can't boot on battery (2.6.21-rc{1,2,3,4})

2007-03-21 Thread Thomas Gleixner
On Wed, 2007-03-21 at 14:04 +0100, Stefan Prechtel wrote:
 I uploaded the output of dmesg (kernel 2.6.21-rc4-git5) (battery / ac)
 and dmidecode
 I can boot on battery with nolapic_timer and the second core is online, too.
 /proc/acpi/processor/C000/ shows the same as before but
 /proc/interrupts has changed:
 
 (battery)
CPU0   CPU1
   0:  47131  0  local-APIC-edge-fasteoi   timer
 LOC:  0  46978
 
 (ac)
CPU0   CPU1
   0:  59137  0  local-APIC-edge-fasteoi   timer
 LOC:  0  58984

That's correct. We keep the PIT alive and trigger the lapic timer
interrupt via an IPI.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] clockevents: Fix suspend/resume to disk hangs

2007-03-21 Thread Thomas Gleixner
On Tue, 2007-03-20 at 10:35 +0100, Marcus Better wrote:
 Thomas Gleixner wrote:
 
  I finally found a dual core box, which survives suspend/resume without
  crashing in the middle of nowhere. Sigh, I never figured out from the
  code and the bug reports what's going on.
  
  The observed hangs are caused by a stale state transition of the clock
  event devices, which keeps the RCU synchronization away from completion,
  when the non boot CPU is brought back up.
 
 This didn't fix the suspend problems on my Thinkpad R60. (Sorry for
 nagging - please let me know if I can assist in debugging this...)

I did not expect that it fixes your problem. clockevents are only used
in arch/i386 right now. You are running a 64 bit kernel, so a change of
your problem would have been very surprising.

You said, that the breakage came between 2.6.20 and rc2. Can you bisect
it ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i386: disable local apic timer via command line or dmi quirk

2007-03-21 Thread Thomas Gleixner
The local APIC timer stops to work in deeper C-States. This is handled
by the ACPI code and a broadcast mechanism in the clockevents / tick
managment code.

Some systems do not expose the deeper C-States to the kernel, but switch
into deeper C-States behind the kernels back. This delays the local apic
timer interrupts for ever and makes the systems unusable.

Add a command line option to disable the local apic timer and a dmi
quirk for known broken systems.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 856c8b1..06377c7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1117,6 +1117,8 @@ and is between 256 and 4096 characters. It is defined in 
the file
 
nolapic [IA-32,APIC] Do not enable or use the local APIC.
 
+   nolapic_timer   [IA-32,APIC] Do not use the local APIC timer.
+
noltlbs [PPC] Do not use large page/tlb entries for kernel
lowmem mapping on PPC40x.
 
diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 5cff797..3682511 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -28,6 +28,7 @@
 #include linux/clockchips.h
 #include linux/acpi_pmtmr.h
 #include linux/module.h
+#include linux/dmi.h
 
 #include asm/atomic.h
 #include asm/smp.h
@@ -61,6 +62,8 @@ static int enable_local_apic __initdata = 0;
 
 /* Local APIC timer verification ok */
 static int local_apic_timer_verify_ok;
+/* Disable local APIC timer from the kernel commandline or via dmi quirk */
+static int local_apic_timer_disabled;
 
 /*
  * Debug level, exported for io_apic.c
@@ -266,6 +269,32 @@ static void __devinit setup_APIC_timer(void)
 }
 
 /*
+ * Detect systems with known broken BIOS implementations
+ */
+static int __init lapic_check_broken_bios(struct dmi_system_id *d)
+{
+   printk(KERN_NOTICE %s detected: disabling lapic timer.\n,
+  d-ident);
+   local_apic_timer_disabled = 1;
+   return 0;
+}
+
+static struct dmi_system_id __initdata broken_bios_dmi_table[] = {
+   {
+   /*
+* BIOS exports only C1 state, but uses deeper power
+* modes behind the kernels back.
+*/
+ .callback = lapic_check_broken_bios,
+ .ident = HP nx6325,
+ .matches = {
+   DMI_MATCH(DMI_PRODUCT_NAME, HP Compaq nx6325),
+ },
+},
+{}
+};
+
+/*
  * In this functions we calibrate APIC bus clocks to the external timer.
  *
  * We want to do the calibration only once since we want to have local timer
@@ -340,6 +369,22 @@ void __init setup_boot_APIC_clock(void)
long delta, deltapm;
int pm_referenced = 0;
 
+   /* Detect know broken systems */
+   dmi_check_system(broken_bios_dmi_table);
+
+   /*
+* The local apic timer can be disabled via the kernel
+* commandline or from the dmi quirk above. Register the lapic
+* timer as a dummy clock event source on SMP systems, so the
+* broadcast mechanism is used. On UP systems simply ignore it.
+*/
+   if (local_apic_timer_disabled) {
+   /* No broadcast on UP ! */
+   if (num_possible_cpus()  1)
+   setup_APIC_timer();
+   return;
+   }
+
apic_printk(APIC_VERBOSE, Using local APIC timer interrupts.\n
calibrating APIC timer ...\n);
 
@@ -1179,6 +1224,13 @@ static int __init parse_nolapic(char *arg)
 }
 early_param(nolapic, parse_nolapic);
 
+static int __init parse_disable_lapic_timer(char *arg)
+{
+   local_apic_timer_disabled = 1;
+   return 0;
+}
+early_param(nolapic_timer, parse_disable_lapic_timer);
+
 static int __init apic_set_verbosity(char *str)
 {
if (strcmp(debug, str) == 0)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)

2007-03-22 Thread Thomas Gleixner
On Thu, 2007-03-22 at 08:28 -0700, Greg KH wrote:
 On Tue, Mar 20, 2007 at 11:54:03AM +, Pavel Machek wrote:
  Hi!
  
   [EMAIL PROTECTED]:/home/maxim# cat 
   /sys/devices/system/clockevents/clockevents0/registered
   lapicF:0007 M:3(periodic) C: 1
   hpet F:0003 M:1(shutdown) C: 0
   lapicF:0007 M:3(periodic) C: 0
   [EMAIL PROTECTED]:/home/maxim#   
  
  Now... this file needs to die, before 2.6.21 is released. It tries to
  bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
  part of stable ABI!
 
 Eeek!
 
 I agree, that needs to be fixed now.
 
 Remember, 1 value per file in sysfs!  Shall I just submit a patch
 ripping it out for now?

I fix it.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] no boot with 2.6.21-rc3 and later

2007-03-22 Thread Thomas Gleixner
On Thu, 2007-03-22 at 12:25 -0700, john stultz wrote:
 On Thu, 2007-03-22 at 13:14 -0600, Bob Tracy wrote:
  john stultz wrote:
   Try this patch and let me know if it does the right thing.
  
  Will do.  I'll report back in a few hours.
  
   Although I do
   still need to dig a bit on the PIT hang issue.
  
  Any chance this might be related to the APIC issues currently being
  discussed in other threads?
 
 Hmmm. Good thought, I'll have to look into it. It could be that if the
 PIT is disabled in favor of the local apic, we'll have to make sure its
 not being used as a clocksource.

Ouch, yes. That's fatal and can happen. Not sure, what to do about that.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i386: clockevents fix breakage on Geode/Cyrix PIT implementations

2007-03-22 Thread Thomas Gleixner
The PIT has no dedicated mode for shut down. The only way to disable PIT
is to put it into one shot mode. AMD implementations of PIT on Geode
(also observed on Cyrix) are confused by an empty transition from
CLOCK_EVT_MODE_UNUSED to CLOCK_EVT_MODE_SHUTDOWN, which puts the PIT
into one shot mode momentarily.

I realized after staring helpless at the bug report
http://bugzilla.kernel.org/show_bug.cgi?id=8027 for quite a while, that
the only change, which might influence the bogomips calibration, is the
above transition during the PIT initialization.

Avoiding the unnecessary switch to oneshot and later to periodic mode
fixes the weird bogomips value and also the resulting slowness.

The fix is confirmed on OLPC and another Geode based box.

Note: this is unrelated to the Dual Core problem discussed here:
http://lkml.org/lkml/2007/3/17/48

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/i8253.c b/arch/i386/kernel/i8253.c
index 5cbb776..10cef5c 100644
--- a/arch/i386/kernel/i8253.c
+++ b/arch/i386/kernel/i8253.c
@@ -47,9 +47,17 @@ static void init_pit_timer(enum clock_event_mode mode,
outb(LATCH  8 , PIT_CH0); /* MSB */
break;
 
-   case CLOCK_EVT_MODE_ONESHOT:
+   /*
+* Avoid unnecessary state transitions, as it confuses
+* Geode / Cyrix based boxen.
+*/
case CLOCK_EVT_MODE_SHUTDOWN:
+   if (evt-mode == CLOCK_EVT_MODE_UNUSED)
+   break;
case CLOCK_EVT_MODE_UNUSED:
+   if (evt-mode == CLOCK_EVT_MODE_SHUTDOWN)
+   break;
+   case CLOCK_EVT_MODE_ONESHOT:
/* One shot setup */
outb_p(0x38, PIT_MODE);
udelay(10);


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix irqpoll on IA64 (timer interrupt != 0)

2007-03-22 Thread Thomas Gleixner
On Thu, 2007-03-22 at 14:09 -0700, Andrew Morton wrote:
 I think the term 'timer_interrupt' is a bit generic-sounding.  Would it be
 better to call it irqpoll_interrupt?  After all, some architecture might
 want to use, umm, the keyboard interrupt to trigger IRQ polling ;)  

Interesting thought, but in general I have to agree.

 Also, the code presently passes the magic IRQ number into the generic IRQ
 code.  I wonder if we'd get a more pleasing result if we were to make the
 generic IRQ code call _out_ to the architecture:

 Then, ia64 can implement arch_is_irqpoll_irq() and it can do whatever it
 wants in there.
 
 The __attribute__((weak)) thing adds a little bit of overhead, but I don't
 think this is a fastpath?

Well, depends what you consider a fastpath. When noirqdebug == 0, it is
called on every interrupt.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc[123] regression with NOAPIC

2007-03-22 Thread Thomas Gleixner
On Thu, 2007-03-22 at 14:42 +0100, Adrian Bunk wrote:
  Starting with head as of yesterday and reverting two commits (that are
  duplicates of each other -- the same commit came into Linus's tree via
  two different paths) 'fixes' the problem for me. I'll let those with the
  big brains decide just why.
  
  The two commits are 5c95d3f5783ab184f64b7848f0a871352c35c3cf and
  3434933b17fa64adddf83059603c61296f6e1ee2 . The net reverse diff of those
  two is below.
 ...
 
 Thanks for tracking it down.
 
 It's quite possible that these commits trigger your problem.
 
 Does it work if you do _not_ revert the commits, and instead replace in
 drivers/acpi/processor_idle.c the
   #ifdef ARCH_APICTIMER_STOPS_ON_C3
 with an
   #if 0
 ?

Then NOAPIC probably works again, but booting w/o NOAPIC fails.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc[123] regression with NOAPIC

2007-03-22 Thread Thomas Gleixner
On Thu, 2007-03-22 at 15:16 +0100, Adrian Bunk wrote:
   Does it work if you do _not_ revert the commits, and instead replace in
   drivers/acpi/processor_idle.c the
 #ifdef ARCH_APICTIMER_STOPS_ON_C3
   with an
 #if 0
   ?
  
  Then NOAPIC probably works again, but booting w/o NOAPIC fails.
 
 But we'll know that it's this code that has a problen with noapic
 in the CONFIG_GENERIC_CLOCKEVENTS=n case.

Nope. This code does not have a problem. It causes a problem elsewhere:

It calls switch_ipi_to_APIC_timer() or switch_APIC_timer_to_ipi(), which
sets/clears a bit in the broadcast mask and enables / disables the local
APIC timer.

I don't see right now, why this causes the box to lock up hard, but
maybe the debug printk's below give us some hint.

tglx

diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 723417d..29376e2 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -886,6 +886,8 @@ void disable_APIC_timer(void)
if (using_apic_timer) {
unsigned long v;
 
+   printk(Disabling local APIC timer %d\n, apic_runs_main_timer);
+
v = apic_read(APIC_LVTT);
/*
 * When an illegal vector value (0-15) is written to an LVT
@@ -910,6 +912,7 @@ void enable_APIC_timer(void)
!cpu_isset(cpu, timer_interrupt_broadcast_ipi_mask)) {
unsigned long v;
 
+   printk(Enabling local APIC timer: %d\n, apic_runs_main_timer);
v = apic_read(APIC_LVTT);
apic_write(APIC_LVTT, v  ~APIC_LVT_MASKED);
}
@@ -934,6 +937,7 @@ void smp_send_timer_broadcast_ipi(void)
 
cpus_and(mask, cpu_online_map, timer_interrupt_broadcast_ipi_mask);
if (!cpus_empty(mask)) {
+   printk(Send IPI\n);
send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
}
 }


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/6] 2.6.21-rc4: known regressions

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 12:42 +0100, Ingo Molnar wrote:
 there's a new post-rc4 regression: my T60 hangs during early bootup. I 
 bisected the hang down to this recent commit:
 
 | commit 25496caec111481161e7f06bbfa12a533c43cc6f
 | Author: Thomas Renninger [EMAIL PROTECTED]
 | Date:   Tue Feb 27 12:13:00 2007 -0500
 |
 |ACPI: Only use IPI on known broken machines (AMD, Dothan/BaniasPentium M)
 
 undoing this change fixes my T60 so it correctly boots again.
 
 the commit has this confidence-raising comment:
 
 |   However, I am not sure about the naming of the parameter and how it 
 |   could/should get integrated into the dyntick part 
 |   (CONFIG_GENERIC_CLOCKEVENTS). There, a more fine grained check (TSC 
 |   still running?, ..) is needed?
 
 could we please revert this commit until it's done correctly?
 
 and did this end up being a 'fix'? The change weakens the scope of a 
 hardware workaround, which IMO has no place so late in the cycle. At a 
 minimum the clockevents maintainer (Thomas) should have been Cc:-ed on 
 it.

Ingo, 

I had seen it before, and I had no objections under the premise, that it
does not break things and especially survives on Andrews VAIO. I
expected that to come in via -mm so it gets enough testing.

We should revert that patch and add a trust_lapic_timer_in_c2
commandline option instead. So we are on the safe side.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i386: add command line option local_apic_timer_c2_ok

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 12:56 +0100, Thomas Gleixner wrote:
 We should revert that patch and add a trust_lapic_timer_in_c2
 commandline option instead. So we are on the safe side.

Here is a patch which applies after reverting 
25496caec111481161e7f06bbfa12a533c43cc6f

It turned out that it is almost impossible to trust ACPI, BIOS  Co.
regarding the C states. This was the reason to switch the local apic
timer off in C2 state already. OTOH there are sane and well behaving
systems, which get punished by that decision.

Allow the user to confirm that the local apic timer is trustworthy in C2
state. This keeps the default behaviour on the safe side.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]
Acked-by: Ingo Molnar [EMAIL PROTECTED]

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e39ab0c..09640a8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -780,6 +780,9 @@ and is between 256 and 4096 characters. It is defined in 
the file
lapic   [IA-32,APIC] Enable the local APIC even if BIOS
disabled it.
 
+   lapic_timer_c2_ok   [IA-32,APIC] trust the local apic timer in
+   C2 power state.
+
lasi=   [HW,SCSI] PARISC LASI driver for the 53c700 chip
Format: addr:io,irq:irq
 
diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c
index 244c3fe..e884152 100644
--- a/arch/i386/kernel/apic.c
+++ b/arch/i386/kernel/apic.c
@@ -64,6 +64,9 @@ static int enable_local_apic __initdata = 0;
 static int local_apic_timer_verify_ok;
 /* Disable local APIC timer from the kernel commandline or via dmi quirk */
 static int local_apic_timer_disabled;
+/* Local APIC timer works in C2 */
+int local_apic_timer_c2_ok;
+EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok);
 
 /*
  * Debug level, exported for io_apic.c
@@ -1232,6 +1235,13 @@ static int __init parse_disable_lapic_timer(char *arg)
 }
 early_param(nolapic_timer, parse_disable_lapic_timer);
 
+static int __init parse_lapic_timer_c2_ok(char *arg)
+{
+   local_apic_timer_c2_ok = 1;
+   return 0;
+}
+early_param(lapic_timer_c2_ok, parse_lapic_timer_c2_ok);
+
 static int __init apic_set_verbosity(char *str)
 {
if (strcmp(debug, str) == 0)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 6077300..cdf7894 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -268,6 +268,7 @@ static void acpi_timer_check_state(int state, struct 
acpi_processor *pr,
   struct acpi_processor_cx *cx)
 {
struct acpi_processor_power *pwr = pr-power;
+   u8 type = local_apic_timer_c2_ok ? ACPI_STATE_C3 : ACPI_STATE_C2;
 
/*
 * Check, if one of the previous states already marked the lapic
@@ -276,7 +277,7 @@ static void acpi_timer_check_state(int state, struct 
acpi_processor *pr,
if (pwr-timer_broadcast_on_state  state)
return;
 
-   if (cx-type = ACPI_STATE_C2)
+   if (cx-type = type)
pr-power.timer_broadcast_on_state = state;
 }
 
diff --git a/include/asm-i386/apic.h b/include/asm-i386/apic.h
index cc6b165..a19810a 100644
--- a/include/asm-i386/apic.h
+++ b/include/asm-i386/apic.h
@@ -117,6 +117,7 @@ extern void enable_NMI_through_LVT0 (void * dummy);
 #define ARCH_APICTIMER_STOPS_ON_C3 1
 
 extern int timer_over_8254;
+extern int local_apic_timer_c2_ok;
 
 #else /* !CONFIG_X86_LOCAL_APIC */
 static inline void lapic_shutdown(void) { }


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [1/6] 2.6.21-rc4: known regressions

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 11:28 -0700, Linus Torvalds wrote:
 
 On Fri, 23 Mar 2007, Linus Torvalds wrote:
  
  Thomas, please fix.
 
 Here's a possible fix. It compiles. And I still wish we had common files.

You beat me by 30 seconds.

 ia64 shouldn't be affected, because ia64 doesn't #define the 
 ARCH_APICTIMER_STOPS_ON_C3 flag (and then we don't use the c2_ok thing 
 either. 

Right, ia64 does not see it.

 But this is still pretty damn ugly.

Yes it is.

 Maybe a field in struct acpi_processor for C2/C3 problems?

Hmm, the acpi processor stuff is modular.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
 Subject: gettimeofday increments too slowly
 References : http://bugzilla.kernel.org/show_bug.cgi?id=8027
 Submitter  : David L [EMAIL PROTECTED]
 Caused-By  : Thomas Gleixner [EMAIL PROTECTED]
  commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2
 Handled-By : Thomas Gleixner [EMAIL PROTECTED]
 Status : problem is being debugged

Patch available: http://lkml.org/lkml/2007/3/22/301

commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 20:15 +0100, Thomas Gleixner wrote:
 On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
  Subject: gettimeofday increments too slowly
  References : http://bugzilla.kernel.org/show_bug.cgi?id=8027
  Submitter  : David L [EMAIL PROTECTED]
  Caused-By  : Thomas Gleixner [EMAIL PROTECTED]
   commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2
  Handled-By : Thomas Gleixner [EMAIL PROTECTED]
  Status : problem is being debugged
 
 Patch available: http://lkml.org/lkml/2007/3/22/301
 
 commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4

Oops. That fixed only the one half of the problem. The timeofday one
persists.

John, any idea ?

tglx




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
 Subject: dynticks makes ksoftirqd1 use unreasonable amount of cpu time
 References : http://bugzilla.kernel.org/show_bug.cgi?id=8100
 Submitter  : Emil Karlson [EMAIL PROTECTED]
 Handled-By : Thomas Gleixner [EMAIL PROTECTED]
 Status : problem is being debugged

The problem is not reproducible on any of my machines.

Emil, is it still there with Linus latest ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
 This email lists some known regressions in Linus' tree compared to 2.6.20.
 
 If you find your name in the Cc header, you are either submitter of one
 of the bugs, maintainer of an affectected subsystem or driver, a patch
 of you caused a breakage or I'm considering you in any other way
 possibly involved with one or more of these issues.
 
 Due to the huge amount of recipients, please trim the Cc when answering.
 
 
 Subject: system doesn't come out of suspend  (CONFIG_NO_HZ)
 References : http://lkml.org/lkml/2007/2/22/391
 Submitter  : Michael S. Tsirkin [EMAIL PROTECTED]
  Soeren Sonnenburg [EMAIL PROTECTED]
 Handled-By : Thomas Gleixner [EMAIL PROTECTED]
  Ingo Molnar [EMAIL PROTECTED]
  Tejun Heo [EMAIL PROTECTED]
  Rafael J. Wysocki [EMAIL PROTECTED]
 Status : problem is being debugged
 
 
 Subject: first disk access after resume takes several minutes
  ('date' does not advance after resume from RAM, CONFIG_NO_HZ=n)
 References : http://lkml.org/lkml/2007/3/8/117
 Submitter  : Michael S. Tsirkin [EMAIL PROTECTED]
 Handled-By : Thomas Gleixner [EMAIL PROTECTED]
  Ingo Molnar [EMAIL PROTECTED]
 Status : problem is being debugged

I lost track of Michaels various nested problems.

Michael can you please give a summary on _all_ entries in the
regressions list against Linus latest ?

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
 Subject: Dynticks and High resolution Timer hanging the system
  workaround: clocksource=acpi_pm
 References : http://lkml.org/lkml/2007/3/7/504
 Submitter  : Stephane Casset [EMAIL PROTECTED]
 Caused-By  : Thomas Gleixner [EMAIL PROTECTED]
 Status : unknown

Stephane, does the problem still exists with Linus latest ?

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
 Subject: soft lockup detected on CPU#0
 References : http://lkml.org/lkml/2007/3/3/152
 Submitter  : Michal Piotrowski [EMAIL PROTECTED]
 Handled-By : Thomas Gleixner [EMAIL PROTECTED]
  Ingo Molnar [EMAIL PROTECTED]
 Status : unknown

Michal,

any news on that one ? 

You said the same problem exists in 2.6.20.1. Has this been resolved in
2.6.20.2/3

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 19:48 +0100, Adrian Bunk wrote:
 Subject: x86_64: ACPI regression with noapic  (APICTIMER_STOPS_ON_C3?)
 References : http://lkml.org/lkml/2007/3/8/468
  http://lkml.org/lkml/2007/3/22/156
 Submitter  : Ray Lee [EMAIL PROTECTED]
 Handled-By : Thomas Gleixner [EMAIL PROTECTED]
 Status : problem is being debugged

Ray,

can you please test the patch below ?

Thanks,

tglx

--
Subject: [PATCH] x86_64: avoid sending LOCAL_TIMER_VECTOR IPI to itself

Ray Lee reported, that on an UP kernel with noapic commandline option
set, the box locks hard during boot.

Adding some debug printks revieled, that the last action on the box
before stalling was Send IPI - a debug printk which was put into
smp_send_timer_broadcast_ipi().

It seems that send_IPI_mask(mask, LOCAL_TIMER_VECTOR) fails when
noapic is set on the commandline on an UP kernel.

Aside of that it does not make much sense to trigger an interrupt
instead of calling the function directly on the CPU which gets the
PIT/HPET interrupt in case of broadcasting.

Reported-by: Ray Lee [EMAIL PROTECTED]
Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
index 723417d..83328e1 100644
--- a/arch/x86_64/kernel/apic.c
+++ b/arch/x86_64/kernel/apic.c
@@ -930,9 +930,17 @@ EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
 
 void smp_send_timer_broadcast_ipi(void)
 {
+   int cpu = smp_processor_id();
cpumask_t mask;
 
cpus_and(mask, cpu_online_map, timer_interrupt_broadcast_ipi_mask);
+
+   if (cpu_isset(cpu, mask)) {
+   cpu_clear(cpu, mask);
+   add_pda(apic_timer_irqs, 1);
+   smp_local_timer_interrupt();
+   }
+
if (!cpus_empty(mask)) {
send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
}


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 18:23 -0400, Chuck Ebbert wrote:
 Thomas Gleixner wrote:
  On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
  Subject: gettimeofday increments too slowly
  References : http://bugzilla.kernel.org/show_bug.cgi?id=8027
  Submitter  : David L [EMAIL PROTECTED]
  Caused-By  : Thomas Gleixner [EMAIL PROTECTED]
   commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2
  Handled-By : Thomas Gleixner [EMAIL PROTECTED]
  Status : problem is being debugged
  
  Patch available: http://lkml.org/lkml/2007/3/22/301
  
  commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4
  
 
 For the other issue raised there, clock running too slow, I now
 realize there is a similar report:
 
 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231626

That's a different one, AFAICT. Davids problem is probably caused by me
breaking the TSC watchdog. 

/me orders paperbags prophylactically and goes back to look at the code

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
On Fri, 2007-03-23 at 23:43 +0100, Thomas Gleixner wrote:
 On Fri, 2007-03-23 at 18:23 -0400, Chuck Ebbert wrote:
  Thomas Gleixner wrote:
   On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
   Subject: gettimeofday increments too slowly
   References : http://bugzilla.kernel.org/show_bug.cgi?id=8027
   Submitter  : David L [EMAIL PROTECTED]
   Caused-By  : Thomas Gleixner [EMAIL PROTECTED]
commit 92c7e00254b2d0efc1e36ac3e45474ce1871b6b2
   Handled-By : Thomas Gleixner [EMAIL PROTECTED]
   Status : problem is being debugged
   
   Patch available: http://lkml.org/lkml/2007/3/22/301
   
   commit 6b3964cde70cfe6db79d35b42137431ef7d2f7e4
   
  
  For the other issue raised there, clock running too slow, I now
  realize there is a similar report:
  
  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231626
 
 That's a different one, AFAICT. Davids problem is probably caused by me
 breaking the TSC watchdog. 
 
 /me orders paperbags prophylactically and goes back to look at the code

David,

can you please test the patch below ?

tglx

-
Subject: [PATCH] clocksource: Fix thinko in watchdog selection

The watchdog implementation excludes low res / non continuous
clocksources from being selected as a watchdog reference
unintentionally.

Allow using jiffies/PIT as a watchdog reference as long as no better
clocksource is available. This is necessary to detect TSC breakage on
systems, which have no pmtimer/hpet.

The main goal of the initial patch (preventing to switch to highres/nohz
when no reliable fallback clocksource is available) is still guaranteed
by the checks in clocksource_watchdog().

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 5b0e46b..fe5c7db 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -151,7 +151,8 @@ static void clocksource_check_watchdog(struct clocksource 
*cs)
watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL;
add_timer(watchdog_timer);
}
-   } else if (cs-flags  CLOCK_SOURCE_IS_CONTINUOUS) {
+   } else {
+   if (cs-flags  CLOCK_SOURCE_IS_CONTINUOUS)
cs-flags |= CLOCK_SOURCE_VALID_FOR_HRES;
 
if (!watchdog || cs-rating  watchdog-rating) {


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/5] 2.6.21-rc4: known regressions (v2)

2007-03-23 Thread Thomas Gleixner
Ray,

On Fri, 2007-03-23 at 17:14 -0700, Ray Lee wrote:
 (I wondered about the IPI on a UP system, seemed a bit weird :-).)
 
 Works great, booting both with NOAPIC and without. *Much* thanks for
 debugging this while you're also handling a bunch of other issues at
 the same time.

Thank you for debugging and excellent problem descriptions !

 Patch reproduced below, with an acked-by (and, uhm, a couple of spelling
 fixes in the description -- don't hate me, 'kay?).

I know that my English sucks.

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-24 Thread Thomas Gleixner
Emil,

On Fri, 2007-03-23 at 20:22 +0100, Thomas Gleixner wrote:
 On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
  Subject: dynticks makes ksoftirqd1 use unreasonable amount of cpu time
  References : http://bugzilla.kernel.org/show_bug.cgi?id=8100
  Submitter  : Emil Karlson [EMAIL PROTECTED]
  Handled-By : Thomas Gleixner [EMAIL PROTECTED]
  Status : problem is being debugged
 
 The problem is not reproducible on any of my machines.
 

I've uploaded a patch against 2.6.21-rc4 to

http://tglx.de/private/tglx/2.6.21-rc4-trace/2.6.21-rc4-trace.patch.bz2

It contains all changes in Linus tree since -rc4 plus the two pending
fixes (http://tglx.de/private/tglx/2.6.21-rc4-pending/) along with a
backport of the latency tracer from the realtime preemption patch.

Can you please apply the patch on top of -rc4 and build it with the
configuration, which exposes this strange behaviour. Please enable also
CONFIG_LATENCY_TRACE in the Kernel hacking menu.

When the problem is visible, then run trace-it
(http://tglx.de/private/tglx/2.6.21-rc4-trace/trace-it.c) as root:

# trace-it trace.txt

This captures roughly one second of kernel code pathes. Please stick
trace.txt into Bugzilla.

Thanks,

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-24 Thread Thomas Gleixner
On Sat, 2007-03-24 at 14:59 +0100, Michal Piotrowski wrote:
 On 23/03/07, Thomas Gleixner [EMAIL PROTECTED] wrote:
  On Fri, 2007-03-23 at 19:50 +0100, Adrian Bunk wrote:
   Subject: soft lockup detected on CPU#0
   References : http://lkml.org/lkml/2007/3/3/152
   Submitter  : Michal Piotrowski [EMAIL PROTECTED]
   Handled-By : Thomas Gleixner [EMAIL PROTECTED]
Ingo Molnar [EMAIL PROTECTED]
   Status : unknown
 
  Michal,
 
  any news on that one ?
 
  You said the same problem exists in 2.6.20.1. Has this been resolved in
  2.6.20.2/3
 
 Yes, I tried 2.6.20.4 and it works fine.

Is it solved in Linus latest too ?

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] i386: Prevent early access to TSC to avoid crash on TSCless systems

2007-03-24 Thread Thomas Gleixner
commit f9690982b8c2f9a2c65acdc113e758ec356676a3 removed the check for
cpu_khz from sched_clock(), which prevented early access to the TSC by
non obvious magic.

This is harmless as long as the CPU has a TSC. On TSCless systems this
results in an illegal instruction trap.

Replace tsc_disabled and tsc_unstable by tsc_enabled, which is only set
when the tsc is available and not unstable.

Signed-off-by: Thomas Gleixner [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/tsc.c b/arch/i386/kernel/tsc.c
index 0e65f7a..6cb8f53 100644
--- a/arch/i386/kernel/tsc.c
+++ b/arch/i386/kernel/tsc.c
@@ -18,6 +18,8 @@
 
 #include mach_timer.h
 
+static int tsc_enabled;
+
 /*
  * On some systems the TSC frequency does not
  * change with the cpu frequency. So we need
@@ -105,7 +107,7 @@ unsigned long long sched_clock(void)
/*
 * Fall back to jiffies if there's no TSC available:
 */
-   if (tsc_unstable || unlikely(tsc_disable))
+   if (unlikely(!tsc_enabled))
/* No locking but a rare wrong value is not a big deal: */
return (jiffies_64 - INITIAL_JIFFIES) * (10 / HZ);
 
@@ -283,6 +285,7 @@ void mark_tsc_unstable(void)
 {
if (!tsc_unstable) {
tsc_unstable = 1;
+   tsc_enabled = 0;
/* Can be called before registration */
if (clocksource_tsc.mult)
clocksource_change_rating(clocksource_tsc, 0);
@@ -383,7 +386,9 @@ void __init tsc_init(void)
if (check_tsc_unstable()) {
clocksource_tsc.rating = 0;
clocksource_tsc.flags = ~CLOCK_SOURCE_IS_CONTINUOUS;
-   }
+   } else
+   tsc_enabled = 1;
+
clocksource_register(clocksource_tsc);
 
return;




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/5] 2.6.21-rc4: known regressions (v2)

2007-03-25 Thread Thomas Gleixner
On Sun, 2007-03-25 at 09:11 +0200, Michael S. Tsirkin wrote:
  I lost track of Michaels various nested problems.
  
  Michael can you please give a summary on _all_ entries in the
  regressions list against Linus latest ?
 
 I tested 2 different configurations on my T60:
 - With CONFIG_NO_HZ enabled.
   I tested this on -rc1, and have not retested with CONFIG_NO_HZ since.
   Observed behaviour: the system would not come out of suspend to RAM.
   After I press Fn/F4 the crescent LED starts blinking so it seems Linux 
 started
   doing something.
   This is a problem but not a regression as such, since CONFIG_NO_HZ is new
   in 2.6.21.

It needs to be fixed before 2.6.21 final nevertheless.

 - Without CONFIG_NO_HZ
   I last tested this with cd05a1f818073a623455a58e756c5b419fc98db9.
   After systems comes out of suspend to ram, I observed the following
   behaviour (I used s2ram from console):
   1. The first disk access takes much longer than with 2.6.20
   2. System clock does not advance (date always reports the same time)
   3. After an attempt to switch to X, X starts drawing some windows and then 
 hangs
 
   All 3 issues are new and did not occur under 2.6.20, so this is a 
 regression.
   Attached is a full dmesg from boot to resume.

There is not much interesting to see in the log.

Can you please test the following:

Add clocksource=acpi_pm to the kernel commandline.

If this does not change anything, then disable CONFIG_HPET and retry.


One thing in the log is indeed scary:

[2.959150] Calibrating delay using timer specific routine.. 20089.12
BogoMIPS (lpj=100445639)

This is after the reboot, but it is not related to your problem. This is
a different problem, which needs urgent attention.

Adrian, can you open a seperate entry for this please ? It is not a new
thing, this can be observed with older kernels as well, but it needs to
be addressed. It probably needs a similar solution as I did for the
local apic timer calibration.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] __copy_to_user_inatomic broken on non Pentium machines

2007-03-25 Thread Thomas Gleixner
Environment: Pre Pentium systems, (boot_cpu_data.wp_works_ok == 0)
Last known working kernel: 2.6.18 (did not try 2.6.19 yet)

Enabling CONFIG_PREEMPT on latest mainline as well as 2.6.20 trigger

[   14.15] BUG: sleeping function called from invalid context at 
/home/tglx/work/kernel/vanilla/linux-2.6.20/kernel/rwsem.c:20
[   14.16] in_atomic():1, irqs_disabled():0
[   14.16] no locks held by init/1.
[   14.17]  [c0103346] show_trace_log_lvl+0x1a/0x2f
[   14.18]  [c0103441] show_trace+0x12/0x14
[   14.19]  [c0103cf5] dump_stack+0x16/0x18
[   14.19]  [c010aa62] __might_sleep+0xc7/0xcd
[   14.20]  [c01213a1] down_read+0x18/0x47
[   14.21]  [c01a01e4] __copy_to_user_ll+0x5e/0x1b6
[   14.22]  [c012cf85] file_read_actor+0x10b/0x149
[   14.23]  [c012d7b2] do_generic_mapping_read+0x187/0x433
[   14.24]  [c012f64b] generic_file_aio_read+0x191/0x1ca
[   14.24]  [c0141657] do_sync_read+0xc2/0xff
[   14.25]  [c0141eb6] vfs_read+0x90/0x145
[   14.26]  [c014227e] sys_read+0x3f/0x63
[   14.27]  [c0102fb0] syscall_call+0x7/0xb
[   14.27]  ===

and 

[   22.66] BUG: scheduling while atomic: e2fsck/0x1001/272
[   22.67] 1 lock held by e2fsck/272:
[   22.68]  #0:  (mm-mmap_sem){}, at: [c01a01e4] 
__copy_to_user_ll+0x5e/0x1b6
[   22.69]  [c0103346] show_trace_log_lvl+0x1a/0x2f
[   22.70]  [c0103441] show_trace+0x12/0x14
[   22.71]  [c0103cf5] dump_stack+0x16/0x18
[   22.72]  [c024a189] __sched_text_start+0x71/0x57f
[   22.72]  [c010b49f] __cond_resched+0x21/0x3b
[   22.73]  [c024aca7] cond_resched+0x26/0x31
[   22.74]  [c0137ae5] get_user_pages+0x1e1/0x23c
[   22.75]  [c01a021e] __copy_to_user_ll+0x98/0x1b6
[   22.76]  [c012cf85] file_read_actor+0x10b/0x149
[   22.77]  [c012d7b2] do_generic_mapping_read+0x187/0x433
[   22.78]  [c012f64b] generic_file_aio_read+0x191/0x1ca
[   22.79]  [c0141657] do_sync_read+0xc2/0xff
[   22.79]  [c0141eb6] vfs_read+0x90/0x145
[   22.80]  [c014227e] sys_read+0x3f/0x63
[   22.81]  [c0102fb0] syscall_call+0x7/0xb
[   22.82]  ===

which is not surprising. 

int file_read_actor(read_descriptor_t *desc, struct page *page,
unsigned long offset, unsigned long size)
{

/*
 * Faults on the destination of a read are common, so do it before
 * taking the kmap.
 */
if (!fault_in_pages_writeable(desc-arg.buf, size)) {
kaddr = kmap_atomic(page, KM_USER0);
   left = __copy_to_user_inatomic(desc-arg.buf,
kaddr + offset, size);

is called with preempt_count == 1, due to the kmap_atomic() above.

Now __copy_to_user_ll() takes the (boot_cpu_data.wp_works_ok == 0) path,
which in turn calls 

down_read(current-mm-mmap_sem) - which might sleep

and

get_user_pages() - which has a cond_resched() inside.

Not sure how to fix that.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >