Re: [PATCH] Repair misuse of sv_lock in 5.10.16-rt30.

2021-02-26 Thread Joe Korty
On Fri, Feb 26, 2021 at 03:15:46PM +, Chuck Lever wrote:
> 
> 
> > On Feb 26, 2021, at 10:00 AM, J. Bruce Fields  wrote:
> > 
> > Adding Chuck, linux-nfs.
> > 
> > Makes sense to me.--b.
> 
> Joe, I can add this to nfsd-5.12-rc. Would it be appropriate to add:
> 
> Fixes: 719f8bcc883e ("svcrpc: fix xpt_list traversal locking on shutdown")

Sure.
And thanks, everybody, for the quick response.
Joe


[PATCH] Repair misuse of sv_lock in 5.10.16-rt30.

2021-02-26 Thread Joe Korty
Repair misuse of sv_lock in 5.10.16-rt30.

[ This problem is in mainline, but only rt has the chops to be
able to detect it. ]

Lockdep reports a circular lock dependency between serv->sv_lock and
softirq_ctl.lock on system shutdown, when using a kernel built with
CONFIG_PREEMPT_RT=y, and a nfs mount exists.

This is due to the definition of spin_lock_bh on rt:

local_bh_disable();
rt_spin_lock(lock);

which forces a softirq_ctl.lock -> serv->sv_lock dependency.  This is
not a problem as long as _every_ lock of serv->sv_lock is a:

spin_lock_bh(>sv_lock);

but there is one of the form:

spin_lock(>sv_lock);

This is what is causing the circular dependency splat.  The spin_lock()
grabs the lock without first grabbing softirq_ctl.lock via local_bh_disable.
If later on in the critical region,  someone does a local_bh_disable, we
get a serv->sv_lock -> softirq_ctrl.lock dependency established.  Deadlock.

Fix is to make serv->sv_lock be locked with spin_lock_bh everywhere, no
exceptions.

Signed-off-by: Joe Korty 




[  OK  ] Stopped target NFS client services.
 Stopping Logout off all iSCSI sessions on shutdown...
 Stopping NFS server and services...
[  109.442380] 
[  109.442385] ==
[  109.442386] WARNING: possible circular locking dependency detected
[  109.442387] 5.10.16-rt30 #1 Not tainted
[  109.442389] --
[  109.442390] nfsd/1032 is trying to acquire lock:
[  109.442392] 994237617f60 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: 
__local_bh_disable_ip+0xd9/0x270
[  109.442405] 
[  109.442405] but task is already holding lock:
[  109.442406] 994245cb00b0 (>sv_lock){+.+.}-{0:0}, at: 
svc_close_list+0x1f/0x90
[  109.442415] 
[  109.442415] which lock already depends on the new lock.
[  109.442415] 
[  109.442416] 
[  109.442416] the existing dependency chain (in reverse order) is:
[  109.442417] 
[  109.442417] -> #1 (>sv_lock){+.+.}-{0:0}:
[  109.442421]rt_spin_lock+0x2b/0xc0
[  109.442428]svc_add_new_perm_xprt+0x42/0xa0
[  109.442430]svc_addsock+0x135/0x220
[  109.442434]write_ports+0x4b3/0x620
[  109.442438]nfsctl_transaction_write+0x45/0x80
[  109.442440]vfs_write+0xff/0x420
[  109.442444]ksys_write+0x4f/0xc0
[  109.442446]do_syscall_64+0x33/0x40
[  109.442450]entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  109.442454] 
[  109.442454] -> #0 ((softirq_ctrl.lock).lock){+.+.}-{2:2}:
[  109.442457]__lock_acquire+0x1264/0x20b0
[  109.442463]lock_acquire+0xc2/0x400
[  109.442466]rt_spin_lock+0x2b/0xc0
[  109.442469]__local_bh_disable_ip+0xd9/0x270
[  109.442471]svc_xprt_do_enqueue+0xc0/0x4d0
[  109.442474]svc_close_list+0x60/0x90
[  109.442476]svc_close_net+0x49/0x1a0
[  109.442478]svc_shutdown_net+0x12/0x40
[  109.442480]nfsd_destroy+0xc5/0x180
[  109.442482]nfsd+0x1bc/0x270
[  109.442483]kthread+0x194/0x1b0
[  109.442487]ret_from_fork+0x22/0x30
[  109.442492] 
[  109.442492] other info that might help us debug this:
[  109.442492] 
[  109.442493]  Possible unsafe locking scenario:
[  109.442493] 
[  109.442493]CPU0CPU1
[  109.442494]
[  109.442495]   lock(>sv_lock);
[  109.442496]lock((softirq_ctrl.lock).lock);
[  109.442498]lock(>sv_lock);
[  109.442499]   lock((softirq_ctrl.lock).lock);
[  109.442501] 
[  109.442501]  *** DEADLOCK ***
[  109.442501] 
[  109.442501] 3 locks held by nfsd/1032:
[  109.442503]  #0: 93b49258 (nfsd_mutex){+.+.}-{3:3}, at: 
nfsd+0x19a/0x270
[  109.442508]  #1: 994245cb00b0 (>sv_lock){+.+.}-{0:0}, at: 
svc_close_list+0x1f/0x90
[  109.442512]  #2: 93a81b20 (rcu_read_lock){}-{1:2}, at: 
rt_spin_lock+0x5/0xc0
[  109.442518] 
[  109.442518] stack backtrace:
[  109.442519] CPU: 0 PID: 1032 Comm: nfsd Not tainted 5.10.16-rt30 #1
[  109.442522] Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 
09/22/2015
[  109.442524] Call Trace:
[  109.442527]  dump_stack+0x77/0x97
[  109.442533]  check_noncircular+0xdc/0xf0
[  109.442546]  __lock_acquire+0x1264/0x20b0
[  109.442553]  lock_acquire+0xc2/0x400
[  109.442564]  rt_spin_lock+0x2b/0xc0
[  109.442570]  __local_bh_disable_ip+0xd9/0x270
[  109.442573]  svc_xprt_do_enqueue+0xc0/0x4d0
[  109.442577]  svc_close_list+0x60/0x90
[  109.442581]  svc_close_net+0x49/0x1a0
[  109.442585]  svc_shutdown_net+0x12/0x40
[  109.442588]  nfsd_destroy+0xc5/0x180
[  109.442590]  nfsd+0x1bc/0x270
[  109.442595]  kthread+0x194/0x1b0
[  109.442600]  ret_from_fork+0x22/0x30
[  109.518225] nfsd: last server has exited, flushing export cache
[  OK  ] Stopped NFSv4 ID-name mapping service.
[  OK  ] Stopped GSSAPI Proxy Daemon.
[  OK  ] Stopp

Re: [ANNOUNCE] v4.4.231-rt202

2020-08-08 Thread Joe Korty
Ping?

On Mon, Jul 27, 2020 at 03:10:33PM -0400, Steven Rostedt wrote:
> On Sun, 26 Jul 2020 13:55:12 +0200
> Daniel Wagner  wrote:
> 
> > Hi,
> > 
> > On 24.07.20 15:41, Daniel Wagner wrote:
> > > Known issues:
> > > 
> > > sigwaittest with hackbench as workload is able to trigger a crash on 
> > > x86_64,
> > > the same as reported for the v4.4.220-rt196 release. As it turns
> > > out it was not triggered by BPF.
> > > https://paste.opensuse.org/view/raw/58939248  
> > 
> > Joe pointed out [1] that v4.4-rt is missing 9567db2ebe56 ("signal: 
> > Prevent double-free of user struct") from devel-rt. With this
> > patch all my tests pass.
> > 
> > @stable-rt team: Can you please add it to the missing trees?
> 
> Good catch,
> 
> I'll pull this in on Friday.
> 
> -- Steve
> 
> > 
> > Thanks,
> > Daniel
> > 
> > [1] 
> > https://lore.kernel.org/linux-rt-users/20200626130544.ga37...@zipoli.concurrent-rt.com/


Re: [BUG 4.4.178] x86_64 compat mode futexes broken

2019-06-06 Thread Joe Korty
On Thu, Jun 06, 2019 at 04:11:30PM -0700, Nathan Chancellor wrote:
> On Thu, Jun 06, 2019 at 09:11:43PM +0000, Joe Korty wrote:
> > Starting with 4.4.178, the LTP test
> > 
> >   pthread_cond_wait/2-3
> > 
> > when compiled on x86_64 with 'gcc -m32', started failing.  It generates 
> > this log output:
> > 
> >   [16:18:38]Implementation supports the MONOTONIC CLOCK but option is 
> > disabled in test.   
> >   [16:18:38]Test starting
> >   [16:18:38] Process-shared primitive will be tested
> >   [16:18:38] Alternative clock for cond will be tested
> >   [16:18:38]Test 2-3.c FAILED: The child did not own the mutex inside the 
> > cleanup handler
> > 
> 
> What is the exact build command + test case command? I'd like to
> reproduce this myself.
> 
> > A git bisection between 4.4.177..178 shows that this commit is the culprit:
> > 
> >   Git-Commit: 79739ad2d0ac5787a15a1acf7caaf34cd95bbf3c
> >   Author: Alistair Strachan 
> >   Subject: [PATCH] x86: vdso: Use $LD instead of $CC to link
> > 
> 
> Have you tested 4.4.180? There were two subsequent fixes to this patch
> in 4.4:

Hi Nathan,
I started with 4.4.179-rt181 and worked backwards from there.  Per your
suggestion, I tried 4.4.180 and it does work properly.

Thanks,
Joe




> 485d15db01ca ("kbuild: simplify ld-option implementation")
> 07d35512e494 ("x86/vdso: Pass --eh-frame-hdr to the linker")
> 
> > And, indeed, when I back this patch out of 4.4.178 proper, the above test
> > passes again.
> > 
> > Please consider backing this patch out of linux-4.4.y, and from master, and 
> > from
> > any other linux branch it has been backported to.
> > 
> 
> So this is broken in mainline too?
> 
> > PS: In backing it out of 4.4.178, I first backed out
> > 
> >7c45b45fd6e928c9ce275c32f6fa98d317e6f5ee
> >
> > This is a follow-on vdso patch which collides with the
> > patch we are interested in removing.  As it claims to be
> > only removing redundant code, it probably should never
> > have been backported in the first place.
> 
> While it is redundant for ld.bfd, it causes a build failure with the
> release version of ld.lld:
> 
> https://github.com/ClangBuiltLinux/linux/issues/31
> 
> Cheers,
> Nathan


[BUG 4.4.178] x86_64 compat mode futexes broken

2019-06-06 Thread Joe Korty
Starting with 4.4.178, the LTP test

  pthread_cond_wait/2-3

when compiled on x86_64 with 'gcc -m32', started failing.  It generates this 
log output:

  [16:18:38]Implementation supports the MONOTONIC CLOCK but option is disabled 
in test.   
  [16:18:38]Test starting
  [16:18:38] Process-shared primitive will be tested
  [16:18:38] Alternative clock for cond will be tested
  [16:18:38]Test 2-3.c FAILED: The child did not own the mutex inside the 
cleanup handler

A git bisection between 4.4.177..178 shows that this commit is the culprit:

  Git-Commit: 79739ad2d0ac5787a15a1acf7caaf34cd95bbf3c
  Author: Alistair Strachan 
  Subject: [PATCH] x86: vdso: Use $LD instead of $CC to link

And, indeed, when I back this patch out of 4.4.178 proper, the above test
passes again.

Please consider backing this patch out of linux-4.4.y, and from master, and from
any other linux branch it has been backported to.

PS: In backing it out of 4.4.178, I first backed out

   7c45b45fd6e928c9ce275c32f6fa98d317e6f5ee
   
This is a follow-on vdso patch which collides with the
patch we are interested in removing.  As it claims to be
only removing redundant code, it probably should never
have been backported in the first place.

Signed-off-by: Joe Korty 



Re: [ptrace, rt] erratic behaviour in PTRACE_SINGLESET on 4.13-rt and later.

2018-11-27 Thread Joe Korty
On Tue, Nov 27, 2018 at 08:58:19AM -0600, Clark Williams wrote:
> Joe,
> 
> This looks interesting. Do you have a git repo where I can pull the
> source?
> 
> Clark


Hi Clark,
No I don't, sorry.  I am attaching the LAG version, it is a few
dozen lines shorter than the version I first sent out to the
mailing list.

Joe

PS: Oh, I forgot to do....

Signed-off-by: Joe Korty 




> On Tue, 20 Nov 2018 12:29:00 -0500
> Steven Rostedt  wrote:
> 
> > [ Adding Clark and John who manage the rt-tests repo ]
> > 
> > -- Steve
> > 
> > On Mon, 19 Nov 2018 19:46:23 +
> > Joe Korty  wrote:
> > 
> > > Hi Julia & the RT team,
> > > 
> > > The following program might make a good addition to the rt
> > > test suite.  It tests the reliability of PTRACE_SINGLESTEP.
> > > It does by default 10,000 ssteps against a simple,
> > > spinner tracee.  Also by default, it spins off ten of these
> > > tracer/tracee pairs, all of which are to run concurrently.
> > > 
> > > Starting with 4.13-rt, this test occasionally encounters a
> > > sstep whose waitpid returns a WIFSIGNALED (signal SIGTRAP)
> > > rather than a WIFSTOPPED.  This usually happens after
> > > thousands of ssteps have executed.  Having multiple
> > > tracer/tracee pairs running dramatically increases the
> > > chances of failure.
> > > 
> > > The is what the test output looks like for a good run:
> > > 
> > >   #forks: 10
> > >   #steps: 1
> > >   
> > >   forktest#0/22872: STARTING
> > >   forktest#7/22879: STARTING
> > >   forktest#8/22880: STARTING
> > >   forktest#6/22878: STARTING
> > >   forktest#5/22877: STARTING
> > >   forktest#3/22875: STARTING
> > >   forktest#4/22876: STARTING
> > >   forktest#9/22882: STARTING
> > >   forktest#2/22874: STARTING
> > >   forktest#1/22873: STARTING
> > >   forktest#0/22872: EXITING, no error
> > >   forktest#8/22880: EXITING, no error
> > >   forktest#3/22875: EXITING, no error
> > >   forktest#7/22879: EXITING, no error
> > >   forktest#6/22878: EXITING, no error
> > >   forktest#5/22877: EXITING, no error
> > >   forktest#2/22874: EXITING, no error
> > >   forktest#4/22876: EXITING, no error
> > >   forktest#9/22882: EXITING, no error
> > >   forktest#1/22873: EXITING, no error
> > >   All tests PASSED.
> > > 
> > > This is what the test output looks like for a failing run:
> > > 
> > >   #forks: 10
> > >   #steps: 1
> > >   
> > >   forktest#0/22906: STARTING
> > >   forktest#1/22907: STARTING
> > >   forktest#2/22909: STARTING
> > >   forktest#3/22911: STARTING
> > >   forktest#4/22912: STARTING
> > >   forktest#5/22915: STARTING
> > >   forktest#6/22916: STARTING
> > >   forktest#7/22919: STARTING
> > >   forktest#8/22920: STARTING
> > >   forktest#9/22923: STARTING
> > >   forktest#2/22909: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9: wanted 
> > > STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#5/22915: EXITING, no error
> > >   forktest#3/22911: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7953: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#0/22906: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5072: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#9/22923: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7992: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#4/22912: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9923: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#1/22907: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7723: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#7/22919: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5036: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#8/22920: EXITING, ERROR: wait on PTRACE_SINGLESTEP #4943: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#6/22916: EXITING, no error
> > >   One or more tests FAILED.
> > > 
> > > Here are the observations from my testing so far:
> > > 
> > >   - It has never failed when confined to a single cpu.
> > >  

Re: [ptrace, rt] erratic behaviour in PTRACE_SINGLESET on 4.13-rt and later.

2018-11-27 Thread Joe Korty
On Tue, Nov 27, 2018 at 08:58:19AM -0600, Clark Williams wrote:
> Joe,
> 
> This looks interesting. Do you have a git repo where I can pull the
> source?
> 
> Clark


Hi Clark,
No I don't, sorry.  I am attaching the LAG version, it is a few
dozen lines shorter than the version I first sent out to the
mailing list.

Joe

PS: Oh, I forgot to do....

Signed-off-by: Joe Korty 




> On Tue, 20 Nov 2018 12:29:00 -0500
> Steven Rostedt  wrote:
> 
> > [ Adding Clark and John who manage the rt-tests repo ]
> > 
> > -- Steve
> > 
> > On Mon, 19 Nov 2018 19:46:23 +
> > Joe Korty  wrote:
> > 
> > > Hi Julia & the RT team,
> > > 
> > > The following program might make a good addition to the rt
> > > test suite.  It tests the reliability of PTRACE_SINGLESTEP.
> > > It does by default 10,000 ssteps against a simple,
> > > spinner tracee.  Also by default, it spins off ten of these
> > > tracer/tracee pairs, all of which are to run concurrently.
> > > 
> > > Starting with 4.13-rt, this test occasionally encounters a
> > > sstep whose waitpid returns a WIFSIGNALED (signal SIGTRAP)
> > > rather than a WIFSTOPPED.  This usually happens after
> > > thousands of ssteps have executed.  Having multiple
> > > tracer/tracee pairs running dramatically increases the
> > > chances of failure.
> > > 
> > > The is what the test output looks like for a good run:
> > > 
> > >   #forks: 10
> > >   #steps: 1
> > >   
> > >   forktest#0/22872: STARTING
> > >   forktest#7/22879: STARTING
> > >   forktest#8/22880: STARTING
> > >   forktest#6/22878: STARTING
> > >   forktest#5/22877: STARTING
> > >   forktest#3/22875: STARTING
> > >   forktest#4/22876: STARTING
> > >   forktest#9/22882: STARTING
> > >   forktest#2/22874: STARTING
> > >   forktest#1/22873: STARTING
> > >   forktest#0/22872: EXITING, no error
> > >   forktest#8/22880: EXITING, no error
> > >   forktest#3/22875: EXITING, no error
> > >   forktest#7/22879: EXITING, no error
> > >   forktest#6/22878: EXITING, no error
> > >   forktest#5/22877: EXITING, no error
> > >   forktest#2/22874: EXITING, no error
> > >   forktest#4/22876: EXITING, no error
> > >   forktest#9/22882: EXITING, no error
> > >   forktest#1/22873: EXITING, no error
> > >   All tests PASSED.
> > > 
> > > This is what the test output looks like for a failing run:
> > > 
> > >   #forks: 10
> > >   #steps: 1
> > >   
> > >   forktest#0/22906: STARTING
> > >   forktest#1/22907: STARTING
> > >   forktest#2/22909: STARTING
> > >   forktest#3/22911: STARTING
> > >   forktest#4/22912: STARTING
> > >   forktest#5/22915: STARTING
> > >   forktest#6/22916: STARTING
> > >   forktest#7/22919: STARTING
> > >   forktest#8/22920: STARTING
> > >   forktest#9/22923: STARTING
> > >   forktest#2/22909: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9: wanted 
> > > STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#5/22915: EXITING, no error
> > >   forktest#3/22911: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7953: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#0/22906: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5072: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#9/22923: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7992: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#4/22912: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9923: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#1/22907: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7723: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#7/22919: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5036: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#8/22920: EXITING, ERROR: wait on PTRACE_SINGLESTEP #4943: 
> > > wanted STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
> > >   forktest#6/22916: EXITING, no error
> > >   One or more tests FAILED.
> > > 
> > > Here are the observations from my testing so far:
> > > 
> > >   - It has never failed when confined to a single cpu.
> > >  

[ptrace, rt] erratic behaviour in PTRACE_SINGLESET on 4.13-rt and later.

2018-11-19 Thread Joe Korty
Hi Julia & the RT team,

The following program might make a good addition to the rt
test suite.  It tests the reliability of PTRACE_SINGLESTEP.
It does by default 10,000 ssteps against a simple,
spinner tracee.  Also by default, it spins off ten of these
tracer/tracee pairs, all of which are to run concurrently.

Starting with 4.13-rt, this test occasionally encounters a
sstep whose waitpid returns a WIFSIGNALED (signal SIGTRAP)
rather than a WIFSTOPPED.  This usually happens after
thousands of ssteps have executed.  Having multiple
tracer/tracee pairs running dramatically increases the
chances of failure.

The is what the test output looks like for a good run:

  #forks: 10
  #steps: 1
  
  forktest#0/22872: STARTING
  forktest#7/22879: STARTING
  forktest#8/22880: STARTING
  forktest#6/22878: STARTING
  forktest#5/22877: STARTING
  forktest#3/22875: STARTING
  forktest#4/22876: STARTING
  forktest#9/22882: STARTING
  forktest#2/22874: STARTING
  forktest#1/22873: STARTING
  forktest#0/22872: EXITING, no error
  forktest#8/22880: EXITING, no error
  forktest#3/22875: EXITING, no error
  forktest#7/22879: EXITING, no error
  forktest#6/22878: EXITING, no error
  forktest#5/22877: EXITING, no error
  forktest#2/22874: EXITING, no error
  forktest#4/22876: EXITING, no error
  forktest#9/22882: EXITING, no error
  forktest#1/22873: EXITING, no error
  All tests PASSED.

This is what the test output looks like for a failing run:

  #forks: 10
  #steps: 1
  
  forktest#0/22906: STARTING
  forktest#1/22907: STARTING
  forktest#2/22909: STARTING
  forktest#3/22911: STARTING
  forktest#4/22912: STARTING
  forktest#5/22915: STARTING
  forktest#6/22916: STARTING
  forktest#7/22919: STARTING
  forktest#8/22920: STARTING
  forktest#9/22923: STARTING
  forktest#2/22909: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#5/22915: EXITING, no error
  forktest#3/22911: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7953: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#0/22906: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5072: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#9/22923: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7992: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#4/22912: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9923: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#1/22907: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7723: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#7/22919: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5036: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#8/22920: EXITING, ERROR: wait on PTRACE_SINGLESTEP #4943: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#6/22916: EXITING, no error
  One or more tests FAILED.

Here are the observations from my testing so far:

  - It has never failed when confined to a single cpu.
  - It has never failed on !rt kernels.
  - It has never failed on any kernel prior to 4.13.
  - More failures than what chance would dictate happen
near the end of a test run (ie, if a test of 10,000 
steps is run, the failure will be at the 9,xxx'th step,
if 100,000 steps are run, the failure will be at
the 9x,xxx'th step).

The above results are from kernels 4.{4,9,11,13,14,19}-rt
and some !rt's of these as well.

I have yet to see or hear of this bug, if it is a bug,
giving anyone a problem in a debug session.  It might not
even be a bug but merely expected behaviour. And of course
there is the possibility of a misuse of ptrace/waitpid in
my test program. Its API, after all, is rather convoluted.

Regards,
Joe




/*
 * Have a tracer do a bunch of PTRACE_SINGLESTEPs against
 * a tracee as fast as possible.  Create several of these
 * tracer/tracee pairs and see if they can be made to
 * interfere with each other.
 *
 * Usage:
 *   ssdd nforks niters
 * Where:
 *   nforks - number of tracer/tracee pairs to fork off.
 *default 10.
 *   niters - number of PTRACE_SINGLESTEP iterations to
 *do before declaring success, for each tracer/
 *tracee pair set up.  Default 10,000.
 *
 * The tracee waits on each PTRACE_SINGLESTEP with a waitpid(2)
 * and checks that waitpid's return values for correctness.
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 

/* do_wait return values */
#define STATE_EXITED1
#define STATE_STOPPED   2
#define STATE_SIGNALED  3
#define STATE_UNKNOWN   4
#define STATE_ECHILD5
#define STATE_EXITED_TSIG   6   /* exited with termination signal */
#define STATE_EXITED_ERRSTAT7   /* exited with non-zero status */

char *state_name[] = {
[STATE_EXITED] = "STATE_EXITED",
[STATE_STOPPED] = 

[ptrace, rt] erratic behaviour in PTRACE_SINGLESET on 4.13-rt and later.

2018-11-19 Thread Joe Korty
Hi Julia & the RT team,

The following program might make a good addition to the rt
test suite.  It tests the reliability of PTRACE_SINGLESTEP.
It does by default 10,000 ssteps against a simple,
spinner tracee.  Also by default, it spins off ten of these
tracer/tracee pairs, all of which are to run concurrently.

Starting with 4.13-rt, this test occasionally encounters a
sstep whose waitpid returns a WIFSIGNALED (signal SIGTRAP)
rather than a WIFSTOPPED.  This usually happens after
thousands of ssteps have executed.  Having multiple
tracer/tracee pairs running dramatically increases the
chances of failure.

The is what the test output looks like for a good run:

  #forks: 10
  #steps: 1
  
  forktest#0/22872: STARTING
  forktest#7/22879: STARTING
  forktest#8/22880: STARTING
  forktest#6/22878: STARTING
  forktest#5/22877: STARTING
  forktest#3/22875: STARTING
  forktest#4/22876: STARTING
  forktest#9/22882: STARTING
  forktest#2/22874: STARTING
  forktest#1/22873: STARTING
  forktest#0/22872: EXITING, no error
  forktest#8/22880: EXITING, no error
  forktest#3/22875: EXITING, no error
  forktest#7/22879: EXITING, no error
  forktest#6/22878: EXITING, no error
  forktest#5/22877: EXITING, no error
  forktest#2/22874: EXITING, no error
  forktest#4/22876: EXITING, no error
  forktest#9/22882: EXITING, no error
  forktest#1/22873: EXITING, no error
  All tests PASSED.

This is what the test output looks like for a failing run:

  #forks: 10
  #steps: 1
  
  forktest#0/22906: STARTING
  forktest#1/22907: STARTING
  forktest#2/22909: STARTING
  forktest#3/22911: STARTING
  forktest#4/22912: STARTING
  forktest#5/22915: STARTING
  forktest#6/22916: STARTING
  forktest#7/22919: STARTING
  forktest#8/22920: STARTING
  forktest#9/22923: STARTING
  forktest#2/22909: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#5/22915: EXITING, no error
  forktest#3/22911: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7953: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#0/22906: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5072: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#9/22923: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7992: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#4/22912: EXITING, ERROR: wait on PTRACE_SINGLESTEP #9923: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#1/22907: EXITING, ERROR: wait on PTRACE_SINGLESTEP #7723: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#7/22919: EXITING, ERROR: wait on PTRACE_SINGLESTEP #5036: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#8/22920: EXITING, ERROR: wait on PTRACE_SINGLESTEP #4943: wanted 
STATE_STOPPED, saw STATE_SIGNALED instead (and saw signo 5 too)
  forktest#6/22916: EXITING, no error
  One or more tests FAILED.

Here are the observations from my testing so far:

  - It has never failed when confined to a single cpu.
  - It has never failed on !rt kernels.
  - It has never failed on any kernel prior to 4.13.
  - More failures than what chance would dictate happen
near the end of a test run (ie, if a test of 10,000 
steps is run, the failure will be at the 9,xxx'th step,
if 100,000 steps are run, the failure will be at
the 9x,xxx'th step).

The above results are from kernels 4.{4,9,11,13,14,19}-rt
and some !rt's of these as well.

I have yet to see or hear of this bug, if it is a bug,
giving anyone a problem in a debug session.  It might not
even be a bug but merely expected behaviour. And of course
there is the possibility of a misuse of ptrace/waitpid in
my test program. Its API, after all, is rather convoluted.

Regards,
Joe




/*
 * Have a tracer do a bunch of PTRACE_SINGLESTEPs against
 * a tracee as fast as possible.  Create several of these
 * tracer/tracee pairs and see if they can be made to
 * interfere with each other.
 *
 * Usage:
 *   ssdd nforks niters
 * Where:
 *   nforks - number of tracer/tracee pairs to fork off.
 *default 10.
 *   niters - number of PTRACE_SINGLESTEP iterations to
 *do before declaring success, for each tracer/
 *tracee pair set up.  Default 10,000.
 *
 * The tracee waits on each PTRACE_SINGLESTEP with a waitpid(2)
 * and checks that waitpid's return values for correctness.
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 

/* do_wait return values */
#define STATE_EXITED1
#define STATE_STOPPED   2
#define STATE_SIGNALED  3
#define STATE_UNKNOWN   4
#define STATE_ECHILD5
#define STATE_EXITED_TSIG   6   /* exited with termination signal */
#define STATE_EXITED_ERRSTAT7   /* exited with non-zero status */

char *state_name[] = {
[STATE_EXITED] = "STATE_EXITED",
[STATE_STOPPED] = 

Re: [PATCH RT v2] sched/migrate_disable: fallback to preempt_disable() instead barrier()

2018-07-06 Thread joe . korty
On Fri, Jul 06, 2018 at 12:58:57PM +0200, Sebastian Andrzej Siewior wrote:
> On SMP + !RT migrate_disable() is still around. It is not part of spin_lock()
> anymore so it has almost no users. However the futex code has a workaround for
> the !in_atomic() part of migrate disable which fails because the matching
> migrade_disable() is no longer part of spin_lock().
> 
> On !SMP + !RT migrate_disable() is reduced to barrier(). This is not optimal
> because we few spots where a "preempt_disable()" statement was replaced with
> "migrate_disable()".
> 
> We also used the migration_disable counter to figure out if a sleeping lock is
> acquired so RCU does not complain about schedule() during rcu_read_lock() 
> while
> a sleeping lock is held. This changed, we no longer use it, we have now a
> sleeping_lock counter for the RCU purpose.
> 
> This means we can now:
> - for SMP + RT_BASE
>   full migration program, nothing changes here
> 
> - for !SMP + RT_BASE
>   the migration counting is no longer required. It used to ensure that the 
> task
>   is not migrated to another CPU and that this CPU remains online. !SMP 
> ensures
>   that already.
>   Move it to CONFIG_SCHED_DEBUG so the counting is done for debugging purpose
>   only.
> 
> - for all other cases including !RT
>   fallback to preempt_disable(). The only remaining users of migrate_disable()
>   are those which were converted from preempt_disable() and the futex
>   workaround which is already in the preempt_disable() section due to the
>   spin_lock that is held.
> 
> Cc: stable...@vger.kernel.org
> Reported-by: joe.ko...@concurrent-rt.com
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
> v1???v2: limit migrate_disable to RT only. Use preempt_disable() for !RT
>if migrate_disable() is used.
> 
>  include/linux/preempt.h |6 +++---
>  include/linux/sched.h   |4 ++--
>  kernel/sched/core.c |   23 +++
>  kernel/sched/debug.c|2 +-
>  4 files changed, 17 insertions(+), 18 deletions(-)


Hi Sebastian,
v2 works for me.

I compiled and booted both smp+preempt+!rt and
smp+preempt+rt kernels, no splats on boot for either.

I ran the futex selftests on both kernels, both passed.

I ran a selection of posix tests from an old version of
the Linux Test Project, both kernels passed all tests.

Regards, and thanks,
Joe


Re: [PATCH RT v2] sched/migrate_disable: fallback to preempt_disable() instead barrier()

2018-07-06 Thread joe . korty
On Fri, Jul 06, 2018 at 12:58:57PM +0200, Sebastian Andrzej Siewior wrote:
> On SMP + !RT migrate_disable() is still around. It is not part of spin_lock()
> anymore so it has almost no users. However the futex code has a workaround for
> the !in_atomic() part of migrate disable which fails because the matching
> migrade_disable() is no longer part of spin_lock().
> 
> On !SMP + !RT migrate_disable() is reduced to barrier(). This is not optimal
> because we few spots where a "preempt_disable()" statement was replaced with
> "migrate_disable()".
> 
> We also used the migration_disable counter to figure out if a sleeping lock is
> acquired so RCU does not complain about schedule() during rcu_read_lock() 
> while
> a sleeping lock is held. This changed, we no longer use it, we have now a
> sleeping_lock counter for the RCU purpose.
> 
> This means we can now:
> - for SMP + RT_BASE
>   full migration program, nothing changes here
> 
> - for !SMP + RT_BASE
>   the migration counting is no longer required. It used to ensure that the 
> task
>   is not migrated to another CPU and that this CPU remains online. !SMP 
> ensures
>   that already.
>   Move it to CONFIG_SCHED_DEBUG so the counting is done for debugging purpose
>   only.
> 
> - for all other cases including !RT
>   fallback to preempt_disable(). The only remaining users of migrate_disable()
>   are those which were converted from preempt_disable() and the futex
>   workaround which is already in the preempt_disable() section due to the
>   spin_lock that is held.
> 
> Cc: stable...@vger.kernel.org
> Reported-by: joe.ko...@concurrent-rt.com
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
> v1???v2: limit migrate_disable to RT only. Use preempt_disable() for !RT
>if migrate_disable() is used.
> 
>  include/linux/preempt.h |6 +++---
>  include/linux/sched.h   |4 ++--
>  kernel/sched/core.c |   23 +++
>  kernel/sched/debug.c|2 +-
>  4 files changed, 17 insertions(+), 18 deletions(-)


Hi Sebastian,
v2 works for me.

I compiled and booted both smp+preempt+!rt and
smp+preempt+rt kernels, no splats on boot for either.

I ran the futex selftests on both kernels, both passed.

I ran a selection of posix tests from an old version of
the Linux Test Project, both kernels passed all tests.

Regards, and thanks,
Joe


Re: [PATCH RT] sched/migrate_disable: fallback to preempt_disable() instead barrier()

2018-07-05 Thread joe . korty
On Thu, Jul 05, 2018 at 05:50:34PM +0200, Sebastian Andrzej Siewior wrote:
> migrate_disable() does nothing !SMP && !RT. This is bad for two reasons:
> - The futex code relies on the fact migrate_disable() is part of spin_lock().
>   There is a workaround for the !in_atomic() case in migrate_disable() which
>   work-arounds the different ordering (non-atomic lock and atomic unlock).
> 
> - we have a few instances where preempt_disable() is replaced with
>   migrate_disable().
> 
> For both cases it is bad if migrate_disable() ends up as barrier() instead of
> preempt_disable(). Let migrate_disable() fallback to preempt_disable().
> 
> Cc: stable...@vger.kernel.org
> Reported-by: joe.ko...@concurrent-rt.com
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
>  include/linux/preempt.h | 4 ++--
>  kernel/sched/core.c | 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index 043e431a7e8e..d46688d521e6 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -241,8 +241,8 @@ static inline int __migrate_disabled(struct task_struct 
> *p)
>  }
>  
>  #else
> -#define migrate_disable()barrier()
> -#define migrate_enable() barrier()
> +#define migrate_disable()preempt_disable()
> +#define migrate_enable() preempt_enable()
>  static inline int __migrate_disabled(struct task_struct *p)
>  {
>   return 0;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ac3fb8495bd5..626a62218518 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7326,6 +7326,7 @@ void migrate_disable(void)
>  #endif
>  
>   p->migrate_disable++;
> + preempt_disable();
>  }
>  EXPORT_SYMBOL(migrate_disable);
>  
> @@ -7349,6 +7350,7 @@ void migrate_enable(void)
>  
>   WARN_ON_ONCE(p->migrate_disable <= 0);
>   p->migrate_disable--;
> + preempt_enable();
>  }
>  EXPORT_SYMBOL(migrate_enable);
>  #endif
> -- 
> 2.18.0



Hi Sebastian,
I just verified that this fix does not work for my mix of
config options (smp && preempt && !rt).

Regards,
Joe



Re: [PATCH RT] sched/migrate_disable: fallback to preempt_disable() instead barrier()

2018-07-05 Thread joe . korty
On Thu, Jul 05, 2018 at 05:50:34PM +0200, Sebastian Andrzej Siewior wrote:
> migrate_disable() does nothing !SMP && !RT. This is bad for two reasons:
> - The futex code relies on the fact migrate_disable() is part of spin_lock().
>   There is a workaround for the !in_atomic() case in migrate_disable() which
>   work-arounds the different ordering (non-atomic lock and atomic unlock).
> 
> - we have a few instances where preempt_disable() is replaced with
>   migrate_disable().
> 
> For both cases it is bad if migrate_disable() ends up as barrier() instead of
> preempt_disable(). Let migrate_disable() fallback to preempt_disable().
> 
> Cc: stable...@vger.kernel.org
> Reported-by: joe.ko...@concurrent-rt.com
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
>  include/linux/preempt.h | 4 ++--
>  kernel/sched/core.c | 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index 043e431a7e8e..d46688d521e6 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -241,8 +241,8 @@ static inline int __migrate_disabled(struct task_struct 
> *p)
>  }
>  
>  #else
> -#define migrate_disable()barrier()
> -#define migrate_enable() barrier()
> +#define migrate_disable()preempt_disable()
> +#define migrate_enable() preempt_enable()
>  static inline int __migrate_disabled(struct task_struct *p)
>  {
>   return 0;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ac3fb8495bd5..626a62218518 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7326,6 +7326,7 @@ void migrate_disable(void)
>  #endif
>  
>   p->migrate_disable++;
> + preempt_disable();
>  }
>  EXPORT_SYMBOL(migrate_disable);
>  
> @@ -7349,6 +7350,7 @@ void migrate_enable(void)
>  
>   WARN_ON_ONCE(p->migrate_disable <= 0);
>   p->migrate_disable--;
> + preempt_enable();
>  }
>  EXPORT_SYMBOL(migrate_enable);
>  #endif
> -- 
> 2.18.0



Hi Sebastian,
I just verified that this fix does not work for my mix of
config options (smp && preempt && !rt).

Regards,
Joe



[PATCH RT] sample fix for splat in futex_[un]lock_pi for !rt

2018-07-04 Thread joe . korty
Balance atomic/!atomic migrate_enable calls in futex_[un]lock_pi.

The clever use of migrate_disable/enable in rt patch

  "futex: workaround migrate_disable/enable in different"

has balanced atomic/!atomic context only for the rt kernel.
This workaround makes it balanced for both rt and !rt.

The 'solution' presented here is for reference only.
A better solution might be for !rt to go back to using
migrate_enable/disable == preempt_enable/disable.
This patch passes the futex selftests for rt and !rt.

Sample kernel splat, edited for brevity.  This happens
near the end of boot on a CentOS 7 installation.

   WARNING: CPU: 1 PID: 5966 at kernel/sched/core.c:6994 
migrate_enable+0x24e/0x2f0
   CPU: 1 PID: 5966 Comm: threaded-ml Not tainted 4.14.40-rt31 #1
   Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
   task: 88046b67a6c0 task.stack: c900053a
   RIP: 0010:migrate_enable+0x24e/0x2f0
   RSP: 0018:c900053a3df8 EFLAGS: 00010246

   Call Trace:
futex_unlock_pi+0x134/0x210
do_futex+0x13f/0x190
SyS_futex+0x6e/0x150
do_syscall_64+0x6f/0x190
entry_SYSCALL_64_after_hwframe+0x42/0xb7


   WARNING: CPU: 1 PID: 5966 at kernel/sched/core.c:6998 
migrate_enable+0x75/0x2f0
   CPU: 1 PID: 5966 Comm: threaded-ml Tainted: GW   4.14.40-rt31 #1
   Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
   task: 88046b67a6c0 task.stack: c900053a
   RIP: 0010:migrate_enable+0x75/0x2f0
   RSP: 0018:c900053a3df8 EFLAGS: 00010246

   Call Trace:
futex_unlock_pi+0x134/0x210
do_futex+0x13f/0x190
SyS_futex+0x6e/0x150
do_syscall_64+0x6f/0x190
entry_SYSCALL_64_after_hwframe+0x42/0xb7

This patch was developed against 4.14.40-rt31.  Should be
applicatible to all rt releases in which migrate_enable !=
preempt_enable for !rt kernels.

Signed-off-by: Joe Korty 

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2838,7 +2838,14 @@ retry_private:
spin_unlock(q.lock_ptr);
ret = __rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, 
current);
raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
+#ifdef CONFIG_PREEMPT_RT_FULL
migrate_enable();
+#else
+   /* !rt has to force balanced atomic/!atomic migrate_enable/disable uses 
*/
+   preempt_disable();
+   migrate_enable();
+   preempt_enable();
+#endif
 
if (ret) {
if (ret == 1)
@@ -2998,7 +3005,14 @@ retry:
/* drops pi_state->pi_mutex.wait_lock */
ret = wake_futex_pi(uaddr, uval, pi_state);
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+   migrate_enable();
+#else
+   /* !rt has to force balanced atomic/!atomic uses */
+   preempt_disable();
migrate_enable();
+   preempt_enable();
+#endif
 
put_pi_state(pi_state);
 



[PATCH RT] sample fix for splat in futex_[un]lock_pi for !rt

2018-07-04 Thread joe . korty
Balance atomic/!atomic migrate_enable calls in futex_[un]lock_pi.

The clever use of migrate_disable/enable in rt patch

  "futex: workaround migrate_disable/enable in different"

has balanced atomic/!atomic context only for the rt kernel.
This workaround makes it balanced for both rt and !rt.

The 'solution' presented here is for reference only.
A better solution might be for !rt to go back to using
migrate_enable/disable == preempt_enable/disable.
This patch passes the futex selftests for rt and !rt.

Sample kernel splat, edited for brevity.  This happens
near the end of boot on a CentOS 7 installation.

   WARNING: CPU: 1 PID: 5966 at kernel/sched/core.c:6994 
migrate_enable+0x24e/0x2f0
   CPU: 1 PID: 5966 Comm: threaded-ml Not tainted 4.14.40-rt31 #1
   Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
   task: 88046b67a6c0 task.stack: c900053a
   RIP: 0010:migrate_enable+0x24e/0x2f0
   RSP: 0018:c900053a3df8 EFLAGS: 00010246

   Call Trace:
futex_unlock_pi+0x134/0x210
do_futex+0x13f/0x190
SyS_futex+0x6e/0x150
do_syscall_64+0x6f/0x190
entry_SYSCALL_64_after_hwframe+0x42/0xb7


   WARNING: CPU: 1 PID: 5966 at kernel/sched/core.c:6998 
migrate_enable+0x75/0x2f0
   CPU: 1 PID: 5966 Comm: threaded-ml Tainted: GW   4.14.40-rt31 #1
   Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
   task: 88046b67a6c0 task.stack: c900053a
   RIP: 0010:migrate_enable+0x75/0x2f0
   RSP: 0018:c900053a3df8 EFLAGS: 00010246

   Call Trace:
futex_unlock_pi+0x134/0x210
do_futex+0x13f/0x190
SyS_futex+0x6e/0x150
do_syscall_64+0x6f/0x190
entry_SYSCALL_64_after_hwframe+0x42/0xb7

This patch was developed against 4.14.40-rt31.  Should be
applicatible to all rt releases in which migrate_enable !=
preempt_enable for !rt kernels.

Signed-off-by: Joe Korty 

Index: b/kernel/futex.c
===
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2838,7 +2838,14 @@ retry_private:
spin_unlock(q.lock_ptr);
ret = __rt_mutex_start_proxy_lock(_state->pi_mutex, _waiter, 
current);
raw_spin_unlock_irq(_state->pi_mutex.wait_lock);
+#ifdef CONFIG_PREEMPT_RT_FULL
migrate_enable();
+#else
+   /* !rt has to force balanced atomic/!atomic migrate_enable/disable uses 
*/
+   preempt_disable();
+   migrate_enable();
+   preempt_enable();
+#endif
 
if (ret) {
if (ret == 1)
@@ -2998,7 +3005,14 @@ retry:
/* drops pi_state->pi_mutex.wait_lock */
ret = wake_futex_pi(uaddr, uval, pi_state);
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+   migrate_enable();
+#else
+   /* !rt has to force balanced atomic/!atomic uses */
+   preempt_disable();
migrate_enable();
+   preempt_enable();
+#endif
 
put_pi_state(pi_state);
 



Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING

2018-03-26 Thread joe . korty
Oh well.  Makes me wonder why might_sleep is testing for
!TASK_RUNNABLE though.

Thanks for the correction,
Joe


On Mon, Mar 26, 2018 at 11:35:15AM -0400, Steven Rostedt wrote:
> On Fri, 23 Mar 2018 13:21:31 -0400
> joe.ko...@concurrent-rt.com wrote:
> 
> > My understanding is, in standard Linux and in rt, setting
> > task state to anything other than TASK_RUNNING in of itself
> > blocks preemption.
> 
> That is clearly false. The only thing that blocks preemption with a
> CONFIG_PREEMPT kernel is preempt_disable() and local_irq*() disabling.
> 
> (Note spin_locks call preempt_disable in non RT).
> 
> Otherwise, nothing will stop preemption.
> 
> >  A preemption is not really needed here
> > as it is expected that there is a schedule() written in that
> > will shortly be executed.  And if a 'involuntary schedule'
> > (ie, preemption) were allowed to occur between the task
> > state set and the schedule(), that would change the task
> > state back to TASK_RUNNING, which would cause the schedule
> > to NOP.  Thus we risk not having paused long enough here
> > for the condition we were waiting for to become true.
> 
> That is also incorrect. As Julia mentioned, a preemption keeps the
> state of the task.


Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING

2018-03-26 Thread joe . korty
Oh well.  Makes me wonder why might_sleep is testing for
!TASK_RUNNABLE though.

Thanks for the correction,
Joe


On Mon, Mar 26, 2018 at 11:35:15AM -0400, Steven Rostedt wrote:
> On Fri, 23 Mar 2018 13:21:31 -0400
> joe.ko...@concurrent-rt.com wrote:
> 
> > My understanding is, in standard Linux and in rt, setting
> > task state to anything other than TASK_RUNNING in of itself
> > blocks preemption.
> 
> That is clearly false. The only thing that blocks preemption with a
> CONFIG_PREEMPT kernel is preempt_disable() and local_irq*() disabling.
> 
> (Note spin_locks call preempt_disable in non RT).
> 
> Otherwise, nothing will stop preemption.
> 
> >  A preemption is not really needed here
> > as it is expected that there is a schedule() written in that
> > will shortly be executed.  And if a 'involuntary schedule'
> > (ie, preemption) were allowed to occur between the task
> > state set and the schedule(), that would change the task
> > state back to TASK_RUNNING, which would cause the schedule
> > to NOP.  Thus we risk not having paused long enough here
> > for the condition we were waiting for to become true.
> 
> That is also incorrect. As Julia mentioned, a preemption keeps the
> state of the task.


Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING

2018-03-23 Thread joe . korty
Hi Julia,
Thanks for the quick response!

On Fri, Mar 23, 2018 at 11:59:21AM -0500, Julia Cartwright wrote:
> Hey Joe-
> 
> Thanks for the writeup.
> 
> On Fri, Mar 23, 2018 at 11:09:59AM -0400, joe.ko...@concurrent-rt.com wrote:
> > I see the below kernel splat in 4.9-rt when I run a test program that
> > continually changes the affinity of some set of running pids:
> > 
> >do not call blocking ops when !TASK_RUNNING; state=2 set at ...
> >   ...
> >   stop_one_cpu+0x60/0x80
> >   migrate_enable+0x21f/0x3e0
> >   rt_spin_unlock+0x2f/0x40
> >   prepare_to_wait+0x5c/0x80
> >   ...
> 
> This is clearly a problem.
> 
> > The reason is that spin_unlock, write_unlock, and read_unlock call
> > migrate_enable, and since 4.4-rt, migrate_enable will sleep if it discovers
> > that a migration is in order.  But sleeping in the unlock services is not
> > expected by most kernel developers,
> 
> I don't buy this, see below:
> 
> > and where that counts most is in code sequences like the following:
> >
> >   set_current_state(TASK_UNINTERRUPIBLE);
> >   spin_unlock();
> >   schedule();
> 
> The analog in mainline is CONFIG_PREEMPT and the implicit
> preempt_enable() in spin_unlock().  In this configuration, a kernel
> developer should _absolutely_ expect their task to be suspended (and
> potentially migrated), _regardless of the task state_ if there is a
> preemption event on the CPU on which this task is executing.
> 
> Similarly, on RT, there is nothing _conceptually_ wrong on RT with
> migrating on migrate_enable(), regardless of task state, if there is a
> pending migration event.

My understanding is, in standard Linux and in rt, setting
task state to anything other than TASK_RUNNING in of itself
blocks preemption.  A preemption is not really needed here
as it is expected that there is a schedule() written in that
will shortly be executed.  And if a 'involuntary schedule'
(ie, preemption) were allowed to occur between the task
state set and the schedule(), that would change the task
state back to TASK_RUNNING, which would cause the schedule
to NOP.  Thus we risk not having paused long enough here
for the condition we were waiting for to become true.

> 
> It's clear, however, that the mechanism used here is broken ...
> 
>Julia

Thanks,
Joe


Re: [PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING

2018-03-23 Thread joe . korty
Hi Julia,
Thanks for the quick response!

On Fri, Mar 23, 2018 at 11:59:21AM -0500, Julia Cartwright wrote:
> Hey Joe-
> 
> Thanks for the writeup.
> 
> On Fri, Mar 23, 2018 at 11:09:59AM -0400, joe.ko...@concurrent-rt.com wrote:
> > I see the below kernel splat in 4.9-rt when I run a test program that
> > continually changes the affinity of some set of running pids:
> > 
> >do not call blocking ops when !TASK_RUNNING; state=2 set at ...
> >   ...
> >   stop_one_cpu+0x60/0x80
> >   migrate_enable+0x21f/0x3e0
> >   rt_spin_unlock+0x2f/0x40
> >   prepare_to_wait+0x5c/0x80
> >   ...
> 
> This is clearly a problem.
> 
> > The reason is that spin_unlock, write_unlock, and read_unlock call
> > migrate_enable, and since 4.4-rt, migrate_enable will sleep if it discovers
> > that a migration is in order.  But sleeping in the unlock services is not
> > expected by most kernel developers,
> 
> I don't buy this, see below:
> 
> > and where that counts most is in code sequences like the following:
> >
> >   set_current_state(TASK_UNINTERRUPIBLE);
> >   spin_unlock();
> >   schedule();
> 
> The analog in mainline is CONFIG_PREEMPT and the implicit
> preempt_enable() in spin_unlock().  In this configuration, a kernel
> developer should _absolutely_ expect their task to be suspended (and
> potentially migrated), _regardless of the task state_ if there is a
> preemption event on the CPU on which this task is executing.
> 
> Similarly, on RT, there is nothing _conceptually_ wrong on RT with
> migrating on migrate_enable(), regardless of task state, if there is a
> pending migration event.

My understanding is, in standard Linux and in rt, setting
task state to anything other than TASK_RUNNING in of itself
blocks preemption.  A preemption is not really needed here
as it is expected that there is a schedule() written in that
will shortly be executed.  And if a 'involuntary schedule'
(ie, preemption) were allowed to occur between the task
state set and the schedule(), that would change the task
state back to TASK_RUNNING, which would cause the schedule
to NOP.  Thus we risk not having paused long enough here
for the condition we were waiting for to become true.

> 
> It's clear, however, that the mechanism used here is broken ...
> 
>Julia

Thanks,
Joe


[PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING

2018-03-23 Thread joe . korty
p->comm);
   21   +   }
   22   +
   23   rq = task_rq_lock(p, );
   24   update_rq_clock(rq);
   25    
   26   @@ -3499,6 +3505,15 @@ void migrate_enable(void)
   27   tlb_migrate_finish(p->mm);
   28   return;
   29   }
   30   +   } else if (p->migrate_disable_update && p->state != 
TASK_RUNNING) {
   31   +   if (p->migrate_enable_deferred)
   32   +   pr_info("%d(%s): migrate_enable() 
deferred (again).\n",
   33   +   p->pid, p->comm);
   34   +   else {
   35   +   pr_info("%d(%s): migrate_enable() 
deferred.\n",
   36   +   p->pid, p->comm);
   37   +   p->migrate_enable_deferred = 1;
   38   +   }
   39   }
   40    
   41   unpin_current_cpu();
EOF

The rt patch sched-migrate-disable-handle-updated-task-mask-mg-di.patch
appears to have introduced this issue, around the 4.4-rt timeframe.

Signed-off-by: Joe Korty <joe.ko...@concurrent-rt.com>

Index: b/kernel/sched/core.c
===
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3457,7 +3457,14 @@ void migrate_enable(void)
 */
p->migrate_disable = 0;
 
-   if (p->migrate_disable_update) {
+   /*
+* Do not apply affinity update on this migrate_enable if task
+* is preparing to go to sleep for some other reason (eg, task
+* state == TASK_INTERRUPTIBLE).  Instead defer update to a
+* future migate_enable that is called when task state is again
+* == TASK_RUNNING.
+*/
+   if (p->migrate_disable_update && p->state == TASK_RUNNING) {
struct rq *rq;
struct rq_flags rf;
 


[PATCH RT] Defer migrate_enable migration while task state != TASK_RUNNING

2018-03-23 Thread joe . korty
u();
EOF

The rt patch sched-migrate-disable-handle-updated-task-mask-mg-di.patch
appears to have introduced this issue, around the 4.4-rt timeframe.

Signed-off-by: Joe Korty 

Index: b/kernel/sched/core.c
===
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3457,7 +3457,14 @@ void migrate_enable(void)
 */
p->migrate_disable = 0;
 
-   if (p->migrate_disable_update) {
+   /*
+* Do not apply affinity update on this migrate_enable if task
+* is preparing to go to sleep for some other reason (eg, task
+* state == TASK_INTERRUPTIBLE).  Instead defer update to a
+* future migate_enable that is called when task state is again
+* == TASK_RUNNING.
+*/
+   if (p->migrate_disable_update && p->state == TASK_RUNNING) {
struct rq *rq;
struct rq_flags rf;
 


Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-29 Thread joe . korty
On Tue, Nov 28, 2017 at 07:22:34PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2017 10:33:17 -0500
> joe.ko...@concurrent-rt.com wrote:
> 
> > On Tue, Nov 21, 2017 at 09:33:52AM -0500, joe.ko...@concurrent-rt.com wrote:
> > > On Mon, Nov 20, 2017 at 11:57:51PM -0500, Steven Rostedt wrote:  
> > > > On Mon, 20 Nov 2017 23:02:07 -0500
> > > > Steven Rostedt  wrote:
> > > > 
> > > >   
> > > > > Ideally, I would like to stay close to what upstream -rt does. Would
> > > > > you be able to backport the 4.11-rt patch?
> > > > > 
> > > > > I'm currently working on releasing 4.9-rt and 4.4-rt with the latest
> > > > > backports. I could easily add this one too.  
> > > > 
> > > > Speaking of which. I just backported this patch to 4.4-rt. Is this what
> > > > you are talking about?  
> > > 
> > > Yes it is.
> > > Thanks for finding that!
> > > Joe  
> > 
> > I spoke too fast.  You will a variant of my one-liner fix
> > when you backport the 4.11.12-r16 patch:
> > 
> > rt-Increase-decrease-the-nr-of-migratory-tasks-when-.patch
> > 
> > to 4.9-rt and 4.4-rt.  The fix of interest is the introduction of
> > 
> > p->nr_cpus_allowed = cpumask_weight(>cpus_mask);
> > 
> > to migrate_enable_update_cpus_allowed().
> 
> You totally confused me here.
> 
> Hmm, that patch isn't marked for stable. I'm guessing that it should be
> backported.
> 
> Now are you saying your patch still needs to be applied if we backport
> this patch? Or does your patch need to be applied to what I have
> already done?
> 
> I want to release 4.4-rt (and 4.9-rt) this week so let me know.



Hi Steve,
Just porting that other patch should do the trick.  Or you can just apply
my patch, I know that one works as it has actually been tested.

Joe



Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-29 Thread joe . korty
On Tue, Nov 28, 2017 at 07:22:34PM -0500, Steven Rostedt wrote:
> On Tue, 21 Nov 2017 10:33:17 -0500
> joe.ko...@concurrent-rt.com wrote:
> 
> > On Tue, Nov 21, 2017 at 09:33:52AM -0500, joe.ko...@concurrent-rt.com wrote:
> > > On Mon, Nov 20, 2017 at 11:57:51PM -0500, Steven Rostedt wrote:  
> > > > On Mon, 20 Nov 2017 23:02:07 -0500
> > > > Steven Rostedt  wrote:
> > > > 
> > > >   
> > > > > Ideally, I would like to stay close to what upstream -rt does. Would
> > > > > you be able to backport the 4.11-rt patch?
> > > > > 
> > > > > I'm currently working on releasing 4.9-rt and 4.4-rt with the latest
> > > > > backports. I could easily add this one too.  
> > > > 
> > > > Speaking of which. I just backported this patch to 4.4-rt. Is this what
> > > > you are talking about?  
> > > 
> > > Yes it is.
> > > Thanks for finding that!
> > > Joe  
> > 
> > I spoke too fast.  You will a variant of my one-liner fix
> > when you backport the 4.11.12-r16 patch:
> > 
> > rt-Increase-decrease-the-nr-of-migratory-tasks-when-.patch
> > 
> > to 4.9-rt and 4.4-rt.  The fix of interest is the introduction of
> > 
> > p->nr_cpus_allowed = cpumask_weight(>cpus_mask);
> > 
> > to migrate_enable_update_cpus_allowed().
> 
> You totally confused me here.
> 
> Hmm, that patch isn't marked for stable. I'm guessing that it should be
> backported.
> 
> Now are you saying your patch still needs to be applied if we backport
> this patch? Or does your patch need to be applied to what I have
> already done?
> 
> I want to release 4.4-rt (and 4.9-rt) this week so let me know.



Hi Steve,
Just porting that other patch should do the trick.  Or you can just apply
my patch, I know that one works as it has actually been tested.

Joe



Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-21 Thread joe . korty
On Tue, Nov 21, 2017 at 09:33:52AM -0500, joe.ko...@concurrent-rt.com wrote:
> On Mon, Nov 20, 2017 at 11:57:51PM -0500, Steven Rostedt wrote:
> > On Mon, 20 Nov 2017 23:02:07 -0500
> > Steven Rostedt  wrote:
> > 
> > 
> > > Ideally, I would like to stay close to what upstream -rt does. Would
> > > you be able to backport the 4.11-rt patch?
> > > 
> > > I'm currently working on releasing 4.9-rt and 4.4-rt with the latest
> > > backports. I could easily add this one too.
> > 
> > Speaking of which. I just backported this patch to 4.4-rt. Is this what
> > you are talking about?
> 
> Yes it is.
> Thanks for finding that!
> Joe

I spoke too fast.  You will a variant of my one-liner fix
when you backport the 4.11.12-r16 patch:

rt-Increase-decrease-the-nr-of-migratory-tasks-when-.patch

to 4.9-rt and 4.4-rt.  The fix of interest is the introduction of

p->nr_cpus_allowed = cpumask_weight(>cpus_mask);

to migrate_enable_update_cpus_allowed().

Regards,
Joe

> 
> > >From 1dc89be37874bfc7bb4a0ea7c45492d7db39f62b Mon Sep 17 00:00:00 2001
> > From: Sebastian Andrzej Siewior 
> > Date: Mon, 19 Jun 2017 09:55:47 +0200
> > Subject: [PATCH] sched/migrate disable: handle updated task-mask mg-dis
> >  section
> > 
> > If task's cpumask changes while in the task is in a migrate_disable()
> > section then we don't react on it after a migrate_enable(). It matters
> > however if current CPU is no longer part of the cpumask. We also miss
> > the ->set_cpus_allowed() callback.
> > This patch fixes it by setting task->migrate_disable_update once we this
> > "delayed" hook.
> > This bug was introduced while fixing unrelated issue in
> > migrate_disable() in v4.4-rt3 (update_migrate_disable() got removed
> > during that).
> > 
> > Cc: stable...@vger.kernel.org
> > Signed-off-by: Sebastian Andrzej Siewior 
> > Signed-off-by: Steven Rostedt (VMware) 
> > ---
> >  include/linux/sched.h |1 
> >  kernel/sched/core.c   |   59 
> > --
> >  2 files changed, 54 insertions(+), 6 deletions(-)
> > 
> > Index: stable-rt.git/include/linux/sched.h
> > ===
> > --- stable-rt.git.orig/include/linux/sched.h2017-11-20 
> > 23:43:24.214077537 -0500
> > +++ stable-rt.git/include/linux/sched.h 2017-11-20 23:43:24.154079278 
> > -0500
> > @@ -1438,6 +1438,7 @@ struct task_struct {
> > unsigned int policy;
> >  #ifdef CONFIG_PREEMPT_RT_FULL
> > int migrate_disable;
> > +   int migrate_disable_update;
> >  # ifdef CONFIG_SCHED_DEBUG
> > int migrate_disable_atomic;
> >  # endif
> > Index: stable-rt.git/kernel/sched/core.c
> > ===
> > --- stable-rt.git.orig/kernel/sched/core.c  2017-11-20 23:43:24.214077537 
> > -0500
> > +++ stable-rt.git/kernel/sched/core.c   2017-11-20 23:56:05.071687323 
> > -0500
> > @@ -1212,18 +1212,14 @@ void set_cpus_allowed_common(struct task
> > p->nr_cpus_allowed = cpumask_weight(new_mask);
> >  }
> >  
> > -void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> > *new_mask)
> > +static void __do_set_cpus_allowed_tail(struct task_struct *p,
> > +  const struct cpumask *new_mask)
> >  {
> > struct rq *rq = task_rq(p);
> > bool queued, running;
> >  
> > lockdep_assert_held(>pi_lock);
> >  
> > -   if (__migrate_disabled(p)) {
> > -   cpumask_copy(>cpus_allowed, new_mask);
> > -   return;
> > -   }
> > -
> > queued = task_on_rq_queued(p);
> > running = task_current(rq, p);
> >  
> > @@ -1246,6 +1242,20 @@ void do_set_cpus_allowed(struct task_str
> > enqueue_task(rq, p, ENQUEUE_RESTORE);
> >  }
> >  
> > +void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> > *new_mask)
> > +{
> > +   if (__migrate_disabled(p)) {
> > +   lockdep_assert_held(>pi_lock);
> > +
> > +   cpumask_copy(>cpus_allowed, new_mask);
> > +#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_SMP)
> > +   p->migrate_disable_update = 1;
> > +#endif
> > +   return;
> > +   }
> > +   __do_set_cpus_allowed_tail(p, new_mask);
> > +}
> > +
> >  static DEFINE_PER_CPU(struct cpumask, sched_cpumasks);
> >  static DEFINE_MUTEX(sched_down_mutex);
> >  static cpumask_t sched_down_cpumask;
> > @@ -3231,6 +3241,43 @@ void migrate_enable(void)
> >  */
> > p->migrate_disable = 0;
> >  
> > +   if (p->migrate_disable_update) {
> > +   unsigned long flags;
> > +   struct rq *rq;
> > +
> > +   rq = task_rq_lock(p, );
> > +   update_rq_clock(rq);
> > +
> > +   __do_set_cpus_allowed_tail(p, >cpus_allowed);
> > +   task_rq_unlock(rq, p, );
> > +
> > +   p->migrate_disable_update = 0;
> > +
> > +   WARN_ON(smp_processor_id() != task_cpu(p));
> > +   if 

Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-21 Thread joe . korty
On Tue, Nov 21, 2017 at 09:33:52AM -0500, joe.ko...@concurrent-rt.com wrote:
> On Mon, Nov 20, 2017 at 11:57:51PM -0500, Steven Rostedt wrote:
> > On Mon, 20 Nov 2017 23:02:07 -0500
> > Steven Rostedt  wrote:
> > 
> > 
> > > Ideally, I would like to stay close to what upstream -rt does. Would
> > > you be able to backport the 4.11-rt patch?
> > > 
> > > I'm currently working on releasing 4.9-rt and 4.4-rt with the latest
> > > backports. I could easily add this one too.
> > 
> > Speaking of which. I just backported this patch to 4.4-rt. Is this what
> > you are talking about?
> 
> Yes it is.
> Thanks for finding that!
> Joe

I spoke too fast.  You will a variant of my one-liner fix
when you backport the 4.11.12-r16 patch:

rt-Increase-decrease-the-nr-of-migratory-tasks-when-.patch

to 4.9-rt and 4.4-rt.  The fix of interest is the introduction of

p->nr_cpus_allowed = cpumask_weight(>cpus_mask);

to migrate_enable_update_cpus_allowed().

Regards,
Joe

> 
> > >From 1dc89be37874bfc7bb4a0ea7c45492d7db39f62b Mon Sep 17 00:00:00 2001
> > From: Sebastian Andrzej Siewior 
> > Date: Mon, 19 Jun 2017 09:55:47 +0200
> > Subject: [PATCH] sched/migrate disable: handle updated task-mask mg-dis
> >  section
> > 
> > If task's cpumask changes while in the task is in a migrate_disable()
> > section then we don't react on it after a migrate_enable(). It matters
> > however if current CPU is no longer part of the cpumask. We also miss
> > the ->set_cpus_allowed() callback.
> > This patch fixes it by setting task->migrate_disable_update once we this
> > "delayed" hook.
> > This bug was introduced while fixing unrelated issue in
> > migrate_disable() in v4.4-rt3 (update_migrate_disable() got removed
> > during that).
> > 
> > Cc: stable...@vger.kernel.org
> > Signed-off-by: Sebastian Andrzej Siewior 
> > Signed-off-by: Steven Rostedt (VMware) 
> > ---
> >  include/linux/sched.h |1 
> >  kernel/sched/core.c   |   59 
> > --
> >  2 files changed, 54 insertions(+), 6 deletions(-)
> > 
> > Index: stable-rt.git/include/linux/sched.h
> > ===
> > --- stable-rt.git.orig/include/linux/sched.h2017-11-20 
> > 23:43:24.214077537 -0500
> > +++ stable-rt.git/include/linux/sched.h 2017-11-20 23:43:24.154079278 
> > -0500
> > @@ -1438,6 +1438,7 @@ struct task_struct {
> > unsigned int policy;
> >  #ifdef CONFIG_PREEMPT_RT_FULL
> > int migrate_disable;
> > +   int migrate_disable_update;
> >  # ifdef CONFIG_SCHED_DEBUG
> > int migrate_disable_atomic;
> >  # endif
> > Index: stable-rt.git/kernel/sched/core.c
> > ===
> > --- stable-rt.git.orig/kernel/sched/core.c  2017-11-20 23:43:24.214077537 
> > -0500
> > +++ stable-rt.git/kernel/sched/core.c   2017-11-20 23:56:05.071687323 
> > -0500
> > @@ -1212,18 +1212,14 @@ void set_cpus_allowed_common(struct task
> > p->nr_cpus_allowed = cpumask_weight(new_mask);
> >  }
> >  
> > -void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> > *new_mask)
> > +static void __do_set_cpus_allowed_tail(struct task_struct *p,
> > +  const struct cpumask *new_mask)
> >  {
> > struct rq *rq = task_rq(p);
> > bool queued, running;
> >  
> > lockdep_assert_held(>pi_lock);
> >  
> > -   if (__migrate_disabled(p)) {
> > -   cpumask_copy(>cpus_allowed, new_mask);
> > -   return;
> > -   }
> > -
> > queued = task_on_rq_queued(p);
> > running = task_current(rq, p);
> >  
> > @@ -1246,6 +1242,20 @@ void do_set_cpus_allowed(struct task_str
> > enqueue_task(rq, p, ENQUEUE_RESTORE);
> >  }
> >  
> > +void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> > *new_mask)
> > +{
> > +   if (__migrate_disabled(p)) {
> > +   lockdep_assert_held(>pi_lock);
> > +
> > +   cpumask_copy(>cpus_allowed, new_mask);
> > +#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_SMP)
> > +   p->migrate_disable_update = 1;
> > +#endif
> > +   return;
> > +   }
> > +   __do_set_cpus_allowed_tail(p, new_mask);
> > +}
> > +
> >  static DEFINE_PER_CPU(struct cpumask, sched_cpumasks);
> >  static DEFINE_MUTEX(sched_down_mutex);
> >  static cpumask_t sched_down_cpumask;
> > @@ -3231,6 +3241,43 @@ void migrate_enable(void)
> >  */
> > p->migrate_disable = 0;
> >  
> > +   if (p->migrate_disable_update) {
> > +   unsigned long flags;
> > +   struct rq *rq;
> > +
> > +   rq = task_rq_lock(p, );
> > +   update_rq_clock(rq);
> > +
> > +   __do_set_cpus_allowed_tail(p, >cpus_allowed);
> > +   task_rq_unlock(rq, p, );
> > +
> > +   p->migrate_disable_update = 0;
> > +
> > +   WARN_ON(smp_processor_id() != task_cpu(p));
> > +   if (!cpumask_test_cpu(task_cpu(p), >cpus_allowed)) {
> > +   const struct cpumask 

Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-21 Thread joe . korty
On Mon, Nov 20, 2017 at 11:57:51PM -0500, Steven Rostedt wrote:
> On Mon, 20 Nov 2017 23:02:07 -0500
> Steven Rostedt  wrote:
> 
> 
> > Ideally, I would like to stay close to what upstream -rt does. Would
> > you be able to backport the 4.11-rt patch?
> > 
> > I'm currently working on releasing 4.9-rt and 4.4-rt with the latest
> > backports. I could easily add this one too.
> 
> Speaking of which. I just backported this patch to 4.4-rt. Is this what
> you are talking about?

Yes it is.
Thanks for finding that!
Joe

> >From 1dc89be37874bfc7bb4a0ea7c45492d7db39f62b Mon Sep 17 00:00:00 2001
> From: Sebastian Andrzej Siewior 
> Date: Mon, 19 Jun 2017 09:55:47 +0200
> Subject: [PATCH] sched/migrate disable: handle updated task-mask mg-dis
>  section
> 
> If task's cpumask changes while in the task is in a migrate_disable()
> section then we don't react on it after a migrate_enable(). It matters
> however if current CPU is no longer part of the cpumask. We also miss
> the ->set_cpus_allowed() callback.
> This patch fixes it by setting task->migrate_disable_update once we this
> "delayed" hook.
> This bug was introduced while fixing unrelated issue in
> migrate_disable() in v4.4-rt3 (update_migrate_disable() got removed
> during that).
> 
> Cc: stable...@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
>  include/linux/sched.h |1 
>  kernel/sched/core.c   |   59 
> --
>  2 files changed, 54 insertions(+), 6 deletions(-)
> 
> Index: stable-rt.git/include/linux/sched.h
> ===
> --- stable-rt.git.orig/include/linux/sched.h  2017-11-20 23:43:24.214077537 
> -0500
> +++ stable-rt.git/include/linux/sched.h   2017-11-20 23:43:24.154079278 
> -0500
> @@ -1438,6 +1438,7 @@ struct task_struct {
>   unsigned int policy;
>  #ifdef CONFIG_PREEMPT_RT_FULL
>   int migrate_disable;
> + int migrate_disable_update;
>  # ifdef CONFIG_SCHED_DEBUG
>   int migrate_disable_atomic;
>  # endif
> Index: stable-rt.git/kernel/sched/core.c
> ===
> --- stable-rt.git.orig/kernel/sched/core.c2017-11-20 23:43:24.214077537 
> -0500
> +++ stable-rt.git/kernel/sched/core.c 2017-11-20 23:56:05.071687323 -0500
> @@ -1212,18 +1212,14 @@ void set_cpus_allowed_common(struct task
>   p->nr_cpus_allowed = cpumask_weight(new_mask);
>  }
>  
> -void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> *new_mask)
> +static void __do_set_cpus_allowed_tail(struct task_struct *p,
> +const struct cpumask *new_mask)
>  {
>   struct rq *rq = task_rq(p);
>   bool queued, running;
>  
>   lockdep_assert_held(>pi_lock);
>  
> - if (__migrate_disabled(p)) {
> - cpumask_copy(>cpus_allowed, new_mask);
> - return;
> - }
> -
>   queued = task_on_rq_queued(p);
>   running = task_current(rq, p);
>  
> @@ -1246,6 +1242,20 @@ void do_set_cpus_allowed(struct task_str
>   enqueue_task(rq, p, ENQUEUE_RESTORE);
>  }
>  
> +void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> *new_mask)
> +{
> + if (__migrate_disabled(p)) {
> + lockdep_assert_held(>pi_lock);
> +
> + cpumask_copy(>cpus_allowed, new_mask);
> +#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_SMP)
> + p->migrate_disable_update = 1;
> +#endif
> + return;
> + }
> + __do_set_cpus_allowed_tail(p, new_mask);
> +}
> +
>  static DEFINE_PER_CPU(struct cpumask, sched_cpumasks);
>  static DEFINE_MUTEX(sched_down_mutex);
>  static cpumask_t sched_down_cpumask;
> @@ -3231,6 +3241,43 @@ void migrate_enable(void)
>*/
>   p->migrate_disable = 0;
>  
> + if (p->migrate_disable_update) {
> + unsigned long flags;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, );
> + update_rq_clock(rq);
> +
> + __do_set_cpus_allowed_tail(p, >cpus_allowed);
> + task_rq_unlock(rq, p, );
> +
> + p->migrate_disable_update = 0;
> +
> + WARN_ON(smp_processor_id() != task_cpu(p));
> + if (!cpumask_test_cpu(task_cpu(p), >cpus_allowed)) {
> + const struct cpumask *cpu_valid_mask = cpu_active_mask;
> + struct migration_arg arg;
> + unsigned int dest_cpu;
> +
> + if (p->flags & PF_KTHREAD) {
> + /*
> +  * Kernel threads are allowed on online && 
> !active CPUs
> +  */
> + cpu_valid_mask = cpu_online_mask;
> + }
> + dest_cpu = cpumask_any_and(cpu_valid_mask, 
> 

Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-21 Thread joe . korty
On Mon, Nov 20, 2017 at 11:57:51PM -0500, Steven Rostedt wrote:
> On Mon, 20 Nov 2017 23:02:07 -0500
> Steven Rostedt  wrote:
> 
> 
> > Ideally, I would like to stay close to what upstream -rt does. Would
> > you be able to backport the 4.11-rt patch?
> > 
> > I'm currently working on releasing 4.9-rt and 4.4-rt with the latest
> > backports. I could easily add this one too.
> 
> Speaking of which. I just backported this patch to 4.4-rt. Is this what
> you are talking about?

Yes it is.
Thanks for finding that!
Joe

> >From 1dc89be37874bfc7bb4a0ea7c45492d7db39f62b Mon Sep 17 00:00:00 2001
> From: Sebastian Andrzej Siewior 
> Date: Mon, 19 Jun 2017 09:55:47 +0200
> Subject: [PATCH] sched/migrate disable: handle updated task-mask mg-dis
>  section
> 
> If task's cpumask changes while in the task is in a migrate_disable()
> section then we don't react on it after a migrate_enable(). It matters
> however if current CPU is no longer part of the cpumask. We also miss
> the ->set_cpus_allowed() callback.
> This patch fixes it by setting task->migrate_disable_update once we this
> "delayed" hook.
> This bug was introduced while fixing unrelated issue in
> migrate_disable() in v4.4-rt3 (update_migrate_disable() got removed
> during that).
> 
> Cc: stable...@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior 
> Signed-off-by: Steven Rostedt (VMware) 
> ---
>  include/linux/sched.h |1 
>  kernel/sched/core.c   |   59 
> --
>  2 files changed, 54 insertions(+), 6 deletions(-)
> 
> Index: stable-rt.git/include/linux/sched.h
> ===
> --- stable-rt.git.orig/include/linux/sched.h  2017-11-20 23:43:24.214077537 
> -0500
> +++ stable-rt.git/include/linux/sched.h   2017-11-20 23:43:24.154079278 
> -0500
> @@ -1438,6 +1438,7 @@ struct task_struct {
>   unsigned int policy;
>  #ifdef CONFIG_PREEMPT_RT_FULL
>   int migrate_disable;
> + int migrate_disable_update;
>  # ifdef CONFIG_SCHED_DEBUG
>   int migrate_disable_atomic;
>  # endif
> Index: stable-rt.git/kernel/sched/core.c
> ===
> --- stable-rt.git.orig/kernel/sched/core.c2017-11-20 23:43:24.214077537 
> -0500
> +++ stable-rt.git/kernel/sched/core.c 2017-11-20 23:56:05.071687323 -0500
> @@ -1212,18 +1212,14 @@ void set_cpus_allowed_common(struct task
>   p->nr_cpus_allowed = cpumask_weight(new_mask);
>  }
>  
> -void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> *new_mask)
> +static void __do_set_cpus_allowed_tail(struct task_struct *p,
> +const struct cpumask *new_mask)
>  {
>   struct rq *rq = task_rq(p);
>   bool queued, running;
>  
>   lockdep_assert_held(>pi_lock);
>  
> - if (__migrate_disabled(p)) {
> - cpumask_copy(>cpus_allowed, new_mask);
> - return;
> - }
> -
>   queued = task_on_rq_queued(p);
>   running = task_current(rq, p);
>  
> @@ -1246,6 +1242,20 @@ void do_set_cpus_allowed(struct task_str
>   enqueue_task(rq, p, ENQUEUE_RESTORE);
>  }
>  
> +void do_set_cpus_allowed(struct task_struct *p, const struct cpumask 
> *new_mask)
> +{
> + if (__migrate_disabled(p)) {
> + lockdep_assert_held(>pi_lock);
> +
> + cpumask_copy(>cpus_allowed, new_mask);
> +#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_SMP)
> + p->migrate_disable_update = 1;
> +#endif
> + return;
> + }
> + __do_set_cpus_allowed_tail(p, new_mask);
> +}
> +
>  static DEFINE_PER_CPU(struct cpumask, sched_cpumasks);
>  static DEFINE_MUTEX(sched_down_mutex);
>  static cpumask_t sched_down_cpumask;
> @@ -3231,6 +3241,43 @@ void migrate_enable(void)
>*/
>   p->migrate_disable = 0;
>  
> + if (p->migrate_disable_update) {
> + unsigned long flags;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, );
> + update_rq_clock(rq);
> +
> + __do_set_cpus_allowed_tail(p, >cpus_allowed);
> + task_rq_unlock(rq, p, );
> +
> + p->migrate_disable_update = 0;
> +
> + WARN_ON(smp_processor_id() != task_cpu(p));
> + if (!cpumask_test_cpu(task_cpu(p), >cpus_allowed)) {
> + const struct cpumask *cpu_valid_mask = cpu_active_mask;
> + struct migration_arg arg;
> + unsigned int dest_cpu;
> +
> + if (p->flags & PF_KTHREAD) {
> + /*
> +  * Kernel threads are allowed on online && 
> !active CPUs
> +  */
> + cpu_valid_mask = cpu_online_mask;
> + }
> + dest_cpu = cpumask_any_and(cpu_valid_mask, 
> >cpus_allowed);
> + arg.task = p;
> + arg.dest_cpu = 

Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-20 Thread joe . korty
Hi Steve,
A quick perusal of 4.11.12-rt16 shows that it has an
entirely new version of migrate_disable which to me appears
correct.

In that new implementation, migrate_enable() recalculates
p->nr_cpus_allowed when it switches the task back to
using p->cpus_mask.  This brings the two back into sync
if anything had happened to get them out of sync while
migration was disabled (as would happen on an affinity
change during that disable period).

4.9.47-rt37 has the old implementation and it appears to
have same bug as 4.4-rt though I have yet to test 4.9-rt.

The fix in  these older versions could take one of two
forms: either we recalculate p->nr_cpus_allowed when
migrate_enable goes back to using p->cpus_allowed,
as the 4.11-rt version does, or the one place where we
allow p->nr_cpus_allowed to diverge from p->cpus_allowed
be fixed.  The patch I submitted earlier takes this second
approach.

Regards,
Joe



On Fri, Nov 17, 2017 at 05:48:51PM -0500, Steven Rostedt wrote:
> On Wed, 15 Nov 2017 14:25:29 -0500
> joe.ko...@concurrent-rt.com wrote:
> 
> > 4.4.86-rt99's patch
> > 
> >   0037-Intrduce-migrate_disable-cpu_light.patch
> > 
> > introduces a place where a task's cpus_allowed mask is
> > updated without a corresponding update to nr_cpus_allowed.
> > 
> > This path is executed when task affinity is changed while
> > migrate_disabled() is true.  As there is no code present
> > to set nr_cpus_allowed when the migrate_disable state is
> > dropped, the scheduler at that point on may make incorrect
> > scheduling decisions for this task.
> > 
> > My testing consists of temporarily adding a
> > 
> >  if (tsk_nr_cpus_allowed(p) == cpumask_weight(tsk_cpus_allowed(p))
> > printk_ratelimited(...)
> 
> Have you tested v4.9-rt or 4.13-rt if it has the same bug? If it is a
> bug in 4.13-rt then it needs to go there first, and then backported to
> the stable releases (which I'm actually working on now).
> 
> -- Steve
> 
> > 
> > stmt to schedule() and running a simple affinity rotation
> > program I wrote, one that rotates the threads of stress(1).
> > While rotating, I got the expected kernel error messages.
> > With this patch applied the messages disappeared.
> > 
> > Signed-off-by: Joe Korty <joe.ko...@concurrent-rt.com>
> > 
> > Index: b/kernel/sched/core.c
> > ===
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1220,6 +1220,7 @@ void do_set_cpus_allowed(struct task_str
> > lockdep_assert_held(>pi_lock);
> >  
> > if (__migrate_disabled(p)) {
> > +   p->nr_cpus_allowed = cpumask_weight(new_mask);
> > cpumask_copy(>cpus_allowed, new_mask);
> > return;
> > }


Re: [PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-20 Thread joe . korty
Hi Steve,
A quick perusal of 4.11.12-rt16 shows that it has an
entirely new version of migrate_disable which to me appears
correct.

In that new implementation, migrate_enable() recalculates
p->nr_cpus_allowed when it switches the task back to
using p->cpus_mask.  This brings the two back into sync
if anything had happened to get them out of sync while
migration was disabled (as would happen on an affinity
change during that disable period).

4.9.47-rt37 has the old implementation and it appears to
have same bug as 4.4-rt though I have yet to test 4.9-rt.

The fix in  these older versions could take one of two
forms: either we recalculate p->nr_cpus_allowed when
migrate_enable goes back to using p->cpus_allowed,
as the 4.11-rt version does, or the one place where we
allow p->nr_cpus_allowed to diverge from p->cpus_allowed
be fixed.  The patch I submitted earlier takes this second
approach.

Regards,
Joe



On Fri, Nov 17, 2017 at 05:48:51PM -0500, Steven Rostedt wrote:
> On Wed, 15 Nov 2017 14:25:29 -0500
> joe.ko...@concurrent-rt.com wrote:
> 
> > 4.4.86-rt99's patch
> > 
> >   0037-Intrduce-migrate_disable-cpu_light.patch
> > 
> > introduces a place where a task's cpus_allowed mask is
> > updated without a corresponding update to nr_cpus_allowed.
> > 
> > This path is executed when task affinity is changed while
> > migrate_disabled() is true.  As there is no code present
> > to set nr_cpus_allowed when the migrate_disable state is
> > dropped, the scheduler at that point on may make incorrect
> > scheduling decisions for this task.
> > 
> > My testing consists of temporarily adding a
> > 
> >  if (tsk_nr_cpus_allowed(p) == cpumask_weight(tsk_cpus_allowed(p))
> > printk_ratelimited(...)
> 
> Have you tested v4.9-rt or 4.13-rt if it has the same bug? If it is a
> bug in 4.13-rt then it needs to go there first, and then backported to
> the stable releases (which I'm actually working on now).
> 
> -- Steve
> 
> > 
> > stmt to schedule() and running a simple affinity rotation
> > program I wrote, one that rotates the threads of stress(1).
> > While rotating, I got the expected kernel error messages.
> > With this patch applied the messages disappeared.
> > 
> > Signed-off-by: Joe Korty 
> > 
> > Index: b/kernel/sched/core.c
> > ===
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1220,6 +1220,7 @@ void do_set_cpus_allowed(struct task_str
> > lockdep_assert_held(>pi_lock);
> >  
> > if (__migrate_disabled(p)) {
> > +   p->nr_cpus_allowed = cpumask_weight(new_mask);
> > cpumask_copy(>cpus_allowed, new_mask);
> > return;
> > }


[PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-15 Thread joe . korty
4.4.86-rt99's patch

  0037-Intrduce-migrate_disable-cpu_light.patch

introduces a place where a task's cpus_allowed mask is
updated without a corresponding update to nr_cpus_allowed.

This path is executed when task affinity is changed while
migrate_disabled() is true.  As there is no code present
to set nr_cpus_allowed when the migrate_disable state is
dropped, the scheduler at that point on may make incorrect
scheduling decisions for this task.

My testing consists of temporarily adding a

 if (tsk_nr_cpus_allowed(p) == cpumask_weight(tsk_cpus_allowed(p))
printk_ratelimited(...)

stmt to schedule() and running a simple affinity rotation
program I wrote, one that rotates the threads of stress(1).
While rotating, I got the expected kernel error messages.
With this patch applied the messages disappeared.

Signed-off-by: Joe Korty <joe.ko...@concurrent-rt.com>

Index: b/kernel/sched/core.c
===
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1220,6 +1220,7 @@ void do_set_cpus_allowed(struct task_str
lockdep_assert_held(>pi_lock);
 
if (__migrate_disabled(p)) {
+   p->nr_cpus_allowed = cpumask_weight(new_mask);
cpumask_copy(>cpus_allowed, new_mask);
return;
}


[PATCH] 4.4.86-rt99: fix sync breakage between nr_cpus_allowed and cpus_allowed

2017-11-15 Thread joe . korty
4.4.86-rt99's patch

  0037-Intrduce-migrate_disable-cpu_light.patch

introduces a place where a task's cpus_allowed mask is
updated without a corresponding update to nr_cpus_allowed.

This path is executed when task affinity is changed while
migrate_disabled() is true.  As there is no code present
to set nr_cpus_allowed when the migrate_disable state is
dropped, the scheduler at that point on may make incorrect
scheduling decisions for this task.

My testing consists of temporarily adding a

 if (tsk_nr_cpus_allowed(p) == cpumask_weight(tsk_cpus_allowed(p))
printk_ratelimited(...)

stmt to schedule() and running a simple affinity rotation
program I wrote, one that rotates the threads of stress(1).
While rotating, I got the expected kernel error messages.
With this patch applied the messages disappeared.

Signed-off-by: Joe Korty 

Index: b/kernel/sched/core.c
===
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1220,6 +1220,7 @@ void do_set_cpus_allowed(struct task_str
lockdep_assert_held(>pi_lock);
 
if (__migrate_disabled(p)) {
+   p->nr_cpus_allowed = cpumask_weight(new_mask);
cpumask_copy(>cpus_allowed, new_mask);
return;
}


Re: tunnels: Don't apply GRO to multiple layers of encapsulation.

2017-09-05 Thread joe . korty
Hi Sasha,
The backport of

   fac8e0f579695a3ecbc4d3cac369139d7f819971
   tunnels: Don't apply GRO to multiple layers of encapsulation

into 4.1 missed a hunk.  The same backport into 3.18 was done
correctly.  This patch introduces the missing hunk into 4.1.
Excepts from some emails:

Joe Korty wrote:
> I am not experiencing any bad symptoms.  I simply noticed
> that the patch introduced a new function, sit_gro_receive,
> without introducing any users, and that same patch in
> linux-4.4.y does have a user.

Jesse gross wrote:
> Thanks for pointing that out. The line you mentioned
> should indeed be there and seems to have been missed in
> the backport.
> 
> The backport was actually done by Sasha, not by me -
> would you mind sending a patch to him or working with him
> to fix it?

Could you review this and run it through your tests and
send it along to Greg if appropriate?

Thanks,
Joe

Signed-off-by: Joe Korty <joe.ko...@concurrent-rt.com>

Index: b/net/ipv6/ip6_offload.c
===
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -300,7 +300,7 @@ static struct packet_offload ipv6_packet
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
.gso_segment = ipv6_gso_segment,
-   .gro_receive = ipv6_gro_receive,
+   .gro_receive = sit_gro_receive,
.gro_complete = ipv6_gro_complete,
},
 };


Re: tunnels: Don't apply GRO to multiple layers of encapsulation.

2017-09-05 Thread joe . korty
Hi Sasha,
The backport of

   fac8e0f579695a3ecbc4d3cac369139d7f819971
   tunnels: Don't apply GRO to multiple layers of encapsulation

into 4.1 missed a hunk.  The same backport into 3.18 was done
correctly.  This patch introduces the missing hunk into 4.1.
Excepts from some emails:

Joe Korty wrote:
> I am not experiencing any bad symptoms.  I simply noticed
> that the patch introduced a new function, sit_gro_receive,
> without introducing any users, and that same patch in
> linux-4.4.y does have a user.

Jesse gross wrote:
> Thanks for pointing that out. The line you mentioned
> should indeed be there and seems to have been missed in
> the backport.
> 
> The backport was actually done by Sasha, not by me -
> would you mind sending a patch to him or working with him
> to fix it?

Could you review this and run it through your tests and
send it along to Greg if appropriate?

Thanks,
Joe

Signed-off-by: Joe Korty 

Index: b/net/ipv6/ip6_offload.c
===
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -300,7 +300,7 @@ static struct packet_offload ipv6_packet
.type = cpu_to_be16(ETH_P_IPV6),
.callbacks = {
.gso_segment = ipv6_gso_segment,
-   .gro_receive = ipv6_gro_receive,
+   .gro_receive = sit_gro_receive,
.gro_complete = ipv6_gro_complete,
},
 };


Re: tunnels: Don't apply GRO to multiple layers of encapsulation.

2017-08-31 Thread joe . korty
[ resend due to mail problems at my end ]

Hi Jesse,

The backport of fac8e0f579695a3ecbc4d3cac369139d7f819971,
"tunnels: Don't apply GRO to multiple layers of encapsulation",
to linux-4.1.y seems to have missed a line.

The 4.1 commit is 066b300e5be43cb61697539e2a3a9aac5afb422f.

The potentially missing line is:

-   .gro_receive= ipv6_gro_receive,
+   .gro_receive= sit_gro_receive,


I am not experiencing any bad symptoms.  I simply noticed
that the patch introduced a new function, sit_gro_receive,
without introducing any users, and that same patch in  
linux-4.4.y does have a user.

Regards,
Joe



Re: tunnels: Don't apply GRO to multiple layers of encapsulation.

2017-08-31 Thread joe . korty
[ resend due to mail problems at my end ]

Hi Jesse,

The backport of fac8e0f579695a3ecbc4d3cac369139d7f819971,
"tunnels: Don't apply GRO to multiple layers of encapsulation",
to linux-4.1.y seems to have missed a line.

The 4.1 commit is 066b300e5be43cb61697539e2a3a9aac5afb422f.

The potentially missing line is:

-   .gro_receive= ipv6_gro_receive,
+   .gro_receive= sit_gro_receive,


I am not experiencing any bad symptoms.  I simply noticed
that the patch introduced a new function, sit_gro_receive,
without introducing any users, and that same patch in  
linux-4.4.y does have a user.

Regards,
Joe



[PATCH] Fix kfree bug in sendmsg and recvmsg

2016-02-17 Thread Joe Korty
Fix kfree bug in recvmsg and sendmsg.

We cannot kfree(iov) when iov points to an array on the
stack, as that has the potential of corrupting memory.

So re-introduce the if-stmt that used to protect kfree
from this condition, code that was removed as part of
a larger set of changes made by git commit da184284.

Signed-off-by: Joe Korty <joe.ko...@ccur.com>

Index: b/net/socket.c
===
--- a/net/socket.c
+++ b/net/socket.c
@@ -1960,7 +1960,8 @@ out_freectl:
if (ctl_buf != ctl)
sock_kfree_s(sock->sk, ctl_buf, ctl_len);
 out_freeiov:
-   kfree(iov);
+   if (iov != iovstack)
+   kfree(iov);
return err;
 }
 
@@ -2125,7 +2126,8 @@ static int ___sys_recvmsg(struct socket 
err = len;
 
 out_freeiov:
-   kfree(iov);
+   if (iov != iovstack)
+   kfree(iov);
return err;
 }
 


[PATCH] Fix kfree bug in sendmsg and recvmsg

2016-02-17 Thread Joe Korty
Fix kfree bug in recvmsg and sendmsg.

We cannot kfree(iov) when iov points to an array on the
stack, as that has the potential of corrupting memory.

So re-introduce the if-stmt that used to protect kfree
from this condition, code that was removed as part of
a larger set of changes made by git commit da184284.

Signed-off-by: Joe Korty 

Index: b/net/socket.c
===
--- a/net/socket.c
+++ b/net/socket.c
@@ -1960,7 +1960,8 @@ out_freectl:
if (ctl_buf != ctl)
sock_kfree_s(sock->sk, ctl_buf, ctl_len);
 out_freeiov:
-   kfree(iov);
+   if (iov != iovstack)
+   kfree(iov);
return err;
 }
 
@@ -2125,7 +2126,8 @@ static int ___sys_recvmsg(struct socket 
err = len;
 
 out_freeiov:
-   kfree(iov);
+   if (iov != iovstack)
+   kfree(iov);
return err;
 }
 


[no subject]

2016-02-09 Thread Joe Korty
subscribe


[no subject]

2016-02-09 Thread Joe Korty
subscribe


Re: [ANNOUNCE] 3.12.6-rt9

2014-01-21 Thread Joe Korty
On Tue, Jan 21, 2014 at 01:39:10AM -0500, Muli Baron wrote:
> On 21/1/2014 04:17, Steven Rostedt wrote:
> > On Sat, 18 Jan 2014 04:15:29 +0100
> > Mike Galbraith  wrote:
> >
> >
> >>> So you also have the timers-do-not-raise-softirq-unconditionally.patch?
> >>
> >
> > People have been complaining that the latest 3.12-rt does not boot on
> > intel i7 boxes. And by reverting this patch, it boots fine.
> >
> > I happen to have a i7 box to test on, and sure enough, the latest
> > 3.12-rt locks up on boot and reverting the
> > timers-do-not-raise-softirq-unconditionally.patch, it boots fine.
> >
> > Looking into it, I made this small update, and the box boots. Seems
> > checking "active_timers" is not enough to skip raising softirqs. I
> > haven't looked at why yet, but I would like others to test this patch
> > too.
> >
> > I'll leave why this lets i7 boxes boot as an exercise for Thomas ;-)
> >
> > -- Steve
> >
> > Signed-off-by: Steven Rostedt 
> >
> > diff --git a/kernel/timer.c b/kernel/timer.c
> > index 46467be..8212c10 100644
> > --- a/kernel/timer.c
> > +++ b/kernel/timer.c
> > @@ -1464,13 +1464,11 @@ void run_local_timers(void)
> > raise_softirq(TIMER_SOFTIRQ);
> > return;
> > }
> > -   if (!base->active_timers)
> > -   goto out;
> >
> > /* Check whether the next pending timer has expired */
> > if (time_before_eq(base->next_timer, jiffies))
> > raise_softirq(TIMER_SOFTIRQ);
> > -out:
> > +
> > rt_spin_unlock_after_trylock_in_irq(>lock);
> >
> >   }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> While this might fix booting on i7 machines it kinds of defeats the 
> original purpose of this patch, which was to let NO_HZ_FULL work 
> properly with threaded interrupts. With the active_timers check removed 
> the timer interrupt keeps firing even though there is only one task 
> running on a specific processor, since it can't shut down the tick 
> because the ksoftirqd thread keeps getting scheduled (see the previous 
> thread "CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo" for the full 
> discussion).
> 
> -- Muli


Would something like this work?  This would get us past boot, which has
always been this strange, half initialized thing one has to tiptoe around.

-   if (!base->active_timers)
+   if (!base->active_timers && system_state == SYSTEM_RUNNING)

Joe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] 3.12.6-rt9

2014-01-21 Thread Joe Korty
On Tue, Jan 21, 2014 at 01:39:10AM -0500, Muli Baron wrote:
 On 21/1/2014 04:17, Steven Rostedt wrote:
  On Sat, 18 Jan 2014 04:15:29 +0100
  Mike Galbraith bitbuc...@online.de wrote:
 
 
  So you also have the timers-do-not-raise-softirq-unconditionally.patch?
 
 
  People have been complaining that the latest 3.12-rt does not boot on
  intel i7 boxes. And by reverting this patch, it boots fine.
 
  I happen to have a i7 box to test on, and sure enough, the latest
  3.12-rt locks up on boot and reverting the
  timers-do-not-raise-softirq-unconditionally.patch, it boots fine.
 
  Looking into it, I made this small update, and the box boots. Seems
  checking active_timers is not enough to skip raising softirqs. I
  haven't looked at why yet, but I would like others to test this patch
  too.
 
  I'll leave why this lets i7 boxes boot as an exercise for Thomas ;-)
 
  -- Steve
 
  Signed-off-by: Steven Rostedt rost...@goodmis.org
 
  diff --git a/kernel/timer.c b/kernel/timer.c
  index 46467be..8212c10 100644
  --- a/kernel/timer.c
  +++ b/kernel/timer.c
  @@ -1464,13 +1464,11 @@ void run_local_timers(void)
  raise_softirq(TIMER_SOFTIRQ);
  return;
  }
  -   if (!base-active_timers)
  -   goto out;
 
  /* Check whether the next pending timer has expired */
  if (time_before_eq(base-next_timer, jiffies))
  raise_softirq(TIMER_SOFTIRQ);
  -out:
  +
  rt_spin_unlock_after_trylock_in_irq(base-lock);
 
}
  --
  To unsubscribe from this list: send the line unsubscribe linux-rt-users in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 While this might fix booting on i7 machines it kinds of defeats the 
 original purpose of this patch, which was to let NO_HZ_FULL work 
 properly with threaded interrupts. With the active_timers check removed 
 the timer interrupt keeps firing even though there is only one task 
 running on a specific processor, since it can't shut down the tick 
 because the ksoftirqd thread keeps getting scheduled (see the previous 
 thread CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo for the full 
 discussion).
 
 -- Muli


Would something like this work?  This would get us past boot, which has
always been this strange, half initialized thing one has to tiptoe around.

-   if (!base-active_timers)
+   if (!base-active_timers  system_state == SYSTEM_RUNNING)

Joe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] jrcu-3.6-1

2012-10-22 Thread Joe Korty
Hi Stas,
Here is the forward port to 3.6 of the LAG jRCU I promised.
I've only compiled and used it on a single x86_64 machine.
The 3.5 version though is getting heavy use on various
x86_64 and i386 machines in the lab.

Regards,
Joe

Joe's RCU for Linux-3.6, first cut.

jRCU is a tiny RCU best suited for small-SMP systems.
See Documentation/RCU/jrcu.txt for details.

Recent revision history:

   3.6-1: basic port from 3.5-2, no new functionality added.

   3.5-2: replaced the original lockless implementation with
   one based on locks. This makes the algorithm simplier to
   describe, as well as expand its uses beyond its original
   parameters (small SMP, large frame).  Rewrite based on
   comments from Andi Kleen on the 3.4-1 version last May.

   3.5-1: basic port from 3.4-1, no new functionality added.

Signed-off-by: Joe Korty 

Index: b/kernel/jrcu.c
===
--- /dev/null
+++ b/kernel/jrcu.c
@@ -0,0 +1,781 @@
+/*
+ * Joe's tiny RCU, for small SMP systems.
+ *
+ * See Documentation/RCU/jrcu.txt for theory of operation and design details.
+ *
+ * Author: Joe Korty 
+ *
+ * Acknowledgements: Paul E. McKenney's 'TinyRCU for uniprocessors' inspired
+ * the thought that there could could be something similiarly simple for SMP.
+ * The rcu_list chain operators are from Jim Houston's Alternative RCU.
+ *
+ * Copyright Concurrent Computer Corporation, 2011-2012.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Define an rcu list type and operators.  This differs from linux/list.h
+ * in that an rcu list has only ->next pointers for the chain nodes; the
+ * list head however is special and has pointers to both the first and
+ * last nodes of the chain.  Tweaked so that null head, tail pointers can
+ * be used to signify an empty list.
+ */
+struct rcu_list {
+   struct rcu_head *head;
+   struct rcu_head **tail;
+   int count;  /* stats-n-debug */
+};
+
+static inline void rcu_list_init(struct rcu_list *l)
+{
+   l->head = NULL;
+   l->tail = NULL;
+   l->count = 0;
+}
+
+/*
+ * Add an element to the tail of an rcu list
+ */
+static inline void rcu_list_add(struct rcu_list *l, struct rcu_head *h)
+{
+   if (unlikely(l->tail == NULL))
+   l->tail = >head;
+   *l->tail = h;
+   l->tail = >next;
+   l->count++;
+   h->next = NULL;
+}
+
+/*
+ * Append the contents of one rcu list to another.  The 'from' list is left
+ * corrupted on exit; the caller must re-initialize it before it can be used
+ * again.
+ */
+static inline void rcu_list_join(struct rcu_list *to, struct rcu_list *from)
+{
+   if (from->head) {
+   if (unlikely(to->tail == NULL)) {
+   to->tail = >head;
+   to->count = 0;
+   }
+   *to->tail = from->head;
+   to->tail = from->tail;
+   to->count += from->count;
+   }
+}
+
+/* End of generic rcu list definitions, start of specific JRCU stuff */
+
+struct rcu_data {
+   u16 wait;   /* goes false when this cpu consents to
+* the retirement of the current batch */
+   struct rcu_list clist;  /* current callback list */
+   struct rcu_list plist;  /* previous callback list */
+   raw_spinlock_t lock;/* protects the above callback lists */
+   s64 nqueued;/* #callbacks queued (stats-n-debug) */
+} cacheline_aligned_in_smp;
+
+static struct rcu_data rcu_data[NR_CPUS];
+
+/* debug & statistics stuff */
+static struct rcu_stats {
+   unsigned npasses;   /* #passes made */
+   unsigned nlast; /* #passes since last end-of-batch */
+   unsigned nbatches;  /* #end-of-batches (eobs) seen */
+   atomic_t nbarriers; /* #rcu barriers processed */
+   atomic_t nsyncs;/* #rcu syncs processed */
+   s64 ninvoked;   /* #invoked (ie, finished) callbacks */
+   unsigned nforced;   /* #forced eobs (shoul

[PATCH] jrcu-3.6-1

2012-10-22 Thread Joe Korty
Hi Stas,
Here is the forward port to 3.6 of the LAG jRCU I promised.
I've only compiled and used it on a single x86_64 machine.
The 3.5 version though is getting heavy use on various
x86_64 and i386 machines in the lab.

Regards,
Joe

Joe's RCU for Linux-3.6, first cut.

jRCU is a tiny RCU best suited for small-SMP systems.
See Documentation/RCU/jrcu.txt for details.

Recent revision history:

   3.6-1: basic port from 3.5-2, no new functionality added.

   3.5-2: replaced the original lockless implementation with
   one based on locks. This makes the algorithm simplier to
   describe, as well as expand its uses beyond its original
   parameters (small SMP, large frame).  Rewrite based on
   comments from Andi Kleen on the 3.4-1 version last May.

   3.5-1: basic port from 3.4-1, no new functionality added.

Signed-off-by: Joe Korty joe.ko...@ccur.com

Index: b/kernel/jrcu.c
===
--- /dev/null
+++ b/kernel/jrcu.c
@@ -0,0 +1,781 @@
+/*
+ * Joe's tiny RCU, for small SMP systems.
+ *
+ * See Documentation/RCU/jrcu.txt for theory of operation and design details.
+ *
+ * Author: Joe Korty joe.ko...@ccur.com
+ *
+ * Acknowledgements: Paul E. McKenney's 'TinyRCU for uniprocessors' inspired
+ * the thought that there could could be something similiarly simple for SMP.
+ * The rcu_list chain operators are from Jim Houston's Alternative RCU.
+ *
+ * Copyright Concurrent Computer Corporation, 2011-2012.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+
+#include linux/bug.h
+#include linux/smp.h
+#include linux/slab.h
+#include linux/ctype.h
+#include linux/sched.h
+#include linux/types.h
+#include linux/kernel.h
+#include linux/module.h
+#include linux/percpu.h
+#include linux/stddef.h
+#include linux/string.h
+#include linux/preempt.h
+#include linux/uaccess.h
+#include linux/compiler.h
+#include linux/irqflags.h
+#include linux/rcupdate.h
+
+/*
+ * Define an rcu list type and operators.  This differs from linux/list.h
+ * in that an rcu list has only -next pointers for the chain nodes; the
+ * list head however is special and has pointers to both the first and
+ * last nodes of the chain.  Tweaked so that null head, tail pointers can
+ * be used to signify an empty list.
+ */
+struct rcu_list {
+   struct rcu_head *head;
+   struct rcu_head **tail;
+   int count;  /* stats-n-debug */
+};
+
+static inline void rcu_list_init(struct rcu_list *l)
+{
+   l-head = NULL;
+   l-tail = NULL;
+   l-count = 0;
+}
+
+/*
+ * Add an element to the tail of an rcu list
+ */
+static inline void rcu_list_add(struct rcu_list *l, struct rcu_head *h)
+{
+   if (unlikely(l-tail == NULL))
+   l-tail = l-head;
+   *l-tail = h;
+   l-tail = h-next;
+   l-count++;
+   h-next = NULL;
+}
+
+/*
+ * Append the contents of one rcu list to another.  The 'from' list is left
+ * corrupted on exit; the caller must re-initialize it before it can be used
+ * again.
+ */
+static inline void rcu_list_join(struct rcu_list *to, struct rcu_list *from)
+{
+   if (from-head) {
+   if (unlikely(to-tail == NULL)) {
+   to-tail = to-head;
+   to-count = 0;
+   }
+   *to-tail = from-head;
+   to-tail = from-tail;
+   to-count += from-count;
+   }
+}
+
+/* End of generic rcu list definitions, start of specific JRCU stuff */
+
+struct rcu_data {
+   u16 wait;   /* goes false when this cpu consents to
+* the retirement of the current batch */
+   struct rcu_list clist;  /* current callback list */
+   struct rcu_list plist;  /* previous callback list */
+   raw_spinlock_t lock;/* protects the above callback lists */
+   s64 nqueued;/* #callbacks queued (stats-n-debug) */
+} cacheline_aligned_in_smp;
+
+static struct rcu_data rcu_data[NR_CPUS];
+
+/* debug  statistics stuff */
+static struct rcu_stats {
+   unsigned npasses;   /* #passes made */
+   unsigned nlast; /* #passes since last end-of-batch */
+   unsigned nbatches;  /* #end-of-batches (eobs) seen */
+   atomic_t nbarriers; /* #rcu barriers processed */
+   atomic_t nsyncs

Re: possible corrections in the docs (Re: [PATCH] [7/50] x86: expand /proc/interrupts to include missing vectors, v2)

2007-09-21 Thread Joe Korty
Looks good to me.
Joe

Acked-by: Joe Korty <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: possible corrections in the docs (Re: [PATCH] [7/50] x86: expand /proc/interrupts to include missing vectors, v2)

2007-09-21 Thread Joe Korty
Looks good to me.
Joe

Acked-by: Joe Korty [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix section mismatch in the Adaptec DPT SCSI Raid driver

2007-08-17 Thread Joe Korty
On Fri, Aug 17, 2007 at 02:18:56PM -0700, Andrew Morton wrote:
> Please always provide at least a copy of the error message when providing
> patches which fix warnings, or build errors, or section mismatches.
> 
> For section mismatches, an analysis of what caused the problem would help,
> too.  It saves others from having to do the same thing.
> 
> In this case, I'd need to see what error is being fixed so that I can judge
> the seriousness of the problem.  In this case I don't _think_ it'll be
> terribly serious because iirc most architectures don't free exitcall memory.




Fix section mismatch in the Adaptec DPT SCSI Raid driver.

WARNING: vmlinux.o(.init.text+0x1fcd2): Section mismatch:
reference to .exit.text:adpt_exit (between 'adpt_init' and 
'ahc_linux_init')

This warning is due to adaptec device detection calling the exit routine
on failure to properly register the adaptec device.

The exit routine + call was added on July 30 by
  Commit: 55d9fcf57ba5ec427544fca7abc335cf3da78160
  Author: Matthew Wilcox
  Subject: [SCSI] dpt_i2o: convert to SCSI hotplug model.

Mathew: isn't a module exit routine a little too strong to be calling
on the failure of a single device?  Module exit implies that other,
non-failing adaptec raid devices will also get shut down.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>

Index: 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c
===
--- 2.6.23-rc3-git1.orig/drivers/scsi/dpt_i2o.c 2007-08-17 16:36:05.0 
-0400
+++ 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c  2007-08-17 16:50:13.0 
-0400
@@ -3351,7 +3351,7 @@
return count > 0 ? 0 : -ENODEV;
 }
 
-static void __exit adpt_exit(void)
+static void adpt_exit(void)
 {
while (hba_chain)
adpt_release(hba_chain);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix section mismatch in the Adaptec DPT SCSI Raid driver

2007-08-17 Thread Joe Korty
Fix section mismatch in the Adaptec DPT SCSI Raid driver.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>

Index: 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c
===
--- 2.6.23-rc3-git1.orig/drivers/scsi/dpt_i2o.c 2007-08-17 16:36:05.0 
-0400
+++ 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c  2007-08-17 16:50:13.0 
-0400
@@ -3351,7 +3351,7 @@
return count > 0 ? 0 : -ENODEV;
 }
 
-static void __exit adpt_exit(void)
+static void adpt_exit(void)
 {
while (hba_chain)
adpt_release(hba_chain);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix section mismatch in the Adaptec DPT SCSI Raid driver

2007-08-17 Thread Joe Korty
Fix section mismatch in the Adaptec DPT SCSI Raid driver.

Signed-off-by: Joe Korty [EMAIL PROTECTED]

Index: 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c
===
--- 2.6.23-rc3-git1.orig/drivers/scsi/dpt_i2o.c 2007-08-17 16:36:05.0 
-0400
+++ 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c  2007-08-17 16:50:13.0 
-0400
@@ -3351,7 +3351,7 @@
return count  0 ? 0 : -ENODEV;
 }
 
-static void __exit adpt_exit(void)
+static void adpt_exit(void)
 {
while (hba_chain)
adpt_release(hba_chain);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix section mismatch in the Adaptec DPT SCSI Raid driver

2007-08-17 Thread Joe Korty
On Fri, Aug 17, 2007 at 02:18:56PM -0700, Andrew Morton wrote:
 Please always provide at least a copy of the error message when providing
 patches which fix warnings, or build errors, or section mismatches.
 
 For section mismatches, an analysis of what caused the problem would help,
 too.  It saves others from having to do the same thing.
 
 In this case, I'd need to see what error is being fixed so that I can judge
 the seriousness of the problem.  In this case I don't _think_ it'll be
 terribly serious because iirc most architectures don't free exitcall memory.




Fix section mismatch in the Adaptec DPT SCSI Raid driver.

WARNING: vmlinux.o(.init.text+0x1fcd2): Section mismatch:
reference to .exit.text:adpt_exit (between 'adpt_init' and 
'ahc_linux_init')

This warning is due to adaptec device detection calling the exit routine
on failure to properly register the adaptec device.

The exit routine + call was added on July 30 by
  Commit: 55d9fcf57ba5ec427544fca7abc335cf3da78160
  Author: Matthew Wilcox
  Subject: [SCSI] dpt_i2o: convert to SCSI hotplug model.

Mathew: isn't a module exit routine a little too strong to be calling
on the failure of a single device?  Module exit implies that other,
non-failing adaptec raid devices will also get shut down.

Signed-off-by: Joe Korty [EMAIL PROTECTED]

Index: 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c
===
--- 2.6.23-rc3-git1.orig/drivers/scsi/dpt_i2o.c 2007-08-17 16:36:05.0 
-0400
+++ 2.6.23-rc3-git1/drivers/scsi/dpt_i2o.c  2007-08-17 16:50:13.0 
-0400
@@ -3351,7 +3351,7 @@
return count  0 ? 0 : -ENODEV;
 }
 
-static void __exit adpt_exit(void)
+static void adpt_exit(void)
 {
while (hba_chain)
adpt_release(hba_chain);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hres_timers_resume must block interrupts

2007-08-13 Thread Joe Korty
Retrigger_next_event() must be called with interrupts disabled.

All internal (to hrtimer.c) uses of retrigger_next_event() are correct.
But the version exported to other files, hres_timers_resume(), does not
do the IRQ blocking, nor does the (single) external caller of it.

Rather than require that users of hres_timers_resume() do the IRQ blocking,
this patch makes the blocking part of the hres_timers_resume() functionality.

(Also remove the meaningless WARN_ON_ONCE() call in hres_timers_resume)

Signed-off-by: Joe Korty ([EMAIL PROTECTED])

Index: 2.6.23-rc3/kernel/hrtimer.c
===
--- 2.6.23-rc3.orig/kernel/hrtimer.c2007-08-13 18:30:09.0 -0400
+++ 2.6.23-rc3/kernel/hrtimer.c 2007-08-13 18:38:48.0 -0400
@@ -463,10 +463,11 @@
  */
 void hres_timers_resume(void)
 {
-   WARN_ON_ONCE(num_online_cpus() > 1);
+   unsigned long flags;
 
-   /* Retrigger the CPU local events: */
+   local_irq_save(flags);
retrigger_next_event(NULL);
+   local_irq_restore(flags);
 }
 
 /*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hres_timers_resume must block interrupts

2007-08-13 Thread Joe Korty
Retrigger_next_event() must be called with interrupts disabled.

All internal (to hrtimer.c) uses of retrigger_next_event() are correct.
But the version exported to other files, hres_timers_resume(), does not
do the IRQ blocking, nor does the (single) external caller of it.

Rather than require that users of hres_timers_resume() do the IRQ blocking,
this patch makes the blocking part of the hres_timers_resume() functionality.

(Also remove the meaningless WARN_ON_ONCE() call in hres_timers_resume)

Signed-off-by: Joe Korty ([EMAIL PROTECTED])

Index: 2.6.23-rc3/kernel/hrtimer.c
===
--- 2.6.23-rc3.orig/kernel/hrtimer.c2007-08-13 18:30:09.0 -0400
+++ 2.6.23-rc3/kernel/hrtimer.c 2007-08-13 18:38:48.0 -0400
@@ -463,10 +463,11 @@
  */
 void hres_timers_resume(void)
 {
-   WARN_ON_ONCE(num_online_cpus()  1);
+   unsigned long flags;
 
-   /* Retrigger the CPU local events: */
+   local_irq_save(flags);
retrigger_next_event(NULL);
+   local_irq_restore(flags);
 }
 
 /*

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: WARN_ON() which sometimes sucks

2007-08-01 Thread Joe Korty
On Wed, Aug 01, 2007 at 02:20:48PM +1000, Paul Mackerras wrote:
> Linus Torvalds writes:
> 
> > Umm. The WARN_ON() might actually get a "long long" value for all we know. 
> > Ie it's perfectly possible that the WARN_ON might look like
> > 
> > /* Must not have high bits on */
> > WARN_ON(offset & 0x);
> > 
> > which on a 32-bit pcc would apparently do the wrong thing entirely as it 
> > stands now. No?
> 
> Actually, because of the typeof in the powerpc WARN_ON, I think it
> would fail to build since we'd be passing a long long value to an
> inline asm, or at least I hope it would fail to build. :)


Turning the condition into an integer should work ...

#define NEW_WARN_ON(x) OLD_WARN_ON(!!(x))

Regards,
Joe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: WARN_ON() which sometimes sucks

2007-08-01 Thread Joe Korty
On Wed, Aug 01, 2007 at 02:20:48PM +1000, Paul Mackerras wrote:
 Linus Torvalds writes:
 
  Umm. The WARN_ON() might actually get a long long value for all we know. 
  Ie it's perfectly possible that the WARN_ON might look like
  
  /* Must not have high bits on */
  WARN_ON(offset  0x);
  
  which on a 32-bit pcc would apparently do the wrong thing entirely as it 
  stands now. No?
 
 Actually, because of the typeof in the powerpc WARN_ON, I think it
 would fail to build since we'd be passing a long long value to an
 inline asm, or at least I hope it would fail to build. :)


Turning the condition into an integer should work ...

#define NEW_WARN_ON(x) OLD_WARN_ON(!!(x))

Regards,
Joe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] expand /proc/interrupts to include missing vectors, v4

2007-07-31 Thread Joe Korty
[ v3-v4 changelog:
s/irq_spur_counts/irq_spurious_counts/g (Andrew Morton)
tweaked documentation (Andi Kleen)
moved increments before irq_exit as appropriate (Andi Kleen)
whitespace cleanup (Andi Kleen)
]

Add missing IRQs and IRQ descriptions to /proc/interrupts,
version 4.

/proc/interrupts is most useful when it displays every
IRQ vector in use by the system, not just those somebody
thought would be interesting.

This patch inserts the following vector displays to the
i386 and x86_64 platforms.

rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts

A threshold interrupt occurs when ECC memory correction
is occuring at too high a frequency.  Thresholds are used
by the ECC hardware as occasional ECC corrections are part
of normal operation (alpha particles), but long sequences
of ECC corrections usually indicate a memory chip that
is about to fail.  Note that not every system has ECC
threshold logic, and those that do, can require it to
be specifically enabled.

Thermal event interrupts occur when a temperature threshold
has been exceeded for some CPU chip.  I am not sure,
but I think a thermal interrupt is also generated when
the temperature drops back to a normal level.

A spurious interrupt is an interrupt that was raised then
lowered by the device before it could be fully processed
by the APIC.  Hence the apic sees the interrupt but does
not know what device it came from.  For this case the APIC
hardware will assume a vector of 0xff.

Rescheduling, call, and TLB flush interrupts are sent from
one CPU to another per the needs of the OS.  Typically,
their statistics would be used to discover interrupt
flooding.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>

Index: 2.6.23-rc1-git7/arch/i386/kernel/apic.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/apic.c2007-07-31 
16:31:09.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/apic.c 2007-07-31 16:45:30.0 
-0400
@@ -1279,6 +1279,7 @@
/* see sw-dev-man vol 3, chapter 7.4.13.5 */
printk(KERN_INFO "spurious APIC interrupt on CPU#%d, "
   "should never happen.\n", smp_processor_id());
+   __get_cpu_var(irq_stat).irq_spurious_counts++;
irq_exit();
 }
 
Index: 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-31 
16:31:09.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c2007-07-31 
16:45:30.0 -0400
@@ -61,6 +61,7 @@
 {
irq_enter();
vendor_thermal_interrupt(regs);
+   __get_cpu_var(irq_stat).irq_thermal_counts++;
irq_exit();
 }
 
Index: 2.6.23-rc1-git7/arch/i386/kernel/irq.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/irq.c 2007-07-31 16:31:09.0 
-0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/irq.c  2007-07-31 16:45:30.0 
-0400
@@ -284,14 +284,41 @@
seq_printf(p, "NMI: ");
for_each_online_cpu(j)
seq_printf(p, "%10u ", nmi_count(j));
-   seq_putc(p, '\n');
+   seq_printf(p, "  Non-maskable interrupts\n");
 #ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, "LOC: ");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
per_cpu(irq_stat,j).apic_timer_irqs);
-   seq_putc(p, '\n');
+   seq_printf(p, "  Local interrupts\n");
 #endif
+#ifdef CONFIG_SMP
+   seq_printf(p, "RES: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_resched_counts);
+   seq_printf(p, "  Rescheduling interrupts\n");
+   seq_printf(p, "CAL: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_call_counts);
+   seq_printf(p, "  function call interrupts\n");
+   seq_printf(p, "TLB: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_tlb_counts);
+   seq_printf(p, "  TLB shootdowns\n");
+#endif
+   seq_printf(p, "TRM: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_thermal_counts);
+   seq_printf(p, "  Therm

Re: [PATCH] expand /proc/interrupts to include missing vectors, v3

2007-07-31 Thread Joe Korty
On Tue, Jul 31, 2007 at 07:02:01PM +0200, Andi Kleen wrote:

Hi Andi,
Thanks for the review.  I implemented many of your suggestions and for
the rest, here mention why not, in case you want to respond further.

Regards,
Joe

> Joe Korty <[EMAIL PROTECTED]> writes:
> > A threshold interrupt occurs when ECC memory correction
> > is occuring at too high a frequency. 
> 
> It's configurable and the default is off. Also 
> it's only on AMD hardware.

v4 now has a comment to the Documentation section noting
this.


> > Thresholds are used
> > by the ECC hardware as occasional ECC failures are part
> > of normal operation,

Occasional ECC _corrections_ are normal (due to stray alpha particles)
but ECC _failures_ are not.  Document corrected.

> > irq_exit();
> > +   __get_cpu_var(irq_stat).irq_spur_counts++;
> 
> Wouldn't it be safer on preemptible kernels to have that inside
> the irq_exit? 

Although irq_exit() releases the preemption block, it doesn't seem to
release the APIC interrupt block, at least for i386.  And as an interrupt
block also blocks preemption and process migration, it seems that it would
be safe to do the increments after the irq_exit().  But I've moved them
all inside in v4, just in case I am wrong, or this changes in the future
(eg, PREEMPT_RT).

> > +   seq_printf(p, "RES: ");
> 
> I think it would be better to use 5-6 char identifiers
> even when it whacks the columns a bit; otherwise nobody
> will know what it means. e.g. SCHED here.

v3 addresses this.  The normally empty 'description' column at the end of
each line now holds a description of each vector.  The three-character
line-prefix names are there only to make the new lines match the syntax
and format of the other lines in /proc/interrupts.

> Also there you should update proc(5) and send a patch
> to the manpage maintainer.

Will do.

Thanks,
Joe

PS: also fixed up the whitespace.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] expand /proc/interrupts to include missing vectors, v3

2007-07-31 Thread Joe Korty
[ changes from v2:
added documentation
merged some #ifdef CONFIG_SMP's
]

Add missing IRQs and IRQ descriptions to /proc/interrupts.

/proc/interrupts is most useful when it displays every
IRQ vector in use by the system, not just those somebody
thought would be interesting.

This patch inserts the following vector displays to the
i386 and x86_64 platforms, as appropriate:

rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts

A threshold interrupt occurs when ECC memory correction
is occuring at too high a frequency.  Thresholds are used
by the ECC hardware as occasional ECC failures are part
of normal operation, but long sequences of ECC failures
usually indicate a memory chip that is about to fail.

Thermal event interrupts occur when a temperature threshold
has been exceeded for some CPU chip.  IIRC, thermal interrupts
can also be generated when the temperature drops back to
the normal range.

A spurious interrupt is an interrupt that was raised then
lowered by the device before it could be fully processed
by the APIC.  Hence the apic sees the interrupt but does
not know what device it came from.  For this case the APIC
hardware will assume a vector of 0xff.

Rescheduling, call, and TLB flush interrupts are sent from
one CPU to another per the needs of the OS.  Typically,
their statistics would be used to discover interrupt
flooding.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>

Index: 2.6.23-rc1-git7/arch/i386/kernel/apic.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/apic.c2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/apic.c 2007-07-30 19:08:07.0 
-0400
@@ -1280,6 +1280,7 @@
printk(KERN_INFO "spurious APIC interrupt on CPU#%d, "
   "should never happen.\n", smp_processor_id());
irq_exit();
+   __get_cpu_var(irq_stat).irq_spur_counts++;
 }
 
 /*
Index: 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c2007-07-30 
19:08:07.0 -0400
@@ -62,6 +62,7 @@
irq_enter();
vendor_thermal_interrupt(regs);
irq_exit();
+   __get_cpu_var(irq_stat).irq_thermal_counts++;
 }
 
 /* P4/Xeon Thermal regulation detect and init */
Index: 2.6.23-rc1-git7/arch/i386/kernel/irq.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/irq.c 2007-07-30 19:08:05.0 
-0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/irq.c  2007-07-31 09:40:58.0 
-0400
@@ -284,14 +284,41 @@
seq_printf(p, "NMI: ");
for_each_online_cpu(j)
seq_printf(p, "%10u ", nmi_count(j));
-   seq_putc(p, '\n');
+   seq_printf(p, "  Non-maskable interrupts\n");
 #ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, "LOC: ");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
per_cpu(irq_stat,j).apic_timer_irqs);
-   seq_putc(p, '\n');
+   seq_printf(p, "  Local interrupts\n");
 #endif
+#ifdef CONFIG_SMP
+   seq_printf(p, "RES: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_resched_counts);
+   seq_printf(p, "  Rescheduling interrupts\n");
+   seq_printf(p, "CAL: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_call_counts);
+   seq_printf(p, "  function call interrupts\n");
+   seq_printf(p, "TLB: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_tlb_counts);
+   seq_printf(p, "  TLB shootdowns\n");
+#endif
+   seq_printf(p, "TRM: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_thermal_counts);
+   seq_printf(p, "  Thermal event interrupts\n");
+   seq_printf(p, "SPU: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_spur_counts);
+   seq_printf(p, "  Spurious

Re: [PATCH] expand /proc/interrupts to include missing vectors, v3

2007-07-31 Thread Joe Korty
[ changes from v2:
added documentation
merged some #ifdef CONFIG_SMP's
]

Add missing IRQs and IRQ descriptions to /proc/interrupts.

/proc/interrupts is most useful when it displays every
IRQ vector in use by the system, not just those somebody
thought would be interesting.

This patch inserts the following vector displays to the
i386 and x86_64 platforms, as appropriate:

rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts

A threshold interrupt occurs when ECC memory correction
is occuring at too high a frequency.  Thresholds are used
by the ECC hardware as occasional ECC failures are part
of normal operation, but long sequences of ECC failures
usually indicate a memory chip that is about to fail.

Thermal event interrupts occur when a temperature threshold
has been exceeded for some CPU chip.  IIRC, thermal interrupts
can also be generated when the temperature drops back to
the normal range.

A spurious interrupt is an interrupt that was raised then
lowered by the device before it could be fully processed
by the APIC.  Hence the apic sees the interrupt but does
not know what device it came from.  For this case the APIC
hardware will assume a vector of 0xff.

Rescheduling, call, and TLB flush interrupts are sent from
one CPU to another per the needs of the OS.  Typically,
their statistics would be used to discover interrupt
flooding.

Signed-off-by: Joe Korty [EMAIL PROTECTED]

Index: 2.6.23-rc1-git7/arch/i386/kernel/apic.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/apic.c2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/apic.c 2007-07-30 19:08:07.0 
-0400
@@ -1280,6 +1280,7 @@
printk(KERN_INFO spurious APIC interrupt on CPU#%d, 
   should never happen.\n, smp_processor_id());
irq_exit();
+   __get_cpu_var(irq_stat).irq_spur_counts++;
 }
 
 /*
Index: 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c2007-07-30 
19:08:07.0 -0400
@@ -62,6 +62,7 @@
irq_enter();
vendor_thermal_interrupt(regs);
irq_exit();
+   __get_cpu_var(irq_stat).irq_thermal_counts++;
 }
 
 /* P4/Xeon Thermal regulation detect and init */
Index: 2.6.23-rc1-git7/arch/i386/kernel/irq.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/irq.c 2007-07-30 19:08:05.0 
-0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/irq.c  2007-07-31 09:40:58.0 
-0400
@@ -284,14 +284,41 @@
seq_printf(p, NMI: );
for_each_online_cpu(j)
seq_printf(p, %10u , nmi_count(j));
-   seq_putc(p, '\n');
+   seq_printf(p,   Non-maskable interrupts\n);
 #ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, LOC: );
for_each_online_cpu(j)
seq_printf(p, %10u ,
per_cpu(irq_stat,j).apic_timer_irqs);
-   seq_putc(p, '\n');
+   seq_printf(p,   Local interrupts\n);
 #endif
+#ifdef CONFIG_SMP
+   seq_printf(p, RES: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_resched_counts);
+   seq_printf(p,   Rescheduling interrupts\n);
+   seq_printf(p, CAL: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_call_counts);
+   seq_printf(p,   function call interrupts\n);
+   seq_printf(p, TLB: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_tlb_counts);
+   seq_printf(p,   TLB shootdowns\n);
+#endif
+   seq_printf(p, TRM: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_thermal_counts);
+   seq_printf(p,   Thermal event interrupts\n);
+   seq_printf(p, SPU: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_spur_counts);
+   seq_printf(p,   Spurious interrupts\n);
seq_printf(p, ERR: %10u\n, atomic_read(irq_err_count));
 #if defined(CONFIG_X86_IO_APIC)
seq_printf(p, MIS: %10u\n, atomic_read(irq_mis_count));
Index: 2.6.23-rc1-git7/arch/i386/kernel/smp.c

Re: [PATCH] expand /proc/interrupts to include missing vectors, v3

2007-07-31 Thread Joe Korty
On Tue, Jul 31, 2007 at 07:02:01PM +0200, Andi Kleen wrote:

Hi Andi,
Thanks for the review.  I implemented many of your suggestions and for
the rest, here mention why not, in case you want to respond further.

Regards,
Joe

 Joe Korty [EMAIL PROTECTED] writes:
  A threshold interrupt occurs when ECC memory correction
  is occuring at too high a frequency. 
 
 It's configurable and the default is off. Also 
 it's only on AMD hardware.

v4 now has a comment to the Documentation section noting
this.


  Thresholds are used
  by the ECC hardware as occasional ECC failures are part
  of normal operation,

Occasional ECC _corrections_ are normal (due to stray alpha particles)
but ECC _failures_ are not.  Document corrected.

  irq_exit();
  +   __get_cpu_var(irq_stat).irq_spur_counts++;
 
 Wouldn't it be safer on preemptible kernels to have that inside
 the irq_exit? 

Although irq_exit() releases the preemption block, it doesn't seem to
release the APIC interrupt block, at least for i386.  And as an interrupt
block also blocks preemption and process migration, it seems that it would
be safe to do the increments after the irq_exit().  But I've moved them
all inside in v4, just in case I am wrong, or this changes in the future
(eg, PREEMPT_RT).

  +   seq_printf(p, RES: );
 
 I think it would be better to use 5-6 char identifiers
 even when it whacks the columns a bit; otherwise nobody
 will know what it means. e.g. SCHED here.

v3 addresses this.  The normally empty 'description' column at the end of
each line now holds a description of each vector.  The three-character
line-prefix names are there only to make the new lines match the syntax
and format of the other lines in /proc/interrupts.

 Also there you should update proc(5) and send a patch
 to the manpage maintainer.

Will do.

Thanks,
Joe

PS: also fixed up the whitespace.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] expand /proc/interrupts to include missing vectors, v4

2007-07-31 Thread Joe Korty
[ v3-v4 changelog:
s/irq_spur_counts/irq_spurious_counts/g (Andrew Morton)
tweaked documentation (Andi Kleen)
moved increments before irq_exit as appropriate (Andi Kleen)
whitespace cleanup (Andi Kleen)
]

Add missing IRQs and IRQ descriptions to /proc/interrupts,
version 4.

/proc/interrupts is most useful when it displays every
IRQ vector in use by the system, not just those somebody
thought would be interesting.

This patch inserts the following vector displays to the
i386 and x86_64 platforms.

rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts

A threshold interrupt occurs when ECC memory correction
is occuring at too high a frequency.  Thresholds are used
by the ECC hardware as occasional ECC corrections are part
of normal operation (alpha particles), but long sequences
of ECC corrections usually indicate a memory chip that
is about to fail.  Note that not every system has ECC
threshold logic, and those that do, can require it to
be specifically enabled.

Thermal event interrupts occur when a temperature threshold
has been exceeded for some CPU chip.  I am not sure,
but I think a thermal interrupt is also generated when
the temperature drops back to a normal level.

A spurious interrupt is an interrupt that was raised then
lowered by the device before it could be fully processed
by the APIC.  Hence the apic sees the interrupt but does
not know what device it came from.  For this case the APIC
hardware will assume a vector of 0xff.

Rescheduling, call, and TLB flush interrupts are sent from
one CPU to another per the needs of the OS.  Typically,
their statistics would be used to discover interrupt
flooding.

Signed-off-by: Joe Korty [EMAIL PROTECTED]

Index: 2.6.23-rc1-git7/arch/i386/kernel/apic.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/apic.c2007-07-31 
16:31:09.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/apic.c 2007-07-31 16:45:30.0 
-0400
@@ -1279,6 +1279,7 @@
/* see sw-dev-man vol 3, chapter 7.4.13.5 */
printk(KERN_INFO spurious APIC interrupt on CPU#%d, 
   should never happen.\n, smp_processor_id());
+   __get_cpu_var(irq_stat).irq_spurious_counts++;
irq_exit();
 }
 
Index: 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-31 
16:31:09.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c2007-07-31 
16:45:30.0 -0400
@@ -61,6 +61,7 @@
 {
irq_enter();
vendor_thermal_interrupt(regs);
+   __get_cpu_var(irq_stat).irq_thermal_counts++;
irq_exit();
 }
 
Index: 2.6.23-rc1-git7/arch/i386/kernel/irq.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/irq.c 2007-07-31 16:31:09.0 
-0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/irq.c  2007-07-31 16:45:30.0 
-0400
@@ -284,14 +284,41 @@
seq_printf(p, NMI: );
for_each_online_cpu(j)
seq_printf(p, %10u , nmi_count(j));
-   seq_putc(p, '\n');
+   seq_printf(p,   Non-maskable interrupts\n);
 #ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, LOC: );
for_each_online_cpu(j)
seq_printf(p, %10u ,
per_cpu(irq_stat,j).apic_timer_irqs);
-   seq_putc(p, '\n');
+   seq_printf(p,   Local interrupts\n);
 #endif
+#ifdef CONFIG_SMP
+   seq_printf(p, RES: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_resched_counts);
+   seq_printf(p,   Rescheduling interrupts\n);
+   seq_printf(p, CAL: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_call_counts);
+   seq_printf(p,   function call interrupts\n);
+   seq_printf(p, TLB: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_tlb_counts);
+   seq_printf(p,   TLB shootdowns\n);
+#endif
+   seq_printf(p, TRM: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_thermal_counts);
+   seq_printf(p,   Thermal event interrupts\n);
+   seq_printf(p, SPU: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_spurious_counts

[PATCH] expand /proc/interrupts to include missing vectors, v2

2007-07-30 Thread Joe Korty
Add missing IRQs and IRQ descriptions to /proc/interrupts.

/proc/interrupts is most useful when it displays every
IRQ vector in use by the system, not just those somebody
thought would be interesting.

This patch inserts the following vector displays to the
i386 and x86_64 platforms, as appropriate:

rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts

A threshold interrupt occurs when ECC memory correction
is occuring at too high a frequency.  Thresholds are used
by the ECC hardware as occasional ECC failures are part
of normal operation, but long sequences of ECC failures
usually indicate a memory chip that is about to fail.

Thermal event interrupts occur when a temperature threshold
has been exceeded for some CPU chip.  IIRC, a thermal
interrupt is also generated when the temperature drops
back to a normal level.

A spurious interrupt is an interrupt that was raised then
lowered by the device before it could be fully processed
by the APIC.  Hence the apic sees the interrupt but does
not know what device it came from.  For this case the APIC
hardware will assume a vector of 0xff.

Rescheduling, call, and TLB flush interrupts are sent from
one CPU to another per the needs of the OS.  Typically,
their statistics would be used to discover if an
interrupt flood of the given type has been occuring.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>

Index: 2.6.23-rc1-git7/arch/i386/kernel/apic.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/apic.c2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/apic.c 2007-07-30 19:08:07.0 
-0400
@@ -1280,6 +1280,7 @@
printk(KERN_INFO "spurious APIC interrupt on CPU#%d, "
   "should never happen.\n", smp_processor_id());
irq_exit();
+   __get_cpu_var(irq_stat).irq_spur_counts++;
 }
 
 /*
Index: 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c2007-07-30 
19:08:07.0 -0400
@@ -62,6 +62,7 @@
irq_enter();
vendor_thermal_interrupt(regs);
irq_exit();
+   __get_cpu_var(irq_stat).irq_thermal_counts++;
 }
 
 /* P4/Xeon Thermal regulation detect and init */
Index: 2.6.23-rc1-git7/arch/i386/kernel/irq.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/irq.c 2007-07-30 19:08:05.0 
-0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/irq.c  2007-07-30 19:08:07.0 
-0400
@@ -284,14 +284,45 @@
seq_printf(p, "NMI: ");
for_each_online_cpu(j)
seq_printf(p, "%10u ", nmi_count(j));
-   seq_putc(p, '\n');
+   seq_printf(p, "  Non-maskable interrupts\n");
 #ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, "LOC: ");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
per_cpu(irq_stat,j).apic_timer_irqs);
-   seq_putc(p, '\n');
+   seq_printf(p, "  Local interrupts\n");
 #endif
+#ifdef CONFIG_SMP
+   seq_printf(p, "RES: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_resched_counts);
+   seq_printf(p, "  Rescheduling interrupts\n");
+#endif
+#ifdef CONFIG_SMP
+   seq_printf(p, "CAL: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_call_counts);
+   seq_printf(p, "  function call interrupts\n");
+#endif
+#ifdef CONFIG_SMP
+   seq_printf(p, "TLB: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_tlb_counts);
+   seq_printf(p, "  TLB shootdowns\n");
+#endif
+   seq_printf(p, "TRM: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_thermal_counts);
+   seq_printf(p, "  Thermal event interrupts\n");
+   seq_printf(p, "SPU: ");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ",
+   per_cpu(irq_stat,j).irq_spur_counts);
+   seq_printf(p, "  S

Re: [PATCH] create /proc/all-interrupts

2007-07-30 Thread Joe Korty
On Mon, Jul 30, 2007 at 12:32:06PM -0700, Andrew Morton wrote:
> On Mon, 30 Jul 2007 10:33:17 -0700
> Sven-Thorsten Dietrich <[EMAIL PROTECTED]> wrote:
> > On Thu, 2007-07-26 at 11:56 -0700, H. Peter Anvin wrote: 
> > > Joe Korty wrote:
> > > > Create /proc/all-interrupts for some architectures.

> > Would it make sense to drop this patch into -mm for feedback?
> > 
> 
> It's a lot of code for something which might be useful to someone sometime.
> 
> It's a bit of a crappy changelog too.  I'd at least like to see a list of
> all the new fields.
> 
> It should be OK to add new lines to /proc/interrupts?  That file varies a
> lot between machines adn between architectures - as long as the new lines
> have similar layout it is unlikely that anything will break.
> 
> + atomic_inc(&__get_cpu_var(irq_thermal_counts));
> 
> The patch does atomic ops on cpu-local variables.  This isn't needed, and
> is expensive.
> 
> If the field is only ever modified from hard interrupt context then you can
> make the field unsigned long and use plain old `foo++'.
> 
> If the field is modified from both hard-IRQ and from non-IRQ then use a
> local_t and local_inc.
> 
> Or even, given that this is just a statistic and grrat precision is not
> needed, use unsigned long and f++ even if that _is_ racy.  Because the
> consequences of a race will just be a single lost count, which we dont'
> care about enough to add the additional overhead of an atomic op.

Hi Andrew,
Thanks for the comments.  I'll, at least, make the changes you suggested.

(the /proc/interrupts version has the benefit of being smaller too).

Joe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] create /proc/all-interrupts

2007-07-30 Thread Joe Korty
On Mon, Jul 30, 2007 at 12:32:06PM -0700, Andrew Morton wrote:
 On Mon, 30 Jul 2007 10:33:17 -0700
 Sven-Thorsten Dietrich [EMAIL PROTECTED] wrote:
  On Thu, 2007-07-26 at 11:56 -0700, H. Peter Anvin wrote: 
   Joe Korty wrote:
Create /proc/all-interrupts for some architectures.

  Would it make sense to drop this patch into -mm for feedback?
  
 
 It's a lot of code for something which might be useful to someone sometime.
 
 It's a bit of a crappy changelog too.  I'd at least like to see a list of
 all the new fields.
 
 It should be OK to add new lines to /proc/interrupts?  That file varies a
 lot between machines adn between architectures - as long as the new lines
 have similar layout it is unlikely that anything will break.
 
 + atomic_inc(__get_cpu_var(irq_thermal_counts));
 
 The patch does atomic ops on cpu-local variables.  This isn't needed, and
 is expensive.
 
 If the field is only ever modified from hard interrupt context then you can
 make the field unsigned long and use plain old `foo++'.
 
 If the field is modified from both hard-IRQ and from non-IRQ then use a
 local_t and local_inc.
 
 Or even, given that this is just a statistic and grrat precision is not
 needed, use unsigned long and f++ even if that _is_ racy.  Because the
 consequences of a race will just be a single lost count, which we dont'
 care about enough to add the additional overhead of an atomic op.

Hi Andrew,
Thanks for the comments.  I'll, at least, make the changes you suggested.

(the /proc/interrupts version has the benefit of being smaller too).

Joe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] expand /proc/interrupts to include missing vectors, v2

2007-07-30 Thread Joe Korty
Add missing IRQs and IRQ descriptions to /proc/interrupts.

/proc/interrupts is most useful when it displays every
IRQ vector in use by the system, not just those somebody
thought would be interesting.

This patch inserts the following vector displays to the
i386 and x86_64 platforms, as appropriate:

rescheduling interrupts
TLB flush interrupts
function call interrupts
thermal event interrupts
threshold interrupts
spurious interrupts

A threshold interrupt occurs when ECC memory correction
is occuring at too high a frequency.  Thresholds are used
by the ECC hardware as occasional ECC failures are part
of normal operation, but long sequences of ECC failures
usually indicate a memory chip that is about to fail.

Thermal event interrupts occur when a temperature threshold
has been exceeded for some CPU chip.  IIRC, a thermal
interrupt is also generated when the temperature drops
back to a normal level.

A spurious interrupt is an interrupt that was raised then
lowered by the device before it could be fully processed
by the APIC.  Hence the apic sees the interrupt but does
not know what device it came from.  For this case the APIC
hardware will assume a vector of 0xff.

Rescheduling, call, and TLB flush interrupts are sent from
one CPU to another per the needs of the OS.  Typically,
their statistics would be used to discover if an
interrupt flood of the given type has been occuring.

Signed-off-by: Joe Korty [EMAIL PROTECTED]

Index: 2.6.23-rc1-git7/arch/i386/kernel/apic.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/apic.c2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/apic.c 2007-07-30 19:08:07.0 
-0400
@@ -1280,6 +1280,7 @@
printk(KERN_INFO spurious APIC interrupt on CPU#%d, 
   should never happen.\n, smp_processor_id());
irq_exit();
+   __get_cpu_var(irq_stat).irq_spur_counts++;
 }
 
 /*
Index: 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-30 
19:08:05.0 -0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/cpu/mcheck/p4.c2007-07-30 
19:08:07.0 -0400
@@ -62,6 +62,7 @@
irq_enter();
vendor_thermal_interrupt(regs);
irq_exit();
+   __get_cpu_var(irq_stat).irq_thermal_counts++;
 }
 
 /* P4/Xeon Thermal regulation detect and init */
Index: 2.6.23-rc1-git7/arch/i386/kernel/irq.c
===
--- 2.6.23-rc1-git7.orig/arch/i386/kernel/irq.c 2007-07-30 19:08:05.0 
-0400
+++ 2.6.23-rc1-git7/arch/i386/kernel/irq.c  2007-07-30 19:08:07.0 
-0400
@@ -284,14 +284,45 @@
seq_printf(p, NMI: );
for_each_online_cpu(j)
seq_printf(p, %10u , nmi_count(j));
-   seq_putc(p, '\n');
+   seq_printf(p,   Non-maskable interrupts\n);
 #ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, LOC: );
for_each_online_cpu(j)
seq_printf(p, %10u ,
per_cpu(irq_stat,j).apic_timer_irqs);
-   seq_putc(p, '\n');
+   seq_printf(p,   Local interrupts\n);
 #endif
+#ifdef CONFIG_SMP
+   seq_printf(p, RES: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_resched_counts);
+   seq_printf(p,   Rescheduling interrupts\n);
+#endif
+#ifdef CONFIG_SMP
+   seq_printf(p, CAL: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_call_counts);
+   seq_printf(p,   function call interrupts\n);
+#endif
+#ifdef CONFIG_SMP
+   seq_printf(p, TLB: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_tlb_counts);
+   seq_printf(p,   TLB shootdowns\n);
+#endif
+   seq_printf(p, TRM: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_thermal_counts);
+   seq_printf(p,   Thermal event interrupts\n);
+   seq_printf(p, SPU: );
+   for_each_online_cpu(j)
+   seq_printf(p, %10u ,
+   per_cpu(irq_stat,j).irq_spur_counts);
+   seq_printf(p,   Spurious interrupts\n);
seq_printf(p, ERR: %10u\n, atomic_read(irq_err_count));
 #if defined(CONFIG_X86_IO_APIC)
seq_printf(p, MIS: %10u\n, atomic_read(irq_mis_count));
Index: 2.6.23-rc1-git7/arch/i386/kernel/smp.c

[PATCH] create /proc/all-interrupts

2007-07-26 Thread Joe Korty
Create /proc/all-interrupts for some architectures.

Create a version of /proc/interrupts that displays _every_
IRQ vector, not just those that someone thought might be
interesting, and add an entry in the commentary column
for those vectors which lacked such a comment.

Rationale: /proc/interrupts is not truly useful unless it
displays every IRQ vector, not just those somebody thought
would be interesting.  For example, since /proc/interrupts
does not display the rescheduling interrupt, the occurance
of rescheduling interrupt floods ends up affecting
latencies, yet without an entry in /proc/interrupts, it
is difficult to discern why latencies are being affected.

Rather than modify /proc/interrupts, this patch creates
a new version of /proc/interrupts, on the off-chance
that adding new lines to /proc/interrupts, and appending
new fields to the end of old lines, might break some
longstanding script.  However, these kinds of changes
traditionally do not affect scripts, so it might be
reasonable to fold /proc/all-interrupts back into
/proc/interrupts.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>

Index: 2.6.22.1-rt8/arch/i386/kernel/apic.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/apic.c   2007-07-26 11:57:13.0 
-0400
+++ 2.6.22.1-rt8/arch/i386/kernel/apic.c2007-07-26 11:57:14.0 
-0400
@@ -1268,6 +1268,8 @@
 {
unsigned long v;
 
+   atomic_inc(&__get_cpu_var(irq_spur_counts));
+
irq_enter();
/*
 * Check if this really is a spurious interrupt and ACK it
@@ -1297,7 +1299,7 @@
apic_write(APIC_ESR, 0);
v1 = apic_read(APIC_ESR);
ack_APIC_irq();
-   atomic_inc(_err_count);
+   atomic_inc(&__get_cpu_var(irq_err_counts));
 
/* Here is what the APIC error bits mean:
   0: Send CS error
Index: 2.6.22.1-rt8/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/cpu/mcheck/p4.c  2007-07-26 
11:57:13.0 -0400
+++ 2.6.22.1-rt8/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-26 
11:57:14.0 -0400
@@ -60,6 +60,7 @@
 fastcall void smp_thermal_interrupt(struct pt_regs *regs)
 {
irq_enter();
+   atomic_inc(&__get_cpu_var(irq_thermal_counts));
vendor_thermal_interrupt(regs);
irq_exit();
 }
Index: 2.6.22.1-rt8/arch/i386/kernel/i8259.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/i8259.c  2007-07-26 11:57:13.0 
-0400
+++ 2.6.22.1-rt8/arch/i386/kernel/i8259.c   2007-07-26 11:57:14.0 
-0400
@@ -209,7 +209,7 @@
printk(KERN_DEBUG "spurious 8259A interrupt: IRQ%d.\n", 
irq);
spurious_irq_mask |= irqmask;
}
-   atomic_inc(_err_count);
+   atomic_inc(&__get_cpu_var(irq_err_counts));
/*
 * Theoretically we do not have to handle this IRQ,
 * but in Linux this does not cause problems and is
Index: 2.6.22.1-rt8/arch/i386/kernel/io_apic.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/io_apic.c2007-07-26 
11:57:13.0 -0400
+++ 2.6.22.1-rt8/arch/i386/kernel/io_apic.c 2007-07-26 11:57:14.0 
-0400
@@ -51,7 +51,6 @@
 #include "io_ports.h"
 
 int (*ioapic_renumber_irq)(int ioapic, int irq);
-atomic_t irq_mis_count;
 
 /* Where if anywhere is the i8259 connect in external int mode */
 static struct { int pin, apic; } ioapic_i8259 = { -1, -1 };
@@ -2031,7 +2030,7 @@
ack_APIC_irq();
 
if (!(v & (1 << (i & 0x1f {
-   atomic_inc(_mis_count);
+   atomic_inc(&__get_cpu_var(irq_mis_counts));
spin_lock(_lock);
/* mask = 1, trigger = 0 */
__modify_IO_APIC_irq(irq, 0x0001, 0x8000);
Index: 2.6.22.1-rt8/arch/i386/kernel/irq.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/irq.c2007-07-26 11:57:13.0 
-0400
+++ 2.6.22.1-rt8/arch/i386/kernel/irq.c 2007-07-26 13:13:22.0 -0400
@@ -12,6 +12,8 @@
 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -252,15 +254,22 @@
  * Interrupt statistics:
  */
 
-atomic_t irq_err_count;
+DEFINE_PER_CPU(atomic_t, irq_resched_counts);
+DEFINE_PER_CPU(atomic_t, irq_call_counts);
+DEFINE_PER_CPU(atomic_t, irq_spur_counts);
+DEFINE_PER_CPU(atomic_t, irq_tlb_counts);
+DEFINE_PER_CPU(atomic_t, irq_thermal_counts);
+DEFINE_PER_CPU(atomic_t, irq_err_counts);
+DEFINE_PER_CPU(atomic_t, irq_mis_counts);
 
 /*
- * /proc/interrupts printing:
+ * /proc/interrupts and /proc/all-interrupts printing.  Done this
+ * way to preserve the original /proc/interrupts layout.
 

[PATCH] create /proc/all-interrupts

2007-07-26 Thread Joe Korty
Create /proc/all-interrupts for some architectures.

Create a version of /proc/interrupts that displays _every_
IRQ vector, not just those that someone thought might be
interesting, and add an entry in the commentary column
for those vectors which lacked such a comment.

Rationale: /proc/interrupts is not truly useful unless it
displays every IRQ vector, not just those somebody thought
would be interesting.  For example, since /proc/interrupts
does not display the rescheduling interrupt, the occurance
of rescheduling interrupt floods ends up affecting
latencies, yet without an entry in /proc/interrupts, it
is difficult to discern why latencies are being affected.

Rather than modify /proc/interrupts, this patch creates
a new version of /proc/interrupts, on the off-chance
that adding new lines to /proc/interrupts, and appending
new fields to the end of old lines, might break some
longstanding script.  However, these kinds of changes
traditionally do not affect scripts, so it might be
reasonable to fold /proc/all-interrupts back into
/proc/interrupts.

Signed-off-by: Joe Korty [EMAIL PROTECTED]

Index: 2.6.22.1-rt8/arch/i386/kernel/apic.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/apic.c   2007-07-26 11:57:13.0 
-0400
+++ 2.6.22.1-rt8/arch/i386/kernel/apic.c2007-07-26 11:57:14.0 
-0400
@@ -1268,6 +1268,8 @@
 {
unsigned long v;
 
+   atomic_inc(__get_cpu_var(irq_spur_counts));
+
irq_enter();
/*
 * Check if this really is a spurious interrupt and ACK it
@@ -1297,7 +1299,7 @@
apic_write(APIC_ESR, 0);
v1 = apic_read(APIC_ESR);
ack_APIC_irq();
-   atomic_inc(irq_err_count);
+   atomic_inc(__get_cpu_var(irq_err_counts));
 
/* Here is what the APIC error bits mean:
   0: Send CS error
Index: 2.6.22.1-rt8/arch/i386/kernel/cpu/mcheck/p4.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/cpu/mcheck/p4.c  2007-07-26 
11:57:13.0 -0400
+++ 2.6.22.1-rt8/arch/i386/kernel/cpu/mcheck/p4.c   2007-07-26 
11:57:14.0 -0400
@@ -60,6 +60,7 @@
 fastcall void smp_thermal_interrupt(struct pt_regs *regs)
 {
irq_enter();
+   atomic_inc(__get_cpu_var(irq_thermal_counts));
vendor_thermal_interrupt(regs);
irq_exit();
 }
Index: 2.6.22.1-rt8/arch/i386/kernel/i8259.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/i8259.c  2007-07-26 11:57:13.0 
-0400
+++ 2.6.22.1-rt8/arch/i386/kernel/i8259.c   2007-07-26 11:57:14.0 
-0400
@@ -209,7 +209,7 @@
printk(KERN_DEBUG spurious 8259A interrupt: IRQ%d.\n, 
irq);
spurious_irq_mask |= irqmask;
}
-   atomic_inc(irq_err_count);
+   atomic_inc(__get_cpu_var(irq_err_counts));
/*
 * Theoretically we do not have to handle this IRQ,
 * but in Linux this does not cause problems and is
Index: 2.6.22.1-rt8/arch/i386/kernel/io_apic.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/io_apic.c2007-07-26 
11:57:13.0 -0400
+++ 2.6.22.1-rt8/arch/i386/kernel/io_apic.c 2007-07-26 11:57:14.0 
-0400
@@ -51,7 +51,6 @@
 #include io_ports.h
 
 int (*ioapic_renumber_irq)(int ioapic, int irq);
-atomic_t irq_mis_count;
 
 /* Where if anywhere is the i8259 connect in external int mode */
 static struct { int pin, apic; } ioapic_i8259 = { -1, -1 };
@@ -2031,7 +2030,7 @@
ack_APIC_irq();
 
if (!(v  (1  (i  0x1f {
-   atomic_inc(irq_mis_count);
+   atomic_inc(__get_cpu_var(irq_mis_counts));
spin_lock(ioapic_lock);
/* mask = 1, trigger = 0 */
__modify_IO_APIC_irq(irq, 0x0001, 0x8000);
Index: 2.6.22.1-rt8/arch/i386/kernel/irq.c
===
--- 2.6.22.1-rt8.orig/arch/i386/kernel/irq.c2007-07-26 11:57:13.0 
-0400
+++ 2.6.22.1-rt8/arch/i386/kernel/irq.c 2007-07-26 13:13:22.0 -0400
@@ -12,6 +12,8 @@
 
 #include linux/module.h
 #include linux/seq_file.h
+#include linux/fs.h
+#include linux/proc_fs.h
 #include linux/interrupt.h
 #include linux/kernel_stat.h
 #include linux/notifier.h
@@ -252,15 +254,22 @@
  * Interrupt statistics:
  */
 
-atomic_t irq_err_count;
+DEFINE_PER_CPU(atomic_t, irq_resched_counts);
+DEFINE_PER_CPU(atomic_t, irq_call_counts);
+DEFINE_PER_CPU(atomic_t, irq_spur_counts);
+DEFINE_PER_CPU(atomic_t, irq_tlb_counts);
+DEFINE_PER_CPU(atomic_t, irq_thermal_counts);
+DEFINE_PER_CPU(atomic_t, irq_err_counts);
+DEFINE_PER_CPU(atomic_t, irq_mis_counts);
 
 /*
- * /proc/interrupts printing:
+ * /proc/interrupts and /proc/all-interrupts printing.  Done this
+ * way to preserve

Re: [PATCH 1/2] PTRACE_PEEKDATA consolidation

2007-06-11 Thread Joe Korty
On Tue, Jun 12, 2007 at 12:52:25AM +0400, Alexey Dobriyan wrote:
> On Mon, Jun 11, 2007 at 09:35:17PM +0100, Christoph Hellwig wrote:
> > On Tue, Jun 12, 2007 at 12:40:06AM +0400, Alexey Dobriyan wrote:
> > > Identical implementations of PTRACE_PEEKDATA go into
> > > simple_ptrace_peekdata() function.
> > >
> > > compile-tested on ~half of archs, playing with gdb on x86_64.
> >
> > Looks good.  Why don't you call it generic_ptrace_peekdata instead of
> > simple_ptrace_peekdata, though?
> 
> Because they're simple :) I was probably spoiled by libfs.c .

The problem with names like 'simple_*' is that, as the years go by,
simple code grows to become complex code, and then the prefix 'simple'
doesn't apply anymore.

Regards,
Joe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] PTRACE_PEEKDATA consolidation

2007-06-11 Thread Joe Korty
On Tue, Jun 12, 2007 at 12:52:25AM +0400, Alexey Dobriyan wrote:
 On Mon, Jun 11, 2007 at 09:35:17PM +0100, Christoph Hellwig wrote:
  On Tue, Jun 12, 2007 at 12:40:06AM +0400, Alexey Dobriyan wrote:
   Identical implementations of PTRACE_PEEKDATA go into
   simple_ptrace_peekdata() function.
  
   compile-tested on ~half of archs, playing with gdb on x86_64.
 
  Looks good.  Why don't you call it generic_ptrace_peekdata instead of
  simple_ptrace_peekdata, though?
 
 Because they're simple :) I was probably spoiled by libfs.c .

The problem with names like 'simple_*' is that, as the years go by,
simple code grows to become complex code, and then the prefix 'simple'
doesn't apply anymore.

Regards,
Joe
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


add_timer_on and CONFIG_NO_HZ

2007-05-30 Thread Joe Korty
Hi Thomas,
 It seems that when add_timer_on() is used to put a timer
on another cpu's timer queue, that remote cpu should be
made to reprogram its hardware APIC timer, when CONFIG_NO_HZ=y
and the new timer is put at the front of the timer wheel.

Regards,
Joe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


add_timer_on and CONFIG_NO_HZ

2007-05-30 Thread Joe Korty
Hi Thomas,
 It seems that when add_timer_on() is used to put a timer
on another cpu's timer queue, that remote cpu should be
made to reprogram its hardware APIC timer, when CONFIG_NO_HZ=y
and the new timer is put at the front of the timer wheel.

Regards,
Joe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 04:32:49PM +0200, Roman Zippel wrote:
> On Thu, 1 Sep 2005, Joe Korty wrote:

> > Kernel time sucks.  It is just a single clock, it may not have
> > the attributes of the clock that the user really wished to use.
> 
> Wrong. The kernel time is simple and effective for almost all users.
> We are talking about _timeouts_ here, what fancy "attributes" does that 
> need that are just not overkill?

The name should be changed from 'struct timeout' to something like
'struct timeevent'.

Joe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 01:50:33AM +0200, Roman Zippel wrote:
> When you convert a user time to kernel time you can
> automatically validate

Kernel time sucks.  It is just a single clock, it may not have
the attributes of the clock that the user really wished to use.

Joe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 11:22:32AM +0200, Roman Zippel wrote:
> For a timeout? Please get real.
> If you need more precision, use a dedicated timer API, but don't make the 
> general case more complex for the 99.99% of other users.

Struct timeout is just a struct timespec + a bit for absolute/relative +
a field for clock specification.  What's so complex about that?  It captures
everything needed to specify time, from here to the end of time.

Joe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 11:19:51AM +0200, Roman Zippel wrote:

> You still didn't explain what's the point in choosing
> different clock sources for a _timeout_.

Well, if CLOCK_REALTIME is set forward by a minute,
timers & timeout specified against that clock will expire
a minute earlier than expected.  That doesn't happen with
CLOCK_MONOTONIC.  Applications should have the ability
to select what they want to happen in this case (ie,
whether the timeout/timer has to happen at a particular
wall-clock time, say 2pm, or if the interval aspects of
the timer/timeout are more important).  Applications
get this if they have the ability to specify the clock
their timer or timeout is specified against.

Also . (I am going off the deep end here) .

The purpose of CLOCK_REALTIME is to track wall clock time.
That means it can be speed up, slowed down, or even be
force-fed a new time to make it match.

The purpose of CLOCK_MONOTONIC is to provide an even,
unchanging progression of advancing time. That is, any two
intervals on this time-line of the same measured length
actually represent, as close as possible, the same length
of time.

CLOCK_MONOTONIC should get adjustments only to bring its
frequency back into line (but currently gets more than this
in Linux).  CLOCK_REALTIME should and does get adjustments
for frequency and then gets further, temporary speedups
or slowdown to bring its absolute value back into line.

Note that there is no need for the two clocks to track each
other in any way, as Linux currently goes to lengths to do.

I know Linux does not implement the above definition
of CLOCK_MONOTONIC; however, I would like an interface
where when, if the day comes time is properly handled,
applications can take advantage of it.

Joe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 11:19:51AM +0200, Roman Zippel wrote:

 You still didn't explain what's the point in choosing
 different clock sources for a _timeout_.

Well, if CLOCK_REALTIME is set forward by a minute,
timers  timeout specified against that clock will expire
a minute earlier than expected.  That doesn't happen with
CLOCK_MONOTONIC.  Applications should have the ability
to select what they want to happen in this case (ie,
whether the timeout/timer has to happen at a particular
wall-clock time, say 2pm, or if the interval aspects of
the timer/timeout are more important).  Applications
get this if they have the ability to specify the clock
their timer or timeout is specified against.

Also . (I am going off the deep end here) .

The purpose of CLOCK_REALTIME is to track wall clock time.
That means it can be speed up, slowed down, or even be
force-fed a new time to make it match.

The purpose of CLOCK_MONOTONIC is to provide an even,
unchanging progression of advancing time. That is, any two
intervals on this time-line of the same measured length
actually represent, as close as possible, the same length
of time.

CLOCK_MONOTONIC should get adjustments only to bring its
frequency back into line (but currently gets more than this
in Linux).  CLOCK_REALTIME should and does get adjustments
for frequency and then gets further, temporary speedups
or slowdown to bring its absolute value back into line.

Note that there is no need for the two clocks to track each
other in any way, as Linux currently goes to lengths to do.

I know Linux does not implement the above definition
of CLOCK_MONOTONIC; however, I would like an interface
where when, if the day comes time is properly handled,
applications can take advantage of it.

Joe
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 11:22:32AM +0200, Roman Zippel wrote:
 For a timeout? Please get real.
 If you need more precision, use a dedicated timer API, but don't make the 
 general case more complex for the 99.99% of other users.

Struct timeout is just a struct timespec + a bit for absolute/relative +
a field for clock specification.  What's so complex about that?  It captures
everything needed to specify time, from here to the end of time.

Joe
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 01:50:33AM +0200, Roman Zippel wrote:
 When you convert a user time to kernel time you can
 automatically validate

Kernel time sucks.  It is just a single clock, it may not have
the attributes of the clock that the user really wished to use.

Joe
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-09-01 Thread Joe Korty
On Thu, Sep 01, 2005 at 04:32:49PM +0200, Roman Zippel wrote:
 On Thu, 1 Sep 2005, Joe Korty wrote:

  Kernel time sucks.  It is just a single clock, it may not have
  the attributes of the clock that the user really wished to use.
 
 Wrong. The kernel time is simple and effective for almost all users.
 We are talking about _timeouts_ here, what fancy attributes does that 
 need that are just not overkill?

The name should be changed from 'struct timeout' to something like
'struct timeevent'.

Joe
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-08-31 Thread Joe Korty
On Wed, Aug 31, 2005 at 03:20:03PM -0600, Christopher Friesen wrote:
> Perez-Gonzalez, Inaky wrote:
> >In this structure,
> >the user specifies:
> >whether the time is absolute, or relative to 'now'.
> 
> 
> >Timeout_sleep has a return argument, endtime, which is also in
> >'struct timeout' format.  If the input time was relative, then
> >it is converted to absolute and returned through this argument.
> 
> Wouldn't it make more sense for the endtime to be returned in the same 
> format (relative/absolute) as the original timer was specified?  That 
> way an application can set a new timer for "timeout + SLEEPTIME" and on 
> average it will be reasonably accurate.
> 
> In the proposed method, for endtime to be useful the app needs to check 
> the current time, compare with the endtime, and figure out the delta. 
> If you're going to force the app to do all that work anyway, the app may 
> as well use absolute times.
> 
> Chris

The returned timeout struct has a bit used to mark the value as absolute.  Thus
the caller treats the returned timeout as a opaque cookie that can be
reapplied to the next (or more likely, the to-be restarted) timeout.

A general principle is, once a time has been converted to absolute, it
should never be converted back to relative time.  To do so means the
end-time starts to drift from the original end-time.

Regards,
Joe
--
"Money can buy bandwidth, but latency is forever" -- John Mashey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FW: [RFC] A more general timeout specification

2005-08-31 Thread Joe Korty
On Wed, Aug 31, 2005 at 01:55:54PM -0700, Perez-Gonzalez, Inaky wrote:
> Hi Andrew
> 
> This was developed by Joe Korty <[EMAIL PROTECTED]>, greatly 
> enhancing something I had done before, so I am signing it out 
> (although Joe should too, Joe?).


The fusyn (robust mutexes) project proposes the creation
of a more general data structure, 'struct timeout', for the
specification of timeouts in new services.  In this structure,
the user specifies:

a time, in timespec format.
the clock the time is specified against (eg, CLOCK_MONOTONIC).
whether the time is absolute, or relative to 'now'.

That is, all combinations of useful timeout attributes become
possible.

Also proposed are two new kernel routines for the manipulation
of timeouts:

timeout_validate()
timeout_sleep()

timeout_validate() error-checks the syntax of a timeout
argument and returns either zero or -EINVAL.  By breaking
timeout_validate() out from timeout_sleep(), it becomes possible
to error check the timeout 'far away' from the places in the
code where we would actually do the timeout, as well as being
able to perform such checks only at those places we know the
timeout specification is coming from an unsafe source.

timeout_sleep() puts the caller to sleep until the
specified end time is in the past, as measured against
the given clock, or until the caller is awakened by other
means (such as wake_up_process()).  Like schedule_timeout(),
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE must be set ahead
of time; if TASK_INTERRUPTIBLE is set then signals will also
break the caller out of the sleep.

timeout_sleep() returns either 0 (returned early) or -ETIMEDOUT
(returned due to timeout).  It is up to the caller to resolve,
in the "returned early" case, why it returned early.

Timeout_sleep has a return argument, endtime, which is also in
'struct timeout' format.  If the input time was relative, then
it is converted to absolute and returned through this argument.
This can be used when an early-terminated service must be
restarted and side effects of the early termination-n-restart
(such as end time drift) are to be avoided.

Signed-off-by: Inaky Perez-Gonzalez <[EMAIL PROTECTED]>
Signed-off-by: Joe Korty <[EMAIL PROTECTED]>




 2.6.12-rc4-jak/include/linux/time.h|6 +
 2.6.12-rc4-jak/include/linux/timeout.h |   48 
 2.6.12-rc4-jak/kernel/posix-timers.c   |7 +
 2.6.12-rc4-jak/kernel/timer.c  |  184
+
 4 files changed, 245 insertions(+)

diff -puNa include/linux/time.h~a.more.flexible.timeout.approach
include/linux/time.h
--- 2.6.12-rc4/include/linux/time.h~a.more.flexible.timeout.approach
2005-05-18 13:53:14.204417169 -0400
+++ 2.6.12-rc4-jak/include/linux/time.h 2005-05-18 13:53:14.212416002
-0400
@@ -25,6 +25,8 @@ struct timezone {
int tz_dsttime; /* type of dst correction */
 };
 
+#include 
+
 #ifdef __KERNEL__
 
 /* Parameters used to convert the timespec values */
@@ -103,6 +105,10 @@ struct itimerval;
 extern int do_setitimer(int which, struct itimerval *value, struct
itimerval *ovalue);
 extern int do_getitimer(int which, struct itimerval *value);
 extern void getnstimeofday (struct timespec *tv);
+extern long clock_gettime(int which, struct timespec *tp);
+
+extern int FASTCALL(abs_timespec_to_abs_jiffies (clockid_t clock, const
struct timespec *tp, unsigned long *jp));
+extern int FASTCALL(rel_to_abs_timespec(clockid_t clock, const struct
timespec *tsrel, struct timespec *tsabs));
 
 extern struct timespec timespec_trunc(struct timespec t, unsigned
gran);
 
diff -puNa /dev/null include/linux/timeout.h
--- /dev/null   2004-06-24 14:04:38.0 -0400
+++ 2.6.12-rc4-jak/include/linux/timeout.h  2005-05-18
13:53:14.212416002 -0400
@@ -0,0 +1,48 @@
+/*
+ * Extended timeout specification
+ *
+ * (C) 2002-2005 Intel Corp
+ * Inaky Perez-Gonzalez <[EMAIL PROTECTED]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Generic extended timeout specification.  Broken out by Joe Korty
+ * <[EMAIL PROTECTED]> from linux/time.h so that it can be included
+ * by userspace applications in conjunction with #include "time.h".
+ */
+
+#ifndef _LINUX_TIMEOUT_H
+#define _LINUX_TIMEOUT_H
+
+/* 'struct timeout' flag values.  OR these into clock_id along with
+ * a clock specification such as CLOCK_REALTIME or CLOCK_MONOTONIC.
+ */
+enum {
+   TIMEOUT_RELATIVE   = 0x1000,/* relative timeout */
+
+   TIMEOUT_FLAGS_MASK = 0xf000,/* flags mask for
clock_id */
+   TIMEOUT_CLOCK_MASK = 0x0fff,/* clock mask for
clock_id */
+};
+
+/* Magic values a 'struct timeout' pointer can have */
+
+#define TIMEOUT_MAX((struct timeout *) ~0UL) /* never time out */
+#define TIMEOUT_NONE   ((struct timeout *) 0UL)  /* time out
immediately */
+
+/**
+ * struct timeout - general timeout specification
+ *
+ * @clock_id: which clock sour

Re: FW: [RFC] A more general timeout specification

2005-08-31 Thread Joe Korty
On Wed, Aug 31, 2005 at 01:55:54PM -0700, Perez-Gonzalez, Inaky wrote:
 Hi Andrew
 
 This was developed by Joe Korty [EMAIL PROTECTED], greatly 
 enhancing something I had done before, so I am signing it out 
 (although Joe should too, Joe?).


The fusyn (robust mutexes) project proposes the creation
of a more general data structure, 'struct timeout', for the
specification of timeouts in new services.  In this structure,
the user specifies:

a time, in timespec format.
the clock the time is specified against (eg, CLOCK_MONOTONIC).
whether the time is absolute, or relative to 'now'.

That is, all combinations of useful timeout attributes become
possible.

Also proposed are two new kernel routines for the manipulation
of timeouts:

timeout_validate()
timeout_sleep()

timeout_validate() error-checks the syntax of a timeout
argument and returns either zero or -EINVAL.  By breaking
timeout_validate() out from timeout_sleep(), it becomes possible
to error check the timeout 'far away' from the places in the
code where we would actually do the timeout, as well as being
able to perform such checks only at those places we know the
timeout specification is coming from an unsafe source.

timeout_sleep() puts the caller to sleep until the
specified end time is in the past, as measured against
the given clock, or until the caller is awakened by other
means (such as wake_up_process()).  Like schedule_timeout(),
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE must be set ahead
of time; if TASK_INTERRUPTIBLE is set then signals will also
break the caller out of the sleep.

timeout_sleep() returns either 0 (returned early) or -ETIMEDOUT
(returned due to timeout).  It is up to the caller to resolve,
in the returned early case, why it returned early.

Timeout_sleep has a return argument, endtime, which is also in
'struct timeout' format.  If the input time was relative, then
it is converted to absolute and returned through this argument.
This can be used when an early-terminated service must be
restarted and side effects of the early termination-n-restart
(such as end time drift) are to be avoided.

Signed-off-by: Inaky Perez-Gonzalez [EMAIL PROTECTED]
Signed-off-by: Joe Korty [EMAIL PROTECTED]




 2.6.12-rc4-jak/include/linux/time.h|6 +
 2.6.12-rc4-jak/include/linux/timeout.h |   48 
 2.6.12-rc4-jak/kernel/posix-timers.c   |7 +
 2.6.12-rc4-jak/kernel/timer.c  |  184
+
 4 files changed, 245 insertions(+)

diff -puNa include/linux/time.h~a.more.flexible.timeout.approach
include/linux/time.h
--- 2.6.12-rc4/include/linux/time.h~a.more.flexible.timeout.approach
2005-05-18 13:53:14.204417169 -0400
+++ 2.6.12-rc4-jak/include/linux/time.h 2005-05-18 13:53:14.212416002
-0400
@@ -25,6 +25,8 @@ struct timezone {
int tz_dsttime; /* type of dst correction */
 };
 
+#include linux/timeout.h
+
 #ifdef __KERNEL__
 
 /* Parameters used to convert the timespec values */
@@ -103,6 +105,10 @@ struct itimerval;
 extern int do_setitimer(int which, struct itimerval *value, struct
itimerval *ovalue);
 extern int do_getitimer(int which, struct itimerval *value);
 extern void getnstimeofday (struct timespec *tv);
+extern long clock_gettime(int which, struct timespec *tp);
+
+extern int FASTCALL(abs_timespec_to_abs_jiffies (clockid_t clock, const
struct timespec *tp, unsigned long *jp));
+extern int FASTCALL(rel_to_abs_timespec(clockid_t clock, const struct
timespec *tsrel, struct timespec *tsabs));
 
 extern struct timespec timespec_trunc(struct timespec t, unsigned
gran);
 
diff -puNa /dev/null include/linux/timeout.h
--- /dev/null   2004-06-24 14:04:38.0 -0400
+++ 2.6.12-rc4-jak/include/linux/timeout.h  2005-05-18
13:53:14.212416002 -0400
@@ -0,0 +1,48 @@
+/*
+ * Extended timeout specification
+ *
+ * (C) 2002-2005 Intel Corp
+ * Inaky Perez-Gonzalez [EMAIL PROTECTED].
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Generic extended timeout specification.  Broken out by Joe Korty
+ * [EMAIL PROTECTED] from linux/time.h so that it can be included
+ * by userspace applications in conjunction with #include time.h.
+ */
+
+#ifndef _LINUX_TIMEOUT_H
+#define _LINUX_TIMEOUT_H
+
+/* 'struct timeout' flag values.  OR these into clock_id along with
+ * a clock specification such as CLOCK_REALTIME or CLOCK_MONOTONIC.
+ */
+enum {
+   TIMEOUT_RELATIVE   = 0x1000,/* relative timeout */
+
+   TIMEOUT_FLAGS_MASK = 0xf000,/* flags mask for
clock_id */
+   TIMEOUT_CLOCK_MASK = 0x0fff,/* clock mask for
clock_id */
+};
+
+/* Magic values a 'struct timeout' pointer can have */
+
+#define TIMEOUT_MAX((struct timeout *) ~0UL) /* never time out */
+#define TIMEOUT_NONE   ((struct timeout *) 0UL)  /* time out
immediately */
+
+/**
+ * struct timeout - general timeout specification
+ *
+ * @clock_id: which clock source to use ORed with flags describing use.
+ * @ts:   timespec

Re: FW: [RFC] A more general timeout specification

2005-08-31 Thread Joe Korty
On Wed, Aug 31, 2005 at 03:20:03PM -0600, Christopher Friesen wrote:
 Perez-Gonzalez, Inaky wrote:
 In this structure,
 the user specifies:
 whether the time is absolute, or relative to 'now'.
 
 
 Timeout_sleep has a return argument, endtime, which is also in
 'struct timeout' format.  If the input time was relative, then
 it is converted to absolute and returned through this argument.
 
 Wouldn't it make more sense for the endtime to be returned in the same 
 format (relative/absolute) as the original timer was specified?  That 
 way an application can set a new timer for timeout + SLEEPTIME and on 
 average it will be reasonably accurate.
 
 In the proposed method, for endtime to be useful the app needs to check 
 the current time, compare with the endtime, and figure out the delta. 
 If you're going to force the app to do all that work anyway, the app may 
 as well use absolute times.
 
 Chris

The returned timeout struct has a bit used to mark the value as absolute.  Thus
the caller treats the returned timeout as a opaque cookie that can be
reapplied to the next (or more likely, the to-be restarted) timeout.

A general principle is, once a time has been converted to absolute, it
should never be converted back to relative time.  To do so means the
end-time starts to drift from the original end-time.

Regards,
Joe
--
Money can buy bandwidth, but latency is forever -- John Mashey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] add EOWNERDEAD and ENOTRECOVERABLE version 2

2005-04-13 Thread Joe Korty

Add EOWNERDEAD and ENOTRECOVERABLE to all architectures.
This is to support the upcoming patches for robust mutexes.

We normally don't reserve parts of the name/number space
for external patches, but robust mutexes are sufficiently
popular and important to justify it in this case.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>


 2.6.12-rc2-jak/include/asm-alpha/errno.h   |4 
 2.6.12-rc2-jak/include/asm-generic/errno.h |4 
 2.6.12-rc2-jak/include/asm-mips/errno.h|4 
 2.6.12-rc2-jak/include/asm-parisc/errno.h  |4 
 2.6.12-rc2-jak/include/asm-sparc/errno.h   |4 
 2.6.12-rc2-jak/include/asm-sparc64/errno.h |4 
 6 files changed, 24 insertions(+)

diff -puNa include/asm-generic/errno.h~owner.notrecoverable.errnos 
include/asm-generic/errno.h
--- 2.6.12-rc2/include/asm-generic/errno.h~owner.notrecoverable.errnos  
2005-04-12 09:54:38.0 -0400
+++ 2.6.12-rc2-jak/include/asm-generic/errno.h  2005-04-13 09:58:26.0 
-0400
@@ -102,4 +102,8 @@
 #defineEKEYREVOKED 128 /* Key has been revoked */
 #defineEKEYREJECTED129 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  130 /* Owner died */
+#defineENOTRECOVERABLE 131 /* State not recoverable */
+
 #endif
diff -puNa include/asm-alpha/errno.h~owner.notrecoverable.errnos 
include/asm-alpha/errno.h
--- 2.6.12-rc2/include/asm-alpha/errno.h~owner.notrecoverable.errnos
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-alpha/errno.h2005-04-13 09:58:41.0 
-0400
@@ -116,4 +116,8 @@
 #defineEKEYREVOKED 134 /* Key has been revoked */
 #defineEKEYREJECTED135 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  136 /* Owner died */
+#defineENOTRECOVERABLE 137 /* State not recoverable */
+
 #endif
diff -puNa include/asm-mips/errno.h~owner.notrecoverable.errnos 
include/asm-mips/errno.h
--- 2.6.12-rc2/include/asm-mips/errno.h~owner.notrecoverable.errnos 
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-mips/errno.h 2005-04-13 09:59:17.0 
-0400
@@ -115,6 +115,10 @@
 #defineEKEYREVOKED 163 /* Key has been revoked */
 #defineEKEYREJECTED164 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  165 /* Owner died */
+#defineENOTRECOVERABLE 166 /* State not recoverable */
+
 #define EDQUOT 1133/* Quota exceeded */
 
 #ifdef __KERNEL__
diff -puNa include/asm-parisc/errno.h~owner.notrecoverable.errnos 
include/asm-parisc/errno.h
--- 2.6.12-rc2/include/asm-parisc/errno.h~owner.notrecoverable.errnos   
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-parisc/errno.h   2005-04-13 09:59:24.0 
-0400
@@ -115,5 +115,9 @@
 #define ENOTSUP252 /* Function not implemented (POSIX.4 / 
HPUX) */
 #define ECANCELLED 253 /* aio request was canceled before complete 
(POSIX.4 / HPUX) */
 
+/* for robust mutexes */
+#define EOWNERDEAD 254 /* Owner died */
+#define ENOTRECOVERABLE255 /* State not recoverable */
+
 
 #endif
diff -puNa include/asm-sparc/errno.h~owner.notrecoverable.errnos 
include/asm-sparc/errno.h
--- 2.6.12-rc2/include/asm-sparc/errno.h~owner.notrecoverable.errnos
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-sparc/errno.h2005-04-13 09:59:28.0 
-0400
@@ -107,4 +107,8 @@
 #defineEKEYREVOKED 130 /* Key has been revoked */
 #defineEKEYREJECTED131 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  132 /* Owner died */
+#defineENOTRECOVERABLE 133 /* State not recoverable */
+
 #endif
diff -puNa include/asm-sparc64/errno.h~owner.notrecoverable.errnos 
include/asm-sparc64/errno.h
--- 2.6.12-rc2/include/asm-sparc64/errno.h~owner.notrecoverable.errnos  
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-sparc64/errno.h  2005-04-13 09:59:33.0 
-0400
@@ -107,4 +107,8 @@
 #defineEKEYREVOKED 130 /* Key has been revoked */
 #defineEKEYREJECTED131 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  132 /* Owner died */
+#defineENOTRECOVERABLE 133 /* State not recoverable */
+
 #endif /* !(_SPARC64_ERRNO_H) */

_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] add EOWNERDEAD and ENOTRECOVERABLE version 2

2005-04-13 Thread Joe Korty

Add EOWNERDEAD and ENOTRECOVERABLE to all architectures.
This is to support the upcoming patches for robust mutexes.

We normally don't reserve parts of the name/number space
for external patches, but robust mutexes are sufficiently
popular and important to justify it in this case.

Signed-off-by: Joe Korty [EMAIL PROTECTED]


 2.6.12-rc2-jak/include/asm-alpha/errno.h   |4 
 2.6.12-rc2-jak/include/asm-generic/errno.h |4 
 2.6.12-rc2-jak/include/asm-mips/errno.h|4 
 2.6.12-rc2-jak/include/asm-parisc/errno.h  |4 
 2.6.12-rc2-jak/include/asm-sparc/errno.h   |4 
 2.6.12-rc2-jak/include/asm-sparc64/errno.h |4 
 6 files changed, 24 insertions(+)

diff -puNa include/asm-generic/errno.h~owner.notrecoverable.errnos 
include/asm-generic/errno.h
--- 2.6.12-rc2/include/asm-generic/errno.h~owner.notrecoverable.errnos  
2005-04-12 09:54:38.0 -0400
+++ 2.6.12-rc2-jak/include/asm-generic/errno.h  2005-04-13 09:58:26.0 
-0400
@@ -102,4 +102,8 @@
 #defineEKEYREVOKED 128 /* Key has been revoked */
 #defineEKEYREJECTED129 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  130 /* Owner died */
+#defineENOTRECOVERABLE 131 /* State not recoverable */
+
 #endif
diff -puNa include/asm-alpha/errno.h~owner.notrecoverable.errnos 
include/asm-alpha/errno.h
--- 2.6.12-rc2/include/asm-alpha/errno.h~owner.notrecoverable.errnos
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-alpha/errno.h2005-04-13 09:58:41.0 
-0400
@@ -116,4 +116,8 @@
 #defineEKEYREVOKED 134 /* Key has been revoked */
 #defineEKEYREJECTED135 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  136 /* Owner died */
+#defineENOTRECOVERABLE 137 /* State not recoverable */
+
 #endif
diff -puNa include/asm-mips/errno.h~owner.notrecoverable.errnos 
include/asm-mips/errno.h
--- 2.6.12-rc2/include/asm-mips/errno.h~owner.notrecoverable.errnos 
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-mips/errno.h 2005-04-13 09:59:17.0 
-0400
@@ -115,6 +115,10 @@
 #defineEKEYREVOKED 163 /* Key has been revoked */
 #defineEKEYREJECTED164 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  165 /* Owner died */
+#defineENOTRECOVERABLE 166 /* State not recoverable */
+
 #define EDQUOT 1133/* Quota exceeded */
 
 #ifdef __KERNEL__
diff -puNa include/asm-parisc/errno.h~owner.notrecoverable.errnos 
include/asm-parisc/errno.h
--- 2.6.12-rc2/include/asm-parisc/errno.h~owner.notrecoverable.errnos   
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-parisc/errno.h   2005-04-13 09:59:24.0 
-0400
@@ -115,5 +115,9 @@
 #define ENOTSUP252 /* Function not implemented (POSIX.4 / 
HPUX) */
 #define ECANCELLED 253 /* aio request was canceled before complete 
(POSIX.4 / HPUX) */
 
+/* for robust mutexes */
+#define EOWNERDEAD 254 /* Owner died */
+#define ENOTRECOVERABLE255 /* State not recoverable */
+
 
 #endif
diff -puNa include/asm-sparc/errno.h~owner.notrecoverable.errnos 
include/asm-sparc/errno.h
--- 2.6.12-rc2/include/asm-sparc/errno.h~owner.notrecoverable.errnos
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-sparc/errno.h2005-04-13 09:59:28.0 
-0400
@@ -107,4 +107,8 @@
 #defineEKEYREVOKED 130 /* Key has been revoked */
 #defineEKEYREJECTED131 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  132 /* Owner died */
+#defineENOTRECOVERABLE 133 /* State not recoverable */
+
 #endif
diff -puNa include/asm-sparc64/errno.h~owner.notrecoverable.errnos 
include/asm-sparc64/errno.h
--- 2.6.12-rc2/include/asm-sparc64/errno.h~owner.notrecoverable.errnos  
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-sparc64/errno.h  2005-04-13 09:59:33.0 
-0400
@@ -107,4 +107,8 @@
 #defineEKEYREVOKED 130 /* Key has been revoked */
 #defineEKEYREJECTED131 /* Key was rejected by service */
 
+/* for robust mutexes */
+#defineEOWNERDEAD  132 /* Owner died */
+#defineENOTRECOVERABLE 133 /* State not recoverable */
+
 #endif /* !(_SPARC64_ERRNO_H) */

_

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: FUSYN and RT

2005-04-12 Thread Joe Korty
On Tue, Apr 12, 2005 at 11:15:02AM -0700, Daniel Walker wrote:

> It seems like these two locks are going to interact on a very limited
> basis. Fusyn will be the user space mutex, and the RT mutex is only in
> the kernel. You can't lock an RT mutex and hold it, then lock a Fusyn
> mutex (anyone disagree?). That is assuming Fusyn stays in user space.

Well yeah, but you could lock a fusyn, then invoke a system call which
locks a kernel semaphore.

Regards,
Joe
--
"Money can buy bandwidth, but latency is forever" -- John Mashey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] add EOWNERDEAD and ENOTRECOVERABLE

2005-04-12 Thread Joe Korty

Hi Andrew,
 This patch adds EOWNERDEAD and ENOTRECOVERABLE to all
architectures.  Though there is nothing in the kernel
that uses them yet, I know of two patches in development,
one by Intel and the other by Bull, that adds robust mutex
support to pthread_mutex*.

Robust mutexes, by de-facto industry convention, return
EOWNERDEAD when the owner of a mutex dies, and returns
ENOTRECOVERABLE if the new owner decides that it is not
able to recover from the dead state, and so wants to mark
the mutex unrecoverable.

There is interest in robust mutexes in Linux, as they are
a well established tool when writing high availability
applications on non-Linux platforms.

Even though there are kernel components to the robust mutex
patches, the exact kernel ABI can be easily changed or
even completely replaced, without affecting applications,
while the patches are still in development.  To achieve
this immunity, the applications need only to access robust
services through the pthreads library and they must link
only with the dynamic version of the library.  This works
only because pthread robust mutexes are a de-facto
standard, enforced by standard usage by long established
applications, which any implementation of robust mutexes
is required to match if it is to be accepted.

However, one piece of the ABI that does leak through is
the value of EOWNERDEAD and ENOTRECOVERABLE.  If these
values could be fixed then application writers would feel
more comfortable using these patches while they are still
in development.  In addition, if the patches are never
accepted into the standard kernel, but live forever in
various high availability vendor kernels as a specialty
item, then it gives their users an unchanging ABI that they
can live with -- even when they migrate their application
binaries to a competing high availability Linux kernel.

I know that it is rare for an unused patch to be accepted;
however it has happened at least once before when there
was need -- eg, the security hooks patch, so this patch
request may not be completely out of line.

Regards,
Joe

i386 compile tested.

Signed-off-by: Joe Korty <[EMAIL PROTECTED]>


 2.6.12-rc2-jak/include/asm-alpha/errno.h   |3 +++
 2.6.12-rc2-jak/include/asm-generic/errno.h |3 +++
 2.6.12-rc2-jak/include/asm-mips/errno.h|3 +++
 2.6.12-rc2-jak/include/asm-parisc/errno.h  |3 +++
 2.6.12-rc2-jak/include/asm-sparc/errno.h   |3 +++
 2.6.12-rc2-jak/include/asm-sparc64/errno.h |3 +++
 6 files changed, 18 insertions(+)

diff -puNa include/asm-generic/errno.h~owner.notrecoverable.errnos 
include/asm-generic/errno.h
--- 2.6.12-rc2/include/asm-generic/errno.h~owner.notrecoverable.errnos  
2005-04-12 09:54:38.0 -0400
+++ 2.6.12-rc2-jak/include/asm-generic/errno.h  2005-04-12 11:16:50.681480153 
-0400
@@ -102,4 +102,7 @@
 #defineEKEYREVOKED 128 /* Key has been revoked */
 #defineEKEYREJECTED129 /* Key was rejected by service */
 
+#defineEOWNERDEAD  130 /* Owner died */
+#defineENOTRECOVERABLE 131 /* State not recoverable */
+
 #endif
diff -puNa include/asm-alpha/errno.h~owner.notrecoverable.errnos 
include/asm-alpha/errno.h
--- 2.6.12-rc2/include/asm-alpha/errno.h~owner.notrecoverable.errnos
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-alpha/errno.h2005-04-12 11:16:17.548396780 
-0400
@@ -116,4 +116,7 @@
 #defineEKEYREVOKED 134 /* Key has been revoked */
 #defineEKEYREJECTED135 /* Key was rejected by service */
 
+#defineEOWNERDEAD  136 /* Owner died */
+#defineENOTRECOVERABLE 137 /* State not recoverable */
+
 #endif
diff -puNa include/asm-mips/errno.h~owner.notrecoverable.errnos 
include/asm-mips/errno.h
--- 2.6.12-rc2/include/asm-mips/errno.h~owner.notrecoverable.errnos 
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-mips/errno.h 2005-04-12 11:16:29.262658422 
-0400
@@ -115,6 +115,9 @@
 #defineEKEYREVOKED 163 /* Key has been revoked */
 #defineEKEYREJECTED164 /* Key was rejected by service */
 
+#defineEOWNERDEAD  165 /* Owner died */
+#defineENOTRECOVERABLE 166 /* State not recoverable */
+
 #define EDQUOT 1133/* Quota exceeded */
 
 #ifdef __KERNEL__
diff -puNa include/asm-parisc/errno.h~owner.notrecoverable.errnos 
include/asm-parisc/errno.h
--- 2.6.12-rc2/include/asm-parisc/errno.h~owner.notrecoverable.errnos   
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-parisc/errno.h   2005-04-12 11:14:19.353941346 
-0400
@@ -115,5 +115,8 @@
 #define ENOTSUP252 /* Function not implemented (POSIX.4 / 
HPUX) */
 #define ECANCELLED 253 /* aio request was canceled before complete 
(POSIX.4 / HPUX) */
 
+#define EOWNERDEAD 254 /* Owner died */
+#define ENOTRECOVERABLE255 /* State not recoverable */
+
 
 #endif
diff -puNa include/asm-sparc/e

[PATCH] add EOWNERDEAD and ENOTRECOVERABLE

2005-04-12 Thread Joe Korty

Hi Andrew,
 This patch adds EOWNERDEAD and ENOTRECOVERABLE to all
architectures.  Though there is nothing in the kernel
that uses them yet, I know of two patches in development,
one by Intel and the other by Bull, that adds robust mutex
support to pthread_mutex*.

Robust mutexes, by de-facto industry convention, return
EOWNERDEAD when the owner of a mutex dies, and returns
ENOTRECOVERABLE if the new owner decides that it is not
able to recover from the dead state, and so wants to mark
the mutex unrecoverable.

There is interest in robust mutexes in Linux, as they are
a well established tool when writing high availability
applications on non-Linux platforms.

Even though there are kernel components to the robust mutex
patches, the exact kernel ABI can be easily changed or
even completely replaced, without affecting applications,
while the patches are still in development.  To achieve
this immunity, the applications need only to access robust
services through the pthreads library and they must link
only with the dynamic version of the library.  This works
only because pthread robust mutexes are a de-facto
standard, enforced by standard usage by long established
applications, which any implementation of robust mutexes
is required to match if it is to be accepted.

However, one piece of the ABI that does leak through is
the value of EOWNERDEAD and ENOTRECOVERABLE.  If these
values could be fixed then application writers would feel
more comfortable using these patches while they are still
in development.  In addition, if the patches are never
accepted into the standard kernel, but live forever in
various high availability vendor kernels as a specialty
item, then it gives their users an unchanging ABI that they
can live with -- even when they migrate their application
binaries to a competing high availability Linux kernel.

I know that it is rare for an unused patch to be accepted;
however it has happened at least once before when there
was need -- eg, the security hooks patch, so this patch
request may not be completely out of line.

Regards,
Joe

i386 compile tested.

Signed-off-by: Joe Korty [EMAIL PROTECTED]


 2.6.12-rc2-jak/include/asm-alpha/errno.h   |3 +++
 2.6.12-rc2-jak/include/asm-generic/errno.h |3 +++
 2.6.12-rc2-jak/include/asm-mips/errno.h|3 +++
 2.6.12-rc2-jak/include/asm-parisc/errno.h  |3 +++
 2.6.12-rc2-jak/include/asm-sparc/errno.h   |3 +++
 2.6.12-rc2-jak/include/asm-sparc64/errno.h |3 +++
 6 files changed, 18 insertions(+)

diff -puNa include/asm-generic/errno.h~owner.notrecoverable.errnos 
include/asm-generic/errno.h
--- 2.6.12-rc2/include/asm-generic/errno.h~owner.notrecoverable.errnos  
2005-04-12 09:54:38.0 -0400
+++ 2.6.12-rc2-jak/include/asm-generic/errno.h  2005-04-12 11:16:50.681480153 
-0400
@@ -102,4 +102,7 @@
 #defineEKEYREVOKED 128 /* Key has been revoked */
 #defineEKEYREJECTED129 /* Key was rejected by service */
 
+#defineEOWNERDEAD  130 /* Owner died */
+#defineENOTRECOVERABLE 131 /* State not recoverable */
+
 #endif
diff -puNa include/asm-alpha/errno.h~owner.notrecoverable.errnos 
include/asm-alpha/errno.h
--- 2.6.12-rc2/include/asm-alpha/errno.h~owner.notrecoverable.errnos
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-alpha/errno.h2005-04-12 11:16:17.548396780 
-0400
@@ -116,4 +116,7 @@
 #defineEKEYREVOKED 134 /* Key has been revoked */
 #defineEKEYREJECTED135 /* Key was rejected by service */
 
+#defineEOWNERDEAD  136 /* Owner died */
+#defineENOTRECOVERABLE 137 /* State not recoverable */
+
 #endif
diff -puNa include/asm-mips/errno.h~owner.notrecoverable.errnos 
include/asm-mips/errno.h
--- 2.6.12-rc2/include/asm-mips/errno.h~owner.notrecoverable.errnos 
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-mips/errno.h 2005-04-12 11:16:29.262658422 
-0400
@@ -115,6 +115,9 @@
 #defineEKEYREVOKED 163 /* Key has been revoked */
 #defineEKEYREJECTED164 /* Key was rejected by service */
 
+#defineEOWNERDEAD  165 /* Owner died */
+#defineENOTRECOVERABLE 166 /* State not recoverable */
+
 #define EDQUOT 1133/* Quota exceeded */
 
 #ifdef __KERNEL__
diff -puNa include/asm-parisc/errno.h~owner.notrecoverable.errnos 
include/asm-parisc/errno.h
--- 2.6.12-rc2/include/asm-parisc/errno.h~owner.notrecoverable.errnos   
2005-04-12 10:04:36.0 -0400
+++ 2.6.12-rc2-jak/include/asm-parisc/errno.h   2005-04-12 11:14:19.353941346 
-0400
@@ -115,5 +115,8 @@
 #define ENOTSUP252 /* Function not implemented (POSIX.4 / 
HPUX) */
 #define ECANCELLED 253 /* aio request was canceled before complete 
(POSIX.4 / HPUX) */
 
+#define EOWNERDEAD 254 /* Owner died */
+#define ENOTRECOVERABLE255 /* State not recoverable */
+
 
 #endif
diff -puNa include/asm-sparc/errno.h

Re: FUSYN and RT

2005-04-12 Thread Joe Korty
On Tue, Apr 12, 2005 at 11:15:02AM -0700, Daniel Walker wrote:

 It seems like these two locks are going to interact on a very limited
 basis. Fusyn will be the user space mutex, and the RT mutex is only in
 the kernel. You can't lock an RT mutex and hold it, then lock a Fusyn
 mutex (anyone disagree?). That is assuming Fusyn stays in user space.

Well yeah, but you could lock a fusyn, then invoke a system call which
locks a kernel semaphore.

Regards,
Joe
--
Money can buy bandwidth, but latency is forever -- John Mashey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86 TSC time warp puzzle

2005-04-04 Thread Joe Korty
On Mon, Apr 04, 2005 at 09:59:22AM +0100, [EMAIL PROTECTED] wrote:
> Jonathan Lundell wrote:
> >Well, not actually a time warp, though it feels like one.
> >
> >I'm doing some real-time bit-twiddling in a driver, using the TSC to 
> >measure out delays on the order of hundreds of nanoseconds. Because I 
> >want an upper limit on the delay, I disable interrupts around it.
> >
> >The logic is something like:
> >
> >local_irq_save
> >out(set a bit)
> >t0 = TSC
> >wait while (t = (TSC - t0)) < delay_time
> >out(clear the bit)
> >local_irq_restore
> >
> > From time to time, when I exit the delay, t is *much* bigger than 
> >delay_time. If delay_time is, say, 300ns, t is usually no more than 
> >325ns. But every so often, t can be 2000, or 1, or even much higher.
> >
> >The value of t seems to depend on the CPU involved, The worst case is 
> >with an Intel 915GV chipset, where t approaches 500 microseconds (!).


Add nmi_watchdog=0 to your boot command line.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86 TSC time warp puzzle

2005-04-04 Thread Joe Korty
On Mon, Apr 04, 2005 at 09:59:22AM +0100, [EMAIL PROTECTED] wrote:
 Jonathan Lundell wrote:
 Well, not actually a time warp, though it feels like one.
 
 I'm doing some real-time bit-twiddling in a driver, using the TSC to 
 measure out delays on the order of hundreds of nanoseconds. Because I 
 want an upper limit on the delay, I disable interrupts around it.
 
 The logic is something like:
 
 local_irq_save
 out(set a bit)
 t0 = TSC
 wait while (t = (TSC - t0))  delay_time
 out(clear the bit)
 local_irq_restore
 
  From time to time, when I exit the delay, t is *much* bigger than 
 delay_time. If delay_time is, say, 300ns, t is usually no more than 
 325ns. But every so often, t can be 2000, or 1, or even much higher.
 
 The value of t seems to depend on the CPU involved, The worst case is 
 with an Intel 915GV chipset, where t approaches 500 microseconds (!).


Add nmi_watchdog=0 to your boot command line.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Joe Korty
On Tue, Feb 22, 2005 at 01:30:27PM -0800, Linus Torvalds wrote:
> 
> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.
> 
> Just do
> 
>   repeat:
>   down_read(>mm->mmap_sem);
>   get_futex_key(...) etc.
>   queue_me(...) etc.
>   inc_preempt_count();
>   ret = get_user(...);
>   dec_preempt_count();

Perhaps this should be preempt_disable  preempt_enable.

Otherwise, a preempt attempt in get_user would not be seen
until some future preempt_enable was executed.

Regards,
Joe
--
"Money can buy bandwidth, but latency is forever" -- John Mashey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Joe Korty
On Tue, Feb 22, 2005 at 01:30:27PM -0800, Linus Torvalds wrote:
 
 We really have this already, and it's called current-preempt. It 
 handles any lock at all, and doesn't add yet another special case to all 
 the architectures.
 
 Just do
 
   repeat:
   down_read(current-mm-mmap_sem);
   get_futex_key(...) etc.
   queue_me(...) etc.
   inc_preempt_count();
   ret = get_user(...);
   dec_preempt_count();

Perhaps this should be preempt_disable  preempt_enable.

Otherwise, a preempt attempt in get_user would not be seen
until some future preempt_enable was executed.

Regards,
Joe
--
Money can buy bandwidth, but latency is forever -- John Mashey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


memset argument order misuses

2005-02-12 Thread Joe Korty
Hi Andrew,
A simple 'grep memset.*\<0);' shows argument order errors in several uses
of memset.

This grep was inspired by Al Viro's recent patch, megaraid_mbox fix,
which fixed this problem in the megaraid driver.

Completely untested.

Regards,
Joe
--
"Money can buy bandwidth, but latency is forever" -- John Mashey



diff -Nura base/drivers/s390/block/dasd_genhd.c 
new/drivers/s390/block/dasd_genhd.c
--- base/drivers/s390/block/dasd_genhd.c2004-12-24 16:35:24.0 
-0500
+++ new/drivers/s390/block/dasd_genhd.c 2005-02-12 21:55:48.546192009 -0500
@@ -149,8 +149,8 @@
 * Can't call delete_partitions directly. Use ioctl.
 * The ioctl also does locking and invalidation.
 */
-   memset(, sizeof(struct blkpg_partition), 0);
-   memset(, sizeof(struct blkpg_ioctl_arg), 0);
+   memset(, 0, sizeof(struct blkpg_partition));
+   memset(, 0, sizeof(struct blkpg_ioctl_arg));
barg.data = 
barg.op = BLKPG_DEL_PARTITION;
for (bpart.pno = device->gdp->minors - 1; bpart.pno > 0; bpart.pno--)
diff -Nura base/drivers/s390/cio/cmf.c new/drivers/s390/cio/cmf.c
--- base/drivers/s390/cio/cmf.c 2004-12-24 16:33:48.0 -0500
+++ new/drivers/s390/cio/cmf.c  2005-02-12 21:56:08.430256458 -0500
@@ -526,7 +526,7 @@
time = get_clock() - cdev->private->cmb_start_time;
spin_unlock_irqrestore(cdev->ccwlock, flags);
 
-   memset(data, sizeof(struct cmbdata), 0);
+   memset(data, 0, sizeof(struct cmbdata));
 
/* we only know values before device_busy_time */
data->size = offsetof(struct cmbdata, device_busy_time);
@@ -736,7 +736,7 @@
time = get_clock() - cdev->private->cmb_start_time;
spin_unlock_irqrestore(cdev->ccwlock, flags);
 
-   memset (data, sizeof(struct cmbdata), 0);
+   memset (data, 0, sizeof(struct cmbdata));
 
/* we only know values before device_busy_time */
data->size = offsetof(struct cmbdata, device_busy_time);
diff -Nura base/drivers/s390/cio/css.c new/drivers/s390/cio/css.c
--- base/drivers/s390/cio/css.c 2005-02-12 21:51:28.0 -0500
+++ new/drivers/s390/cio/css.c  2005-02-12 21:56:20.066538550 -0500
@@ -527,7 +527,7 @@
new_slow_sch = kmalloc(sizeof(struct slow_subchannel), GFP_ATOMIC);
if (!new_slow_sch)
return -ENOMEM;
-   memset(new_slow_sch, sizeof(struct slow_subchannel), 0);
+   memset(new_slow_sch, 0, sizeof(struct slow_subchannel));
new_slow_sch->schid = schid;
spin_lock_irqsave(_subchannel_lock, flags);
list_add_tail(_slow_sch->slow_list, _subchannels_head);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


memset argument order misuses

2005-02-12 Thread Joe Korty
Hi Andrew,
A simple 'grep memset.*\0);' shows argument order errors in several uses
of memset.

This grep was inspired by Al Viro's recent patch, megaraid_mbox fix,
which fixed this problem in the megaraid driver.

Completely untested.

Regards,
Joe
--
Money can buy bandwidth, but latency is forever -- John Mashey



diff -Nura base/drivers/s390/block/dasd_genhd.c 
new/drivers/s390/block/dasd_genhd.c
--- base/drivers/s390/block/dasd_genhd.c2004-12-24 16:35:24.0 
-0500
+++ new/drivers/s390/block/dasd_genhd.c 2005-02-12 21:55:48.546192009 -0500
@@ -149,8 +149,8 @@
 * Can't call delete_partitions directly. Use ioctl.
 * The ioctl also does locking and invalidation.
 */
-   memset(bpart, sizeof(struct blkpg_partition), 0);
-   memset(barg, sizeof(struct blkpg_ioctl_arg), 0);
+   memset(bpart, 0, sizeof(struct blkpg_partition));
+   memset(barg, 0, sizeof(struct blkpg_ioctl_arg));
barg.data = bpart;
barg.op = BLKPG_DEL_PARTITION;
for (bpart.pno = device-gdp-minors - 1; bpart.pno  0; bpart.pno--)
diff -Nura base/drivers/s390/cio/cmf.c new/drivers/s390/cio/cmf.c
--- base/drivers/s390/cio/cmf.c 2004-12-24 16:33:48.0 -0500
+++ new/drivers/s390/cio/cmf.c  2005-02-12 21:56:08.430256458 -0500
@@ -526,7 +526,7 @@
time = get_clock() - cdev-private-cmb_start_time;
spin_unlock_irqrestore(cdev-ccwlock, flags);
 
-   memset(data, sizeof(struct cmbdata), 0);
+   memset(data, 0, sizeof(struct cmbdata));
 
/* we only know values before device_busy_time */
data-size = offsetof(struct cmbdata, device_busy_time);
@@ -736,7 +736,7 @@
time = get_clock() - cdev-private-cmb_start_time;
spin_unlock_irqrestore(cdev-ccwlock, flags);
 
-   memset (data, sizeof(struct cmbdata), 0);
+   memset (data, 0, sizeof(struct cmbdata));
 
/* we only know values before device_busy_time */
data-size = offsetof(struct cmbdata, device_busy_time);
diff -Nura base/drivers/s390/cio/css.c new/drivers/s390/cio/css.c
--- base/drivers/s390/cio/css.c 2005-02-12 21:51:28.0 -0500
+++ new/drivers/s390/cio/css.c  2005-02-12 21:56:20.066538550 -0500
@@ -527,7 +527,7 @@
new_slow_sch = kmalloc(sizeof(struct slow_subchannel), GFP_ATOMIC);
if (!new_slow_sch)
return -ENOMEM;
-   memset(new_slow_sch, sizeof(struct slow_subchannel), 0);
+   memset(new_slow_sch, 0, sizeof(struct slow_subchannel));
new_slow_sch-schid = schid;
spin_lock_irqsave(slow_subchannel_lock, flags);
list_add_tail(new_slow_sch-slow_list, slow_subchannels_head);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/