Re: 4BSD Scheduler Problem on 5.3

2005-04-01 Thread John Baldwin
On Thursday 31 March 2005 08:03 pm, Robert Watson wrote:
 On Thu, 31 Mar 2005, John Baldwin wrote:
  On Thursday 31 March 2005 03:38 pm, William Michael Grim wrote:
   Hello.
  
   I keep having kernel panics every couple weeks on my system.  It occurs
   in the sched_switch() function.  There are several other statements in
   the backtrace involving ??; what are those?
  
   I have attached the dump output and system info to this email.  Any
   feedback would be helpful.
  
   Thanks so much for your help.
 
  The real trace ends with Xint0x80_syscall().  The rest after that is
  garbage memory.  Your real problem is in exit1() or ttywakeup().  Since
  ttywakeup() doesn't call exit1() (AFAIK), the exit1() frame is probably
  bogus (gdb doesn't grok trapframes maybe?) and the real bug is a NULL
  pointer deref in ttywakeup().  Perhaps it's a bug in the ptc driver?
  (ptcopen is in the trace).  What is the ptc driver anyway?

 I think we have a race in -STABLE relating to tty wakeups and
 open/close/device teardown.  I've seen a panic relating to sio during a
 tty close on RELENG_5 about 5-6 months ago, but was unable to get a dump.
 Scott has since fixed dumps with twe, but I've not yet been able to get
 the bug to recur.  I'll give it another try.

Sounds very plausible.  Does Poul-Henning have any ideas?

-- 
John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/
Power Users Use the Power to Serve  =  http://www.FreeBSD.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 4BSD Scheduler Problem on 5.3

2005-04-01 Thread Robert Watson

On Fri, 1 Apr 2005, Poul-Henning Kamp wrote:

 In message [EMAIL PROTECTED], John Baldwin writes:
 
  I think we have a race in -STABLE relating to tty wakeups and
  open/close/device teardown.  I've seen a panic relating to sio during a
  tty close on RELENG_5 about 5-6 months ago, but was unable to get a dump.
  Scott has since fixed dumps with twe, but I've not yet been able to get
  the bug to recur.  I'll give it another try.
 
 Sounds very plausible.  Does Poul-Henning have any ideas?
 
 Is this before or after my tty changes ? 
 
 There is a general nastyness about ttys/sessions/exit which I have never
 really felt comfortable about.  My hope is that I have solved it by
 refcounting the tty structure. 
 
 So if this is before my changes:  Yeah, known (but rare) issue) 
 
 If after my changes: D**N! 

The instance of the panic I saw was in RELENG_5 in January when using a
serial console.  Here's a copy of the e-mail I sent to stable@ when it
occurred.  It's a little weak on the debugging side because I couldn't get
a dump and didn't have a kernel with symbols easily on hand, but my
reading was that there was a race relating to open/close and device
events.

Robert N M Watson

Date: Sun, 23 Jan 2005 16:51:14 + (GMT)
From: Robert Watson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: NULL pointer deref in sioopen() suggests a close/open race on sio 
device?


Ran into the following panic on a 5-STABLE box this morning, which
occurred after hitting Ctrl-D to close a login session on a serial console
(ttyd0 at 9600 bps):

login: Jan 23 10:43:27 fledge login: 2 LOGIN FAILURES ON ttyd0


Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x1c
fault code  = supervisor write, page not present
instruction pointer = 0x8:0xc051537b
stack pointer   = 0x10:0xe7345988
frame pointer   = 0x10:0xe7345994
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 45092 (getty)
[thread pid 45092 tid 100201 ]
Stopped at  knote+0x27: cmpxchgl%ecx,0x1c(%edx)
db show pcpu
cpuid= 0
curthread= 0xc290d190: pid 45092 getty
curpcb   = 0xe7345da0
fpcurthread  = 0xc290d190: pid 45092 getty
idlethread   = 0xc22644b0: pid 11 idle
APIC ID  = 0
currentldt   = 0x30
db trace
Tracing pid 45092 tid 100201 td 0xc290d190
knote(c264e098,0,0,c290d190,e73459c4) at knote+0x27
ttwwakeup(c264e000) at ttwwakeup+0xc8
comstart(c264e000) at comstart+0x385
comparam(c264e000,c264e0a4,c264e000,3,0) at comparam+0x253
sioopen(c079f060,3,2000,c290d190,c078e6a0) at sioopen+0x1df
spec_open(e7345a84,e7345b40,c058d585,e7345a84,180) at spec_open+0x2b6
spec_vnoperate(e7345a84) at spec_vnoperate+0x13
vn_open_cred(e7345be4,e7345ce4,c08,c2261d80,0) at vn_open_cred+0x419
vn_open(e7345be4,e7345ce4,c08,0,c4289b58) at vn_open+0x1e
kern_open(c290d190,804f8e0,0,3,bfbfee18) at kern_open+0xe3
open(c290d190,e7345d14,3,0,292) at open+0x18
syscall(2f,2f,2f,804f8e0,0) at syscall+0x27b
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (5, FreeBSD ELF32, open), eip = 0x280d155b, esp = 0xbfbfedec,
ebp = 0xbfbfee18 ---

The ps list is a bit boring, but the primary interesting thing is that it
looks like the close was going on in one thread just about when the sio
swi was scheduled to run also:

db ps
  pid   proc uarea   uid  ppid  pgrp  flag   stat  wmesgwchan  cmd
45092 c6762388 e73870000 1 1 0004000 [CPU 0] getty ...
  132 c235954c e4fbf0000 0 0 20c [RUNQ] swi5: clock sio

I didn't have a kernel with debugging symbols on-hand, but the above
address in knote() is a cmpxchg early in the function, which means it's
likely the conditional call to mtx_lock() hitting a NULL mutex pointer for
kl_lock.  This in turn suggests that something has called ttyrel/tty_close
on the TTY in a race with the open, or otherwise NULL'd that pointer via
knlist_destroy().  Anyone have any pointers on this one?  The TTY code is
not my forte... 

Robert N M Watson

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 4BSD Scheduler Problem on 5.3

2005-04-01 Thread William Michael Grim
I'm not sure if it's before or after your changes, Poul-Henning.  If there is
a newer -RELEASE I can upgrade too, I will do that.  I don't really want to
upgrade to -STABLE, but I will also do that to relieve the issue if necessary.
Just give me a recommendation on to either update RELEASE beyond -p1 or to
go ahead and update to -STABLE.

I appreciate the help all of you have been.

Thanks much.

On Fri, Apr 01, 2005 at 08:52:38PM +0200, Poul-Henning Kamp wrote:
 In message [EMAIL PROTECTED], John Baldwin writes:
 
  I think we have a race in -STABLE relating to tty wakeups and
  open/close/device teardown.  I've seen a panic relating to sio during a
  tty close on RELENG_5 about 5-6 months ago, but was unable to get a dump.
  Scott has since fixed dumps with twe, but I've not yet been able to get
  the bug to recur.  I'll give it another try.
 
 Sounds very plausible.  Does Poul-Henning have any ideas?
 
 Is this before or after my tty changes ?
 
 There is a general nastyness about ttys/sessions/exit which I have
 never really felt comfortable about.  My hope is that I have solved
 it by refcounting the tty structure.
 
 So if this is before my changes:  Yeah, known (but rare) issue)
 
 If after my changes: D**N!
 
 -- 
 Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
 [EMAIL PROTECTED] | TCP/IP since RFC 956
 FreeBSD committer   | BSD since 4.3-tahoe
 Never attribute to malice what can adequately be explained by incompetence.

-- 
William Michael Grim
Student, Southern Illinois University at Edwardsville
Unix Network Administrator, SIUE, Computer Science dept.
Phone: (217) 341-6552
Email: [EMAIL PROTECTED]
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 4BSD Scheduler Problem on 5.3

2005-03-31 Thread John Baldwin
On Thursday 31 March 2005 03:38 pm, William Michael Grim wrote:
 Hello.

 I keep having kernel panics every couple weeks on my system.  It occurs in
 the sched_switch() function.  There are several other statements in the
 backtrace involving ??; what are those?

 I have attached the dump output and system info to this email.  Any
 feedback would be helpful.

 Thanks so much for your help.

The real trace ends with Xint0x80_syscall().  The rest after that is garbage 
memory.  Your real problem is in exit1() or ttywakeup().  Since ttywakeup() 
doesn't call exit1() (AFAIK), the exit1() frame is probably bogus (gdb 
doesn't grok trapframes maybe?) and the real bug is a NULL pointer deref in 
ttywakeup().  Perhaps it's a bug in the ptc driver?  (ptcopen is in the 
trace).  What is the ptc driver anyway?

-- 
John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/
Power Users Use the Power to Serve  =  http://www.FreeBSD.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 4BSD Scheduler Problem on 5.3

2005-03-31 Thread Robert Watson

On Thu, 31 Mar 2005, John Baldwin wrote:

 On Thursday 31 March 2005 03:38 pm, William Michael Grim wrote:
  Hello.
 
  I keep having kernel panics every couple weeks on my system.  It occurs in
  the sched_switch() function.  There are several other statements in the
  backtrace involving ??; what are those?
 
  I have attached the dump output and system info to this email.  Any
  feedback would be helpful.
 
  Thanks so much for your help.
 
 The real trace ends with Xint0x80_syscall().  The rest after that is
 garbage memory.  Your real problem is in exit1() or ttywakeup().  Since
 ttywakeup() doesn't call exit1() (AFAIK), the exit1() frame is probably
 bogus (gdb doesn't grok trapframes maybe?) and the real bug is a NULL
 pointer deref in ttywakeup().  Perhaps it's a bug in the ptc driver? 
 (ptcopen is in the trace).  What is the ptc driver anyway? 

I think we have a race in -STABLE relating to tty wakeups and
open/close/device teardown.  I've seen a panic relating to sio during a
tty close on RELENG_5 about 5-6 months ago, but was unable to get a dump. 
Scott has since fixed dumps with twe, but I've not yet been able to get
the bug to recur.  I'll give it another try. 

Robert N M Watson

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]