Once it's in that state, can you run the following please:
# truss -Tgetpid -p <PID>
# pstack <PID>
# pfiles <PID>
# prun <PID>
The truss command will stop the process on a getpid call (confirm the S
column has a 'T' in it with "/usr/bin/ps -opid,s,comm -p <PID>"), and
then the pstack and pfiles will snapshot a point we're interested in.
The prun will put the process back in the state it was in so you can
kill / svcadm restart it.
I found bug 5060500 which describes a possible cause of this. However,
the fix for 6537549 added the signal(SIGPIPE SIG_IGN) to nscd, so this
explains why the process doesn't die any more.
Both bugs can be seen here:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5060500
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6537549
Regards,
Brian
John Ryan wrote:
Hi,
On at least 2 machines where I'm running opensolaris 2009.06 nscd runs okay for
about one hour, then starts using 100% of a CPU. After some more indeterminate
time, it starts using 100% of another CPU.
They are running ldap, zfs sharenfs, nis
I ran a truss on the nscd process, and just before it's start consuming CPU, it
gets and ignores a SIGPIPE, the continually does a getpid()
In the following truss output, the pid is 6300. Here's part of truss where it
gets the signal:
/5: time() = 1250088300
/5: xstat(2, "/etc/nsswitch.conf", 0xFE080BC8) = 0
/5: xstat(2, "/etc/resolv.conf", 0xFE080C58) = 0
/5: time() = 1250088300
/5: getpid() = 6300 [1]
/5: write(2, " W e d A u g 1 2 1".., 86) = 86
/5: xstat(2, "/etc/passwd", 0xFE07CA48) = 0
/5: getpid() = 6300 [1]
/5: write(2, " W e d A u g 1 2 1".., 86) = 86
/5: getpid() = 6300 [1]
/5: write(2, " W e d A u g 1 2 1".., 102) = 102
/5: stat64("/etc/passwd", 0xFE07C868) = 0
/5: getpid() = 6300 [1]
/5: fxstat(2, 6, 0xFE07C768) = 0
/5: fxstat(2, 6, 0xFE07C568) = 0
/5: putmsg(6, 0xFE07C688, 0xFE07C760, 0) = 0
/5: pollsys(0x0816407C, 1, 0xFE07C6B0, 0x00000000) = 1
/5: fxstat(2, 6, 0xFE07C558) = 0
/5: getmsg(6, 0xFE07C688, 0x080A4EF8, 0xFE07C680) = 0
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: time() = 1250088300
/5: write(7, " 081A2020104 c819C0412 d".., 165) Err#32 EPIPE
/5: Received signal #13, SIGPIPE [ignored]
/5: time() = 1250088300
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
/5: getpid() = 6300 [1]
........... millions of them !!
Anyone seeing this?
Regards
Regards
John
--
Brian Ruthven
Solaris Revenue Product Engineering
Sun Microsystems UK
Sparc House, Guillemont Park, Camberley, GU17 9QG
_______________________________________________
networking-discuss mailing list
[email protected]