Once it's in that state, can you run the following please:

# truss -Tgetpid -p <PID>
# pstack <PID>
# pfiles <PID>
# prun <PID>

The truss command will stop the process on a getpid call (confirm the S column has a 'T' in it with "/usr/bin/ps -opid,s,comm -p <PID>"), and then the pstack and pfiles will snapshot a point we're interested in. The prun will put the process back in the state it was in so you can kill / svcadm restart it.


I found bug 5060500 which describes a possible cause of this. However, the fix for 6537549 added the signal(SIGPIPE SIG_IGN) to nscd, so this explains why the process doesn't die any more.

Both bugs can be seen here:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5060500
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6537549


Regards,
Brian


John Ryan wrote:
Hi,

On at least 2 machines where I'm running opensolaris 2009.06 nscd runs okay for 
about one hour, then starts using 100% of a CPU. After some more indeterminate 
time, it starts using 100% of another CPU.
They are running ldap, zfs sharenfs, nis
I ran a truss on the nscd process, and just before it's start consuming CPU, it 
gets and ignores a SIGPIPE, the continually does a getpid()
In the following truss output, the pid is 6300. Here's part of truss where it 
gets the signal:
/5:     time()                                          = 1250088300
/5:     xstat(2, "/etc/nsswitch.conf", 0xFE080BC8)      = 0
/5:     xstat(2, "/etc/resolv.conf", 0xFE080C58)        = 0
/5:     time()                                          = 1250088300
/5:     getpid()                                        = 6300 [1]
/5:     write(2, " W e d   A u g   1 2   1".., 86)      = 86
/5:     xstat(2, "/etc/passwd", 0xFE07CA48)             = 0
/5:     getpid()                                        = 6300 [1]
/5:     write(2, " W e d   A u g   1 2   1".., 86)      = 86
/5:     getpid()                                        = 6300 [1]
/5:     write(2, " W e d   A u g   1 2   1".., 102)     = 102
/5:     stat64("/etc/passwd", 0xFE07C868)               = 0
/5:     getpid()                                        = 6300 [1]
/5:     fxstat(2, 6, 0xFE07C768)                        = 0
/5:     fxstat(2, 6, 0xFE07C568)                        = 0
/5:     putmsg(6, 0xFE07C688, 0xFE07C760, 0)            = 0
/5:     pollsys(0x0816407C, 1, 0xFE07C6B0, 0x00000000)  = 1
/5:     fxstat(2, 6, 0xFE07C558)                        = 0
/5:     getmsg(6, 0xFE07C688, 0x080A4EF8, 0xFE07C680)   = 0
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     time()                                          = 1250088300
/5:     write(7, " 081A2020104 c819C0412 d".., 165)     Err#32 EPIPE
/5:         Received signal #13, SIGPIPE [ignored]
/5:     time()                                          = 1250088300
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
/5:     getpid()                                        = 6300 [1]
........... millions of them !!

Anyone seeing this?

Regards
Regards
John

--
Brian Ruthven
Solaris Revenue Product Engineering
Sun Microsystems UK
Sparc House, Guillemont Park, Camberley, GU17 9QG

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to