James and Mike,

Thanks for the hints.  The process is not a zombie:

from ps -ef:
# ps -ef |grep ifind |grep -v grep
    root  9973     1   0   Mar 21 ?           0:34
/opt/galaxy/iDataAgent/ifind -j 28995 -a 2:52 -t 2 -d backupmedia_server
    root  9141     1   0 00:00:27 ?           0:04
/opt/galaxy/iDataAgent/ifind -j 29234 -a 2:52 -t 1 -d backupmedia_server
    root 22521    1   0 12:50:00 ?           0:06
/opt/galaxy/iDataAgent/ifind -j 29234 -a 2:52 -t 1 -d backupmedia_server
#  truss -fp 9973
truss: no such process: 9973

These processes are part of our backup software (Commvault) that scans
filesystems to populate the servers indexes.

I am also pretty sure it's not dumping core either as at least one has been
arround for awhile.

>From the Thread stack trace:

> ::ps !grep ifind
R  22521  22484  22484  22484      0 0x4a004000 000003002248b160 ifind
R   9973      1  12809  12809      0 0x4a304902 0000030000c65b40 ifind
R   9141      1  12809  12809      0 0x4a304902 0000030016cd18d0 ifind
> 0000030000c65b40::print proc_t p_tlist | ::findstack -v
stack pointer for thread 300117d32a0: 2a1039ec841
[ 000002a1039ec841 cv_wait+0x38() ]
  000002a1039ec8f1 top_begin_async+0x90(600274d3180, 3, 100, 0, 1,
60023ac2620)
  000002a1039ec9a1 ufs_syncip+0x298(60045392cf0, 400, 0, 15, 1000, 0)
  000002a1039eca51 ufs_idle_free+0x68(60045392cf0, 300117d32a4, 0,
60045390140, 300117d32a4, 0)
  000002a1039ecb01 ufs_idle_some+0x180(2, 60045392cf0, c0, 0, 18db010,
18d90f8)
  000002a1039ecbb1 ufs_lookup+0x240(30017142d40, 2a1039ed680, 2a1039ed678,
60029b33d00, fe2f, fd82)
  000002a1039ecc91 fop_lookup+0x28(30017142d40, 2a1039ed680, 2a1039ed678,
129c858, 0, 30000d94f00)
  000002a1039ecd51 lookuppnvp+0x344(2a1039ed940, 0, 2f, 2a1039ed678,
2a1039ed680, 6002184da40)
  000002a1039ecf91 lookuppnat+0x120(30017142d40, 0, 0, 0, 2a1039edad8, 0)
  000002a1039ed051 lookupnameat+0x5c(0, 0, 0, 0, 2a1039edad8, 0)
  000002a1039ed161 cstatat_getvp+0x198(ffd19400, 456878, 1, 0, 2a1039edad8,
0)
  000002a1039ed221 cstatat64_32+0x40(ffffffffffd19553, 456878, 1000,
ffbea6e0, 1000, 0)
  000002a1039ed2e1 syscall_trap32+0xcc(456878, ffbea6e0, 0, 0, 0, 0)
>

For what it's worth, the server was just patched with a January patch
cluster plus 139483-05

Thanks again,
--Brett

On 3/22/09, James C. McPherson <James.McPherson at sun.com> wrote:
> On Sun, 22 Mar 2009 16:21:15 -0700
> Michael Schuster <Michael.Schuster at Sun.COM> wrote:
>
> > Brett Monroe wrote:
> > > Hey all,
> > >
> > > I am seeing an issue on one of our Solaris 10 servers and I would like
> > > to get more insight into what is going on.  I suspect it is a kernel
> > > bug and I think mdb is the only way I can look into the kernel to see
> > > what's going on (with respect to this issue).  My mdb skills are close
> > > to non-existent so please bear with me. :)  Anyway, here is what I am
> > > seeing:
> > >
> > > The Server is running Solaris 10 Kernel 138888-02.  I have some
> > > processes that appear in the process table and in /proc but they won't
> > > die if killed and can't be trussed and p* commands fail with the error
> > > "no such process."
> >
> > do they appear in a 'ps -ef' listing, perhaps as "defunct"? in that
case,
> > you have so-called zombies, which are processed that have exited but
whose
> > exit code still needs to be reaped.
>
>
> Hi Brett,
> a mate just pointed out that you might have come across the
> situation where if the proc is large enough and it recently
> received a signal, it could be in the process of dumping core.
>
> iirc you'd want to check the p_siginfo part of the proc structure
> to make sure of that.
>
>
> cheers,
> James
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.opensolaris.org/pipermail/mdb-discuss/attachments/20090322/7ba66bbf/attachment.html>

Reply via email to