Hi Sam,

Thanks for the reply. I think we found a work around.  Our code is
really hybrid meaning MPI/OpenMP.  MPI across nodes and OpenMP inside
each node.  Each node has several I/O threads writing to pvfs.  We
usually send a kill signal to the MPI rank that started the openmp
threads on the node and that somehow sometimes does not kill all the I/O
threads and leaves some hanging and pvfs-core dies.

If we use pkill instead of just kill pkill seems to kill all the I/O
thread every time and things work as expected.

At the moment I can't recompile with debug on but will try that out
later.

Thanks
Rene






> 
> > HI,
> >
> > We are getting some strange behavior out of pvfs-2.8.1 clients
> running
> > on some sles 10 sp 1 nodes.
> >
> > The pvfs2 clients can mount the pvfs2 file system with no problems
> we
> > then start an MPI job that runs on a small number of nodes.  The 
> > problem
> > happens when we try to kill the mpi job.  As soon as we send the
> kill
> > signal to the mpi job several of our pvfs2 client nodes have their
> > pvfs2-client-core deamon die with this message:
> >
> > hpcp6671:~ # ps -ef |grep pvfs
> > root     25767     1  0 12:21 ?
> > 00:00:00 /bphpc5/vol0/salmr0/opt/pvfs-2.8.1/x86_64/sles10sp1/sbin/
> > pvfs2-client -p /bphpc5/vol0/salmr0/opt/pvfs-2.8.1/x86_64/sles10sp1/
> > sbin/pvfs2-client-core
> > root     16117 25767  0 15:02 ?        00:00:00 [pvfs2-client-co]
> >
> >
> >
> > hpcp6671:~ # cat /tmp/pvfs2-client.log
> > [E 12:21:35.567169] PVFS Client Daemon Started.  Version 2.8.1
> > [D 12:21:35.567434] [INFO]: Mapping pointer 0x2acdf7aa3000 for I/O.
> > [D 12:21:35.579256] [INFO]: Mapping pointer 0x2acdf8ea5000 for I/O.
> > [E 15:02:54.988860] PVFS2 client: signal 11, faulty address is
> 0x41d5,
> > from 0x408d81
> > [E 15:02:54.989282] [bt] pvfs2-client-core [0x408d81]
> > [E 15:02:54.989294] [bt] pvfs2-client-core [0x408d81]
> > [E 15:02:54.989302] [bt] pvfs2-client-core(main+0xbc3) [0x40a173]
> > [E 15:02:54.989309] [bt] /lib64/libc.so.6(__libc_start_main+0xf4)
> > [0x2acdf788b154]
> > [E 15:02:54.989315] [bt] pvfs2-client-core [0x403519]
> > [E 15:02:54.991351] Child process with pid 25768 was killed by an
> > uncaught signal 6
> 
> Hi Rene,
> 
> This is a segfault in the client process.  The daemon is restarting 
> itself, which may be what the error below is from.  I'll have to 
> figure out what that 0x408d81 pointer maps to.  Might not be all that 
> useful though.  Would you be willing to recompile with debugging 
> enabled (rerun configure with CFLAGS=-g, and then rebuild)?  That 
> would at least give us line numbers to look at.
> 
> >
> > [E 15:02:54.993980] PVFS Client Daemon Started.  Version 2.8.1
> > [D 15:02:54.994242] [INFO]: Mapping pointer 0x2b94619a2000 for I/O.
> > [D 15:02:55.008318] [INFO]: Mapping pointer 0x2b9462da4000 for I/O.
> > [E 15:02:55.312456] Got an unrecognized/unimplemented vfs operation
> of
> > type ff000000.
> > [E 15:02:55.312497] Post of op: PVFS_VFS_OP_INVALID failed!
> 
> I would try to fix the above before worrying about this one.  It
> could 
> be just fallout from the first failure.
> 
> -sam
> >
> >
> >
> > Any ideas?
> >
> > thanks
> > Rene
> >
> > _______________________________________________
> > Pvfs2-users mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> 
> 
> 
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to