Jim,

We'll definitely try to help you resolve the problem you're seeing. That said, I responded to a similar query of yours back in April. See:

http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-April/002765.html

It would be great if you could answer the questions I asked in that email.

Also, its been hinted by yourself and others that this may not be PVFS related, as other users aren't experiencing the same problem. I encourage you to eliminate the possibility of memory problems on your system. You could try to run memtester (http://pyropus.ca/software/memtester/ ) on both servers and clients to verify that memory on your system isn't the problem.

I've created a trac ticket for the problem you're seeing, so that we can keep track of it that way. See:

https://trac.mcs.anl.gov/projects/pvfs/ticket/113

-sam

On Jul 28, 2009, at 10:58 AM, Jim Kusznir wrote:

Hi all:

More or less since I've installed pvfs2, I've had recurring stability
issues.  Presently, my cluster headnode has 3 processes, each using
100% of a core, that are "hung" on I/O (all of that processor usage is
in "system", not "user"), but the process is not in "D" state (its
moving between S and R).  The process should have completed in an hour
or less, its now been running for over 18 hours.  It also is not
responding to kills (including kill -9).  From the sounds of the
users' message, any additional processes started in the same working
directory will hang in the same way.

This happens a lot.  Presently, the 3 hung processes are a binary
specific to the research (x2) and gzip; often, the hung processes are
ls and ssh (for scp), etc.  When this happens, all other physical
systems are still fully functional.  This has happened repeatedly
(although not repeatable on demand) on versions 1.5 through 1.8.1.
The only recovery option I have found to date is to reboot the system.
This normally only happens on the head node, but the head node is
also where a lot of the user I/O takes place (especially a lot of
small I/O accesses such as a few scp sessions, some gzips, and 5-10
users doing ls, mv, and cp operations).

Given what I understand about pvfs2's current user base, I'd think it
must be stable; a large cluster could never run pvfs2 and still be
useful to users with the types of instability I keep experiencing.  As
such, I suspect the problem is somewhere with my system/setup, but to
date pcarns and others on #pvfs2 have not been able to identify what
it is.  These stability issues are significantly effecting the
usability of the cluster, and of course, beginning to deter users from
it, and/or my competency in administrating it.  Yet from what I can
tell, I'm experiencing some bug in the pvfs kernel module.  I'd really
like to get this problem fixed, and I'm at a loss of how, other than
replacing pvfs2 with some other filesystem, which I'd rather not do.

How do I fix this problem without replacing pvfs2?

--Jim
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to