Hi all:

More or less since I've installed pvfs2, I've had recurring stability
issues.  Presently, my cluster headnode has 3 processes, each using
100% of a core, that are "hung" on I/O (all of that processor usage is
in "system", not "user"), but the process is not in "D" state (its
moving between S and R).  The process should have completed in an hour
or less, its now been running for over 18 hours.  It also is not
responding to kills (including kill -9).  From the sounds of the
users' message, any additional processes started in the same working
directory will hang in the same way.

This happens a lot.  Presently, the 3 hung processes are a binary
specific to the research (x2) and gzip; often, the hung processes are
ls and ssh (for scp), etc.  When this happens, all other physical
systems are still fully functional.  This has happened repeatedly
(although not repeatable on demand) on versions 1.5 through 1.8.1.
The only recovery option I have found to date is to reboot the system.
 This normally only happens on the head node, but the head node is
also where a lot of the user I/O takes place (especially a lot of
small I/O accesses such as a few scp sessions, some gzips, and 5-10
users doing ls, mv, and cp operations).

Given what I understand about pvfs2's current user base, I'd think it
must be stable; a large cluster could never run pvfs2 and still be
useful to users with the types of instability I keep experiencing.  As
such, I suspect the problem is somewhere with my system/setup, but to
date pcarns and others on #pvfs2 have not been able to identify what
it is.  These stability issues are significantly effecting the
usability of the cluster, and of course, beginning to deter users from
it, and/or my competency in administrating it.  Yet from what I can
tell, I'm experiencing some bug in the pvfs kernel module.  I'd really
like to get this problem fixed, and I'm at a loss of how, other than
replacing pvfs2 with some other filesystem, which I'd rather not do.

How do I fix this problem without replacing pvfs2?

--Jim
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to