Hi Jim,
Kyle is right, errors of this magnitude in PVFS should show up in the
log file. Before we get into enabling debugging output for you
though, it would be nice to know if there are error messages being
thrown from PVFS.
PVFS has two components in the client, a kernel module that integrates
with the kernel VFS layer, and a userspace daemon that runs as root.
The default locations for the logs for these two components are
different. The kernel module writes error messages to syslog. The
PVFS daemon writes error messages to the log file at /tmp/pvfs2-
client.log. When you start to see the instabilities you mentioned, do
you see anything from running dmesg, or does anything show up in the /
tmp/pvfs2-client.log file?
Also, if you monitor the pvfs2-client-core process (using ps or top),
does the memory of that process grow over time?
Thanks,
-sam
On Apr 1, 2009, at 8:09 PM, Kyle Schochenmaier wrote:
Jim -
Anytime a server process hangs, or has any kind of of connection
issues, it will cause client side connectivity issues.
Is it possible that your head node rebooted itself because of the
high load?
If one of your server processes had a timeout, it should be reflected
in the log file.
Lets turn on debugging for the client and see what happens.
Also it might be beneficial to just restart all of the pvfs2-server
process if you have a minute or two of downtime, if there was an issue
with IP conflicts, I have no idea how that may affect the filesystem
as a whole, I'll leave that to everyone else to comment on. -- This
may be an interesting conversation, if you'd like to describe your
experiences with it.
So lets get client-side logging (network level? if there are timeouts,
which should occur if there are connectivity issues, this should be
reflected in logs)
/tmp/pvfs2-client.log
An additional option may be to think about upgrading to 2.8.1.
Hopefully we can establish a cause before having to do that. But it
may be worthwhile at some point.
~Kyle
Kyle Schochenmaier
On Wed, Apr 1, 2009 at 1:56 PM, Jim Kusznir <[email protected]>
wrote:
1. It appears that I didn't look at the right place this time...I've
had that problem in the past, but didn't throughly investigate it,
and
when one takes buffers/cache into account, I have over 6GB available.
This was reported for the cluster headnode only, which is a pvfs2
client (no pvfs2-server-related processes).
2. Workload on the head node is user access to pvfs2 data. Sometimes
its users scp'ing data into or out of pvfs2; sometimes its them
working with it directly (viewing, tar/untar, etc). None of this
should be "high performance", and there should not be any direct
application utilization of data here. Cluster nodes are a different
story....
3. I don't have any cron jobs of my own. I doubt any users would,
but
if so, their jobs would just be transferring data anyway. The high
load average is actually from processes that are hung trying to do
I/O
on a file in pvfs2. The CPU is basically idle. At times, I've seen
the loadaverage as high as 40+.
4) my pvfs2 server logs have not been modified since March 2nd.
I do know that at the beginning of the week, one of the pvfs2-server
nodes was having intermentent connectivity (IP conflict) issues.
However, these problems have started after that was corrected, and
the
week prior, the head node rebooted spontaneously at least 2 times, I
think it was more like 4 (I was on vacation).
On Wed, Apr 1, 2009 at 11:38 AM, Kyle Schochenmaier <[email protected]
> wrote:
Jim -
A couple things to start with.
1. wrt 'climbing memory usage' - is this on the client (head)
node,
or the pvfs2-servers (data servers)?
2. Is your workload basically some users scp'ing data to a
pvfs2-mounted location on the client/head node?
3. Is it possible that you have some errant cron jobs or 'bad'
scripts running that are eating up the cpu that are not related to
pvfs?
4. Are there any timeouts on the pvfs2-server logs?
It would make sense that users files would be inaccessible if one or
more of the data servers is having connectivity issues or other
issues.
Can you send your pvfs2 config file as well log files for the client
process and the servers (if there is anything there)
~Kyle
Kyle Schochenmaier
On Wed, Apr 1, 2009 at 1:27 PM, Jim Kusznir <[email protected]>
wrote:
Hi all:
I'm (once again) experiencing system instability that appears to be
traceable to pvfs2. Symptoms usually show up when one or more
users
start long SCP sessions for transferring 5+GB of data lasting
several
hours. I believe they usually have 1-3 sessions running in
parallel.
Symptoms include:
* High load averages (and climbing slowly with additional use)
without
supporting CPU load. The ONLY way to recover from this is
reboot. My
load average is currently 7.10 with 99.8% idle CPU.
* hung SCP and other I/O processes
* large amounts of RAM "missing" (Currently free -m reports
7552MB in
use; adding up usage from all processes comes to about 1GB.
* Often (always?) some users' files become unaccessible (although
users have stopped reporting those problems as its happened so
frequently).
If I let this go a bit longer, there's a reasonable chance that the
machine will just spontaneously reboot. There's nothing logged
as to
the cause. No OOM or other errors...Just one minute everything's
fine; the next its booting up.
Sometimes it will take a long time for these problems to build up
(for
example, right now the system load and memory issues are here
with a
couple days of "building"); sometimes the system will spontaneously
reboot several times in one day (with no notice of climbing loads
or
the like).
These problems so far have only happened on the head node (pvfs
client); our compute nodes have not shown this problem.
System configuration:
Rocks 5.1 with manual pvfs setup (NOT using rocks-supplied PVFS
binaries or configurations)
pvfs 2.7.1 + patches from pcarns
3 CentOS 5 dedicated PVFS servers (each with ~10TB storage, Dell
PERC
6/e + MD1000's)
PVFS servers are running over bonded dual-gig connections using
linux
kernel ethernet bonding driver
Clients are single-gig connected.
no off-site pvfs2 access (scp/ssh/sftp access only, via the head
node)
Any suggestions?
I'm getting fairly desperate for help, as pvfs2 has been the main
destabilizing factor for the cluster since it went online, and
causing
spontaneous reboots is not a good thing....
Thanks!
--Jim
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users