Jim - Anytime a server process hangs, or has any kind of of connection issues, it will cause client side connectivity issues. Is it possible that your head node rebooted itself because of the high load?
If one of your server processes had a timeout, it should be reflected in the log file. Lets turn on debugging for the client and see what happens. Also it might be beneficial to just restart all of the pvfs2-server process if you have a minute or two of downtime, if there was an issue with IP conflicts, I have no idea how that may affect the filesystem as a whole, I'll leave that to everyone else to comment on. -- This may be an interesting conversation, if you'd like to describe your experiences with it. So lets get client-side logging (network level? if there are timeouts, which should occur if there are connectivity issues, this should be reflected in logs) /tmp/pvfs2-client.log An additional option may be to think about upgrading to 2.8.1. Hopefully we can establish a cause before having to do that. But it may be worthwhile at some point. ~Kyle Kyle Schochenmaier On Wed, Apr 1, 2009 at 1:56 PM, Jim Kusznir <[email protected]> wrote: > 1. It appears that I didn't look at the right place this time...I've > had that problem in the past, but didn't throughly investigate it, and > when one takes buffers/cache into account, I have over 6GB available. > This was reported for the cluster headnode only, which is a pvfs2 > client (no pvfs2-server-related processes). > 2. Workload on the head node is user access to pvfs2 data. Sometimes > its users scp'ing data into or out of pvfs2; sometimes its them > working with it directly (viewing, tar/untar, etc). None of this > should be "high performance", and there should not be any direct > application utilization of data here. Cluster nodes are a different > story.... > 3. I don't have any cron jobs of my own. I doubt any users would, but > if so, their jobs would just be transferring data anyway. The high > load average is actually from processes that are hung trying to do I/O > on a file in pvfs2. The CPU is basically idle. At times, I've seen > the loadaverage as high as 40+. > 4) my pvfs2 server logs have not been modified since March 2nd. > > I do know that at the beginning of the week, one of the pvfs2-server > nodes was having intermentent connectivity (IP conflict) issues. > However, these problems have started after that was corrected, and the > week prior, the head node rebooted spontaneously at least 2 times, I > think it was more like 4 (I was on vacation). > > On Wed, Apr 1, 2009 at 11:38 AM, Kyle Schochenmaier <[email protected]> > wrote: >> Jim - >> >> A couple things to start with. >> >> 1. wrt 'climbing memory usage' - is this on the client (head) node, >> or the pvfs2-servers (data servers)? >> 2. Is your workload basically some users scp'ing data to a >> pvfs2-mounted location on the client/head node? >> 3. Is it possible that you have some errant cron jobs or 'bad' >> scripts running that are eating up the cpu that are not related to >> pvfs? >> 4. Are there any timeouts on the pvfs2-server logs? >> It would make sense that users files would be inaccessible if one or >> more of the data servers is having connectivity issues or other >> issues. >> >> Can you send your pvfs2 config file as well log files for the client >> process and the servers (if there is anything there) >> >> >> ~Kyle >> >> Kyle Schochenmaier >> >> >> >> On Wed, Apr 1, 2009 at 1:27 PM, Jim Kusznir <[email protected]> wrote: >>> Hi all: >>> >>> I'm (once again) experiencing system instability that appears to be >>> traceable to pvfs2. Symptoms usually show up when one or more users >>> start long SCP sessions for transferring 5+GB of data lasting several >>> hours. I believe they usually have 1-3 sessions running in parallel. >>> Symptoms include: >>> >>> * High load averages (and climbing slowly with additional use) without >>> supporting CPU load. The ONLY way to recover from this is reboot. My >>> load average is currently 7.10 with 99.8% idle CPU. >>> * hung SCP and other I/O processes >>> * large amounts of RAM "missing" (Currently free -m reports 7552MB in >>> use; adding up usage from all processes comes to about 1GB. >>> * Often (always?) some users' files become unaccessible (although >>> users have stopped reporting those problems as its happened so >>> frequently). >>> >>> If I let this go a bit longer, there's a reasonable chance that the >>> machine will just spontaneously reboot. There's nothing logged as to >>> the cause. No OOM or other errors...Just one minute everything's >>> fine; the next its booting up. >>> >>> Sometimes it will take a long time for these problems to build up (for >>> example, right now the system load and memory issues are here with a >>> couple days of "building"); sometimes the system will spontaneously >>> reboot several times in one day (with no notice of climbing loads or >>> the like). >>> >>> These problems so far have only happened on the head node (pvfs >>> client); our compute nodes have not shown this problem. >>> >>> System configuration: >>> Rocks 5.1 with manual pvfs setup (NOT using rocks-supplied PVFS >>> binaries or configurations) >>> pvfs 2.7.1 + patches from pcarns >>> 3 CentOS 5 dedicated PVFS servers (each with ~10TB storage, Dell PERC >>> 6/e + MD1000's) >>> PVFS servers are running over bonded dual-gig connections using linux >>> kernel ethernet bonding driver >>> Clients are single-gig connected. >>> no off-site pvfs2 access (scp/ssh/sftp access only, via the head node) >>> >>> Any suggestions? >>> I'm getting fairly desperate for help, as pvfs2 has been the main >>> destabilizing factor for the cluster since it went online, and causing >>> spontaneous reboots is not a good thing.... >>> >>> Thanks! >>> --Jim >>> _______________________________________________ >>> Pvfs2-users mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> >> > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
