Re: [Pvfs2-users] Pvfs2 stability issues

Sam Lang Wed, 29 Jul 2009 14:31:58 -0700


On Jul 29, 2009, at 3:32 PM, Jim Kusznir wrote:

Here's the last portion of the pvfs2-client log.  My head node
spontaneously rebooted again this afternoon.  There's never anything
in the system logs when this happens.  I don't know where it happened
with respect to the pvfs2-client log, as it doesn't appear to rotate
on its own.

----------------
[E 09:45:59.292032] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:01.300109] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:03.308181] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:05.330273] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:07.350688] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:09.371106] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:09.371125] *** msgpairarray_completion_fn: msgpair to server
tcp://pvfs2-io-0-0:3334 failed: Connection refused
[E 09:46:09.371137] *** Out of retries.
[E 09:46:09.381108] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:11.391534] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:13.411966] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:15.432403] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:17.442847] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:19.463301] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:19.463322] *** msgpairarray_completion_fn: msgpair to server
tcp://pvfs2-io-0-0:3334 failed: Connection refused
[E 09:46:19.463334] *** Out of retries.
[E 09:53:48.857615] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 2194138.
[E 09:53:48.857710] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Operation cancelled (possibly due
to timeout)
[E 09:53:48.858193] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Operation cancelled (possibly due
to timeout)
[E 09:53:48.858226] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Operation cancelled (possibly due
to timeout)
[E 08:45:16.995718] PVFS Client Daemon Started.  Version 2.8.1
[D 08:45:17.007853] [INFO]: Mapping pointer 0x2b8c00d67000 for I/O.
[D 08:45:17.022636] [INFO]: Mapping pointer 0x15c4a000 for I/O.
[E 10:33:45.216228] Child process with pid 9281 was killed by an
uncaught signal 6
[E 10:33:45.220192] PVFS Client Daemon Started.  Version 2.8.1
[D 10:33:45.220444] [INFO]: Mapping pointer 0x2aed09e06000 for I/O.
[D 10:33:45.235840] [INFO]: Mapping pointer 0x64ec000 for I/O.
[E 10:59:40.888591] PVFS Client Daemon Started.  Version 2.8.1
[D 10:59:40.896507] [INFO]: Mapping pointer 0x2abaaf20f000 for I/O.
[D 10:59:40.911142] [INFO]: Mapping pointer 0x10a5a000 for I/O.
----------------------
I know at some point yesterday, and a few days prior, the IO servers
went down (origionally for the migration to 1.8.1, which took longer
than I expected due to the conversion, and then yesterday at some
point at least one of the server processes died.  I rebooted the
server (to clear up the hung processes that resulted yesterday from
pvfs2 I/O operations that did not complete) this morning around 8:30.
It appears around noon it rebooted again on its own accord.  There's
nothing in the logs to describe why, but it only happens during
periods of heavy user pvfs2 I/O.


Hi Jim,

Unfortunately that log probably isn't going to be useful. All itshows is that a client wasn't able to contact a server (pvfs2-io-0-0)and then (on a different day) that client was restarted a number oftimes. The client log doesn't rotate, it just keeps growing. Ifpossible, could you send the pvfs2-client.log files from all theclients in your cluster, as well as the pvfs2-server.log files fromeach of the three servers? Also, can you clarify some of the above?I've asked a few clarifying questions below.

You migrated to PVFS 2.8.1 and the I/O servers were "down" during thattime. Did you keep the PVFS volume mounted on the clients, with thekernel module loaded and the client daemon running during that time?

When the servers went down yesterday and the other days, did just thepvfs2-server process die? Or did the entire server spontaneouslyreboot?

Did you reboot the _server_ to clear up hung processes? Hung processesare going to be on the client nodes, unless PVFS is also mounted onthe same node the servers are running on, but that's not true for yourconfig IIRC.

When it rebooted around noon, which node rebooted? A server node or aclient? Which node in particular? You've seen a number ofspontaneous reboots. Are they limited to client nodes or server nodes(or maybe just a few client nodes)? Are there no kernel panics insyslog when those reboots occur?

If its helpful, our approach to debugging these sorts of problemsusually involves:

* Isolating the problem. If we can know that process hangs or nodereboots always occur at node X, or always occur because the clientdaemon dies, or one of the server falls over, etc. It gives us anarea to focus on. PVFS (or any distributed file system) has a numberof different components that interact in different ways. Knowingwhich bits of code are the source of the problem make it a lot easierfor us to figure out. Basically, we're not very good at whack-a-mole.

* Reproducing the problem. If you can reproduce what you're seeing byrunning some application with a specific set of parameters, or using asynthetic test, we can probably reproduce it on our own systems aswell. That makes debugging the problem a _ton_ easier, and a fix isusually in short order. Obviously, its not always possible toreproduce a problem, especially for those problems that are moresystem specific.

* Lots of logging/debugging. Problems in PVFS get written to logs aserror messages most of the time. Its probably not 100%, but we'vegotten pretty good about writing errors to logs when something goeswrong. Kernel panics usually show up in syslog on modern day kernelswhen the node falls over, with information about what bit of codecaused the panic. Giving us as much information as you can throw atus is better than trying to filter out the things that seemimportant. We're used to handling lots of data. :-)

We also have quite a few debugging options that can be enabled andwritten to logs. That's the direction we'll probably have to go if wecan't solve your problem otherwise.


Thanks,
-sam

--Jim

On Tue, Jul 28, 2009 at 9:35 AM, Sam Lang<[email protected]> wrote:
Jim,
We'll definitely try to help you resolve the problem you'reseeing. That
said, I responded to a similar query of yours back in April.  See:

http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-April/002765.html
It would be great if you could answer the questions I asked in thatemail.
Also, its been hinted by yourself and others that this may not bePVFSrelated, as other users aren't experiencing the same problem. Iencourageyou to eliminate the possibility of memory problems on yoursystem. Youcould try to run memtester (http://pyropus.ca/software/memtester/)on bothservers and clients to verify that memory on your system isn't theproblem.
I've created a trac ticket for the problem you're seeing, so thatwe can
keep track of it that way.  See:

https://trac.mcs.anl.gov/projects/pvfs/ticket/113

-sam

On Jul 28, 2009, at 10:58 AM, Jim Kusznir wrote:
Hi all:
More or less since I've installed pvfs2, I've had recurringstability
issues.  Presently, my cluster headnode has 3 processes, each using
100% of a core, that are "hung" on I/O (all of that processorusage is
in "system", not "user"), but the process is not in "D" state (its
moving between S and R). The process should have completed in anhour
or less, its now been running for over 18 hours.  It also is not
responding to kills (including kill -9).  From the sounds of the
users' message, any additional processes started in the same working
directory will hang in the same way.

This happens a lot.  Presently, the 3 hung processes are a binary
specific to the research (x2) and gzip; often, the hung processesare
ls and ssh (for scp), etc.  When this happens, all other physical
systems are still fully functional.  This has happened repeatedly
(although not repeatable on demand) on versions 1.5 through 1.8.1.
The only recovery option I have found to date is to reboot thesystem.
This normally only happens on the head node, but the head node is
also where a lot of the user I/O takes place (especially a lot of
small I/O accesses such as a few scp sessions, some gzips, and 5-10
users doing ls, mv, and cp operations).
Given what I understand about pvfs2's current user base, I'd thinkit
must be stable; a large cluster could never run pvfs2 and still be
useful to users with the types of instability I keepexperiencing. Assuch, I suspect the problem is somewhere with my system/setup, butto
date pcarns and others on #pvfs2 have not been able to identify what
it is.  These stability issues are significantly effecting the
usability of the cluster, and of course, beginning to deter usersfrom
it, and/or my competency in administrating it.  Yet from what I can
tell, I'm experiencing some bug in the pvfs kernel module. I'dreally
like to get this problem fixed, and I'm at a loss of how, other than
replacing pvfs2 with some other filesystem, which I'd rather not do.

How do I fix this problem without replacing pvfs2?

--Jim
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Pvfs2 stability issues

Reply via email to