On Jul 29, 2009, at 3:32 PM, Jim Kusznir wrote:

Here's the last portion of the pvfs2-client log.  My head node
spontaneously rebooted again this afternoon.  There's never anything
in the system logs when this happens.  I don't know where it happened
with respect to the pvfs2-client log, as it doesn't appear to rotate
on its own.

----------------
[E 09:45:59.292032] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:01.300109] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:03.308181] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:05.330273] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:07.350688] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:09.371106] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:09.371125] *** msgpairarray_completion_fn: msgpair to server
tcp://pvfs2-io-0-0:3334 failed: Connection refused
[E 09:46:09.371137] *** Out of retries.
[E 09:46:09.381108] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:11.391534] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:13.411966] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:15.432403] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:17.442847] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:19.463301] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Connection refused
[E 09:46:19.463322] *** msgpairarray_completion_fn: msgpair to server
tcp://pvfs2-io-0-0:3334 failed: Connection refused
[E 09:46:19.463334] *** Out of retries.
[E 09:53:48.857615] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 2194138.
[E 09:53:48.857710] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Operation cancelled (possibly due
to timeout)
[E 09:53:48.858193] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Operation cancelled (possibly due
to timeout)
[E 09:53:48.858226] Warning: msgpair failed to
tcp://pvfs2-io-0-0:3334, will retry: Operation cancelled (possibly due
to timeout)
[E 08:45:16.995718] PVFS Client Daemon Started.  Version 2.8.1
[D 08:45:17.007853] [INFO]: Mapping pointer 0x2b8c00d67000 for I/O.
[D 08:45:17.022636] [INFO]: Mapping pointer 0x15c4a000 for I/O.
[E 10:33:45.216228] Child process with pid 9281 was killed by an
uncaught signal 6
[E 10:33:45.220192] PVFS Client Daemon Started.  Version 2.8.1
[D 10:33:45.220444] [INFO]: Mapping pointer 0x2aed09e06000 for I/O.
[D 10:33:45.235840] [INFO]: Mapping pointer 0x64ec000 for I/O.
[E 10:59:40.888591] PVFS Client Daemon Started.  Version 2.8.1
[D 10:59:40.896507] [INFO]: Mapping pointer 0x2abaaf20f000 for I/O.
[D 10:59:40.911142] [INFO]: Mapping pointer 0x10a5a000 for I/O.
----------------------
I know at some point yesterday, and a few days prior, the IO servers
went down (origionally for the migration to 1.8.1, which took longer
than I expected due to the conversion, and then yesterday at some
point at least one of the server processes died.  I rebooted the
server (to clear up the hung processes that resulted yesterday from
pvfs2 I/O operations that did not complete) this morning around 8:30.
It appears around noon it rebooted again on its own accord.  There's
nothing in the logs to describe why, but it only happens during
periods of heavy user pvfs2 I/O.

Hi Jim,

Unfortunately that log probably isn't going to be useful. All it shows is that a client wasn't able to contact a server (pvfs2-io-0-0) and then (on a different day) that client was restarted a number of times. The client log doesn't rotate, it just keeps growing. If possible, could you send the pvfs2-client.log files from all the clients in your cluster, as well as the pvfs2-server.log files from each of the three servers? Also, can you clarify some of the above? I've asked a few clarifying questions below.

You migrated to PVFS 2.8.1 and the I/O servers were "down" during that time. Did you keep the PVFS volume mounted on the clients, with the kernel module loaded and the client daemon running during that time?

When the servers went down yesterday and the other days, did just the pvfs2-server process die? Or did the entire server spontaneously reboot?

Did you reboot the _server_ to clear up hung processes? Hung processes are going to be on the client nodes, unless PVFS is also mounted on the same node the servers are running on, but that's not true for your config IIRC.

When it rebooted around noon, which node rebooted? A server node or a client? Which node in particular? You've seen a number of spontaneous reboots. Are they limited to client nodes or server nodes (or maybe just a few client nodes)? Are there no kernel panics in syslog when those reboots occur?


If its helpful, our approach to debugging these sorts of problems usually involves:

* Isolating the problem. If we can know that process hangs or node reboots always occur at node X, or always occur because the client daemon dies, or one of the server falls over, etc. It gives us an area to focus on. PVFS (or any distributed file system) has a number of different components that interact in different ways. Knowing which bits of code are the source of the problem make it a lot easier for us to figure out. Basically, we're not very good at whack-a-mole.

* Reproducing the problem. If you can reproduce what you're seeing by running some application with a specific set of parameters, or using a synthetic test, we can probably reproduce it on our own systems as well. That makes debugging the problem a _ton_ easier, and a fix is usually in short order. Obviously, its not always possible to reproduce a problem, especially for those problems that are more system specific.

* Lots of logging/debugging. Problems in PVFS get written to logs as error messages most of the time. Its probably not 100%, but we've gotten pretty good about writing errors to logs when something goes wrong. Kernel panics usually show up in syslog on modern day kernels when the node falls over, with information about what bit of code caused the panic. Giving us as much information as you can throw at us is better than trying to filter out the things that seem important. We're used to handling lots of data. :-)

We also have quite a few debugging options that can be enabled and written to logs. That's the direction we'll probably have to go if we can't solve your problem otherwise.

Thanks,
-sam


--Jim

On Tue, Jul 28, 2009 at 9:35 AM, Sam Lang<[email protected]> wrote:

Jim,

We'll definitely try to help you resolve the problem you're seeing. That
said, I responded to a similar query of yours back in April.  See:

http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-April/002765.html

It would be great if you could answer the questions I asked in that email.

Also, its been hinted by yourself and others that this may not be PVFS related, as other users aren't experiencing the same problem. I encourage you to eliminate the possibility of memory problems on your system. You could try to run memtester (http://pyropus.ca/software/memtester/) on both servers and clients to verify that memory on your system isn't the problem.

I've created a trac ticket for the problem you're seeing, so that we can
keep track of it that way.  See:

https://trac.mcs.anl.gov/projects/pvfs/ticket/113

-sam

On Jul 28, 2009, at 10:58 AM, Jim Kusznir wrote:

Hi all:

More or less since I've installed pvfs2, I've had recurring stability
issues.  Presently, my cluster headnode has 3 processes, each using
100% of a core, that are "hung" on I/O (all of that processor usage is
in "system", not "user"), but the process is not in "D" state (its
moving between S and R). The process should have completed in an hour
or less, its now been running for over 18 hours.  It also is not
responding to kills (including kill -9).  From the sounds of the
users' message, any additional processes started in the same working
directory will hang in the same way.

This happens a lot.  Presently, the 3 hung processes are a binary
specific to the research (x2) and gzip; often, the hung processes are
ls and ssh (for scp), etc.  When this happens, all other physical
systems are still fully functional.  This has happened repeatedly
(although not repeatable on demand) on versions 1.5 through 1.8.1.
The only recovery option I have found to date is to reboot the system.
This normally only happens on the head node, but the head node is
also where a lot of the user I/O takes place (especially a lot of
small I/O accesses such as a few scp sessions, some gzips, and 5-10
users doing ls, mv, and cp operations).

Given what I understand about pvfs2's current user base, I'd think it
must be stable; a large cluster could never run pvfs2 and still be
useful to users with the types of instability I keep experiencing. As such, I suspect the problem is somewhere with my system/setup, but to
date pcarns and others on #pvfs2 have not been able to identify what
it is.  These stability issues are significantly effecting the
usability of the cluster, and of course, beginning to deter users from
it, and/or my competency in administrating it.  Yet from what I can
tell, I'm experiencing some bug in the pvfs kernel module. I'd really
like to get this problem fixed, and I'm at a loss of how, other than
replacing pvfs2 with some other filesystem, which I'd rather not do.

How do I fix this problem without replacing pvfs2?

--Jim
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users



_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to