Well, pvfs2 server crashed overnight...Until I upgraded to 1.8.1, I've never had the server process crash. Each time I recover from a crash, all the servers use 100% cpu for a while before they start responding to requests again. Here's the output from the pvfs-server.log:
[D 07/22 13:37] PVFS2 Server version 2.8.1 starting. [E 07/28 09:41] Error: poorly formatted protocol message received. [E 07/28 09:41] Protocol version mismatch: received major version 5 when expecting 6. [E 07/28 09:41] Please verify your PVFS2 installation [E 07/28 09:41] and make sure that the version is consistent. [E 07/28 09:41] PVFS2 server: signal 11, faulty address is 0x10, from 0x43f114 [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x43f114] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x43f114] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x440608] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_next+0xbc) [0x44f92c] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_continue+0x1e) [0x44f4ae] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server(main+0xa7e) [0x413b5e] [E 07/28 09:41] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) [0x33d181d8a4] [E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x410b49] [D 07/28 09:46] PVFS2 Server version 2.8.1 starting. [E 07/29 13:33] Error: poorly formatted protocol message received. [E 07/29 13:33] Protocol version mismatch: received major version 5 when expecting 6. [E 07/29 13:33] Please verify your PVFS2 installation [E 07/29 13:33] and make sure that the version is consistent. [E 07/29 13:33] PVFS2 server: signal 11, faulty address is 0x10, from 0x43f114 [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x43f114] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x43f114] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x440608] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_next+0xbc) [0x44f92c] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server(PINT_state_machine_continue+0x1e) [0x44f4ae] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server(main+0xa7e) [0x413b5e] [E 07/29 13:33] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) [0x33d181d8a4] [E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x410b49] [D 07/30 08:24] PVFS2 Server version 2.8.1 starting. --------- and here's from dmesg: pvfs2-server[8993]: segfault at 0000000000000010 rip 000000000043f114 rsp 00007fff9406f760 error 4 pvfs2-server[14168]: segfault at 0000000000000010 rip 000000000043f114 rsp 00007fff0b3aa260 error 4 --Jim On Wed, Jul 29, 2009 at 3:40 PM, Jim Kusznir<[email protected]> wrote: > Hello: > > Thanks for your reply; answers inline below. > > On Wed, Jul 29, 2009 at 2:30 PM, Sam Lang<[email protected]> wrote: >> >> Hi Jim, >> >> Unfortunately that log probably isn't going to be useful. All it shows is >> that a client wasn't able to contact a server (pvfs2-io-0-0) and then (on a >> different day) that client was restarted a number of times. The client log >> doesn't rotate, it just keeps growing. If possible, could you send the >> pvfs2-client.log files from all the clients in your cluster, as well as the >> pvfs2-server.log files from each of the three servers? Also, can you >> clarify some of the above? I've asked a few clarifying questions below. > > I can, but they never show anything. The pvfs server logs are always > perfectly clear, and as the nodes are currently normally idle, there's > nothing there either. Its only the headnode of the cluster (which is > just a client to pvfs) that I've been experiencing the issues, as the > demands on pvfs2 are much greater on the headnode presently. > >> You migrated to PVFS 2.8.1 and the I/O servers were "down" during that time. >> Did you keep the PVFS volume mounted on the clients, with the kernel module >> loaded and the client daemon running during that time? > > I had shutdown all the nodes and stopped pvfs on the headnode, then > logged in to my 3 pvfs servers and stopped service, installed the new > RPM that I had built earlier that day, and started the pvfs service. > Not knowing there was an upgrade that took place, I then went and > started my headnode. In addition, I kicked off a rebuild of my > compute nodes, which is how I do software installation on them > (installing the new pvfs2 rpm). It turned out that while pvfs was > running on my pvfs server nodes, it was not answering requests due to > the updates, but I found this out when my clients were failing to > mount or talk to the nodes. > > Later that day, one of my users managed to fill up the pvfs volume, > which caused hung processes. I was able to promptly release a few > gig, and sent an e-mail out to all my users asking them to clean out > their space. The freeing of space did not allow the hung processes to > resume or be killed (kill and kill -9 still had no effect on them). > My users promptly began to log in and clean their space and between > the du's, and rm's, at least one of my pvfs servers crashed. It was a > few minutes before I figured out that this had happened; when I did, I > restarted pvfs2 on all 3 nodes. Then again I had issues where my > clients still were not able to access them, and when I looked on the > server nodes again, the pvfs process was using 100% of one core, just > like the upgrade process. After about 10 minutes, it fixed, and > things started working again. > >> When the servers went down yesterday and the other days, did just the >> pvfs2-server process die? Or did the entire server spontaneously reboot? > > My pvfs servers have never died or rebooted prior to yesterday when at > least one of the pvfs server processes crashed. I've never had an > unexpected reboot on my pvfs servers. The servers always seem to be > functioning well; its always been the clients that have the problems. >> >> Did you reboot the _server_ to clear up hung processes? Hung processes are >> going to be on the client nodes, unless PVFS is also mounted on the same >> node the servers are running on, but that's not true for your config IIRC. > > I rebooted my cluster headnode to clear the processes, as they were > user processes trying to do I/O on the pvfs volume through the > kernel-module-mounted pvfs data (/mnt/pvfs2). As far as pvfs is > concerned, my head node is only a client. > >> When it rebooted around noon, which node rebooted? A server node or a >> client? Which node in particular? You've seen a number of spontaneous >> reboots. Are they limited to client nodes or server nodes (or maybe just a >> few client nodes)? Are there no kernel panics in syslog when those reboots >> occur? > > Spontaneous reboots have always been client nodes. Normally its my > cluster headnode, as that gets the heaviest pvfs usage, especially > "parallel usage" (multiple users performing unrelated i/o on the pvfs > volume, such as tar/untar, mv, cp, scp, and some minor preprocessing > of data). I have occasionally seen a client node reboot for no > apparent reason (but the job that was running on there was always an > I/O intensive job). In that case, though, I don't get any logs or > such, as when a compute node reboots in rocks, it is reformatted and > rebuilt on its way back up. > > I have searched and sowered all the logs (except I forget the > pvfs-client log, as it doesn't go in /var/log like all my other logs), > and the only thing that shows up is the system booting back up. > There's no sign of a problem or cause of going down. > >> If its helpful, our approach to debugging these sorts of problems usually >> involves: >> >> * Isolating the problem. If we can know that process hangs or node reboots >> always occur at node X, or always occur because the client daemon dies, or >> one of the server falls over, etc. It gives us an area to focus on. PVFS >> (or any distributed file system) has a number of different components that >> interact in different ways. Knowing which bits of code are the source of >> the problem make it a lot easier for us to figure out. Basically, we're not >> very good at whack-a-mole. > > Yep. This makes sense. While I'm not an expert with pvfs, my best > estimation of where the problem is is in the pvfs2 kernel module > and/or the code responsible for mounting the filesystem. Like I've > said, I've never had issues with the server. Normally when a process > hangs (like ls), any other I/O in that directory or below it will also > hang. However, other servers will continue to work just fine, as will > pvfs2-ls on the affected system. To me, this says the servers are > likely working correctly, and the pvfs2-* commands have not broken > either. I've never done any ROMIO jobs, so my "client access methods" > are limited. > >> * Reproducing the problem. If you can reproduce what you're seeing by >> running some application with a specific set of parameters, or using a >> synthetic test, we can probably reproduce it on our own systems as well. >> That makes debugging the problem a _ton_ easier, and a fix is usually in >> short order. Obviously, its not always possible to reproduce a problem, >> especially for those problems that are more system specific. > > This is very hard. I know what kinds of conditions the problems occur > in, but this is after watching and asking users what they were doing > over the course of a few dozen crashes. I've tried to stage crashes, > but have not succeeded. Repeating the same actions to the best of our > ability will not reproduce the crashes. > > Note that there are two different failure modes I'm experiencing. The > more common (at least, prior to 1.8.1; I don't have enough experience > with 1.8.1 yet to speak to it) is that a directory and its > subdirectories will "freeze" and all processes that attempt I/O in > those directories will hang in an unkillable state. They will each > add 1.0 to the system load average, but normally not add any to the > actual %cpu in use. The only method I have of clearing these > processes and restoring access to that directory and its subordinates > is to reboot the CLIENT system responsible (which is usually my > headnode). > > The other failure mode is spontaneous reboot. This shows up less > frequently, but has happened at least a dozen times, and at least once > (possibly twice) since upgrading to 1.8.1. There are no logs of > anything going wrong, just that the server reboots. It always happens > when some users are doing lots of I/O. Its more likely to happen when > I have a user running 4-6 scp processes and some other users working > on their data. But not guaranteed. > > In short, I cannot reproduce the problem on demand, but the problem > does reproduce itself often enough. > >> * Lots of logging/debugging. Problems in PVFS get written to logs as error >> messages most of the time. Its probably not 100%, but we've gotten pretty >> good about writing errors to logs when something goes wrong. Kernel panics >> usually show up in syslog on modern day kernels when the node falls over, >> with information about what bit of code caused the panic. Giving us as much >> information as you can throw at us is better than trying to filter out the >> things that seem important. We're used to handling lots of data. :-) > > Unfortunately, I normally cannot find any trace of a problem. I've > checked the syslog of both the client and all 3 servers, and I've > checked the 3 server pvfs client logs. I have not previously checked > the client logs, as I usually forget they exist. However, in today's > spontaneous reboot, there was nothing as you see above. In the event > that a cluster node spontaneously reboots, there cannot be any logs > recovered. I do not have any known / confirmed cases of a client > having "frozen directory" issues since much older versions of pvfs2 > (1.6.x? 1.7.0 maybe). > >> We also have quite a few debugging options that can be enabled and written >> to logs. That's the direction we'll probably have to go if we can't solve >> your problem otherwise. >> >> Thanks, >> -sam >> >>> >>> --Jim >>> >>> On Tue, Jul 28, 2009 at 9:35 AM, Sam Lang<[email protected]> wrote: >>>> >>>> Jim, >>>> >>>> We'll definitely try to help you resolve the problem you're seeing. That >>>> said, I responded to a similar query of yours back in April. See: >>>> >>>> >>>> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-April/002765.html >>>> >>>> It would be great if you could answer the questions I asked in that >>>> email. >>>> >>>> Also, its been hinted by yourself and others that this may not be PVFS >>>> related, as other users aren't experiencing the same problem. I >>>> encourage >>>> you to eliminate the possibility of memory problems on your system. You >>>> could try to run memtester (http://pyropus.ca/software/memtester/) on >>>> both >>>> servers and clients to verify that memory on your system isn't the >>>> problem. >>>> >>>> I've created a trac ticket for the problem you're seeing, so that we can >>>> keep track of it that way. See: >>>> >>>> https://trac.mcs.anl.gov/projects/pvfs/ticket/113 >>>> >>>> -sam >>>> >>>> On Jul 28, 2009, at 10:58 AM, Jim Kusznir wrote: >>>> >>>>> Hi all: >>>>> >>>>> More or less since I've installed pvfs2, I've had recurring stability >>>>> issues. Presently, my cluster headnode has 3 processes, each using >>>>> 100% of a core, that are "hung" on I/O (all of that processor usage is >>>>> in "system", not "user"), but the process is not in "D" state (its >>>>> moving between S and R). The process should have completed in an hour >>>>> or less, its now been running for over 18 hours. It also is not >>>>> responding to kills (including kill -9). From the sounds of the >>>>> users' message, any additional processes started in the same working >>>>> directory will hang in the same way. >>>>> >>>>> This happens a lot. Presently, the 3 hung processes are a binary >>>>> specific to the research (x2) and gzip; often, the hung processes are >>>>> ls and ssh (for scp), etc. When this happens, all other physical >>>>> systems are still fully functional. This has happened repeatedly >>>>> (although not repeatable on demand) on versions 1.5 through 1.8.1. >>>>> The only recovery option I have found to date is to reboot the system. >>>>> This normally only happens on the head node, but the head node is >>>>> also where a lot of the user I/O takes place (especially a lot of >>>>> small I/O accesses such as a few scp sessions, some gzips, and 5-10 >>>>> users doing ls, mv, and cp operations). >>>>> >>>>> Given what I understand about pvfs2's current user base, I'd think it >>>>> must be stable; a large cluster could never run pvfs2 and still be >>>>> useful to users with the types of instability I keep experiencing. As >>>>> such, I suspect the problem is somewhere with my system/setup, but to >>>>> date pcarns and others on #pvfs2 have not been able to identify what >>>>> it is. These stability issues are significantly effecting the >>>>> usability of the cluster, and of course, beginning to deter users from >>>>> it, and/or my competency in administrating it. Yet from what I can >>>>> tell, I'm experiencing some bug in the pvfs kernel module. I'd really >>>>> like to get this problem fixed, and I'm at a loss of how, other than >>>>> replacing pvfs2 with some other filesystem, which I'd rather not do. >>>>> >>>>> How do I fix this problem without replacing pvfs2? >>>>> >>>>> --Jim >>>>> _______________________________________________ >>>>> Pvfs2-users mailing list >>>>> [email protected] >>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>>> >>>> >> >> > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
