Re: [Pvfs2-users] Pvfs2 stability issues

Jim Kusznir Thu, 30 Jul 2009 08:29:04 -0700

Well, pvfs2 server crashed overnight...Until I upgraded to 1.8.1, I've
never had the server process crash.  Each time I recover from a crash,
all the servers use 100% cpu for a while before they start responding
to requests again.  Here's the output from the pvfs-server.log:


[D 07/22 13:37] PVFS2 Server version 2.8.1 starting.
[E 07/28 09:41] Error: poorly formatted protocol message received.
[E 07/28 09:41]    Protocol version mismatch: received major version 5
when expecting 6.
[E 07/28 09:41]    Please verify your PVFS2 installation
[E 07/28 09:41]    and make sure that the version is consistent.
[E 07/28 09:41] PVFS2 server: signal 11, faulty address is 0x10, from 0x43f114
[E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x43f114]
[E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x43f114]
[E 07/28 09:41] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f]
[E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x440608]
[E 07/28 09:41] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f]
[E 07/28 09:41] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_next+0xbc) [0x44f92c]
[E 07/28 09:41] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_continue+0x1e) [0x44f4ae]
[E 07/28 09:41] [bt] /usr/sbin/pvfs2-server(main+0xa7e) [0x413b5e]
[E 07/28 09:41] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) [0x33d181d8a4]
[E 07/28 09:41] [bt] /usr/sbin/pvfs2-server [0x410b49]
[D 07/28 09:46] PVFS2 Server version 2.8.1 starting.
[E 07/29 13:33] Error: poorly formatted protocol message received.
[E 07/29 13:33]    Protocol version mismatch: received major version 5
when expecting 6.
[E 07/29 13:33]    Please verify your PVFS2 installation
[E 07/29 13:33]    and make sure that the version is consistent.
[E 07/29 13:33] PVFS2 server: signal 11, faulty address is 0x10, from 0x43f114
[E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x43f114]
[E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x43f114]
[E 07/29 13:33] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f]
[E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x440608]
[E 07/29 13:33] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_invoke+0xcf) [0x44f61f]
[E 07/29 13:33] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_next+0xbc) [0x44f92c]
[E 07/29 13:33] [bt]
/usr/sbin/pvfs2-server(PINT_state_machine_continue+0x1e) [0x44f4ae]
[E 07/29 13:33] [bt] /usr/sbin/pvfs2-server(main+0xa7e) [0x413b5e]
[E 07/29 13:33] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) [0x33d181d8a4]
[E 07/29 13:33] [bt] /usr/sbin/pvfs2-server [0x410b49]
[D 07/30 08:24] PVFS2 Server version 2.8.1 starting.


---------
and here's from dmesg:

pvfs2-server[8993]: segfault at 0000000000000010 rip 000000000043f114
rsp 00007fff9406f760 error 4
pvfs2-server[14168]: segfault at 0000000000000010 rip 000000000043f114
rsp 00007fff0b3aa260 error 4

--Jim

On Wed, Jul 29, 2009 at 3:40 PM, Jim Kusznir<[email protected]> wrote:
> Hello:
>
> Thanks for your reply; answers inline below.
>
> On Wed, Jul 29, 2009 at 2:30 PM, Sam Lang<[email protected]> wrote:
>>
>> Hi Jim,
>>
>> Unfortunately that log probably isn't going to be useful.  All it shows is
>> that a client wasn't able to contact a server (pvfs2-io-0-0) and then (on a
>> different day) that client was restarted a number of times.  The client log
>> doesn't rotate, it just keeps growing.  If possible, could you send the
>> pvfs2-client.log files from all the clients in your cluster, as well as the
>> pvfs2-server.log files from each of the three servers?  Also, can you
>> clarify some of the above?  I've asked a few clarifying questions below.
>
> I can, but they never show anything.  The pvfs server logs are always
> perfectly clear, and as the nodes are currently normally idle, there's
> nothing there either.  Its only the headnode of the cluster (which is
> just a client to pvfs) that I've been experiencing the issues, as the
> demands on pvfs2 are much greater on the headnode presently.
>
>> You migrated to PVFS 2.8.1 and the I/O servers were "down" during that time.
>>  Did you keep the PVFS volume mounted on the clients, with the kernel module
>> loaded and the client daemon running during that time?
>
> I had shutdown all the nodes and stopped pvfs on the headnode, then
> logged in to my 3 pvfs servers and stopped service, installed the new
> RPM that I had built earlier that day, and started the pvfs service.
> Not knowing there was an upgrade that took place, I then went and
> started my headnode.  In addition, I kicked off a rebuild of my
> compute nodes, which is how I do software installation on them
> (installing the new pvfs2 rpm).  It turned out that while pvfs was
> running on my pvfs server nodes, it was not answering requests due to
> the updates, but I found this out when my clients were failing to
> mount or talk to the nodes.
>
> Later that day, one of my users managed to fill up the pvfs volume,
> which caused hung processes.  I was able to promptly release a few
> gig, and sent an e-mail out to all my users asking them to clean out
> their space.  The freeing of space did not allow the hung processes to
> resume or be killed (kill and kill -9 still had no effect on them).
> My users promptly began to log in and clean their space and between
> the du's, and rm's, at least one of my pvfs servers crashed.  It was a
> few minutes before I figured out that this had happened; when I did, I
> restarted pvfs2 on all 3 nodes.  Then again I had issues where my
> clients still were not able to access them, and when I looked on the
> server nodes again, the pvfs process was using 100% of one core, just
> like the upgrade process.  After about 10 minutes, it fixed, and
> things started working again.
>
>> When the servers went down yesterday and the other days, did just the
>> pvfs2-server process die?  Or did the entire server spontaneously reboot?
>
> My pvfs servers have never died or rebooted prior to yesterday when at
> least one of the pvfs server processes crashed.  I've never had an
> unexpected reboot on my pvfs servers.  The servers always seem to be
> functioning well; its always been the clients that have the problems.
>>
>> Did you reboot the _server_ to clear up hung processes? Hung processes are
>> going to be on the client nodes, unless PVFS is also mounted on the same
>> node the servers are running on, but that's not true for your config IIRC.
>
> I rebooted my cluster headnode to clear the processes, as they were
> user processes trying to do I/O on the pvfs volume through the
> kernel-module-mounted pvfs data (/mnt/pvfs2).  As far as pvfs is
> concerned, my head node is only a client.
>
>> When it rebooted around noon, which node rebooted?  A server node or a
>> client?  Which node in particular?  You've seen a number of spontaneous
>> reboots.  Are they limited to client nodes or server nodes (or maybe just a
>> few client nodes)?  Are there no kernel panics in syslog when those reboots
>> occur?
>
> Spontaneous reboots have always been client nodes.  Normally its my
> cluster headnode, as that gets the heaviest pvfs usage, especially
> "parallel usage" (multiple users performing unrelated i/o on the pvfs
> volume, such as tar/untar, mv, cp, scp, and some minor preprocessing
> of data).  I have occasionally seen a client node reboot for no
> apparent reason (but the job that was running on there was always an
> I/O intensive job).  In that case, though, I don't get any logs or
> such, as when a compute node reboots in rocks, it is reformatted and
> rebuilt on its way back up.
>
> I have searched and sowered all the logs (except I forget the
> pvfs-client log, as it doesn't go in /var/log like all my other logs),
> and the only thing that shows up is the system booting back up.
> There's no sign of a problem or cause of going down.
>
>> If its helpful, our approach to debugging these sorts of problems usually
>> involves:
>>
>> * Isolating the problem.  If we can know that process hangs or node reboots
>> always occur at node X, or always occur because the client daemon dies, or
>> one of the server falls over, etc.  It gives us an area to focus on.  PVFS
>> (or any distributed file system) has a number of different components that
>> interact in different ways.  Knowing which bits of code are the source of
>> the problem make it a lot easier for us to figure out.  Basically, we're not
>> very good at whack-a-mole.
>
> Yep.  This makes sense.  While I'm not an expert with pvfs, my best
> estimation of where the problem is is in the pvfs2 kernel module
> and/or the code responsible for mounting the filesystem.  Like I've
> said, I've never had issues with the server.  Normally when a process
> hangs (like ls), any other I/O in that directory or below it will also
> hang.  However, other servers will continue to work just fine, as will
> pvfs2-ls on the affected system.  To me, this says the servers are
> likely working correctly, and the pvfs2-* commands have not broken
> either.  I've never done any ROMIO jobs, so my "client access methods"
> are limited.
>
>> * Reproducing the problem.  If you can reproduce what you're seeing by
>> running some application with a specific set of parameters, or using a
>> synthetic test, we can probably reproduce it on our own systems as well.
>>  That makes debugging the problem a _ton_ easier, and a fix is usually in
>> short order.  Obviously, its not always possible to reproduce a problem,
>> especially for those problems that are more system specific.
>
> This is very hard.  I know what kinds of conditions the problems occur
> in, but this is after watching and asking users what they were doing
> over the course of a few dozen crashes.  I've tried to stage crashes,
> but have not succeeded.  Repeating the same actions to the best of our
> ability will not reproduce the crashes.
>
> Note that there are two different failure modes I'm experiencing.  The
> more common (at least, prior to 1.8.1; I don't have enough experience
> with 1.8.1 yet to speak to it) is that a directory and its
> subdirectories will "freeze" and all processes that attempt I/O in
> those directories will hang in an unkillable state.  They will each
> add 1.0 to the system load average, but normally not add any to the
> actual %cpu in use.  The only method I have of clearing these
> processes and restoring access to that directory and its subordinates
> is to reboot the CLIENT system responsible (which is usually my
> headnode).
>
> The other failure mode is spontaneous reboot.  This shows up less
> frequently, but has happened at least a dozen times, and at least once
> (possibly twice) since upgrading to 1.8.1.  There are no logs of
> anything going wrong, just that the server reboots.  It always happens
> when some users are doing lots of I/O.  Its more likely to happen when
> I have a user running 4-6 scp processes and some other users working
> on their data.  But not guaranteed.
>
> In short, I cannot reproduce the problem on demand, but the problem
> does reproduce itself often enough.
>
>> * Lots of logging/debugging.  Problems in PVFS get written to logs as error
>> messages most of the time.  Its probably not 100%, but we've gotten pretty
>> good about writing errors to logs when something goes wrong.  Kernel panics
>> usually show up in syslog on modern day kernels when the node falls over,
>> with information about what bit of code caused the panic.  Giving us as much
>> information as you can throw at us is better than trying to filter out the
>> things that seem important.  We're used to handling lots of data. :-)
>
> Unfortunately, I normally cannot find any trace of a problem.  I've
> checked the syslog of both the client and all 3 servers, and I've
> checked the 3 server pvfs client logs.  I have not previously checked
> the client logs, as I usually forget they exist.  However, in today's
> spontaneous reboot, there was nothing as you see above.  In the event
> that a cluster node spontaneously reboots, there cannot be any logs
> recovered.  I do not have any known / confirmed cases of a client
> having "frozen directory" issues since much older versions of pvfs2
> (1.6.x? 1.7.0 maybe).
>
>> We also have quite a few debugging options that can be enabled and written
>> to logs.  That's the direction we'll probably have to go if we can't solve
>> your problem otherwise.
>>
>> Thanks,
>> -sam
>>
>>>
>>> --Jim
>>>
>>> On Tue, Jul 28, 2009 at 9:35 AM, Sam Lang<[email protected]> wrote:
>>>>
>>>> Jim,
>>>>
>>>> We'll definitely try to help you resolve the problem you're seeing.  That
>>>> said, I responded to a similar query of yours back in April.  See:
>>>>
>>>>
>>>> http://www.beowulf-underground.org/pipermail/pvfs2-users/2009-April/002765.html
>>>>
>>>> It would be great if you could answer the questions I asked in that
>>>> email.
>>>>
>>>> Also, its been hinted by yourself and others that this may not be PVFS
>>>> related, as other users aren't experiencing the same problem.  I
>>>> encourage
>>>> you to eliminate the possibility of memory problems on your system.  You
>>>> could try to run memtester  (http://pyropus.ca/software/memtester/) on
>>>> both
>>>> servers and clients to verify that memory on your system isn't the
>>>> problem.
>>>>
>>>> I've created a trac ticket for the problem you're seeing, so that we can
>>>> keep track of it that way.  See:
>>>>
>>>> https://trac.mcs.anl.gov/projects/pvfs/ticket/113
>>>>
>>>> -sam
>>>>
>>>> On Jul 28, 2009, at 10:58 AM, Jim Kusznir wrote:
>>>>
>>>>> Hi all:
>>>>>
>>>>> More or less since I've installed pvfs2, I've had recurring stability
>>>>> issues.  Presently, my cluster headnode has 3 processes, each using
>>>>> 100% of a core, that are "hung" on I/O (all of that processor usage is
>>>>> in "system", not "user"), but the process is not in "D" state (its
>>>>> moving between S and R).  The process should have completed in an hour
>>>>> or less, its now been running for over 18 hours.  It also is not
>>>>> responding to kills (including kill -9).  From the sounds of the
>>>>> users' message, any additional processes started in the same working
>>>>> directory will hang in the same way.
>>>>>
>>>>> This happens a lot.  Presently, the 3 hung processes are a binary
>>>>> specific to the research (x2) and gzip; often, the hung processes are
>>>>> ls and ssh (for scp), etc.  When this happens, all other physical
>>>>> systems are still fully functional.  This has happened repeatedly
>>>>> (although not repeatable on demand) on versions 1.5 through 1.8.1.
>>>>> The only recovery option I have found to date is to reboot the system.
>>>>> This normally only happens on the head node, but the head node is
>>>>> also where a lot of the user I/O takes place (especially a lot of
>>>>> small I/O accesses such as a few scp sessions, some gzips, and 5-10
>>>>> users doing ls, mv, and cp operations).
>>>>>
>>>>> Given what I understand about pvfs2's current user base, I'd think it
>>>>> must be stable; a large cluster could never run pvfs2 and still be
>>>>> useful to users with the types of instability I keep experiencing.  As
>>>>> such, I suspect the problem is somewhere with my system/setup, but to
>>>>> date pcarns and others on #pvfs2 have not been able to identify what
>>>>> it is.  These stability issues are significantly effecting the
>>>>> usability of the cluster, and of course, beginning to deter users from
>>>>> it, and/or my competency in administrating it.  Yet from what I can
>>>>> tell, I'm experiencing some bug in the pvfs kernel module.  I'd really
>>>>> like to get this problem fixed, and I'm at a loss of how, other than
>>>>> replacing pvfs2 with some other filesystem, which I'd rather not do.
>>>>>
>>>>> How do I fix this problem without replacing pvfs2?
>>>>>
>>>>> --Jim
>>>>> _______________________________________________
>>>>> Pvfs2-users mailing list
>>>>> [email protected]
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>>
>>
>>
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Pvfs2 stability issues

Reply via email to