Hi. In the past week we have had two frustrating periods of significant performance problems in our AFS cell. The first one lasted for maybe two hours, at which point it seemed the culprit was something odd-looking on two of our remote-access linux servers. I rebooted those servers, and the performance problems disappeared. That sounds good, but I was so busy investigating various red-herrings that the performance problems might have stopped 15-20 minutes earlier, and I just didn't notice until after I had done that reboot. This incident, by itself, is not too worrisome.
Wednesday the significant (but intermittent) performance problems returned, and there was nothing particularly odd-looking on any machines I could see. Based on some google searches, we zeroed in on the fact that one of our file servers was reporting rather high values for 'calls waiting for a thread' in the output of 'rxdebug $fileserver -rxstats'. The other file servers almost always reported zero calls waiting, but on this one file server the value tended to range between 5 and 50. Occasionally it got over 100. And the higher the value, the more likely we would see performance problems on a wide variety of AFS clients. Googling some more showed that many people had reported that this value was indeed a good indicator of performance problems. And looking in log files on the file servers we saw a few (but not many) messages which pointed us to problems in our network. Most of those looked like minor problems, one or two were more significant and were magnified by some heavy network traffic which happened to be going on at the time. We fixed all of those, and actually shut down the process which was (legitimately) doing a lot of network I/O. These were all good things to do, and none of them made a bit of difference to the values we saw for 'calls waiting" on that file server, or on the very frustratingly hangs we were seeing on AFS clients. And then at 7:07am this morning, the problem disappeared. Completely. The 'calls wating' value on that server has not gone above zero for the entire rest of the day. So, the immediate crisis is over. Everything is working fine. But my question is: If this returns, how can I track down what is *causing* the calls-waiting value to climb? We had over 100 workstations using AFS at the time, scattered all around campus. I did a variety of things to try and pinpoint the culprit, but didn't have much luck. So, given a streak of high values for 'call waiting', how can I track that down to a specific client (or clients), or maybe a specific AFS volume? -- Garance Alistair Drosehn Senior Systems Programmer RPI; Troy NY _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
