Re: [OpenAFS] Re: Tuning the -daemons.
Jan Johansson j...@it.su.se wrote: I will try my best to post what we did in the end. After another hang I was able to get a thread dump and it matched the dynamic vcache problem so we added -disable-dynamic-vcaches to the cache manager and it has been trouble free since. Thank you for the invaluable help provided by this list. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Tuning the -daemons.
Andrew Deason adea...@sinenomine.net wrote: It suggests that it could be the problem, but technically really anything holding xvcache could cause that (or anything else causing the callback thread to hang). But certainly the issue in this thread is the most likely cause. If you want to really be sure that that's it, you could 'echo t /proc/sysrq-trigger' and look in syslog. If you see a process inside afs_FlushVCBs and RXAFS_GiveUpCallBacks, that would pretty much prove that this is the specific issue. Ok. Thank you. Now we have enough information to discuss solutions. I will try my best to post what we did in the end. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Tuning the -daemons.
On Thu, 28 Apr 2011 10:46:25 +0200 Jan Johansson j...@it.su.se wrote: So when reading the thread more closely I found a command that I had missed. cmdebug client So this time around I tried it when the IMAP server broke and got no response (it timed out). Would it be correct to assume that this is evidence that I am seeing the mentioned problem? It suggests that it could be the problem, but technically really anything holding xvcache could cause that (or anything else causing the callback thread to hang). But certainly the issue in this thread is the most likely cause. If you want to really be sure that that's it, you could 'echo t /proc/sysrq-trigger' and look in syslog. If you see a process inside afs_FlushVCBs and RXAFS_GiveUpCallBacks, that would pretty much prove that this is the specific issue. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Tuning the -daemons.
We believe that this behaviour is fixed in 1.6.0pre4. Do you have any idea when it was introduced? Harald. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Tuning the -daemons.
On Tue, 19 Apr 2011 14:54:38 +0200 (CEST) Harald Barth h...@kth.se wrote: We believe that this behaviour is fixed in 1.6.0pre4. Do you have any idea when it was introduced? The underlying issue I think has always existed: xvcache must be write-locked for vcache traversal, and we traverse vcaches looking for something to flush, and a flush may hit a fileserver for a GiveUpCallBacks call when we flush VCBs when we run out of CBRs. I think all of that has always been the case, from looking at git history. (Always meaning back to OpenAFS 1.0.) Maybe dynamic vcaches made this more likely to be hit, though (which would be 1.4.10, Linux-only). Before/without those, I think you have to run out of free vcache entries before you hit the relevant code path, which I expect happens less often than we ShakeLooseVCaches these days. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Tuning the -daemons.
Maybe dynamic vcaches made this more likely to be hit, though (which would be 1.4.10, Linux-only). That makes sense as I think we were running something that was 1.4.9-ish a long time without seeing any such issues. Harald. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Tuning the -daemons.
On 18 Apr 2011, at 12:33, Jan Johansson wrote: Some time ago (in thread https://lists.openafs.org/pipermail/openafs-info/2011-February/035407.html) I asked about the client -daemons flag. Reviewing your original post, it has occurred to me that your problem could be a symptom of an issue a number of sites are seeing with callback breaks. Essentially, it is possible for the thread in client that handles incoming network traffic to hang whilst handling a callback break. If this happens, it appears to the fileserver like the client is no longer handling data, and you will see the errors that you have been seeing. We believe that this behaviour is fixed in 1.6.0pre4. If you still have your test environment, it would be very interesting to know whether you still see these problems. Cheers, Simon. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Tuning the -daemons.
On Mon, 7 Feb 2011 20:55:11 +0100 Jan Johansson j...@it.su.se wrote: We had this kind of problems before. In the first round the client made the server crash. An upgrade of the client from Ubuntu Karmic to Ubuntu Lucid solved that. If the client made the server crash, there was a bug in the server. Clients should not be able to make the server crash, no matter what they do. Upgrading the client may have worked around the problem, but it did not solve it. This time around we are rebuilding the IMAP servers for mail clients and since we have a little time before the users arrive with the pitch forks I am trying to understand what the right settings should be. Well, the right settings would arguably be don't deliver mail into AFS ;) But we can try what we can... To the best of my knowledge there never was a problem running rxdebug client 7001. I know for a fact the rxdebug server 700X works without problem during the hangs. To be clear, I mean 'rxdebug client 7001' executed from the server that was emitting this message: fileserver[1139]: BreakDelayedCallbacks FAILED for host AAA.BBB.CCC.186:7001 which IS UP. Connection from AAA.BBB.CCC.186:7001. Possible network or routing failure. I would try executing that while the hang is happening, to make sure that the server can initiate connections to the client. If it seems okay, it may help to run 'cmdebug client', and see if you see any messages like Lock afs_xvcache status: stuff -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Tuning the -daemons.
On Mon, 07 Feb 2011 18:02:23 +0100 (CET) Harald Barth h...@kth.se wrote: Long version: We have a pretty busy IMAP server with Maildir's in AFS (yeah its probably crazy but we have been doing it for a number of years). Longer answer: You want to tune your servers to -daemons 128 which is I think you mean -p 128. I believe Jan is asking about the client background I/O daemons, not the number of server threads/processes. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Tuning the -daemons.
Thank you for your interest in helping out here. So I will start with the easy questions and try to get into the kernel later. Based on the History I believe that the problem is the client/cache manager. We had this kind of problems before. In the first round the client made the server crash. An upgrade of the client from Ubuntu Karmic to Ubuntu Lucid solved that. Next the server got overloaded so we upgraded from Ubuntu Hardy to Ubuntu Lucid. Threw out some more of the old FreeBSD and stopped running virtual servers in ESX. Some time passed and we got blocking fileservers tuning of the fileservers solved some of the problems. We also threw some random options at the client and redesigned the webmail to make users stick to a single IMAP backend. This time around we are rebuilding the IMAP servers for mail clients and since we have a little time before the users arrive with the pitch forks I am trying to understand what the right settings should be. In the earlier cases the server would stop serving any clients so unrelated services (like webservers) would stop and the users would complain about not beeing able to save their files. This time it is only the single client/cachemanager that is affected. The server is running Ubuntu Lucid Lynx with the included OpenAFS 1.4.12+dfsg-3 package. Fileserver is started with -L -abortthreshold 1024 -syslog The client is running Ubuntu Lucid Lynx with the included OpenAFS 1.4.12+dfsg-3 package. The random options on the webmail backends are -stat 15000 -dcache 6000 -daemons 6 -volumes 256 -rxpck 2000 -files 5 -afsdb -dynroot -fakestat the once I am testing now are -daemons 6 -afsdb -dynroot -fakestat. To the best of my knowledge there never was a problem running rxdebug client 7001. I know for a fact the rxdebug server 700X works without problem during the hangs. Jan J ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info