Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Feb 1, 2011, at 3:58 PM, Andrew Deason wrote: > On Tue, 01 Feb 2011 12:04:08 -0800 > Patricia O'Reilly wrote: > >> From what you have described it sounds to me like you need the patch >> that Andrew referenced earlier that allows you to configure an >> -offline-timeout and -offline-shutdown-timeout option on your >> fileservers. We have has similar problems at our site and will be >> releasing that patch into production shortly. > > Maybe, maybe not. I think the most common cause of this is just having > too many volumes that can be shut down in 30 minutes. Determining this > is easy; if it happens every single time you shut down the fileserver, > that's probably it. (But obviously that's not fun to do.) > > But it could also be the 1.4.11 host package bugs; I don't know, and I > just noted that cause to illustrate that there are several possible > reasons. As noted earlier, we saw this at least back to our use of 1.4.8. Prior to that we'd being doing rolling restarts - ie, moving all the volumes off a server before restarting it. So it may have been present earlier, but we simply didn't hit it. > >> Jeff Blaine wrote: >>> >>> Thanks for the replies. >>> >>> I can't at all fathom that our issue is one of existing >>> client connections and callback break completion (timing out). > > I'd only say that if you have pretty good control over all of your > clients. It's possible to see some really bizarre behavior (from the > fileserver's point of view) from old clients or clients on > oddly-behaving networks or NATs. Seconded. A number of our more savvy users (or users who have savvy IT admins) run AFS at home, another large batch of folks are behind nats/firewalls, and a third small group are alumni or ex-staff who use their AFS space from all over the world. As a proportion of overall users that's fairly small, but as a proportion of folks whose hosts time out during shutdown it's pretty large. >>> Let's assume this issue is what caused our problem. I'm sort of at >>> a loss as to how to approach OpenAFS versions. On one hand, >>> expectations of more effort to make it clear in the release notes >>> what items could cause something like unclean server shutdowns (kind >>> of a big deal, IMO) are not really justifiable. > > This wasn't an issue causing fileserver shutdowns to hang and get > killed, it was a general fileserver stability issue; that hang (or > crash, or however it manifested; I've seen both) could happen at any > time. There two things which seemed to make the problem more likely - having the server up for a long time, and having lots of different hosts using volumes from that server. We did find a log entry that was usually a symptom of the problem about to occur, but once that entry appeared it was too late to fix it - either the server would crash or would get into an infinite loop in the next few minutes to hours. Attempting to restart the server once we'd seen it always tickled the bug; attaching to the process w/gdb and forcing a core dump was how we finally diagnosed the bloody thing. >>> It's open source, etc. On the other hand, it's not acceptable to >>> blindly upgrade to the latest stable release every time it comes >>> out. I understand that the most obvious take-away is just, "You got >>> bit. Move on.", but if anything can improve on our end, I'd like to >>> do that. > > Perhaps not right when it comes out, but it can be a good idea to move > towards them, depending on how you do your risk/change management. > Waiting a bit after each stable release for production machines makes > sense, to see if unknown issues crop up, but if there are significant > issues, you will hear about it if you are paying attention (probably in > the form of a new release, fixing the issue). > > 1.4.12 was released almost a year ago, and I don't think there are any > significant problems besides the issues that caused 1.4.14 to be > released. There are some smaller issues here and there that sometimes > get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would > cause me to recommend rolling back to pre-1.4.12 if you had upgraded a > machine to 1.4.12. 1.4.12 been bery bery good to me; there's no fix in .13/.14 that seems to affect us. Right now we're gearing up to build a test host for the latest 1.6 release candidate. Barring some disastrous newfound issue with 1.4.12, 1.6 makes more sense. As noted earlier in this discussion, dynamic attach looks like a fix for shutdown/restart timing issues. Steve___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Tue, 01 Feb 2011 12:04:08 -0800 Patricia O'Reilly wrote: > From what you have described it sounds to me like you need the patch > that Andrew referenced earlier that allows you to configure an > -offline-timeout and -offline-shutdown-timeout option on your > fileservers. We have has similar problems at our site and will be > releasing that patch into production shortly. Maybe, maybe not. I think the most common cause of this is just having too many volumes that can be shut down in 30 minutes. Determining this is easy; if it happens every single time you shut down the fileserver, that's probably it. (But obviously that's not fun to do.) But it could also be the 1.4.11 host package bugs; I don't know, and I just noted that cause to illustrate that there are several possible reasons. > Jeff Blaine wrote: > > > > Thanks for the replies. > > > > I can't at all fathom that our issue is one of existing > > client connections and callback break completion (timing out). I'd only say that if you have pretty good control over all of your clients. It's possible to see some really bizarre behavior (from the fileserver's point of view) from old clients or clients on oddly-behaving networks or NATs. > >> Also, in this specific case, it may not be just that shutting down > >> volumes took too long. 1.4.11 has known problems that can cause this > >> (e.g. the host list gets a loop in it, and something spins forever > >> trying to traverse the whole list). > > > > That's this, I think?: > > > > - Fixes to avoid issues cleaning up deleted hosts in > > the fileserver (126454) There were a few issues; all of the ones known to cause problems are included in 1.4.12. I don't have references for all of them off the top of my head, but I can get them for you if you want. > > Let's assume this issue is what caused our problem. I'm sort of at > > a loss as to how to approach OpenAFS versions. On one hand, > > expectations of more effort to make it clear in the release notes > > what items could cause something like unclean server shutdowns (kind > > of a big deal, IMO) are not really justifiable. This wasn't an issue causing fileserver shutdowns to hang and get killed, it was a general fileserver stability issue; that hang (or crash, or however it manifested; I've seen both) could happen at any time. And doing something like that actually isn't that difficult for at least most of the issues I am involved with. I already generally know which versions are affected for the bigger issues, so just writing that down would not be that hard. (But going back through all of the changes between 1.4.Z and 1.4 head would be a lot of work at this point) But that's not true for all changes, and I think it may be prohibitively difficult if we had to include information like that with every single change to the stable branch. I'm not sure how useful it is, though. In the specific case of the host list issues, the only meaningful thing I can say is that "sometimes the fileserver crashes". It's not really possible for you to know how susceptible you are to it (unless you get hit by it), because the circumstances required to trigger the crash are rather complex, and they involve access patterns of clients that you generally cannot control or even detect. > > It's open source, etc. On the other hand, it's not acceptable to > > blindly upgrade to the latest stable release every time it comes > > out. I understand that the most obvious take-away is just, "You got > > bit. Move on.", but if anything can improve on our end, I'd like to > > do that. Perhaps not right when it comes out, but it can be a good idea to move towards them, depending on how you do your risk/change management. Waiting a bit after each stable release for production machines makes sense, to see if unknown issues crop up, but if there are significant issues, you will hear about it if you are paying attention (probably in the form of a new release, fixing the issue). 1.4.12 was released almost a year ago, and I don't think there are any significant problems besides the issues that caused 1.4.14 to be released. There are some smaller issues here and there that sometimes get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would cause me to recommend rolling back to pre-1.4.12 if you had upgraded a machine to 1.4.12. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
>From what you have described it sounds to me like you need the patch that >Andrew referenced earlier that allows you to configure an -offline-timeout and >-offline-shutdown-timeout option on your fileservers. We have has similar >problems at our site and will be releasing that patch into production shortly. --patty Jeff Blaine wrote: Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within 1800 seconds Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 > > Thanks for the replies. > > I can't at all fathom that our issue is one of existing > client connections and callback break completion (timing out). > >> Also, in this specific case, it may not be just that shutting down >> volumes took too long. 1.4.11 has known problems that can cause this >> (e.g. the host list gets a loop in it, and something spins forever >> trying to traverse the whole list). > > That's this, I think?: > > - Fixes to avoid issues cleaning up deleted hosts in > the fileserver (126454) > > Let's assume this issue is what caused our problem. I'm sort > of at a loss as to how to approach OpenAFS versions. On one > hand, expectations of more effort to make it clear in the > release notes what items could cause something like unclean > server shutdowns (kind of a big deal, IMO) are not really > justifiable. It's open source, etc. On the other hand, > it's not acceptable to blindly upgrade to the latest stable > release every time it comes out. I understand that the most > obvious take-away is just, "You got bit. Move on.", but > if anything can improve on our end, I'd like to do that. > > I welcome any suggestions for how others are approaching this. > > Jeff Blaine > ___ > OpenAFS-info mailing list > OpenAFS-info@openafs.org > https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within 1800 seconds Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 Thanks for the replies. I can't at all fathom that our issue is one of existing client connections and callback break completion (timing out). > Also, in this specific case, it may not be just that shutting down > volumes took too long. 1.4.11 has known problems that can cause this > (e.g. the host list gets a loop in it, and something spins forever > trying to traverse the whole list). That's this, I think?: - Fixes to avoid issues cleaning up deleted hosts in the fileserver (126454) Let's assume this issue is what caused our problem. I'm sort of at a loss as to how to approach OpenAFS versions. On one hand, expectations of more effort to make it clear in the release notes what items could cause something like unclean server shutdowns (kind of a big deal, IMO) are not really justifiable. It's open source, etc. On the other hand, it's not acceptable to blindly upgrade to the latest stable release every time it comes out. I understand that the most obvious take-away is just, "You got bit. Move on.", but if anything can improve on our end, I'd like to do that. I welcome any suggestions for how others are approaching this. Jeff Blaine ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On 1/31/2011 12:17 PM, Stephen Joyce wrote: > On Mon, 31 Jan 2011, Steve Simmons wrote: >> We have about 235,000 volumes spread across 40 vice partitions. Our >> 'fix' is a combination of lengthening that timeout to a 3600 seconds >> and keeping our vice partitions no longer than 2TB. Active partitions >> are spread roughly equally across those 40 partitions. But that's just >> a stopgap; the longer a server stays up, the more likely it >> accumulates dead callbacks. > > Assuming this is true, isn't this a good argument to keep the weekly > server process restarts? But its not true. As Andrew has already pointed out, callbacks are not broken on server shutdown. In any case, callbacks have a life span of minutes to hours. Even if a callback was recorded that was a week old, it would not trigger an RPC to the client when it came time to break callbacks on the associated object. Jeffrey Altman signature.asc Description: OpenPGP digital signature
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Jan 31, 2011, at 12:36 PM, Andrew Deason wrote: > On Mon, 31 Jan 2011 11:54:24 -0500 > Steve Simmons wrote: > >>> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 >>> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 >>> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 >>> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown >>> within 1800 seconds >>> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 >> >> We have seen similar issues. It occurs when there is a given vice >> partition where lots of clients have registered callbacks but those >> clients are no longer accessible. Not all the clients have responded >> when the 1800 second timer goes off, and the fileserver goes down >> uncleanly. > > Also, in this specific case, it may not be just that shutting down > volumes took too long. 1.4.11 has known problems that can cause this > (e.g. the host list gets a loop in it, and something spins forever > trying to traverse the whole list). Yeah, we got seriously bit by that bug. But not just on shutdowns; eventually the list would be so corrupt the processes would actually crash. Dan Hyde spent a lot of time on that; it's why we're running 1.4.12 with a couple of patches currently. 'Fixing' that bug by regular server restarts is an argument for those restarts. But we were seeing the 1800 second timeout on shutdown at least back to 1.4.8. Based on our experience with earlier versions, the host list corruption issue didn't surface until post-1.4.8. Or at least, not as badly. Steve___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Jan 31, 2011, at 12:17 PM, Stephen Joyce wrote: > On Mon, 31 Jan 2011, Steve Simmons wrote: > >> We have seen similar issues. It occurs when there is a given vice partition >> where lots of clients have registered callbacks but those clients are no >> longer accessible. Not all the clients have responded when the 1800 second >> timer goes off, and the fileserver goes down uncleanly. >> >> We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is >> a combination of lengthening that timeout to a 3600 seconds and keeping our >> vice partitions no longer than 2TB. Active partitions are spread roughly >> equally across those 40 partitions. But that's just a stopgap; the longer a >> server stays up, the more likely it accumulates dead callbacks. > > Assuming this is true, isn't this a good argument to keep the weekly server > process restarts? Weekly outages, even if only for a few minutes per, are not acceptable here. Doing them less frequently starts to put us into the range of the timeout problems above. At the moment most of our afs service processes have run happily for 237 days. That alone is a strong argument for not needing weekly restarts. If there are memory leaks, etc, they largely aren't affecting us since We mostly do restarts when we need to do software upgrades of one sort or another. They are typically done in a rolling fashion - upgrade the hot spare(s), vos move volumes to the hot spare(s), take down the vacated servers and upgrade, lather, rinse, repeat. At one point we went two years without a general AFS shutdown. We only got away from that due to bugs that required us to do OS upgrades more frequently or the entire cell at once. Life seems generally better with respect to those issues; and campus' opinion of the service is better when there are no perceived outages. For the curious, we're running 1.4.12 with a couple of fixes we pulled forward from the 1.4.13 development stream. Barring new developments, the next one we'll give serious consideration to is 1.6.X.___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Mon, 31 Jan 2011 11:54:24 -0500 Steve Simmons wrote: > > Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 > > Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 > > Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 > > Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown > > within 1800 seconds > > Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 > > We have seen similar issues. It occurs when there is a given vice > partition where lots of clients have registered callbacks but those > clients are no longer accessible. Not all the clients have responded > when the 1800 second timer goes off, and the fileserver goes down > uncleanly. Also, in this specific case, it may not be just that shutting down volumes took too long. 1.4.11 has known problems that can cause this (e.g. the host list gets a loop in it, and something spins forever trying to traverse the whole list). -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Mon, 31 Jan 2011 11:54:24 -0500 Steve Simmons wrote: > I haven't read the code, but by observing the logfiles during a > shutdown time it appears that fs shutdown break callbacks in a > single-threaded manner per partition. This could probably be > parallelized; simple thought experiments say X parallel callback > breaks would result in run time T reduced to T/X. I have said this before and I will continue to say it: we do not break callbacks on volume shutdown. We reset the client callback state on the next client access after the server comes back up (for non-DAFS). What we _do_ do is wait for existing client connections and callback breaks to complete before we can shut down. There are several causes of callback breaks to be initiated, but a fileserver restart/shutdown is not one of them. If you want to improve shutdown time, DAFS will help just for the portion where disk is the bottleneck. If you want to "kick off" clients during shutdown, so clients holding open a connection don't block a shutdown, take a look at the code that adds the -offline-shutdown-timeout parameter (which is on master, gerrit 2984). That functionality is not implemented for callback-related calls, but it could be with more work. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Mon, 31 Jan 2011, Steve Simmons wrote: We have seen similar issues. It occurs when there is a given vice partition where lots of clients have registered callbacks but those clients are no longer accessible. Not all the clients have responded when the 1800 second timer goes off, and the fileserver goes down uncleanly. We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is a combination of lengthening that timeout to a 3600 seconds and keeping our vice partitions no longer than 2TB. Active partitions are spread roughly equally across those 40 partitions. But that's just a stopgap; the longer a server stays up, the more likely it accumulates dead callbacks. Assuming this is true, isn't this a good argument to keep the weekly server process restarts? Cheers, Stephen ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Jan 28, 2011, at 1:58 PM, Jeff Blaine wrote: > On 1/28/2011 1:52 PM, Derrick Brashear wrote: >> did shutdown perchance take 30min? > > Yes. I found this in BosLog.old just now: > > Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 > Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 > Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 > Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within > 1800 seconds > Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 We have seen similar issues. It occurs when there is a given vice partition where lots of clients have registered callbacks but those clients are no longer accessible. Not all the clients have responded when the 1800 second timer goes off, and the fileserver goes down uncleanly. We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is a combination of lengthening that timeout to a 3600 seconds and keeping our vice partitions no longer than 2TB. Active partitions are spread roughly equally across those 40 partitions. But that's just a stopgap; the longer a server stays up, the more likely it accumulates dead callbacks. Two things I suspect but don't know for certain: Dynamic attach may help this a bit, simply because there will be fewer volumes attached and therefore fewer to detatch. I plan on trying this out soon. :-) I haven't read the code, but by observing the logfiles during a shutdown time it appears that fs shutdown break callbacks in a single-threaded manner per partition. This could probably be parallelized; simple thought experiments say X parallel callback breaks would result in run time T reduced to T/X. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Fri, Jan 28, 2011 at 1:58 PM, Jeff Blaine wrote: > On 1/28/2011 1:52 PM, Derrick Brashear wrote: >> >> did shutdown perchance take 30min? > > Yes. I found this in BosLog.old just now: > > Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 > Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 > Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 > Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within > 1800 seconds > Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 Sadly, not enough info to ascertain why, but, an unclean shutdown is unclean. And you know which volumes were not offline: they're the ones you needed to salvage. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Fri, 28 Jan 2011 13:52:02 -0500 Derrick Brashear wrote: > did shutdown perchance take 30min? BosLog would still indicate a force kill after 30 mins. What are all of the BosLog entries mentioning the fileserver? (assuming bosserver hasn't been restarted enough times to rotate that away) -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On 1/28/2011 1:52 PM, Derrick Brashear wrote: did shutdown perchance take 30min? Yes. I found this in BosLog.old just now: Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within 1800 seconds Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 Derrick On Jan 28, 2011, at 1:50 PM, Jeff Blaine wrote: Do you have the FileLog from that shutdown? No, it was cycled out by me salvaging :| And there isn't anything in play that would cause an old version of the vice partition or something weird like that, is there? (ZFS snapshots, liveupgrade misconfiguration, etc) No. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
did shutdown perchance take 30min? Derrick On Jan 28, 2011, at 1:50 PM, Jeff Blaine wrote: >> Do you have the FileLog from that shutdown? > > No, it was cycled out by me salvaging :| > >> And there isn't anything in play that would cause an old version of the >> vice partition or something weird like that, is there? (ZFS snapshots, >> liveupgrade misconfiguration, etc) > > No. > ___ > OpenAFS-info mailing list > OpenAFS-info@openafs.org > https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
Do you have the FileLog from that shutdown? No, it was cycled out by me salvaging :| And there isn't anything in play that would cause an old version of the vice partition or something weird like that, is there? (ZFS snapshots, liveupgrade misconfiguration, etc) No. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Fri, 28 Jan 2011 13:17:31 -0500 Jeff Blaine wrote: > Examples from FileLog.old: > > Fri Jan 28 10:02:48 2011 VAttachVolume: volume /vicepf/V2023864046.vol > needs to be salvaged; not attached. This just says that the fileserver didn't clear the "I'm using this volume" flag in the header; we salvage because it implies the fileserver was killed possibly in the middle of some I/O. You're sure everything shut down cleanly before? Do you have the FileLog from that shutdown? It's possible for the fileserver to exit "cleanly" even if for some reason it couldn't offline every single volume (but it will log that it couldn't do so) And there isn't anything in play that would cause an old version of the vice partition or something weird like that, is there? (ZFS snapshots, liveupgrade misconfiguration, etc) > Fri Jan 28 10:02:49 2011 VAttachVolume: volume salvage flag is ON for > /vicepa//V2023886583.vol; volume needs salvage This is the explicit "something is wrong with this volume" flag being set. This can happen as a result of many different things, but I think all of them are logged in FileLog when they happen. If you have the old FileLog, it might say why. Of course, one of those things that triggers this is that "needs to be salvaged" message above. So, if a previous startup logged those same messages, that would cause this. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
Examples from FileLog.old: Fri Jan 28 10:02:48 2011 VAttachVolume: volume /vicepf/V2023864046.vol needs to be salvaged; not attached. Fri Jan 28 10:02:49 2011 VAttachVolume: volume salvage flag is ON for /vicepa//V2023886583.vol; volume needs salvage Examples from SalvageLog old pretty much run the gamut (it's a 4MB file...). 01/28/2011 10:30:50 Found 13 orphaned files and directories (approx. 26 KB) 01/28/2011 10:30:52 Volume uniquifier is too low; fixed 01/28/2011 10:31:11 Vnode 34: version < inode version; fixed (old status) 01/28/2011 12:54:15 Volume 536872710 (src.local) mount point ./flex/011 to '#src.flex.011#' invalid, converted to symbolic link 01/28/2011 12:27:30 dir vnode 15: special old unlink-while-referenced file .__afs9803 is deleted (vnode 2248) 01/28/2011 12:28:22 dir vnode 1075: ./.gconfd/lock/ior (vnode 4272): unique changed from 54370 to 57920 01/28/2011 12:28:22 dir vnode 1077: ./.gconf/%gconf-xml-backend.lock/ior already claimed by directory vnode 1 (vnode 4278, unique 54373) -- deleted 01/28/2011 12:28:28 dir vnode 607: invalid entry: ./.gconfd/lock/ior (vnode 1114, unique 132811) 01/28/2011 12:37:28 dir vnode 1: invalid entry deleted: ./.ab_library.lock (vnode 50816, unique 25535) On 1/28/2011 12:33 PM, Andrew Deason wrote: On Fri, 28 Jan 2011 12:10:38 -0500 Jeff Blaine wrote: The last time we brought our fileservers down (cleanly, according to "shutdown" info via bos status), it struck me as odd that salvages were needed once it came up. I sort of brushed it off. As in, it salvaged everything automatically when it came back up, or volumes were not attached when it came back up, and you needed to salvage to bring them online? We've done it again, and the same situation is presenting itself, and I'm really confused as to how that is and what is happening incorrectly. One of the three cleanly shutdown fileservers came up with hundreds of unattachable volumes, and is salvaging now by our hand. Well, why are they not attaching? FileLog should tell you. And the salvage logs should say what they fixed, if anything, to bring them back online. Also, salvaging an entire partition at once may be quite a bit faster than salvaging volumes individually, depending on how many volumes you have. The fileserver needs to be shutdown for that to happen, though. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On 1/28/2011 12:33 PM, Andrew Deason wrote: On Fri, 28 Jan 2011 12:10:38 -0500 Jeff Blaine wrote: The last time we brought our fileservers down (cleanly, according to "shutdown" info via bos status), it struck me as odd that salvages were needed once it came up. I sort of brushed it off. As in, it salvaged everything automatically when it came back up, or volumes were not attached when it came back up, and you needed to salvage to bring them online? The latter. We've done it again, and the same situation is presenting itself, and I'm really confused as to how that is and what is happening incorrectly. One of the three cleanly shutdown fileservers came up with hundreds of unattachable volumes, and is salvaging now by our hand. Well, why are they not attaching? FileLog should tell you. And the salvage logs should say what they fixed, if anything, to bring them back online. Yes, I am waiting on that to all finish before I examine and reply. Also, salvaging an entire partition at once may be quite a bit faster than salvaging volumes individually, depending on how many volumes you have. The fileserver needs to be shutdown for that to happen, though. I didn't trust it at all and forced a salvage of the whole server. There were many unattachable volumes on every partition. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: Need volume state / fileserver / salvage knowledge
On Fri, 28 Jan 2011 12:10:38 -0500 Jeff Blaine wrote: > The last time we brought our fileservers down (cleanly, according to > "shutdown" info via bos status), it struck me as odd that salvages > were needed once it came up. I sort of brushed it off. As in, it salvaged everything automatically when it came back up, or volumes were not attached when it came back up, and you needed to salvage to bring them online? > We've done it again, and the same situation is presenting itself, > and I'm really confused as to how that is and what is happening > incorrectly. One of the three cleanly shutdown fileservers came > up with hundreds of unattachable volumes, and is salvaging now > by our hand. Well, why are they not attaching? FileLog should tell you. And the salvage logs should say what they fixed, if anything, to bring them back online. Also, salvaging an entire partition at once may be quite a bit faster than salvaging volumes individually, depending on how many volumes you have. The fileserver needs to be shutdown for that to happen, though. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info