Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-02-07 Thread Steve Simmons

On Feb 1, 2011, at 3:58 PM, Andrew Deason wrote:

> On Tue, 01 Feb 2011 12:04:08 -0800
> Patricia O'Reilly  wrote:
> 
>> From what you have described it sounds to me like you need the patch
>> that Andrew referenced earlier that allows you to configure an
>> -offline-timeout and -offline-shutdown-timeout option on your
>> fileservers. We have has similar problems at our site and will be
>> releasing that patch into production shortly.
> 
> Maybe, maybe not. I think the most common cause of this is just having
> too many volumes that can be shut down in 30 minutes. Determining this
> is easy; if it happens every single time you shut down the fileserver,
> that's probably it. (But obviously that's not fun to do.)
> 
> But it could also be the 1.4.11 host package bugs; I don't know, and I
> just noted that cause to illustrate that there are several possible
> reasons.

As noted earlier, we saw this at least back to our use of 1.4.8. Prior to that 
we'd being doing rolling restarts - ie, moving all the volumes off a server 
before restarting it. So it may have been present earlier, but we simply didn't 
hit it.

> 
>> Jeff Blaine wrote:
>>> 
>>> Thanks for the replies.
>>> 
>>> I can't at all fathom that our issue is one of existing
>>> client connections and callback break completion (timing out).
> 
> I'd only say that if you have pretty good control over all of your
> clients. It's possible to see some really bizarre behavior (from the
> fileserver's point of view) from old clients or clients on
> oddly-behaving networks or NATs.

Seconded. A number of our more savvy users (or users who have savvy IT admins) 
run AFS at home, another large batch of folks are behind nats/firewalls, and a 
third small group are alumni or ex-staff who use their AFS space from all over 
the world. As a proportion of overall users that's fairly small, but as a 
proportion of folks whose hosts time out during shutdown it's pretty large.

>>> Let's assume this issue is what caused our problem.  I'm sort of at
>>> a loss as to how to approach OpenAFS versions.  On one hand,
>>> expectations of more effort to make it clear in the release notes
>>> what items could cause something like unclean server shutdowns (kind
>>> of a big deal, IMO) are not really justifiable.
> 
> This wasn't an issue causing fileserver shutdowns to hang and get
> killed, it was a general fileserver stability issue; that hang (or
> crash, or however it manifested; I've seen both) could happen at any
> time.

There two things which seemed to make the problem more likely - having the 
server up for a long time, and having lots of different hosts using volumes 
from that server. We did find a log entry that was usually a symptom of the 
problem about to occur, but once that entry appeared it was too late to fix it 
- either the server would crash or would get into an infinite loop in the next 
few minutes to hours. Attempting to restart the server once we'd seen it always 
tickled the bug; attaching to the process w/gdb and forcing a core dump was how 
we finally diagnosed the bloody thing.

>>> It's open source, etc.  On the other hand, it's not acceptable to
>>> blindly upgrade to the latest stable release every time it comes
>>> out. I understand that the most obvious take-away is just, "You got
>>> bit. Move on.", but if anything can improve on our end, I'd like to
>>> do that.
> 
> Perhaps not right when it comes out, but it can be a good idea to move
> towards them, depending on how you do your risk/change management.
> Waiting a bit after each stable release for production machines makes
> sense, to see if unknown issues crop up, but if there are significant
> issues, you will hear about it if you are paying attention (probably in
> the form of a new release, fixing the issue).
> 
> 1.4.12 was released almost a year ago, and I don't think there are any
> significant problems besides the issues that caused 1.4.14 to be
> released. There are some smaller issues here and there that sometimes
> get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would
> cause me to recommend rolling back to pre-1.4.12 if you had upgraded a
> machine to 1.4.12.

1.4.12 been bery bery good to me; there's no fix in .13/.14 that seems to 
affect us. Right now we're gearing up to build a test host for the latest 1.6 
release candidate. Barring some disastrous newfound issue with 1.4.12, 1.6 
makes more sense. As noted earlier in this discussion, dynamic attach looks 
like a fix for shutdown/restart timing issues.

Steve___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-02-01 Thread Andrew Deason
On Tue, 01 Feb 2011 12:04:08 -0800
Patricia O'Reilly  wrote:

> From what you have described it sounds to me like you need the patch
> that Andrew referenced earlier that allows you to configure an
> -offline-timeout and -offline-shutdown-timeout option on your
> fileservers. We have has similar problems at our site and will be
> releasing that patch into production shortly.

Maybe, maybe not. I think the most common cause of this is just having
too many volumes that can be shut down in 30 minutes. Determining this
is easy; if it happens every single time you shut down the fileserver,
that's probably it. (But obviously that's not fun to do.)

But it could also be the 1.4.11 host package bugs; I don't know, and I
just noted that cause to illustrate that there are several possible
reasons.

> Jeff Blaine wrote:
> > 
> > Thanks for the replies.
> > 
> > I can't at all fathom that our issue is one of existing
> > client connections and callback break completion (timing out).

I'd only say that if you have pretty good control over all of your
clients. It's possible to see some really bizarre behavior (from the
fileserver's point of view) from old clients or clients on
oddly-behaving networks or NATs.

> >> Also, in this specific case, it may not be just that shutting down
> >> volumes took too long. 1.4.11 has known problems that can cause this
> >> (e.g. the host list gets a loop in it, and something spins forever
> >> trying to traverse the whole list).
> > 
> > That's this, I think?:
> > 
> > - Fixes to avoid issues cleaning up deleted hosts in
> >   the fileserver (126454)

There were a few issues; all of the ones known to cause problems are
included in 1.4.12. I don't have references for all of them off the top
of my head, but I can get them for you if you want.

> > Let's assume this issue is what caused our problem.  I'm sort of at
> > a loss as to how to approach OpenAFS versions.  On one hand,
> > expectations of more effort to make it clear in the release notes
> > what items could cause something like unclean server shutdowns (kind
> > of a big deal, IMO) are not really justifiable.

This wasn't an issue causing fileserver shutdowns to hang and get
killed, it was a general fileserver stability issue; that hang (or
crash, or however it manifested; I've seen both) could happen at any
time.

And doing something like that actually isn't that difficult for at least
most of the issues I am involved with. I already generally know which
versions are affected for the bigger issues, so just writing that down
would not be that hard. (But going back through all of the changes
between 1.4.Z and 1.4 head would be a lot of work at this point) But
that's not true for all changes, and I think it may be prohibitively
difficult if we had to include information like that with every single
change to the stable branch.

I'm not sure how useful it is, though. In the specific case of the host
list issues, the only meaningful thing I can say is that "sometimes the
fileserver crashes". It's not really possible for you to know how
susceptible you are to it (unless you get hit by it), because the
circumstances required to trigger the crash are rather complex, and they
involve access patterns of clients that you generally cannot control or
even detect.

> > It's open source, etc.  On the other hand, it's not acceptable to
> > blindly upgrade to the latest stable release every time it comes
> > out. I understand that the most obvious take-away is just, "You got
> > bit. Move on.", but if anything can improve on our end, I'd like to
> > do that.

Perhaps not right when it comes out, but it can be a good idea to move
towards them, depending on how you do your risk/change management.
Waiting a bit after each stable release for production machines makes
sense, to see if unknown issues crop up, but if there are significant
issues, you will hear about it if you are paying attention (probably in
the form of a new release, fixing the issue).

1.4.12 was released almost a year ago, and I don't think there are any
significant problems besides the issues that caused 1.4.14 to be
released. There are some smaller issues here and there that sometimes
get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would
cause me to recommend rolling back to pre-1.4.12 if you had upgraded a
machine to 1.4.12.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-02-01 Thread Patricia O'Reilly
>From what you have described it sounds to me like you need the patch that 
>Andrew referenced earlier that allows you to configure an -offline-timeout and 
>-offline-shutdown-timeout option on your fileservers. We have has similar 
>problems at our site and will be releasing that patch into production shortly.

--patty

Jeff Blaine wrote:
 Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
 Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
 Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
 Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to
 shutdown within 1800 seconds
 Wed Jan 26 12:58:37 2011: fs:file exited on signal 9
> 
> Thanks for the replies.
> 
> I can't at all fathom that our issue is one of existing
> client connections and callback break completion (timing out).
> 
>> Also, in this specific case, it may not be just that shutting down
>> volumes took too long. 1.4.11 has known problems that can cause this
>> (e.g. the host list gets a loop in it, and something spins forever
>> trying to traverse the whole list).
> 
> That's this, I think?:
> 
> - Fixes to avoid issues cleaning up deleted hosts in
>   the fileserver (126454)
> 
> Let's assume this issue is what caused our problem.  I'm sort
> of at a loss as to how to approach OpenAFS versions.  On one
> hand, expectations of more effort to make it clear in the
> release notes what items could cause something like unclean
> server shutdowns (kind of a big deal, IMO) are not really
> justifiable.  It's open source, etc.  On the other hand,
> it's not acceptable to blindly upgrade to the latest stable
> release every time it comes out.  I understand that the most
> obvious take-away is just, "You got bit.  Move on.", but
> if anything can improve on our end, I'd like to do that.
> 
> I welcome any suggestions for how others are approaching this.
> 
> Jeff Blaine
> ___
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-02-01 Thread Jeff Blaine

Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within 
1800 seconds
Wed Jan 26 12:58:37 2011: fs:file exited on signal 9


Thanks for the replies.

I can't at all fathom that our issue is one of existing
client connections and callback break completion (timing out).

> Also, in this specific case, it may not be just that shutting down
> volumes took too long. 1.4.11 has known problems that can cause this
> (e.g. the host list gets a loop in it, and something spins forever
> trying to traverse the whole list).

That's this, I think?:

- Fixes to avoid issues cleaning up deleted hosts in
  the fileserver (126454)

Let's assume this issue is what caused our problem.  I'm sort
of at a loss as to how to approach OpenAFS versions.  On one
hand, expectations of more effort to make it clear in the
release notes what items could cause something like unclean
server shutdowns (kind of a big deal, IMO) are not really
justifiable.  It's open source, etc.  On the other hand,
it's not acceptable to blindly upgrade to the latest stable
release every time it comes out.  I understand that the most
obvious take-away is just, "You got bit.  Move on.", but
if anything can improve on our end, I'd like to do that.

I welcome any suggestions for how others are approaching this.

Jeff Blaine
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-31 Thread Jeffrey Altman
On 1/31/2011 12:17 PM, Stephen Joyce wrote:
> On Mon, 31 Jan 2011, Steve Simmons wrote:
>> We have about 235,000 volumes spread across 40 vice partitions. Our
>> 'fix' is a combination of lengthening that timeout to a 3600 seconds
>> and keeping our vice partitions no longer than 2TB. Active partitions
>> are spread roughly equally across those 40 partitions. But that's just
>> a stopgap; the longer a server stays up, the more likely it
>> accumulates dead callbacks.
> 
> Assuming this is true, isn't this a good argument to keep the weekly
> server process restarts?

But its not true.  As Andrew has already pointed out, callbacks are not
broken on server shutdown.  In any case, callbacks have a life span of
minutes to hours.  Even if a callback was recorded that was a week old,
it would not trigger an RPC to the client when it came time to break
callbacks on the associated object.

Jeffrey Altman



signature.asc
Description: OpenPGP digital signature


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-31 Thread Steve Simmons

On Jan 31, 2011, at 12:36 PM, Andrew Deason wrote:

> On Mon, 31 Jan 2011 11:54:24 -0500
> Steve Simmons  wrote:
> 
>>> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
>>> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
>>> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
>>> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown 
>>> within 1800 seconds
>>> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9
>> 
>> We have seen similar issues. It occurs when there is a given vice
>> partition where lots of clients have registered callbacks but those
>> clients are no longer accessible. Not all the clients have responded
>> when the 1800 second timer goes off, and the fileserver goes down
>> uncleanly.
> 
> Also, in this specific case, it may not be just that shutting down
> volumes took too long. 1.4.11 has known problems that can cause this
> (e.g. the host list gets a loop in it, and something spins forever
> trying to traverse the whole list).

Yeah, we got seriously bit by that bug. But not just on shutdowns; eventually 
the list would be so corrupt the processes would actually crash. Dan Hyde spent 
a lot of time on that; it's why we're running 1.4.12 with a couple of patches 
currently. 'Fixing' that bug by regular server restarts is an argument for 
those restarts. But we were seeing the 1800 second timeout on shutdown at least 
back to 1.4.8. Based on our experience with earlier versions, the host list 
corruption issue didn't surface until post-1.4.8. Or at least, not as badly.

Steve___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-31 Thread Steve Simmons

On Jan 31, 2011, at 12:17 PM, Stephen Joyce wrote:

> On Mon, 31 Jan 2011, Steve Simmons wrote:
> 
>> We have seen similar issues. It occurs when there is a given vice partition 
>> where lots of clients have registered callbacks but those clients are no 
>> longer accessible. Not all the clients have responded when the 1800 second 
>> timer goes off, and the fileserver goes down uncleanly.
>> 
>> We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is 
>> a combination of lengthening that timeout to a 3600 seconds and keeping our 
>> vice partitions no longer than 2TB. Active partitions are spread roughly 
>> equally across those 40 partitions. But that's just a stopgap; the longer a 
>> server stays up, the more likely it accumulates dead callbacks.
> 
> Assuming this is true, isn't this a good argument to keep the weekly server 
> process restarts?

Weekly outages, even if only for a few minutes per, are not acceptable here. 
Doing them less frequently starts to put us into the range of the timeout 
problems above.

At the moment most of our afs service processes have run happily for 237 days. 
That alone is a strong argument for not needing weekly restarts. If there are 
memory leaks, etc, they largely aren't affecting us since 

We mostly do restarts when we need to do software upgrades of one sort or 
another. They are typically done in a rolling fashion - upgrade the hot 
spare(s), vos move volumes to the hot spare(s), take down the vacated servers 
and upgrade, lather, rinse, repeat. At one point we went two years without a 
general AFS shutdown. We only got away from that due to bugs that required us 
to do OS upgrades more frequently or the entire cell at once. Life seems 
generally better with respect to those issues; and campus' opinion of the 
service is better when there are no perceived outages.

For the curious, we're running 1.4.12 with a couple of fixes we pulled forward 
from the 1.4.13 development stream. Barring new developments, the next one 
we'll give serious consideration to is 
1.6.X.___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-31 Thread Andrew Deason
On Mon, 31 Jan 2011 11:54:24 -0500
Steve Simmons  wrote:

> > Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
> > Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
> > Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
> > Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown 
> > within 1800 seconds
> > Wed Jan 26 12:58:37 2011: fs:file exited on signal 9
> 
> We have seen similar issues. It occurs when there is a given vice
> partition where lots of clients have registered callbacks but those
> clients are no longer accessible. Not all the clients have responded
> when the 1800 second timer goes off, and the fileserver goes down
> uncleanly.

Also, in this specific case, it may not be just that shutting down
volumes took too long. 1.4.11 has known problems that can cause this
(e.g. the host list gets a loop in it, and something spins forever
trying to traverse the whole list).

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-31 Thread Andrew Deason
On Mon, 31 Jan 2011 11:54:24 -0500
Steve Simmons  wrote:

> I haven't read the code, but by observing the logfiles during a
> shutdown time it appears that fs shutdown break callbacks in a
> single-threaded manner per partition. This could probably be
> parallelized; simple thought experiments say X parallel callback
> breaks would result in run time T reduced to T/X.

I have said this before and I will continue to say it: we do not break
callbacks on volume shutdown. We reset the client callback state on the
next client access after the server comes back up (for non-DAFS).

What we _do_ do is wait for existing client connections and callback
breaks to complete before we can shut down. There are several causes of
callback breaks to be initiated, but a fileserver restart/shutdown is
not one of them.

If you want to improve shutdown time, DAFS will help just for the
portion where disk is the bottleneck. If you want to "kick off" clients
during shutdown, so clients holding open a connection don't block a
shutdown, take a look at the code that adds the
-offline-shutdown-timeout parameter (which is on master, gerrit 2984).
That functionality is not implemented for callback-related calls, but it
could be with more work.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-31 Thread Stephen Joyce

On Mon, 31 Jan 2011, Steve Simmons wrote:

We have seen similar issues. It occurs when there is a given vice 
partition where lots of clients have registered callbacks but those 
clients are no longer accessible. Not all the clients have responded when 
the 1800 second timer goes off, and the fileserver goes down uncleanly.


We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' 
is a combination of lengthening that timeout to a 3600 seconds and 
keeping our vice partitions no longer than 2TB. Active partitions are 
spread roughly equally across those 40 partitions. But that's just a 
stopgap; the longer a server stays up, the more likely it accumulates 
dead callbacks.


Assuming this is true, isn't this a good argument to keep the weekly server 
process restarts?


Cheers,
Stephen
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-31 Thread Steve Simmons

On Jan 28, 2011, at 1:58 PM, Jeff Blaine wrote:

> On 1/28/2011 1:52 PM, Derrick Brashear wrote:
>> did shutdown perchance take 30min?
> 
> Yes.  I found this in BosLog.old just now:
> 
> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within 
> 1800 seconds
> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9

We have seen similar issues. It occurs when there is a given vice partition 
where lots of clients have registered callbacks but those clients are no longer 
accessible. Not all the clients have responded when the 1800 second timer goes 
off, and the fileserver goes down uncleanly.

We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is a 
combination of lengthening that timeout to a 3600 seconds and keeping our vice 
partitions no longer than 2TB. Active partitions are spread roughly equally 
across those 40 partitions. But that's just a stopgap; the longer a server 
stays up, the more likely it accumulates dead callbacks.

Two things I suspect but don't know for certain:

Dynamic attach may help this a bit, simply because there will be fewer volumes 
attached and therefore fewer to detatch. I plan on trying this out soon. :-)

I haven't read the code, but by observing the logfiles during a shutdown time 
it appears that fs shutdown break callbacks in a single-threaded manner per 
partition. This could probably be parallelized; simple thought experiments say 
X parallel callback breaks would result in run time T reduced to T/X.


___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Derrick Brashear
On Fri, Jan 28, 2011 at 1:58 PM, Jeff Blaine  wrote:
> On 1/28/2011 1:52 PM, Derrick Brashear wrote:
>>
>> did shutdown perchance take 30min?
>
> Yes.  I found this in BosLog.old just now:
>
> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within
> 1800 seconds
> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9

Sadly, not enough info to ascertain why, but, an unclean shutdown is
unclean. And you know which volumes were not offline: they're the ones
you needed to salvage.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Andrew Deason
On Fri, 28 Jan 2011 13:52:02 -0500
Derrick Brashear  wrote:

> did shutdown perchance take 30min?

BosLog would still indicate a force kill after 30 mins. What are all of
the BosLog entries mentioning the fileserver? (assuming bosserver hasn't
been restarted enough times to rotate that away)

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info



Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Jeff Blaine

On 1/28/2011 1:52 PM, Derrick Brashear wrote:

did shutdown perchance take 30min?


Yes.  I found this in BosLog.old just now:

Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown 
within 1800 seconds

Wed Jan 26 12:58:37 2011: fs:file exited on signal 9



Derrick


On Jan 28, 2011, at 1:50 PM, Jeff Blaine  wrote:


Do you have the FileLog from that shutdown?


No, it was cycled out by me salvaging :|


And there isn't anything in play that would cause an old version of the
vice partition or something weird like that, is there? (ZFS snapshots,
liveupgrade misconfiguration, etc)


No.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info



___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Derrick Brashear
did shutdown perchance take 30min?

Derrick


On Jan 28, 2011, at 1:50 PM, Jeff Blaine  wrote:

>> Do you have the FileLog from that shutdown?
> 
> No, it was cycled out by me salvaging :|
> 
>> And there isn't anything in play that would cause an old version of the
>> vice partition or something weird like that, is there? (ZFS snapshots,
>> liveupgrade misconfiguration, etc)
> 
> No.
> ___
> OpenAFS-info mailing list
> OpenAFS-info@openafs.org
> https://lists.openafs.org/mailman/listinfo/openafs-info
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Jeff Blaine

Do you have the FileLog from that shutdown?


No, it was cycled out by me salvaging :|


And there isn't anything in play that would cause an old version of the
vice partition or something weird like that, is there? (ZFS snapshots,
liveupgrade misconfiguration, etc)


No.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Andrew Deason
On Fri, 28 Jan 2011 13:17:31 -0500
Jeff Blaine  wrote:

> Examples from FileLog.old:
> 
> Fri Jan 28 10:02:48 2011 VAttachVolume: volume /vicepf/V2023864046.vol 
> needs to be salvaged; not attached.

This just says that the fileserver didn't clear the "I'm using this
volume" flag in the header; we salvage because it implies the fileserver
was killed possibly in the middle of some I/O. You're sure everything
shut down cleanly before? Do you have the FileLog from that shutdown?
It's possible for the fileserver to exit "cleanly" even if for some
reason it couldn't offline every single volume (but it will log that it
couldn't do so)

And there isn't anything in play that would cause an old version of the
vice partition or something weird like that, is there? (ZFS snapshots,
liveupgrade misconfiguration, etc)

> Fri Jan 28 10:02:49 2011 VAttachVolume: volume salvage flag is ON for 
> /vicepa//V2023886583.vol; volume needs salvage

This is the explicit "something is wrong with this volume" flag being
set. This can happen as a result of many different things, but I think
all of them are logged in FileLog when they happen. If you have the old
FileLog, it might say why. Of course, one of those things that triggers
this is that "needs to be salvaged" message above. So, if a previous
startup logged those same messages, that would cause this.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Jeff Blaine

Examples from FileLog.old:

Fri Jan 28 10:02:48 2011 VAttachVolume: volume /vicepf/V2023864046.vol 
needs to be salvaged; not attached.


Fri Jan 28 10:02:49 2011 VAttachVolume: volume salvage flag is ON for 
/vicepa//V2023886583.vol; volume needs salvage


Examples from SalvageLog old pretty much run the gamut (it's
a 4MB file...).

01/28/2011 10:30:50 Found 13 orphaned files and directories (approx. 26 KB)

01/28/2011 10:30:52 Volume uniquifier is too low; fixed

01/28/2011 10:31:11 Vnode 34: version < inode version; fixed (old status)

01/28/2011 12:54:15 Volume 536872710 (src.local) mount point ./flex/011 
to '#src.flex.011#' invalid, converted to symbolic link


01/28/2011 12:27:30 dir vnode 15: special old unlink-while-referenced 
file .__afs9803 is deleted (vnode 2248)


01/28/2011 12:28:22 dir vnode 1075: ./.gconfd/lock/ior (vnode 4272): 
unique changed from 54370 to 57920


01/28/2011 12:28:22 dir vnode 1077: ./.gconf/%gconf-xml-backend.lock/ior 
already claimed by directory vnode 1 (vnode 4278, unique 54373) -- deleted


01/28/2011 12:28:28 dir vnode 607: invalid entry: ./.gconfd/lock/ior 
(vnode 1114, unique 132811)


01/28/2011 12:37:28 dir vnode 1: invalid entry deleted: 
./.ab_library.lock (vnode 50816, unique 25535)


On 1/28/2011 12:33 PM, Andrew Deason wrote:

On Fri, 28 Jan 2011 12:10:38 -0500
Jeff Blaine  wrote:


The last time we brought our fileservers down (cleanly, according to
"shutdown" info via bos status), it struck me as odd that salvages
were needed once it came up.  I sort of brushed it off.


As in, it salvaged everything automatically when it came back up, or
volumes were not attached when it came back up, and you needed to
salvage to bring them online?


We've done it again, and the same situation is presenting itself,
and I'm really confused as to how that is and what is happening
incorrectly.  One of the three cleanly shutdown fileservers came
up with hundreds of unattachable volumes, and is salvaging now
by our hand.


Well, why are they not attaching? FileLog should tell you. And the
salvage logs should say what they fixed, if anything, to bring them back
online.

Also, salvaging an entire partition at once may be quite a bit faster
than salvaging volumes individually, depending on how many volumes you
have. The fileserver needs to be shutdown for that to happen, though.


___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Jeff Blaine

On 1/28/2011 12:33 PM, Andrew Deason wrote:

On Fri, 28 Jan 2011 12:10:38 -0500
Jeff Blaine  wrote:


The last time we brought our fileservers down (cleanly, according to
"shutdown" info via bos status), it struck me as odd that salvages
were needed once it came up.  I sort of brushed it off.


As in, it salvaged everything automatically when it came back up, or
volumes were not attached when it came back up, and you needed to
salvage to bring them online?


The latter.


We've done it again, and the same situation is presenting itself,
and I'm really confused as to how that is and what is happening
incorrectly.  One of the three cleanly shutdown fileservers came
up with hundreds of unattachable volumes, and is salvaging now
by our hand.


Well, why are they not attaching? FileLog should tell you. And the
salvage logs should say what they fixed, if anything, to bring them back
online.


Yes, I am waiting on that to all finish before I examine and reply.


Also, salvaging an entire partition at once may be quite a bit faster
than salvaging volumes individually, depending on how many volumes you
have. The fileserver needs to be shutdown for that to happen, though.


I didn't trust it at all and forced a salvage of the whole server.
There were many unattachable volumes on every partition.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: Need volume state / fileserver / salvage knowledge

2011-01-28 Thread Andrew Deason
On Fri, 28 Jan 2011 12:10:38 -0500
Jeff Blaine  wrote:

> The last time we brought our fileservers down (cleanly, according to
> "shutdown" info via bos status), it struck me as odd that salvages
> were needed once it came up.  I sort of brushed it off.

As in, it salvaged everything automatically when it came back up, or
volumes were not attached when it came back up, and you needed to
salvage to bring them online?

> We've done it again, and the same situation is presenting itself,
> and I'm really confused as to how that is and what is happening
> incorrectly.  One of the three cleanly shutdown fileservers came
> up with hundreds of unattachable volumes, and is salvaging now
> by our hand.

Well, why are they not attaching? FileLog should tell you. And the
salvage logs should say what they fixed, if anything, to bring them back
online.

Also, salvaging an entire partition at once may be quite a bit faster
than salvaging volumes individually, depending on how many volumes you
have. The fileserver needs to be shutdown for that to happen, though.

-- 
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info