Re: [OpenAFS] DAFS Salvager failure

2012-10-25 Thread Pavel Semerad
> Folks,
> 
> One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
> RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
> salvager hung and eventually the dafileserver stopped responding to
> clients.
> 
  I had similar problem at monday and tuesday this week. dafileserver
crashed, was restarted by bosserver but after some time salvager stopped
salvaging (defined number of salvage processes was there, but only
sleeping and not repairing data). And some FSSYNC error messages were at log.
Then I manually restarted fileserver process and it worked for some time,
salvaging volumes. But only till next dafileserver crash. This was seen
several times, also with older binaries from openafs-1.6.1 (current were
openafs-1.6.1a).

  After recompiling openafs with debug info and next crash I found that it
segfaulted in FD_ISSET in function CallHandler in file src/vol/fssync-server.c .
  I saw that it is possible to use poll() interface instead of select()
in the code, so I forced it to use this poll() code (#define HAVE_POLL)
and it is working without crash from tuesday till now.
  I don't know if this have no issues, I didn't found test for poll() in
configure script so this poll() code doesn't seem to be normally used.

Pavel Semerad

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] DAFS Salvager failure

2012-10-25 Thread Jack Neely
Thanks Jeffrey!

I've created [rt.central.org #131372] with a follow up.  

At this point this one server is running the traditional fileserver.
There were 3 volumes that would not come online -- that even caused the
traditional salvager to crash.  We restored those from tape and,
finally, the server is up and running.

The FSSYNC errors were the only thing in the log message that seemed to
coordinate with the dasalvager getting stuck.  Well, and the core files.
The backtraces indicate the dafileserver called osi_Panic from the
FSSYNC related functions.

Jack

On Fri, Oct 19, 2012 at 12:03:51PM -0400, Jeffrey Altman wrote:
> If you have core files from dasalvager and dafileserver then the
> processes have terminated abnormally.   If you have an OpenAFS support
> provider I suggest you contact them with a support request.
> 
> Note that this mailing list is likely to be very quiet over the next
> 24 to 48 hours as the core developers are in transit due to the end
> of the European AFS and Kerberos Conference.
> 
> If you do not have a support provider, please open a ticket in OpenAFS
> RT by sending mail to openafs-b...@openafs.org  Please include in the
> report stack traces obtained from the core files.  They will provide the
> first clue as to what is failing since nothing is evident in the log 
> files.
> Be sure to also look at the *.old log files.
> 
> Jeffrey Altman
> 
> 
> On Thursday, October 18, 2012 10:40:31 PM, Jack Neely wrote:
> > Folks,
> >
> > One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
> > RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
> > salvager hung and eventually the dafileserver stopped responding to
> > clients.
> >
> > We're rebooted, fsck'd the ext4 partitions, and finally ran the
> > dasalvager -force by hand to attempt to correctly salvage the server.
> > In all cases once the dafs instance starts up, it serves requests, it
> > dispatches a volume salvage or 4, all the salvager processes get stuck
> > and we start all over again.  We've salvaged the server multiple times
> > at this point -- our next hope is that we can restart the file server
> > with the traditional file server process.  (BTW, 2 and 3 GiB cores from
> > dafileserver and dasalvager abound.)
> >
> > SalsrvLog messages are usually along the following:
> >
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> > 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> > errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597266)
> > 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> > 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> > offline failed; trying again...
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> > 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> > errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597265)
> > 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> > 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> > offline failed; trying again...
> > 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> >
> > or
> >
> > 10/18/2012 22:20:49 dispatching child to salvage volume 540007729...
> > 10/18/2012 22:19:33 SYNC_ask: No response on circuit 'FSSYNC'
> > 10/18/2012 22:19:33 SYNC_ask: protocol communications failure on circuit
> > 'FSSYNC'; attempting reconnect to server
> >
> > and from FileLog (this looks like I'm restoring from backups)
> >
> > Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> > (2574739029)
> > Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> > (3774863615)
> > Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> > (944130375)
> > Thu Oct 18 22:25:30 2012 Volume 539458481 now offline, must be salvaged.
> > Thu Oct 18 22:25:30 2012 Scheduling salvage for volume 539458481 on part
> > /vicepb over SALVSYNC
> > Thu Oct 18 22:25:31 2012 nUsers == 0, but header not on LRU
> > Thu Oct 18 22:25:31 2012 SYNC_getCom:  error receiving command
> > Thu Oct 18 22:25:31 2012 Scheduling salvage for volume 539894230 on part
> > /vicepb over SALVSYNC
> > Thu Oct 18 22:25:31 2012 FSYNC_com:  read failed; dropping connection
> > (cnt=103291)
> > Thu Oct 18 22:25:37 2012 FSYNC_com:  invalid protocol version
> > (2023862981)
> >
> > I've checked, all my binaries are from my 1.6.1 build.  What's going on?
> >
> > Jack Neely
> >
> 



-- 
Jack Neely 
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC Stat

Re: [OpenAFS] DAFS Salvager failure

2012-10-19 Thread Mark Vitale
I have been working on what seems to be a similar report to yours:
OpenAFS 1.6.1, RHEL6, crashed fileserver with SEGV, FSYNC seems implicated.
However, you are DAFS and my report is non-DAFS.

Would you be willing to send me your logs and cores?

Thanks,
--
Mark Vitale
mvit...@sinenomine.net

On Oct 18, 2012, at 10:40 PM, Jack Neely wrote:

> Folks,
> 
> One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
> RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
> salvager hung and eventually the dafileserver stopped responding to
> clients.
> 

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] DAFS Salvager failure

2012-10-19 Thread Jeffrey Altman
If you have core files from dasalvager and dafileserver then the
processes have terminated abnormally.   If you have an OpenAFS support
provider I suggest you contact them with a support request.

Note that this mailing list is likely to be very quiet over the next
24 to 48 hours as the core developers are in transit due to the end
of the European AFS and Kerberos Conference.

If you do not have a support provider, please open a ticket in OpenAFS
RT by sending mail to openafs-b...@openafs.org  Please include in the
report stack traces obtained from the core files.  They will provide the
first clue as to what is failing since nothing is evident in the log 
files.
Be sure to also look at the *.old log files.

Jeffrey Altman


On Thursday, October 18, 2012 10:40:31 PM, Jack Neely wrote:
> Folks,
>
> One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
> RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
> salvager hung and eventually the dafileserver stopped responding to
> clients.
>
> We're rebooted, fsck'd the ext4 partitions, and finally ran the
> dasalvager -force by hand to attempt to correctly salvage the server.
> In all cases once the dafs instance starts up, it serves requests, it
> dispatches a volume salvage or 4, all the salvager processes get stuck
> and we start all over again.  We've salvaged the server multiple times
> at this point -- our next hope is that we can restart the file server
> with the traditional file server process.  (BTW, 2 and 3 GiB cores from
> dafileserver and dasalvager abound.)
>
> SalsrvLog messages are usually along the following:
>
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> 'FSSYNC'; attempting reconnect to server
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> 'FSSYNC'; attempting reconnect to server
> 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597266)
> 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> offline failed; trying again...
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
> 'FSSYNC'; attempting reconnect to server
> 10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
> errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597265)
> 10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
> 10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
> offline failed; trying again...
> 10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
>
> or
>
> 10/18/2012 22:20:49 dispatching child to salvage volume 540007729...
> 10/18/2012 22:19:33 SYNC_ask: No response on circuit 'FSSYNC'
> 10/18/2012 22:19:33 SYNC_ask: protocol communications failure on circuit
> 'FSSYNC'; attempting reconnect to server
>
> and from FileLog (this looks like I'm restoring from backups)
>
> Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> (2574739029)
> Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> (3774863615)
> Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
> (944130375)
> Thu Oct 18 22:25:30 2012 Volume 539458481 now offline, must be salvaged.
> Thu Oct 18 22:25:30 2012 Scheduling salvage for volume 539458481 on part
> /vicepb over SALVSYNC
> Thu Oct 18 22:25:31 2012 nUsers == 0, but header not on LRU
> Thu Oct 18 22:25:31 2012 SYNC_getCom:  error receiving command
> Thu Oct 18 22:25:31 2012 Scheduling salvage for volume 539894230 on part
> /vicepb over SALVSYNC
> Thu Oct 18 22:25:31 2012 FSYNC_com:  read failed; dropping connection
> (cnt=103291)
> Thu Oct 18 22:25:37 2012 FSYNC_com:  invalid protocol version
> (2023862981)
>
> I've checked, all my binaries are from my 1.6.1 build.  What's going on?
>
> Jack Neely
>



signature.asc
Description: OpenPGP digital signature


[OpenAFS] DAFS Salvager failure

2012-10-18 Thread Jack Neely
Folks,

One of our AFS file servers crashed this afternoon.  OpenAFS 1.6.1 on
RHEL 6 with kernel 2.6.32-279.9.1.el6.x86_64.  It looks like the
salvager hung and eventually the dafileserver stopped responding to
clients.

We're rebooted, fsck'd the ext4 partitions, and finally ran the
dasalvager -force by hand to attempt to correctly salvage the server.
In all cases once the dafs instance starts up, it serves requests, it
dispatches a volume salvage or 4, all the salvager processes get stuck
and we start all over again.  We've salvaged the server multiple times
at this point -- our next hope is that we can restart the file server
with the traditional file server process.  (BTW, 2 and 3 GiB cores from
dafileserver and dasalvager abound.)

SalsrvLog messages are usually along the following:

10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server
10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server
10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597266)
10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
offline failed; trying again...
10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 17:55:08 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server
10/18/2012 17:55:11 SYNC_ask: too many / too latent fatal protocol
errors on circuit 'FSSYNC'; giving up (tries 1 timeout 1350597265)
10/18/2012 17:55:11 FSYNC_askfs: internal FSSYNC protocol error 2
10/18/2012 17:55:11 AskOffline:  request for fileserver to take volume
offline failed; trying again...
10/18/2012 17:55:08 SYNC_ask: No response on circuit 'FSSYNC'

or

10/18/2012 22:20:49 dispatching child to salvage volume 540007729...
10/18/2012 22:19:33 SYNC_ask: No response on circuit 'FSSYNC'
10/18/2012 22:19:33 SYNC_ask: protocol communications failure on circuit
'FSSYNC'; attempting reconnect to server

and from FileLog (this looks like I'm restoring from backups)

Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
(2574739029)
Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
(3774863615)
Thu Oct 18 22:25:30 2012 FSYNC_com:  invalid protocol version
(944130375)
Thu Oct 18 22:25:30 2012 Volume 539458481 now offline, must be salvaged.
Thu Oct 18 22:25:30 2012 Scheduling salvage for volume 539458481 on part
/vicepb over SALVSYNC
Thu Oct 18 22:25:31 2012 nUsers == 0, but header not on LRU
Thu Oct 18 22:25:31 2012 SYNC_getCom:  error receiving command
Thu Oct 18 22:25:31 2012 Scheduling salvage for volume 539894230 on part
/vicepb over SALVSYNC
Thu Oct 18 22:25:31 2012 FSYNC_com:  read failed; dropping connection
(cnt=103291)
Thu Oct 18 22:25:37 2012 FSYNC_com:  invalid protocol version
(2023862981)

I've checked, all my binaries are from my 1.6.1 build.  What's going on?

Jack Neely

-- 
Jack Neely 
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info