Re: [OpenAFS] Non-functional fileserver

2024-07-11 Thread MS Vitale
Dr. Wonczak,

Thank you for your report.  Please see my interleaved replies below:

> On Jul 11, 2024, at 9:50 AM, Stephan Wonczak  wrote:
> 
>  Today we had a strange problem with two of our test-AFS-Servers. Apart from 
> our normal cell we created two additional cells, each one consisting of a 
> single server that servers as both DB-Server and Fileserver. These servers 
> were created about two years back, and were working fine then. Yesterday we 
> had need to test something new and we revisited the servers.
>  "bos status" came back fine with "all servers running".

'bos status  -long' is useful in this situation, and may report that a 
core file is present.

>  However, "vos listvol -server xxx" resulted in "possible communication 
> failure" Digging a bit, we had numerous log entries in VolSerLog 
> "SYNC_connect: temporary failure on circuit 'FSSYNC' (will retry)". This 
> pointed to the fact, that the fssync.sock socket file was missing. Indeed, 
> /var/log/messages showed that the fileserver-process had dumped core during 
> startup. Interestingly, though, a fileserver process -was- running, just not 
> really functioning.
>  Several unsuccessful hours of debugging, tracing and googling later, I was 
> ready to give up and trash the test cell and create a new one from scratch. 
> During the process of purging the files I thought "OK, 
> /usr/afs/etc/CellServDB for this cell stays the same, so I can keep that." On 
> a hunch, I actually looked what was inside: Lo and behold! The configured 
> DB-server adress for the cell had the wrong IP.
>  This is when I remembered that both problematic machines were moved to a 
> different network segment. We had corrected the -client- CellervDB during 
> that move, but forgot about the server CellServDB.
>  Now, the whole point of this story:
>  The logs were spectacularily unhelpful in pinpointing this misconfiguration. 
> Indeed, I would not have expected the fileserver to dump core instead of 
> refusing to run at all. At the very least there should be a log entry that no 
> DB-Server could be reached (and CellServDB should be checked).
>  Recreating this behaviour is easy:
>  Take a working single-server cell, and change the IP in 
> /usr/afs/etc/CellServDB. Restart the fileserver and watch things go south.

I tried this (running master) and was able to reproduce some of your symptoms, 
as expected - but not all of them.

In this case, when the CSDB has the wrong IP address, the fileserver
will never be fully functional even though it is "running".

When a fileserver is in this state, the fileserver FSSYNC channel is indeed 
blocked
until the fileserver is able to complete registration with the vlserver.  As 
you 
observed, this in turn affects any volserver operation that requires the FSSYNC 
channel. 

The fileserver will also be unable to obtain required authorization information 
from the ptserver.

However, I did NOT experience a fileserver crash.
And I also see these expected messages in FileLog:
  ...
  Thu Jul 11 11:34:57 2024 VL_RegisterAddrs rpc failed; will retry periodically 
(code=-1, err=0)
  Thu Jul 11 11:36:07 2024 Couldn't get CPS for AnyUser, will try again in 30 
seconds; code=-1.
  Thu Jul 11 11:37:12 2024 Couldn't get CPS for AnyUser, will try again in 30 
seconds; code=-1.
  ...

Admittedly, these message are not as helpful as they could be; they should 
mention which 
IP addrs it is trying to reach.

What version of OpenAFS are you running?

Regards,
--
Mark Vitale
Sine Nomine Associates

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Non-functional fileserver

2024-07-11 Thread Stephan Wonczak

  Dear all,
  Sorry for the longish post, but I wanted to provide a bit of 
background.
  Today we had a strange problem with two of our test-AFS-Servers. Apart 
from our normal cell we created two additional cells, each one consisting 
of a single server that servers as both DB-Server and Fileserver. These 
servers were created about two years back, and were working fine then. 
Yesterday we had need to test something new and we revisited the servers.

  "bos status" came back fine with "all servers running".
  However, "vos listvol -server xxx" resulted in "possible communication 
failure" Digging a bit, we had numerous log entries in VolSerLog 
"SYNC_connect: temporary failure on circuit 'FSSYNC' (will retry)". This 
pointed to the fact, that the fssync.sock socket file was missing. Indeed, 
/var/log/messages showed that the fileserver-process had dumped core 
during startup. Interestingly, though, a fileserver process -was- running, 
just not really functioning.
  Several unsuccessful hours of debugging, tracing and googling later, I 
was ready to give up and trash the test cell and create a new one from 
scratch. During the process of purging the files I thought "OK, 
/usr/afs/etc/CellServDB for this cell stays the same, so I can keep that." 
On a hunch, I actually looked what was inside: Lo and behold! The 
configured DB-server adress for the cell had the wrong IP.
  This is when I remembered that both problematic machines were moved to a 
different network segment. We had corrected the -client- CellervDB during 
that move, but forgot about the server CellServDB.

  Now, the whole point of this story:
  The logs were spectacularily unhelpful in pinpointing this 
misconfiguration. Indeed, I would not have expected the fileserver to dump 
core instead of refusing to run at all. At the very least there should be 
a log entry that no DB-Server could be reached (and CellServDB should be 
checked).

  Recreating this behaviour is easy:
  Take a working single-server cell, and change the IP in 
/usr/afs/etc/CellServDB. Restart the fileserver and watch things go south.


  Thanks for reading my long ramble :-)

Dipl. Chem. Dr. Stephan Wonczak

Regionales Rechenzentrum der Universitaet zu Koeln (RRZK)
Universitaet zu Koeln, Weyertal 121, 50931 Koeln
Tel: +49/(0)221/470-89583, Fax: +49/(0)221/470-89625
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info