On 7/26/2024 4:44 AM, Stephan Wonczak wrote:
The vlserver initializes ubik with two bits of information.
1. the primary IP address of the machine its running on which is determined by obtaining the host name and resolving the associated IPv4 address. 2. the list of IP addresses for peers obtained from the server CellServDB file
[root@afstest/usr/afs]$ udebug afstest.rrz.uni-koeln.de vl -long
Host's addresses are: 134.95.13.39

This list only contains 134.95.13.39.  This means that the IP address passed into ubik by the vlserver process must have been this address.  Otherwise, there would be more than one address reported as the local address.

Host's 134.95.13.39 time is Fri Jul 26 10:26:26 2024
Local time is Fri Jul 26 10:26:26 2024 (time differential 0 secs)
Last yes vote for 134.95.13.39 was 13 secs ago (sync site);
Last vote started 13 secs ago (at Fri Jul 26 10:26:13 2024)
Local db version is 1610030433.14
I am sync site until 47 secs from now (at Fri Jul 26 10:27:13 2024) (2 servers)
This states that ubik believes there are two servers consisting of itself and all of its peers which were obtained from the server CellServDB file.
Recovery state 1
Recovery state 1 means that this server is the coordinator (aka sync site) and it has been unable to find the best database version within the quorum.
The last trans I handled was 1720702725.17056
The ubik epoch began at Thu, 11 Jul 2024 12:58:45 GMT.   This will be the time the vlserver process was started.
Sync site's db version is 1610030433.14
The last time the database was modified the ubik epoch was Thu, 07 Jan 2021 14:40:33 GMT which is 16:40:33 local time.
0 locked pages, 0 of them for write
Last time a new db version was labelled was:
     1279661 secs ago (at Thu Jul 11 14:58:45 2024)

Server (134.95.110.160): (db 0.0)
    last vote never rcvd
    last beacon never sent
    dbcurrent=0, up=0 beaconSince=0

  Where does this IP 134.95.110.160 come from?

This came from the server CellServDB file.   The vlserver does not communicate with the cache manager or any other service to obtain the list of ubik peers.

Well, actually, this was the -old- IP of this machine before it was moved into another network. But where did this come from? Hmmm...

  I got it.
  After correcting the server-CellServDB, I did not reboot the machine. I just stopped (and afterwards) restarted both the openafs-server and openafs-client.

Is it possible that instead of restarting openafs-server that only the fileserver was restarted?

I do not believe the 134.95.110.160 ubik peer address could have come from anywhere other than the server CellServDB.  Unless the contents of the server ThisCell file were wrong and the specified cell name has an AFSDB or SRV record that specifies 134.95.110.160 as a location server.

Obviously, the wrong IP remained in some kernel resident lists. I tried fixing the issue with "fs newcell", but no luck there.

"fs newcell" is used to push the vlserver IP addresses into the cache manager kernel module.   This list is not used by the vlserver.

One reboot later, however, things are looking fine now:

 udebug afstest.rrz.uni-koeln.de vl -long
Host's addresses are: 134.95.13.39
Host's 134.95.13.39 time is Fri Jul 26 10:39:31 2024
Local time is Fri Jul 26 10:39:31 2024 (time differential 0 secs)
Last yes vote for 134.95.13.39 was 0 secs ago (sync site);
Last vote started 0 secs ago (at Fri Jul 26 10:39:31 2024)
Local db version is 1610030433.14
I am sync site forever (1 server)
Recovery state 1f
The last trans I handled was 1721983108.0

vlserver was restarted at Fri, 26 Jul 2024 08:38:28 GMT

Sync site's db version is 1610030433.14
0 locked pages, 0 of them for write
Last time a new db version was labelled was:
     63 secs ago (at Fri Jul 26 10:38:28 2024)

  Thanks, Jeffrey, for pointing me in the right direction!
  (and hopefully someone can learn from my bunbling here :-) )

I'm glad you have been able to get OpenAFS working.

Now that we understand the root cause I was curious to see how AuriStorFS would respond if the cell's ubik quorum configuration did not include any of the local machine's interface addresses or if the fileserver couldn't communicate with any of the addresses defined in the cell service database configuration.   I configured the cell with a single servers hosting both database and fileserver on address 10.0.1.99 but the actual host address is 10.0.1.180.

The ubik database services refuse to start and logs:

  Fri Jul 26 12:47:02.006820 2024 [1] ubik: Unable to find local server in ubik configuration   Fri Jul 26 12:47:02.006826 2024 [1] vlserver: Ubik init failed: Invalid argument

The fileserver refuses to start and logs:

  Fri Jul 26 12:47:23.007151 2024 [1] Fileserver will register the following addresses: \
     10.0.1.180 2001:db8::1:a4e5 2001:db8::20c:29ff:fe0b:ca78
  Fri Jul 26 12:47:29.147497 2024 [1] Could not fetch protection server capabilities: \       Error 301063: rx unreachable peer at Fri Jul 26 12:47:29.147459 2024 \
          [10.0.1.99]:7002 returned rx unreachable peer
  Fri Jul 26 12:47:29.147519 2024 [1] Fatal error in host initialization, exiting!!

"bos status -long" reports:

  Instance ptserver, (type is simple) temporarily disabled, stopped for too many errors, currently starting up.
      Process last started at Fri Jul 26 12:58:16 2024 (160 proc starts)
      Last exit at Fri Jul 26 12:58:16 2024
      Last error exit at Fri Jul 26 12:58:16 2024, by exiting with code 2
      Command 1 is '/usr/lib/yfs/ptserver'

  Instance vlserver, (type is simple) temporarily disabled, stopped for too many errors, currently starting up.
      Process last started at Fri Jul 26 12:58:16 2024 (160 proc starts)
      Last exit at Fri Jul 26 12:58:16 2024
      Last error exit at Fri Jul 26 12:58:16 2024, by exiting with code 2
      Command 1 is '/usr/lib/yfs/vlserver'

  Instance dafs, (type is dafs) temporarily disabled, stopped for too many errors, currently shutdown.
      Auxiliary status is: file server shut down.
      Process last started at Fri Jul 26 12:47:23 2024 (15 proc starts)
      Last exit at Fri Jul 26 12:48:47 2024
      Last error exit at Fri Jul 26 12:47:29 2024, by file, by exiting with code 1
      Command 1 is '/usr/lib/yfs/fileserver'
      Command 2 is '/usr/lib/yfs/volserver'
      Command 3 is '/usr/lib/yfs/salvageserver'
      Command 4 is '/usr/lib/yfs/salvager'

If there was a configuration parsing error that prevented the servers from starting that would be visible from "systemctl status auristorfs-server":

  root@bookworm:/home/jaltman/src# systemctl status auristorfs-server
  × auristorfs-server.service - AuriStorFS Server Service
       Loaded: loaded (/lib/systemd/system/auristorfs-server.service; enabled; preset: enabled)        Active: failed (Result: exit-code) since Fri 2024-07-26 14:27:00 EDT; 6s ago
     Duration: 4ms
      Process: 885317 ExecStart=/usr/sbin/bosserver -nofork -pidfiles=/var/run/yfs (code=exited, status=1/FAILURE)
     Main PID: 885317 (code=exited, status=1/FAILURE)
          CPU: 3ms

  Jul 26 14:27:00 bookworm systemd[1]: Started auristorfs-server.service - AuriStorFS Server Service.   Jul 26 14:27:00 bookworm bosserver[885317]: bosserver: /etc/yfs/server/yfs-server.conf.d/bad.conf:2: unclosed {

Once again, I'm glad to hear your test server cell is once again functioning normally.

Jeffrey Altman


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to