Hello,
I want to share a success story, and document the steps I did
(in essence, followed Jan's recommendations and read some man pages)
Running a two-server setup with all volumes replicated.
One of the servers got a corrupted volume and refused to start,
complaining about an assertion error when running salvager on one of the
volumes.
[think of an irreparable fsck problem on a central NFS server? ;-]
Well, my other server stayed online so that I had the system running.
The steps to revive the crashing server:
get the info about the volume:
[any of the servers]# grep <volname> /vice/db/VRList
you get
<volname> <groupid> <replica_num> <volid_serv1> <volid_serv2> 0 0 0 0 0 0 <VSG>
now you have to find which <volid> you are interested in:
[any of the servers]# grep <host_with_dead_server> /vice/db/servers
you get
<hostname> <serverid>
the <serverid> is a small number and it matches the beginning of <volid>,
e.g. my dead server has number 2 and the matching volid was 200004d
Let us get more information about the volume (and destroy it, as it is
broken)
[the host with the dead server]# grep rvm_ <where-you-have-it>/server.conf
rvm_log="<LOG>"
rvm_data="<DATA>"
rvm_data_length="<LENGTH>"
[the host with the dead server]# norton -mapprivate <LOG> <DATA> <LENGTH>
(-mapprivate is a lot faster than without it)
norton> show volume <volid>
Id: 0x<volid> Name: <name>.<digit> Parent: 0x200004d
GoupId: 0x<groupid> Partition: <partition>
as I had to remove the broken volume:
norton> delete volume <volid>
norton> Ctrl/D
Now start the server, watch "tail -f /vice/srv/SrvLog" and see the volume
being destroyed, instead of crashing the server.
When the server is up, the moment of truth has come.
Run (substituting the values from the above) :
[the host with now alive server, missing one volume]# \
volutil create_rep <partition> <name>.<digit> 0x<groupid> 0x<volid>
Now go to a well connected client and do "ls -alR" on the corresponding
mountpoint.
You can watch the resolution to happen, looking at the results of
volutil -h <hostname-of-serverN> info <volid_servN> | grep diskused
for the servers concerned (use output of "grep <volname> /vice/db/VRList"
above)
Enjoy Coda!
--
Ivan