Re: [OpenAFS] Backup/Determining Ubik Coordinator/Verifying Databases
"Joseph D. Kulisics" <[EMAIL PROTECTED]> writes: ... > backup> deletedump -to 03/21/2008 > The following dumps were deleted: > backup: RPC interface mismatch (-452) ; Error while deleting dumps from 0 t= > o 1206082859 -452 = RXGEN_SS_MARSHAL "server marshall failed". This is usually generated at runtime by code on the server after an operation was completed and the results are being packed onto the wire to be returned to the client. In your case, "while deleting dumps" - can only happen as a result of calling bcdb_listDumps. This maps into a call to ubik_Call ( BUDB_ListDumps and on the server . This returns a list of dumps bounded by BUDB_MAX_RETURN_LIST (1000). Perhaps you tried to delete more than 1000 dumps? Break your deletes down to no more than 1000 per run. ... > > 1. How can I determine which server has been elected to be the Ubik write c= > oordinator? I doubt this is a ubik error (see above). However, udebug 7021 will tell you what ubik thinks is happening with budb, including who is the sync site. On the sync site, udebug will print additional information, including "Recovery state". You want to see "1f". Replace 7021 with 7002 or 7003 to find information on pt and vl, although those are most likely not your worry here. > > 2. Is there a way to check the consistency of the various databases across = > all of the database servers? ... I think you could build and run "ol_verify". This is probably not your best strategy. If you really think your backup database is corrupted, it's probably simplier to build it from scratch. You'll have to scan each dump, but if you've got them online this may be pretty cheap. Transarc used to recommend keeping frequent backups of the backup database, and they probably also recommended purging it on a regular basis of backups that were no longer of interest. Before doing either of these, you should certainly save your current backup database, on each machine. -Marcus Watts ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Backup/Determining Ubik Coordinator/Verifying Databases
Hi, I'm having a problem with a new AFS database server, and I wanted to try to determine a few things. First, the problem is that I retired an old machine running OpenAFS version 1.4.2. It was the first database server in my cell, and it had the lowest IP address. I retired the old machine and brought up a new machine in it's place. The new machine runs OpenAFS version 1.4.6. I configured the machine to be a database server according to the instructions in the Quick Start Guide. The old server managed backups from AFS, and when I raised the new server, I copied the backups, which I had kept online as files, onto the new server. When I run the backup command to try to remove old backups, I get the following error: backup> deletedump -to 03/21/2008 The following dumps were deleted: backup: RPC interface mismatch (-452) ; Error while deleting dumps from 0 to 1206082859 The error made me concerned that the new database server might not be communicating correctly with the old server, and I was trying to go through log files to determine that state of the databases servers. I had two questions: 1. How can I determine which server has been elected to be the Ubik write coordinator? 2. Is there a way to check the consistency of the various databases across all of the database servers? These questions mattered mostly for peace of mind and studying the problem with the backup system, which doesn't seem to be able to dump a volume anymore. Does anyone recognize the error above? I apologize if these are silly questions. I couldn't find the answers in my admittedly blind search of documentation. Please, let me know if there is more information that I can provide. Thank you, Joseph Kulisics _ For all claims from an equal, urged upon a neighbor as commands, before any attempt at legal settlement, be they great or be they small, have only one meaning, and that is slavery. Pericles as quoted by Thucydides, Book I of his history of the Peloponnesian War HOME PAGE URL: http://copper.chem.ucla.edu/~kulisics/ Joseph D. Kulisics --- 居候犬 Јосиф Кулишић ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
RE: [OpenAFS] Duplicate VLDB entries
jax > vos listvol afsfs10 vicepa Total number of volumes on server afsfs10 partition /vicepa: 0 Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0 jax > vos syncvldb -server afsfs10 -part vicepa VLDB synchronized with state of server afsfs10 partition /vicepa jax > vos syncserv -server afsfs10 -part vicepa Server afsfs10 partition /vicepa synchronized with VLDB jax > vos listvldb -server afsfs10 -partition vicepa ... Total entries: 1988 Seems like those commands aren't doing much to help. Any other ideas? Thanks again, -Jeff -Original Message- From: Stephen Joyce [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 09, 2008 3:11 PM To: Jeff Quinn Subject: Re: [OpenAFS] Duplicate VLDB entries Try vos syncvldb -server -part vos syncserv -server -part If any of the volumes are replicated, you'd want to sync all the servers/parts which hold replicas too. Be sure to perform all the syncvldb commands before the syncserv commands. In the steady- (non-problem) state, these commands should be non-destructive and do nothing, otherwise, they will change info/volumes as problems are corrected. Cheers, Stephen -- Stephen Joyce Systems AdministratorP A N I C Physics & Astronomy Department Physics & Astronomy University of North Carolina at Chapel Hill Network Infrastructure voice: (919) 962-7214and Computing fax: (919) 962-0480 http://www.panic.unc.edu "Lazy Programmers know that if a thing is worth doing, it's worth doing well -- unless doing it well takes so long that isn't worth doing any more. Then you just do it 'good enough'" --- Programming Perl, p 282. On Wed, 9 Apr 2008, Jeff Quinn wrote: > We recently lost a partition on one of our servers and restored from tape. > The volumes were all recovered, but when we did a vos exam on any of the > restored volumes, it would have 2 identical entries in the vldb: > > > >number of sites -> 2 > > server afsfs10.cl.msu.edu partition /vicepa RW Site > > server afsfs10.cl.msu.edu partition /vicepa RW Site > > > > After moving the volume: > > > >number of sites -> 2 > > server afsfs5.cl.msu.edu partition /vicepa RW Site > > server afsfs10.cl.msu.edu partition /vicepa RW Site > > > > We moved all of the volumes off of that partition, so now they all look like > above. A vos listvol of afsfs10 vicepa returns 0 volumes, but a vos > listvldb returns all 1989 of them.We then tried to do a vos syncvldb > with a volume. It reports that it has been synced, but there are still 2 > entries in the vldb. We have also tried vos syncserv with the whole > partition with the same result. > > > > We have also tried to manually remove an entry from the vldb using vos > delentry, but you cannot specify a server/partition AND a volume. A vos > delentry on the volume removed both entries from the vldb. > > > > All volumes are accessible via their mount points. > > > > Any ideas on how to remedy the vldb discrepancy? > > > > Thanks, > > -Jeff Quinn > > ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] 1.4.7pre3 client success on EL4&5 - and a question
On Wed, Apr 09, 2008 at 05:45:59PM +0200, Stephan Wiesand wrote: > 1.4.7pre3 builds, and the client works, for us on SL4 and SL5, i386 and > x86_64. > > And here's the question: > > We're trying to do something that doesn't work in AFS space under certain > circumstances. We don't know yet what makes it fail or work, but it > consistently either fails or works on any client, and all clients have a > very similar setup. > > All clients are SL4, amd64, latest kernel (2.6.9-67.0.7.ELsmp). > > The failing procedure is a bit convoluted, and I don't know in every detail > what it's doing. But the part that fails on some clients is that RPMs get > installed, with the RPMDB in AFS, and if it fails we get three messages > "afs: failed to store file (13)" and a wedged RPMDB. And, with 1.4.7pre3 but > not 1.4.6, we see two more messages: > > WARNING: afs_ufswr vcp=10396e494c0, exOrW=0 > WARNING: afs_ufswr vcp=10396e49140, exOrW=0 > > Any hint what these are about would probably be very helpful. > > Thanks, > Stephan > > PS On SL3, inserting the module from 1.4.7pre3 fails with the message that > hlist_unhashed is GPLONLY. I'll file a bug in RT. > > -- > Stephan Wiesand >DESY - DV - >Platanenallee 6 >15738 Zeuthen, Germany > ___ > OpenAFS-info mailing list > OpenAFS-info@openafs.org > https://lists.openafs.org/mailman/listinfo/openafs-info Well, the RPM db is a sleepycat database. Which is not known to work well in AFS space. Jack Neely -- Jack Neely <[EMAIL PROTECTED]> Linux Czar, OIT Campus Linux Services Office of Information Technology, NC State University GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] 1.4.7pre3 client success on EL4&5 - and a question
1.4.7pre3 builds, and the client works, for us on SL4 and SL5, i386 and x86_64. And here's the question: We're trying to do something that doesn't work in AFS space under certain circumstances. We don't know yet what makes it fail or work, but it consistently either fails or works on any client, and all clients have a very similar setup. All clients are SL4, amd64, latest kernel (2.6.9-67.0.7.ELsmp). The failing procedure is a bit convoluted, and I don't know in every detail what it's doing. But the part that fails on some clients is that RPMs get installed, with the RPMDB in AFS, and if it fails we get three messages "afs: failed to store file (13)" and a wedged RPMDB. And, with 1.4.7pre3 but not 1.4.6, we see two more messages: WARNING: afs_ufswr vcp=10396e494c0, exOrW=0 WARNING: afs_ufswr vcp=10396e49140, exOrW=0 Any hint what these are about would probably be very helpful. Thanks, Stephan PS On SL3, inserting the module from 1.4.7pre3 fails with the message that hlist_unhashed is GPLONLY. I'll file a bug in RT. -- Stephan Wiesand DESY - DV - Platanenallee 6 15738 Zeuthen, Germany ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Duplicate VLDB entries
We recently lost a partition on one of our servers and restored from tape. The volumes were all recovered, but when we did a vos exam on any of the restored volumes, it would have 2 identical entries in the vldb: number of sites -> 2 server afsfs10.cl.msu.edu partition /vicepa RW Site server afsfs10.cl.msu.edu partition /vicepa RW Site After moving the volume: number of sites -> 2 server afsfs5.cl.msu.edu partition /vicepa RW Site server afsfs10.cl.msu.edu partition /vicepa RW Site We moved all of the volumes off of that partition, so now they all look like above. A vos listvol of afsfs10 vicepa returns 0 volumes, but a vos listvldb returns all 1989 of them.We then tried to do a vos syncvldb with a volume. It reports that it has been synced, but there are still 2 entries in the vldb. We have also tried vos syncserv with the whole partition with the same result. We have also tried to manually remove an entry from the vldb using vos delentry, but you cannot specify a server/partition AND a volume. A vos delentry on the volume removed both entries from the vldb. All volumes are accessible via their mount points. Any ideas on how to remedy the vldb discrepancy? Thanks, -Jeff Quinn