Re: [OpenAFS] Backup/Determining Ubik Coordinator/Verifying Databases

2008-04-09 Thread Marcus Watts
"Joseph D. Kulisics" <[EMAIL PROTECTED]> writes:
...
> backup> deletedump -to 03/21/2008
> The following dumps were deleted:
> backup: RPC interface mismatch (-452) ; Error while deleting dumps from 0 t=
> o 1206082859

-452 = RXGEN_SS_MARSHAL "server marshall failed".

This is usually generated at runtime by code on the server after
an operation was completed and the results are being packed onto the wire
to be returned to the client.  

In your case, "while deleting dumps" - can only happen as
a result of calling bcdb_listDumps.  This maps into a call
to ubik_Call ( BUDB_ListDumps and on the server .  This
returns a list of dumps bounded by BUDB_MAX_RETURN_LIST (1000).
Perhaps you tried to delete more than 1000 dumps?
Break your deletes down to no more than 1000 per run.

...
> 
> 1. How can I determine which server has been elected to be the Ubik write c=
> oordinator?

I doubt this is a ubik error (see above).  However,
udebug  7021
will tell you what ubik thinks is happening with budb, including who is
the sync site.  On the sync site, udebug will print additional
information, including "Recovery state".  You want to see "1f".

Replace 7021 with 7002 or 7003 to find information on pt and vl,
although those are most likely not your worry here.

> 
> 2. Is there a way to check the consistency of the various databases across =
> all of the database servers?
...

I think you could build and run "ol_verify".  This is probably
not your best strategy.

If you really think your backup database is corrupted, it's probably
simplier to build it from scratch.  You'll have to scan each dump,
but if you've got them online this may be pretty cheap.  Transarc used
to recommend keeping frequent backups of the backup database, and
they probably also recommended purging it on a regular basis of backups
that were no longer of interest.

Before doing either of these, you should certainly save your current
backup database, on each machine.

-Marcus Watts
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Backup/Determining Ubik Coordinator/Verifying Databases

2008-04-09 Thread Joseph D. Kulisics
Hi,

I'm having a problem with a new AFS database server, and I wanted to try to 
determine a few things. First, the problem is that I retired an old machine 
running OpenAFS version 1.4.2. It was the first database server in my cell, and 
it had the lowest IP address. I retired the old machine and brought up a new 
machine in it's place. The new machine runs OpenAFS version 1.4.6. I configured 
the machine to be a database server according to the instructions in the Quick 
Start Guide.

The old server managed backups from AFS, and when I raised the new server, I 
copied the backups, which I had kept online as files, onto the new server. When 
I run the backup command to try to remove old backups, I get the following 
error:

backup> deletedump -to 03/21/2008
The following dumps were deleted:
backup: RPC interface mismatch (-452) ; Error while deleting dumps from 0 to 
1206082859

The error made me concerned that the new database server might not be 
communicating correctly with the old server, and I was trying to go through log 
files to determine that state of the databases servers. I had two questions:

1. How can I determine which server has been elected to be the Ubik write 
coordinator?

2. Is there a way to check the consistency of the various databases across all 
of the database servers?

These questions mattered mostly for peace of mind and studying the problem with 
the backup system, which doesn't seem to be able to dump a volume anymore. Does 
anyone recognize the error above?

I apologize if these are silly questions. I couldn't find the answers in my 
admittedly blind search of documentation. Please, let me know if there is more 
information that I can provide. Thank you,

  Joseph Kulisics

_

For all claims from an equal, urged upon a neighbor as commands, before any 
attempt at legal settlement, be they great or be they small, have only one 
meaning, and that is slavery.

Pericles as quoted by Thucydides,
Book I of his history of the Peloponnesian War


HOME PAGE URL:  http://copper.chem.ucla.edu/~kulisics/
Joseph D. Kulisics --- 居候犬
Јосиф Кулишић
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


RE: [OpenAFS] Duplicate VLDB entries

2008-04-09 Thread Jeff Quinn
jax > vos listvol afsfs10 vicepa
Total number of volumes on server afsfs10 partition /vicepa: 0 

Total volumes onLine 0 ; Total volumes offLine 0 ; Total busy 0

jax > vos syncvldb -server afsfs10 -part vicepa
VLDB synchronized with state of server afsfs10 partition /vicepa

jax > vos syncserv -server afsfs10 -part vicepa
Server afsfs10 partition /vicepa synchronized with VLDB

jax > vos listvldb -server afsfs10 -partition vicepa
...
Total entries: 1988

Seems like those commands aren't doing much to help. Any other ideas?

Thanks again,
-Jeff

-Original Message-
From: Stephen Joyce [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 09, 2008 3:11 PM
To: Jeff Quinn
Subject: Re: [OpenAFS] Duplicate VLDB entries

Try

vos syncvldb -server  -part 
vos syncserv -server  -part 

If any of the volumes are replicated, you'd want to sync all the 
servers/parts which hold replicas too. Be sure to perform all the syncvldb 
commands before the syncserv commands.

In the steady- (non-problem) state, these commands should be 
non-destructive and do nothing, otherwise, they will change info/volumes as 
problems are corrected.

Cheers, Stephen
--
Stephen Joyce
Systems AdministratorP A N I C
Physics & Astronomy Department Physics & Astronomy
University of North Carolina at Chapel Hill Network Infrastructure
voice: (919) 962-7214and Computing
fax: (919) 962-0480   http://www.panic.unc.edu

   "Lazy Programmers know that if a thing is worth doing, it's worth
   doing well -- unless doing it well takes so long that isn't worth
   doing any more. Then you just do it 'good enough'"
  --- Programming Perl, p 282.

On Wed, 9 Apr 2008, Jeff Quinn wrote:

> We recently lost a partition on one of our servers and restored from tape.
> The volumes were all recovered, but when we did a vos exam on any of the
> restored volumes, it would have 2 identical entries in the vldb:
>
>
>
>number of sites -> 2
>
>   server afsfs10.cl.msu.edu partition /vicepa RW Site
>
>   server afsfs10.cl.msu.edu partition /vicepa RW Site
>
>
>
> After moving the volume:
>
>
>
>number of sites -> 2
>
>   server afsfs5.cl.msu.edu partition /vicepa RW Site
>
>   server afsfs10.cl.msu.edu partition /vicepa RW Site
>
>
>
> We moved all of the volumes off of that partition, so now they all look
like
> above.   A vos listvol of afsfs10 vicepa returns 0 volumes, but a vos
> listvldb returns all 1989 of them.We then tried to do a vos syncvldb
> with a volume.  It reports that it has been synced, but there are still 2
> entries in the vldb.  We have also tried vos syncserv with the whole
> partition with the same result.
>
>
>
> We have also tried to manually remove an entry from the vldb using vos
> delentry, but you cannot specify a server/partition AND a volume.  A vos
> delentry on the volume removed both entries from the vldb.
>
>
>
> All volumes are accessible via their mount points.
>
>
>
> Any ideas on how to remedy the vldb discrepancy?
>
>
>
> Thanks,
>
> -Jeff Quinn
>
>


___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] 1.4.7pre3 client success on EL4&5 - and a question

2008-04-09 Thread Jack Neely
On Wed, Apr 09, 2008 at 05:45:59PM +0200, Stephan Wiesand wrote:
>  1.4.7pre3 builds, and the client works, for us on SL4 and SL5, i386 and 
>  x86_64.
> 
>  And here's the question:
> 
>  We're trying to do something that doesn't work in AFS space under certain 
>  circumstances. We don't know yet what makes it fail or work, but it 
>  consistently either fails or works on any client, and all clients have a 
>  very similar setup.
> 
>  All clients are SL4, amd64, latest kernel (2.6.9-67.0.7.ELsmp).
> 
>  The failing procedure is a bit convoluted, and I don't know in every detail 
>  what it's doing. But the part that fails on some clients is that RPMs get 
>  installed, with the RPMDB in AFS, and if it fails we get three messages 
>  "afs: failed to store file (13)" and a wedged RPMDB. And, with 1.4.7pre3 but 
>  not 1.4.6, we see two more messages:
> 
>  WARNING: afs_ufswr vcp=10396e494c0, exOrW=0
>  WARNING: afs_ufswr vcp=10396e49140, exOrW=0
> 
>  Any hint what these are about would probably be very helpful.
> 
>  Thanks,
>   Stephan
> 
>  PS On SL3, inserting the module from 1.4.7pre3 fails with the message that 
>  hlist_unhashed is GPLONLY. I'll file a bug in RT.
> 
>  -- 
>  Stephan Wiesand
>DESY - DV -
>Platanenallee 6
>15738 Zeuthen, Germany
>  ___
>  OpenAFS-info mailing list
>  OpenAFS-info@openafs.org
>  https://lists.openafs.org/mailman/listinfo/openafs-info

Well, the RPM db is a sleepycat database.  Which is not known to work
well in AFS space.

Jack Neely
-- 
Jack Neely <[EMAIL PROTECTED]>
Linux Czar, OIT Campus Linux Services
Office of Information Technology, NC State University
GPG Fingerprint: 1917 5AC1 E828 9337 7AA4  EA6B 213B 765F 3B6A 5B89
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] 1.4.7pre3 client success on EL4&5 - and a question

2008-04-09 Thread Stephan Wiesand
1.4.7pre3 builds, and the client works, for us on SL4 and SL5, i386 and 
x86_64.


And here's the question:

We're trying to do something that doesn't work in AFS space under certain 
circumstances. We don't know yet what makes it fail or work, but it 
consistently either fails or works on any client, and all clients have a 
very similar setup.


All clients are SL4, amd64, latest kernel (2.6.9-67.0.7.ELsmp).

The failing procedure is a bit convoluted, and I don't know in every 
detail what it's doing. But the part that fails on some clients is that 
RPMs get installed, with the RPMDB in AFS, and if it fails we get three 
messages "afs: failed to store file (13)" and a wedged RPMDB. And, with 
1.4.7pre3 but not 1.4.6, we see two more messages:


WARNING: afs_ufswr vcp=10396e494c0, exOrW=0
WARNING: afs_ufswr vcp=10396e49140, exOrW=0

Any hint what these are about would probably be very helpful.

Thanks,
Stephan

PS On SL3, inserting the module from 1.4.7pre3 fails with the message that 
hlist_unhashed is GPLONLY. I'll file a bug in RT.


--
Stephan Wiesand
  DESY - DV -
  Platanenallee 6
  15738 Zeuthen, Germany
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Duplicate VLDB entries

2008-04-09 Thread Jeff Quinn
We recently lost a partition on one of our servers and restored from tape.
The volumes were all recovered, but when we did a vos exam on any of the
restored volumes, it would have 2 identical entries in the vldb:

 

number of sites -> 2

   server afsfs10.cl.msu.edu partition /vicepa RW Site

   server afsfs10.cl.msu.edu partition /vicepa RW Site

 

After moving the volume:

 

number of sites -> 2

   server afsfs5.cl.msu.edu partition /vicepa RW Site 

   server afsfs10.cl.msu.edu partition /vicepa RW Site

 

We moved all of the volumes off of that partition, so now they all look like
above.   A vos listvol of afsfs10 vicepa returns 0 volumes, but a vos
listvldb returns all 1989 of them.We then tried to do a vos syncvldb
with a volume.  It reports that it has been synced, but there are still 2
entries in the vldb.  We have also tried vos syncserv with the whole
partition with the same result.  

 

We have also tried to manually remove an entry from the vldb using vos
delentry, but you cannot specify a server/partition AND a volume.  A vos
delentry on the volume removed both entries from the vldb.

 

All volumes are accessible via their mount points.

 

Any ideas on how to remedy the vldb discrepancy?   

 

Thanks,

-Jeff Quinn