[OpenAFS] Unable to 'move' volume....volume ID too large / cloned volume not findable?
Hi Everyone, I am having a problem trying to 'vos move' volumes after losing/restoring an AFS file server. The server that was lost has been restored on new hardware. The old RW volumes were moved to other servers (convertROtoRW) and now I want to use the 'vos move' command to move them back. Here is what happens (I have tokens as 'admin'. Linat07 is the current RW home for OSGWN and Linat08 is the new server): vos move OSGWN linat07 /vicepf linat08 /vicepg -verbose Starting transaction on source volume 536874901 ... done Allocating new volume id for clone of volume 536874901 ... done Cloning source volume 536874901 ... done Ending the transaction on the source volume 536874901 ... done Starting transaction on the cloned volume 2681864210 ... Failed to start a transaction on the cloned volume2681864210 Volume not attached, does not exist, or not on line vos move: operation interrupted, cleanup in progress... clear transaction contexts Recovery: Releasing VLDB lock on volume 536874901 ... done Recovery: Accessing VLDB. move incomplete - attempt cleanup of target partition - no guarantee Recovery: Creating transaction for destination volume 536874901 ... Recovery: Unable to start transaction on destination volume 536874901. Recovery: Creating transaction on source volume 536874901 ... done Recovery: Setting flags on source volume 536874901 ... done Recovery: Ending transaction on source volume 536874901 ... done Recovery: Creating transaction on clone volume 2681864210 ... Recovery: Unable to start transaction on source volume 536874901. Recovery: Releasing lock on VLDB entry for volume 536874901 ... done cleanup complete - user verify desired result [linat08:local]# vos examine 2681864210 Could not fetch the entry for volume number 18446744072096448530 from VLDB I am assuming the large cloned volume ID is causing the problem as opposed to an inability to create a cloned volume. I can make replicas on linat08 for existing volumes without a problem. NOTE: The hex representations of the cloned volume from the move attempt above and the 'vos examine': [linat08:local]# 2681864210 = 0x 9FDA0012 [linat08:local]# 18446744072096448530 = 0x 9FDA0012 Any suggestions? This seems like a 64 vs 32 bit issue. Here is the information on servers and versions: We have 3 AFS DB servers: Linat02 - RHEL5/x86_64 - OpenAFS 1.4.7 Linat03 - RHEL4/i686- OpenAFS 1.4.6 Linat04 - RHEL5/x86_64 - OpenAFS 1.4.7 We have 3 AFS file servers: Linat06 - RHEL4/x86_64 - OpenAFS 1.4.6 Linat07 - RHEL4/x86_64 - OpenAFS 1.4.6 Linat08 - RHEL5/x86_64 - OpenAFS 1.4.8 Info on OSGWN volume: [linat08:~]# vos examine OSGWN OSGWN 536874901 RW 505153 K On-line linat07.grid.umich.edu /vicepf RWrite 536874901 ROnly 18446744072096448530 Backup 0 MaxQuota200 K CreationTue Mar 3 03:43:06 2009 CopyMon Dec 3 16:39:21 2007 Backup Never Last Update Sat Feb 21 15:18:05 2009 0 accesses in the past day (i.e., vnode references) RWrite: 536874901 ROnly: 536874902 number of sites - 2 server linat07.grid.umich.edu partition /vicepf RW Site server linat06.grid.umich.edu partition /vicepe RO Site Let me know if there is other info required to help resolve this. Thanks, Shawn McKee University of Michigan/ATLAS Group ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Unable to 'move' volume....volume ID too large / cloned volume not findable?
What says the VolserLog on the source server? -Hartmut McKee, Shawn wrote: Hi Everyone, I am having a problem trying to 'vos move' volumes after losing/restoring an AFS file server. The server that was lost has been restored on new hardware. The old RW volumes were moved to other servers (convertROtoRW) and now I want to use the 'vos move' command to move them back. Here is what happens (I have tokens as 'admin'. Linat07 is the current RW home for OSGWN and Linat08 is the new server): vos move OSGWN linat07 /vicepf linat08 /vicepg -verbose Starting transaction on source volume 536874901 ... done Allocating new volume id for clone of volume 536874901 ... done Cloning source volume 536874901 ... done Ending the transaction on the source volume 536874901 ... done Starting transaction on the cloned volume 2681864210 ... Failed to start a transaction on the cloned volume2681864210 Volume not attached, does not exist, or not on line vos move: operation interrupted, cleanup in progress... clear transaction contexts Recovery: Releasing VLDB lock on volume 536874901 ... done Recovery: Accessing VLDB. move incomplete - attempt cleanup of target partition - no guarantee Recovery: Creating transaction for destination volume 536874901 ... Recovery: Unable to start transaction on destination volume 536874901. Recovery: Creating transaction on source volume 536874901 ... done Recovery: Setting flags on source volume 536874901 ... done Recovery: Ending transaction on source volume 536874901 ... done Recovery: Creating transaction on clone volume 2681864210 ... Recovery: Unable to start transaction on source volume 536874901. Recovery: Releasing lock on VLDB entry for volume 536874901 ... done cleanup complete - user verify desired result [linat08:local]# vos examine 2681864210 Could not fetch the entry for volume number 18446744072096448530 from VLDB I am assuming the large cloned volume ID is causing the problem as opposed to an inability to create a cloned volume. I can make replicas on linat08 for existing volumes without a problem. NOTE: The hex representations of the cloned volume from the move attempt above and the 'vos examine': [linat08:local]# 2681864210 = 0x 9FDA0012 [linat08:local]# 18446744072096448530 = 0x 9FDA0012 Any suggestions? This seems like a 64 vs 32 bit issue. Here is the information on servers and versions: We have 3 AFS DB servers: Linat02 - RHEL5/x86_64 - OpenAFS 1.4.7 Linat03 - RHEL4/i686- OpenAFS 1.4.6 Linat04 - RHEL5/x86_64 - OpenAFS 1.4.7 We have 3 AFS file servers: Linat06 - RHEL4/x86_64 - OpenAFS 1.4.6 Linat07 - RHEL4/x86_64 - OpenAFS 1.4.6 Linat08 - RHEL5/x86_64 - OpenAFS 1.4.8 Info on OSGWN volume: [linat08:~]# vos examine OSGWN OSGWN 536874901 RW 505153 K On-line linat07.grid.umich.edu /vicepf RWrite 536874901 ROnly 18446744072096448530 Backup 0 MaxQuota200 K CreationTue Mar 3 03:43:06 2009 Copy Mon Dec 3 16:39:21 2007 Backup Never Last Update Sat Feb 21 15:18:05 2009 0 accesses in the past day (i.e., vnode references) RWrite: 536874901 ROnly: 536874902 number of sites - 2 server linat07.grid.umich.edu partition /vicepf RW Site server linat06.grid.umich.edu partition /vicepe RO Site Let me know if there is other info required to help resolve this. Thanks, Shawn McKee University of Michigan/ATLAS Group ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info -- - Hartmut Reuter e-mail reu...@rzg.mpg.de phone+49-89-3299-1328 fax +49-89-3299-1301 RZG (Rechenzentrum Garching)webhttp://www.rzg.mpg.de/~hwr Computing Center of the Max-Planck-Gesellschaft (MPG) and the Institut fuer Plasmaphysik (IPP) - ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Unable to 'move' volume....volume ID too large / cloned volume not findable?
McKee, Shawn wrote: Could not fetch the entry for volume number 18446744072096448530 from VLDB I am assuming the large cloned volume ID is causing the problem as opposed to an inability to create a cloned volume. I can make replicas on linat08 for existing volumes without a problem. NOTE: The hex representations of the cloned volume from the move attempt above and the 'vos examine': [linat08:local]# 2681864210 = 0x 9FDA0012 [linat08:local]# 18446744072096448530 = 0x 9FDA0012 Any suggestions? This seems like a 64 vs 32 bit issue. Its a signed vs unsigned int problem. afsint.h and afscbint.h include typedef afs_uint32 VolumeId; which is used throughout the src/vol package for volume ids. In volser/vos.c and volser/vsprocs.c the volume id variables are defined as signed 32-bit (afs_int32). There are also some signed vs unsigned issued in some of the protocol structures. I believe the type VolumeId should be used consistently to define types of volume id variables. In vldbint.xg: struct [nu]vldbentry uses afs_uint32 for the volumeId array but afs_int32 for the cloneId. Same for VldbUpdateEntry. VldbListByAttributes uses afs_int32 for the volumeid. This is not a full examination but I believe it shows where the problem lies. Jeffrey Altman smime.p7s Description: S/MIME Cryptographic Signature
RE: [OpenAFS] Unable to 'move' volume....volume ID too large / cloned volume not findable?
Thanks for the quick reply Hartmut. Here are the relevant lines from the attempt below: Wed Mar 18 10:50:14 2009 1 Volser: Clone: Cloning volume 536874901 to new volume 2681864210 Wed Mar 18 10:50:15 2009 VAttachVolume: Failed to open /vicepf/V184467440720964485 (errno 2) Wed Mar 18 10:50:15 2009 VAttachVolume: Failed to open /vicepf/V184467440720964485 (errno 2) Shawn -Original Message- From: Hartmut Reuter [mailto:reu...@rzg.mpg.de] Sent: Thursday, March 19, 2009 9:01 AM To: McKee, Shawn Cc: openafs-info@openafs.org Subject: Re: [OpenAFS] Unable to 'move' volumevolume ID too large / cloned volume not findable? What says the VolserLog on the source server? -Hartmut McKee, Shawn wrote: Hi Everyone, I am having a problem trying to 'vos move' volumes after losing/restoring an AFS file server. The server that was lost has been restored on new hardware. The old RW volumes were moved to other servers (convertROtoRW) and now I want to use the 'vos move' command to move them back. Here is what happens (I have tokens as 'admin'. Linat07 is the current RW home for OSGWN and Linat08 is the new server): vos move OSGWN linat07 /vicepf linat08 /vicepg -verbose Starting transaction on source volume 536874901 ... done Allocating new volume id for clone of volume 536874901 ... done Cloning source volume 536874901 ... done Ending the transaction on the source volume 536874901 ... done Starting transaction on the cloned volume 2681864210 ... Failed to start a transaction on the cloned volume2681864210 Volume not attached, does not exist, or not on line vos move: operation interrupted, cleanup in progress... clear transaction contexts Recovery: Releasing VLDB lock on volume 536874901 ... done Recovery: Accessing VLDB. move incomplete - attempt cleanup of target partition - no guarantee Recovery: Creating transaction for destination volume 536874901 ... Recovery: Unable to start transaction on destination volume 536874901. Recovery: Creating transaction on source volume 536874901 ... done Recovery: Setting flags on source volume 536874901 ... done Recovery: Ending transaction on source volume 536874901 ... done Recovery: Creating transaction on clone volume 2681864210 ... Recovery: Unable to start transaction on source volume 536874901. Recovery: Releasing lock on VLDB entry for volume 536874901 ... done cleanup complete - user verify desired result [linat08:local]# vos examine 2681864210 Could not fetch the entry for volume number 18446744072096448530 from VLDB I am assuming the large cloned volume ID is causing the problem as opposed to an inability to create a cloned volume. I can make replicas on linat08 for existing volumes without a problem. NOTE: The hex representations of the cloned volume from the move attempt above and the 'vos examine': [linat08:local]# 2681864210 = 0x 9FDA0012 [linat08:local]# 18446744072096448530 = 0x 9FDA0012 Any suggestions? This seems like a 64 vs 32 bit issue. Here is the information on servers and versions: We have 3 AFS DB servers: Linat02 - RHEL5/x86_64 - OpenAFS 1.4.7 Linat03 - RHEL4/i686- OpenAFS 1.4.6 Linat04 - RHEL5/x86_64 - OpenAFS 1.4.7 We have 3 AFS file servers: Linat06 - RHEL4/x86_64 - OpenAFS 1.4.6 Linat07 - RHEL4/x86_64 - OpenAFS 1.4.6 Linat08 - RHEL5/x86_64 - OpenAFS 1.4.8 Info on OSGWN volume: [linat08:~]# vos examine OSGWN OSGWN 536874901 RW 505153 K On-line linat07.grid.umich.edu /vicepf RWrite 536874901 ROnly 18446744072096448530 Backup 0 MaxQuota200 K CreationTue Mar 3 03:43:06 2009 Copy Mon Dec 3 16:39:21 2007 Backup Never Last Update Sat Feb 21 15:18:05 2009 0 accesses in the past day (i.e., vnode references) RWrite: 536874901 ROnly: 536874902 number of sites - 2 server linat07.grid.umich.edu partition /vicepf RW Site server linat06.grid.umich.edu partition /vicepe RO Site Let me know if there is other info required to help resolve this. Thanks, Shawn McKee University of Michigan/ATLAS Group ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info -- - Hartmut Reuter e-mail reu...@rzg.mpg.de phone+49-89-3299-1328 fax +49-89-3299-1301 RZG (Rechenzentrum Garching)webhttp://www.rzg.mpg.de/~hwr Computing Center of the Max-Planck-Gesellschaft (MPG) and the Institut fuer Plasmaphysik (IPP) - ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
RE: [OpenAFS] Unable to 'move' volume....volume ID too large / cloned volume not findable?
Thanks Jeffrey, Yes, I agree...you have found the issue. I am surprised no one else has hit this before (maybe they have but I didn't find this problem in my searching). I guess this will take a patch to the code for existing versions to get this resolved. Thanks again, Shawn -Original Message- From: Jeffrey Altman [mailto:jalt...@secure-endpoints.com] Sent: Thursday, March 19, 2009 9:19 AM To: McKee, Shawn Cc: openafs-info@openafs.org Subject: Re: [OpenAFS] Unable to 'move' volumevolume ID too large / cloned volume not findable? McKee, Shawn wrote: Could not fetch the entry for volume number 18446744072096448530 from VLDB I am assuming the large cloned volume ID is causing the problem as opposed to an inability to create a cloned volume. I can make replicas on linat08 for existing volumes without a problem. NOTE: The hex representations of the cloned volume from the move attempt above and the 'vos examine': [linat08:local]# 2681864210 = 0x 9FDA0012 [linat08:local]# 18446744072096448530 = 0x 9FDA0012 Any suggestions? This seems like a 64 vs 32 bit issue. Its a signed vs unsigned int problem. afsint.h and afscbint.h include typedef afs_uint32 VolumeId; which is used throughout the src/vol package for volume ids. In volser/vos.c and volser/vsprocs.c the volume id variables are defined as signed 32-bit (afs_int32). There are also some signed vs unsigned issued in some of the protocol structures. I believe the type VolumeId should be used consistently to define types of volume id variables. In vldbint.xg: struct [nu]vldbentry uses afs_uint32 for the volumeId array but afs_int32 for the cloneId. Same for VldbUpdateEntry. VldbListByAttributes uses afs_int32 for the volumeid. This is not a full examination but I believe it shows where the problem lies. Jeffrey Altman smime.p7s Description: S/MIME cryptographic signature
Re: [OpenAFS] Unable to 'move' volume....volume ID too large / cloned volume not findable?
McKee, Shawn wrote: Thanks Jeffrey, Yes, I agree...you have found the issue. I am surprised no one else has hit this before (maybe they have but I didn't find this problem in my searching). I guess this will take a patch to the code for existing versions to get this resolved. Thanks again, Shawn I have opened ticket 124510 for this issue. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] AFS lag
There is a lot of misinformation about Ubik out there; the voting protocol is actually not complicated, it's just not documented well. it's actually well-documented, if you find Kazar's paper on Quorum Completion. You know, we should try to find a copy of that and put it somewhere useful. From what I remember (I think I saw a copy once), the paper gets you about 80% of the way there; the source code gets you the rest of the way. Actually, I now realize that I _do_ have a copy of it. Can we put it on the OpenAFS web site? I just have the PostScript; it's easy enough to convert that to PDF. If your database servers are accessable via the Internet, we could take a look at them via udebug. Really, there are only a few things that can go wrong; of all of the pieces of AFS, I think Ubik is one of the most bulletproof. There are a couple (unlikely) open issues; See RT. Didn't know about those. Still, I think we need more information to diagnose the original problem. --Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
RE: [OpenAFS] Openafs 1.4.8 on OSX very slow
This problem is solved after my isp fixed some very slow dns servers that were even slower in conjuction with a dns relayon my router. Looks like the finder was waiting for that every time. Its running smoothly now as it used to be. Thanks everyone! -Original Message- From: openafs-info-ad...@openafs.org [mailto:openafs-info-ad...@openafs.org] On Behalf Of Hans Melgers Sent: maandag 9 maart 2009 22:13 To: jalt...@secure-endpoints.com Cc: Felix Frank; Joel; OpenAFS-info Subject: Re: [OpenAFS] Openafs 1.4.8 on OSX very slow If the finder stats all of them then yes, this behaviour seems normal. There are a couple of gigs in that tree. However, I did use the client before and in my memory it wasnt as slow as it is now. I tried to make a symlink from my local Documents folder directly to my afs homedir but Finder doesnt like that either..again very slow, now even when opening the Documents folder. On 9 mrt 2009, at 21:33, Jeffrey Altman wrote: The next question then becomes how many files are in all of those subdirectories? Finder is going to stat them all. Since the MacOS X OpenAFS client does not have a working bulkstat mechanism, one FetchStatus RPC will be issued per directory object. More if those object are symlinks. You can use tcpdump and wireshark to capture and examine the RPCs that being sent from your machine. The more that are sent, the longer it will take. Jeffrey Altman Hans Melgers wrote: I log in as admin, so i have all permissions. On 9 mrt 2009, at 20:55, Jeffrey Altman wrote: Hans Melgers wrote: I did, no change. Just tried it again, doubleclicking on the user dir, its got 9 subdirs (all 9 different volumes), took 3 minutes before they showed up. Maybe a clue; opening a subfolder with no subdirs opens up immediately (all files shown), opening one with subfolders takes forever. Do you have 'l'ist but not 'r'ead permission on those subfolders? Hans Melgers ENEMB.V. www.enem.nl h...@enem.nl ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info Hans Melgers ENEMB.V. www.enem.nl h...@enem.nl ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info Hans Melgers ENEMB.V. www.enem.nl h...@enem.nl ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info