Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

Andy Cobaugh Tue, 01 Mar 2011 19:39:08 -0800

On 2011-03-01 at 20:27, Andrew Deason ( adea...@sinenomine.net ) said:

On Tue, 1 Mar 2011 18:57:50 -0500 (EST)
Andy Cobaugh <phale...@gmail.com> wrote:

This has happened at least once at work on Solaris 10 x86 with a
.backup volume, as seen above, and at least once on one of my home
machines on 64bit linux with an RO clone.


The volume was probably deleted during the salvage (it was already gone
by the time of the 'zap -force'), but the fileserver still has the
volume in an 'error' state.

Could you

volinfo /vicepa 536871061
fssync-debug query 536871061


# volinfo /vicepcb 536871061
Inode 2305843649298038783: Good magic 78a1b2c5 and version 1
Inode 2305843649365147647: Good magic 99776655 and version 1
Inode 2305843649432256511: Good magic 88664433 and version 1
Inode 2305843641043648511: Good magic 99877712 and version 1
Volume header for volume 536871061 (home.gsong.backup)
stamp.magic = 78a1b2c5, stamp.version = 1
inUse = 0, inService = 0, blessed = 1, needsSalvaged = 0, dontSalvage = 0
type = 2 (backup), uniquifier = 6743359, needsCallback = 0, destroyMe = d3
id = 536871061, parentId = 536871059, cloneId = 0, backupId = 536871061, 
restoredFromId = 0
maxquota = 134217728, minquota = 0, maxfiles = 0, filecount = 221896, diskused 
= 56684296
creationDate = 1299021740 (2011/03/01.18:22:20), copyDate = 1299021740 
(2011/03/01.18:22:20)
backupDate = 1299021740 (2011/03/01.18:22:20), expirationDate = 0 
(1969/12/31.19:00:00)
accessDate = 1299021734 (2011/03/01.18:22:14), updateDate = 1299021636 
(2011/03/01.18:20:36)

owner = 1045, accountNumber = 0dayUse = 0; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 0 (1969/12/31.19:00:00)

volUpdateCounter = 135816


(and I think you meant dafssync-debug. I may not have mentioned that.)

# dafssync-debug query 536871061
calling FSYNC_VolOp with command code 65543 (FSYNC_VOL_QUERY)
FSSYNC service returned 0 (SYNC_OK)
protocol header response code was 0 (SYNC_OK)
protocol reason code was 0 (SYNC_REASON_NONE)
volume = {
        hashid          = 536871061
        header          = 0
        device          = 79
        partition       = 102a75a8
        linkHandle      = 0
        nextVnodeUnique = 0
        diskDataHandle  = 0
        vnodeHashOffset = 79
        shuttingDown    = 0
        goingOffline    = 0
        cacheCheck      = 0
        nUsers          = 0
        needsPutBack    = 0
        specialStatus   = 0
        updateTime      = 0
        vnodeIndex[vSmall] = {
                handle       = 0
                bitmap       = 0
                bitmapSize   = 0
                bitmapOffset = 0
        }
        vnodeIndex[vLarge] = {
                handle       = 0
                bitmap       = 0
                bitmapSize   = 0
                bitmapOffset = 0
        }
        updateTime      = 0
        attach_state    = VOL_STATE_ERROR
        attach_flags    = VOL_IN_HASH | VOL_ON_VBYP_LIST
        nWaiters        = 0
        chainCacheCheck = 3
        salvage = {
                prio      = 0
                reason    = 0
                requested = 0
                scheduled = 0
        }
        stats = {
                hash_lookups = {
                        hi = 0
                        lo = 155
                }
                hash_short_circuits = {
                        hi = 0
                        lo = 0
                }
                hdr_loads = {
                        hi = 0
                        lo = 0
                }
                hdr_gets = {
                        hi = 0
                        lo = 0
                }
                attaches         = 0
                soft_detaches    = 0
                salvages         = 16
                vol_ops          = 1
                last_attach      = 0
                last_get         = 0
                last_promote     = 0
                last_hdr_get     = 0
                last_hdr_load    = 0
                last_salvage     = 1299019004
                last_salvage_req = 1299018855
                last_vol_op      = 1299018890
        }
        vlru = {
                idx = 5 (VLRU_QUEUE_INVALID)
        }
        pending_vol_op  = 0
}


Do you want the .vol file for this volume?

on the fileserver?

I have an idea on why you can't get the volume usable again, but I have
no clue as to what the original inconsistency was that caused the first
salvage.

My suspicion is that a previous 'vos backup' left it in this state. Thevolume group hasn't been touched other than for backups in many months.I've never had a problem like this with that fileserver or volume until Iupgraded from 1.4.11 to 1.6.0pre2.

I would have included more snippets from FileLog as well, but

I have >> the debug level turned up to try to track down a possible


Then you should have some logs mentioning 'FSYNC_com' around 'Tue Mar  1
00:02:25 2011' explaining why we refused to give out the volume.  (You
don't perhaps have that volid header also existing on another partition,
do you?) Not that that should explain the rest of it.

Only one partition with this volume header on it. Hasn't been moved sincelast March.

First time I see any FSYNC messages is this evening when I tried to fixthings. I see this line repeated 39 times:


Tue Mar  1 17:34:18 2011 FSYNC_backgroundSalvage: unable to request salvage for 
volume 536871061

If I look in FileLog.old (I restarted at some point to up the debuglevel), I see these lines:


Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection 
(cnt=94804)
Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection 
(cnt=94805)

I would also like to note that the vos backup ocurring Sunday failed witha timeout, then succeeded Monday, then failed today.


Command output from vos backup on Sunday:

Failed to end the transaction on the rw volume 536871059____: server not responding promptlyError in vos backup command.

____: server not responding promptly

Corresponding entries in VolserLog, then the successful vos backup onMonday:


Sun Feb 27 00:06:38 2011 trans 22689 on volume 536871059 has been idle for more 
than 300 seconds
Sun Feb 27 00:07:08 2011 trans 22689 on volume 536871059 has been idle for more 
than 330 seconds
Sun Feb 27 00:07:38 2011 trans 22689 on volume 536871059 has been idle for more 
than 360 seconds
Sun Feb 27 00:08:08 2011 trans 22689 on volume 536871059 has been idle for more 
than 390 seconds
Sun Feb 27 00:08:38 2011 trans 22689 on volume 536871059 has been idle for more 
than 420 seconds
Sun Feb 27 00:09:08 2011 trans 22689 on volume 536871059 has been idle for more 
than 450 seconds
Sun Feb 27 00:09:38 2011 trans 22689 on volume 536871059 has been idle for more 
than 480 seconds
Sun Feb 27 00:10:08 2011 trans 22689 on volume 536871059 has been idle for more 
than 510 seconds
Sun Feb 27 00:10:38 2011 trans 22689 on volume 536871059 has been idle for more 
than 540 seconds
Sun Feb 27 00:11:08 2011 trans 22689 on volume 536871059 has been idle for more 
than 570 seconds
Sun Feb 27 00:11:38 2011 trans 22689 on volume 536871059 has been idle for more 
than 600 seconds
Sun Feb 27 00:11:38 2011 trans 22689 on volume 536871059 has timed out
...
Mon Feb 28 00:01:38 2011 1 Volser: Clone: Recloning volume 536871059 to volume 
536871061

I notice the error from sunday in the remctl logs, and ran vos backupmanually just to make sure things were ok:


Mon Feb 28 09:24:03 2011 1 Volser: Clone: Recloning volume 536871059 to volume 
536871061

--andy
_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

Reply via email to