Re: [Jfs-discussion] Chronic logredo failed (rc=-231) errors on large 18T volumes

Sandon Van Ness Tue, 13 Jul 2010 01:12:47 -0700

I thought I would mention that I also could verify that this bug is not
fixed.


I unfortunately ran into a situation where I had to do a reboot -f and
couldn't unmount the file-system and on reboot an fsck was required as
the journal was not able to be played back failed (rc=-231)

On 06/06/2010 01:12 PM, Tim Nufire wrote:
> It was late when I sent the email below.... Here's more detail on my
> test so others can reproduce...
>
> 1) Create a RAID6 array with 15 1.5 TB volumes
>
> */sbin/mdadm --create /dev/md11 --bitmap=internal --auto=yes -l 6 -n
> 15 /dev/sd{b,e,h,k,n,q,t,w,z,ac,af,ai,al,ao,ar}1  *
>
> 2) Let the array fully sync and then create a JFS file system on it
>
> */sbin/jfs_mkfs -q /dev/md11*
> *
> /sbin/jfs_mkfs version 1.1.14, 06-Apr-2009
>
> Format completed successfully.
>
> 19027657792 kilobytes total disk space.
>
> 3) Make sure the volume looks good
>
> *jfs_tune -l /dev/md11*
> jfs_tune version 1.1.14, 06-Apr-2009
>
> JFS filesystem superblock:
>
> JFS magic number: 'JFS1'
> JFS version: 1
> JFS state: clean
> JFS flags: JFS_LINUX  JFS_COMMIT  JFS_GROUPCOMMIT  JFS_INLINELOG  
> Aggregate block size: 4096 bytes
> Aggregate size: 38053891680 blocks
> Physical block size: 512 bytes
> Allocation group size: 67108864 aggregate blocks
> Log device number: 0x0
> Filesystem creation: Sun Jun  6 12:59:04 2010
> Volume label: ''
>
> *fsck.jfs -n -v /dev/md11*
> fsck.jfs version 1.1.14, 06-Apr-2009
> processing started: 6/6/2010 13.0.6
> The current device is:  /dev/md11
> Open(...READONLY...) returned rc = 0
> Primary superblock is valid.
> The type of file system for the device is JFS.
> Block size in bytes:  4096
> Filesystem size in blocks:  4756914448
> **Phase 1 - Check Blocks, Files/Directories, and  Directory Entries
> **Phase 2 - Count links
> **Phase 3 - Duplicate Block Rescan and Directory Connectedness
> **Phase 4 - Report Problems
> **Phase 5 - Check Connectivity
> **Phase 6 - Perform Approved Corrections
> **Phase 7 - Verify File/Directory Allocation Maps
> **Phase 8 - Verify Disk Allocation Maps
> Filesystem Summary:
> Blocks in use for inodes:  8
> Inode count:  64
> File count:  0
> Directory count:  1
> Block count:  4756914448
> Free block count:  4756155203
> 19027657792 kilobytes total disk space.
>         0 kilobytes in 1 directories.
>         0 kilobytes in 0 user files.
>         0 kilobytes in extended attributes
>         0 kilobytes in access control lists
>   3036980 kilobytes reserved for system use.
> 19024620812 kilobytes are available for use.
> File system checked READ ONLY.
> Filesystem is clean.
> processing terminated:  6/6/2010 13:00:35  with return code: 0  exit
> code: 0.
>
> 4) Mount the new file system
> *
>
> *mount /dev/md11 /mnt*
>
> 5) Fail 3 drives in the array so that the array is no longer viable
> (RAID6 can only handle 2 failures)
>
> *
> *mdadm /dev/md11 --fail /dev/sdb1 --remove /dev/sdb1*
> mdadm: set /dev/sdb1 faulty in /dev/md11
> mdadm: hot removed /dev/sdb1
>
> *mdadm /dev/md11 --fail /dev/sde1 --remove /dev/sde1*
> mdadm: set /dev/sde1 faulty in /dev/md11
> mdadm: hot removed /dev/sde1
>
> *mdadm /dev/md11 --fail /dev/sdh1 --remove /dev/sdh1*
> mdadm: set /dev/sdh1 faulty in /dev/md11
> mdadm: hot removed /dev/sdh1
>
> *cat /proc/mdstat *
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4] [raid10] 
> md11 : active raid6 sdar1[14] sdao1[13] sdal1[12] sdai1[11] sdaf1[10]
> sdac1[9] sdz1[8] sdw1[7] sdt1[6] sdq1[5] sdn1[4] sdk1[3]
>       19027657792 blocks level 6, 64k chunk, algorithm 2 [15/12]
> [___UUUUUUUUUUUU]
>       bitmap: 25/175 pages [100KB], 4096KB chunk
>
> unused devices: <none>
>
> *
> 6) Un-mount the file system and re-assemble the RAID6 array manually.
> This works because they data on the drives is still good.
>
> NOTE: sdc1 and sde1 are marked as 'missing' below because they were
> removed while the array was still viable. Removing sdh1 killed the array.
>
> *umount /dev/md11*
>
> *mdadm --stop /dev/md11*
> mdadm: stopped /dev/md11
>
> *mdadm --create /dev/md11 --assume-clean --bitmap=internal --auto=yes
> -l 6 -n 15 missing missing /dev/sd{h,k,n,q,t,w,z,ac,af,ai,al,ao,ar}1*
> mdadm: /dev/sdh1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdk1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdn1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdq1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdt1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdw1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdz1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdac1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdaf1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdai1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdal1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdao1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> mdadm: /dev/sdar1 appears to be part of a raid array:
>     level=raid6 devices=15 ctime=Sun Jun  6 12:45:57 2010
> Continue creating array? yes
> mdadm: array /dev/md11 started.
>
> 7) Check the file system... Read-only at first and then read-write
>
> *fsck.jfs -n -v /dev/md11fsck.jfs version 1.1.14, 06-Apr-2009*
> processing started: 6/6/2010 13.7.30
> The current device is:  /dev/md11
> Open(...READONLY...) returned rc = 0
> Primary superblock is valid.
> The type of file system for the device is JFS.
> Block size in bytes:  4096
> Filesystem size in blocks:  4756914448
> **Phase 1 - Check Blocks, Files/Directories, and  Directory Entries
> **Phase 2 - Count links
> **Phase 3 - Duplicate Block Rescan and Directory Connectedness
> **Phase 4 - Report Problems
> **Phase 5 - Check Connectivity
> **Phase 6 - Perform Approved Corrections
> **Phase 7 - Verify File/Directory Allocation Maps
> **Phase 8 - Verify Disk Allocation Maps
> Filesystem Summary:
> Blocks in use for inodes:  8
> Inode count:  64
> File count:  0
> Directory count:  1
> Block count:  4756914448
> Free block count:  4756155203
> 19027657792 kilobytes total disk space.
>         0 kilobytes in 1 directories.
>         0 kilobytes in 0 user files.
>         0 kilobytes in extended attributes
>         0 kilobytes in access control lists
>   3036980 kilobytes reserved for system use.
> 19024620812 kilobytes are available for use.
> File system checked READ ONLY.
> Filesystem is clean.
> processing terminated:  6/6/2010 13:07:58  with return code: 0  exit
> code: 0.
>
> *fsck.jfs -v /dev/md11*
> fsck.jfs version 1.1.14, 06-Apr-2009
> processing started: 6/6/2010 13.8.21
> Using default parameter: -p
> The current device is:  /dev/md11
> Open(...READ/WRITE EXCLUSIVE...) returned rc = 0
> Primary superblock is valid.
> The type of file system for the device is JFS.
> Block size in bytes:  4096
> Filesystem size in blocks:  4756914448
> **Phase 0 - Replay Journal Log
> LOGREDO:  Log record for Sync Point at:    0x02050
> LOGREDO:  Beginning to update the Inode Allocation Map.
> LOGREDO:  Done updating the Inode Allocation Map.
> LOGREDO:  Beginning to update the Block Map.
> LOGREDO:  Incorrect leaf index detected (k=(d) 0, j=(d) 0, idx=(d) 0)
> while writing Block Map.
> LOGREDO:  Write Block Map control page failed in UpdateMaps().
> LOGREDO:  Unable to update map(s).
> *logredo failed (rc=-231).  fsck continuing.*
> **Phase 1 - Check Blocks, Files/Directories, and  Directory Entries
> **Phase 2 - Count links
> **Phase 3 - Duplicate Block Rescan and Directory Connectedness
> **Phase 4 - Report Problems
> **Phase 5 - Check Connectivity
> **Phase 6 - Perform Approved Corrections
> **Phase 7 - Rebuild File/Directory Allocation Maps
> **Phase 8 - Rebuild Disk Allocation Maps
> **Phase 9 - Reformat File System Log
> logformat returned rc = 0
> Filesystem Summary:
> Blocks in use for inodes:  8
> Inode count:  64
> File count:  0
> Directory count:  1
> Block count:  4756914448
> Free block count:  4756155203
> 19027657792 kilobytes total disk space.
>         0 kilobytes in 1 directories.
>         0 kilobytes in 0 user files.
>         0 kilobytes in extended attributes
>         0 kilobytes in access control lists
>   3036980 kilobytes reserved for system use.
> 19024620812 kilobytes are available for use.
> Filesystem is clean.
> All observed inconsistencies have been repaired.
> Filesystem has been marked clean.
> **** Filesystem was modified. ****
> processing terminated:  6/6/2010 13:09:15  with return code: 0  exit
> code: 0.
>
> Note "logredo failed (rc=-231)" above.... 
>
> This test should be equivalent to pulling the plug on a system and
> then checking the file system. Also, unlike the test below, the above
> was done with no load on the system so there should not have been
> anything in the journal to replay. 
>
> Hope this helps.
>
> Tim 
>
>
> On Jun 6, 2010, at 3:28 AM, Tim Nufire wrote:
>
>> I tested the patch tonight with mixed results...
>>
>> The good news is that it fixed the disk allocation maps bug in
>> jfs_mkfs and fsck.jfs now runs cleanly on a newly formatted volume.
>>
>> Unfortunately, I'm still seeing the  logredo failure... To test this
>> I simulated a RAID failure while my server was under heavy load and
>> then re-assembled the array. As expected, the JFS file system on this
>> array was not closed cleanly... This should be an equivalent state to
>> that of a power failure but was easier to do remotely :-)
>>
>> Here's the output of fsck showing the rc=-231 error during the
>> logredo....
>>
>> *fsck.jfs -v /dev/md15*
>> fsck.jfs version 1.1.14, 06-Apr-2009
>> processing started: 6/6/2010 3.6.44
>> Using default parameter: -p
>> The current device is:  /dev/md15
>> Open(...READ/WRITE EXCLUSIVE...) returned rc = 0
>> Primary superblock is valid.
>> The type of file system for the device is JFS.
>> Block size in bytes:  4096
>> Filesystem size in blocks:  4756914448
>> **Phase 0 - Replay Journal Log
>> LOGREDO:  Log record for Sync Point at:    0x088276c
>> LOGREDO:  Beginning to update the Inode Allocation Map.
>> LOGREDO:  Done updating the Inode Allocation Map.
>> LOGREDO:  Beginning to update the Block Map.
>> LOGREDO:  Incorrect leaf index detected (k=(d) 0, j=(d) 0, idx=(d) 0)
>> while writing Block Map.
>> LOGREDO:  Write Block Map control page failed in UpdateMaps().
>> LOGREDO:  Unable to update map(s).
>> *logredo failed (rc=-231).  fsck continuing.*
>> **Phase 1 - Check Blocks, Files/Directories, and  Directory Entries
>> .
>> .
>> .
>>
>> Here's the volume information:
>>
>> *jfs_tune -l /dev/md15*
>> jfs_tune version 1.1.14, 06-Apr-2009
>>
>> JFS filesystem superblock:
>>
>> JFS magic number: 'JFS1'
>> JFS version: 1
>> JFS state: mounted
>> JFS flags: JFS_LINUX  JFS_COMMIT  JFS_GROUPCOMMIT  JFS_INLINELOG  
>> Aggregate block size: 4096 bytes
>> Aggregate size: 38053891680 blocks
>> Physical block size: 512 bytes
>> Allocation group size: 67108864 aggregate blocks
>> Log device number: 0x90b
>> Filesystem creation: Sun Jun  6 01:35:43 2010
>> Volume label: 'id-0850422'
>>
>> During this test I also hit kernel level JFS bug which I assume is
>> unrelated to the jfsutils patch but I'm including it anyway just
>> incase the changes to jfs_mkfs somehow caused this...
>>
>> Jun  6 01:50:10 ul085 kernel: [124387.345574] ERROR: (device md11):
>> txAbort
>> Jun  6 01:50:10 ul085 kernel: [124387.394483] BUG at
>> fs/jfs/jfs_txnmgr.c:939 assert(mp->nohomeok > 0)
>> Jun  6 01:50:10 ul085 kernel: [124387.394533] ------------[ cut here
>> ]------------
>> Jun  6 01:50:10 ul085 kernel: [124387.394555] kernel BUG at
>> fs/jfs/jfs_txnmgr.c:939!
>> Jun  6 01:50:10 ul085 kernel: [124387.394576] invalid opcode: 0000
>> [1] SMP 
>> Jun  6 01:50:10 ul085 kernel: [124387.394597] CPU 0 
>> Jun  6 01:50:10 ul085 kernel: [124387.396184] Modules linked in:
>> xt_multiport ipt_LOG xt_tcpudp xt_state iptable_filter ip_tables
>> nf_conntrack_pptp nf_conntrack_proto_dccp nf_conntrack_ftp
>> nf_conntrack_proto_udplite nf_conntrack_netlink nfnetlink nf_nat
>> nf_conntrack_tftp nf_conntrack_irc nf_conntrack_sip xt_conntrack
>> x_tables nf_conntrack_h323 nf_conntrack_proto_sctp ts_kmp
>> nf_conntrack_amanda nf_conntrack_netbios_ns nf_conntrack_proto_gre
>> nf_conntrack_sane nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
>> ipv6 jfs nls_base fuse coretemp w83627ehf hwmon_vid sbp2 loop
>> serio_raw snd_hda_intel pcspkr i2c_i801 psmouse snd_pcm i2c_core
>> snd_timer snd soundcore snd_page_alloc intel_agp button evdev ext3
>> jbd mbcache raid10 raid456 async_xor async_memcpy async_tx xor raid1
>> raid0 multipath linear md_mod ide_disk ata_generic sd_mod jmicron
>> ata_piix ohci1394 ieee1394 ide_pci_generic ide_core sata_sil24
>> ehci_hcd libata scsi_mod dock e1000e uhci_hcd thermal processor fan
>> thermal_sys [last unloaded: scsi_wait_scan]
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] Pid: 3845, comm:
>> jfsCommit Tainted: G        W 2.6.26-2-amd64 #1
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] RIP:
>> 0010:[<ffffffffa02c498f>]  [<ffffffffa02c498f>] :jfs:txUnlock+0xb2/0x211
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] RSP:
>> 0018:ffff8100c4765e90  EFLAGS: 00010286
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] RAX: 000000000000004b
>> RBX: ffff810017c8dce0 RCX: 000000000117419b
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] RDX: ffff810080a46000
>> RSI: 0000000000000046 RDI: 0000000000000286
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] RBP: ffffc20009901258
>> R08: ffff8100c719ae00 R09: ffff8100c4765a00
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] R10: 0000000000000000
>> R11: 0000010000dd80d0 R12: ffffc200097c23c0
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] R13: ffff8100c719ae00
>> R14: 0000000000b300b3 R15: 0000000000000004
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] FS: 
>> 0000000000000000(0000) GS:ffffffff8053d000(0000) knlGS:0000000000000000
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] CS:  0010 DS: 0018 ES:
>> 0018 CR0: 000000008005003b
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] CR2: 00007fb67ce53000
>> CR3: 00000000c3ccb000 CR4: 00000000000006e0
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] DR0: 0000000000000000
>> DR1: 0000000000000000 DR2: 0000000000000000
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] DR3: 0000000000000000
>> DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] Process jfsCommit (pid:
>> 3845, threadinfo ffff8100c4764000, task ffff8100ca30ab50)
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] Stack: 
>> 0000ffff805a64c0 ffff8100c719ae00 ffffc200097c23c0 0000000000000286
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  ffff8100cadd1580
>> ffffffff805a64c0 0000000000000004 ffffffffa02c73db
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  0000000000000000
>> ffff8100ca30ab50 ffffffff8022c108 0000000000100100
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] Call Trace:
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffffa02c73db>] ?
>> :jfs:jfs_lazycommit+0xfb/0x22e
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffff8022c108>] ?
>> default_wake_function+0x0/0xe
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffffa02c72e0>] ?
>> :jfs:jfs_lazycommit+0x0/0x22e
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffff80245feb>] ?
>> kthread+0x47/0x74
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffff8023008d>] ?
>> schedule_tail+0x27/0x5c
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffff8020cf38>] ?
>> child_rip+0xa/0x12
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffff80245fa4>] ?
>> kthread+0x0/0x74
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  [<ffffffff8020cf2e>] ?
>> child_rip+0x0/0x12
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] 
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] 
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] Code: d2 ff ff 8b 43 68
>> 85 c0 7f 25 48 c7 c1 03 c0 2c a0 ba ab 03 00 00 48 c7 c6 d3 bf 2c a0
>> 48 c7 c7 e7 bf 2c a0 31 c0 e8 8e 09 f7 df <0f> 0b eb fe ff c8 85 c0
>> 89 43 68 75 09 48 8b 7b 58 e8 00 4a fb
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] RIP 
>> [<ffffffffa02c498f>] :jfs:txUnlock+0xb2/0x211
>> Jun  6 01:50:10 ul085 kernel: [124387.396500]  RSP <ffff8100c4765e90>
>> Jun  6 01:50:10 ul085 kernel: [124387.396500] ---[ end trace
>> 4eaa2a86a8e2da22 ]---
>>
>> Let me know if there is anything else I can provide to help debug
>> this issue.
>>
>> Thanks!
>>
>> Tim
>>
>> On Jun 4, 2010, at 12:06 AM, Sandon Van Ness wrote:
>>
>>> Dave Kleikamp wrote:
>>>> On Tue, 2010-05-25 at 15:15 -0700, Tim Nufire wrote:
>>>>   
>>>>> Dave,
>>>>>
>>>>> Any update on this issue? Every time I run fsck on a volume greater
>>>>> than 12TB I have this problem. 
>>>>>     
>>>>
>>>> I haven't seen this exact failure, but the patch I sent you on the other
>>>> problem may be enough to fix this.  Could you let me know if you're
>>>> still having any problems with the latest code?
>>>>
>>>> Thanks,
>>>> Shaggy
>>>>   
>>>
>>> Tim may not have a system he can easily test this 'on demand' but as
>>> soon as my newly setup raid array finishes initializing and i copy
>>> all my data to it (about 2 days) I can go ahead and test this to see
>>> if the logredo fails like it used to on my system.
>>>
>>> This was the only other problem I had with JFS which wasn't that big
>>> of a deal for me as I was forced to do an fsck only 5 or 6 times and
>>> fsck's only took around 10-11 minutes on my system. I will reply to
>>> this thread to let you know the results (if Tim doesn't before me).
>>
>


------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Jfs-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jfs-discussion

Re: [Jfs-discussion] Chronic logredo failed (rc=-231) errors on large 18T volumes

Reply via email to