Re: [Ocfs2-users] fsck.ocfs2 not fixing as it outputs errors when checking w/ no flag (-fn) but is clean with yes flag (-fy)
On 3/31/2016 10:37 PM, Junxiao Bi wrote: > On 04/01/2016 11:20 AM, Jay Vasa wrote: >> On 3/31/2016 6:36 PM, Herbert van den Bergh wrote: >>> It seems to me that the reason fsck -fn is reporting errors is because >>> it isn't replaying the journal: >>> >>> ** Skipping journal replay because -n was given. There may be spurious >>> errors that journal replay would fix. ** >>> ** Skipping slot recovery because -n was given. ** >>> >>> So there are outstanding changes in the journal that need to be made >>> to the fs, but fsck -fn skips them. Then later it runs into the >>> inconsistencies that would have been cleared if the journal was replayed. >>> >>> fsck -fy does replay the journal, so it doesn't see the >>> inconsistencies that were fixed by it. >>> >>> When you do the fsck -fn AFTER fsck -fy, does it still say now that it >>> is skipping journal replay? If so, I wonder why. If not, does it >>> still report the exact same inode / cluster numbers as the previous >>> time you ran it? If fsck -fy had to make any changes (including >>> replaying the journal), run it again, and repeat until it doesn't make >>> any changes to the filesystem. This is just to make sure it isn't >>> leaving some inconsistency unfixed. So please do: >>> >>> umount (on ALL nodes) >>> fsck -fy >>> fsck -fy (if the previous fsck made ANY changes including replaying >>> the journal) >>> fsck -fn (check if it mentions skipping the journal replay) >>> >>> If you still see any errors reported by fsck -fn, are they exactly the >>> same ones as you've sent earlier? >>> >> This is exactly what I did on the first time I ran it. I really don't >> want to have another downtime doing exactly this again. > So the "corrupted" ocfs2 volume is online now, does it work well? If > ocfs2 is really corrupted, i think it will soon fall into a read-only fs > or panic. If it works well, then maybe fsck.ocfs2 -fn report the > corruption wrongly. > > Thanks, > Junxiao. Yes the "corrupted" ocfs2 is working just fine. It has not fallen to read-only and has not had a panic. I though am worried that it will in the future and go read-only at some-time. I have though been lately minimizing the load on it as I am worried about this happening and seems no way to fix it. Thanks, Jay >> If you see I ran this exactly: >> % umount /dev/drbd2 -- the umount stalled so I rebooted it >> % fsck -fy /dev/drbd2 >> -- this fixed the journal replay >> % fsck -fy /dev/drbd2 >> -- this did nothing >> % fsck -fn /dev/drbd2 >> -- this showed the errors all over again. Yes exactly the same errors. >> >> Look at the bottom of this message as that is exactly what I ran, and >> yes everything was unmounted. This is the only reason why I brought up >> this issue. >> >> If you really want me to do this again, I can, but I don't like bringing >> down the filesystem another 6 hours for this. I have already tried fsck >> this about 20 times. >> >> Thanks, >> Jay ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 not fixing as it outputs errors when checking w/ no flag (-fn) but is clean with yes flag (-fy)
On 3/31/2016 6:28 PM, Junxiao Bi wrote: > On 04/01/2016 09:21 AM, Jay Vasa wrote: >> I never did an fsck -fn with it being mounted. I understand that will >> cause cause errors. >> It has never been mounted whenever I did any fsck, either -fn -fy. I was >> trying to say that it sucks that I have to stop the production for a >> fsck which is not fixing these errors. >> Again, It has never mounted. I am sorry if I wasn't clear in >> communicating that. > Interesting, then how about "fsck.ocfs2 -f"? Try it and see whether it > report any corruption. I have already tried this-- I was curious about this also. I ran also with just the -f, and it gave the same results as -fy. I was hoping to get user input so I can type "y", but nope. Thanks, Jay > > Thanks, > Junxiao. >> Jay ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 not fixing as it outputs errors when checking w/ no flag (-fn) but is clean with yes flag (-fy)
Hi, >> So, we have 2 problems now. >> What's the matter with fsck? >> It's very weired:-/ > Yes very weird. The main issue is that I need to fsck this filesystem. I > hope Junxiao can help. >> How this error happend in kernel? >> If there's not solution available right now, none of them is easy if >> we cannot >> reproduce. >> >> So, you use nfs on top of ocfs2, here is relative commit: >> git log -p 6ca497a83 >> >> And, please provide the initial and whole error messages as early time >> as possible. > > I tried this command, but I don't have this repository. > # git log -p 6ca497a83 > fatal: Not a git repository (or any of the parent directories): .git Sorry, I meant the kernel source commit. I saw an oracle guy has given a very kind reply. Thanks for him! Eric > > Can you tell me exactly the commands I need to run to help you out with > this output? I am using the official RPMs to I don't have the source > code, as I believe Oracle only has it. > Do i need to check out the ocfs2-tools from git, and from where? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Michael, Yes, currently the best way is to copy out data as much as possible and recreate the ocfs2 volume, then restore back the data. I haven't encountered this issue before and don't know which case can lead to it, so I'm sorry I can't give you the advice which can avoid this issue. But I suggest you keep follow the patches of the latest kernel, and patch those read-only related (both ocfs2 and jbd2). We have indeed submitted several patches to fix read-only issues. Thanks, Joseph On 2016/3/26 0:41, Michael Ulbrich wrote: > Joseph, > > thanks again for your help! > > Currently I'm dumping out 4 TB of data from the broken ocfs2 device to > an external disk. I have shut down the cluster and have the fs mounted > read-only on a single node. It seems that the data structures are still > intact and that the file system problems are bound to internal data > areas (DLM?) which are not in use in the single node r/o mount use case. > > Will create a new ocfs2 device and restore the data later. > > Besides taking metadata backups with o2image is there any advice which > you would give to avoid similar situations in the future? > > All the best ... Michael > > On 03/25/2016 01:36 AM, Joseph Qi wrote: >> Hi Michael, >> >> On 2016/3/24 21:47, Michael Ulbrich wrote: >>> Hi Joseph, >>> >>> thanks for this information although this does not sound too optimistic ... >>> >>> So, if I understand you correctly, if we had a metadata backup from >>> o2image _before_ the crash we could have looked up the missing info to >>> remove the loop from group chain 73, right? >> If we have metadata backup, we can use o2image to restore it back, but >> this may loss some data. >> >>> >>> But how could the loop issue be fixed and at the same time the damage to >>> the data be minimized? There is a recent file level backup from which >>> damaged or missing files could be restored later. >>> >>> 151 4054438912158722152 13720106061984 >>> 152 409459507215872107535119 5119 1984 >>> 153 4090944512158721818 140549646 1984 <-- >>> 154 408364339215872571 153014914 1984 >>> 155 4510758912158724834 110386601 1984 >>> 156 4492506112158726532 9340 5119 1984 >>> >>> Could you describe a "brute force" way how to dd out and edit record >>> #153 to remove the loop and minimize potential loss of data at the same >>> time? So that fsck would have a chance to complete and fix the remaining >>> issues? >> This is dangerous until we can know exactly what's info the block should >> store. >> >> My idea is to find out the actual block of record #154 and let block >> 4090944512 of record #153 points to it. This must be a bit complicated >> and should be done under deep understanding of the disk layout. >> >> I have went though fsck.ocfs2 patches, and found the following may help: >> commit efca4b0f2241 (Break a chain loop in group desc) >> But as you said, you have already upgraded to version 1.8.4. So I'm sorry >> currently I don't have a better idea. >> >> Thanks, >> Joseph >>> >>> Thanks a lot for your help ... Michael >>> >>> On 03/24/2016 02:10 PM, Joseph Qi wrote: Hi Michael, So I think the block of record #153 goes wrong, which points next to block 4083643392 of record #19. But the problem is we don't know the right info of the block of record #153, otherwise we can dd out, edit it and then dd in to fix it. Thanks, Joseph On 2016/3/24 18:38, Michael Ulbrich wrote: > Hi Joseph, > > ok, got it! Here's the loop in chain 73: > > Group Chain: 73 Parent Inode: 13 Generation: 1172963971 > CRC32: ECC: > ## Block#TotalUsed Free Contig Size > 0428077363215872114874385 1774 1984 > 12583263232158725341 105315153 1984 > 24543613952158725329 105435119 1984 > 3453266227215872107535119 5119 1984 > 44539963392158723223 126497530 1984 > 54536312832158725219 106535534 1984 > 64529011712158726047 9825 3359 1984 > 74525361152158724475 113975809 1984 > 84521710592158723182 126905844 1984 > 94518060032158725881 9991 5131 1984 > 10 423696691215872107535119 5119 1984 > 11 409824563215872107565116 3388 1984 > 12 4514409472158728826 7046 5119 1984 > 13 34411448321587215 158579680 1984 > 14 4404892672158727563 8309 5119 1984 > 15 4233316352158729398 6474 5114 1984 > 16 44888215872
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Joseph, thanks again for your help! Currently I'm dumping out 4 TB of data from the broken ocfs2 device to an external disk. I have shut down the cluster and have the fs mounted read-only on a single node. It seems that the data structures are still intact and that the file system problems are bound to internal data areas (DLM?) which are not in use in the single node r/o mount use case. Will create a new ocfs2 device and restore the data later. Besides taking metadata backups with o2image is there any advice which you would give to avoid similar situations in the future? All the best ... Michael On 03/25/2016 01:36 AM, Joseph Qi wrote: > Hi Michael, > > On 2016/3/24 21:47, Michael Ulbrich wrote: >> Hi Joseph, >> >> thanks for this information although this does not sound too optimistic ... >> >> So, if I understand you correctly, if we had a metadata backup from >> o2image _before_ the crash we could have looked up the missing info to >> remove the loop from group chain 73, right? > If we have metadata backup, we can use o2image to restore it back, but > this may loss some data. > >> >> But how could the loop issue be fixed and at the same time the damage to >> the data be minimized? There is a recent file level backup from which >> damaged or missing files could be restored later. >> >> 151 4054438912158722152 13720106061984 >> 152 409459507215872107535119 5119 1984 >> 153 4090944512158721818 140549646 1984 <-- >> 154 408364339215872571 153014914 1984 >> 155 4510758912158724834 110386601 1984 >> 156 4492506112158726532 9340 5119 1984 >> >> Could you describe a "brute force" way how to dd out and edit record >> #153 to remove the loop and minimize potential loss of data at the same >> time? So that fsck would have a chance to complete and fix the remaining >> issues? > This is dangerous until we can know exactly what's info the block should > store. > > My idea is to find out the actual block of record #154 and let block > 4090944512 of record #153 points to it. This must be a bit complicated > and should be done under deep understanding of the disk layout. > > I have went though fsck.ocfs2 patches, and found the following may help: > commit efca4b0f2241 (Break a chain loop in group desc) > But as you said, you have already upgraded to version 1.8.4. So I'm sorry > currently I don't have a better idea. > > Thanks, > Joseph >> >> Thanks a lot for your help ... Michael >> >> On 03/24/2016 02:10 PM, Joseph Qi wrote: >>> Hi Michael, >>> So I think the block of record #153 goes wrong, which points next to >>> block 4083643392 of record #19. >>> But the problem is we don't know the right info of the block of record >>> #153, otherwise we can dd out, edit it and then dd in to fix it. >>> >>> Thanks, >>> Joseph >>> >>> On 2016/3/24 18:38, Michael Ulbrich wrote: Hi Joseph, ok, got it! Here's the loop in chain 73: Group Chain: 73 Parent Inode: 13 Generation: 1172963971 CRC32: ECC: ## Block#TotalUsed Free Contig Size 0428077363215872114874385 1774 1984 12583263232158725341 105315153 1984 24543613952158725329 105435119 1984 3453266227215872107535119 5119 1984 44539963392158723223 126497530 1984 54536312832158725219 106535534 1984 64529011712158726047 9825 3359 1984 74525361152158724475 113975809 1984 84521710592158723182 126905844 1984 94518060032158725881 9991 5131 1984 10 423696691215872107535119 5119 1984 11 409824563215872107565116 3388 1984 12 4514409472158728826 7046 5119 1984 13 34411448321587215 158579680 1984 14 4404892672158727563 8309 5119 1984 15 4233316352158729398 6474 5114 1984 16 448882158726358 9514 5119 1984 17 3901115392158729932 5940 3757 1984 18 4507108352158726557 9315 6166 1984 19 408364339215872571 153014914 1984 <-- 20 4510758912158724834 110386601 1984 21 4492506112158726532 9340 5119 1984 22 449615667215872107535119 5119 1984 23 450345779215872107185154 5119 1984 ... 154 408364339215872571 153014914 1984 <-- 155 4510758912158724834
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Michael, On 2016/3/24 21:47, Michael Ulbrich wrote: > Hi Joseph, > > thanks for this information although this does not sound too optimistic ... > > So, if I understand you correctly, if we had a metadata backup from > o2image _before_ the crash we could have looked up the missing info to > remove the loop from group chain 73, right? If we have metadata backup, we can use o2image to restore it back, but this may loss some data. > > But how could the loop issue be fixed and at the same time the damage to > the data be minimized? There is a recent file level backup from which > damaged or missing files could be restored later. > > 151 4054438912158722152 13720106061984 > 152 409459507215872107535119 5119 1984 > 153 4090944512158721818 140549646 1984 <-- > 154 408364339215872571 153014914 1984 > 155 4510758912158724834 110386601 1984 > 156 4492506112158726532 9340 5119 1984 > > Could you describe a "brute force" way how to dd out and edit record > #153 to remove the loop and minimize potential loss of data at the same > time? So that fsck would have a chance to complete and fix the remaining > issues? This is dangerous until we can know exactly what's info the block should store. My idea is to find out the actual block of record #154 and let block 4090944512 of record #153 points to it. This must be a bit complicated and should be done under deep understanding of the disk layout. I have went though fsck.ocfs2 patches, and found the following may help: commit efca4b0f2241 (Break a chain loop in group desc) But as you said, you have already upgraded to version 1.8.4. So I'm sorry currently I don't have a better idea. Thanks, Joseph > > Thanks a lot for your help ... Michael > > On 03/24/2016 02:10 PM, Joseph Qi wrote: >> Hi Michael, >> So I think the block of record #153 goes wrong, which points next to >> block 4083643392 of record #19. >> But the problem is we don't know the right info of the block of record >> #153, otherwise we can dd out, edit it and then dd in to fix it. >> >> Thanks, >> Joseph >> >> On 2016/3/24 18:38, Michael Ulbrich wrote: >>> Hi Joseph, >>> >>> ok, got it! Here's the loop in chain 73: >>> >>> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >>> CRC32: ECC: >>> ## Block#TotalUsed Free Contig Size >>> 0428077363215872114874385 1774 1984 >>> 12583263232158725341 105315153 1984 >>> 24543613952158725329 105435119 1984 >>> 3453266227215872107535119 5119 1984 >>> 44539963392158723223 126497530 1984 >>> 54536312832158725219 106535534 1984 >>> 64529011712158726047 9825 3359 1984 >>> 74525361152158724475 113975809 1984 >>> 84521710592158723182 126905844 1984 >>> 94518060032158725881 9991 5131 1984 >>> 10 423696691215872107535119 5119 1984 >>> 11 409824563215872107565116 3388 1984 >>> 12 4514409472158728826 7046 5119 1984 >>> 13 34411448321587215 158579680 1984 >>> 14 4404892672158727563 8309 5119 1984 >>> 15 4233316352158729398 6474 5114 1984 >>> 16 448882158726358 9514 5119 1984 >>> 17 3901115392158729932 5940 3757 1984 >>> 18 4507108352158726557 9315 6166 1984 >>> 19 408364339215872571 153014914 1984 <-- >>> 20 4510758912158724834 110386601 1984 >>> 21 4492506112158726532 9340 5119 1984 >>> 22 449615667215872107535119 5119 1984 >>> 23 450345779215872107185154 5119 1984 >>> ... >>> 154 408364339215872571 153014914 1984 <-- >>> 155 4510758912158724834 110386601 1984 >>> 156 4492506112158726532 9340 5119 1984 >>> 157 449615667215872107535119 5119 1984 >>> 158 450345779215872107185154 5119 1984 >>> ... >>> 289 408364339215872571 153014914 1984 <-- >>> 290 4510758912158724834 110386601 1984 >>> 291 4492506112158726532 9340 5119 1984 >>> 292 449615667215872107535119 5119 1984 >>> 293 450345779215872107185154 5119 1984 >>> >>> etc. >>> >>> So the loop begins at record #154 and spans 135 records, right? >>> >>> Will backup fs metadata as soon as I have some external storage at hand.
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Joseph, thanks for this information although this does not sound too optimistic ... So, if I understand you correctly, if we had a metadata backup from o2image _before_ the crash we could have looked up the missing info to remove the loop from group chain 73, right? But how could the loop issue be fixed and at the same time the damage to the data be minimized? There is a recent file level backup from which damaged or missing files could be restored later. 151 4054438912158722152 13720106061984 152 409459507215872107535119 5119 1984 153 4090944512158721818 140549646 1984 <-- 154 408364339215872571 153014914 1984 155 4510758912158724834 110386601 1984 156 4492506112158726532 9340 5119 1984 Could you describe a "brute force" way how to dd out and edit record #153 to remove the loop and minimize potential loss of data at the same time? So that fsck would have a chance to complete and fix the remaining issues? Thanks a lot for your help ... Michael On 03/24/2016 02:10 PM, Joseph Qi wrote: > Hi Michael, > So I think the block of record #153 goes wrong, which points next to > block 4083643392 of record #19. > But the problem is we don't know the right info of the block of record > #153, otherwise we can dd out, edit it and then dd in to fix it. > > Thanks, > Joseph > > On 2016/3/24 18:38, Michael Ulbrich wrote: >> Hi Joseph, >> >> ok, got it! Here's the loop in chain 73: >> >> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >> CRC32: ECC: >> ## Block#TotalUsed Free Contig Size >> 0428077363215872114874385 1774 1984 >> 12583263232158725341 105315153 1984 >> 24543613952158725329 105435119 1984 >> 3453266227215872107535119 5119 1984 >> 44539963392158723223 126497530 1984 >> 54536312832158725219 106535534 1984 >> 64529011712158726047 9825 3359 1984 >> 74525361152158724475 113975809 1984 >> 84521710592158723182 126905844 1984 >> 94518060032158725881 9991 5131 1984 >> 10 423696691215872107535119 5119 1984 >> 11 409824563215872107565116 3388 1984 >> 12 4514409472158728826 7046 5119 1984 >> 13 34411448321587215 158579680 1984 >> 14 4404892672158727563 8309 5119 1984 >> 15 4233316352158729398 6474 5114 1984 >> 16 448882158726358 9514 5119 1984 >> 17 3901115392158729932 5940 3757 1984 >> 18 4507108352158726557 9315 6166 1984 >> 19 408364339215872571 153014914 1984 <-- >> 20 4510758912158724834 110386601 1984 >> 21 4492506112158726532 9340 5119 1984 >> 22 449615667215872107535119 5119 1984 >> 23 450345779215872107185154 5119 1984 >> ... >> 154 408364339215872571 153014914 1984 <-- >> 155 4510758912158724834 110386601 1984 >> 156 4492506112158726532 9340 5119 1984 >> 157 449615667215872107535119 5119 1984 >> 158 450345779215872107185154 5119 1984 >> ... >> 289 408364339215872571 153014914 1984 <-- >> 290 4510758912158724834 110386601 1984 >> 291 4492506112158726532 9340 5119 1984 >> 292 449615667215872107535119 5119 1984 >> 293 450345779215872107185154 5119 1984 >> >> etc. >> >> So the loop begins at record #154 and spans 135 records, right? >> >> Will backup fs metadata as soon as I have some external storage at hand. >> >> Thanks a lot so far ... Michael >> >> On 03/24/2016 10:41 AM, Joseph Qi wrote: >>> Hi Michael, >>> It seems that dead loop happens in chain 73. You have formatted using 2K >>> block and 4K cluster, so each chain should have 1522 or 1521 records. >>> But at first glance, I cannot figure out which block goes wrong, because >>> the output you pasted indicates all blocks are different. So I suggest >>> you investigate the all blocks which belong to chain 73 and try to find >>> out if there is a loop there. >>> BTW, have you backed up the metadata using o2image? >>> >>> Thanks, >>> Joseph >>> >>> On 2016/3/24 16:40, Michael Ulbrich wrote: Hi Joseph, thanks a lot for your help. It is very much appreciated! I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Michael, So I think the block of record #153 goes wrong, which points next to block 4083643392 of record #19. But the problem is we don't know the right info of the block of record #153, otherwise we can dd out, edit it and then dd in to fix it. Thanks, Joseph On 2016/3/24 18:38, Michael Ulbrich wrote: > Hi Joseph, > > ok, got it! Here's the loop in chain 73: > > Group Chain: 73 Parent Inode: 13 Generation: 1172963971 > CRC32: ECC: > ## Block#TotalUsed Free Contig Size > 0428077363215872114874385 1774 1984 > 12583263232158725341 105315153 1984 > 24543613952158725329 105435119 1984 > 3453266227215872107535119 5119 1984 > 44539963392158723223 126497530 1984 > 54536312832158725219 106535534 1984 > 64529011712158726047 9825 3359 1984 > 74525361152158724475 113975809 1984 > 84521710592158723182 126905844 1984 > 94518060032158725881 9991 5131 1984 > 10 423696691215872107535119 5119 1984 > 11 409824563215872107565116 3388 1984 > 12 4514409472158728826 7046 5119 1984 > 13 34411448321587215 158579680 1984 > 14 4404892672158727563 8309 5119 1984 > 15 4233316352158729398 6474 5114 1984 > 16 448882158726358 9514 5119 1984 > 17 3901115392158729932 5940 3757 1984 > 18 4507108352158726557 9315 6166 1984 > 19 408364339215872571 153014914 1984 <-- > 20 4510758912158724834 110386601 1984 > 21 4492506112158726532 9340 5119 1984 > 22 449615667215872107535119 5119 1984 > 23 450345779215872107185154 5119 1984 > ... > 154 408364339215872571 153014914 1984 <-- > 155 4510758912158724834 110386601 1984 > 156 4492506112158726532 9340 5119 1984 > 157 449615667215872107535119 5119 1984 > 158 450345779215872107185154 5119 1984 > ... > 289 408364339215872571 153014914 1984 <-- > 290 4510758912158724834 110386601 1984 > 291 4492506112158726532 9340 5119 1984 > 292 449615667215872107535119 5119 1984 > 293 450345779215872107185154 5119 1984 > > etc. > > So the loop begins at record #154 and spans 135 records, right? > > Will backup fs metadata as soon as I have some external storage at hand. > > Thanks a lot so far ... Michael > > On 03/24/2016 10:41 AM, Joseph Qi wrote: >> Hi Michael, >> It seems that dead loop happens in chain 73. You have formatted using 2K >> block and 4K cluster, so each chain should have 1522 or 1521 records. >> But at first glance, I cannot figure out which block goes wrong, because >> the output you pasted indicates all blocks are different. So I suggest >> you investigate the all blocks which belong to chain 73 and try to find >> out if there is a loop there. >> BTW, have you backed up the metadata using o2image? >> >> Thanks, >> Joseph >> >> On 2016/3/24 16:40, Michael Ulbrich wrote: >>> Hi Joseph, >>> >>> thanks a lot for your help. It is very much appreciated! >>> >>> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: >>> >>> root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > >>> debugfs_drbd1.log 2>&1 >>> >>> Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) >>> FS Generation: 1172963971 (0x45ea0283) >>> CRC32: ECC: >>> Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain >>> Dynamic Features: (0x0) >>> User: 0 (root) Group: 0 (root) Size: 11381315956736 >>> Links: 1 Clusters: 2778641591 >>> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>> dtime: 0x0 -- Thu Jan 1 01:00:00 1970 >>> ctime_nsec: 0x -- 0 >>> atime_nsec: 0x -- 0 >>> mtime_nsec: 0x -- 0 >>> Refcount Block: 0 >>> Last Extblk: 0 Orphan Slot: 0 >>> Sub Alloc Slot: Global Sub Alloc Bit: 7 >>> Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 >>> Clusters per Group: 15872 Bits per Cluster: 1 >>> Count: 115 Next Free Rec: 115 >>> ## TotalUsed Free Block# >>> 024173056 9429318 14743738 4533995520 >>> 124173056 9421663 14751393 4548629504 >>> 224173056 9432421 14740635 4588817408 >>> 324173056
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Joseph, ok, got it! Here's the loop in chain 73: Group Chain: 73 Parent Inode: 13 Generation: 1172963971 CRC32: ECC: ## Block#TotalUsed Free Contig Size 0428077363215872114874385 1774 1984 12583263232158725341 105315153 1984 24543613952158725329 105435119 1984 3453266227215872107535119 5119 1984 44539963392158723223 126497530 1984 54536312832158725219 106535534 1984 64529011712158726047 9825 3359 1984 74525361152158724475 113975809 1984 84521710592158723182 126905844 1984 94518060032158725881 9991 5131 1984 10 423696691215872107535119 5119 1984 11 409824563215872107565116 3388 1984 12 4514409472158728826 7046 5119 1984 13 34411448321587215 158579680 1984 14 4404892672158727563 8309 5119 1984 15 4233316352158729398 6474 5114 1984 16 448882158726358 9514 5119 1984 17 3901115392158729932 5940 3757 1984 18 4507108352158726557 9315 6166 1984 19 408364339215872571 153014914 1984 <-- 20 4510758912158724834 110386601 1984 21 4492506112158726532 9340 5119 1984 22 449615667215872107535119 5119 1984 23 450345779215872107185154 5119 1984 ... 154 408364339215872571 153014914 1984 <-- 155 4510758912158724834 110386601 1984 156 4492506112158726532 9340 5119 1984 157 449615667215872107535119 5119 1984 158 450345779215872107185154 5119 1984 ... 289 408364339215872571 153014914 1984 <-- 290 4510758912158724834 110386601 1984 291 4492506112158726532 9340 5119 1984 292 449615667215872107535119 5119 1984 293 450345779215872107185154 5119 1984 etc. So the loop begins at record #154 and spans 135 records, right? Will backup fs metadata as soon as I have some external storage at hand. Thanks a lot so far ... Michael On 03/24/2016 10:41 AM, Joseph Qi wrote: > Hi Michael, > It seems that dead loop happens in chain 73. You have formatted using 2K > block and 4K cluster, so each chain should have 1522 or 1521 records. > But at first glance, I cannot figure out which block goes wrong, because > the output you pasted indicates all blocks are different. So I suggest > you investigate the all blocks which belong to chain 73 and try to find > out if there is a loop there. > BTW, have you backed up the metadata using o2image? > > Thanks, > Joseph > > On 2016/3/24 16:40, Michael Ulbrich wrote: >> Hi Joseph, >> >> thanks a lot for your help. It is very much appreciated! >> >> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: >> >> root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > >> debugfs_drbd1.log 2>&1 >> >> Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) >> FS Generation: 1172963971 (0x45ea0283) >> CRC32: ECC: >> Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain >> Dynamic Features: (0x0) >> User: 0 (root) Group: 0 (root) Size: 11381315956736 >> Links: 1 Clusters: 2778641591 >> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >> dtime: 0x0 -- Thu Jan 1 01:00:00 1970 >> ctime_nsec: 0x -- 0 >> atime_nsec: 0x -- 0 >> mtime_nsec: 0x -- 0 >> Refcount Block: 0 >> Last Extblk: 0 Orphan Slot: 0 >> Sub Alloc Slot: Global Sub Alloc Bit: 7 >> Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 >> Clusters per Group: 15872 Bits per Cluster: 1 >> Count: 115 Next Free Rec: 115 >> ## TotalUsed Free Block# >> 024173056 9429318 14743738 4533995520 >> 124173056 9421663 14751393 4548629504 >> 224173056 9432421 14740635 4588817408 >> 324173056 9427533 14745523 4548692992 >> 424173056 9433978 14739078 4508568576 >> 524173056 9436974 14736082 4636369920 >> 624173056 9428411 14744645 4563390464 >> 724173056 9426950 14746106 4479459328 >> 824173056 9428099 14744957 4548851712 >> 924173056 9431794 14741262 4585389056 >> ... >> 105 24157184 9414241 14742943 4690652160 >> 106
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Michael, It seems that dead loop happens in chain 73. You have formatted using 2K block and 4K cluster, so each chain should have 1522 or 1521 records. But at first glance, I cannot figure out which block goes wrong, because the output you pasted indicates all blocks are different. So I suggest you investigate the all blocks which belong to chain 73 and try to find out if there is a loop there. BTW, have you backed up the metadata using o2image? Thanks, Joseph On 2016/3/24 16:40, Michael Ulbrich wrote: > Hi Joseph, > > thanks a lot for your help. It is very much appreciated! > > I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: > > root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > > debugfs_drbd1.log 2>&1 > > Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) > FS Generation: 1172963971 (0x45ea0283) > CRC32: ECC: > Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain > Dynamic Features: (0x0) > User: 0 (root) Group: 0 (root) Size: 11381315956736 > Links: 1 Clusters: 2778641591 > ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 > atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 > mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 > dtime: 0x0 -- Thu Jan 1 01:00:00 1970 > ctime_nsec: 0x -- 0 > atime_nsec: 0x -- 0 > mtime_nsec: 0x -- 0 > Refcount Block: 0 > Last Extblk: 0 Orphan Slot: 0 > Sub Alloc Slot: Global Sub Alloc Bit: 7 > Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 > Clusters per Group: 15872 Bits per Cluster: 1 > Count: 115 Next Free Rec: 115 > ## TotalUsed Free Block# > 024173056 9429318 14743738 4533995520 > 124173056 9421663 14751393 4548629504 > 224173056 9432421 14740635 4588817408 > 324173056 9427533 14745523 4548692992 > 424173056 9433978 14739078 4508568576 > 524173056 9436974 14736082 4636369920 > 624173056 9428411 14744645 4563390464 > 724173056 9426950 14746106 4479459328 > 824173056 9428099 14744957 4548851712 > 924173056 9431794 14741262 4585389056 > ... > 105 24157184 9414241 14742943 4690652160 > 106 24157184 9419715 14737469 4467999744 > 107 24157184 9411479 14745705 4431525888 > 108 24157184 9413235 14743949 4559327232 > 109 24157184 9417948 14739236 4500950016 > 110 24157184 9411013 14746171 4566691840 > 111 24157184 9421252 14735932 4522916864 > 112 24157184 9416726 14740458 4537550848 > 113 24157184 9415358 14741826 4676303872 > 114 24157184 9420448 14736736 4526662656 > > Group Chain: 0 Parent Inode: 13 Generation: 1172963971 > CRC32: ECC: > ## Block#TotalUsed Free Contig Size > 04533995520158726339 9533 3987 1984 > 1453034496015872107555117 5117 1984 > 2299710976015872107535119 5119 1984 > 3452669440015872107535119 5119 1984 > 4302266368015872107535119 5119 1984 > 54512092160158729043 6829 2742 1984 > 64523043840158724948 109249612 1984 > 74519393280158726150 9722 5595 1984 > 84515742720158724323 115496603 1984 > 9377102848015872107535119 5119 1984 > ... > 1513 552329728015872115871158711984 > 1514 552694784015872115871158711984 > 1515 553059840015872115871158711984 > 1516 553424896015872115871158711984 > 1517 553789952015872115871158711984 > 1518 554155008015872115871158711984 > 1519 554520064015872115871158711984 > 1520 554885120015872115871158711984 > 1521 555250176015872115871158711984 > 1522 555615232015872115871158711984 > > Group Chain: 1 Parent Inode: 13 Generation: 1172963971 > CRC32: ECC: > ## Block#TotalUsed Free Contig Size > 0454862950415872107555117 2496 1984 > 129934909441587259 15813144511984 > 2248971366415872107585114 3726 1984 > 33117609984158723958 119146165 1984 > 4254447206415872107535119 5119 1984 > 5304094822415872107535119 5119 1984 > 6297158758415872107535119 5119 1984 > 74493871104158728664 7208 3705
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Joseph, thanks a lot for your help. It is very much appreciated! I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > debugfs_drbd1.log 2>&1 Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) FS Generation: 1172963971 (0x45ea0283) CRC32: ECC: Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 11381315956736 Links: 1 Clusters: 2778641591 ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 dtime: 0x0 -- Thu Jan 1 01:00:00 1970 ctime_nsec: 0x -- 0 atime_nsec: 0x -- 0 mtime_nsec: 0x -- 0 Refcount Block: 0 Last Extblk: 0 Orphan Slot: 0 Sub Alloc Slot: Global Sub Alloc Bit: 7 Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 Clusters per Group: 15872 Bits per Cluster: 1 Count: 115 Next Free Rec: 115 ## TotalUsed Free Block# 024173056 9429318 14743738 4533995520 124173056 9421663 14751393 4548629504 224173056 9432421 14740635 4588817408 324173056 9427533 14745523 4548692992 424173056 9433978 14739078 4508568576 524173056 9436974 14736082 4636369920 624173056 9428411 14744645 4563390464 724173056 9426950 14746106 4479459328 824173056 9428099 14744957 4548851712 924173056 9431794 14741262 4585389056 ... 105 24157184 9414241 14742943 4690652160 106 24157184 9419715 14737469 4467999744 107 24157184 9411479 14745705 4431525888 108 24157184 9413235 14743949 4559327232 109 24157184 9417948 14739236 4500950016 110 24157184 9411013 14746171 4566691840 111 24157184 9421252 14735932 4522916864 112 24157184 9416726 14740458 4537550848 113 24157184 9415358 14741826 4676303872 114 24157184 9420448 14736736 4526662656 Group Chain: 0 Parent Inode: 13 Generation: 1172963971 CRC32: ECC: ## Block#TotalUsed Free Contig Size 04533995520158726339 9533 3987 1984 1453034496015872107555117 5117 1984 2299710976015872107535119 5119 1984 3452669440015872107535119 5119 1984 4302266368015872107535119 5119 1984 54512092160158729043 6829 2742 1984 64523043840158724948 109249612 1984 74519393280158726150 9722 5595 1984 84515742720158724323 115496603 1984 9377102848015872107535119 5119 1984 ... 1513 552329728015872115871158711984 1514 552694784015872115871158711984 1515 553059840015872115871158711984 1516 553424896015872115871158711984 1517 553789952015872115871158711984 1518 554155008015872115871158711984 1519 554520064015872115871158711984 1520 554885120015872115871158711984 1521 555250176015872115871158711984 1522 555615232015872115871158711984 Group Chain: 1 Parent Inode: 13 Generation: 1172963971 CRC32: ECC: ## Block#TotalUsed Free Contig Size 0454862950415872107555117 2496 1984 129934909441587259 15813144511984 2248971366415872107585114 3726 1984 33117609984158723958 119146165 1984 4254447206415872107535119 5119 1984 5304094822415872107535119 5119 1984 6297158758415872107535119 5119 1984 74493871104158728664 7208 3705 1984 84544978944158728711 7161 2919 1984 94417209344158723253 126196447 1984 ... 1513 552332902415872115871158711984 1514 552697958415872115871158711984 1515 553063014415872115871158711984 1516 553428070415872115871158711984 1517 553793126415872115871158711984 1518 554158182415872115871158711984 1519 554523238415872115871158711984 1520 554888294415872115871158711984 1521 5552533504
Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Michael, Could you please use debugfs to check the output? # debugfs.ocfs2 -R 'stat //global_bitmap' Thanks, Joseph On 2016/3/24 6:38, Michael Ulbrich wrote: > Hi ocfs2-users, > > my first post to this list from yesterday probably didn't get through. > > Anyway, I've made some progress in the meantime and may now ask more > specific questions ... > > I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy: > > Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux > > the kernel modules are: > > modinfo ocfs2 -> version: 1.5.0 > > using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri. > > As an alternative I cloned and built the latest ocfs2-tools from > markfasheh's ocfs2-tools on github which should be version 1.8.4. > > The filesystem runs on top of drbd, is used to roughly 40 % and suffers > from read-only remounts and hanging clients since the last reboot. This > may be DLM problems but I suspect they stem from some corrupt disk > structures. Before that it all ran stable for months. > > This situation made me want to run fsck.ocfs2 and now I wonder how to do > that. The filesystem is not mounted. > > With the stock ocfs-tools 1.6.4: > > root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 > fsck.ocfs2 1.6.4 > Checking OCFS2 filesystem in /dev/drbd1: > Label: ocfs2_ASSET > UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 > Number of blocks: 5557283182 > Block size: 2048 > Number of clusters: 2778641591 > Cluster size: 4096 > Number of slots:16 > > I'm checking fsck_drbd1.log and find that it is making progress in > > Pass 0a: Checking cluster allocation chains > > until it reaches "chain 73" and goes into an infinite loop filling the > logfile with breathtaking speed. > > With the newly built ocfs-tools 1.8.4 I get: > > root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 > fsck.ocfs2 1.8.4 > Checking OCFS2 filesystem in /dev/drbd1: > Label: ocfs2_ASSET > UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 > Number of blocks: 5557283182 > Block size: 2048 > Number of clusters: 2778641591 > Cluster size: 4096 > Number of slots:16 > > Again watching the verbose output in fsck_drbd1.log I find that this > time it proceeds up to > > Pass 0a: Checking cluster allocation chains > o2fsck_pass0:1360 | found inode alloc 13 at block 13 > > and stays there without any further progress. I've terminated this > process after waiting for more than an hour. > > Now - I'm lost somehow ... and would very much appreciate if anybody on > this list would share his knowledge and give me a hint what to do next. > > What could be done to get this file system checked and repaired? Am I > missing something important or do I just have to wait a little bit > longer? Is there a version of ocfs2-tools / fsck.ocfs2 which will > perform as expected? > > I'm prepared to upgrade the kernel to 3.16.0-0.bpo.4-amd64 but shy away > from taking that risk without any clue of whether that might solve my > problem ... > > Thanks in advance ... Michael Ulbrich > > ___ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users > > ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2
I don't know if is it possible, but kernel panic error is not in /var/log/kern.log. 2011/5/13 Sunil Mushran sunil.mush...@oracle.com Please do not remove the cc-s. Hard for me to comment without knowing anything about the panic. However, assuming that the panic message indicated that the volume needs to be fsck-ed. In that case, the best course is to umount the volume on all nodes and running fsck on one node. On 05/13/2011 12:33 PM, Xavier Diumé wrote: But initially the system had devices in /etf/fstab with _netdev option. When system starts mounting a kernel panic appears, sometimes after few minuts. The only way that I could start the system was mounting all devices one by one, with a previups fsck. I don't know if it is the better way, but is the only that I've used succesfully. 2011/5/13 Sunil Mushran sunil.mush...@oracle.com On 05/13/2011 11:44 AM, Xavier Diumé wrote: Hello, Is it possible to fsck a mounted filesystem. When one of the cluster nodes reboots because a kernel panic, the device requires fsck.ocfs2 because in mounted.ocfs2 -f rebooted node is shown. If mounted.ocfs2 -f shows the rebooted node, that means the slotmap has not been cleaned up as yet. That cleanup happens during node recovery. If the volume is still mounted on another node, it will get cleaned up momentarily. If however it does not get cleaned up, that means that the volume is not mounted on any node. In that case, the next mount will clean up slotmap. Either way one does not need to fsck just to cleanup the slotmap. -- Xavier Diumé http://socaqui.cat -- Xavier Diumé http://socaqui.cat ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2
On 05/13/2011 11:44 AM, Xavier Diumé wrote: Hello, Is it possible to fsck a mounted filesystem. When one of the cluster nodes reboots because a kernel panic, the device requires fsck.ocfs2 because in mounted.ocfs2 -f rebooted node is shown. If mounted.ocfs2 -f shows the rebooted node, that means the slotmap has not been cleaned up as yet. That cleanup happens during node recovery. If the volume is still mounted on another node, it will get cleaned up momentarily. If however it does not get cleaned up, that means that the volume is not mounted on any node. In that case, the next mount will clean up slotmap. Either way one does not need to fsck just to cleanup the slotmap. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2
But initially the system had devices in /etf/fstab with _netdev option. When system starts mounting a kernel panic appears, sometimes after few minuts. The only way that I could start the system was mounting all devices one by one, with a previups fsck. I don't know if it is the better way, but is the only that I've used succesfully. 2011/5/13 Sunil Mushran sunil.mush...@oracle.com On 05/13/2011 11:44 AM, Xavier Diumé wrote: Hello, Is it possible to fsck a mounted filesystem. When one of the cluster nodes reboots because a kernel panic, the device requires fsck.ocfs2 because in mounted.ocfs2 -f rebooted node is shown. If mounted.ocfs2 -f shows the rebooted node, that means the slotmap has not been cleaned up as yet. That cleanup happens during node recovery. If the volume is still mounted on another node, it will get cleaned up momentarily. If however it does not get cleaned up, that means that the volume is not mounted on any node. In that case, the next mount will clean up slotmap. Either way one does not need to fsck just to cleanup the slotmap. -- Xavier Diumé http://socaqui.cat ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
Correction, kernel modules are 1.4.4, the tools and console is 1.4.3. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Thursday, May 20, 2010 6:00 PM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory? We are setting up 2 new EL5 U4 machines to replace our current database servers running our demo environment. We use 3Par SANs and their snap clone options. The current production system we snap clone from is EL4 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB physical memory). Can anyone enlighten what is going on? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
And upgrading to kernel modules 1.4.7, tools 1.4.4 didn't change the memory part: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 29532 root 18 0 21.9g 10g4 D 21.1 45.0 0:15.24 fsck.ocfs2 -Original Message- From: Ulf Zimmermann Sent: Thursday, May 20, 2010 6:06 PM To: Ulf Zimmermann; ocfs2-users@oss.oracle.com Subject: RE: fsck.ocfs2 using huge amount of memory? Correction, kernel modules are 1.4.4, the tools and console is 1.4.3. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Thursday, May 20, 2010 6:00 PM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory? We are setting up 2 new EL5 U4 machines to replace our current database servers running our demo environment. We use 3Par SANs and their snap clone options. The current production system we snap clone from is EL4 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB physical memory). Can anyone enlighten what is going on? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
On Thu, May 20, 2010 at 06:00:19PM -0700, Ulf Zimmermann wrote: We are setting up 2 new EL5 U4 machines to replace our current database servers running our demo environment. We use 3Par SANs and their snap clone options. The current production system we snap clone from is EL4 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB physical memory). Can anyone enlighten what is going on? How big are your filesystems? Can we get the output of debugfs.ocfs2 -R 'stats' /dev/xxx? Recent fsck.ocfs2 knows how to build its own I/O cache for significant speed improvements. It only tries to get as much cache as the filesystem actually needs, and no more than half of system memory. That's why I'm asking for your filesystem size - I'm guessing you have more than 12GB of used space on the filesystem, so fsck.ocfs2 is trying to grab that much cache. Joel -- In the room the women come and go Talking of Michaelangelo. Joel Becker Principal Software Developer Oracle E-mail: joel.bec...@oracle.com Phone: (650) 506-8127 ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
http://oss.oracle.com/projects/ocfs2-tools/news/article_8.html We did make a related change in fsck in that release. Do you mind creating a bugzilla for this? Do mention the arch. I can then send you a debug version of the tool that'll tell us why it is behaving like that on your machine. On 05/20/2010 06:12 PM, Ulf Zimmermann wrote: And upgrading to kernel modules 1.4.7, tools 1.4.4 didn't change the memory part: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 29532 root 18 0 21.9g 10g4 D 21.1 45.0 0:15.24 fsck.ocfs2 -Original Message- From: Ulf Zimmermann Sent: Thursday, May 20, 2010 6:06 PM To: Ulf Zimmermann; ocfs2-users@oss.oracle.com Subject: RE: fsck.ocfs2 using huge amount of memory? Correction, kernel modules are 1.4.4, the tools and console is 1.4.3. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Thursday, May 20, 2010 6:00 PM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory? We are setting up 2 new EL5 U4 machines to replace our current database servers running our demo environment. We use 3Par SANs and their snap clone options. The current production system we snap clone from is EL4 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB physical memory). Can anyone enlighten what is going on? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 can't fix an orphaned inode
Sunil, Bug 1236. Thanks very much. -- Carl Benson, PHS Linux SysAdmin (206-667-4862, cben...@fhcrc.org) On 03/18/2010 11:32 AM, Sunil Mushran wrote: One option is to provide me with the o2image of the volume. # o2image -r /dev/sda1 - | bzip2 sda1.out.bz2 File a bugzilla and add the link to that image. (The bz cannot handle large files.) The other option is to file a bz and attach the stat_sysdir output. http://oss.oracle.com/~smushran/.debug/scripts/stat_sysdir.sh Carl J. Benson wrote: Hello! I searched through the mailing list back to 07/2008, and didn't see this question answered before. I have 7 systems that use an ocfs2 filesystem. After many months of solid reliable use, they all crashed yesterday. 6 systems run openSUSE 11.1, kernel 2.627.29-0.1-default, with these RPMs: ocfs2-tools-1.4.1-6.9 ocfs2console -1.4.1-6.9 1 system has for a week been running openSUSE 11.2, kernel 2.6.31.12-0.1-default, with these RPMs: ocfs2console-1.4.1-25.6.x86_64 ocfs2-tools-1.4.1-25.6.x86_64 ocfs2-tools-o2cb-1.4.1-25.6.x86_64 I still haven't figured out where the corruption started, but the problem at the moment is this: After repeated runs of fsck.ocfs2 (with the filesystem unmounted, of course!), it's fallen into a pattern. Here is the output of fsck.ocfs2: /root # fsck.ocfs2 /dev/sdc1 Checking OCFS2 filesystem in /dev/sdc1: label: iscsi_ocfs2_cluster uuid: 23 48 29 28 4d 71 44 e6 b4 1d 88 75 c9 69 46 d3 number of blocks: 268438109 bytes per block:4096 number of clusters: 268438109 bytes per cluster: 4096 max slots: 20 pass4: Invalid block number while truncating orphan inode 104935559 fsck.ocfs2: Invalid block number while trying to replay the orphan directory fsck encountered errors while recovering slot information, check forced. /dev/sdc1 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Pass 2: Checking directory entries. Pass 3: Checking directory connectivity. Pass 4a: checking for orphaned inodes [INODE_ORPHANED] Inode 104935559 was found in the orphan directory. Delete its contents and unlink it? y y pass4: Invalid block number while truncating orphan inode 104935559 [INODE_ORPHANED] Inode 106959312 was found in the orphan directory. Delete its contents and unlink it? y y pass4: Invalid block number while truncating orphan inode 106959312 Pass 4b: Checking inodes link counts. All passes succeeded. At this point I reboot the server (named merlot1), run fsck.ocfs2, and get exactly the same result. What can I do now? I looked at the man page for debugfs.ocfs2, but it doesn't look like that's going to help me. Any suggestions, please? -- Carl Benson | cben...@fhcrc.org Linux System Administrator | Telephone: (206) 667-4862 Fred Hutchinson Cancer Research Center ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 can't fix an orphaned inode
One option is to provide me with the o2image of the volume. # o2image -r /dev/sda1 - | bzip2 sda1.out.bz2 File a bugzilla and add the link to that image. (The bz cannot handle large files.) The other option is to file a bz and attach the stat_sysdir output. http://oss.oracle.com/~smushran/.debug/scripts/stat_sysdir.sh Carl J. Benson wrote: Hello! I searched through the mailing list back to 07/2008, and didn't see this question answered before. I have 7 systems that use an ocfs2 filesystem. After many months of solid reliable use, they all crashed yesterday. 6 systems run openSUSE 11.1, kernel 2.627.29-0.1-default, with these RPMs: ocfs2-tools-1.4.1-6.9 ocfs2console -1.4.1-6.9 1 system has for a week been running openSUSE 11.2, kernel 2.6.31.12-0.1-default, with these RPMs: ocfs2console-1.4.1-25.6.x86_64 ocfs2-tools-1.4.1-25.6.x86_64 ocfs2-tools-o2cb-1.4.1-25.6.x86_64 I still haven't figured out where the corruption started, but the problem at the moment is this: After repeated runs of fsck.ocfs2 (with the filesystem unmounted, of course!), it's fallen into a pattern. Here is the output of fsck.ocfs2: /root # fsck.ocfs2 /dev/sdc1 Checking OCFS2 filesystem in /dev/sdc1: label: iscsi_ocfs2_cluster uuid: 23 48 29 28 4d 71 44 e6 b4 1d 88 75 c9 69 46 d3 number of blocks: 268438109 bytes per block:4096 number of clusters: 268438109 bytes per cluster: 4096 max slots: 20 pass4: Invalid block number while truncating orphan inode 104935559 fsck.ocfs2: Invalid block number while trying to replay the orphan directory fsck encountered errors while recovering slot information, check forced. /dev/sdc1 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Pass 2: Checking directory entries. Pass 3: Checking directory connectivity. Pass 4a: checking for orphaned inodes [INODE_ORPHANED] Inode 104935559 was found in the orphan directory. Delete its contents and unlink it? y y pass4: Invalid block number while truncating orphan inode 104935559 [INODE_ORPHANED] Inode 106959312 was found in the orphan directory. Delete its contents and unlink it? y y pass4: Invalid block number while truncating orphan inode 106959312 Pass 4b: Checking inodes link counts. All passes succeeded. At this point I reboot the server (named merlot1), run fsck.ocfs2, and get exactly the same result. What can I do now? I looked at the man page for debugfs.ocfs2, but it doesn't look like that's going to help me. Any suggestions, please? -- Carl Benson | cben...@fhcrc.org Linux System Administrator | Telephone: (206) 667-4862 Fred Hutchinson Cancer Research Center ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users