Re: [Ocfs2-users] fsck.ocfs2 not fixing as it outputs errors when checking w/ no flag (-fn) but is clean with yes flag (-fy)

2016-04-01 Thread Jay V
On 3/31/2016 10:37 PM, Junxiao Bi wrote:
> On 04/01/2016 11:20 AM, Jay Vasa wrote:
>> On 3/31/2016 6:36 PM, Herbert van den Bergh wrote:
>>> It seems to me that the reason fsck -fn is reporting errors is because
>>> it isn't replaying the journal:
>>>
>>> ** Skipping journal replay because -n was given. There may be spurious
>>> errors that journal replay would fix. **
>>> ** Skipping slot recovery because -n was given. **
>>>
>>> So there are outstanding changes in the journal that need to be made
>>> to the fs, but fsck -fn skips them.  Then later it runs into the
>>> inconsistencies that would have been cleared if the journal was replayed.
>>>
>>> fsck -fy does replay the journal, so it doesn't see the
>>> inconsistencies that were fixed by it.
>>>
>>> When you do the fsck -fn AFTER fsck -fy, does it still say now that it
>>> is skipping journal replay?  If so, I wonder why.  If not, does it
>>> still report the exact same inode / cluster numbers as the previous
>>> time you ran it?  If fsck -fy had to make any changes (including
>>> replaying the journal), run it again, and repeat until it doesn't make
>>> any changes to the filesystem.  This is just to make sure it isn't
>>> leaving some inconsistency unfixed.  So please do:
>>>
>>> umount (on ALL nodes)
>>> fsck -fy
>>> fsck -fy (if the previous fsck made ANY changes including replaying
>>> the journal)
>>> fsck -fn (check if it mentions skipping the journal replay)
>>>
>>> If you still see any errors reported by fsck -fn, are they exactly the
>>> same ones as you've sent earlier?
>>>
>> This is exactly what I did on the first time I ran it. I really don't
>> want to have another downtime doing exactly this again.
> So the "corrupted" ocfs2 volume is online now, does it work well? If
> ocfs2 is really corrupted, i think it will soon fall into a read-only fs
> or panic. If it works well, then maybe fsck.ocfs2 -fn report the
> corruption wrongly.
>
> Thanks,
> Junxiao.
Yes the "corrupted" ocfs2 is working just fine. It has not fallen to 
read-only and has not had a panic.  I though am worried that it will in 
the future and go read-only at some-time. I have though been lately 
minimizing the load on it as I am worried about this happening and seems 
no way to fix it.

Thanks,
Jay

>> If you see I ran this exactly:
>> % umount /dev/drbd2 -- the umount stalled so I rebooted it
>> % fsck -fy /dev/drbd2
>> -- this fixed the journal replay
>> % fsck -fy /dev/drbd2
>> -- this did nothing
>> % fsck -fn /dev/drbd2
>> -- this showed the errors all over again. Yes exactly the same errors.
>>
>> Look at the bottom of this message as that is exactly what I ran, and
>> yes everything was unmounted. This is the only reason why I brought up
>> this issue.
>>
>> If you really want me to do this again, I can, but I don't like bringing
>> down the filesystem another 6 hours for this. I have already tried fsck
>> this about 20 times.
>>
>> Thanks,
>> Jay


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 not fixing as it outputs errors when checking w/ no flag (-fn) but is clean with yes flag (-fy)

2016-03-31 Thread Jay Vasa


On 3/31/2016 6:28 PM, Junxiao Bi wrote:
> On 04/01/2016 09:21 AM, Jay Vasa wrote:
>> I never did an fsck -fn with it being mounted. I understand that will
>> cause cause errors.
>> It has never been mounted whenever I did any fsck, either -fn -fy. I was
>> trying to say that it sucks that I have to stop the production for a
>> fsck which is not fixing these errors.
>> Again, It has never mounted. I am sorry if I wasn't clear in
>> communicating that.
> Interesting, then how about "fsck.ocfs2 -f"? Try it and see whether it
> report any corruption.

I have already tried this-- I was curious about this also. I ran also 
with just the -f, and it gave the same results as -fy. I was hoping to 
get user input so I can type "y", but nope.

Thanks,
Jay

>
> Thanks,
> Junxiao.
>> Jay


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 not fixing as it outputs errors when checking w/ no flag (-fn) but is clean with yes flag (-fy)

2016-03-29 Thread Eric Ren
Hi,

>> So, we have 2 problems now.
>> What's the matter with fsck?
>> It's very weired:-/
> Yes very weird. The main issue is that I need to fsck this filesystem. I
> hope Junxiao can help.
>> How this error happend in kernel?
>> If there's not solution available right now, none of them is easy if
>> we cannot
>> reproduce.
>>
>> So, you use nfs on top of ocfs2, here is relative commit:
>> git log -p 6ca497a83
>>
>> And, please provide the initial and whole error messages as early time
>> as possible.
>
> I tried this command, but I don't have this repository.
> # git log -p 6ca497a83
> fatal: Not a git repository (or any of the parent directories): .git

Sorry, I meant the kernel source commit.

I saw an oracle guy has given a very kind reply. Thanks for him!

Eric

>
> Can you tell me exactly the commands I need to run to help you out with
> this output? I am using the official RPMs to I don't have the source
> code, as I believe Oracle only has it.
> Do i need to check out the ocfs2-tools from git, and from where?


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-25 Thread Joseph Qi
Hi Michael,
Yes, currently the best way is to copy out data as much as possible and
recreate the ocfs2 volume, then restore back the data.
I haven't encountered this issue before and don't know which case can
lead to it, so I'm sorry I can't give you the advice which can avoid
this issue.
But I suggest you keep follow the patches of the latest kernel, and
patch those read-only related (both ocfs2 and jbd2). We have indeed
submitted several patches to fix read-only issues.

Thanks,
Joseph

On 2016/3/26 0:41, Michael Ulbrich wrote:
> Joseph,
> 
> thanks again for your help!
> 
> Currently I'm dumping out 4 TB of data from the broken ocfs2 device to
> an external disk. I have shut down the cluster and have the fs mounted
> read-only on a single node. It seems that the data structures are still
> intact and that the file system problems are bound to internal data
> areas (DLM?) which are not in use in the single node r/o mount use case.
> 
> Will create a new ocfs2 device and restore the data later.
> 
> Besides taking  metadata backups with o2image is there any advice which
> you would give to avoid similar situations in the future?
> 
> All the best ... Michael
> 
> On 03/25/2016 01:36 AM, Joseph Qi wrote:
>> Hi Michael,
>>
>> On 2016/3/24 21:47, Michael Ulbrich wrote:
>>> Hi Joseph,
>>>
>>> thanks for this information although this does not sound too optimistic ...
>>>
>>> So, if I understand you correctly, if we had a metadata backup from
>>> o2image _before_ the crash we could have looked up the missing info to
>>> remove the loop from group chain 73, right?
>> If we have metadata backup, we can use o2image to restore it back, but
>> this may loss some data.
>>
>>>
>>> But how could the loop issue be fixed and at the same time the damage to
>>> the data be minimized? There is a recent file level backup from which
>>> damaged or missing files could be restored later.
>>>
>>> 151   4054438912158722152 13720106061984
>>> 152   409459507215872107535119 5119 1984
>>> 153   4090944512158721818 140549646 1984 <--
>>> 154   408364339215872571  153014914 1984
>>> 155   4510758912158724834 110386601 1984
>>> 156   4492506112158726532 9340 5119 1984
>>>
>>> Could you describe a "brute force" way how to dd out and edit record
>>> #153 to remove the loop and minimize potential loss of data at the same
>>> time? So that fsck would have a chance to complete and fix the remaining
>>> issues?
>> This is dangerous until we can know exactly what's info the block should
>> store.
>>
>> My idea is to find out the actual block of record #154 and let block
>> 4090944512 of record #153 points to it. This must be a bit complicated
>> and should be done under deep understanding of the disk layout.
>>
>> I have went though fsck.ocfs2 patches, and found the following may help:
>> commit efca4b0f2241 (Break a chain loop in group desc)
>> But as you said, you have already upgraded to version 1.8.4. So I'm sorry
>> currently I don't have a better idea.
>>
>> Thanks,
>> Joseph
>>>
>>> Thanks a lot for your help ... Michael
>>>
>>> On 03/24/2016 02:10 PM, Joseph Qi wrote:
 Hi Michael,
 So I think the block of record #153 goes wrong, which points next to
 block 4083643392 of record #19.
 But the problem is we don't know the right info of the block of record
 #153, otherwise we can dd out, edit it and then dd in to fix it.

 Thanks,
 Joseph

 On 2016/3/24 18:38, Michael Ulbrich wrote:
> Hi Joseph,
>
> ok, got it! Here's the loop in chain 73:
>
> Group Chain: 73   Parent Inode: 13  Generation: 1172963971
> CRC32:    ECC: 
> ##   Block#TotalUsed Free Contig   Size
> 0428077363215872114874385 1774 1984
> 12583263232158725341 105315153 1984
> 24543613952158725329 105435119 1984
> 3453266227215872107535119 5119 1984
> 44539963392158723223 126497530 1984
> 54536312832158725219 106535534 1984
> 64529011712158726047 9825 3359 1984
> 74525361152158724475 113975809 1984
> 84521710592158723182 126905844 1984
> 94518060032158725881 9991 5131 1984
> 10   423696691215872107535119 5119 1984
> 11   409824563215872107565116 3388 1984
> 12   4514409472158728826 7046 5119 1984
> 13   34411448321587215   158579680 1984
> 14   4404892672158727563 8309 5119 1984
> 15   4233316352158729398 6474 5114 1984
> 16   44888215872  

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-25 Thread Michael Ulbrich
Joseph,

thanks again for your help!

Currently I'm dumping out 4 TB of data from the broken ocfs2 device to
an external disk. I have shut down the cluster and have the fs mounted
read-only on a single node. It seems that the data structures are still
intact and that the file system problems are bound to internal data
areas (DLM?) which are not in use in the single node r/o mount use case.

Will create a new ocfs2 device and restore the data later.

Besides taking  metadata backups with o2image is there any advice which
you would give to avoid similar situations in the future?

All the best ... Michael

On 03/25/2016 01:36 AM, Joseph Qi wrote:
> Hi Michael,
> 
> On 2016/3/24 21:47, Michael Ulbrich wrote:
>> Hi Joseph,
>>
>> thanks for this information although this does not sound too optimistic ...
>>
>> So, if I understand you correctly, if we had a metadata backup from
>> o2image _before_ the crash we could have looked up the missing info to
>> remove the loop from group chain 73, right?
> If we have metadata backup, we can use o2image to restore it back, but
> this may loss some data.
> 
>>
>> But how could the loop issue be fixed and at the same time the damage to
>> the data be minimized? There is a recent file level backup from which
>> damaged or missing files could be restored later.
>>
>> 151   4054438912158722152 13720106061984
>> 152   409459507215872107535119 5119 1984
>> 153   4090944512158721818 140549646 1984 <--
>> 154   408364339215872571  153014914 1984
>> 155   4510758912158724834 110386601 1984
>> 156   4492506112158726532 9340 5119 1984
>>
>> Could you describe a "brute force" way how to dd out and edit record
>> #153 to remove the loop and minimize potential loss of data at the same
>> time? So that fsck would have a chance to complete and fix the remaining
>> issues?
> This is dangerous until we can know exactly what's info the block should
> store.
> 
> My idea is to find out the actual block of record #154 and let block
> 4090944512 of record #153 points to it. This must be a bit complicated
> and should be done under deep understanding of the disk layout.
> 
> I have went though fsck.ocfs2 patches, and found the following may help:
> commit efca4b0f2241 (Break a chain loop in group desc)
> But as you said, you have already upgraded to version 1.8.4. So I'm sorry
> currently I don't have a better idea.
> 
> Thanks,
> Joseph
>>
>> Thanks a lot for your help ... Michael
>>
>> On 03/24/2016 02:10 PM, Joseph Qi wrote:
>>> Hi Michael,
>>> So I think the block of record #153 goes wrong, which points next to
>>> block 4083643392 of record #19.
>>> But the problem is we don't know the right info of the block of record
>>> #153, otherwise we can dd out, edit it and then dd in to fix it.
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On 2016/3/24 18:38, Michael Ulbrich wrote:
 Hi Joseph,

 ok, got it! Here's the loop in chain 73:

 Group Chain: 73   Parent Inode: 13  Generation: 1172963971
 CRC32:    ECC: 
 ##   Block#TotalUsed Free Contig   Size
 0428077363215872114874385 1774 1984
 12583263232158725341 105315153 1984
 24543613952158725329 105435119 1984
 3453266227215872107535119 5119 1984
 44539963392158723223 126497530 1984
 54536312832158725219 106535534 1984
 64529011712158726047 9825 3359 1984
 74525361152158724475 113975809 1984
 84521710592158723182 126905844 1984
 94518060032158725881 9991 5131 1984
 10   423696691215872107535119 5119 1984
 11   409824563215872107565116 3388 1984
 12   4514409472158728826 7046 5119 1984
 13   34411448321587215   158579680 1984
 14   4404892672158727563 8309 5119 1984
 15   4233316352158729398 6474 5114 1984
 16   448882158726358 9514 5119 1984
 17   3901115392158729932 5940 3757 1984
 18   4507108352158726557 9315 6166 1984
 19   408364339215872571  153014914 1984 <--
 20   4510758912158724834 110386601 1984
 21   4492506112158726532 9340 5119 1984
 22   449615667215872107535119 5119 1984
 23   450345779215872107185154 5119 1984
 ...
 154   408364339215872571  153014914 1984 <--
 155   4510758912158724834  

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-24 Thread Joseph Qi
Hi Michael,

On 2016/3/24 21:47, Michael Ulbrich wrote:
> Hi Joseph,
> 
> thanks for this information although this does not sound too optimistic ...
> 
> So, if I understand you correctly, if we had a metadata backup from
> o2image _before_ the crash we could have looked up the missing info to
> remove the loop from group chain 73, right?
If we have metadata backup, we can use o2image to restore it back, but
this may loss some data.

> 
> But how could the loop issue be fixed and at the same time the damage to
> the data be minimized? There is a recent file level backup from which
> damaged or missing files could be restored later.
> 
> 151   4054438912158722152 13720106061984
> 152   409459507215872107535119 5119 1984
> 153   4090944512158721818 140549646 1984 <--
> 154   408364339215872571  153014914 1984
> 155   4510758912158724834 110386601 1984
> 156   4492506112158726532 9340 5119 1984
> 
> Could you describe a "brute force" way how to dd out and edit record
> #153 to remove the loop and minimize potential loss of data at the same
> time? So that fsck would have a chance to complete and fix the remaining
> issues?
This is dangerous until we can know exactly what's info the block should
store.

My idea is to find out the actual block of record #154 and let block
4090944512 of record #153 points to it. This must be a bit complicated
and should be done under deep understanding of the disk layout.

I have went though fsck.ocfs2 patches, and found the following may help:
commit efca4b0f2241 (Break a chain loop in group desc)
But as you said, you have already upgraded to version 1.8.4. So I'm sorry
currently I don't have a better idea.

Thanks,
Joseph
> 
> Thanks a lot for your help ... Michael
> 
> On 03/24/2016 02:10 PM, Joseph Qi wrote:
>> Hi Michael,
>> So I think the block of record #153 goes wrong, which points next to
>> block 4083643392 of record #19.
>> But the problem is we don't know the right info of the block of record
>> #153, otherwise we can dd out, edit it and then dd in to fix it.
>>
>> Thanks,
>> Joseph
>>
>> On 2016/3/24 18:38, Michael Ulbrich wrote:
>>> Hi Joseph,
>>>
>>> ok, got it! Here's the loop in chain 73:
>>>
>>> Group Chain: 73   Parent Inode: 13  Generation: 1172963971
>>> CRC32:    ECC: 
>>> ##   Block#TotalUsed Free Contig   Size
>>> 0428077363215872114874385 1774 1984
>>> 12583263232158725341 105315153 1984
>>> 24543613952158725329 105435119 1984
>>> 3453266227215872107535119 5119 1984
>>> 44539963392158723223 126497530 1984
>>> 54536312832158725219 106535534 1984
>>> 64529011712158726047 9825 3359 1984
>>> 74525361152158724475 113975809 1984
>>> 84521710592158723182 126905844 1984
>>> 94518060032158725881 9991 5131 1984
>>> 10   423696691215872107535119 5119 1984
>>> 11   409824563215872107565116 3388 1984
>>> 12   4514409472158728826 7046 5119 1984
>>> 13   34411448321587215   158579680 1984
>>> 14   4404892672158727563 8309 5119 1984
>>> 15   4233316352158729398 6474 5114 1984
>>> 16   448882158726358 9514 5119 1984
>>> 17   3901115392158729932 5940 3757 1984
>>> 18   4507108352158726557 9315 6166 1984
>>> 19   408364339215872571  153014914 1984 <--
>>> 20   4510758912158724834 110386601 1984
>>> 21   4492506112158726532 9340 5119 1984
>>> 22   449615667215872107535119 5119 1984
>>> 23   450345779215872107185154 5119 1984
>>> ...
>>> 154   408364339215872571  153014914 1984 <--
>>> 155   4510758912158724834 110386601 1984
>>> 156   4492506112158726532 9340 5119 1984
>>> 157   449615667215872107535119 5119 1984
>>> 158   450345779215872107185154 5119 1984
>>> ...
>>> 289   408364339215872571  153014914 1984 <--
>>> 290   4510758912158724834 110386601 1984
>>> 291   4492506112158726532 9340 5119 1984
>>> 292   449615667215872107535119 5119 1984
>>> 293   450345779215872107185154 5119 1984
>>>
>>> etc.
>>>
>>> So the loop begins at record #154 and spans 135 records, right?
>>>
>>> Will backup fs metadata as soon as I have some external storage at hand.

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-24 Thread Michael Ulbrich
Hi Joseph,

thanks for this information although this does not sound too optimistic ...

So, if I understand you correctly, if we had a metadata backup from
o2image _before_ the crash we could have looked up the missing info to
remove the loop from group chain 73, right?

But how could the loop issue be fixed and at the same time the damage to
the data be minimized? There is a recent file level backup from which
damaged or missing files could be restored later.

151   4054438912158722152 13720106061984
152   409459507215872107535119 5119 1984
153   4090944512158721818 140549646 1984 <--
154   408364339215872571  153014914 1984
155   4510758912158724834 110386601 1984
156   4492506112158726532 9340 5119 1984

Could you describe a "brute force" way how to dd out and edit record
#153 to remove the loop and minimize potential loss of data at the same
time? So that fsck would have a chance to complete and fix the remaining
issues?

Thanks a lot for your help ... Michael

On 03/24/2016 02:10 PM, Joseph Qi wrote:
> Hi Michael,
> So I think the block of record #153 goes wrong, which points next to
> block 4083643392 of record #19.
> But the problem is we don't know the right info of the block of record
> #153, otherwise we can dd out, edit it and then dd in to fix it.
> 
> Thanks,
> Joseph
> 
> On 2016/3/24 18:38, Michael Ulbrich wrote:
>> Hi Joseph,
>>
>> ok, got it! Here's the loop in chain 73:
>>
>> Group Chain: 73   Parent Inode: 13  Generation: 1172963971
>> CRC32:    ECC: 
>> ##   Block#TotalUsed Free Contig   Size
>> 0428077363215872114874385 1774 1984
>> 12583263232158725341 105315153 1984
>> 24543613952158725329 105435119 1984
>> 3453266227215872107535119 5119 1984
>> 44539963392158723223 126497530 1984
>> 54536312832158725219 106535534 1984
>> 64529011712158726047 9825 3359 1984
>> 74525361152158724475 113975809 1984
>> 84521710592158723182 126905844 1984
>> 94518060032158725881 9991 5131 1984
>> 10   423696691215872107535119 5119 1984
>> 11   409824563215872107565116 3388 1984
>> 12   4514409472158728826 7046 5119 1984
>> 13   34411448321587215   158579680 1984
>> 14   4404892672158727563 8309 5119 1984
>> 15   4233316352158729398 6474 5114 1984
>> 16   448882158726358 9514 5119 1984
>> 17   3901115392158729932 5940 3757 1984
>> 18   4507108352158726557 9315 6166 1984
>> 19   408364339215872571  153014914 1984 <--
>> 20   4510758912158724834 110386601 1984
>> 21   4492506112158726532 9340 5119 1984
>> 22   449615667215872107535119 5119 1984
>> 23   450345779215872107185154 5119 1984
>> ...
>> 154   408364339215872571  153014914 1984 <--
>> 155   4510758912158724834 110386601 1984
>> 156   4492506112158726532 9340 5119 1984
>> 157   449615667215872107535119 5119 1984
>> 158   450345779215872107185154 5119 1984
>> ...
>> 289   408364339215872571  153014914 1984 <--
>> 290   4510758912158724834 110386601 1984
>> 291   4492506112158726532 9340 5119 1984
>> 292   449615667215872107535119 5119 1984
>> 293   450345779215872107185154 5119 1984
>>
>> etc.
>>
>> So the loop begins at record #154 and spans 135 records, right?
>>
>> Will backup fs metadata as soon as I have some external storage at hand.
>>
>> Thanks a lot so far ... Michael
>>
>> On 03/24/2016 10:41 AM, Joseph Qi wrote:
>>> Hi Michael,
>>> It seems that dead loop happens in chain 73. You have formatted using 2K
>>> block and 4K cluster, so each chain should have 1522 or 1521 records.
>>> But at first glance, I cannot figure out which block goes wrong, because
>>> the output you pasted indicates all blocks are different. So I suggest
>>> you investigate the all blocks which belong to chain 73 and try to find
>>> out if there is a loop there.
>>> BTW, have you backed up the metadata using o2image?
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On 2016/3/24 16:40, Michael Ulbrich wrote:
 Hi Joseph,

 thanks a lot for your help. It is very much appreciated!

 I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file 

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-24 Thread Joseph Qi
Hi Michael,
So I think the block of record #153 goes wrong, which points next to
block 4083643392 of record #19.
But the problem is we don't know the right info of the block of record
#153, otherwise we can dd out, edit it and then dd in to fix it.

Thanks,
Joseph

On 2016/3/24 18:38, Michael Ulbrich wrote:
> Hi Joseph,
> 
> ok, got it! Here's the loop in chain 73:
> 
> Group Chain: 73   Parent Inode: 13  Generation: 1172963971
> CRC32:    ECC: 
> ##   Block#TotalUsed Free Contig   Size
> 0428077363215872114874385 1774 1984
> 12583263232158725341 105315153 1984
> 24543613952158725329 105435119 1984
> 3453266227215872107535119 5119 1984
> 44539963392158723223 126497530 1984
> 54536312832158725219 106535534 1984
> 64529011712158726047 9825 3359 1984
> 74525361152158724475 113975809 1984
> 84521710592158723182 126905844 1984
> 94518060032158725881 9991 5131 1984
> 10   423696691215872107535119 5119 1984
> 11   409824563215872107565116 3388 1984
> 12   4514409472158728826 7046 5119 1984
> 13   34411448321587215   158579680 1984
> 14   4404892672158727563 8309 5119 1984
> 15   4233316352158729398 6474 5114 1984
> 16   448882158726358 9514 5119 1984
> 17   3901115392158729932 5940 3757 1984
> 18   4507108352158726557 9315 6166 1984
> 19   408364339215872571  153014914 1984 <--
> 20   4510758912158724834 110386601 1984
> 21   4492506112158726532 9340 5119 1984
> 22   449615667215872107535119 5119 1984
> 23   450345779215872107185154 5119 1984
> ...
> 154   408364339215872571  153014914 1984 <--
> 155   4510758912158724834 110386601 1984
> 156   4492506112158726532 9340 5119 1984
> 157   449615667215872107535119 5119 1984
> 158   450345779215872107185154 5119 1984
> ...
> 289   408364339215872571  153014914 1984 <--
> 290   4510758912158724834 110386601 1984
> 291   4492506112158726532 9340 5119 1984
> 292   449615667215872107535119 5119 1984
> 293   450345779215872107185154 5119 1984
> 
> etc.
> 
> So the loop begins at record #154 and spans 135 records, right?
> 
> Will backup fs metadata as soon as I have some external storage at hand.
> 
> Thanks a lot so far ... Michael
> 
> On 03/24/2016 10:41 AM, Joseph Qi wrote:
>> Hi Michael,
>> It seems that dead loop happens in chain 73. You have formatted using 2K
>> block and 4K cluster, so each chain should have 1522 or 1521 records.
>> But at first glance, I cannot figure out which block goes wrong, because
>> the output you pasted indicates all blocks are different. So I suggest
>> you investigate the all blocks which belong to chain 73 and try to find
>> out if there is a loop there.
>> BTW, have you backed up the metadata using o2image?
>>
>> Thanks,
>> Joseph
>>
>> On 2016/3/24 16:40, Michael Ulbrich wrote:
>>> Hi Joseph,
>>>
>>> thanks a lot for your help. It is very much appreciated!
>>>
>>> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system:
>>>
>>> root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 >
>>> debugfs_drbd1.log 2>&1
>>>
>>> Inode: 13   Mode: 0644   Generation: 1172963971 (0x45ea0283)
>>> FS Generation: 1172963971 (0x45ea0283)
>>> CRC32:    ECC: 
>>> Type: Regular   Attr: 0x0   Flags: Valid System Allocbitmap Chain
>>> Dynamic Features: (0x0)
>>> User: 0 (root)   Group: 0 (root)   Size: 11381315956736
>>> Links: 1   Clusters: 2778641591
>>> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014
>>> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014
>>> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014
>>> dtime: 0x0 -- Thu Jan  1 01:00:00 1970
>>> ctime_nsec: 0x -- 0
>>> atime_nsec: 0x -- 0
>>> mtime_nsec: 0x -- 0
>>> Refcount Block: 0
>>> Last Extblk: 0   Orphan Slot: 0
>>> Sub Alloc Slot: Global   Sub Alloc Bit: 7
>>> Bitmap Total: 2778641591   Used: 1083108631   Free: 1695532960
>>> Clusters per Group: 15872   Bits per Cluster: 1
>>> Count: 115   Next Free Rec: 115
>>> ##   TotalUsed Free Block#
>>> 024173056 9429318  14743738 4533995520
>>> 124173056 9421663  14751393 4548629504
>>> 224173056 9432421  14740635 4588817408
>>> 324173056 

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-24 Thread Michael Ulbrich
Hi Joseph,

ok, got it! Here's the loop in chain 73:

Group Chain: 73   Parent Inode: 13  Generation: 1172963971
CRC32:    ECC: 
##   Block#TotalUsed Free Contig   Size
0428077363215872114874385 1774 1984
12583263232158725341 105315153 1984
24543613952158725329 105435119 1984
3453266227215872107535119 5119 1984
44539963392158723223 126497530 1984
54536312832158725219 106535534 1984
64529011712158726047 9825 3359 1984
74525361152158724475 113975809 1984
84521710592158723182 126905844 1984
94518060032158725881 9991 5131 1984
10   423696691215872107535119 5119 1984
11   409824563215872107565116 3388 1984
12   4514409472158728826 7046 5119 1984
13   34411448321587215   158579680 1984
14   4404892672158727563 8309 5119 1984
15   4233316352158729398 6474 5114 1984
16   448882158726358 9514 5119 1984
17   3901115392158729932 5940 3757 1984
18   4507108352158726557 9315 6166 1984
19   408364339215872571  153014914 1984 <--
20   4510758912158724834 110386601 1984
21   4492506112158726532 9340 5119 1984
22   449615667215872107535119 5119 1984
23   450345779215872107185154 5119 1984
...
154   408364339215872571  153014914 1984 <--
155   4510758912158724834 110386601 1984
156   4492506112158726532 9340 5119 1984
157   449615667215872107535119 5119 1984
158   450345779215872107185154 5119 1984
...
289   408364339215872571  153014914 1984 <--
290   4510758912158724834 110386601 1984
291   4492506112158726532 9340 5119 1984
292   449615667215872107535119 5119 1984
293   450345779215872107185154 5119 1984

etc.

So the loop begins at record #154 and spans 135 records, right?

Will backup fs metadata as soon as I have some external storage at hand.

Thanks a lot so far ... Michael

On 03/24/2016 10:41 AM, Joseph Qi wrote:
> Hi Michael,
> It seems that dead loop happens in chain 73. You have formatted using 2K
> block and 4K cluster, so each chain should have 1522 or 1521 records.
> But at first glance, I cannot figure out which block goes wrong, because
> the output you pasted indicates all blocks are different. So I suggest
> you investigate the all blocks which belong to chain 73 and try to find
> out if there is a loop there.
> BTW, have you backed up the metadata using o2image?
> 
> Thanks,
> Joseph
> 
> On 2016/3/24 16:40, Michael Ulbrich wrote:
>> Hi Joseph,
>>
>> thanks a lot for your help. It is very much appreciated!
>>
>> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system:
>>
>> root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 >
>> debugfs_drbd1.log 2>&1
>>
>> Inode: 13   Mode: 0644   Generation: 1172963971 (0x45ea0283)
>> FS Generation: 1172963971 (0x45ea0283)
>> CRC32:    ECC: 
>> Type: Regular   Attr: 0x0   Flags: Valid System Allocbitmap Chain
>> Dynamic Features: (0x0)
>> User: 0 (root)   Group: 0 (root)   Size: 11381315956736
>> Links: 1   Clusters: 2778641591
>> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014
>> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014
>> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014
>> dtime: 0x0 -- Thu Jan  1 01:00:00 1970
>> ctime_nsec: 0x -- 0
>> atime_nsec: 0x -- 0
>> mtime_nsec: 0x -- 0
>> Refcount Block: 0
>> Last Extblk: 0   Orphan Slot: 0
>> Sub Alloc Slot: Global   Sub Alloc Bit: 7
>> Bitmap Total: 2778641591   Used: 1083108631   Free: 1695532960
>> Clusters per Group: 15872   Bits per Cluster: 1
>> Count: 115   Next Free Rec: 115
>> ##   TotalUsed Free Block#
>> 024173056 9429318  14743738 4533995520
>> 124173056 9421663  14751393 4548629504
>> 224173056 9432421  14740635 4588817408
>> 324173056 9427533  14745523 4548692992
>> 424173056 9433978  14739078 4508568576
>> 524173056 9436974  14736082 4636369920
>> 624173056 9428411  14744645 4563390464
>> 724173056 9426950  14746106 4479459328
>> 824173056 9428099  14744957 4548851712
>> 924173056 9431794  14741262 4585389056
>> ...
>> 105   24157184 9414241  14742943 4690652160
>> 106   

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-24 Thread Joseph Qi
Hi Michael,
It seems that dead loop happens in chain 73. You have formatted using 2K
block and 4K cluster, so each chain should have 1522 or 1521 records.
But at first glance, I cannot figure out which block goes wrong, because
the output you pasted indicates all blocks are different. So I suggest
you investigate the all blocks which belong to chain 73 and try to find
out if there is a loop there.
BTW, have you backed up the metadata using o2image?

Thanks,
Joseph

On 2016/3/24 16:40, Michael Ulbrich wrote:
> Hi Joseph,
> 
> thanks a lot for your help. It is very much appreciated!
> 
> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system:
> 
> root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 >
> debugfs_drbd1.log 2>&1
> 
> Inode: 13   Mode: 0644   Generation: 1172963971 (0x45ea0283)
> FS Generation: 1172963971 (0x45ea0283)
> CRC32:    ECC: 
> Type: Regular   Attr: 0x0   Flags: Valid System Allocbitmap Chain
> Dynamic Features: (0x0)
> User: 0 (root)   Group: 0 (root)   Size: 11381315956736
> Links: 1   Clusters: 2778641591
> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014
> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014
> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014
> dtime: 0x0 -- Thu Jan  1 01:00:00 1970
> ctime_nsec: 0x -- 0
> atime_nsec: 0x -- 0
> mtime_nsec: 0x -- 0
> Refcount Block: 0
> Last Extblk: 0   Orphan Slot: 0
> Sub Alloc Slot: Global   Sub Alloc Bit: 7
> Bitmap Total: 2778641591   Used: 1083108631   Free: 1695532960
> Clusters per Group: 15872   Bits per Cluster: 1
> Count: 115   Next Free Rec: 115
> ##   TotalUsed Free Block#
> 024173056 9429318  14743738 4533995520
> 124173056 9421663  14751393 4548629504
> 224173056 9432421  14740635 4588817408
> 324173056 9427533  14745523 4548692992
> 424173056 9433978  14739078 4508568576
> 524173056 9436974  14736082 4636369920
> 624173056 9428411  14744645 4563390464
> 724173056 9426950  14746106 4479459328
> 824173056 9428099  14744957 4548851712
> 924173056 9431794  14741262 4585389056
> ...
> 105   24157184 9414241  14742943 4690652160
> 106   24157184 9419715  14737469 4467999744
> 107   24157184 9411479  14745705 4431525888
> 108   24157184 9413235  14743949 4559327232
> 109   24157184 9417948  14739236 4500950016
> 110   24157184 9411013  14746171 4566691840
> 111   24157184 9421252  14735932 4522916864
> 112   24157184 9416726  14740458 4537550848
> 113   24157184 9415358  14741826 4676303872
> 114   24157184 9420448  14736736 4526662656
> 
> Group Chain: 0   Parent Inode: 13  Generation: 1172963971
> CRC32:    ECC: 
> ##   Block#TotalUsed Free Contig   Size
> 04533995520158726339 9533 3987 1984
> 1453034496015872107555117 5117 1984
> 2299710976015872107535119 5119 1984
> 3452669440015872107535119 5119 1984
> 4302266368015872107535119 5119 1984
> 54512092160158729043 6829 2742 1984
> 64523043840158724948 109249612 1984
> 74519393280158726150 9722 5595 1984
> 84515742720158724323 115496603 1984
> 9377102848015872107535119 5119 1984
> ...
> 1513   552329728015872115871158711984
> 1514   552694784015872115871158711984
> 1515   553059840015872115871158711984
> 1516   553424896015872115871158711984
> 1517   553789952015872115871158711984
> 1518   554155008015872115871158711984
> 1519   554520064015872115871158711984
> 1520   554885120015872115871158711984
> 1521   555250176015872115871158711984
> 1522   555615232015872115871158711984
> 
> Group Chain: 1   Parent Inode: 13  Generation: 1172963971
> CRC32:    ECC: 
> ##   Block#TotalUsed Free Contig   Size
> 0454862950415872107555117 2496 1984
> 129934909441587259   15813144511984
> 2248971366415872107585114 3726 1984
> 33117609984158723958 119146165 1984
> 4254447206415872107535119 5119 1984
> 5304094822415872107535119 5119 1984
> 6297158758415872107535119 5119 1984
> 74493871104158728664 7208 3705   

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-24 Thread Michael Ulbrich
Hi Joseph,

thanks a lot for your help. It is very much appreciated!

I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system:

root@s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 >
debugfs_drbd1.log 2>&1

Inode: 13   Mode: 0644   Generation: 1172963971 (0x45ea0283)
FS Generation: 1172963971 (0x45ea0283)
CRC32:    ECC: 
Type: Regular   Attr: 0x0   Flags: Valid System Allocbitmap Chain
Dynamic Features: (0x0)
User: 0 (root)   Group: 0 (root)   Size: 11381315956736
Links: 1   Clusters: 2778641591
ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014
atime: 0x54010183 -- Sat Aug 30 00:41:07 2014
mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014
dtime: 0x0 -- Thu Jan  1 01:00:00 1970
ctime_nsec: 0x -- 0
atime_nsec: 0x -- 0
mtime_nsec: 0x -- 0
Refcount Block: 0
Last Extblk: 0   Orphan Slot: 0
Sub Alloc Slot: Global   Sub Alloc Bit: 7
Bitmap Total: 2778641591   Used: 1083108631   Free: 1695532960
Clusters per Group: 15872   Bits per Cluster: 1
Count: 115   Next Free Rec: 115
##   TotalUsed Free Block#
024173056 9429318  14743738 4533995520
124173056 9421663  14751393 4548629504
224173056 9432421  14740635 4588817408
324173056 9427533  14745523 4548692992
424173056 9433978  14739078 4508568576
524173056 9436974  14736082 4636369920
624173056 9428411  14744645 4563390464
724173056 9426950  14746106 4479459328
824173056 9428099  14744957 4548851712
924173056 9431794  14741262 4585389056
...
105   24157184 9414241  14742943 4690652160
106   24157184 9419715  14737469 4467999744
107   24157184 9411479  14745705 4431525888
108   24157184 9413235  14743949 4559327232
109   24157184 9417948  14739236 4500950016
110   24157184 9411013  14746171 4566691840
111   24157184 9421252  14735932 4522916864
112   24157184 9416726  14740458 4537550848
113   24157184 9415358  14741826 4676303872
114   24157184 9420448  14736736 4526662656

Group Chain: 0   Parent Inode: 13  Generation: 1172963971
CRC32:    ECC: 
##   Block#TotalUsed Free Contig   Size
04533995520158726339 9533 3987 1984
1453034496015872107555117 5117 1984
2299710976015872107535119 5119 1984
3452669440015872107535119 5119 1984
4302266368015872107535119 5119 1984
54512092160158729043 6829 2742 1984
64523043840158724948 109249612 1984
74519393280158726150 9722 5595 1984
84515742720158724323 115496603 1984
9377102848015872107535119 5119 1984
...
1513   552329728015872115871158711984
1514   552694784015872115871158711984
1515   553059840015872115871158711984
1516   553424896015872115871158711984
1517   553789952015872115871158711984
1518   554155008015872115871158711984
1519   554520064015872115871158711984
1520   554885120015872115871158711984
1521   555250176015872115871158711984
1522   555615232015872115871158711984

Group Chain: 1   Parent Inode: 13  Generation: 1172963971
CRC32:    ECC: 
##   Block#TotalUsed Free Contig   Size
0454862950415872107555117 2496 1984
129934909441587259   15813144511984
2248971366415872107585114 3726 1984
33117609984158723958 119146165 1984
4254447206415872107535119 5119 1984
5304094822415872107535119 5119 1984
6297158758415872107535119 5119 1984
74493871104158728664 7208 3705 1984
84544978944158728711 7161 2919 1984
94417209344158723253 126196447 1984
...
1513   552332902415872115871158711984
1514   552697958415872115871158711984
1515   553063014415872115871158711984
1516   553428070415872115871158711984
1517   553793126415872115871158711984
1518   554158182415872115871158711984
1519   554523238415872115871158711984
1520   554888294415872115871158711984
1521   5552533504 

Re: [Ocfs2-users] fsck.ocfs2 loops + hangs but does not check

2016-03-23 Thread Joseph Qi
Hi Michael,
Could you please use debugfs to check the output?
# debugfs.ocfs2 -R 'stat //global_bitmap' 

Thanks,
Joseph

On 2016/3/24 6:38, Michael Ulbrich wrote:
> Hi ocfs2-users,
> 
> my first post to this list from yesterday probably didn't get through.
> 
> Anyway, I've made some progress in the meantime and may now ask more
> specific questions ...
> 
> I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy:
> 
> Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
> 
> the kernel modules are:
> 
> modinfo ocfs2 -> version: 1.5.0
> 
> using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri.
> 
> As an alternative I cloned and built the latest ocfs2-tools from
> markfasheh's ocfs2-tools on github which should be version 1.8.4.
> 
> The filesystem runs on top of drbd, is used to roughly 40 % and suffers
> from read-only remounts and hanging clients since the last reboot. This
> may be DLM problems but I suspect they stem from some corrupt disk
> structures. Before that it all ran stable for months.
> 
> This situation made me want to run fsck.ocfs2 and now I wonder how to do
> that. The filesystem is not mounted.
> 
> With the stock ocfs-tools 1.6.4:
> 
> root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1
> fsck.ocfs2 1.6.4
> Checking OCFS2 filesystem in /dev/drbd1:
>   Label:  ocfs2_ASSET
>   UUID:   6A1A0189A3F94E32B6B9A526DF9060F3
>   Number of blocks:   5557283182
>   Block size: 2048
>   Number of clusters: 2778641591
>   Cluster size:   4096
>   Number of slots:16
> 
> I'm checking fsck_drbd1.log and find that it is making progress in
> 
> Pass 0a: Checking cluster allocation chains
> 
> until it reaches "chain 73" and goes into an infinite loop filling the
> logfile with breathtaking speed.
> 
> With the newly built ocfs-tools 1.8.4 I get:
> 
> root@s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1
> fsck.ocfs2 1.8.4
> Checking OCFS2 filesystem in /dev/drbd1:
>   Label:  ocfs2_ASSET
>   UUID:   6A1A0189A3F94E32B6B9A526DF9060F3
>   Number of blocks:   5557283182
>   Block size: 2048
>   Number of clusters: 2778641591
>   Cluster size:   4096
>   Number of slots:16
> 
> Again watching the verbose output in fsck_drbd1.log I find that this
> time it proceeds up to
> 
> Pass 0a: Checking cluster allocation chains
> o2fsck_pass0:1360 | found inode alloc 13 at block 13
> 
> and stays there without any further progress. I've terminated this
> process after waiting for more than an hour.
> 
> Now - I'm lost somehow ... and would very much appreciate if anybody on
> this list would share his knowledge and give me a hint what to do next.
> 
> What could be done to get this file system checked and repaired? Am I
> missing something important or do I just have to wait a little bit
> longer? Is there a version of ocfs2-tools / fsck.ocfs2 which will
> perform as expected?
> 
> I'm prepared to upgrade the kernel to 3.16.0-0.bpo.4-amd64 but shy away
> from taking that risk without any clue of whether that might solve my
> problem ...
> 
> Thanks in advance ... Michael Ulbrich
> 
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2

2011-05-16 Thread Xavier Diumé
I don't know if is it possible, but kernel panic error is not in
/var/log/kern.log.

2011/5/13 Sunil Mushran sunil.mush...@oracle.com

  Please do not remove the cc-s.

 Hard for me to comment without knowing anything about the panic.

 However, assuming that the panic message indicated that the volume
 needs to be fsck-ed. In that case, the best course is to umount the
 volume on all nodes and running fsck on one node.


 On 05/13/2011 12:33 PM, Xavier Diumé wrote:

 But initially the system had devices in /etf/fstab with _netdev option.
 When system starts mounting a kernel panic appears, sometimes after few
 minuts.
 The only way that I could start the system was mounting all devices one by
 one, with a previups fsck.
 I don't know if it is the better way, but is the only that I've used
 succesfully.

 2011/5/13 Sunil Mushran sunil.mush...@oracle.com

 On 05/13/2011 11:44 AM, Xavier Diumé wrote:

 Hello,
 Is it possible to fsck a mounted filesystem. When one of the cluster
 nodes reboots because a kernel panic, the device requires fsck.ocfs2 because
 in mounted.ocfs2 -f rebooted node is shown.


  If mounted.ocfs2 -f shows the rebooted node, that means the slotmap
 has not been cleaned up as yet. That cleanup happens during node
 recovery. If the volume is still mounted on another node, it will get
 cleaned up momentarily.

 If however it does not get cleaned up, that means that the volume is
 not mounted on any node. In that case, the next mount will clean
 up slotmap.

 Either way one does not need to fsck just to cleanup the slotmap.




 --
 Xavier Diumé
 http://socaqui.cat





-- 
Xavier Diumé
http://socaqui.cat
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] fsck.ocfs2

2011-05-13 Thread Sunil Mushran
On 05/13/2011 11:44 AM, Xavier Diumé wrote:
 Hello,
 Is it possible to fsck a mounted filesystem. When one of the cluster nodes 
 reboots because a kernel panic, the device requires fsck.ocfs2 because in 
 mounted.ocfs2 -f rebooted node is shown.

If mounted.ocfs2 -f shows the rebooted node, that means the slotmap
has not been cleaned up as yet. That cleanup happens during node
recovery. If the volume is still mounted on another node, it will get
cleaned up momentarily.

If however it does not get cleaned up, that means that the volume is
not mounted on any node. In that case, the next mount will clean
up slotmap.

Either way one does not need to fsck just to cleanup the slotmap.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2

2011-05-13 Thread Xavier Diumé
But initially the system had devices in /etf/fstab with _netdev option. When
system starts mounting a kernel panic appears, sometimes after few minuts.
The only way that I could start the system was mounting all devices one by
one, with a previups fsck.
I don't know if it is the better way, but is the only that I've used
succesfully.

2011/5/13 Sunil Mushran sunil.mush...@oracle.com

 On 05/13/2011 11:44 AM, Xavier Diumé wrote:

 Hello,
 Is it possible to fsck a mounted filesystem. When one of the cluster nodes
 reboots because a kernel panic, the device requires fsck.ocfs2 because in
 mounted.ocfs2 -f rebooted node is shown.


 If mounted.ocfs2 -f shows the rebooted node, that means the slotmap
 has not been cleaned up as yet. That cleanup happens during node
 recovery. If the volume is still mounted on another node, it will get
 cleaned up momentarily.

 If however it does not get cleaned up, that means that the volume is
 not mounted on any node. In that case, the next mount will clean
 up slotmap.

 Either way one does not need to fsck just to cleanup the slotmap.




-- 
Xavier Diumé
http://socaqui.cat
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?

2010-05-20 Thread Ulf Zimmermann
Correction, kernel modules are 1.4.4, the tools and console is 1.4.3.


 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
 Sent: Thursday, May 20, 2010 6:00 PM
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
 
 We are setting up 2 new EL5 U4 machines to replace our current database
 servers running our demo environment. We use 3Par SANs and their snap
 clone options. The current production system we snap  clone from is EL4
 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part
 of the refresh process is to run fsck.ocfs2 on the volume to recover,
 but right now as I am trying to run it on our 700GB volume it shows a
 virtual memory size of 21.9GB, resident of 10GB and it is killing the
 machine with swapping (24GB physical memory).
 
 Can anyone enlighten what is going on?
 
 Ulf.
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?

2010-05-20 Thread Ulf Zimmermann
And upgrading to kernel modules 1.4.7, tools 1.4.4 didn't change the memory 
part:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 
29532 root  18   0 21.9g  10g4 D 21.1 45.0   0:15.24 fsck.ocfs2 
 


 -Original Message-
 From: Ulf Zimmermann
 Sent: Thursday, May 20, 2010 6:06 PM
 To: Ulf Zimmermann; ocfs2-users@oss.oracle.com
 Subject: RE: fsck.ocfs2 using huge amount of memory?
 
 Correction, kernel modules are 1.4.4, the tools and console is 1.4.3.
 
 
  -Original Message-
  From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
  boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
  Sent: Thursday, May 20, 2010 6:00 PM
  To: ocfs2-users@oss.oracle.com
  Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
 
  We are setting up 2 new EL5 U4 machines to replace our current
 database
  servers running our demo environment. We use 3Par SANs and their snap
  clone options. The current production system we snap  clone from is
 EL4
  U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part
  of the refresh process is to run fsck.ocfs2 on the volume to recover,
  but right now as I am trying to run it on our 700GB volume it shows a
  virtual memory size of 21.9GB, resident of 10GB and it is killing the
  machine with swapping (24GB physical memory).
 
  Can anyone enlighten what is going on?
 
  Ulf.
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?

2010-05-20 Thread Joel Becker
On Thu, May 20, 2010 at 06:00:19PM -0700, Ulf Zimmermann wrote:
 We are setting up 2 new EL5 U4 machines to replace our current database 
 servers running our demo environment. We use 3Par SANs and their snap clone 
 options. The current production system we snap  clone from is EL4 U5 with 
 ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh 
 process is to run fsck.ocfs2 on the volume to recover, but right now as I am 
 trying to run it on our 700GB volume it shows a virtual memory size of 
 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB 
 physical memory).
 
 Can anyone enlighten what is going on?

How big are your filesystems?  Can we get the output of
debugfs.ocfs2 -R 'stats' /dev/xxx?
Recent fsck.ocfs2 knows how to build its own I/O cache for
significant speed improvements.  It only tries to get as much cache as
the filesystem actually needs, and no more than half of system memory.
That's why I'm asking for your filesystem size - I'm guessing you have
more than 12GB of used space on the filesystem, so fsck.ocfs2 is trying
to grab that much cache. 

Joel

-- 

In the room the women come and go
 Talking of Michaelangelo.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.bec...@oracle.com
Phone: (650) 506-8127

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?

2010-05-20 Thread Sunil Mushran
http://oss.oracle.com/projects/ocfs2-tools/news/article_8.html

We did make a related change in fsck in that release. Do you mind
creating a bugzilla for this? Do mention the arch. I can then send you
a debug version of the tool that'll tell us why it is behaving like that
on your machine.

On 05/20/2010 06:12 PM, Ulf Zimmermann wrote:
 And upgrading to kernel modules 1.4.7, tools 1.4.4 didn't change the memory 
 part:

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 29532 root  18   0 21.9g  10g4 D 21.1 45.0   0:15.24 fsck.ocfs2



 -Original Message-
 From: Ulf Zimmermann
 Sent: Thursday, May 20, 2010 6:06 PM
 To: Ulf Zimmermann; ocfs2-users@oss.oracle.com
 Subject: RE: fsck.ocfs2 using huge amount of memory?

 Correction, kernel modules are 1.4.4, the tools and console is 1.4.3.


  
 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
 Sent: Thursday, May 20, 2010 6:00 PM
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?

 We are setting up 2 new EL5 U4 machines to replace our current

 database
  
 servers running our demo environment. We use 3Par SANs and their snap
 clone options. The current production system we snap  clone from is

 EL4
  
 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part
 of the refresh process is to run fsck.ocfs2 on the volume to recover,
 but right now as I am trying to run it on our 700GB volume it shows a
 virtual memory size of 21.9GB, resident of 10GB and it is killing the
 machine with swapping (24GB physical memory).

 Can anyone enlighten what is going on?

 Ulf.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 can't fix an orphaned inode

2010-03-19 Thread Carl J. Benson
Sunil,

Bug 1236. Thanks very much.

-- 
Carl Benson, PHS Linux SysAdmin  (206-667-4862, cben...@fhcrc.org)

On 03/18/2010 11:32 AM, Sunil Mushran wrote:
 One option is to provide me with the o2image of the volume.
 # o2image -r /dev/sda1 - | bzip2  sda1.out.bz2
 
 File a bugzilla and add the link to that image. (The bz cannot
 handle large files.)
 
 The other option is to file a bz and attach the stat_sysdir output.
 http://oss.oracle.com/~smushran/.debug/scripts/stat_sysdir.sh
 
 Carl J. Benson wrote:
 Hello!

 I searched through the mailing list back to 07/2008, and didn't see
 this question answered before.

 I have 7 systems that use an ocfs2 filesystem. After many months of
 solid reliable use, they all crashed yesterday.

 6 systems run openSUSE 11.1, kernel 2.627.29-0.1-default, with these
 RPMs:

 ocfs2-tools-1.4.1-6.9
 ocfs2console -1.4.1-6.9

 1 system has for a week been running openSUSE 11.2, kernel
 2.6.31.12-0.1-default, with these RPMs:

 ocfs2console-1.4.1-25.6.x86_64
 ocfs2-tools-1.4.1-25.6.x86_64
 ocfs2-tools-o2cb-1.4.1-25.6.x86_64

 I still haven't figured out where the corruption started, but the
 problem at the moment is this: After repeated runs of fsck.ocfs2
 (with the filesystem unmounted, of course!), it's fallen into a
 pattern. Here is the output of fsck.ocfs2:

 /root # fsck.ocfs2 /dev/sdc1
 Checking OCFS2 filesystem in /dev/sdc1:
   label:  iscsi_ocfs2_cluster
   uuid:   23 48 29 28 4d 71 44 e6 b4 1d 88 75 c9 69 46 d3
   number of blocks:   268438109
   bytes per block:4096
   number of clusters: 268438109
   bytes per cluster:  4096
   max slots:  20

 pass4: Invalid block number while truncating orphan inode 104935559
 fsck.ocfs2: Invalid block number while trying to replay the orphan
 directory
 fsck encountered errors while recovering slot information, check forced.
 /dev/sdc1 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Pass 2: Checking directory entries.
 Pass 3: Checking directory connectivity.
 Pass 4a: checking for orphaned inodes
 [INODE_ORPHANED] Inode 104935559 was found in the orphan directory.
 Delete its contents and unlink it? y y
 pass4: Invalid block number while truncating orphan inode 104935559

 [INODE_ORPHANED] Inode 106959312 was found in the orphan directory.
 Delete its contents and unlink it? y y
 pass4: Invalid block number while truncating orphan inode 106959312

 Pass 4b: Checking inodes link counts.

 All passes succeeded.

 At this point I reboot the server (named merlot1), run fsck.ocfs2,
 and get exactly the same result.

 What can I do now? I looked at the man page for debugfs.ocfs2,
 but it doesn't look like that's going to help me.

 Any suggestions, please?

 -- 
 Carl Benson  |  cben...@fhcrc.org
 Linux System Administrator   |  Telephone: (206) 667-4862
 Fred Hutchinson Cancer Research Center

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   
 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 can't fix an orphaned inode

2010-03-18 Thread Sunil Mushran
One option is to provide me with the o2image of the volume.
# o2image -r /dev/sda1 - | bzip2  sda1.out.bz2

File a bugzilla and add the link to that image. (The bz cannot
handle large files.)

The other option is to file a bz and attach the stat_sysdir output.
http://oss.oracle.com/~smushran/.debug/scripts/stat_sysdir.sh

Carl J. Benson wrote:
 Hello!

 I searched through the mailing list back to 07/2008, and didn't see
 this question answered before.

 I have 7 systems that use an ocfs2 filesystem. After many months of
 solid reliable use, they all crashed yesterday.

 6 systems run openSUSE 11.1, kernel 2.627.29-0.1-default, with these
 RPMs:

 ocfs2-tools-1.4.1-6.9
 ocfs2console -1.4.1-6.9

 1 system has for a week been running openSUSE 11.2, kernel
 2.6.31.12-0.1-default, with these RPMs:

 ocfs2console-1.4.1-25.6.x86_64
 ocfs2-tools-1.4.1-25.6.x86_64
 ocfs2-tools-o2cb-1.4.1-25.6.x86_64

 I still haven't figured out where the corruption started, but the
 problem at the moment is this: After repeated runs of fsck.ocfs2
 (with the filesystem unmounted, of course!), it's fallen into a
 pattern. Here is the output of fsck.ocfs2:

 /root # fsck.ocfs2 /dev/sdc1
 Checking OCFS2 filesystem in /dev/sdc1:
   label:  iscsi_ocfs2_cluster
   uuid:   23 48 29 28 4d 71 44 e6 b4 1d 88 75 c9 69 46 d3
   number of blocks:   268438109
   bytes per block:4096
   number of clusters: 268438109
   bytes per cluster:  4096
   max slots:  20

 pass4: Invalid block number while truncating orphan inode 104935559
 fsck.ocfs2: Invalid block number while trying to replay the orphan directory
 fsck encountered errors while recovering slot information, check forced.
 /dev/sdc1 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Pass 2: Checking directory entries.
 Pass 3: Checking directory connectivity.
 Pass 4a: checking for orphaned inodes
 [INODE_ORPHANED] Inode 104935559 was found in the orphan directory.
 Delete its contents and unlink it? y y
 pass4: Invalid block number while truncating orphan inode 104935559

 [INODE_ORPHANED] Inode 106959312 was found in the orphan directory.
 Delete its contents and unlink it? y y
 pass4: Invalid block number while truncating orphan inode 106959312

 Pass 4b: Checking inodes link counts.

 All passes succeeded.

 At this point I reboot the server (named merlot1), run fsck.ocfs2,
 and get exactly the same result.

 What can I do now? I looked at the man page for debugfs.ocfs2,
 but it doesn't look like that's going to help me.

 Any suggestions, please?

 --
 Carl Benson  |  cben...@fhcrc.org
 Linux System Administrator   |  Telephone: (206) 667-4862
 Fred Hutchinson Cancer Research Center

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users