from:"Sunil Mushran"

Re: [Ocfs2-users] [Ocfs2-devel] size increase

2015-03-17 Thread Sunil Mushran

This is because you are specifying a 128k cluster size. Refer to man
mkfs.ocfs2 for more.
On Mar 17, 2015 8:04 PM, "Umarzuki Mochlis"  wrote:

> Hi,
>
> What I meant by total size is output of 'du -hs'
>
> I can see output of fdisk on mpath1 of ocfs2 LUN similar to logical
> volume of ext4 partition (255 head & 63 sectors)
>
> It is a 2 nodes ocfs cluster.
>
> 2015-03-18 10:50 GMT+08:00 Xue jiufei :
> > Hi Umarzuki,
> > What is the meaning of total size, file size or disk usage?
> > If you means the disk usage, I think maybe the difference of
> > cluster size(the minimum allocation unit) is the case.
> > Have you notice the cluster size or block size of your ocfs2
> > and ext4 filesystem?
> >
> > Thanks,
> > Xuejiufei
> >
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 “Heartbeat generation mismatch on device” error when mounting iscsi target

2015-02-09 Thread Sunil Mushran

If ps aux|grep o2hb does not return anything, means you are using local
heartbeat.

That means you have mismatching ocfs2.conf file. And I suspect the node
where this is failing is the one that has the bad ocfs2.conf file. Compare
the config files from all the nodes and ensure it is the same. Or you could
simply replace the one on the failing node from another node. The file
should be the same everywhere. Remember to restart the cluster on that node.

On Mon, Feb 9, 2015 at 2:27 PM, Danijel Krmar <
danijel.kr...@activecollab.com> wrote:

> No, nothing there:
> $ ps aux | grep o2hb
> root  5724  0.0  0.0   8320   888 pts/0S+   22:30
> <http://airmail.calendar/2015-02-09%2022:30:00%20GMT+1>   0:00
> <http://airmail.calendar/2015-02-09%2000:00:00%20GMT+1> grep --color o2hb
>
> Still the same error if i try to mount the iSCSI disk:
> o2hb_check_own_slot:590 ERROR: Heartbeat generation mismatch on device
> (sdb): expected(2:0x2f32486d4c54730a, 0x54d926d7),
> ondisk(2:0xb016e6a72676a791, 0x54d926d7)
>
> As said, there are no such problems on other machines, just this one. I
> can’t get my head around this "Heartbeat generation mismatch” error
> message.
>
> --
> Danijel Krmar
> A51 D.O.O.
> Novi Sad
> https://www.activecollab.com/
>
> On February 9, 2015 at 8:09:06 PM, Sunil Mushran (sunil.mush...@gmail.com)
> wrote:
>
> On node 2, do:
> ps aux | grep o2hb
>
> I suspect you have multiple o2hb threads running. If so, restart the o2cb
> cluster on that node.
>
> On Mon, Feb 9, 2015 at 10:08 AM, Danijel Krmar <
> danijel.kr...@activecollab.com> wrote:
>
>>   As said in the title, when I want to mount a iSCSI target on one
>> machine I get the following error:
>>
>> (o2hb-3F92114867,7826,3):o2hb_check_own_slot:590 ERROR: Heartbeat generation 
>> mismatch on device (sdb): expected(2:0xa0cf28215b4b1ed3, 0x54d8a036), 
>> ondisk(2:0xb016e6a72676a791, 0x54d8a037)
>>
>>  The same iSCSI target is working on other machines.
>>
>> Any idea what this error means?
>>
>>  --
>> Danijel Krmar
>>  A51 D.O.O.
>>  Novi Sad
>>  https://www.activecollab.com/
>>
>> ___
>> Ocfs2-users mailing list
>> Ocfs2-users@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 “Heartbeat generation mismatch on device” error when mounting iscsi target

2015-02-09 Thread Sunil Mushran

On node 2, do:
ps aux | grep o2hb

I suspect you have multiple o2hb threads running. If so, restart the o2cb
cluster on that node.

On Mon, Feb 9, 2015 at 10:08 AM, Danijel Krmar <
danijel.kr...@activecollab.com> wrote:

> As said in the title, when I want to mount a iSCSI target on one machine I
> get the following error:
>
> (o2hb-3F92114867,7826,3):o2hb_check_own_slot:590 ERROR: Heartbeat generation 
> mismatch on device (sdb): expected(2:0xa0cf28215b4b1ed3, 0x54d8a036), 
> ondisk(2:0xb016e6a72676a791, 0x54d8a037)
>
> The same iSCSI target is working on other machines.
>
> Any idea what this error means?
>
> --
> Danijel Krmar
> A51 D.O.O.
> Novi Sad
> https://www.activecollab.com/
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] How to unlock a bloked resource? Thanks

2014-09-10 Thread Sunil Mushran

What is the output of the commands? The protocol is supposed to do the
unlocking on its own. See what is it blocked on. It could be that the node
that has the lock cannot unlock it because it cannot flush the journal to
disk.

On Tue, Sep 9, 2014 at 7:55 PM, Guozhonghua  wrote:

>  Hi All:
>
>
>
> As we test with two node in one OCFS2 cluster.
>
> The cluster is hang up may be for dead lock.
>
> We use the debugfs.ocfs tool founding that one resource is holding by one
> node who has it for long time and another node can still wait for the
> resource.
>
> So the cluster is hang up.
>
>
>
> debugfs.ocfs2 -R "fs_locks -B" /dev/dm-0
>
> debugfs.ocfs2 -R "dlm_locks LOCKID_XXX" /dev/dm-0
>
>
>
> How to unlock the lock held by the node? Is there some commands to unlock
> the resource?
>
>
>
> Thanks.
>
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
> 邮件！
> This e-mail and its attachments contain confidential information from H3C,
> which is
> intended only for the person or entity whose address is listed above. Any
> use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender
> by phone or email immediately and delete it!
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 slow when using 'find' and 'du' commands

2014-05-22 Thread Sunil Mushran

Is this slow the second time you run the command or only the first? How
much memory do you have?

-mmin needs the inode. And reading inodes from disk is expensive. One
reason could be that the system does not have enough memory to cache the
inodes and thus is triggering lots of disk reads.


On Thu, May 22, 2014 at 4:23 PM, Robert Abbate wrote:

> we've noticed that once we have grown to have thousands (21,839) of
> sub-directories running linux commands such as 'find' and 'du' are very
> slow. we compared them by running them directly on a hard-disk with the
> same file and directory contents
>
> find ./directory/* -type f -mmin -60
>
> (21,839 directories)
> ocfs2 = 10 minutes
> direct  = 30 seconds
>
> Are there any tweaks we need to make to help improve performance of these
> commands when many sub-directories exist?
>
> ocfs2 nodes = 3
>
> debugfs.ocfs2 1.6.3
> Feature Compat: 3 backup-super strict-journal-super
> Feature Incompat: 9808 sparse inline-data xattr indexed-dirs
> discontig-bg
> Feature RO compat: 1 unwritten
> Dynamic Features: (0x0)
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and PHP is it related to ocfs2 ?

2014-05-02 Thread Sunil Mushran

So these are all spinning to get the lock. You need to find the lock
holder. Dump the stack of all processes using the fs. Most of the stacks
should be similar to the one above. The useful ones will be the stacks that
is/are not similar to the above.


On Fri, May 2, 2014 at 8:48 AM,  wrote:

>  Thanks for your help! attached i execute the command for two PHP-FPM
> pids, which are in state 'r' = Running and consume 30% cpu time
>
> root@ispconfig:~# cat /proc/5236/stack
> [] ocfs2_inode_cache_io_lock+0x12/0x20 [ocfs2]
> [] ocfs2_metadata_cache_io_lock+0x19/0x20 [ocfs2]
> [] ocfs2_read_blocks+0xe9/0x6d0 [ocfs2]
> [] ocfs2_read_inode_block_full+0x3b/0x60 [ocfs2]
> [] ocfs2_read_inode_block+0x10/0x20 [ocfs2]
> [] ocfs2_iop_get_acl+0x3d/0x80 [ocfs2]
> [] check_acl+0xb9/0x133
> [] generic_permission+0x107/0x120
> [] ocfs2_permission+0x9b/0x100 [ocfs2]
> [] __inode_permission+0x76/0xd0
> [] inode_permission+0x18/0x50
> [] link_path_walk+0x24a/0x8b0
> [] path_lookupat+0x58/0x780
> [] filename_lookup+0x34/0xc0
> [] user_path_at_empty+0x59/0xa0
> [] user_path_at+0x11/0x20
> [] vfs_fstatat+0x51/0xb0
> [] vfs_stat+0x1b/0x20
> [] SYSC_newstat+0x15/0x30
> [] SyS_newstat+0xe/0x10
> [] system_call_fastpath+0x1a/0x1f
> [] 0x
>
> Second PID
>  root@ispconfig:~# cat /proc/5329/stack
> [] __cond_resched+0x2a/0x40
> [] ocfs2_inode_cache_io_lock+0x12/0x20 [ocfs2]
> [] ocfs2_metadata_cache_io_lock+0x19/0x20 [ocfs2]
> [] ocfs2_read_blocks+0xe9/0x6d0 [ocfs2]
> [] ocfs2_read_inode_block_full+0x3b/0x60 [ocfs2]
> [] ocfs2_read_inode_block+0x10/0x20 [ocfs2]
> [] ocfs2_iop_get_acl+0x3d/0x80 [ocfs2]
> [] check_acl+0xb9/0x133
> [] generic_permission+0x107/0x120
> [] ocfs2_permission+0x9b/0x100 [ocfs2]
> [] __inode_permission+0x76/0xd0
> [] inode_permission+0x18/0x50
> [] link_path_walk+0x24a/0x8b0
> [] link_path_walk+0x49c/0x8b0
> [] path_lookupat+0x58/0x780
> [] filename_lookup+0x34/0xc0
> [] user_path_at_empty+0x59/0xa0
> [] user_path_at+0x11/0x20
> [] SyS_faccessat+0x9c/0x220
> [] SyS_access+0x18/0x20
> [] system_call_fastpath+0x1a/0x1f
> [] 0x
>
>
> *Gesendet:* Freitag, 02. Mai 2014 um 17:16 Uhr
> *Von:* "Sunil Mushran" 
> *An:* molo@web.de
> *Cc:* Ocfs2-users@oss.oracle.com
> *Betreff:* Re: Aw: Re: [Ocfs2-users] OCFS2 and PHP is it related to ocfs2
> ?
>
> Dump some kernel/user stacks to see if we can narrow down the loop it is
> spinning in.
>
> cat /process/PID/stack will show the kernel stack
> pstack should show user stack.
> On May 2, 2014 8:12 AM,  wrote:
>>
>>   its PHP-FPM
>>
>> root  1951  1.5  0.1 362344  7704 ?Ss   17:09   0:01 php-fpm:
>> master process (/etc/php5/fpm/php-fpm.conf)
>> www-data  1953  0.0  0.1 360340  4868 ?S17:09   0:00
>> php-fpm: pool www
>> www-data  1954  0.0  0.1 360340  4868 ?S17:09   0:00
>> php-fpm: pool www
>> www-data  1955  0.0  0.1 360340  4868 ?S17:09   0:00
>> php-fpm: pool www
>> www-data  1956  0.0  0.1 360340  4868 ?S17:09   0:00
>> php-fpm: pool www
>> web1  2439 26.5  0.7 376132 30328 ?R17:10   0:17 php-fpm:
>> pool web1
>> web1  2442 18.3  0.6 375592 25524 ?R17:10   0:11 php-fpm:
>> pool web1
>> web1  2451 21.3  0.5 374200 21792 ?R17:10   0:08 php-fpm:
>> pool web1
>> web1  2453 27.0  0.3 366288 13960 ?R17:10   0:11 php-fpm:
>> pool web1
>> web1  2454 25.5  0.2 364068  9244 ?R17:10   0:10 php-fpm:
>> pool web1
>> web1  2455 31.6  0.2 364072  9716 ?R17:10   0:12 php-fpm:
>> pool web1
>> web1  2456 19.4  0.2 364312  9048 ?R17:10   0:07 php-fpm:
>> pool web1
>> web1  2458 22.9  0.2 364068  9108 ?R17:10   0:08 php-fpm:
>> pool web1
>> web1  2459 26.7  0.2 364068  9152 ?R17:10   0:10 php-fpm:
>> pool web1
>> web1  2460 19.3  0.2 364020  9136 ?R17:10   0:07 php-fpm:
>> pool web1
>> web1  2461 23.4  0.2 364068  9092 ?R17:10   0:09 php-fpm:
>> pool web1
>> web1  2462 19.6  0.2 364068  8948 ?R17:10   0:07 php-fpm:
>> pool web1
>> web1  2463 23.7  0.2 364072  8988 ?R17:10   0:09 php-fpm:
>> pool web1
>> web1  2466 27.2  0.2 364068  9072 ?R17:10   0:10 php-fpm:
>> pool web1
>> web1  2471 24.2  0.2 364040  9160 ?R17:10   0:08 php-fpm:
>> pool web1
>> web1  2472 20.7  0.2 364068  8948 ?R17:10   0:07 php-fpm:
>> pool web1
>> web1  2473 21.2  0.2 364068  8912 ?

Re: [Ocfs2-users] OCFS2 and PHP is it related to ocfs2 ?

2014-05-02 Thread Sunil Mushran

Dump some kernel/user stacks to see if we can narrow down the loop it is
spinning in.

cat /process/PID/stack will show the kernel stack
pstack should show user stack.
On May 2, 2014 8:12 AM,  wrote:

> its PHP-FPM
>
> root  1951  1.5  0.1 362344  7704 ?Ss   17:09   0:01 php-fpm:
> master process (/etc/php5/fpm/php-fpm.conf)
> www-data  1953  0.0  0.1 360340  4868 ?S17:09   0:00 php-fpm:
> pool www
> www-data  1954  0.0  0.1 360340  4868 ?S17:09   0:00 php-fpm:
> pool www
> www-data  1955  0.0  0.1 360340  4868 ?S17:09   0:00 php-fpm:
> pool www
> www-data  1956  0.0  0.1 360340  4868 ?S17:09   0:00 php-fpm:
> pool www
> web1  2439 26.5  0.7 376132 30328 ?R17:10   0:17 php-fpm:
> pool web1
> web1  2442 18.3  0.6 375592 25524 ?R17:10   0:11 php-fpm:
> pool web1
> web1  2451 21.3  0.5 374200 21792 ?R17:10   0:08 php-fpm:
> pool web1
> web1  2453 27.0  0.3 366288 13960 ?R17:10   0:11 php-fpm:
> pool web1
> web1  2454 25.5  0.2 364068  9244 ?R17:10   0:10 php-fpm:
> pool web1
> web1  2455 31.6  0.2 364072  9716 ?R17:10   0:12 php-fpm:
> pool web1
> web1  2456 19.4  0.2 364312  9048 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2458 22.9  0.2 364068  9108 ?R17:10   0:08 php-fpm:
> pool web1
> web1  2459 26.7  0.2 364068  9152 ?R17:10   0:10 php-fpm:
> pool web1
> web1  2460 19.3  0.2 364020  9136 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2461 23.4  0.2 364068  9092 ?R17:10   0:09 php-fpm:
> pool web1
> web1  2462 19.6  0.2 364068  8948 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2463 23.7  0.2 364072  8988 ?R17:10   0:09 php-fpm:
> pool web1
> web1  2466 27.2  0.2 364068  9072 ?R17:10   0:10 php-fpm:
> pool web1
> web1  2471 24.2  0.2 364040  9160 ?R17:10   0:08 php-fpm:
> pool web1
> web1  2472 20.7  0.2 364068  8948 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2473 21.2  0.2 364068  8912 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2482 19.8  0.2 364068  8924 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2483 22.0  0.2 364068  8964 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2484 22.4  0.2 364068  8984 ?R17:10   0:08 php-fpm:
> pool web1
> web1  2485 22.2  0.2 364068  8904 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2486 16.5  0.2 364068  8852 ?R17:10   0:05 php-fpm:
> pool web1
> web1  2487 22.4  0.2 364076  8864 ?R17:10   0:07 php-fpm:
> pool web1
> web1  2488 24.3  0.2 364068  9020 ?R17:10   0:08 php-fpm:
> pool web1
> web1  2499 29.7  0.2 364068  9080 ?R17:10   0:10 php-fpm:
> pool web1
> web1  2500 19.8  0.2 364068  8996 ?R17:10   0:06 php-fpm:
> pool web1
> web1  2502 31.3  0.2 364068  9168 ?R17:10   0:10 php-fpm:
> pool web1
> web1  2503 20.9  0.2 364020  8984 ?R17:10   0:07 php-fpm:
> pool web1
>  root  2277  1.3  0.4 438232 16256 ?Ss   17:09   0:01
> /usr/sbin/apache2 -k start
> www-data  2283  0.3  0.0 134164  3928 ?S17:09   0:00
> /usr/sbin/apache2 -k start
> www-data  2293  2.2  0.2 439204  9904 ?S17:09   0:02
> /usr/sbin/apache2 -k start
> www-data  2294  1.4  0.2 439572 10616 ?S17:09   0:01
> /usr/sbin/apache2 -k start
> www-data  2295  2.7  0.2 439288 10016 ?S17:09   0:02
> /usr/sbin/apache2 -k start
> www-data  2296  2.9  0.2 439632 10364 ?S17:09   0:03
> /usr/sbin/apache2 -k start
> www-data  2297  2.9  0.2 439204 10052 ?S17:09   0:03
> /usr/sbin/apache2 -k start
> www-data  2440  3.5  0.2 439196 10016 ?S17:10   0:03
> /usr/sbin/apache2 -k start
> www-data  2443  1.5  0.2 439300 10160 ?S17:10   0:01
> /usr/sbin/apache2 -k start
> www-data  2444  0.2  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2445  0.5  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2446  0.4  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2447  0.4  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2448  0.4  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2449  0.3  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2450  0.5  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2457  0.9  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2464  0.2  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2465  0.3  0.2 439188  9908 ?S17:10   0:00
> /usr/sbin/apache2 -k start
> www-data  2467  0.5

Re: [Ocfs2-users] OCFS2 and PHP is it related to ocfs2 ?

2014-05-02 Thread Sunil Mushran

Which process is pegging the CPU?
On May 2, 2014 6:12 AM,  wrote:

> We have two nodes which are serving PHP webpages with PHP5-FPM. Both Nodes
> are configured with drbd in dual primary mode.
> In our tests, if one of these two nodes get 10-20 Pagerefresh's at the
> same time,  the CPU are 100% in use.
> For these 10-20 Pagerequests the server are 1.5 minutes busy.
>
> Without OCFS (harddrive is formatted with ext4)  for the same 10-20
> requests, it tooks only 0.1 Seconds to complete.
> We only see a long running PHP Process which get timeouted sometime with
> '30 seconds exceeded for including a file' . We tried mounting the ocfs2
> device with 'noatime, nodiratime, data=writeback', but it didn't helped.
>
> does someone has a tip for us ?
> (If we use ocfs2 with icsci, there are no changes)
>
> thanks a lot!
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] FSCK may be failing and corrupting my disk???

2014-03-24 Thread Sunil Mushran

fsck cannot determine which of the two inodes is incorrect. In such cases,
fsck makes a copy of one of the inodes (with data) and asks the user to
delete the bad file after mounting.


On Sun, Mar 23, 2014 at 7:18 AM, Eric Raskin  wrote:

>  I did some more research by running a fsck -fn.  Basically it is one
> inode that is wrong and needs to be cleared.  Is there a way to do that via
> debugfs?  If I can delete that one inode, then all the doubly-linked
> clusters will not be doubly linked any more and all of the errors will go
> away.
>
> Isn't that quicker than cloning a bad inode?
>
>
> On 03/22/2014 09:40 PM, Sunil Mushran wrote:
>
> Cloning the inode means inode + data. Let it finish.
>
>
> On Sat, Mar 22, 2014 at 3:44 PM, Eric Raskin  wrote:
>
>>  Hi:
>>
>> I am running a two-node Oracle VM Server 2.2.2 installation.   We were
>> having some strange problems creating new virtual machines, so I shut down
>> the systems and unmounted the OVS Repository (ocfs2 file system on
>> Equallogic equipment).
>>
>> I ran a fsck -y first, which replayed the logs and said all was clean.
>> But, I am pretty sure there are other issues, so I started an fsck -fy
>>
>> One of the messages I got was:
>>
>> Cluster 161213953 is claimed by the following inodes:
>>   <76289548>
>>   /running_pool/450_gebidb/System.img
>> [DUP_CLUSTERS_CLONE] Inode "(null)" may be cloned or deleted to break the
>> claim it has on its clusters. Clone inode "(null)" to break claims on
>> clusters it shares with other inodes? y
>>
>> I then watched with an strace -p  to see what was
>> happening, since it was taking a long time with no messages.  I see:
>>
>> pwrite64(3,
>> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 90112) = 4096
>> pwrite64(3,
>> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0+\3H\26O}\306\374&\0\0\0\0\0"..., 4096,
>> 10465599488) = 4096
>> pwrite64(3,
>> "GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 10462699520) = 4096
>> pwrite64(3,
>> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 90112) = 4096
>> pwrite64(3,
>> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0/\3H\26O}\302\374&\0\0\0\0\0"..., 4096,
>> 10465583104) = 4096
>> pwrite64(3,
>> "GROUP01\0\300\17\0\4Q\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 10462699520) = 4096
>> pwrite64(3,
>> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 90112) = 4096
>> pwrite64(3,
>> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374&\0\0\0\0\0"..., 4096,
>> 10465558528) = 4096
>> pwrite64(3,
>> "INODE01\0H\26O}\0\0L\0\0\0\0\0\24\346\17\0\0\0\0\0\0\0\0\0"..., 4096,
>> 2686701568) = 4096
>> pwrite64(3, "GROUP01\0\300\17\0~\3\0#\0H\26O}\0\0\0\0\0n\0\1\0\0\0\0"...,
>> 4096, 100940120064) = 4096
>> pwrite64(3,
>> "INODE01\0H\26O}\377\377\7\0\0\0\0\0\0\6\0\30\0\0\0\0\0\0\0\0"..., 4096,
>> 45056) = 4096
>> pwrite64(3,
>> "GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 10462699520) = 4096
>> pwrite64(3,
>> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 90112) = 4096
>> pwrite64(3,
>> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0\272\2H\26O}\274\374&\0\0\0\0\0"..., 4096,
>> 10465558528) = 4096
>> pwrite64(3,
>> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374&\0\0\0\0\0"..., 4096,
>> 10465558528) = 4096
>> pwrite64(3,
>> "GROUP01\0\300\17\0\4O\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 10462699520) = 4096
>> pwrite64(3,
>> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
>> 90112) = 4096
>>
>> This is going on and on.  It looks like it is writing lots of entries to
>> fix one duplicate inode???
>>
>> At this point, I have aborted the fsck, as I am worried that it is
>> completely trashing our OVS repository disk.
>>
>> Can anybody shed some light on this before I restart the fsck?  We need
>> to be back up and running ASAP!
>>
>> Thanks in advance!
>> --
>>
>> ---
>>   Eric H. Raskin 914-765-0500 x120 <914-765-0500%20x120>  Professional
>> Advertising Systems Inc. 914-765-0503 fax  200 Business Park Dr Suite 304
>> eras...@paslists.com  Armonk, NY 10504 http://www.paslists.com
>>
>> ___
>> Ocfs2-users mailing list
>> Ocfs2-users@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>
>
> --
>
> ---
>   Eric H. Raskin 914-765-0500 x120  Professional Advertising Systems Inc.
> 914-765-0503 fax  200 Business Park Dr Suite 304 eras...@paslists.com  Armonk,
> NY 10504 http://www.paslists.com
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] FSCK may be failing and corrupting my disk???

2014-03-22 Thread Sunil Mushran

Cloning the inode means inode + data. Let it finish.


On Sat, Mar 22, 2014 at 3:44 PM, Eric Raskin  wrote:

>  Hi:
>
> I am running a two-node Oracle VM Server 2.2.2 installation.   We were
> having some strange problems creating new virtual machines, so I shut down
> the systems and unmounted the OVS Repository (ocfs2 file system on
> Equallogic equipment).
>
> I ran a fsck -y first, which replayed the logs and said all was clean.
> But, I am pretty sure there are other issues, so I started an fsck -fy
>
> One of the messages I got was:
>
> Cluster 161213953 is claimed by the following inodes:
>   <76289548>
>   /running_pool/450_gebidb/System.img
> [DUP_CLUSTERS_CLONE] Inode "(null)" may be cloned or deleted to break the
> claim it has on its clusters. Clone inode "(null)" to break claims on
> clusters it shares with other inodes? y
>
> I then watched with an strace -p  to see what was happening,
> since it was taking a long time with no messages.  I see:
>
> pwrite64(3,
> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 90112) = 4096
> pwrite64(3,
> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0+\3H\26O}\306\374&\0\0\0\0\0"..., 4096,
> 10465599488) = 4096
> pwrite64(3,
> "GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 10462699520) = 4096
> pwrite64(3,
> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 90112) = 4096
> pwrite64(3,
> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0/\3H\26O}\302\374&\0\0\0\0\0"..., 4096,
> 10465583104) = 4096
> pwrite64(3,
> "GROUP01\0\300\17\0\4Q\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 10462699520) = 4096
> pwrite64(3,
> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 90112) = 4096
> pwrite64(3,
> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374&\0\0\0\0\0"..., 4096,
> 10465558528) = 4096
> pwrite64(3,
> "INODE01\0H\26O}\0\0L\0\0\0\0\0\24\346\17\0\0\0\0\0\0\0\0\0"..., 4096,
> 2686701568) = 4096
> pwrite64(3, "GROUP01\0\300\17\0~\3\0#\0H\26O}\0\0\0\0\0n\0\1\0\0\0\0"...,
> 4096, 100940120064) = 4096
> pwrite64(3,
> "INODE01\0H\26O}\377\377\7\0\0\0\0\0\0\6\0\30\0\0\0\0\0\0\0\0"..., 4096,
> 45056) = 4096
> pwrite64(3,
> "GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 10462699520) = 4096
> pwrite64(3,
> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 90112) = 4096
> pwrite64(3,
> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0\272\2H\26O}\274\374&\0\0\0\0\0"..., 4096,
> 10465558528) = 4096
> pwrite64(3,
> "EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374&\0\0\0\0\0"..., 4096,
> 10465558528) = 4096
> pwrite64(3,
> "GROUP01\0\300\17\0\4O\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 10462699520) = 4096
> pwrite64(3,
> "INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0"..., 4096,
> 90112) = 4096
>
> This is going on and on.  It looks like it is writing lots of entries to
> fix one duplicate inode???
>
> At this point, I have aborted the fsck, as I am worried that it is
> completely trashing our OVS repository disk.
>
> Can anybody shed some light on this before I restart the fsck?  We need to
> be back up and running ASAP!
>
> Thanks in advance!
> --
>
> ---
>   Eric H. Raskin 914-765-0500 x120  Professional Advertising Systems Inc.
> 914-765-0503 fax  200 Business Park Dr Suite 304 eras...@paslists.com  Armonk,
> NY 10504 http://www.paslists.com
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] How do I check fragmentation amount?

2013-11-01 Thread Sunil Mushran

debugfs.ocfs2 -R "frag filespec" DEVICE will show you the fragmentation
level on an inode basis. You could run that for all inodes and figure out
the value for the entire volume.

On Fri, Nov 1, 2013 at 3:00 PM, Andy  wrote:

> How can I check the amount on fragmentation on an OCFS2 volume?
>
> Thanks,
>
> Andy
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] How to break out the unstop loop in the recovery thread? Thanks a lot.

2013-11-01 Thread Sunil Mushran

It is encountering scsi errrors reading the device. Fixing that will fix
the issue.

If you want to stop the logging, I don't believe there is a method right
now. But i could be trivially added.
Allow user to disable mlog(ML_ERROR) logging.



On Thu, Oct 31, 2013 at 7:38 PM, Guozhonghua  wrote:

>  Hi everyone,
>
>
>
> I have one OCFS2 issue.
>
> The OS is Ubuntu, using linux kernel is 3.2.50.
>
> There are three node in the OCFS2 cluster, and all the node is using the
> iSCSI SAN of HP 4330 as the storage.
>
> As the storage restarted, there were two node restarted for fence without
> heartbeating writting on to the storage.
>
> But the last one does not restart, and it still write error message into
> syslog as below:
>
>
>
> Oct 30 02:01:01 server177 kernel: [25786.227598]
> (ocfs2rec,14787,13):ocfs2_read_journal_inode:1463 ERROR: status = -5
>
> Oct 30 02:01:01 server177 kernel: [25786.227615]
> (ocfs2rec,14787,13):ocfs2_replay_journal:1496 ERROR: status = -5
>
> Oct 30 02:01:01 server177 kernel: [25786.227631]
> (ocfs2rec,14787,13):ocfs2_recover_node:1652 ERROR: status = -5
>
> Oct 30 02:01:01 server177 kernel: [25786.227648]
> (ocfs2rec,14787,13):__ocfs2_recovery_thread:1358 ERROR: Error -5 recovering
> node 2 on device (8,32)!
>
> Oct 30 02:01:01 server177 kernel: [25786.227670]
> (ocfs2rec,14787,13):__ocfs2_recovery_thread:1359 ERROR: Volume requires
> unmount.
>
> Oct 30 02:01:01 server177 kernel: [25786.227696] sd 4:0:0:0: [sdc]
> Unhandled error code
>
> Oct 30 02:01:01 server177 kernel: [25786.227707] sd 4:0:0:0: [sdc]
> Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
>
> Oct 30 02:01:01 server177 kernel: [25786.227726] sd 4:0:0:0: [sdc] CDB:
> Read(10): 28 00 00 00 13 40 00 00 08 00
>
> Oct 30 02:01:01 server177 kernel: [25786.227792] end_request: recoverable
> transport error, dev sdc, sector 4928
>
> Oct 30 02:01:01 server177 kernel: [25786.227812]
> (ocfs2rec,14787,13):ocfs2_read_journal_inode:1463 ERROR: status = -5
>
> Oct 30 02:01:01 server177 kernel: [25786.227830]
> (ocfs2rec,14787,13):ocfs2_replay_journal:1496 ERROR: status = -5
>
> Oct 30 02:01:01 server177 kernel: [25786.227848]
> (ocfs2rec,14787,13):ocfs2_recover_node:1652 ERROR: status = -5
>
>
> ...
>
> Oct 30 06:48:41 server177 kernel: [43009.457816] sd 4:0:0:0: [sdc]
> Unhandled error code
>
> Oct 30 06:48:41 server177 kernel: [43009.457826] sd 4:0:0:0: [sdc]
> Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
>
> Oct 30 06:48:41 server177 kernel: [43009.457843] sd 4:0:0:0: [sdc] CDB:
> Read(10): 28 00 00 00 13 40 00 00 08 00
>
> Oct 30 06:48:41 server177 kernel: [43009.457911] end_request: recoverable
> transport error, dev sdc, sector 4928
>
> Oct 30 06:48:41 server177 kernel: [43009.457930]
> (ocfs2rec,14787,9):ocfs2_read_journal_inode:1463 ERROR: status = -5
>
> Oct 30 06:48:41 server177 kernel: [43009.457946]
> (ocfs2rec,14787,9):ocfs2_replay_journal:1496 ERROR: status = -5
>
> Oct 30 06:48:41 server177 kernel: [43009.457960]
> (ocfs2rec,14787,9):ocfs2_recover_node:1652 ERROR: status = -5
>
> Oct 30 06:48:41 server177 kernel: [43009.457975]
> (ocfs2rec,14787,9):__ocfs2_recovery_thread:1358 ERROR: Error -5 recovering
> node 2 on device (8,32)!
>
> Oct 30 06:48:41 server177 kernel: [43009.457996]
> (ocfs2rec,14787,9):__ocfs2_recovery_thread:1359 ERROR: Volume requires
> unmount.
>
> Oct 30 06:48:41 server177 kernel: [43009.458021] sd 4:0:0:0: [sdc]
> Unhandled error code
>
> Oct 30 06:48:41 server177 kernel: [43009.458031] sd 4:0:0:0: [sdc]
> Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
>
> Oct 30 06:48:41 server177 kernel: [43009.458049] sd 4:0:0:0: [sdc] CDB:
> Read(10): 28 00 00 00 13 40 00 00 08 00
>
> Oct 30 06:48:41 server177 kernel: [43009.458117] end_request: recoverable
> transport error, dev sdc, sector 4928
>
> Oct 30 06:48:41 server177 kernel: [43009.458137]
> (ocfs2rec,14787,9):ocfs2_read_journal_inode:1463 ERROR: status = -5
>
> Oct 30 06:48:41 server177 kernel: [43009.458153]
> (ocfs2rec,14787,9):ocfs2_replay_journal:1496 ERROR: status = -5
>
> Oct 30 06:48:41 server177 kernel: [43009.458168]
> (ocfs2rec,14787,9):ocfs2_recover_node:1652 ERROR: status = -5
>
>
> .
>
> .. The same log message as before, and the syslog is very large, it
> can occupy all the capacity remains on the disk...
>
>
>
> So as the syslog file size increases quikly, and is very large and it
> occupy all the capacity of the system directory / remains.
>
> So the host is blocked and not any response.
>
>
>
> According to the log as before, In the function __ocfs2_recovery_thread,
> there may be an un-stop loop which result in the super-large syslog file.
>
> __ocfs2_recovery_thread
>
> {
>
> 
>
> while (rm->rm_used) {
>
>………
>
>

Re: [Ocfs2-users] OCFS2 tuning, fragmentation and localalloc option. Cluster hanging during mix read+write workloads

2013-08-06 Thread Sunil Mushran

If the storage connectivity is not stable, then dlm issues are to be
expected.
In this case, the processes are all trying to take the readlock. One
possible
scenario is that the node holding the writelock is not able to relinquish
the lock
because it cannot flush the updated inodes to disk. I would suggest you look
into load balancing and how it affects the iscsi connectivity from the
hosts.


On Tue, Aug 6, 2013 at 2:51 PM, Gavin Jones  wrote:

> Hello Goldwyn,
>
> Thanks for taking a look at this.  So, then, it does seem to be DLM
> related.  We were running fine for a few weeks and then it came up
> again this morning and has been going on throughout the day.
>
> Regarding the DLM debugging, I allowed debugging for DLM_GLUE,
> DLM_THREAD, DLM_MASTER and DLM_RECOVERY.  However, I don't see any DLM
> logging output in dmesg or syslog --is there perhaps another way to
> get at the actual DLM log?  I've searched around a bit but didn't find
> anything that made it clear.
>
> As for OCFS2 and iSCSI communications, they use the same physical
> network interface but different VLANs on that interface.  The
> "connectionX:0" errors, then, seem to indicate an issue with the ISCSI
> connection.  The system logs and monitoring software don't show any
> warnings or errors about the interface going down, so the only thing I
> can think of is the connection load balancing on the SAN, though
> that's merely a hunch.  Maybe I should mail the list and see if anyone
> has a similar setup.
>
> If you could please point me in the right direction to make use of the
> DLM debugging via debugs.ocfs2, I would appreciate it.
>
> Thanks again,
>
> Gavin W. Jones
> Where 2 Get It, Inc.
>
> On Tue, Aug 6, 2013 at 4:16 PM, Goldwyn Rodrigues 
> wrote:
> > Hi Gavin,
> >
> >
> > On 08/06/2013 01:59 PM, Gavin Jones wrote:
> >>
> >> Hi Goldwyn,
> >>
> >> Apologies for the delayed reply.
> >>
> >> The hung Apache process / OCFS issue cropped up again, so I thought
> >> I'd pass along the contents of /proc//stack of a few affected
> >> processes:
> >>
> >> gjones@slipapp02:~> sudo cat /proc/27521/stack
> >> gjones's password:
> >> [] poll_schedule_timeout+0x44/0x60
> >> [] do_select+0x5a6/0x670
> >> [] core_sys_select+0x19e/0x2d0
> >> [] sys_select+0xb5/0x110
> >> [] system_call_fastpath+0x1a/0x1f
> >> [<7f394bdd5f23>] 0x7f394bdd5f23
> >> [] 0x
> >> gjones@slipapp02:~> sudo cat /proc/27530/stack
> >> [] sys_semtimedop+0x5a1/0x8b0
> >> [] system_call_fastpath+0x1a/0x1f
> >> [<7f394bdddb77>] 0x7f394bdddb77
> >> [] 0x
> >> gjones@slipapp02:~> sudo cat /proc/27462/stack
> >> [] sys_semtimedop+0x5a1/0x8b0
> >> [] system_call_fastpath+0x1a/0x1f
> >> [<7f394bdddb77>] 0x7f394bdddb77
> >> [] 0x
> >> gjones@slipapp02:~> sudo cat /proc/27526/stack
> >> [] sys_semtimedop+0x5a1/0x8b0
> >> [] system_call_fastpath+0x1a/0x1f
> >> [<7f394bdddb77>] 0x7f394bdddb77
> >> [] 0x
> >>
> >>
> >> Additionally, in dmesg I see, for example,
> >>
> >> [774981.361149] (/usr/sbin/httpd,8266,3):ocfs2_unlink:951 ERROR: status
> =
> >> -2
> >> [775896.135467]
> >> (/usr/sbin/httpd,8435,3):ocfs2_check_dir_for_entry:2119 ERROR: status
> >> = -17
> >> [775896.135474] (/usr/sbin/httpd,8435,3):ocfs2_mknod:459 ERROR: status =
> >> -17
> >> [775896.135477] (/usr/sbin/httpd,8435,3):ocfs2_create:629 ERROR: status
> =
> >> -17
> >> [788406.624126] connection1:0: ping timeout of 5 secs expired, recv
> >> timeout 5, last rx 4491991450, last ping 4491992701, now 4491993952
> >> [788406.624138] connection1:0: detected conn error (1011)
> >> [788406.640132] connection2:0: ping timeout of 5 secs expired, recv
> >> timeout 5, last rx 4491991451, last ping 4491992702, now 4491993956
> >> [788406.640142] connection2:0: detected conn error (1011)
> >> [788406.928134] connection4:0: ping timeout of 5 secs expired, recv
> >> timeout 5, last rx 4491991524, last ping 4491992775, now 4491994028
> >> [788406.928150] connection4:0: detected conn error (1011)
> >> [788406.944147] connection5:0: ping timeout of 5 secs expired, recv
> >> timeout 5, last rx 4491991528, last ping 4491992779, now 4491994032
> >> [788406.944165] connection5:0: detected conn error (1011)
> >> [788408.640123] connection3:0: ping timeout of 5 secs expired, recv
> >> timeout 5, last rx 4491991954, last ping 4491993205, now 4491994456
> >> [788408.640134] connection3:0: detected conn error (1011)
> >> [788409.907968] connection1:0: detected conn error (1020)
> >> [788409.908280] connection2:0: detected conn error (1020)
> >> [788409.912683] connection4:0: detected conn error (1020)
> >> [788409.913152] connection5:0: detected conn error (1020)
> >> [788411.491818] connection3:0: detected conn error (1020)
> >>
> >>
> >> that repeats for a bit and then I see
> >>
> >> [1952161.012214] INFO: task /usr/sbin/httpd:27491 blocked for more
> >> than 480 seconds.
> >> [1952161.012219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this mes

Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

2013-07-09 Thread Sunil Mushran

The error does not make sense. Also I don't know what 1.8.0 tools means. I
cannot see that label in the src tree.
https://oss.oracle.com/git/?p=ocfs2-tools.git;a=summary

One option is to build the tools from the head.


On Tue, Jul 9, 2013 at 2:25 PM, Ulf Zimmermann  wrote:

>  Sunil, any suggestions on this?
>
> ** **
>
> ** **
>
> *From:* ocfs2-users-boun...@oss.oracle.com [mailto:
> ocfs2-users-boun...@oss.oracle.com] *On Behalf Of *Ulf Zimmermann
> *Sent:* Saturday, June 22, 2013 15:20
> *To:* Sunil Mushran
>
> *Cc:* ocfs2-users@oss.oracle.com
> *Subject:* Re: [Ocfs2-users] Problems with volumes coming from RHEL5
> going to OEL6
>
>  ** **
>
> [root@co-db03 ulf]# debugfs.ocfs2 -R "stats" /dev/mapper/aucp_data_bk_2_x
> 
>
> Revision: 0.90
>
> Mount Count: 0   Max Mount Count: 20
>
> State: 0   Errors: 0
>
> Check Interval: 0   Last Check: Sun Sep 25 05:32:29 2011
>
> Creator OS: 0
>
> Feature Compat: 0 
>
> Feature Incompat: 0 
>
> Tunefs Incomplete: 0 
>
> Feature RO compat: 0 
>
> Root Blknum: 513   System Dir Blknum: 514
>
> First Cluster Group Blknum: 256
>
> Block Size Bits: 12   Cluster Size Bits: 20
>
> Max Node Slots: 10
>
> Extended Attributes Inline Size: 0
>
> Label: /export/backuprecovery.AUCP
>
> UUID: 5F9C2727159743529200CE9C5E155562
>
> Hash: 0 (0x0)
>
> DX Seeds: 0 0 0 (0x 0x 0x)
>
> Cluster stack: classic o2cb
>
> Cluster flags: 0 
>
> Inode: 2   Mode: 00   Generation: 3147295185 (0xbb97e9d1)
>
> FS Generation: 3147295185 (0xbb97e9d1)
>
> CRC32:    ECC: 
>
> Type: Unknown   Attr: 0x0   Flags: Valid System Superblock 
>
> Dynamic Features: (0x0) 
>
> User: 0 (root)   Group: 0 (root)   Size: 0
>
> Links: 0   Clusters: 1572864
>
> ctime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
>
> atime: 0x0 0x0 -- Wed Dec 31 16:00:00.0 1969
>
> mtime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
>
> dtime: 0x0 -- Wed Dec 31 16:00:00 1969
>
> Refcount Block: 0
>
> Last Extblk: 0   Orphan Slot: 0
>
> Sub Alloc Slot: Global   Sub Alloc Bit: 65535
>
> ** **
>
> ** **
>
> *From:* Sunil Mushran [mailto:sunil.mush...@gmail.com]
> *Sent:* Friday, June 21, 2013 11:11
> *To:* Ulf Zimmermann
> *Cc:* ocfs2-users@oss.oracle.com
> *Subject:* Re: [Ocfs2-users] Problems with volumes coming from RHEL5
> going to OEL6
>
> ** **
>
> Can you dump the following using the 1.8 binary.
> debugfs.ocfs2 -R "stats" /dev/mapper/.
>
> ** **
>
> On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann  wrote:*
> ***
>
> We have a production cluster of 6 nodes, which are currently running RHEL
> 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple destinations,
> one of them is a RHEL4 machine with OCFS2 1.2.9. Because of that the
> volumes are set so that we can read them there.
>
>  
>
> We are now trying to bring up a new server, this one has OEL 6.3 on it and
> it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2
> –cloned-volume to reset the UUID, but when I try to change the label I get:
> 
>
>  
>
> [root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP
> /dev/mapper/aucp_data_bk_2_x
>
> tunefs.ocfs2: Invalid name for a cluster while opening device
> "/dev/mapper/aucp_data_bk_2_x"
>
>  
>
> fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla
> for that:
>
>  
>
> [root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x 
>
> fsck.ocfs2 1.8.0
>
> *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop):
> 0x0197f320 ***
>
> === Backtrace: =
>
> /lib64/libc.so.6[0x3656475366]
>
> fsck.ocfs2[0x434c31]
>
> fsck.ocfs2[0x403bc2]
>
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd]
>
> fsck.ocfs2[0x402879]
>
> === Memory map: 
>
> 0040-0045 r-xp  fc:00 12489
> /sbin/fsck.ocfs2
>
> 0064f000-00651000 rw-p 0004f000 fc:00 12489
> /sbin/fsck.ocfs2
>
> 00651000-00652000 rw-p  00:00 0 
>
> 0085-00851000 rw-p 0005 fc:00 12489
> /s

Re: [Ocfs2-users] High inodes usage

2013-07-03 Thread Sunil Mushran

That number is typically calculated. So it just could be bad arithmetic.
But that should not affect the other ops.


On Wed, Jul 3, 2013 at 12:40 PM, Nicolas Michel  wrote:

> I don't know if it's the root cause of my problems or if it causes any
> problem at all. But I have some stability issues on the cluster so I'm
> investigating anything that could be suspect. My question is : is it a
> normal behavior to get inode usage with df -i showing high percentage like
> 98, 99 or 100%? (a touch on the filesystem with 100% inode usage still
> create a file so I suppose it is not causing any problem but I found it
> weird).
>
>
> 2013/7/3 Sunil Mushran 
>
>> That is old. It just could be a minor bug is that release. Is it causing
>> you any problems?
>>
>>
>> On Wed, Jul 3, 2013 at 12:31 PM, Nicolas Michel <
>> be.nicolas.mic...@gmail.com> wrote:
>>
>>> Hello Sunil,
>>>
>>> I checked the inode usage with df -i
>>> I can't check the kernel version running on the system now because I'm
>>> not at work but it's a SLES 10 SP2, so a pretty old kernel I suppose.
>>>
>>> Nicolas
>>>
>>>
>>> 2013/7/3 Sunil Mushran 
>>>
>>>> Hoe did you figure this out? Also, which version of the kernel are you
>>>> using?
>>>>
>>>>
>>>> On Wed, Jul 3, 2013 at 1:05 AM, Nicolas Michel <
>>>> be.nicolas.mic...@gmail.com> wrote:
>>>>
>>>>> Hello guys,
>>>>>
>>>>> I'm using OCFS2 for a shared storage (on SAN). I just saw that the
>>>>> inode usage is really high although these filesystems are used for Oracle
>>>>> DATA storage. So there are really a few big files.
>>>>>
>>>>> I don't understand why the inode usage is so high with such few big
>>>>> files (As an example : one of the filesystem have 16 files and directories
>>>>> but the ~26 million of inodes are almost used!)
>>>>>
>>>>> My questions :
>>>>> - is the inode usage can be a problem in such a situation
>>>>> - if it is : how can I reduce their number used? Or increase the pool
>>>>> of available inodes?
>>>>> - why so many inodes are used with such a few files? I was sure that
>>>>> there were traditionaly one inode used for one file or one directory.
>>>>>
>>>>> --
>>>>> Nicolas MICHEL
>>>>>
>>>>> ___
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users@oss.oracle.com
>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nicolas MICHEL
>>
>>
>>
>
>
> --
> Nicolas MICHEL
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] High inodes usage

2013-07-03 Thread Sunil Mushran

That is old. It just could be a minor bug is that release. Is it causing
you any problems?


On Wed, Jul 3, 2013 at 12:31 PM, Nicolas Michel  wrote:

> Hello Sunil,
>
> I checked the inode usage with df -i
> I can't check the kernel version running on the system now because I'm not
> at work but it's a SLES 10 SP2, so a pretty old kernel I suppose.
>
> Nicolas
>
>
> 2013/7/3 Sunil Mushran 
>
>> Hoe did you figure this out? Also, which version of the kernel are you
>> using?
>>
>>
>> On Wed, Jul 3, 2013 at 1:05 AM, Nicolas Michel <
>> be.nicolas.mic...@gmail.com> wrote:
>>
>>> Hello guys,
>>>
>>> I'm using OCFS2 for a shared storage (on SAN). I just saw that the inode
>>> usage is really high although these filesystems are used for Oracle DATA
>>> storage. So there are really a few big files.
>>>
>>> I don't understand why the inode usage is so high with such few big
>>> files (As an example : one of the filesystem have 16 files and directories
>>> but the ~26 million of inodes are almost used!)
>>>
>>> My questions :
>>> - is the inode usage can be a problem in such a situation
>>> - if it is : how can I reduce their number used? Or increase the pool of
>>> available inodes?
>>> - why so many inodes are used with such a few files? I was sure that
>>> there were traditionaly one inode used for one file or one directory.
>>>
>>> --
>>> Nicolas MICHEL
>>>
>>> ___
>>> Ocfs2-users mailing list
>>> Ocfs2-users@oss.oracle.com
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>
>>
>
>
> --
> Nicolas MICHEL
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] High inodes usage

2013-07-03 Thread Sunil Mushran

Hoe did you figure this out? Also, which version of the kernel are you
using?


On Wed, Jul 3, 2013 at 1:05 AM, Nicolas Michel
wrote:

> Hello guys,
>
> I'm using OCFS2 for a shared storage (on SAN). I just saw that the inode
> usage is really high although these filesystems are used for Oracle DATA
> storage. So there are really a few big files.
>
> I don't understand why the inode usage is so high with such few big files
> (As an example : one of the filesystem have 16 files and directories but
> the ~26 million of inodes are almost used!)
>
> My questions :
> - is the inode usage can be a problem in such a situation
> - if it is : how can I reduce their number used? Or increase the pool of
> available inodes?
> - why so many inodes are used with such a few files? I was sure that there
> were traditionaly one inode used for one file or one directory.
>
> --
> Nicolas MICHEL
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

2013-06-21 Thread Sunil Mushran

Can you dump the following using the 1.8 binary.
debugfs.ocfs2 -R "stats" /dev/mapper/.


On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann  wrote:

>  We have a production cluster of 6 nodes, which are currently running
> RHEL 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple
> destinations, one of them is a RHEL4 machine with OCFS2 1.2.9. Because of
> that the volumes are set so that we can read them there.
>
> ** **
>
> We are now trying to bring up a new server, this one has OEL 6.3 on it and
> it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2
> –cloned-volume to reset the UUID, but when I try to change the label I get:
> 
>
> ** **
>
> [root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP
> /dev/mapper/aucp_data_bk_2_x
>
> tunefs.ocfs2: Invalid name for a cluster while opening device
> "/dev/mapper/aucp_data_bk_2_x"
>
> ** **
>
> fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla
> for that:
>
> ** **
>
> [root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x ** **
>
> fsck.ocfs2 1.8.0
>
> *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop):
> 0x0197f320 ***
>
> === Backtrace: =
>
> /lib64/libc.so.6[0x3656475366]
>
> fsck.ocfs2[0x434c31]
>
> fsck.ocfs2[0x403bc2]
>
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd]
>
> fsck.ocfs2[0x402879]
>
> === Memory map: 
>
> 0040-0045 r-xp  fc:00 12489
> /sbin/fsck.ocfs2
>
> 0064f000-00651000 rw-p 0004f000 fc:00 12489
> /sbin/fsck.ocfs2
>
> 00651000-00652000 rw-p  00:00 0 
>
> 0085-00851000 rw-p 0005 fc:00 12489
> /sbin/fsck.ocfs2
>
> 0197e000-0199f000 rw-p  00:00 0
> [heap]
>
> 3655c0-3655c2 r-xp  fc:00 8797
> /lib64/ld-2.12.so
>
> 3655e1f000-3655e2 r--p 0001f000 fc:00 8797
> /lib64/ld-2.12.so
>
> 3655e2-3655e21000 rw-p 0002 fc:00 8797
>   /lib64/ld-2.12.so
>
> 3655e21000-3655e22000 rw-p  00:00 0 
>
> 365640-3656589000 r-xp  fc:00 8798
> /lib64/libc-2.12.so
>
> 3656589000-3656788000 ---p 00189000 fc:00 8798
> /lib64/libc-2.12.so
>
> 3656788000-365678c000 r--p 00188000 fc:00 8798
> /lib64/libc-2.12.so
>
> 365678c000-365678d000 rw-p 0018c000 fc:00 8798
> /lib64/libc-2.12.so
>
> 365678d000-3656792000 rw-p  00:00 0 
>
> 3659c0-3659c16000 r-xp  fc:00 8802
> /lib64/libgcc_s-4.4.6-20120305.so.1
>
> 3659c16000-3659e15000 ---p 00016000 fc:00 8802
> /lib64/libgcc_s-4.4.6-20120305.so.1
>
> 3659e15000-3659e16000 rw-p 00015000 fc:00 8802
> /lib64/libgcc_s-4.4.6-20120305.so.1
>
> 3d3e80-3d3e817000 r-xp  fc:00 12028
> /lib64/libpthread-2.12.so
>
> 3d3e817000-3d3ea17000 ---p 00017000 fc:00 12028
>  /lib64/libpthread-2.12.so
>
> 3d3ea17000-3d3ea18000 r--p 00017000 fc:00 12028
> /lib64/libpthread-2.12.so
>
> 3d3ea18000-3d3ea19000 rw-p 00018000 fc:00 12028
> /lib64/libpthread-2.12.so
>
> 3d3ea19000-3d3ea1d000 rw-p  00:00 0 
>
> 3e2660-3e26603000 r-xp  fc:00 426
> /lib64/libcom_err.so.2.1
>
> 3e26603000-3e26802000 ---p 3000 fc:00 426
> /lib64/libcom_err.so.2.1
>
> 3e26802000-3e26803000 r--p 2000 fc:00 426
> /lib64/libcom_err.so.2.1
>
> 3e26803000-3e26804000 rw-p 3000 fc:00 426
> /lib64/libcom_err.so.2.1
>
> 7fb063711000-7fb063714000 rw-p  00:00 0 
>
> 7fb06371d000-7fb06372 rw-p  00:00 0 
>
> 7fffd5b95000-7fffd5bb6000 rw-p  00:00 0
> [stack]
>
> 7fffd5bc5000-7fffd5bc6000 r-xp  00:00 0
> [vdso]
>
> ff60-ff601000 r-xp  00:00 0
> [vsyscall]
>
> Abort (core dumped)
>
> ** **
>
> I think one of the main question is what is the “Invalid name for a
> cluster while trying to join the group” or “Invalid name for a cluster
> while opening device”. I am pretty sure that /etc/sysconfig/o2cb and
> /etc/ocfs2/cluster.conf is correct.
>
> ** **
>
> Ulf.
>
> ** **
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to set the o2cb heartbeat to global

2013-06-04 Thread Sunil Mushran

Support for global heartbeat was added in ocfs2-tools-1.8.


On Tue, Jun 4, 2013 at 8:31 AM, Vineeth Thampi wrote:

> Hi,
>
> I have added heartbeat mode as global, but when I do a mkfs and mount, and
> then check the mount, it says I am in local mode. Even
> /sys/kernel/config/cluster/ocfs2/heartbeat/mode says local. I am running
> CentOS with 3.x kernel, with ocfs2-tools-1.6.4-1118.
>
> mkfs -t ocfs2 -b 4K -C 1M -N 16 --cluster-stack=o2cb  /dev/sdb
> mount -t ocfs2 /dev/sdb /mnt -o
> noatime,data=writeback,nointr,commit=60,coherency=buffered
>
> ==
> node:
> ip_port = 
> ip_address = 10.81.2.108
> number = 1
> name = cam-st08
> cluster = ocfs2
>
> cluster:
> node_count = 2
> heartbeat_mode = global
> name = ocfs2
> ==
>
> root@cam-st07 log # mount | grep sdb
> /dev/sdb on /mnt type ocfs2
> (rw,_netdev,noatime,data=writeback,nointr,commit=60,coherency=buffered,heartbeat=local)
>
> Any help would be much appreciated.
>
> Thanks,
>
> Vineeth
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] What is the overhead/disk loss of formatting an ocfs2 filesystem?

2013-04-15 Thread Sunil Mushran

-N 16 means 16 journals. I think it defaults to 256M journals. So that's
4G. Do you plan to mount it on 16 nodes? If not, reduce that. Other options
is a smaller journal. But you have to be careful as a small journal could
limit your write thruput.


On Mon, Apr 15, 2013 at 1:37 PM, Jerry Smith  wrote:

> Good afternoon,
>
> I have an OEL 6.3 box with a few ocfs2 mounts mounted locally, and was
> wondering what I should expect to lose via formatting etc from a disk
> usage standpoint.
>
> -bash-4.1$ df -h | grep ocfs2
> /dev/dm-15 12G  1.3G   11G  11% /ocfs2/redo0
> /dev/dm-13120G  4.2G  116G   4% /ocfs2/software-master
> /dev/dm-10 48G  4.1G   44G   9% /ocfs2/arch0
> /dev/dm-142.5T  6.7G  2.5T   1% /ocfs2/ora01
> /dev/dm-111.5T  5.7G  1.5T   1% /ocfs2/ora02
> /dev/dm-17100G  4.2G   96G   5% /ocfs2/ora03
> /dev/dm-12200G  4.3G  196G   3% /ocfs2/ora04
> /dev/dm-163.0T  7.3G  3.0T   1% /ocfs2/orabak01
> -bash-4.1$
>
>
> For example ora04 is 196GB total, but with zero usage it shows 4.3GB used:
>
> [root@oeldb10 ~]#df -h /ocfs2/ora04
> FilesystemSize  Used Avail Use% Mounted on
> /dev/dm-12200G  4.3G  196G   3% /ocfs2/ora04
> [root@oeldb10 ~]#find /ocfs2/ora04/ | wc -l
> 3
> [root@oeldb10 ~]#find /ocfs2/ora04/ -exec du -sh {} \;
> 0/ocfs2/ora04/
> 0/ocfs2/ora04/lost+found
> 0/ocfs2/ora04/db66snlux
>
>
> Filesystems formatted via
>
> mkfs -t ocfs2 -N 16 --fs-features=xattr,local -L ${device} ${device}
>
> Mount options
>
> [root@oeldb10 ~]#mount |grep ora04
> /dev/dm-12 on /ocfs2/ora04 type ocfs2
> (rw,_netdev,nointr,user_xattr,heartbeat=none)
>
> Thanks,
>
> --Jerry
>
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Significant Slowdown when writing and deleting files at the same time

2013-03-29 Thread Sunil Mushran

Are you mounting -o writeback?


On Fri, Mar 29, 2013 at 12:28 PM, Andy  wrote:

> I have been having performance issues from time to time on our
> production ocfs2 volumes, so I set up a test system to try to reproduce
> what I was seeing on the production systems.  This is what I found out:
>
> I have a 2 node test system sharing a 2TB volume with a journal size of
> 256MB.  I can easily trigger the slowdown by starting to processes to
> write a 10GB file each, then I delete a different large file (7GB+)
> while the other processes are writing.  The slowdown is significant and
> very disruptive.  Not only did it take over 3 minutes to delete the
> file, every else with pause when entering that directory too.  A du
> command with stop and nfs access to that file system will think the
> server is not responding.  Under heavier amounts of writes, I have had a
> delete takes 13mins for a 8GB file, and NFS mounts return I/O errors.
> We often deal with large files, so this situation above is fairly common.
>
> I would like any ideas that would provide smoother performance of the
> OCFS2 volume and somehow eliminate the long pauses during deletes.
>
> Thanks,
>
> Andy
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] [OCFS2] Crash at o2net_shutdown_sc()

2013-03-01 Thread Sunil Mushran

 [ 1481.620253] o2hb: Unable to stabilize heartbeart on region
1352E2692E704EEB8040E5B8FF560997 (vdb)

What this means is that the device is suspect. o2hb writes are not hitting
the disk. vdb is accepting and
acknowledging the write but spitting out something else during the next
read. Heartbeat detects this and
aborts, as it should.

Then we hit a race during socket close that triggers the oops. Yes, that
needs to be fixed. But you also
need to "fix" vdb... what appears to be a virtual device.


On Fri, Mar 1, 2013 at 1:25 PM, richard -rw- weinberger <
richard.weinber...@gmail.com> wrote:

> Hi!
>
> Using 3.8.1 OCFS2 crashes while joining nodes to the cluster.
> The cluster consists of 10 nodes, while node3 joins the kernel on node3
> crashes.
> (Somtimes later...)
> See dmesg below.
> Is this a known issue? I didn't test older kernels so far.
>
> node1:
> [ 1471.881922] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997
> ( 0 ) 1 nodes
> [ 1471.919522] JBD2: Ignoring recovery information on journal
> [ 1471.947027] ocfs2: Mounting device (253,16) on (node 0, slot 0)
> with ordered data mode.
> [ 1475.802497] o2net: Accepted connection from node node2 (num 1) at
> 192.168.66.2:
> [ 1481.814048] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 8
> [ 1481.814955] o2net: No longer connected to node node2 (num 1) at
> 192.168.66.2:
> [ 1482.468827] o2net: Accepted connection from node node3 (num 2) at
> 192.168.66.3:
> [ 1511.904100] o2net: No connection established with node 1 after 30.0
> seconds, giving up.
> [ 1514.472995] o2net: Connection to node node3 (num 2) at
> 192.168.66.3: shutdown, state 8
> [ 1514.473960] o2net: No longer connected to node node3 (num 2) at
> 192.168.66.3:
> [ 1516.076044] o2net: Accepted connection from node node2 (num 1) at
> 192.168.66.2:
> [ 1520.181430] o2dlm: Node 1 joins domain
> 1352E2692E704EEB8040E5B8FF560997 ( 0 1 ) 2 nodes
> [ 1544.544030] o2net: No connection established with node 2 after 30.0
> seconds, giving up.
> [ 1574.624029] o2net: No connection established with node 2 after 30.0
> seconds, giving up.
>
> node2:
> [ 1475.613170] o2net: Connected to node node1 (num 0) at 192.168.66.1:
> [ 1481.620253] o2hb: Unable to stabilize heartbeart on region
> 1352E2692E704EEB8040E5B8FF560997 (vdb)
> [ 1481.622489] o2net: No longer connected to node node1 (num 0) at
> 192.168.66.1:
> [ 1515.886605] o2net: Connected to node node1 (num 0) at 192.168.66.1:
> [ 1519.992766] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997
> ( 0 1 ) 2 nodes
> [ 1520.017054] JBD2: Ignoring recovery information on journal
> [ 1520.07] ocfs2: Mounting device (253,16) on (node 1, slot 1)
> with ordered data mode.
> [ 1520.159590] mount.ocfs2 (2186) used greatest stack depth: 2568 bytes
> left
>
> node3:
> [ 1482.836865] o2net: Connected to node node1 (num 0) at 192.168.66.1:
> [ 1482.837542] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1484.840952] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1486.844994] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1488.848952] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1490.853052] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1492.857046] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1494.861042] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1496.865024] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1498.869021] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1500.873016] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1502.877056] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1504.881042] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1506.885040] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1508.888991] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1510.893077] o2net: Connection to node node2 (num 1) at
> 192.168.66.2: shutdown, state 7
> [ 1512.843172] (mount.ocfs2,2179,0):dlm_request_join:1477 ERROR: Error
> -107 when sending message 510 (key 0x666c6172) to node 1
> [ 1512.845580] (mount.ocfs2,2179,0):dlm_try_to_join_domain:1653 ERROR:
> status = -107
> [ 1512.847778] (mount.ocfs2,2179,0):dlm_join_domain:1955 ERROR: status =
> -107
> [ 1512.849334] (mount.ocfs2,2179,0):dlm_register_domain:2214 ERROR:
> status = -107
> [ 1512.850921] (mount.ocfs2,2179,0):o2cb_cluster_connect:368 ERROR:
> status = -107
> [ 1512.852511] (mount.ocfs2,2179,0):ocfs2_dlm_init:3004 ERROR: status =
> -107
> [ 1512.854090] (mount.ocfs2,2179,0):ocfs2_mount_volu

Re: [Ocfs2-users] OCFS ..Inode contains a hole at offset...

2013-02-20 Thread Sunil Mushran

This is probably a directory. debugs.ocfs2 -R 'stat <52663>' /dev/ will
dump the inode.

Are you sure fsck is fixing it? Does the output show this block getting
fixed?
If not, you may want to run fsck.ocfs2 v1.8. I think a fix code was added
for it.


On Wed, Feb 20, 2013 at 1:01 AM, Fiorenza Meini  wrote:

> Hi there,
> I have a partition formatted with ocfs2 (1.6.3) on a 2.6.37 Linux Kernel
> system. This partition is managed by a cluster (corosync/pacemaker).
> The backend of this ocfs2 partition is drbd on Lvm.
>
> I see this line in the messages log file:
> ocfs2_read_virt_blocks:871 ERROR: Inode #52663 contains a hole at offset
> 69632
>
> The error is reported more than once and the offset is the same..
>
> When I do a check on this partition, errors are found and resolved, but
> in a short time the problems appears again.
> I can't understand at what level is the problem:
> * kernel ?
> * hardware ?
> * lvm + drbd ?
>
> There are tools that can be used to understand ?
> Any suggestion?
>
> Thanks and regards.
>
> Fiorenza
> --
>
> Fiorenza Meini
> Spazio Web S.r.l.
>
> V. Dante Alighieri, 10 - 13900 Biella
> Tel.: 015.2431982 - 015.9526066
> Fax: 015.2522600
> Reg. Imprese, CF e P.I.: 02414430021
> Iscr. REA: BI - 188936
> Iscr. CCIAA: Biella - 188936
> Cap. Soc.: 30.000,00 Euro i.v.
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs cluster node keeps rebooting

2013-01-14 Thread Sunil Mushran

1.2.5 is 6+ year old release. You may want to use something more current.


On Mon, Jan 14, 2013 at 12:06 PM, Bill Zha  wrote:

> Hi Sunil and All,
>
> We have a 10 Redhat4.2-node OCFS cluster running on version 1.2.5-6.  One
> of the node started to rebooted almost everyday since last week.  The
> entire cluster had been stable for the past 1 year or so.  I captured the
> following console output, can you or someone had the similar issue let me
> know what the possible cause of these reboots?
>
> (25271,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156758.101016 now 1358156788.97593 dr
> 1358156758.101008 adv 1358156758.101022:1358156758.101024 func
> (5d21e188:507) 1357953447.247097:1357953447.247100)
> (25267,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156758.666788 now 1358156788.663604 dr
> 1358156760.666794 adv 1358156758.666793:1358156758.666795 func
> (5d21e188:505) 1357953453.107343:1357953453.107349)
> (25267,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156758.848933 now 1358156788.953367 dr
> 1358156760.847939 adv 1358156758.848939:1358156758.848941 func
> (0e6eb1eb:505) 1357965605.352156:1357965605.352162)
> (25267,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156759.108373 now 1358156789.243003 dr
> 1358156761.108392 adv 1358156759.108376:1358156759.108378 func
> (af22ae1f:502) 1357914301.741127:1357914301.741130)
> (25275,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156759.626366 now 1358156789.623629 dr
> 1358156789.622319 adv 1358156759.626369:1358156759.626371 func
> (abd851aa:505) 1357965605.363679:1357965605.363685)
> (25275,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156759.656350 now 1358156789.913330 dr
> 1358156761.656039 adv 1358156759.656354:1358156759.656355 func
> (0e6eb1eb:502) 1357907401.318584:1357907401.318587)
> (25275,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156759.663467 now 1358156790.203323 dr
> 1358156761.662745 adv 1358156759.663470:1358156759.663472 func
> (7dcded64:502) 1357875986.764566:1357875986.764568)
> (25275,4):o2net_idle_timer:1426 here are some times that might help debug
> the situation: (tmr 1358156759.987324 now 1358156790.493342 dr
> 1358156761.987117 adv 1358156759.987327:1358156759.987329 func
> (6bcd2bc6:502) 1357875995.47:1357875995.55)
> (25,7):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
> dm-14 after 18 milliseconds
> Heartbeat thread (25) printing last 24 blocking operations (cur = 11):
> Heartbeat thread stuck at msleep, stuffing current time into that blocker
> (index 11)
> Index 12: took 0 ms to do allocating bios for read
> Index 13: took 0 ms to do bio alloc read
> Index 14: took 0 ms to do bio add page read
> Index 15: took 0 ms to do bio add page read
> Index 16: took 0 ms to do submit_bio for read
> Index 17: took 0 ms to do waiting for read completion
> Index 18: took 0 ms to do bio alloc write
> Index 19: took 0 ms to do bio add page write
> Index 20: took 0 ms to do submit_bio for write
> Index 21: took 0 ms to do checking slots
> Index 22: took 0 ms to do waiting for write completion
> Index 23: took 100897 ms to do msleep
> Index 0: took 0 ms to do allocating bios for read
> Index 1: took 0 ms to do bio alloc read
> Index 2: took 0 ms to do bio add page read
> Index 3: took 0 ms to do bio add page read
> Index 4: took 0 ms to do submit_bio for read
> Index 5: took 0 ms to do waiting for read completion
> Index 6: took 0 ms to do bio alloc write
> Index 7: took 0 ms to do bio add page write
> Index 8: took 0 ms to do submit_bio for write
> Index 9: took 0 ms to do checking slots
> Index 10: took 0 ms to do waiting for write completion
> Index 11: took 313 ms to do msleep
> *** ocfs2 is very sorry to be fencing this system by restarting ***
>
>
> Thank you so much for your help!
>
>
> Bill
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] asynchronous hwclocks

2013-01-03 Thread Sunil Mushran

The fs does not care about time. It should have no effect on the cluster. 
However the apps may care and may behave erratically. 

On Jan 3, 2013, at 3:13 PM, "Medienpark, Jakob Rößler" 
 wrote:

> Hello list,
> 
> today I noticed huge differences between the hardware clocks in our cluster.
> Some details:
> 
> root@www01:~# hwclock;date
> Do 03 Jan 2013 09:32:09 CET  -0.626096 seconds
> Do 3. Jan 09:34:54 CET 2013
> 
> root@www02:~# hwclock;date
> Do 03 Jan 2013 09:32:09 CET  -0.626091 seconds
> Do 3. Jan 09:34:54 CET 2013
> 
> root@www03:~# hwclock;date
> Do 03 Jan 2013 09:34:54 CET  -0.625820 seconds
> Do 3. Jan 09:34:54 CET 2013
> 
> root@storage:~# hwclock;date
> Do 03 Jan 2013 08:34:54 CET  -0.641532 seconds
> Do 3. Jan 09:34:54 CET 2013
> 
> The server 'storage' is the server which provides the iscsi device to 
> www01-03.
> Because the cluster was very unstable during load peaks, I want to ask 
> you what kind of effects it will have to ocfs2 if the hwclocks are 
> asynchronous like shown above.
> 
> Thanks in advance
> 
> Jakob
> 
> 
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Is this a valid configuration?

2012-12-05 Thread Sunil Mushran

This is normal. My only concern is the use of very old kernel/fs versions.


On Wed, Dec 5, 2012 at 3:08 AM, Neil  wrote:

> Anyone?
>
> 
>
> On 2012-11-28 00:47:56 + neil campbell 
> wrote:
>
> >
> >
> > Hi list,
> >
> > I am running OCFS2 1.2.9-9.bug13439173 on RHEL 4 Kernel 2.6.9-89
> >
> > # modinfo ocfs2
> >
> > filename:   /lib/modules/2.6.9-89.0.26.ELsmp/kernel/fs/ocfs2/ocfs2.ko
> > license:GPL
> > author: Oracle
> > version:1.2.9 CF6A7A44EA2581415F3D612
> > description:OCFS2 1.2.9 Mon Dec  5 14:27:38 EST 2011 (build
> > e5c3135c8cbf75f2620ff4c782d634f1)
> > depends:ocfs2_nodemanager,ocfs2_dlm,jbd,debugfs
> > vermagic:   2.6.9-89.0.26.ELsmp SMP gcc-3.4
> >
> > #
> >
> > I just have some reservations about whether the following configuration,
> > where I have mount points of different file system types over an initial
> > mount point (/d0) would cause any issues?
> >
> > LUN1LUN2LUN3  LUN4
> > ||   | |
> > ||   | |
> > /d0 (ext3)   /d0/app (ext3)  /d0/ocfs (ocfs2)  /d0/app/html (ocfs2)
> >
> >
> > Many thanks,
> > Neil
> >
> >
> > ___
> > Ocfs2-users mailing list
> > Ocfs2-users@oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] "ls" taking ages on a directory containing 900000 files

2012-12-04 Thread Sunil Mushran

1.5 ms per inode. Times 900K files equals 22 mins.

Large dirs are a problem is all file systems. The degree of problem
depends on the overhead. An easy solution around is to shard the
files into multilevel dirs. Like a 2 level structure of a 1000 files in
1000 dirs. Or, a 3 level structure with even fewer files per dir.

Or you could use the other approach suggested. Avoids stat()
by disabling color-ls. Or just use plain find.


On Tue, Dec 4, 2012 at 3:16 PM, Erik Schwartz wrote:

> Amaury, you can see in strace output that it's performing a stat on
> every file.
>
> Try simply:
>
>   $ /bin/ls
>
> My guess is you're using a system where "ls" is aliased to use options
> that are more expensive.
>
> Best regards -
>
> Erik
>
>
> On 12/4/12 5:12 PM, Amaury Francois wrote:
> > The strace looks like this (on all files) :
> >
> >
> >
> > 1354662591.755319
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P069_F01589.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001389>
> >
> > 1354662591.756775
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P035_F01592.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001532>
> >
> > 1354662591.758376
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P085_F01559.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001429>
> >
> > 1354662591.759873
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P027_F01569.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001377>
> >
> > 1354662591.761317
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P002_F01581.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001420>
> >
> > 1354662591.762804
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P050_F01568.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001345>
> >
> > 1354662591.764216
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P089_F01567.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001541>
> >
> > 1354662591.765828
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P010_F01594.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001358>
> >
> > 1354662591.767252
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P045_F01569.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001396>
> >
> > 1354662591.768715
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P036_F01592.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.002072>
> >
> > 1354662591.770854
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P089_F01568.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001722>
> >
> > 1354662591.772643
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P009_F01600.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001281>
> >
> > 1354662591.773992
> > lstat64("TEW_STRESS_TEST_VM.1K_100P_1F.P022_F01583.txt",
> > {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 <0.001413>
> >
> >
> >
> > We are using a 32 bits architecture, can it be the cause of the kernel
> > not having enough memory ? Any possibility to change this behavior ?
> >
> >
> >
> > Description : Description : Description :
> cid:image001.png@01CD01F3.35091200
> >
> >
> >
> > * *
> >
> > *Amaury FRANCOIS*   •  *Ingénieur*
> >
> > Mobile +33 (0)6 88 12 62 54
> >
> > *amaury.franc...@digora.com <mailto:amaury.franc...@digora.com>***
> >
> > * *
> >
> > *Siège Social – 66 rue du Marché Gare – 67200 STRASBOURG*
> >
> > Tél : 0 820 200 217 - +33 (0)3 88 10 49 20
> >
> >
> >
> > Description : test
> >
> >
> >
> >
> >
> > *De :*Sunil Mushran [mailto:sunil.mush...@gmail.com]
> > *Envoyé :* mardi 4 décembre 2012 18:29
> > *À :* Amaury Francois
> > *Cc :* ocfs2-users@oss.oracle.com
> > *Objet :* Re: [Ocfs2-users] "ls" taking ages on a directory containing
> > 90 files
> >
> >
> >
> > strace -p PID -ttt -T
> >
> >
> >
> > Attach and get some timings. The simplest guess is that the system lacks
> > memory to cache all the inodes
> >
> > and thus has to hit disk (and more importantly take cluster locks) for
> > the same inode repeatedly. The user
> >
> > guide has a section in NOTES explaining this.
> >
>

Re: [Ocfs2-users] "ls" taking ages on a directory containing 900000 files

2012-12-04 Thread Sunil Mushran

strace -p PID -ttt -T

Attach and get some timings. The simplest guess is that the system lacks
memory to cache all the inodes
and thus has to hit disk (and more importantly take cluster locks) for the
same inode repeatedly. The user
guide has a section in NOTES explaining this.



On Tue, Dec 4, 2012 at 8:54 AM, Amaury Francois
wrote:

>  Hello,
>
> ** **
>
> We are running OCFS2 1.8 and on a kernel UEK2. An ls on a directory
> containing approx. 1 million of files  is very long (1H). The features we
> have activated on the filesystem are the following : 
>
> ** **
>
> [root@pa-oca-app10 ~]# debugfs.ocfs2 -R "stats" /dev/sdb1
>
> Revision: 0.90
>
> Mount Count: 0   Max Mount Count: 20
>
> State: 0   Errors: 0
>
> Check Interval: 0   Last Check: Fri Nov 30 19:30:17 2012
>
> Creator OS: 0
>
> Feature Compat: 3 backup-super strict-journal-super
>
> Feature Incompat: 32592 sparse extended-slotmap inline-data
> metaecc xattr indexed-dirs refcount discontig-bg clusterinfo
>
> Tunefs Incomplete: 0
>
> Feature RO compat: 1 unwritten
>
> Root Blknum: 5   System Dir Blknum: 6
>
> First Cluster Group Blknum: 3
>
> Block Size Bits: 12   Cluster Size Bits: 12
>
> Max Node Slots: 8
>
> Extended Attributes Inline Size: 256
>
> Label: exchange2
>
> UUID: 2375EAF4E4954C4ABB984BDE27AC93D5
>
> Hash: 2880301520 (0xabade9d0)
>
> DX Seeds: 1678175851 1096448356 79406012 (0x6406ee6b 0x415a7964
> 0x04bba3bc)
>
> Cluster stack: o2cb
>
> Cluster name: appcluster
>
> Cluster flags: 1 Globalheartbeat
>
> Inode: 2   Mode: 00   Generation: 3567595533 (0xd4a5300d)
>
> FS Generation: 3567595533 (0xd4a5300d)
>
> CRC32: 0c996202   ECC: 0819
>
> Type: Unknown   Attr: 0x0   Flags: Valid System Superblock
>
> Dynamic Features: (0x0)
>
> User: 0 (root)   Group: 0 (root)   Size: 0
>
> Links: 0   Clusters: 5242635
>
> ctime: 0x508eac6b 0x0 -- Mon Oct 29 17:18:51.0 2012
>
> atime: 0x0 0x0 -- Thu Jan  1 01:00:00.0 1970
>
> mtime: 0x508eac6b 0x0 -- Mon Oct 29 17:18:51.0 2012
>
> dtime: 0x0 -- Thu Jan  1 01:00:00 1970
>
> Refcount Block: 0
>
> Last Extblk: 0   Orphan Slot: 0
>
> Sub Alloc Slot: Global   Sub Alloc Bit: 65535
>
> ** **
>
> ** **
>
> May inline-data or xattr be the source of the problem ?
>
> ** **
>
> Thank you. 
>
> ** **
>
> [image: Description : Description : Description :
> cid:image001.png@01CD01F3.35091200]
>
> * *
>
> *Amaury FRANCOIS*   •  *Ingénieur*
>
> Mobile +33 (0)6 88  12 62 54
>
> *amaury.franc...@digora.com *
>
> * *
>
> *Siège Social – 66 rue du Marché Gare – 67200 STRASBOURG*
>
> Tél : 0 820 200 217 - +33 (0)3 88 10 49 20 
>
> [image: Description : test]
>
> ** **
>
> ** **
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
<><>___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Huge Problem ocfs2

2012-11-09 Thread Sunil Mushran

Yes that should be enough for that. But that won't help if the real problem
is device related.

What does debugfs.ocfs2 -R "ls -l /" return? If that errors, means the root
dir is gone. Maybe
best to look into your backups.


On Fri, Nov 9, 2012 at 6:01 PM, Marian Serban  wrote:

>  Nope, rdump doesn't work either.
>
> debugfs: rdump -v / /tmp
> Copying to /tmp/
> rdump: Bad magic number in inode while reading inode 129
> rdump: Bad magic number in inode while recursively dumping inode 129
>
>
> Could you please confirm that it's enough to just force the return value
> of 0 at "ocfs2_validate_meta_ecc" in order to bypass the ECC checks?
>
>
>
>
> On 10.11.2012 03:55, Sunil Mushran wrote:
>
> If global bitmap is gone. then the fs is unusable. But you can extract
> data using
> the rdump command in debugfs.ocfs. The success depends on how much of the
> device is still usable.
>
>
> On Fri, Nov 9, 2012 at 5:50 PM, Marian Serban  wrote:
>
>>  I tried hacking the fsck.ocfs2 source code by not considering metaecc
>> flag. Then I ran into
>>
>> journal recovery: Bad magic number in inode while looking up the journal
>> inode for slot 0
>>
>> fsck encountered unrecoverable errors while replaying the journals and
>> will not continue
>>
>>  After bypassing journal replay function, I got
>>
>> Pass 0a: Checking cluster allocation chains
>> pass0: Bad magic number in inode while looking up the global bitmap inode
>> fsck.ocfs2: Bad magic number in inode while performing pass 0
>>
>>
>> Does it mean the filesystem is destroyed completely?
>>
>>
>>
>>
>> On 10.11.2012 02:54, Marian Serban wrote:
>>
>> That's the kernel:
>>
>> Linux ro02xsrv003.bv.easic.ro 2.6.39.4 #6 SMP Mon Dec 12 12:09:49 EET
>> 2011 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Anyway, I tried disabling the metaecc feature, no luck.
>>
>> [root@ro02xsrv003 ~]# tunefs.ocfs2 --fs-features=nometaecc
>> /dev/mapper/volgr1-lvol0
>> tunefs.ocfs2: I/O error on channel while opening device
>> "/dev/mapper/volgr1-lvol0"
>>
>> These are the last lines of strace corresponding to the tunefs.ocfs
>> command:
>>
>>
>>
>> open("/sys/fs/ocfs2/cluster_stack", O_RDONLY) = 4
>> fstat(4, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
>> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
>> = 0x7f54aad05000
>> read(4, "o2cb\n", 4096) = 5
>> close(4)= 0
>> munmap(0x7f54aad05000, 4096)= 0
>> open("/sys/fs/o2cb/interface_revision", O_RDONLY) = 4
>> read(4, "5\n", 15)  = 2
>> read(4, "", 13) = 0
>> close(4)= 0
>> stat("/sys/kernel/config", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
>> statfs("/sys/kernel/config", {f_type=0x62656570, f_bsize=4096,
>> f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0},
>> f_namelen=255, f_frsize=4096}) = 0
>> open("/dev/mapper/volgr1-lvol0", O_RDONLY) = 4
>> ioctl(4, BLKSSZGET, 0x7fffce711454) = 0
>> close(4)= 0
>> pread(3, 
>> "\0\0\v\25\37\1\200\200\202@\21\2\30\26\0\0\0,\17\272\241\4\340\210\311\377\17\300\327\332\373\17"...,
>> 4096, 532480) = 4096
>> close(3)= 0
>> write(2, "tunefs.ocfs2", 12tunefs.ocfs2)= 12
>> write(2, ": ", 2: )   = 2
>> write(2, "I/O error on channel", 20I/O error on channel)= 20
>> write(2, " ", 1 )= 1
>> write(2, "while opening device \"/dev/mappe"..., 47while opening device
>> "/dev/mapper/volgr1-lvol0") = 47
>> write(2, "\r\n", 2
>>
>>
>>
>>
>>
>> On 10.11.2012 02:06, Sunil Mushran wrote:
>>
>> It's either that or a check sum problem. Disable metaecc. Not sure which
>> kernel you are running.
>> We had fixed few problems few years ago around this. If your kernel is
>> older, then it could be
>> a known issue.
>>
>>
>> On Fri, Nov 9, 2012 at 12:50 PM, Marian Serban  wrote:
>>
>>> Hi Sunil,
>>>
>>> Thank you for answering. Unfortunately, it doesn't seem like it's a
>>> hardware problem. There's no way a cable can be loose because it's iSCSI
>>> over 1G Ethernet (coppe

Re: [Ocfs2-users] Huge Problem ocfs2

2012-11-09 Thread Sunil Mushran

If global bitmap is gone. then the fs is unusable. But you can extract data
using
the rdump command in debugfs.ocfs. The success depends on how much of the
device is still usable.


On Fri, Nov 9, 2012 at 5:50 PM, Marian Serban  wrote:

>  I tried hacking the fsck.ocfs2 source code by not considering metaecc
> flag. Then I ran into
>
> journal recovery: Bad magic number in inode while looking up the journal
> inode for slot 0
>
> fsck encountered unrecoverable errors while replaying the journals and
> will not continue
>
> After bypassing journal replay function, I got
>
> Pass 0a: Checking cluster allocation chains
> pass0: Bad magic number in inode while looking up the global bitmap inode
> fsck.ocfs2: Bad magic number in inode while performing pass 0
>
>
> Does it mean the filesystem is destroyed completely?
>
>
>
>
> On 10.11.2012 02:54, Marian Serban wrote:
>
> That's the kernel:
>
> Linux ro02xsrv003.bv.easic.ro 2.6.39.4 #6 SMP Mon Dec 12 12:09:49 EET
> 2011 x86_64 x86_64 x86_64 GNU/Linux
>
> Anyway, I tried disabling the metaecc feature, no luck.
>
> [root@ro02xsrv003 ~]# tunefs.ocfs2 --fs-features=nometaecc
> /dev/mapper/volgr1-lvol0
> tunefs.ocfs2: I/O error on channel while opening device
> "/dev/mapper/volgr1-lvol0"
>
> These are the last lines of strace corresponding to the tunefs.ocfs
> command:
>
>
>
> open("/sys/fs/ocfs2/cluster_stack", O_RDONLY) = 4
> fstat(4, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7f54aad05000
> read(4, "o2cb\n", 4096) = 5
> close(4)= 0
> munmap(0x7f54aad05000, 4096)= 0
> open("/sys/fs/o2cb/interface_revision", O_RDONLY) = 4
> read(4, "5\n", 15)  = 2
> read(4, "", 13) = 0
> close(4)= 0
> stat("/sys/kernel/config", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
> statfs("/sys/kernel/config", {f_type=0x62656570, f_bsize=4096, f_blocks=0,
> f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255,
> f_frsize=4096}) = 0
> open("/dev/mapper/volgr1-lvol0", O_RDONLY) = 4
> ioctl(4, BLKSSZGET, 0x7fffce711454) = 0
> close(4)= 0
> pread(3, 
> "\0\0\v\25\37\1\200\200\202@\21\2\30\26\0\0\0,\17\272\241\4\340\210\311\377\17\300\327\332\373\17"...,
> 4096, 532480) = 4096
> close(3)= 0
> write(2, "tunefs.ocfs2", 12tunefs.ocfs2)= 12
> write(2, ": ", 2: )   = 2
> write(2, "I/O error on channel", 20I/O error on channel)= 20
> write(2, " ", 1 )= 1
> write(2, "while opening device \"/dev/mappe"..., 47while opening device
> "/dev/mapper/volgr1-lvol0") = 47
> write(2, "\r\n", 2
>
>
>
>
>
> On 10.11.2012 02:06, Sunil Mushran wrote:
>
> It's either that or a check sum problem. Disable metaecc. Not sure which
> kernel you are running.
> We had fixed few problems few years ago around this. If your kernel is
> older, then it could be
> a known issue.
>
>
> On Fri, Nov 9, 2012 at 12:50 PM, Marian Serban  wrote:
>
>> Hi Sunil,
>>
>> Thank you for answering. Unfortunately, it doesn't seem like it's a
>> hardware problem. There's no way a cable can be loose because it's iSCSI
>> over 1G Ethernet (copper wires) environment. Also I performed "dd
>> if=/dev/ of=/dev/null" and first 16GB or so are fine. "Dmesg" shows no
>> errors.
>>
>>
>> Also tried with debugfs.ocfs2:
>>
>>
>> [root@ro02xsrv003 ~]# debugfs.ocfs2  /dev/mapper/volgr1-lvol0
>> debugfs.ocfs2 1.6.3
>> debugfs: ls
>> ls: Bad magic number in inode '.'
>> debugfs: slotmap
>> slotmap: Bad magic number in inode while reading slotmap system file
>> debugfs: stats
>> Revision: 0.90
>> Mount Count: 0   Max Mount Count: 20
>> State: 0   Errors: 0
>> Check Interval: 0   Last Check: Fri Nov  9 14:35:53 2012
>> Creator OS: 0
>> Feature Compat: 3 backup-super strict-journal-super
>> Feature Incompat: 16208 sparse extended-slotmap inline-data
>> metaecc xattr indexed-dirs refcount discontig-bg
>> Tunefs Incomplete: 0
>> Feature RO compat: 7 unwritten usrquota grpquota
>> Root Blknum: 129   System Dir Blknum: 130
>> First Cluster Group Blknum: 6

Re: [Ocfs2-users] Huge Problem ocfs2

2012-11-09 Thread Sunil Mushran

It's either that or a check sum problem. Disable metaecc. Not sure which
kernel you are running.
We had fixed few problems few years ago around this. If your kernel is
older, then it could be
a known issue.


On Fri, Nov 9, 2012 at 12:50 PM, Marian Serban  wrote:

> Hi Sunil,
>
> Thank you for answering. Unfortunately, it doesn't seem like it's a
> hardware problem. There's no way a cable can be loose because it's iSCSI
> over 1G Ethernet (copper wires) environment. Also I performed "dd
> if=/dev/ of=/dev/null" and first 16GB or so are fine. "Dmesg" shows no
> errors.
>
>
> Also tried with debugfs.ocfs2:
>
>
> [root@ro02xsrv003 ~]# debugfs.ocfs2  /dev/mapper/volgr1-lvol0
> debugfs.ocfs2 1.6.3
> debugfs: ls
> ls: Bad magic number in inode '.'
> debugfs: slotmap
> slotmap: Bad magic number in inode while reading slotmap system file
> debugfs: stats
> Revision: 0.90
> Mount Count: 0   Max Mount Count: 20
> State: 0   Errors: 0
> Check Interval: 0   Last Check: Fri Nov  9 14:35:53 2012
> Creator OS: 0
> Feature Compat: 3 backup-super strict-journal-super
> Feature Incompat: 16208 sparse extended-slotmap inline-data
> metaecc xattr indexed-dirs refcount discontig-bg
> Tunefs Incomplete: 0
> Feature RO compat: 7 unwritten usrquota grpquota
> Root Blknum: 129   System Dir Blknum: 130
> First Cluster Group Blknum: 64
> Block Size Bits: 12   Cluster Size Bits: 18
> Max Node Slots: 10
> Extended Attributes Inline Size: 256
> Label: SAN
> UUID: B4CF8D4667AF43118F3324567B90A9**87
> Hash: 3698209293 (0xdc6e320d)
> DX Seed[0]: 0x9f4a2bb7
> DX Seed[1]: 0x501ddac0
> DX Seed[2]: 0x6034bfe8
> Cluster stack: classic o2cb
> Inode: 2   Mode: 00   Generation: 1093568923 (0x412e899b)
> FS Generation: 1093568923 (0x412e899b)
> CRC32: 46f2d360   ECC: 04d4
> Type: Unknown   Attr: 0x0   Flags: Valid System Superblock
> Dynamic Features: (0x0)
> User: 0 (root)   Group: 0 (root)   Size: 0
> Links: 0   Clusters: 45340448
> ctime: 0x4ee67f67 -- Tue Dec 13 00:25:43 2011
> atime: 0x0 -- Thu Jan  1 02:00:00 1970
> mtime: 0x4ee67f67 -- Tue Dec 13 00:25:43 2011
> dtime: 0x0 -- Thu Jan  1 02:00:00 1970
> ctime_nsec: 0x -- 0
> atime_nsec: 0x -- 0
> mtime_nsec: 0x -- 0
> Refcount Block: 0
> Last Extblk: 0   Orphan Slot: 0
> Sub Alloc Slot: Global   Sub Alloc Bit: 65535
>
>
>
>
> Marian
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Huge Problem ocfs2

2012-11-09 Thread Sunil Mushran

IO error on channel means the system cannot talk to the block device. The
problem
is in the block layer. Maybe a loose cable or a setup problem.
dmesg should show errors.


On Fri, Nov 9, 2012 at 10:46 AM, Laurentiu Gosu  wrote:

>  Hi,
> I'm using ocfs2 cluster in a production environment since almost 1 year.
> During this time i had to run a fsck.ocfs2 few months ago due to some
> errors but they were fixed.
> Now i have a big problem: I'm not able to mount the volume on any of the
> nodes. I stopped all nodes except one. Some output bellow:
> *mount /mnt/ocfs2**
> **mount.ocfs2: I/O error on channel while trying to determine heartbeat
> information**
> **
> **fsck.ocfs2 /dev/mapper/volgr1-lvol0**
> **fsck.ocfs2 1.6.3**
> **fsck.ocfs2: I/O error on channel while initializing the DLM**
> **
> **fsck.ocfs2 -n /dev/mapper/volgr1-lvol0**
> **fsck.ocfs2 1.6.3**
> **Checking OCFS2 filesystem in /dev/mapper/volgr1-lvol0:**
> **  Label:  SAN**
> **  UUID:   B4CF8D4667AF43118F3324567B90A987**
> **  Number of blocks:   2901788672**
> **  Block size: 4096**
> **  Number of clusters: 45340448**
> **  Cluster size:   262144**
> **  Number of slots:10**
> **
> **journal recovery: I/O error on channel while looking up the journal
> inode for slot 0**
> **fsck encountered unrecoverable errors while replaying the journals and
> will not continue*
>
>
> Can you give me some hints on how to debug the problem?
>
> Thank you,
> Laurentiu.
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] HA-OCFS2?

2012-09-13 Thread Sunil Mushran

cfs != storage

You need to get a highly available storage that is concurrently accessible
from multiple nodes.

ocfs2 will allow multiple nodes to concurrently access the same storage.
With posix semantics.
If a node dies, the remaining nodes will pause to recover and then continue
functioning. The
dead node can then restart and rejoin the cluster.

On Thu, Sep 13, 2012 at 5:02 PM, Eric  wrote:

> Is it possible to create a highly-available OCFS2 cluster (i.e., A storage
> cluster that mitigates the single point of failure [SPoF] created by
> storing an OCFS2 volume on a single LUN)?
>
> The OCFS2 Project Page makes this claim...
>
> > OCFS2 is a general-purpose shared-disk cluster file system for Linux
> capable of providing both *high performance* and *high availability*.
>
> ...but without backing-up the claim of high availability storage (at
> either the HDD- or the node-level).
>
> I've found a couple of articles hinting at using Linux Multipathing or
> DRBD but very little detailed information about either.
>
> TIA,
> Eric Pretorious
> Truckee, CA
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Ocfs2-users Digest, Vol 105, Issue 4

2012-09-12 Thread Sunil Mushran

On Wed, Sep 12, 2012 at 9:45 AM, Asanka Gunasekera <
asanka_gunasek...@yahoo.co.uk> wrote:

> Load O2CB driver on boot (y/n) [y]:
> Cluster stack backing O2CB [o2cb]:
> Cluster to start on boot (Enter "none" to clear) [ocfs2]:
> Specify heartbeat dead threshold (>=7) [31]:
> Specify network idle timeout in ms (>=5000) [3]:
> Specify network keepalive delay in ms (>=1000) [2000]:
> Specify network reconnect delay in ms (>=2000) [2000]:
> Writing O2CB configuration: OK
> Loading filesystem "configfs": OK
> Mounting configfs filesystem at /sys/kernel/config: OK
> Loading filesystem "ocfs2_dlmfs": OK
> Mounting ocfs2_dlmfs filesystem at /dlm: OK
> Starting O2CB cluster ocfs2: Failed
> Cluster ocfs2 created
> Node ocfsn1 added
> o2cb_ctl: Internal logic failure while adding node ocfsn2
>
> Stopping O2CB cluster ocfs2: OK
>


Something wrong with your cluster.conf. Overlapping node numbers, maybe.



> abd in the messages I time to time get below and I saw in a post that I
> can ignore this.
>
> modprobe: FATAL: Module ocfs2_stackglue not found.
>


Yes, this is harmless.
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] test inode bit failed -5

2012-08-31 Thread Sunil Mushran

nfsd encountered an error reading the device. So something in the io path
below the
fs encountered a problem. If it just happened once, then you can ignore it.

On Fri, Aug 31, 2012 at 2:23 AM, Hideyasu Kojima wrote:

> Hi
> I using ocfs2 cluster as NFS Server.
>
> Only once,I got a bellow error,and write error from NFS Client.
> What happend?
>
> kernel: (nfsd,12870,0):ocfs2_get_suballoc_slot_bit:2096 ERROR: read
> block 24993224 failed -5
> kernel: (nfsd,12870,0):ocfs2_test_inode_bit:2207 ERROR: get alloc slot
> and bit failed -5
> kernel: (nfsd,12870,0):ocfs2_get_dentry:96 ERROR: test inode bit failed -5
>
> I currently use kernel 2.6.18-164.el5
> OCFS2 : 1.4.7
> ocfs2-tool: 1.4.4
>
> Thanks.
> --
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Issue with OCFS2 mount

2012-08-29 Thread Sunil Mushran

Forgot to add that this issue is limited to metaecc. So you could avoid the
issue in your
same setup by not enabling metaecc on the volume. And last I checked mkfs
did not
enable it by default.

On Mon, Aug 27, 2012 at 10:35 AM, Sunil Mushran wrote:

> So you are running into a bug that has been fixed in 2.6.36. Upgrade to
> that version,
> if not something more current.
>
> $ git describe --tags 13ceef09
> v2.6.35-rc3-14-g13ceef0
>
> commit 13ceef099edd2b70c5a6f3a9ef5d6d97cda2e096
> Author: Jan Kara 
> Date:   Wed Jul 14 07:56:33 2010 +0200
>
> jbd2/ocfs2: Fix block checksumming when a buffer is used in several
> transactions
>
> OCFS2 uses t_commit trigger to compute and store checksum of the just
> committed blocks. When a buffer has b_frozen_data, checksum is computed
> for it instead of b_data but this can result in an old checksum being
> written to the filesystem in the following scenario:
>
> 1) transaction1 is opened
> 2) handle1 is opened
> 3) journal_access(handle1, bh)
> - This sets jh->b_transaction to transaction1
> 4) modify(bh)
> 5) journal_dirty(handle1, bh)
> 6) handle1 is closed
> 7) start committing transaction1, opening transaction2
> 8) handle2 is opened
> 9) journal_access(handle2, bh)
> - This copies off b_frozen_data to make it safe for transaction1
> to commit.
>   jh->b_next_transaction is set to transaction2.
> 10) jbd2_journal_write_metadata() checksums b_frozen_data
> 11) the journal correctly writes b_frozen_data to the disk journal
> 12) handle2 is closed
> - There was no dirty call for the bh on handle2, so it is never
> queued for
>   any more journal operation
> 13) Checkpointing finally happens, and it just spools the bh via
> normal buffer
> writeback.  This will write b_data, which was never triggered on and
> thus
> contains a wrong (old) checksum.
>
> This patch fixes the problem by calling the trigger at the moment data
> is
> frozen for journal commit - i.e., either when b_frozen_data is created
> by
> do_get_write_access or just before we write a buffer to the log if
> b_frozen_data does not exist. We also rename the trigger to t_frozen as
> that better describes when it is called.
>
> Signed-off-by: Jan Kara 
> Signed-off-by: Mark Fasheh 
> Signed-off-by: Joel Becker 
>
>
> On Mon, Aug 27, 2012 at 5:10 AM, Rory Kilkenny 
> wrote:
>
>>  # uname -a
>> Linux FILEt1 2.6.34.7-0.7-desktop #1 SMP PREEMPT 2010-12-13 11:13:53
>> +0100 x86_64 x86_64 x86_64 GNU/Linux
>>
>> # modinfo ocfs2
>> filename:   /lib/modules/2.6.34.7-0.7-desktop/kernel/fs/ocfs2/ocfs2.ko
>> license:GPL
>> author: Oracle
>> version:1.5.0
>> description:OCFS2 1.5.0
>> srcversion: B13569B35F99D43FA80D129
>> depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
>> vermagic:   2.6.34.7-0.7-desktop SMP preempt mod_unload modversions
>>
>> # mkfs.ocfs2 --version
>> mkfs.ocfs2 1.4.3
>>
>>
>>
>>
>> On 12-08-24 5:44 PM, "Sunil Mushran"  wrote:
>>
>> What is the version of the kernel, ocfs2 and ocfs2 tools?
>>
>> uname -a
>> modinfo ocfs2
>> mkfs.ocfs2 --version
>>
>> On Fri, Aug 24, 2012 at 1:09 PM, Rory Kilkenny 
>> wrote:
>>
>> We have an HP P2000 G3 Storage array, fiber connected.  The storage array
>> has a RAID5 array broken into 2 physical OCFS2 volumes (A & B).
>>
>> A & B are both mounted and formatted as NTFS.
>>
>> One of the volumes is NFS mounted.
>>
>> Every couple of months or so we start getting tons of errors on the NFS
>> mounted volume:
>>
>>
>> Aug 24 09:48:13 FILEt2 kernel: [2234285.848940]
>> (ocfs2_wq,13844,7):ocfs2_block_check_validate:443 ERROR: CRC32 failed:
>> stored: 0, computed 1467126086.  Applying ECC.
>> Aug 24 09:48:13 FILEt2 kernel: [2234285.849252]
>> (ocfs2_wq,13844,7):ocfs2_block_check_validate:457 ERROR: Fixed CRC32
>> failed: stored: 0, computed 3828104806
>> Aug 24 09:48:13 FILEt2 kernel: [2234285.849256]
>> (ocfs2_wq,13844,7):ocfs2_validate_extent_block:903 ERROR: Checksum failed
>> for extent block 1169089
>> Aug 24 09:48:13 FILEt2 kernel: [2234285.849261]
>> (ocfs2_wq,13844,7):__ocfs2_find_path:1861 ERROR: status = -5
>> Aug 24 09:48:13 FILEt2 kernel: [2234285.849264]
>> (ocfs2_wq,13844,7):ocfs2_find_leaf:1958 ERROR: status = -5
>> Aug 24 09:48:13 FILEt2 kernel: [2234285.849267]
>> (ocfs2_wq,13844,7):ocfs2_find_new_last_ext_

Re: [Ocfs2-users] Issue with files and folder ownership

2012-08-29 Thread Sunil Mushran

I would recommend pacemaker if the distribution you are using has all the
bits.
Manual building gets messy. Suse based distros have all the bits required
for ocfs2+pacemaker.

On Tue, Aug 28, 2012 at 10:40 PM, Emilien Macchi <
emilien.mac...@stackops.com> wrote:

> Hi,
>
> On Wed, Aug 29, 2012 at 7:25 AM, Sunil Mushran wrote:
>
>> Isn't the mount point is local to the machine?
>
>
> I use iSCSI for the Block device and I mount the device (/dev/sdc1) at
> /var/lib/nova/instances.
>
> I've formated /dev/sdc1 in OCFS2 FS.
>
> Should I use Pacemaker to manage OCFS2 ?
>
> Thanks,
>
> -Emilien
>
>
>>
>> On Tue, Aug 28, 2012 at 10:14 PM, Emilien Macchi <
>> emilien.mac...@stackops.com> wrote:
>>
>>> Hi,
>>>
>>> On Wed, Aug 29, 2012 at 12:36 AM, Sunil Mushran >> > wrote:
>>>
>>>> Permissions on the mount point should be local to a machine.
>>>
>>>
>>> That's unthinkable if you consider that's a cluster FS which respects
>>> POSIX rules.
>>>
>>>
>>> -Emilien
>>>
>>>
>>>
>>>> AFAIK.
>>>>
>>>> On Mon, Aug 27, 2012 at 3:08 AM, Emilien Macchi <
>>>> emilien.mac...@stackops.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> I'm working on a two nodes cluster with the goal to store virtual
>>>>> machines managed by OpenStack services and KVM Hypervisor. I also use 
>>>>> iSCSI
>>>>> Multi-Pathing for the block device.
>>>>>
>>>>> My cluster is running and I can mount the device (/dev/sdd1).
>>>>>
>>>>> I'm having some problems with POSIX rights :
>>>>>
>>>>>- *chmod* on a file or folder is working.
>>>>>- *chown* on a file or folder is not working as I want : I'm
>>>>>trying to change the ownership of */var/lib/nova/instances* which
>>>>>is my mount point, but when I do that, the ownership setting is not 
>>>>> applied
>>>>>on the second node.
>>>>>
>>>>> I can't use yet OpenStack + KVM because the mount point should have
>>>>> the "nova" user as POSIX owner.
>>>>>
>>>>> Here is my *cluster.conf* :
>>>>> http://paste.openstack.org/show/oPQR5pjZETz7xSAR04so/
>>>>> And my mount point :
>>>>> */dev/sdd1 on /var/lib/nova/instances type ocfs2
>>>>> (rw,_netdev,heartbeat=local)*
>>>>>
>>>>>
>>>>> In advance thank you for your help.
>>>>>
>>>>>
>>>>> Best regards
>>>>>
>>>>> --
>>>>> Emilien Macchi
>>>>> *System Engineer*
>>>>> *www.stackops.com
>>>>>
>>>>> | *emilien.mac...@stackops.com**  *|* skype:emilien.macchi*
>>>>> * <http://www.stackops.com>
>>>>> *
>>>>>
>>>>> *
>>>>>
>>>>>  ADVERTENCIA LEGAL 
>>>>> Le informamos, como destinatario de este mensaje, que el correo
>>>>> electrónico y las comunicaciones por medio de Internet no permiten 
>>>>> asegurar
>>>>> ni garantizar la confidencialidad de los mensajes transmitidos, así como
>>>>> tampoco su integridad o su correcta recepción, por lo que STACKOPS
>>>>> TECHNOLOGIES S.L. no asume responsabilidad alguna por tales 
>>>>> circunstancias.
>>>>> Si no consintiese en la utilización del correo electrónico o de las
>>>>> comunicaciones vía Internet le rogamos nos lo comunique y ponga en nuestro
>>>>> conocimiento de manera inmediata. Este mensaje va dirigido, de manera
>>>>> exclusiva, a su destinatario y contiene información confidencial y sujeta
>>>>> al secreto profesional, cuya divulgación no está permitida por la ley. En
>>>>> caso de haber recibido este mensaje por error, le rogamos que, de forma
>>>>> inmediata, nos lo comunique mediante correo electrónico remitido a nuestra
>>>>> atención y proceda a su eliminación, así como a la de cualquier documento
>>>>> adjunto al mismo. Asimismo, le comunicamos que la distribución, copia o
>>>>> utilización de este mensaje, o de cualquier documento adjunto al mismo

Re: [Ocfs2-users] Issue with files and folder ownership

2012-08-28 Thread Sunil Mushran

Permissions on the mount point should be local to a machine. AFAIK.

On Mon, Aug 27, 2012 at 3:08 AM, Emilien Macchi  wrote:

> Hi,
>
>
> I'm working on a two nodes cluster with the goal to store virtual machines
> managed by OpenStack services and KVM Hypervisor. I also use iSCSI
> Multi-Pathing for the block device.
>
> My cluster is running and I can mount the device (/dev/sdd1).
>
> I'm having some problems with POSIX rights :
>
>- *chmod* on a file or folder is working.
>- *chown* on a file or folder is not working as I want : I'm trying to
>change the ownership of */var/lib/nova/instances* which is my mount
>point, but when I do that, the ownership setting is not applied on the
>second node.
>
> I can't use yet OpenStack + KVM because the mount point should have the
> "nova" user as POSIX owner.
>
> Here is my *cluster.conf* :
> http://paste.openstack.org/show/oPQR5pjZETz7xSAR04so/
> And my mount point :
> */dev/sdd1 on /var/lib/nova/instances type ocfs2
> (rw,_netdev,heartbeat=local)*
>
>
> In advance thank you for your help.
>
>
> Best regards
>
> --
> Emilien Macchi
> *System Engineer*
> *www.stackops.com
>
> | *emilien.mac...@stackops.com**  *|* skype:emilien.macchi*
> * 
> *
>
> *
>
>  ADVERTENCIA LEGAL 
> Le informamos, como destinatario de este mensaje, que el correo
> electrónico y las comunicaciones por medio de Internet no permiten asegurar
> ni garantizar la confidencialidad de los mensajes transmitidos, así como
> tampoco su integridad o su correcta recepción, por lo que STACKOPS
> TECHNOLOGIES S.L. no asume responsabilidad alguna por tales circunstancias.
> Si no consintiese en la utilización del correo electrónico o de las
> comunicaciones vía Internet le rogamos nos lo comunique y ponga en nuestro
> conocimiento de manera inmediata. Este mensaje va dirigido, de manera
> exclusiva, a su destinatario y contiene información confidencial y sujeta
> al secreto profesional, cuya divulgación no está permitida por la ley. En
> caso de haber recibido este mensaje por error, le rogamos que, de forma
> inmediata, nos lo comunique mediante correo electrónico remitido a nuestra
> atención y proceda a su eliminación, así como a la de cualquier documento
> adjunto al mismo. Asimismo, le comunicamos que la distribución, copia o
> utilización de este mensaje, o de cualquier documento adjunto al mismo,
> cualquiera que fuera su finalidad, están prohibidas por la ley.
>
> * PRIVILEGED AND CONFIDENTIAL 
> We hereby inform you, as addressee of this message, that e-mail and
> Internet do not guarantee the confidentiality, nor the completeness or
> proper reception of the messages sent and, thus, STACKOPS TECHNOLOGIES S.L.
> does not assume any liability for those circumstances. Should you not agree
> to the use of e-mail or to communications via Internet, you are kindly
> requested to notify us immediately. This message is intended exclusively
> for the person to whom it is addressed and contains privileged and
> confidential information protected from disclosure by law. If you are not
> the addressee indicated in this message, you should immediately delete it
> and any attachments and notify the sender by reply e-mail. In such case,
> you are hereby notified that any dissemination, distribution, copying or
> use of this message or any attachments, for any purpose, is strictly
> prohibited by law.
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Issue with OCFS2 mount

2012-08-27 Thread Sunil Mushran

So you are running into a bug that has been fixed in 2.6.36. Upgrade to
that version,
if not something more current.

$ git describe --tags 13ceef09
v2.6.35-rc3-14-g13ceef0

commit 13ceef099edd2b70c5a6f3a9ef5d6d97cda2e096
Author: Jan Kara 
Date:   Wed Jul 14 07:56:33 2010 +0200

jbd2/ocfs2: Fix block checksumming when a buffer is used in several
transactions

OCFS2 uses t_commit trigger to compute and store checksum of the just
committed blocks. When a buffer has b_frozen_data, checksum is computed
for it instead of b_data but this can result in an old checksum being
written to the filesystem in the following scenario:

1) transaction1 is opened
2) handle1 is opened
3) journal_access(handle1, bh)
- This sets jh->b_transaction to transaction1
4) modify(bh)
5) journal_dirty(handle1, bh)
6) handle1 is closed
7) start committing transaction1, opening transaction2
8) handle2 is opened
9) journal_access(handle2, bh)
- This copies off b_frozen_data to make it safe for transaction1 to
commit.
  jh->b_next_transaction is set to transaction2.
10) jbd2_journal_write_metadata() checksums b_frozen_data
11) the journal correctly writes b_frozen_data to the disk journal
12) handle2 is closed
- There was no dirty call for the bh on handle2, so it is never
queued for
  any more journal operation
13) Checkpointing finally happens, and it just spools the bh via normal
buffer
writeback.  This will write b_data, which was never triggered on and
thus
contains a wrong (old) checksum.

This patch fixes the problem by calling the trigger at the moment data
is
frozen for journal commit - i.e., either when b_frozen_data is created
by
do_get_write_access or just before we write a buffer to the log if
b_frozen_data does not exist. We also rename the trigger to t_frozen as
that better describes when it is called.

Signed-off-by: Jan Kara 
Signed-off-by: Mark Fasheh 
Signed-off-by: Joel Becker 


On Mon, Aug 27, 2012 at 5:10 AM, Rory Kilkenny wrote:

>  # uname -a
> Linux FILEt1 2.6.34.7-0.7-desktop #1 SMP PREEMPT 2010-12-13 11:13:53 +0100
> x86_64 x86_64 x86_64 GNU/Linux
>
> # modinfo ocfs2
> filename:   /lib/modules/2.6.34.7-0.7-desktop/kernel/fs/ocfs2/ocfs2.ko
> license:GPL
> author: Oracle
> version:1.5.0
> description:OCFS2 1.5.0
> srcversion: B13569B35F99D43FA80D129
> depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
> vermagic:   2.6.34.7-0.7-desktop SMP preempt mod_unload modversions
>
> # mkfs.ocfs2 --version
> mkfs.ocfs2 1.4.3
>
>
>
>
> On 12-08-24 5:44 PM, "Sunil Mushran"  wrote:
>
> What is the version of the kernel, ocfs2 and ocfs2 tools?
>
> uname -a
> modinfo ocfs2
> mkfs.ocfs2 --version
>
> On Fri, Aug 24, 2012 at 1:09 PM, Rory Kilkenny 
> wrote:
>
> We have an HP P2000 G3 Storage array, fiber connected.  The storage array
> has a RAID5 array broken into 2 physical OCFS2 volumes (A & B).
>
> A & B are both mounted and formatted as NTFS.
>
> One of the volumes is NFS mounted.
>
> Every couple of months or so we start getting tons of errors on the NFS
> mounted volume:
>
>
> Aug 24 09:48:13 FILEt2 kernel: [2234285.848940]
> (ocfs2_wq,13844,7):ocfs2_block_check_validate:443 ERROR: CRC32 failed:
> stored: 0, computed 1467126086.  Applying ECC.
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849252]
> (ocfs2_wq,13844,7):ocfs2_block_check_validate:457 ERROR: Fixed CRC32
> failed: stored: 0, computed 3828104806
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849256]
> (ocfs2_wq,13844,7):ocfs2_validate_extent_block:903 ERROR: Checksum failed
> for extent block 1169089
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849261]
> (ocfs2_wq,13844,7):__ocfs2_find_path:1861 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849264]
> (ocfs2_wq,13844,7):ocfs2_find_leaf:1958 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849267]
> (ocfs2_wq,13844,7):ocfs2_find_new_last_ext_blk:6655 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849270]
> (ocfs2_wq,13844,7):ocfs2_do_truncate:6900 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849274]
> (ocfs2_wq,13844,7):ocfs2_commit_truncate:7556 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849280]
> (ocfs2_wq,13844,7):ocfs2_truncate_for_delete:593 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849284]
> (ocfs2_wq,13844,7):ocfs2_wipe_inode:769 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849287]
> (ocfs2_wq,13844,7):ocfs2_delete_inode:1067 ERROR: status = -5
>
>
> If we pull all the data off, destroy the volume, rebuilt it, and copy our
> data back, all works fine; for a while.
>
&g

Re: [Ocfs2-users] Issue with OCFS2 mount

2012-08-24 Thread Sunil Mushran

What is the version of the kernel, ocfs2 and ocfs2 tools?

uname -a
modinfo ocfs2
mkfs.ocfs2 --version

On Fri, Aug 24, 2012 at 1:09 PM, Rory Kilkenny wrote:

>  We have an HP P2000 G3 Storage array, fiber connected.  The storage
> array has a RAID5 array broken into 2 physical OCFS2 volumes (A & B).
>
> A & B are both mounted and formatted as NTFS.
>
> One of the volumes is NFS mounted.
>
> Every couple of months or so we start getting tons of errors on the NFS
> mounted volume:
>
>
> Aug 24 09:48:13 FILEt2 kernel: [2234285.848940]
> (ocfs2_wq,13844,7):ocfs2_block_check_validate:443 ERROR: CRC32 failed:
> stored: 0, computed 1467126086.  Applying ECC.
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849252]
> (ocfs2_wq,13844,7):ocfs2_block_check_validate:457 ERROR: Fixed CRC32
> failed: stored: 0, computed 3828104806
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849256]
> (ocfs2_wq,13844,7):ocfs2_validate_extent_block:903 ERROR: Checksum failed
> for extent block 1169089
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849261]
> (ocfs2_wq,13844,7):__ocfs2_find_path:1861 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849264]
> (ocfs2_wq,13844,7):ocfs2_find_leaf:1958 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849267]
> (ocfs2_wq,13844,7):ocfs2_find_new_last_ext_blk:6655 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849270]
> (ocfs2_wq,13844,7):ocfs2_do_truncate:6900 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849274]
> (ocfs2_wq,13844,7):ocfs2_commit_truncate:7556 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849280]
> (ocfs2_wq,13844,7):ocfs2_truncate_for_delete:593 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849284]
> (ocfs2_wq,13844,7):ocfs2_wipe_inode:769 ERROR: status = -5
> Aug 24 09:48:13 FILEt2 kernel: [2234285.849287]
> (ocfs2_wq,13844,7):ocfs2_delete_inode:1067 ERROR: status = -5
>
>
> If we pull all the data off, destroy the volume, rebuilt it, and copy our
> data back, all works fine; for a while.
>
> This issue does not happen on the non NFS mounted volume. I am currently
> assuming the issue is with NFS and how we have it configured (which to the
> best of my knowledge is default).
>
> Has anyone had a similar experience and be able to share some insight and
> knowledge on any tricks with NFS and OCFS2 volumes?
>
> Thanks in advance.
>
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and util_file

2012-08-23 Thread Sunil Mushran

On Thu, Aug 23, 2012 at 10:58 AM, Maki, Nancy  wrote:

> By default we mount all our OCFS2 volumes with datavolume.  To be more
> specific, the volume that we are having the issue with is not a database
> volume but a shared drive for developers to read and write other types of
> files.  Would it be appropriate to remove the datavolume mount option from
> this particular volume only and leave it on our database volumes?
>
>
>
Yes. datavolume was only meant for db volumes. Other volumes have never
needed it.
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and util_file

2012-08-23 Thread Sunil Mushran

You are probably mounting the volume with the datavolume option. Instead
use the
init.ora param, filesystemio_options for force odirect and mount the volume
without
the datavolume option. This is documented in the user's guide.

On Thu, Aug 23, 2012 at 8:14 AM, Maki, Nancy  wrote:

> We are getting an error ORA-29284 when using utl_file.get_line to read an
> OCFS2 file of larger than 3896 characters. Has anyone encountered this
> before?  We are at OCFS2 2.6 running on OEL 5.6.
>
> ** **
>
> Thanks,
>
> Nancy
>
> ** **
>
> [image: circle] **
>
> *Nancy Maki*
> *Manager of Database Services*
>
> Office of Information & Technology
> The State University of New York
> State University Plaza - Albany, New York 12246
> Tel: 518.320.1213   Fax: 518.320.1550
>
> eMail:  nancy.m...@suny.edu
> *Be a part of Generation SUNY: 
> **Facebook*
> * - **Twitter* * - 
> **YouTube*
> 
>
> ** **
>
> ** **
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
<>___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] null pointer dereference

2012-08-21 Thread Sunil Mushran

You may want to run a full fsck on the fs.

fsck.ocfs2 -fy /dev/

On Tue, Aug 21, 2012 at 12:49 AM, Pawel  wrote:

> Hi,
> After upgrading ocfs2 my cluster is instable.
>
> At least ones per week I can see:
> kernel panic: Null pointer dereference  at 00048
> o2dlm_blocking_ast_wrapper + 0x8/0x20 [ocfs2_stack_o2cb]
> stack:
> dlm_do_local_bast [ocfs2_dlm]
> dlm_lookup_lockers [ocfs2_dlm]
> dlm_proxy_ast_handler
> add_timer
> ..
>
> After that sometimes deadlock happens on another nodes. Entire cluster
> restart solve the issue.
> I see in log:
> (dlm_thread,7227,3):dlm_send_proxy_ast_msg:484 ERROR:
> ECB9442E19A94EAC896641BFADD55E4B: res M0001f411c9,
> error -107 send AST to node 4
> (dlm_thread,7227,3):dlm_flush_asts:605 ERROR: status = -107
> o2net: No connection established with node 4 after 10.0 seconds, giving up.
> o2net: No connection established with node 4 after 10.0 seconds, giving up.
> o2net: No connection established with node 4 after 10.0 seconds, giving up.
> (dlm_thread,7227,4):dlm_send_proxy_ast_msg:484 ERROR:
> ECB9442E19A94EAC896641BFADD55E4B: res M0001f411c9,
> error -107 send AST to node 4
> (dlm_thread,7227,4):dlm_flush_asts:605 ERROR: status = -107
> o2cb: o2dlm has evicted node 4 from domain ECB9442E19A94EAC896641BFADD55E4B
> o2cb: o2dlm has evicted node 4 from domain ECB9442E19A94EAC896641BFADD55E4B
> o2dlm: Begin recovery on domain ECB9442E19A94EAC896641BFADD55E4B for node 4
> o2dlm: Node 5 (he) is the Recovery Master for the dead node 4 in domain
> ECB9442E19A94EAC896641BFADD55E4B
> o2dlm: End recovery on domain ECB9442E19A94EAC896641BFADD55E4B
>
>
> Additionaly ~4 times per day I see:
>
> ocfs2_check_dir_for_entry:2119 ERROR: status = -17
> ocfs2_mknod:459 ERROR: status = -17
> ocfs2_create:629 ERROR: status = -17
>
>
> I currently use kernel 3.4.2
> my filesystem has been created with:
> -N 8-b 4096 -C 32768 --fs-features
>
> backup-super,strict-journal-super,sparse,extended-slotmap,inline-data,metaecc,xattr,indexed-dirs,refcount,discontig-bg,unwritten,usrquota,grpquota
>
> Could you tell me what could make my system instable? Which feature ?
>
> Thanks for any  help
>
> Pawel
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 problem journal size

2012-08-02 Thread Sunil Mushran

oh crap. The dlm lock needs to lock the journals. So you need to recreate
the
journal inodes with i_size 0.

dd a good journal inode and edit it using binary editor. Change the inode
num
to the block number, zero out the i_size and next_free_extent. Repeat for
the
4 inodes.

Hopefully some one on the list has the time to help you further.

On Thu, Aug 2, 2012 at 10:50 AM, Christophe BOUDER <
christophe.bou...@lip6.fr> wrote:

> hello,
>
> > The 4 journal inodes got zeroed out. Do you know how/why?
>
> raid6 with 2 bad disk
> and a third who got problem
> reinsert it in the device it appears good
> but it also crash the device not recognize by the system.
>
> >
> > Have you tried running fsck with -fy (enable writes).
>
> yes but without success
> #fsck.ocfs2 -fy /dev/sdc1
> fsck.ocfs2 1.6.3
> fsck.ocfs2: Internal logic failure while initializing the DLM
>
> > Try with -fy. If that does not work, we'll have to reconstruct empty
> > inodes
> > as
> > placeholders to allow fsck to complete journal recovery followed by
> > journal
> > recreation.
>
> ok how can i do that ?
>
>
> --
> Christophe Bouder,
>
>
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 problem journal size

2012-08-02 Thread Sunil Mushran

The 4 journal inodes got zeroed out. Do you know how/why?

Have you tried running fsck with -fy (enable writes).

fsck.ocfs2 does have a check for bad journals that it will regenerate.

JOURNAL_FILE_INVALID
OCFS2 uses JDB for journalling and some journal files exist in the system
directory. Fsck has found some journal files that are invalid.
Answering yes to this question will regenerate the invalid journal files.

But that may still not work as fsck is currently bailing out during journal
recovery
that happens much earlier on.

Try with -fy. If that does not work, we'll have to reconstruct empty inodes
as
placeholders to allow fsck to complete journal recovery followed by journal
recreation.

On Wed, Aug 1, 2012 at 6:41 PM, Christophe BOUDER  wrote:

> Hello,
>
> i use ocfs2 1.6.3 kernel 3.4.4 on debian testing
> i had problem on my infortrend device
> media error on a disk
> the result i can't mount my ocfs2 file but
> i can read the files with debugfs.ocfs2
>
> and my question is
> can i recover or recreate the journal size for node 8 9 10 11 ?
>
> thank for your help
> here's some log :
>
> # mount /data
> mount.ocfs2: Internal logic failure while trying to join the group
>
>
> # fsck.ocfs2 -n /dev/sdc1
> fsck.ocfs2 1.6.3
> Checking OCFS2 filesystem in /dev/sdc1:
>   Label:  data
>   UUID:   9B655B51E6874480BBC1309DCA048A39
>   Number of blocks:   4027690992
>   Block size: 4096
>   Number of clusters: 251730687
>   Cluster size:   65536
>   Number of slots:32
>
> journal recovery: I/O error on channel while reading cached inode 112 for
> slot 8's journal
> fsck encountered unrecoverable errors while replaying the journals and
> will not continue
>
> # echo "ls -l //" | debugfs.ocfs2 /dev/sdc1 |grep journal
> debugfs.ocfs2 1.6.3
> 55  -rw-r--r--   1 0 0   268435456
> 23-Jun-2007
> 21:30 journal:
> 56  -rw-r--r--   1 0 0   268435456
> 23-Jun-2007
> 21:30 journal:0001
> 57  -rw-r--r--   1 0 0   268435456
> 23-Jun-2007
> 21:30 journal:0002
> 58  -rw-r--r--   1 0 0   268435456
> 23-Jun-2007
> 21:30 journal:0003
> 59  -rw-r--r--   1 0 0   268435456
> 23-Jun-2007
> 21:31 journal:0004
> 79  -rw-r--r--   1 0 0   268435456
> 31-Aug-2007
> 00:45 journal:0005
> 80  -rw-r--r--   1 0 0   268435456
> 31-Aug-2007
> 00:45 journal:0006
> 81  -rw-r--r--   1 0 0   268435456
> 31-Aug-2007
> 00:45 journal:0007
> 112 --   0 0 0   0
>  1-Jan-1970
> 01:00 journal:0008
> 113 --   0 0 0   0
>  1-Jan-1970
> 01:00 journal:0009
> 114 --   0 0 0   0
>  1-Jan-1970
> 01:00 journal:0010
> 115 --   0 0 0   0
>  1-Jan-1970
> 01:00 journal:0011
> 116 -rw-r--r--   1 0 0   268435456
> 31-Aug-2007
> 00:46 journal:0012
> 117 -rw-r--r--   1 0 0   268435456
> 31-Aug-2007
> 00:47 journal:0013
> 118 -rw-r--r--   1 0 0   268435456
> 31-Aug-2007
> 00:47 journal:0014
> 119 -rw-r--r--   1 0 0   268435456
> 31-Aug-2007
> 00:47 journal:0015
> 142 -rw-r--r--   1 0 0   268435456
> 29-May-2009
> 22:53 journal:0016
> 143 -rw-r--r--   1 0 0   268435456
> 29-May-2009
> 22:54 journal:0017
> 166 -rw-r--r--   1 0 0   268435456
> 31-Jan-2010
> 15:36 journal:0018
> 167 -rw-r--r--   1 0 0   268435456
> 31-Jan-2010
> 15:36 journal:0019
> 168 -rw-r--r--   1 0 0   268435456
> 31-Jan-2010
> 15:37 journal:0020
> 169 -rw-r--r--   1 0 0   268435456
> 31-Jan-2010
> 15:37 journal:0021
> 170 -rw-r--r--   1 0 0   268435456
> 31-Jan-2010
> 15:38 journal:0022
> 171 -rw-r--r--   1 0 0   268435456
> 31-Jan-2010
> 15:38 journal:0023
> 208 -rw-r--r--   1 0 0   268435456
> 21-Nov-2010
> 19:35 journal:0024
> 209 -rw-r--r--   1 0 0   268435456
> 21-Nov-2010
> 19:35 journal:0025
> 210 -rw-r--r--   1 0 0   268435456
> 21-Nov-2010
> 19:36 journal:0026
> 211 -rw-r--r--   1 0 0   268435456
> 21-Nov-2010
> 19:36 journal:0027
> 212 -rw-r--r--   1 0 0   268435456
> 21-Nov-2010
> 19:36 journal:0028
> 213 -rw-r--r--   1 0 0   268435456
> 21-Nov-2010
> 19:36 journal:0029
> 214 -rw-r--r--   1 0 0   268435456
> 21-No

Re: [Ocfs2-users] ocfs2-tools git: broken after commit deb5ade9145f8809f1fde19cf53bdfdf1fb7963e

2012-07-26 Thread Sunil Mushran

On Thu, Jul 26, 2012 at 6:37 AM, Dzianis Kahanovich
wrote:

> ocfs2-tools git wrong commit: deb5ade9145f8809f1fde19cf53bdfdf1fb7963e.
>
> After "cleanup unused variable":
> -else
> -tmp = g_list_append(elem, cfs);
>
> o2cb_ctl starts to ignore >1 node. Good commit must be:
> else
> -tmp = g_list_append(elem, cfs);
> +g_list_append(elem, cfs);
>
> Attached patch.
>
>
Thanks.

Acked-by: Sunil Mushran 
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Removing a node from cluster.conf (on a specific node)

2012-04-29 Thread Sunil Mushran

Online add/remove of nodes and of global heartbeat devices has been in mainline 
for over a year. I think 2.6.38+ and tools 1.8. The ocfs2-tools tree hosted on 
oss.oracle.com/git has a 1.8.2 tag that can be used safely. It has been fully 
tested. The user's guide has been moved to man pages bundled with the tools. Do 
man ocfs2 after building and installing the tools.

On Apr 29, 2012, at 1:21 PM, Sébastien Riccio  wrote:

> Hi dear list,
> 
> I think the subjet might already been discussed, but I can only found 
> old threads about removing a node from the cluster.
> 
> I was hoping that in 2012 it would be possible to dynamically add/remove 
> nodes from a shared filesystem but this evening I had this problem:
> 
> I wanted to add a node to our ocfs2 cluster, node named xen-blade11 with 
> ip 10.111.10.111
> 
> So on every other node I ran this command:
> 
> o2cb_ctl -C -i -n xen-blade11 -t node -a number=5 -a 
> ip_address=10.111.10.111 -a ip_port= -a cluster=ocfs2
> 
> Which successfully added the node to every cluster node, except on 
> xen-server16
> 
> On every node the original cluster.conf was:
> 
> node:
> ip_port = 
> ip_address = 10.111.10.116
> number = 0
> name = xen-blade16
> cluster = ocfs2
> 
> node:
> ip_port = 
> ip_address = 10.111.10.115
> number = 1
> name = xen-blade15
> cluster = ocfs2
> 
> node:
> ip_port = 
> ip_address = 10.111.10.114
> number = 2
> name = xen-blade14
> cluster = ocfs2
> 
> node:
> ip_port = 
> ip_address = 10.111.10.113
> number = 3
> name = xen-blade13
> cluster = ocfs2
> 
> node:
> ip_port = 
> ip_address = 10.111.10.112
> number = 4
> name = xen-blade12
> cluster = ocfs2
> 
> cluster:
> node_count = 5
> name = ocfs2
> 
> 
> After adding the node, on every cluster.conf I can see that this was added:
> 
> node:
> ip_port = 
> ip_address = 10.111.10.111
> number = 5
> name = xen-blade11
> cluster = ocfs2
> 
> cluster:
> node_count = 6
> name = ocfs2
> 
> EXCEPT on xen-blade16
> 
> It added like this:
> 
> node:
> ip_port = 
> ip_address = 10.111.10.111
> number = 6
> name = xen-blade11
> cluster = ocfs2
> 
> cluster:
> node_count = 6
> name = ocfs2
> 
> (Notice the number = 6 instead of number = 5)
> 
> So now when i'm trying to connect the xen-blade11 every host accept the 
> connection except the xen-blade16, and the cluster joining is being 
> rejected.
> 
> as we can see in the kernel messages on xen-blade11
> 
> [ 1852.729539] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1852.729892] o2net: Connected to node xen-blade12 (num 4) at 
> 10.111.10.112:
> [ 1852.737122] o2net: Connected to node xen-blade14 (num 2) at 
> 10.111.10.114:
> [ 1852.741408] o2net: Connected to node xen-blade15 (num 1) at 
> 10.111.10.115:
> [ 1854.733759] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1856.737129] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1856.764520] OCFS2 1.5.0
> [ 1858.740877] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1860.744847] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1862.748919] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1864.752929] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1866.756825] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1868.760809] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1870.764937] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1872.768905] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1874.772947] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1876.776928] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1878.780828] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1880.784974] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116: shutdown, state 7
> [ 1882.784529] o2net: No connection established with node 0 after 30.0 
> seconds, giving up.
> [ 1912.864531] o2net: No connection established with node 0 after 30.0 
> seconds, giving up.
> [ 1917.028531] o2cb: This node could not connect to nodes: 0.
> [ 1917.028684] o2cb: Cluster check failed. Fix errors before retrying.
> [ 1917.028758] (mount.ocf

Re: [Ocfs2-users] Permission denied on ocfs2 cluster

2012-03-16 Thread Sunil Mushran

Could be selinux related. I mean it is a permission issue. So you have to look 
at all the security regimes. rwx, posix acl, selinux, etc. 

On Mar 16, 2012, at 8:00 AM, зоррыч  wrote:

> Any idea?
> 
> 
> 
> -Original Message-
> From: ocfs2-users-boun...@oss.oracle.com
> [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of зоррыч
> Sent: Thursday, March 15, 2012 11:26 PM
> To: 'Sunil Mushran'
> Cc: ocfs2-users@oss.oracle.com
> Subject: Re: [Ocfs2-users] Permission denied on ocfs2 cluster
> 
> [root@noc-1-synt /]# ls -lh | grep ocfs
> drwxr-xr-x.   3 root root 3.9K Mar 15 02:20 ocfs
> [root@noc-1-synt /]# chmod -R gou+rwx ./ocfs/ [root@noc-1-synt /]# ls -lh |
> grep ocfs
> drwxrwxrwx.   3 root root 3.9K Mar 15 02:20 ocfs
> [root@noc-1-synt /]# cd ./ocfs/
> [root@noc-1-synt ocfs]# mkdir 1233
> mkdir: cannot create directory `1233': Permission denied [root@noc-1-synt
> ocfs]#
> Strace:
> [root@noc-1-synt ocfs]# strace mkdir 1233 execve("/bin/mkdir", ["mkdir",
> "1233"], [/* 28 vars */]) = 0
> brk(0)  = 0x2132000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7fbd67514000
> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or
> directory)
> open("/etc/ld.so.cache", O_RDONLY)  = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=45938, ...}) = 0 mmap(NULL, 45938,
> PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbd67508000
> close(3)= 0
> open("/lib64/libselinux.so.1", O_RDONLY) = 3 read(3,
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0PX\0D2\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=124624, ...}) = 0 mmap(0x324400,
> 2221912, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
> 0x324400 mprotect(0x324401d000, 2093056, PROT_NONE) = 0
> mmap(0x324421c000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c000) = 0x324421c000
> mmap(0x324421e000, 1880, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x324421e000
> close(3)= 0
> open("/lib64/libc.so.6", O_RDONLY)  = 3
> read(3,
> "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\355\201B2\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=1979000, ...}) = 0
> mmap(0x324280, 3803304, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE,
> 3, 0) = 0x324280 mprotect(0x3242997000, 2097152, PROT_NONE) = 0
> mmap(0x3242b97000, 20480, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x197000) = 0x3242b97000
> mmap(0x3242b9c000, 18600, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3242b9c000
> close(3)= 0
> open("/lib64/libdl.so.2", O_RDONLY) = 3
> read(3,
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\r\300B2\0\0\0"..., 832)
> = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=22536, ...}) = 0 mmap(NULL,
> 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7fbd67507000 mmap(0x3242c0, 2109696, PROT_READ|PROT_EXEC,
> MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3242c0 mprotect(0x3242c02000,
> 2097152, PROT_NONE) = 0 mmap(0x3242e02000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x3242e02000
> close(3)= 0
> mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7fbd67505000 arch_prctl(ARCH_SET_FS, 0x7fbd675057a0) = 0
> mprotect(0x324421c000, 4096, PROT_READ) = 0 mprotect(0x3242b97000, 16384,
> PROT_READ) = 0 mprotect(0x3242e02000, 4096, PROT_READ) = 0
> mprotect(0x324261f000, 4096, PROT_READ) = 0
> munmap(0x7fbd67508000, 45938)   = 0
> statfs("/selinux", {f_type=0xf97cff8c, f_bsize=4096, f_blocks=0, f_bfree=0,
> f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255,
> f_frsize=4096}) = 0
> brk(0)  = 0x2132000
> brk(0x2153000)  = 0x2153000
> open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 fstat(3,
> {st_mode=S_IFREG|0644, st_size=99158704, ...}) = 0 mmap(NULL, 99158704,
> PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbd61674000
> close(3)= 0
> mkdir("1233", 0777) = -1 EACCES (Permission denied)
> open("/usr/share/locale/locale.alias", O_RDONLY) = 3 fstat(3,
> {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0 mmap(NULL, 4096,
> PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbd67513000
> read(3, "# Locale name alias data base.\n#"..., 4096) = 2512
> read(3, "", 4

Re: [Ocfs2-users] Permission denied on ocfs2 cluster

2012-03-15 Thread Sunil Mushran

s.mo", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=435, ...}) = 0
> mmap(NULL, 435, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe958f4a000
> close(3)= 0
> write(2, "mkdir: ", 7mkdir: )  = 7
> write(2, "cannot create directory `12'", 28cannot create directory `12') =
> 28
> open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1
> ENOENT (No such file or directory)
> open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1
> ENOENT (No such file or directory)
> open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT
> (No such file or directory)
> open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT
> (No such file or directory)
> open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT
> (No such file or directory)
> open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No
> such file or directory)
> write(2, ": Permission denied", 19: Permission denied) = 19
> write(2, "\n", 1
> )   = 1
> close(1)= 0
> close(2)= 0
> exit_group(1)   = ?
> [root@noc-1-synt ocfs]#
>
>
>
>
> -Original Message-
> From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
> Sent: Thursday, March 15, 2012 8:20 PM
> To: ??
> Cc: ocfs2-users@oss.oracle.com
> Subject: Re: [Ocfs2-users] Permission denied on ocfs2 cluster
>
> strace may show more. I would first confirm that my perms are correct.
>
> On 03/15/2012 07:58 AM, ?? wrote:
>> I am testing the scheme of drbd and ocfs2
>>
>> If you attempt to write to the cluster error:
>>
>> [root@noc-1-m77 share]# mkdir 12
>>
>> mkdir: cannot create directory `12': Permission denied
>>
>> [root@noc-1-m77 share]#
>>
>> Config:
>>
>> [root@noc-1-m77 /]# cat /etc/ocfs2/cluster.conf
>>
>> cluster:
>>
>> node_count = 2
>>
>> name = cluster-ocfs2
>>
>> node:
>>
>> ip_port = 
>>
>> ip_address = 10.1.20.10
>>
>> number = 0
>>
>> name = noc-1-synt.rutube.ru
>>
>> cluster = cluster-ocfs2
>>
>> node:
>>
>> ip_port = 
>>
>> ip_address = 10.2.20.9
>>
>> number = 1
>>
>> name = noc-1-m77.rutube.ru
>>
>> cluster = cluster-ocfs2
>>
>> logs:
>>
>> Mar 15 05:42:04 noc-1-synt kernel: OCFS2 1.5.0
>>
>> Mar 15 05:42:04 noc-1-synt kernel: o2dlm: Nodes in domain
>> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1
>>
>> Mar 15 05:42:04 noc-1-synt kernel: ocfs2: Mounting device (147,0) on
>> (node 1, slot 0) with ordered data mode.
>>
>> Mar 15 05:42:07 noc-1-synt kernel: o2net: accepted connection from
>> node noc-1-m77.rutube.ru (num 2) at 10.2.20.9:
>>
>> Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Node 2 joins domain
>> 5426CCF9AC414CD59E78F3AE48B9DE2C
>>
>> Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Nodes in domain
>> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2
>>
>> Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Node 2 leaves domain
>> 5426CCF9AC414CD59E78F3AE48B9DE2C
>>
>> Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Nodes in domain
>> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1
>>
>> Mar 15 05:50:56 noc-1-synt kernel: o2net: connection to node
>> noc-1-m77.rutube.ru (num 2) at 10.2.20.9: shutdown, state 8
>>
>> Mar 15 05:50:56 noc-1-synt kernel: o2net: no longer connected to node
>> noc-1-m77.rutube.ru (num 2) at 10.2.20.9:
>>
>> Mar 15 05:51:12 noc-1-synt kernel: ocfs2: Unmounting device (147,0) on
>> (node 1)
>>
>> Mar 15 05:51:45 noc-1-synt kernel: o2net: accepted connection from
>> node noc-1-m77.rutube.ru (num 2) at 10.2.20.9:
>>
>> Mar 15 05:51:47 noc-1-synt kernel: o2dlm: Nodes in domain
>> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2
>>
>> Mar 15 05:51:47 noc-1-synt kernel: ocfs2: Mounting device (147,0) on
>> (node 1, slot 1) with ordered data mode.
>>
>> How do I fix this?
>>
>>
>>
>> ___
>> Ocfs2-users mailing list
>> Ocfs2-users@oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Permission denied on ocfs2 cluster

2012-03-15 Thread Sunil Mushran

strace may show more. I would first confirm that my perms are correct.

On 03/15/2012 07:58 AM, ?? wrote:
> I am testing the scheme of drbd and ocfs2
>
> If you attempt to write to the cluster error:
>
> [root@noc-1-m77 share]# mkdir 12
>
> mkdir: cannot create directory `12': Permission denied
>
> [root@noc-1-m77 share]#
>
> Config:
>
> [root@noc-1-m77 /]# cat /etc/ocfs2/cluster.conf
>
> cluster:
>
> node_count = 2
>
> name = cluster-ocfs2
>
> node:
>
> ip_port = 
>
> ip_address = 10.1.20.10
>
> number = 0
>
> name = noc-1-synt.rutube.ru
>
> cluster = cluster-ocfs2
>
> node:
>
> ip_port = 
>
> ip_address = 10.2.20.9
>
> number = 1
>
> name = noc-1-m77.rutube.ru
>
> cluster = cluster-ocfs2
>
> logs:
>
> Mar 15 05:42:04 noc-1-synt kernel: OCFS2 1.5.0
>
> Mar 15 05:42:04 noc-1-synt kernel: o2dlm: Nodes in domain
> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1
>
> Mar 15 05:42:04 noc-1-synt kernel: ocfs2: Mounting device (147,0) on
> (node 1, slot 0) with ordered data mode.
>
> Mar 15 05:42:07 noc-1-synt kernel: o2net: accepted connection from node
> noc-1-m77.rutube.ru (num 2) at 10.2.20.9:
>
> Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Node 2 joins domain
> 5426CCF9AC414CD59E78F3AE48B9DE2C
>
> Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Nodes in domain
> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2
>
> Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Node 2 leaves domain
> 5426CCF9AC414CD59E78F3AE48B9DE2C
>
> Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Nodes in domain
> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1
>
> Mar 15 05:50:56 noc-1-synt kernel: o2net: connection to node
> noc-1-m77.rutube.ru (num 2) at 10.2.20.9: shutdown, state 8
>
> Mar 15 05:50:56 noc-1-synt kernel: o2net: no longer connected to node
> noc-1-m77.rutube.ru (num 2) at 10.2.20.9:
>
> Mar 15 05:51:12 noc-1-synt kernel: ocfs2: Unmounting device (147,0) on
> (node 1)
>
> Mar 15 05:51:45 noc-1-synt kernel: o2net: accepted connection from node
> noc-1-m77.rutube.ru (num 2) at 10.2.20.9:
>
> Mar 15 05:51:47 noc-1-synt kernel: o2dlm: Nodes in domain
> 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2
>
> Mar 15 05:51:47 noc-1-synt kernel: ocfs2: Mounting device (147,0) on
> (node 1, slot 1) with ordered data mode.
>
> How do I fix this?
>
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2-1.4.7 is not binding in scientific linux 6.2

2012-03-12 Thread Sunil Mushran

ocfs2 1.4 will not build with 2.6.32. A better solution is to
just enable ocfs2 in the 2.6.32 kernel src tree and build.

On 03/11/2012 07:37 AM, зоррыч wrote:
> Hi.
>
> I use scientific linux 6.2:
>
> [root@noc-1-m77 ocfs2-1.4.7]# cat /etc/redhat-release
>
> Scientific Linux release 6.2 (Carbon)
>
> [root@noc-1-m77 ocfs2-1.4.7]# uname -r
>
> 2.6.32-220.4.1.el6.x86_64
>
> Does not compile:
>
> [root@noc-1-m77 ocfs2-1.4.7]# ./configure
> --with-kernel=/usr/src/kernels/2.6.32-220.7.1.el6.x86_64
>
> checking build system type... x86_64-unknown-linux-gnu
>
> checking host system type... x86_64-unknown-linux-gnu
>
> checking for gcc... gcc
>
> checking for C compiler default output file name... a.out
>
> checking whether the C compiler works... yes
>
> checking whether we are cross compiling... no
>
> checking for suffix of executables...
>
> checking for suffix of object files... o
>
> checking whether we are using the GNU C compiler... yes
>
> checking whether gcc accepts -g... yes
>
> checking for gcc option to accept ANSI C... none needed
>
> checking how to run the C preprocessor... gcc -E
>
> checking for a BSD-compatible install... /usr/bin/install -c
>
> checking whether ln -s works... yes
>
> checking for egrep... grep -E
>
> checking for ANSI C header files... yes
>
> checking for an ANSI C-conforming const... yes
>
> checking for vendor... not found
>
> checking for vendor kernel... not supported
>
> checking for debugging... no
>
> checking for directory with kernel build tree...
> /usr/src/kernels/2.6.32-220.7.1.el6.x86_64
>
> checking for kernel version... 2.6.32-220.7.1.el6.x86_64
>
> checking for directory with kernel sources...
> /usr/src/kernels/2.6.32-220.7.1.el6.x86_64
>
> checking for kernel source version... 2.6.32-220.7.1.el6.x86_64
>
> checking for struct delayed_work in workqueue.h... yes
>
> checking for uninitialized_var() in compiler-gcc4.h... yes
>
> checking for zero_user_page() in highmem.h... no
>
> checking for do_sync_mapping_range() in fs.h... yes
>
> checking for fault() in struct vm_operations_struct in mm.h... yes
>
> checking for f_path in fs.h... yes
>
> checking for enum umh_wait in kmod.h... yes
>
> checking for inc_nlink() in fs.h... yes
>
> checking for drop_nlink() in fs.h... yes
>
> checking for kmem_cache_create() with dtor arg in slab.h... no
>
> checking for kmem_cache_zalloc in slab.h... yes
>
> checking for flag FS_RENAME_DOES_D_MOVE in fs.h... yes
>
> checking for enum FS_OCFS2 in sysctl.h... yes
>
> checking for configfs_depend_item() in configfs.h... yes
>
> checking for register_sysctl() with two args in sysctl.h... no
>
> checking for su_mutex in struct configfs_subsystem in configfs.h... yes
>
> checking for struct subsystem in kobject.h... no
>
> checking for is_owner_or_cap() in fs.h... yes
>
> checking for fallocate() in fs.h... yes
>
> checking for struct splice_desc in splice.h... yes
>
> checking for MNT_RELATIME in mount.h... yes
>
> checking for should_remove_suid() in fs.h... no
>
> checking for generic_segment_checks() in fs.h... no
>
> checking for s_op declared as const in struct super_block in fs.h... yes
>
> checking for i_op declared as const in struct inode in fs.h... yes
>
> checking for f_op declared as const in struct file in fs.h... yes
>
> checking for a_ops declared as const in struct address_space in fs.h... yes
>
> checking for aio_read() in struct file_operations using iovec in fs.h... yes
>
> checking for __splice_from_pipe() in splice.h... yes
>
> checking for old bio_end_io_t in bio.h... no
>
> checking for b_size is u32 struct buffer_head in buffer_head.h... no
>
> checking for exportfs.h... yes
>
> checking for linux/lockdep.h... yes
>
> checking for mandatory_lock() in fs.h... yes
>
> checking for range prefix in struct writeback_control... yes
>
> checking for SYNC_FILE_RANGE flags... yes
>
> checking for blkcnt_t in types.h... yes
>
> checking for i_private in struct inode... yes
>
> checking for page_mkwrite in struct vm_operations_struct... no
>
> checking for get_sb_bdev() with 5 arguments in fs.h... no
>
> checking for read_mapping_page in pagemap.h... yes
>
> checking for ino_t in filldir_t in fs.h... no
>
> checking for invalidatepage returning int in fs.h... no
>
> checking for get_blocks_t type... no
>
> checking for linux/uaccess.h... yes
>
> checking for system_utsname in utsname.h... no
>
> checking for MS_LOOP_NO_AOPS flag defined... no
>
> checking for fops->sendfile() in fs.h... no
>
> checking for task_pid_nr in sched.h... yes
>
> checking for confirm() in struct pipe_buf_operations in pipe_fs_i.h... yes
>
> checking for mutex_lock_nested() in mutex.h... yes
>
> checking for inode_double_lock) in fs.h... no
>
> checking for splice_read() in fs.h... yes
>
> checking for sops->statfs takes struct super_block * in fs.h... no
>
> checking for le16_add_cpu() in byteorder/generic.h... yes
>
> checking for le32_add_cpu() in byteorder/generic.h... yes
>
> checking for le64_add_cpu() in byteorder/generic.h... yes
>
>

Re: [Ocfs2-users] ocfs2console hangs on startup

2012-03-10 Thread Sunil Mushran

ocfs2console has been obsoleted. Just use the utilities directly.
To detect ocfs2 volumes, use blkid. You can use it to restrict
the lookup paths. Refer its manpage.

On 03/09/2012 06:15 PM, John Major wrote:
> Hi,
>
> Hope this is the right place to ask this.
>
> I have set up 2 ubuntu lts machines with an IBM iscsi san. I have set up
> multipathd and ocfs2 and it seems to be working.
>
> The problem is that when I run up ocfs2console it hangs (the console
> app, not the system). Using strace, I can see that it is running through
> all the /dev/sdx devices and loops trying to access the first one in
> 'ghost' state per 'multipath -ll'.
>
> Is there a way to restrict which devices the app looks at as it starts
> to say  /dev/mapper/mpath* since I don't actually want it to access any
> of the /dev/sd.. devices directly?

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Ocfs2-users Digest, Vol 98, Issue 9

2012-03-02 Thread Sunil Mushran

On 02/29/2012 04:10 PM, David Johle wrote:
> I too have seen some serious performance issues under 1.4, especially
> with writes.  I'll share some info I've gathered on this topic, take
> it however you wish...
>
> In the past I never really thought about running benchmarks against
> the shared block device as a baseline to compare with the
> filesystem.  So today I did run several dd tests of my own (both read
> and write) against a shared block device (different LUN, but using
> the exact same storage hardware including specific disks as the one
> with OCFS2).
>
> My tests were not in line with those of Erik Schwartz, as I
> determined the performance degradations to be OCFS2 related.
>
> I have a a fs shared by 2 nodes, both are dual quad core xeon systems
> with 2 dedicated storage NICs per box.
> Storage is a Dell/EqualLogic iSCSI SAN with 3 gigE NICs, dedicated
> gigE switches, using jumbo frames.
> I'm using dm-multipath as well.
>
> RHEL5 (2.6.18-194.3.1.el5 kernel)
> ocfs2-2.6.18-194.11.4.el5-1.4.7-1.el5
> ocfs2-tools-1.4.4-1.el5
>
> Using the individual /dev/sdX vs. the /dev/mapper/mpathX devices
> indicates that multipath is working properly as the numbers are close
> to double what the separates each give.
>
> Given the hardware, I'd consider 200MB/s a limit for a single box and
> 300MB/s the limit for the SAN.
>
> Block device:
> Sequential reads tend to be in the 180-190MB/s range with just one
> node reading.
> Both nodes simultaneously reading gives about 260-270MB/s total throughput.
> Sequential writes tend to be in the 115-140MB/s range with just one
> node writing.
> Both nodes simultaneously writing gives about 200-230MB/s total throughput.
>
> OCFS2:
> Sequential reads tend to be in the 80-95MB/s range with just one node reading.
> Both nodes simultaneously reading gives about 125-135MB/s total throughput.
> Sequential writes tend to be in the 5-20MB/s range with just one node writing.
> Both nodes simultaneously writing (different files) gives unbearably
> slow performance of less than 1MB/s total throughput.
>
> Now one thing I will say is that I was testing on a "mature"
> filesystem that has been in use for quite some time.  Tons of file&
> directory creation, reading, updating, deleting, over the course of a
> couple years.
>
> So to see how that might affect things, I then created a new
> filesystem on that same block device I used above (with same options
> as the "mature" one) and ran the set of dd-based fs tests on that.
>
> Create params: -b 4K -C 4K
> --fs-features=backup-super,sparse,unwritten,inline-data
>Mount params: -o noatime,data=writeback
>
> Fresh OCFS2:
> Sequential reads tend to be in the 100-125MB/s range with just one
> node reading.
> Both nodes simultaneously reading gives about 165-180MB/s total throughput.
> Sequential writes tend to be in the 120-140MB/s range with just one
> node writing.
> Both nodes simultaneously writing (different files) gives reasonable
> performance of around 100MB/s total throughput.
>
>
> Wow, what a difference!  I will say that, for the "mature" filesystem
> above that is performing poorly, it has definitely gotten worse over
> time.  It seems to me that the filesystem itself has some time or
> usage based performance degradation issues.
>
> I'm actually thinking it would be to the benefit of my cluster to
> create a new volume, shut down all applications, copy the contents
> over, shuffle mount points, and start it all back up.  The only
> problem is that this will make for some highly unappreciated
> downtime!  Also, I'm concerned that all that copying and loading it
> up with contents may just result in the same performance losses,
> making the whole process just wasted effort.


We have worked on reducing fragmentation in later releases. One specific
feature added was allocation reservation (in 2.6.35). It is available
in prod releases starting 1.6.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 1.2/1.6

2012-03-02 Thread Sunil Mushran

The file system on-disk image has not changed. So the 1.6 file system
software can mount the volume created with 1.2 mkfs. What you cannot do
is concurrently mount the same volume with nodes running 1.2 and 1.6 
versions of the file system software.

It is not mixed mode. The 1.6 fs software will read the on-disk features
on the 1.2 volume and limit the functioning on that volume to just that.
Perfectly normal.

Yes, you can add the tablespace on the 1.2 volume.

For the 1.2 volume to be able to use 1.6 features, the said features
will have to be enabled. Once you do enable those features, the volume
will not be mountable on the older RHEL4 boxes unless those features
are disabled. There is a whole section in the users' guide that explains
this in more detail.

On 03/02/2012 08:09 AM, Maki, Nancy wrote:
> We are in the process of migrating to new database servers. Our current
> RAC clusters are running OCFS2 1.2.9 on RHEL 4. Our new servers are
> running OCFS2 1.6 OEL5. If possible, we would like to minimize the
> amount of data that needs to move as we migrate to the new servers. We
> have the following questions:
>
> 1.Can we mount an existing OCFS2 1.2 volume on a servers running OCFS2 1.6?
>
> 2.Are there any negative implications of being in a mixed mode?
>
> 3.If we need to add a OCFS2 1.6 volume to increase a tablespace size,
> can we have one datafile be OCFS2 1.2 and another
>
> be OCFS2 1.6 for the same tablespace?
>
> 4.Can we use OCFS2 1.6 features against an OCFS2 1.2 volume mounted on
> OCFS2 1.6?
>
> Thank you,
>
> Nancy
>
> circle **
>
>   
>
> *Nancy Maki*
> /Manager of Database Services/
>
> Office of Information & Technology
> The State University of New York
> State University Plaza - Albany, New York 12246
> Tel: 518.320.1213 Fax: 518.320.1550
>
> eMail: nancy.m...@suny.edu
> */Be a part of Generation SUNY: /**/Facebook/*
> */- /**/Twitter/*
> */- /**/YouTube/*
> 
>
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Concurrent write performance issues with OCFS2

2012-02-28 Thread Sunil Mushran

In 1.4, the local allocator window is small. 8MB. Meaning the node
has to hit the global bitmap after every 8MB. In later releases, the
window is much larger.

Second, a single node is not a good baseline. A better baseline is
multiple nodes writing concurrently to the block device. Not fs.
Use dd. Set different write offsets. This should help figure out how
the shared device works with multiple nodes.

On 2/28/2012 9:24 AM, Erik Schwartz wrote:
> I have a two-node RHEL5 cluster that runs the following Linux kernel and
> accompanying OCFS2 module packages:
>
>* kernel-2.6.18-274.17.1.el5
>* ocfs2-2.6.18-274.17.1.el5-1.4.7-1.el5
>
> A 2.5TB LUN is presented to both nodes via DM-Multipath. I have carved
> out a single partition (using the entire LUN), and formatted it with OCFS2:
>
># mkfs.ocfs2 -N 2 -L 'foofs' -T datafiles /dev/mapper/bams01p1
>
> Finally, the filesystem is mounted to both nodes with the following options:
>
># mount | grep bams01
> /dev/mapper/bams01p1 on /foofs type ocfs2
> (rw,_netdev,noatime,data=writeback,heartbeat=local)
>
> --
>
> When a single node is writing arbitrary data (i.e. dd(1) with /dev/zero
> as input) to a large (say, 10 GB) file in /foofs, I see the expected
> performance of ~850 MB/sec.
>
> If both nodes are concurrently writing large files full of zeros to
> /foofs, performance drops way down to ~45 MB/s. I experimented with each
> node writing to /foofs/test01/ and /foofs/test02/ subdirectories,
> respectively, and found that performance increased slightly to a - still
> poor - 65 MB/s.
>
> --
>
> I understand from searching past mailing list threads that the culprit
> is likely related to the negotiation of file locks, and waiting for data
> to be flushed to journal / disk.
>
> My two questions are:
>
> 1. Does this dramatic write performance slowdown sound reasonable and
> expected?
>
> 2. Are there any OCFS2-level steps I can take to improve this situation?
>
>
> Thanks -
>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Bad magic number in inode

2012-02-01 Thread Sunil Mushran

inode#11 is in the system directory. fsck cannot fix this automatically.
If the corruption is limited, there is a chance the inodes could be
recreated manually. But do look at backups to restore.

On 02/01/2012 10:20 AM, Werner Flamme wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hi,
>
> when I try to mount an OCFS2 volume, I get
>
> - ---snip---
> [12212.195823] OCFS2: ERROR (device sde1): ocfs2_validate_inode_block:
> Invalid dinode #11: signature =
> [12212.195825]
> [12212.195827] File system is now read-only due to the potential of
> on-disk corruption. Please run fsck.ocfs2 once the file system is
> unmounted.
> [12212.195832] (mount.ocfs2,9772,0):ocfs2_read_locked_inode:499 ERROR:
> status = -22
> [12212.195842] (mount.ocfs2,9772,0):_ocfs2_get_system_file_inode:158
> ERROR: status = -116
> [12212.195853]
> (mount.ocfs2,9772,0):ocfs2_init_global_system_inodes:475 ERROR: status
> = -22
> [12212.195860]
> (mount.ocfs2,9772,0):ocfs2_init_global_system_inodes:478 ERROR: Unable
> to load system inode 4, possibly corrupt fs?
> [12212.195862] (mount.ocfs2,9772,0):ocfs2_initialize_super:2379 ERROR:
> status = -22
> [12212.195864] (mount.ocfs2,9772,0):ocfs2_fill_super:1064 ERROR:
> status = -22
> [12212.195869] ocfs2: Unmounting device (8,65) on (node 0)
> - ---pins---
>
> And doing an fsck, it looks like this:
> - ---snip---
> # fsck.ocfs2 -f  /dev/disk/by-label/ERSATZ
> fsck.ocfs2 1.8.0
> Checking OCFS2 filesystem in /dev/disk/by-label/ERSATZ:
>Label:  ERSATZ
>UUID:   AEB995484F2D4D19835AA380CAE0683A
>Number of blocks:   268434093
>Block size: 4096
>Number of clusters: 268434093
>Cluster size:   4096
>Number of slots:40
>
> /dev/disk/by-label/ERSATZ was run with -f, check forced.
> Pass 0a: Checking cluster allocation chains
> pass0: Bad magic number in inode reading inode alloc inode 11 for
> verification
> fsck.ocfs2: Bad magic number in inode while performing pass 0
> - ---pins---
>
> Any chance to access the filesystem other that reformatting it?
>
> The node ist the only node that can access this volume. I plan to
> share it via iSCSI, but first it must be mountable... There are 3
> other volumes in this cluster, mounted by about a dozen nodes.
>
> Regards,
> Werner

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?

2012-02-01 Thread Sunil Mushran

On 02/01/2012 10:24 AM, Mark Hampton wrote:
> Here's what I got from debugfs.ocfs2 -R "stats".  I have to type it out
> manually, so I'm only including the "features" lines:
>
> Feature Compat: 3 backup-super strict-journal-super
> Feature Incompat: 16208 sparse extended-slotmap inline-data metaecc
> xattr indexed-dirs refcount discontig-bg
> Feature RO compat: 7 unwritten usrquota grpquota
>
>
> Some other info that may be interesting:
>
> Links: 0   Clusters: 52428544


I would disable quotas. That line suggests the vol is 200G is size.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?

2012-02-01 Thread Sunil Mushran

debugfs.ocfs2 -R "stats" /dev/mapper/...
I want to see the features enabled.

The main issue with large metdata is the fsck timing. The recently 
tagged 1.8 release of the tools has much better fsck performance.

On 02/01/2012 05:25 AM, Mark Hampton wrote:
> We have an application that has many processing threads writing more
> than a billion files ranging from 2KB – 50KB, with 50% under
> 8KB (currently there are 700 million files).  The files are never
> deleted or modified – they are written once, and read infrequently.  The
> files are hashed so that they are evenly distributed across ~1,000,000
> subdirectories up to 3 levels deep, with up to 1000 files per
> directory.  The directories are structured like this:
>
> 0/00/00
>
> 0/00/01
>
> …
>
> F/FF/FE
>
> F/FF/FF
>
> The files need to be readable and writable across a number of
> servers. The NetApp filer we purchased for this project has both NFS and
> iSCSI capabilities.
>
> We first tried doing this via NFS.  After writing 700 million files (12
> TB) into a single NetApp volume, file-write performance became abysmally
> slow.  We can't create more than 200 files per second on the NetApp
> volume, which is about 20% of our required performance target of 1000
> files per second.  It appears that most of the file-write time is going
> towards stat and inode-create operations.
>
> So I now I’m trying the same thing with OCFS2 over iSCSI.  I created 16
> luns on the NetApp.  The 16 luns became 16 OCFS2 filesystems with 16
> different mount points on our servers.
>
> With this configuration I was initially able to write ~1800 files per
> second.  Now that I have completed 100 million files, performance has
> dropped to ~1500 files per second.
>
> I’m using OEL 6.1 (2.6.32-100 kernel) with OCFS2 version 1.6.  The
> application servers have 128GB of memory.  I created my OCFS2
> filesystems as follows:
>
> mkfs.ocfs2 –T mail –b 4k –C 4k –L  --fs-features=indexed-dirs
> –fs-feature-level=max-features /dev/mapper/
>
> And I mount them with these options:
>
> _netdev,commit=30,noatime,localflocks,localalloc=32
>
> So my questions are these:
>
>
> 1) Given a billion files sized 2KB – 50KB, with 50% under 8KB, do I have
> the optimal OCFS2 filesystem and mount-point configurations?
>
>
> 2) Should I split the files across even more filesystems?  Currently I
> have them split across 16 OCFS2 filesystems.
>
> Thanks a billion!

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Extend space on ocfs mount point

2012-02-01 Thread Sunil Mushran

I am not aware of any downsizes in resizing.

On 02/01/2012 09:57 AM, Kalra, Pratima wrote:
> We have a ucm installation on ocfs mount point and we need to increase
> the space on that mount point from 20gb to 30 gb. Is this possible
> without resulting in any after effects?
> Pratima.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?

2012-02-01 Thread Sunil Mushran

On 02/01/2012 07:02 AM, Mark wrote:
> One more thing.  When I straced one of the application processes (these are 
> the
> processes that create the files) I saw this:
>
> % time seconds  usecs/callcalls errors syscall
> --- --  --  -- ---
>68.94   3.002017 11127154open
>18.93   0.929679   2   418108read
>12.40   0.543714   2   257548write
>
> So it seams that inode creation is the biggest time consumer by far.

Yes. open() triggers cluster lock creation which cannot be skipped. 
Reads and writes could skip cluster activity if the node already has the 
appropriate lock level.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

2012-01-27 Thread Sunil Mushran

Symmetric clustering works best when the nodes are comparable because 
all nodes have to work in sync. NFS may be more suitable for your needs.

On 01/26/2012 05:51 PM, Jorge Adrian Salaices wrote:
> I have been working on trying to convince Mgmt at work that we want to
> go to OCFS2 away from NFS for the sharing of the Application Layer of
> our Oracle EBS (Enterprise Business Suite), and for just general "Backup
> Share", but general instability in my setup has dissuaded me to
> recommend it.
>
> I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and
> something as simple as an umount has triggered random Node reboots, even
> on nodes that have Other OCFS2 mounts not shared by the rebooting nodes.
> You see the problem I have is that I have disparate hardware and some of
> these servers are even VM's.
>
> Several documents state that nodes have to be somewhat equal of power
> and specs and in my case that will never be.
> Unfortunately for me, I have had several other events of random Fencing
> that have been unexplained by common checks.
> i.e. My Network has never been the problem yet one server may see
> another one go away when all of the other services on that node may be
> running perfectly fine. I can only surmise that the reason why that may
> have been is because of an elevated load on the server that starved the
> Heartbeat process preventing it from sending Network packets to other
> nodes.
>
> My config has about 40 Nodes on it, I have 4 or 5 different shared LUNs
> out of our SAN and not all servers share all Mounts.
> meaning only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3
> share a third, unfortunately the complexity is such that a server may
> intersect with some of the servers but not all.
> perhaps a change in my config to create separate clusters may be the
> solution but only if a node can be part of multiple clusters:
>
> /node:
> ip_port = 
> ip_address = 172.20.16.151
> number = 1
> name = txri-oprdracdb-1.tomkinsbp.com
> cluster = ocfs2-back
>
> node:
> ip_port = 
> ip_address = 172.20.16.152
> number = 2
> name = txri-oprdracdb-2.tomkinsbp.com
> cluster = ocfs2-back
>
> node:
> ip_port = 
> ip_address = 10.30.12.172
> number = 4
> name = txri-util01.tomkinsbp.com
> cluster = ocfs2-util, ocfs2-back
> node:
> ip_port = 
> ip_address = 10.30.12.94
> number = 5
> name = txri-util02.tomkinsbp.com
> cluster = ocfs2-util, ocfs2-back
>
> cluster:
> node_count = 2
> name = ocfs2-back
>
> cluster:
> node_count = 2
> name = ocfs2-util
> /
> Is this even Legal, or can it be done some other way ?
> or is this done based on the Different DOMAINS that are created once a
> mount is done .
>
>
> How can I make the cluster more stable ? and Why does a node fence
> itself on the cluster even if it does Not have any locks on the shared
> LUN ? It seems to be that the node may be "fenceable" simply by having
> the OCFS2 services turned ON, without a mount .
> is this correct ?
>
> Another question I have been having as well is: can the Fencing method
> be other than Panic or restart ? Can a third party or a Userland event
> be triggered to recover from what may be construed by the "Heartbeat" or
> "Network tests" as a downed node ?
>
> Thanks for any of the help you can give me.
>
>
> --
> Jorge Adrian Salaices
> Sr. Linux Engineer
> Tomkins Building Products
>
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] One node, two clusters?

2011-12-22 Thread Sunil Mushran

On 12/22/2011 10:39 AM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote:
> Is there a separate DLM instance for each ocfs2 volume?
>
> I have two "sub-clusters" in the same cluster... A 10 node Hadoop cluster 
> sharing a SATA RAID10 and a Two node web server cluster sharing a SSD RAID0. 
> One server mounts both volumes to move data between as necessary.
>
> This morning I got the following error (see end of message), and all nodes 
> lost access to all storage. I'm trying to mitigate risk of this happening 
> again.
>
> My hadoop nodes are used to generate search engine indexes, so they can go 
> down. But my web servers provide the search engine service so I need them to 
> not be tied to my hadoop nodes. I just feel safer that way. At the same time, 
>  I need a "bridge" node to move data between the two. I can do it via NFS or 
> SCP, but I figured it'd be worth while to ask if one node can be in two 
> different clusters.
>
> Dec 22 09:15:42 lhce-imed-web1 kernel: 
> (updatedb,1832,1):dlm_get_lock_resource:898 
> 042F68B6AF134E5C9A9EDF4D7BD7BE99:O0013d2ef94: at least 
> one node (11) to recover before lock mastery can begin
>

You should add ocfs2 to PRUNEFS in /etc/updatedb.conf. updatedb generates
a lot of io and network traffic. And it will happen around the same time on all 
nodes.

Yes, each volume has a different dlm domain (instance).

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] One node, two clusters?

2011-12-22 Thread Sunil Mushran

You don't need to have two clusters for this. This can be accomplished
with one cluster with the default local heartbeat.

Create one cluster.conf with all the nodes. All nodes, except the one
machine, will mount from just one san. The common node will mount from
both sans.

If you look at the cluster membership, other than the common node,
all nodes will be interacting (network connection, etc.) with nodes that
they can see on the san.

On 12/22/2011 09:40 AM, Werner Flamme wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Kushnir, Michael (NIH/NLM/LHC) [C] [22.12.2011 18:20]:
>> Is it possible to have one machine be part of two different ocfs2
>> clusters with two different sans? Kind of to serve as a bridge for
>> moving data between two clusters but without actually fully
>> combining the two clusters?
>>
>> Thanks, Michael
> Michael,
>
> I asked this two years ago and the answer was no.
>
> When I look at /etc/ocfs2/cluster.conf, I do not see a possibility to
> configure a second cluster. Though the nodes must be assigned to a
> cluster (and exactly one cluster, this is), there ist only one entry
> "cluster:" in the file, and so there is no way to define a second one.
>
> We synced via rsync :-(
>
> HTH
> Werner
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.18 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk7za4EACgkQk33Krq8b42MvSwCfQAXzqVQRPyhOdFrKM8PCPqbf
> g0cAn20CV4rjzXNrTa/YGaUeNlO3+rmc
> =CBmQ
> -END PGP SIGNATURE-
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] reflink status

2011-12-17 Thread Sunil Mushran

On 12/17/2011 12:05 PM, richard -rw- weinberger wrote:
>> The reflink utility should work. So what it is based on an older
>> coreutils. It is derived from the hard link (ln) utility.
> So, building it from http://oss.oracle.com/git/?p=jlbec/reflink.git;a=shortlog
> via reflink.spec is the way to go?
>

For now, yes.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] reflink status

2011-12-17 Thread Sunil Mushran

First we have to get the new syscall added to the kernel.
The first attempt failed because people overloaded the call with
extraneous stuff. Recently there is another attempt to go back
to the original proposal. Hopefully, next kernel release.

The reflink utility should work. So what it is based on an older
coreutils. It is derived from the hard link (ln) utility.

On 12/17/2011 4:15 AM, richard -rw- weinberger wrote:
> Hi!
>
> What do I need to use reflinks on OCFS2 1.6?
>
> coreutils 8.4's cp --reflink=always doesn't seem too work.
>
> I found
> http://oss.oracle.com/git/?p=jlbec/reflink.git;a=shortlog
> and
> http://oss.oracle.com/~smushran/reflink-tools/
>
> Both contain a patched and outdated coreutils package.
>
> Are there any plans to merge it upstream?
>

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 cluster won't come up and stay up

2011-12-01 Thread Sunil Mushran

To analyze one needs the logs. And a bugzilla is a good place holder for the 
logs. 

On Dec 1, 2011, at 6:05 PM, Tony Rios  wrote:

> Sunil,
> Is submitting a bug report the only answer?
> I'm happy to send in this information, but can I take the cluster down 
> entirely and sort of reset it so we can get these servers back online and 
> talking again in the meanwhile?
> Tony
> 
> On Dec 1, 2011, at 5:05 PM, Sunil Mushran wrote:
> 
>> Node 3 is joining the domain. It is having problms getting the superblock 
>> cluster lock.
>> Create a bugzilla on oss.oracle.com and attach the /var/logs/messages from 
>> all nodes.
>> If you have netconsole setup, attach those logs too.
>> 
>> On 12/01/2011 04:55 PM, Tony Rios wrote:
>>> I'm having an issue today where I just can't seem to keep all the servers 
>>> in the cluster online.
>>> They aren't losing network connectivity and I can ping the iSCSI host just 
>>> fine and the host is logged in.
>>> 
>>> These are the errors form the dmesg when I try to mount the filesystem:
>>> 
>>> root@pedge36:~# dmesg
>>> [0.00] Initializing cgroup subsys cpuset
>>> [0.00] Initializing cgroup subsys cpu
>>> [0.00] Linux version 2.6.38-10-generic (buildd@yellow) (gcc version 
>>> 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #46-Ubuntu SMP Tue Jun 28 15:07:17 
>>> UTC 2011 (Ubuntu 2.6.38-10.46-generic 2.6.38.7)
>>> [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-generic 
>>> root=UUID=3cd859b8-2605-4a38-8767-a6d1f99d53bd ro debug ignore_loglevel
>>> [0.00] BIOS-provided physical RAM map:
>>> [0.00]  BIOS-e820:  - 000a (usable)
>>> [0.00]  BIOS-e820: 0010 - effc (usable)
>>> [0.00]  BIOS-e820: effc - effcfc00 (ACPI data)
>>> [0.00]  BIOS-e820: effcfc00 - e000 (reserved)
>>> [0.00]  BIOS-e820: f000 - f400 (reserved)
>>> [0.00]  BIOS-e820: fec0 - fed00400 (reserved)
>>> [0.00]  BIOS-e820: fed13000 - feda (reserved)
>>> [0.00]  BIOS-e820: fee0 - fee1 (reserved)
>>> [0.00]  BIOS-e820: ffb0 - 0001 (reserved)
>>> [0.00]  BIOS-e820: 0001 - 0001e000 (usable)
>>> [0.00]  BIOS-e820: 0001e000 - 0002 (reserved)
>>> [0.00]  BIOS-e820: 0002 - 00021000 (usable)
>>> [0.00] debug: ignoring loglevel setting.
>>> [0.00] NX (Execute Disable) protection: active
>>> [0.00] DMI 2.3 present.
>>> [0.00] DMI: Dell Computer Corporation PowerEdge 850/0Y8628, BIOS 
>>> A04 08/22/2006
>>> [0.00] e820 update range:  - 0001 
>>> (usable) ==>  (reserved)
>>> [0.00] e820 remove range: 000a - 0010 
>>> (usable)
>>> [0.00] No AGP bridge found
>>> [0.00] last_pfn = 0x21 max_arch_pfn = 0x4
>>> [0.00] MTRR default type: uncachable
>>> [0.00] MTRR fixed ranges enabled:
>>> [0.00]   0-9 write-back
>>> [0.00]   A-B uncachable
>>> [0.00]   C-CBFFF write-protect
>>> [0.00]   CC000-EBFFF uncachable
>>> [0.00]   EC000-F write-protect
>>> [0.00] MTRR variable ranges enabled:
>>> [0.00]   0 base 0 mask E write-back
>>> [0.00]   1 base 2 mask FF000 write-back
>>> [0.00]   2 base 0F000 mask FF000 uncachable
>>> [0.00]   3 disabled
>>> [0.00]   4 disabled
>>> [0.00]   5 disabled
>>> [0.00]   6 disabled
>>> [0.00]   7 disabled
>>> [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 
>>> 0x7010600070106
>>> [0.00] e820 update range: f000 - 0001 
>>> (usable) ==>  (reserved)
>>> [0.00] last_pfn = 0xeffc0 max_arch_pfn = 0x4
>>> [0.00] found SMP MP-table at [880fe710] fe710
>>> [0.00] initial memory mapped : 0 - 2000
>>> [0.00] init_memory_mapping: -effc
>>> [0.00]  00 - 00efe0 page 2M
>>> [0.00]  00ef

Re: [Ocfs2-users] Monitoring progress of fsck.ocfs2

2011-11-18 Thread Sunil Mushran

Do:
cat /proc/PID/stack

It is probably stuck in the block layer.

On 11/18/2011 08:33 AM, Nick Khamis wrote:
> Hello Everyone,
>
> I just ran fsck.ocfs2 on /dev/drbd0 which is a one gig partition on a
> vm with limited resource (100meg of ram).
> I am worried that the process crashed because it has not responded in
> the past hour or so?
>
> fsck.ocfs2 /dev/drbd0
> fsck.ocfs2 1.6.4
> [RECOVER_CLUSTER_INFO] The running cluster is using the cman stack
> with the cluster name ASTCluster, but the filesystem is configured for
> the classic o2cb stack.  Thus, fsck.ocfs2 cannot determine whether the
> filesystem is in use.  fsck.ocfs2 can reconfigure the filesystem to
> use the currently running cluster configuration.  DANGER: YOU MUST BE
> ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE
> MODIFYING ITS CLUSTER CONFIGURATION.  Recover cluster configuration
> information the running cluster?  y
>
>
> ps -uroot
> 8040 pts/000:00:00 fsck.ocfs2
>
>
> I want to mention that I did issue a ctrl+c and ctrl+x when I paniced.
> But I do not think anything happened.
>
> Nick
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Number of Nodes defined

2011-11-17 Thread Sunil Mushran

It must be the same fragmentation issue that we've addressed in 1.6 and later.
Is this 1.4?

On 11/17/2011 08:45 AM, David wrote:
> Sunil, et al,
>
> The reason I needed to make this changed was because the ocfs2 partition, 
> which is 101G in size with 41G currently in use ran out of disk space even 
> though the OS was reporting 60G available.
>
> I had this issue once before and found that the node slot of that cluster was 
> set to 4 even though there were only 2 nodes in the cluster.  When i reduced 
> the node slots to 2 disk space was freed up.
>
> I made these changes to this cluster; reduced the node slots to 2 and 
> everything worked until this morning when the same error returned "No space 
> left on device".
>
> The OS is still showing available disk space but as the error suggests i 
> can't write to the partition.
>
> Any idea what could be happening?
>
> On 11/16/2011 05:45 PM, Sunil Mushran wrote:
>> Reducing node-slots frees up the journal and distributes the metadata
>> that that slot was tracking to the remaining slots. I am not aware of
>> any reason why there should be an impact.
>>
>> On 11/16/2011 03:07 PM, David wrote:
>>> I did read the man page for tunefs.ocfs2 but I didn't see anything 
>>> indicating what the impact to the fs would be when making a change to an 
>>> existing fs such as reducing the node slots.
>>>
>>> Anyway, thank you for the feedback, I was able to make the changes with no 
>>> impact to the fs.
>>>
>>> David
>>>
>>> On 11/16/2011 12:12 PM, Sunil Mushran wrote:
>>>> man tunefs.ocfs2
>>>>
>>>> It cannot be done in an active cluster. But it can be done without having 
>>>> to
>>>> reformat the volume.
>>>>
>>>> On 11/16/2011 10:08 AM, David wrote:
>>>>> I wasn't able to find any documentation that answers whether or not the
>>>>> number of nodes defined for a cluster,  can be reduced on an active
>>>>> cluster as seen via:
>>>>>
>>>>> tunefs.ocfs2 -Q "%B %T %N\n"
>>>>>
>>>>> Does anyone know if this can be done, or do I have to copy the data off
>>>>> of the fs, make the changes, reformat the fs and copy the data back?
>>>>>
>>>>> ___
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users@oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>
>>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Number of Nodes defined

2011-11-16 Thread Sunil Mushran

Reducing node-slots frees up the journal and distributes the metadata
that that slot was tracking to the remaining slots. I am not aware of
any reason why there should be an impact.

On 11/16/2011 03:07 PM, David wrote:
> I did read the man page for tunefs.ocfs2 but I didn't see anything indicating 
> what the impact to the fs would be when making a change to an existing fs 
> such as reducing the node slots.
>
> Anyway, thank you for the feedback, I was able to make the changes with no 
> impact to the fs.
>
> David
>
> On 11/16/2011 12:12 PM, Sunil Mushran wrote:
>> man tunefs.ocfs2
>>
>> It cannot be done in an active cluster. But it can be done without having to
>> reformat the volume.
>>
>> On 11/16/2011 10:08 AM, David wrote:
>>> I wasn't able to find any documentation that answers whether or not the
>>> number of nodes defined for a cluster,  can be reduced on an active
>>> cluster as seen via:
>>>
>>> tunefs.ocfs2 -Q "%B %T %N\n"
>>>
>>> Does anyone know if this can be done, or do I have to copy the data off
>>> of the fs, make the changes, reformat the fs and copy the data back?
>>>
>>> ___
>>> Ocfs2-users mailing list
>>> Ocfs2-users@oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Number of Nodes defined

2011-11-16 Thread Sunil Mushran

man tunefs.ocfs2

It cannot be done in an active cluster. But it can be done without having to
reformat the volume.

On 11/16/2011 10:08 AM, David wrote:
> I wasn't able to find any documentation that answers whether or not the
> number of nodes defined for a cluster,  can be reduced on an active
> cluster as seen via:
>
> tunefs.ocfs2 -Q "%B %T %N\n"
>
> Does anyone know if this can be done, or do I have to copy the data off
> of the fs, make the changes, reformat the fs and copy the data back?
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] [Ocfs2-devel] vmstore option - mkfs

2011-11-16 Thread Sunil Mushran

Yes. But this is just the features. It also selects the appropriate cluster 
size, block size,
journal size, etc. All the params selected are printed by mkfs. You also have 
the option of
running with the --dry-option to see the params.

On 11/16/2011 09:41 AM, Artur Baruchi wrote:
> I just found this:
>
> + {OCFS2_FEATURE_COMPAT_BACKUP_SB | OCFS2_FEATURE_COMPAT_JBD2_SB,
> +  OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC |
> +  OCFS2_FEATURE_INCOMPAT_INLINE_DATA |
> +  OCFS2_FEATURE_INCOMPAT_XATTR |
> +  OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE,
> +  OCFS2_FEATURE_RO_COMPAT_UNWRITTEN},  /* FS_VMSTORE */
>
> These options are the ones that, when choosing for vmstore, are
> enabled by default. Is this correct?
>
> Thanks.
>
> Att.
> Artur Baruchi
>
>
>
> On Wed, Nov 16, 2011 at 3:26 PM, Sunil Mushran  
> wrote:
>> fstype is a handy way to format the volume with parameters that are thought
>> to be useful for that use-case. The result of this is printed during format
>> by
>> way of the parameters selected. man mkfs.ocfs2 has a blurb about the
>> features
>> it enabled by default.
>>
>> On 11/16/2011 08:45 AM, Artur Baruchi wrote:
>>> Hi.
>>>
>>> I tried to find some information about the option vmstore when
>>> formating a device, but didnt found anything about it (no
>>> documentation, I did some greps inside the source code, but nothing
>>> returned). My doubts about this:
>>>
>>> - What kind of optimization this option creates in my file system to
>>> store vm images? I mean.. what does exactly this option do?
>>> - Where, in source code, I can find the part that makes this optimization?
>>>
>>> Thanks in advance.
>>>
>>> Att.
>>> Artur Baruchi
>>>
>>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] [Ocfs2-devel] vmstore option - mkfs

2011-11-16 Thread Sunil Mushran

fstype is a handy way to format the volume with parameters that are thought
to be useful for that use-case. The result of this is printed during format by
way of the parameters selected. man mkfs.ocfs2 has a blurb about the features
it enabled by default.

On 11/16/2011 08:45 AM, Artur Baruchi wrote:
> Hi.
>
> I tried to find some information about the option vmstore when
> formating a device, but didnt found anything about it (no
> documentation, I did some greps inside the source code, but nothing
> returned). My doubts about this:
>
> - What kind of optimization this option creates in my file system to
> store vm images? I mean.. what does exactly this option do?
> - Where, in source code, I can find the part that makes this optimization?
>
> Thanks in advance.
>
> Att.
> Artur Baruchi
>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and db_block_size

2011-11-14 Thread Sunil Mushran

We talk about this in the user's guide.
1. Always use 4K blocksize.
2. Never set the cluster size less than the database block size.

Having a smaller cluster size could mean that a db block may not be contiguous.
And you don't want that for performance and other reasons. Having a still larger
cluster size is an easy way to ensure the files are contiguous. Contiguity can 
only
help perf.

On 11/14/2011 03:35 PM, Pravin K Patil wrote:
> Hi All,
> Is there a benchmark study done different block sizes of ocfs2 and 
> corrosponding db_block_size and its impact on read / write?
> Similar way s there any study done for cluster size of ocfs2 and 
> corrosponding db_block_size and its impact on read / write?
> For example if the db_block_size is 8K and if we have ocfs2 cluster size as 
> 4K will it have any performance impact or in other words, if we make cluster 
> size of file systems on which data files are located as 8K will it improve 
> performance? if so is it for read or write?
> Looking for actual expereince on the settings of ocfs2 block size, cluster 
> size and db_block_size corelation.
>
> Regards,
> Pravin
>

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] dlm locking

2011-11-14 Thread Sunil Mushran

o2image is only useful for debugging. It allows us to get a copy of the file 
system
on which we can test fsck inhouse. The files in lost+found have to be resolved
manually. If they are junk, delete them. If useful, move it to another 
directory.

On 11/11/2011 05:36 PM, Nick Khamis wrote:
> All Fixed!
>
> Just a few questions. Is there any documentation on howto diagnose on
> ocfs2 filesystem:
> * How to transfer an image file for testing onto a different machine.
> As you did with "o2image.out"
> * Does "fsck.ocfs2 -fy /dev/loop0" pretty much fix all the common problems
> * What can I do with the files in lost+found
>
> Thanks Again,
>
> Nick.
>
> On Fri, Nov 11, 2011 at 8:02 PM, Sunil Mushran  
> wrote:
>> So it detected one cluster that was doubly allocated. It fixed it.
>> Details below. The other fixes could be because the o2image was
>> taken on a live volume.
>>
>> As to how this could happen... I would look at the storage.
>>
>>
>> # fsck.ocfs2 -fy /dev/loop0
>> fsck.ocfs2 1.6.3
>> Checking OCFS2 filesystem in /dev/loop0:
>>   Label:  AsteriskServer
>>   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
>>   Number of blocks:   592384
>>   Block size: 4096
>>   Number of clusters: 592384
>>   Cluster size:   4096
>>   Number of slots:2
>>
>> /dev/loop0 was run with -f, check forced.
>> Pass 0a: Checking cluster allocation chains
>> Pass 0b: Checking inode allocation chains
>> Pass 0c: Checking extent block allocation chains
>> Pass 1: Checking inodes and blocks.
>> Duplicate clusters detected.  Pass 1b will be run
>> Running additional passes to resolve clusters claimed by more than one
>> inode...
>> Pass 1b: Determining ownership of multiply-claimed clusters
>> Pass 1c: Determining the names of inodes owning multiply-claimed clusters
>> Pass 1d: Reconciling multiply-claimed clusters
>> Cluster 161335 is claimed by the following inodes:
>>   /asterisk/extensions.conf
>>   /moh/macroform-cold_day.wav
>> [DUP_CLUSTERS_CLONE] Inode "/asterisk/extensions.conf" may be cloned or
>> deleted to break the claim it has on its clusters. Clone inode
>> "/asterisk/extensions.conf" to break claims on clusters it shares with other
>> inodes? y
>> [DUP_CLUSTERS_CLONE] Inode "/moh/macroform-cold_day.wav" may be cloned or
>> deleted to break the claim it has on its clusters. Clone inode
>> "/moh/macroform-cold_day.wav" to break claims on clusters it shares with
>> other inodes? y
>> Pass 2: Checking directory entries.
>> [DIRENT_INODE_FREE] Directory entry 'musiconhold.conf' refers to inode
>> number 35348 which isn't allocated, clear the entry? y
>> Pass 3: Checking directory connectivity.
>> [LOSTFOUND_MISSING] /lost+found does not exist.  Create it so that we can
>> possibly fill it with orphaned inodes? y
>> Pass 4a: checking for orphaned inodes
>> Pass 4b: Checking inodes link counts.
>> [INODE_COUNT] Inode 96783 has a link count of 1 on disk but directory entry
>> references come to 2. Update the count on disk to match? y
>> [INODE_NOT_CONNECTED] Inode 96784 isn't referenced by any directory entries.
>>   Move it to lost+found? y
>> [INODE_NOT_CONNECTED] Inode 96785 isn't referenced by any directory entries.
>>   Move it to lost+found? y
>> [INODE_NOT_CONNECTED] Inode 96794 isn't referenced by any directory entries.
>>   Move it to lost+found? y
>> [INODE_NOT_CONNECTED] Inode 96796 isn't referenced by any directory entries.
>>   Move it to lost+found? y
>> All passes succeeded.
>> Slot 0's journal dirty flag removed
>> Slot 1's journal dirty flag removed
>>
>>
>> [root@ca-test92 ocfs2]# fsck.ocfs2 -fy /dev/loop0
>> fsck.ocfs2 1.6.3
>> Checking OCFS2 filesystem in /dev/loop0:
>>   Label:  AsteriskServer
>>   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
>>   Number of blocks:   592384
>>   Block size: 4096
>>   Number of clusters: 592384
>>   Cluster size:   4096
>>   Number of slots:2
>>
>> /dev/loop0 was run with -f, check forced.
>> Pass 0a: Checking cluster allocation chains
>> Pass 0b: Checking inode allocation chains
>> Pass 0c: Checking extent block allocation chains
>> Pass 1: Checking inodes and blocks.
>> Pass 2: Checking directory entries.
>> Pass 3: Checking directory connectivity.
>> Pass 4a: checking for orphaned inodes
>> Pass 4b: Checking inodes link counts.
>> All passes succeeded.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] dlm locking

2011-11-10 Thread Sunil Mushran

The ro issue was different. It appears the volume has more problems.
If you want to me to look at the issue, I'll need the image of the volume.
# o2image /dev/device  /tmp/o2image.out

On 11/10/2011 01:55 PM, Nick Khamis wrote:
> Hello Sunil,
>
> Thank you so much for your time, and I do not want to take any more
> of it. I ran fsck with -f and have the following:
>
> fsck.ocfs2 -f /dev/drbd0
> fsck.ocfs2 1.6.4
> Checking OCFS2 filesystem in /dev/drbd0:
>Label:  ASTServer
>UUID:   3A791AB36DED41008E58CEF52EBEEFD3
>Number of blocks:   592384
>Block size: 4096
>Number of clusters: 592384
>Cluster size:   4096
>Number of slots:2
>
> /dev/drbd0 was run with -f, check forced.
> Pass 0a: Checking cluster allocation chains
> Pass 0b: Checking inode allocation chains
> Pass 0c: Checking extent block allocation chains
> Pass 1: Checking inodes and blocks.
> Duplicate clusters detected.  Pass 1b will be run
> Running additional passes to resolve clusters claimed by more than one 
> inode...
> Pass 1b: Determining ownership of multiply-claimed clusters
> pass1b: Inode type does not contain extents while processing inode 5
> fsck.ocfs2: Inode type does not contain extents while performing pass 1
>
> Not sure if the read-only is due to the detected duplicate?
>
> Thanks in Advance,
>
> Nick.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] dlm locking

2011-11-10 Thread Sunil Mushran

Do:

fsck.ocfs2 -f /dev/...

Without -f, it only replays the journal.

On 11/09/2011 05:49 PM, Nick Khamis wrote:
> Hello Sunil,
>
> This is only on the protoype so it's not crucial however, it would be
> nice to figure out why for
> future reference:
>
> fsck.ocfs2 /dev/drbd0
> fsck.ocfs2 1.6.4
> Checking OCFS2 filesystem in /dev/drbd0:
>   Label:  AsteriskServer
>   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
>   Number of blocks:   592384
>   Block size: 4096
>   Number of clusters: 592384
>   Cluster size:   4096
>   Number of slots:2
>
> /dev/drbd0 is clean.  It will be checked after 20 additional mounts.
>
> I can mount it and write to it just fine (read and write). It's just
> when I start the application that reads from the filesystem
> (I don't think there is any writing going on), that it goes into read
> mode... It use to work, other than the update to 1.6.4 I am not sure
> what I have changed..
>
> Not quite sure what kind of information you would need to help figure
> out the problem?
>
> Cheers,
>
> Nick.
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] dlm locking

2011-11-09 Thread Sunil Mushran

This has nothing to do with the dlm. The error states that the fs encountered
a bad inode on disk. Possible disk corruption. On encountering the fs goes 
readonly
and asks the user to run fsck.

On 11/09/2011 11:51 AM, Nick Khamis wrote:
> Hello Everyone,
>
> For the first time I eoerienced a dlm lock:
>
> [ 9721.831813] OCFS2 DLM 1.5.0
> [ 9721.917032] ocfs2: Registered cluster interface o2cb
> [ 9722.170848] OCFS2 DLMFS 1.5.0
> [ 9722.179018] OCFS2 User DLM kernel interface loaded
> [ 9755.743195] ocfs2_dlm: Nodes in domain
> ("3A791AB36DED41008E58CEF52EBEEFD3"): 1
> [ 9755.852798] ocfs2: Mounting device (147,0) on (node 1, slot 0) with
> ordered data mode.
> [ 9783.240424] block drbd0: Handshake successful: Agreed network
> protocol version 91
> [ 9783.242922] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
> [ 9783.243074] block drbd0: conn( WFConnection ->  WFReportParams )
> [ 9783.243205] block drbd0: Starting asender thread (from drbd0_receiver 
> [4390])
> [ 9783.271014] block drbd0: data-integrity-alg:
> [ 9783.271298] block drbd0: drbd_sync_handshake:
> [ 9783.271318] block drbd0: self
> 964FFEDA732A512B:0ABD16D2597E52D9:54E3AEC293CEDC7E:120384BD0E3A5705
> bits:3 flags:0
> [ 9783.271342] block drbd0: peer
> B4C81B0FD76EFAC2:0ABD16D2597E52D9:54E3AEC293CEDC7F:120384BD0E3A5705
> bits:0 flags:0
> [ 9783.271364] block drbd0: uuid_compare()=100 by rule 90
> [ 9783.271380] block drbd0: Split-Brain detected, 1 primaries,
> automatically solved. Sync from this node
> [ 9783.271417] block drbd0: peer( Unknown ->  Secondary ) conn(
> WFReportParams ->  WFBitMapS )
> [ 9783.399967] block drbd0: peer( Secondary ->  Primary )
> [ 9783.515979] block drbd0: conn( WFBitMapS ->  SyncSource ) pdsk(
> Outdated ->  Inconsistent )
> [ 9783.522521] block drbd0: Began resync as SyncSource (will sync 12
> KB [3 bits set]).
> [ 9783.629758] block drbd0: Implicitly set pdsk Inconsistent!
> [ 9783.799387] block drbd0: Resync done (total 1 sec; paused 0 sec; 12 K/sec)
> [ 9783.799956] block drbd0: conn( SyncSource ->  Connected ) pdsk(
> Inconsistent ->  UpToDate )
> [ 9795.430801] o2net: accepted connection from node astdrbd2 (num 2)
> at 192.168.2.111:
> [ 9800.231650] ocfs2_dlm: Node 2 joins domain 3A791AB36DED41008E58CEF52EBEEFD3
> [ 9800.231668] ocfs2_dlm: Nodes in domain
> ("3A791AB36DED41008E58CEF52EBEEFD3"): 1 2
> [ 9861.922744] OCFS2: ERROR (device drbd0):
> ocfs2_validate_inode_block: Invalid dinode #35348: OCFS2_VALID_FL not
> set
> [ 9861.922767]
> [ 9861.927278] File system is now read-only due to the potential of
> on-disk corruption. Please run fsck.ocfs2 once the file system is
> unmounted.
> [ 9861.928231] (8009,0):ocfs2_read_locked_inode:496 ERROR: status = -22
>
> Not sure where to start, but with your appreciated help I am sure we
> can get it resolved.
>
> Thanks in Advance,
>
> Nick.
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] mixing ocfs2 versions in a cluster

2011-11-09 Thread Sunil Mushran

I would recommend upgrading all the nodes to 1.2.9 as it contains fixes
to known bugs in the versions you are running. Mixing versions is never
recommended mainly because it is hard to test all possible combinations.
It is alright to do so on an interim basis. But never recommended as a
stable setup.

On 11/09/2011 10:53 AM, Shashank wrote:
> Can you mix ocfs2 versions in a cluster?
>
> Eg. I have 4 nodes in a cluster. two nodes with version 1.2.7.-1el4
> and the other two with 1.2.5-6.
>
> Thanks,
> Vik
>

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] mount.ocfs2: Device name specified was not found while opening device

2011-11-03 Thread Sunil Mushran

The device is missing. IOW, "ls /dev/Data-1/sto2data-1" is failing.
You need the fix that.

On 11/03/2011 06:15 AM, Anderson J. Dominitini wrote:
> Hi guys
>
>   I added a new storage in my cluster with  five  new partition. In my
> headnode are all ok, all partition were mounted. But in nodes, I have a
> problem for make find this partitions.
>
> mount.ocfs2: Device name specified was not found while opening device
> /dev/Data-1/sto2data-1
> mount.ocfs2: Device name specified was not found while opening device
> /dev/Data-2/sto2data-2
> mount.ocfs2: Device name specified was not found while opening device
> /dev/Data-3/sto2data-3
> mount.ocfs2: Device name specified was not found while opening device
> /dev/Data-4/sto2data-4
> mount.ocfs2: Device name specified was not found while opening device
> /dev/Data-5/sto2data-5
>
> The kernel + OCFS did update.
>
> The nodes can mount the others partitions, but this partions that
> decribe , they can't. Someone can help-me?
>
> regards
> Anderson
>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-28 Thread Sunil Mushran

On 10/27/2011 07:10 PM, Tim Serong wrote:
> Damn.  It was in Pacemaker's include/crm/ais.h, back before June 27 last 
> year(!), when it was moved to Pacemaker's configure.ac:
>
> https://github.com/ClusterLabs/pacemaker/commit/8e939b0ad779c65d445e2fa150df1cc046428a93#include/crm/ais.h
>
> This means it probably no longer appears in any of Pacemaker's public (devel 
> package) header files, which explains the compile error.
>
> I did some more digging, and we (SUSE) presumably never had this problem 
> because we've been carrying the attached patch for rather a long time. It 
> replaces CRM_SERVICE (a relatively uninteresting number) with a somewhat more 
> useful string literal...
>
>>
>>> I thought the O2CB OCF RA was always provided by either pacemaker (or,
>>> on SUSE at least, in ocfs2-tools), but was never included in the
>>> upstream ocfs2-tools source tree?
>>
>>
>> I thought we had checked-in all the pacemaker related patches. Are we
>> missing something?
>
> The O2CB OCF RA is this thing:
>
> https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/o2cb
>
> It's the (better/stronger/faster :)) equivalent of the o2cb init script, 
> which you use when OCFS2 is under Pacemaker's control.
>
> There's (IMO) a good argument for having OCF RAs included with the project 
> they're intended for use with (all code pertaining to the operation of some 
> program lives in one place).
>
> OTOH, there's another argument for having them included in the generic 
> resource-agents or pacemaker package (Pacemaker and RHCS probably being the 
> only things that actually use OCF RAs).
>
> I suspect the RA was either never submitted to ocfs2-tools, or was never 
> accepted (don't know which, I wasn't involved when it was originally written).

So I am checking in the patch with your sign-off. I hope that is ok with you.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-27 Thread Sunil Mushran

On 10/27/2011 05:26 PM, Tim Serong wrote:
> That ought to work...  But where did PCMK_SERVICE_ID come from in that 
> context?  AFAICT it's always been CRM_SERVICE there.  See current head:
>
> http://oss.oracle.com/git/?p=ocfs2-tools.git;a=blob;f=ocfs2_controld/pacemaker.c;hb=HEAD#l158
>
> CRM_SERVICE is then mapped back to PCMK_SERVICE_ID in pacemaker's 
> include/crm/ais.h:
>
> https://github.com/ClusterLabs/pacemaker/blob/master/include/crm/ais.h#L54


Where is PCMK_SERVICE_ID defined? This qs has come up more than once.


> I thought the O2CB OCF RA was always provided by either pacemaker (or, on 
> SUSE at least, in ocfs2-tools), but was never included in the upstream 
> ocfs2-tools source tree?


I thought we had checked-in all the pacemaker related patches. Are we missing 
something?

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-27 Thread Sunil Mushran

I don't remember that resource. If it did exist, it would have
existed in pacemaker. ocfs2-tools does not carry any pacemaker
bits. It carries bits that allows it to work with pacemaker & cman.

On 10/27/2011 02:27 PM, Nick Khamis wrote:
> Hello Sunil,
>
> Thank you so much for your response. I just downloaded 1.6. And had to
> add the following to pacemaker.c:
>
> #define PCMK_SERVICE_ID 9
> line 158: log_error("Connection to our AIS plugin (%d) failed",
> PCMK_SERVICE_ID);
>
> to avoid.
>
> pacemaker.c: In function setup_stack:
> pacemaker.c:158: error: PCMK_SERVICE_ID undeclared (first use in this 
> function)
> pacemaker.c:158: error: (Each undeclared identifier is reported only once
> pacemaker.c:158: error: for each function it appears in.)
> make[1]: *** [pacemaker.o] Error 1
> make[1]: Leaving directory `/usr/local/src/ocfs2-tools-1.6.4/ocfs2_controld'
> make: *** [ocfs2_controld] Error 2
>
> Not sure if that was the right thing to do?
>
> On a slightly unreallated. There use to be pacemaker ocf resource
> agent script included for o2cb "o2cb.ocf".
> I take it this is now only provided by pacemaker?
>
> Cheers,
>
> Nick.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-27 Thread Sunil Mushran

ocfs2-tools-1.4.4 is too old. Build 1.6.4. The source tarball is on 
oss.oracle.com.

On 10/27/2011 12:45 PM, Nick Khamis wrote:
> Hello Everyone,
>
> I am building ocfs2-tools from source. Modified
> /ocfs2_controld/Makefile to point to the correct pacemaker 1.1.6
> headers:
>
> PCMK_INCLUDES = -I/usr/include/pacemaker -I/usr/include/heartbeat
> -I/usr/include/libxml2 $(GLIB_CFLAGS)
>
> However, for some reason I am getting:
>
> setup_stack:
> pacemaker.c:158: error: PCMK_SERVICE_ID undeclared (first use in this 
> function)
> pacemaker.c:158: error: (Each undeclared identifier is reported only once
> pacemaker.c:158: error: for each function it appears in.)
> make[1]: *** [pacemaker.o] Error 1
> make[1]: Leaving directory `/usr/local/src/ocfs2-tools-1.4.4/ocfs2_controld'
>
> The config I am using:
>
> ./configure --sbindir=/sbin --bin=/bin --libdir=/usr/lib
> --sysconfdir=/etc --datadir=/etc/ocfs2 --sharedstatedir=/var/ocfs2
> --libexecdir=/usr/libexec --localstatedir=/var --mandir=/usr/man
> --enable-dynamic-fsck --enable-dynamic-ctl
>
>
> Thanks in Advance,
>
> Nick.
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-23 Thread Sunil Mushran


Are you sure you have ocfs2-tools-1.6.3? I remember we had an
issue with this with an earlier release... 1.6.1/.2.

On 10/23/2011 10:43 AM, Laurentiu Gosu wrote:

hmm..
#ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
*BUT:*
#ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
I can still kill the ref using device name (-d).

On 10/23/2011 17:57, Sunil Mushran wrote:

I think it stops by uuid. So try doing this the next time.
You are encountering some issue that we have not seen before.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2

On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:

Hi Sunil,
Sorry for my late reply, i just had time today to start from scratch 
and test.
I rebuilt my environment(2 nodes connected to a SAN via 
iSCSI+multipath). I still have the issue that the heartbeat is 
active after I umount my ocfs2 volume.

/etc/init.d/o2cb stop
Stopping O2CB cluster CLUST: Failed
Unable to stop cluster as heartbeat region still active

ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs

After i manually kill the ref (ocfs2_hb_ctl -K -d 
/dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
live with that but why doesn't it stop automatically? As i 
understand, hearbeat should be started and stopped once the volume 
gets mounted/umounted.


br,
Laurentiu.

On 10/19/2011 02:28, Sunil Mushran wrote:

Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
OK, i rebooted one of the nodes(both had similar issues); . But 
something is still fishy.

- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 
0C4AB55FE9314FA5A9F81652FDB9B22D

-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/


PS: i'm going to sleep now, i have to be up in a few hours. We can 
continue tomorrow if it's ok with you.

Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:
One way this can happen is if one starts the hb manually and then 
force
formats on that volume. The format will generate a new uuid. Once 
that
happens, the hb tool cannot map the region to the device and thus 
fail

to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
Yes, i did reformat it(even more than once i think, last week). 
This is a pre-production system and i'm trying various options 
before moving into real life.



On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
*


On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
heartbeat


No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  
UUID  Label
/dev/m

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-23 Thread Sunil Mushran


I think it stops by uuid. So try doing this the next time.
You are encountering some issue that we have not seen before.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2

On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:

Hi Sunil,
Sorry for my late reply, i just had time today to start from scratch 
and test.
I rebuilt my environment(2 nodes connected to a SAN via 
iSCSI+multipath). I still have the issue that the heartbeat is active 
after I umount my ocfs2 volume.

/etc/init.d/o2cb stop
Stopping O2CB cluster CLUST: Failed
Unable to stop cluster as heartbeat region still active

ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs

After i manually kill the ref (ocfs2_hb_ctl -K -d 
/dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
live with that but why doesn't it stop automatically? As i understand, 
hearbeat should be started and stopped once the volume gets 
mounted/umounted.


br,
Laurentiu.

On 10/19/2011 02:28, Sunil Mushran wrote:

Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
OK, i rebooted one of the nodes(both had similar issues); . But 
something is still fishy.

- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 
0C4AB55FE9314FA5A9F81652FDB9B22D

-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/


PS: i'm going to sleep now, i have to be up in a few hours. We can 
continue tomorrow if it's ok with you.

Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:

One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
Yes, i did reformat it(even more than once i think, last week). 
This is a pre-production system and i'm trying various options 
before moving into real life.



On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
*


On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
heartbeat


No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  
UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   
0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2


mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to

Re: [Ocfs2-users] OCFS2 slow with multiple writes

2011-10-21 Thread Sunil Mushran

Because in this case the cluster lock may be waiting for the journal
commit to complete. It depends on where the file is being created,
what internal metadata blocks need to be locked, etc. Your dd is not
a simple write. It is a create + allocation + write. If the file already
exists, then the data extents will first be truncated too.

On 10/21/2011 03:27 AM, Prakash Velayutham wrote:
> Hi Sunil,
>
> Thanks for the response. Do you mean OCFS2 is blocking writes from multiple 
> clients? Is that how OCFS2 works? I can understand that writing the (2) 20G 
> files might take longer with "ordered" option as data needs to be flushed to 
> the FS before journal commit, but why is that blocking a new separate file 
> from being written to the file system?
>
> Regards,
> Prakash
>
> On Oct 20, 2011, at 6:25 PM, Sunil Mushran wrote:
>
>> Use writeback. Ordered data requires the data to be flushed
>> before journal commit. And flushing 40G takes time.
>>
>> mount -t data=writeback DEVICE PATH
>>
>> On 10/20/2011 03:05 PM, Prakash Velayutham wrote:
>>> Hi,
>>>
>>> OS - SLES 11.1 with HAE
>>> OCFS2 - 1.4.3-0.16.7
>>> Cluster stack - Pacemaker
>>>
>>> I have Heartbeat Filesystem monitor that monitors the OCFS2 file system for 
>>> availability. This monitor kicks in every minute and tries to write a file 
>>> using dd as below.
>>>
>>> dd of=/var/lib/mysql/data1/.Filesystem_status/default_bmimysqlp3 
>>> oflag=direct,sync bs=512 conv=fsync,sync
>>>
>>> If the OCFS2 file system is busy, like when I try to create 2 large files 
>>> (20GB each) in the OCFS2 directory, I see that the above monitor process 
>>> hangs until the 2 files are created. But this causes Pacemaker to fence the 
>>> node as the RA is configured for a timeout of 45secs and the 2 file 
>>> creations do take more than that. The OCFS2 file system is mounted as below.
>>>
>>> /dev/mapper/bmimysqlp3_p4_vol1 on /var/lib/mysql/data1 type ocfs2 
>>> (rw,_netdev,nointr,data=ordered,cluster_stack=pcmk)
>>>
>>> Is there something wrong with the file system itself that a small file 
>>> creation hangs like that? Please let me know if you need any more 
>>> information.
>>>
>>> Thanks,
>>> Prakash


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 slow with multiple writes

2011-10-20 Thread Sunil Mushran

Use writeback. Ordered data requires the data to be flushed
before journal commit. And flushing 40G takes time.

mount -t data=writeback DEVICE PATH

On 10/20/2011 03:05 PM, Prakash Velayutham wrote:
> Hi,
>
> OS - SLES 11.1 with HAE
> OCFS2 - 1.4.3-0.16.7
> Cluster stack - Pacemaker
>
> I have Heartbeat Filesystem monitor that monitors the OCFS2 file system for 
> availability. This monitor kicks in every minute and tries to write a file 
> using dd as below.
>
> dd of=/var/lib/mysql/data1/.Filesystem_status/default_bmimysqlp3 
> oflag=direct,sync bs=512 conv=fsync,sync
>
> If the OCFS2 file system is busy, like when I try to create 2 large files 
> (20GB each) in the OCFS2 directory, I see that the above monitor process 
> hangs until the 2 files are created. But this causes Pacemaker to fence the 
> node as the RA is configured for a timeout of 45secs and the 2 file creations 
> do take more than that. The OCFS2 file system is mounted as below.
>
> /dev/mapper/bmimysqlp3_p4_vol1 on /var/lib/mysql/data1 type ocfs2 
> (rw,_netdev,nointr,data=ordered,cluster_stack=pcmk)
>
> Is there something wrong with the file system itself that a small file 
> creation hangs like that? Please let me know if you need any more information.
>
> Thanks,
> Prakash
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran


Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:

OK, i rebooted one of the nodes(both had similar issues); . But something is 
still fishy.
- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D
-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/

PS: i'm going to sleep now, i have to be up in a few hours. We can continue 
tomorrow if it's ok with you.
Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:

One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:

Yes, i did reformat it(even more than once i think, last week). This is a 
pre-production system and i'm trying various options before moving into real 
life.


On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D *

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2

mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:

ls -lR /sys/kernel/debug/ocfs2
/sys/kernel/debug/ocfs2:
total 0

ls -lR /sys/kernel/debug/o2dlm
/sys/kernel/debug/o2dlm:
total 0

ocfs2_hb_ctl -I -d /dev/dm-2
ocfs2_hb_ctl: Device name specified was not found while reading uuid

There is no /dev/dm-2 mounted.


On 10/19/2011 00:27, Sunil Mushran wrote:

mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:

Again   the outputs:
 cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
dm-2
--->here should be volgr1-lvol0 i guess?

ls -lR /sys/kernel/debug/ocfs2
ls: /sys/kernel/debug/ocfs2: No such file or directory

ls -lR /sys/kernel/debug/o2dlm
ls: /sys/kernel/debug/o2dlm: No such file or directory

I think i have to enable debug first somehow..?

Laurentiu.

On 10/19/2011 00:17, Sunil Mu

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran


One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:

Yes, i did reformat it(even more than once i think, last week). This is a 
pre-production system and i'm trying various options before moving into real 
life.


On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D *

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2

mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:

ls -lR /sys/kernel/debug/ocfs2
/sys/kernel/debug/ocfs2:
total 0

ls -lR /sys/kernel/debug/o2dlm
/sys/kernel/debug/o2dlm:
total 0

ocfs2_hb_ctl -I -d /dev/dm-2
ocfs2_hb_ctl: Device name specified was not found while reading uuid

There is no /dev/dm-2 mounted.


On 10/19/2011 00:27, Sunil Mushran wrote:

mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:

Again   the outputs:
 cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
dm-2
--->here should be volgr1-lvol0 i guess?

ls -lR /sys/kernel/debug/ocfs2
ls: /sys/kernel/debug/ocfs2: No such file or directory

ls -lR /sys/kernel/debug/o2dlm
ls: /sys/kernel/debug/o2dlm: No such file or directory

I think i have to enable debug first somehow..?

Laurentiu.

On 10/19/2011 00:17, Sunil Mushran wrote:

What does this return?
cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

Also, do:
ls -lR /sys/kernel/debug/ocfs2
ls -lR /sys/kernel/debug/o2dlm

On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:

Here is the output:

ls -lR /sys/kernel/config/cluster
/sys/kernel/config/cluster:
total 0
drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

/sys/kernel/config/cluster/CLUSTER:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
-rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
-rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
drwxr-xr-x 4 root root0 Oct 11 20:23 node
-rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

/sys/kernel/config/cluster/CLUSTER/heartbeat:
total 0
drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
-r--r--r-- 1 root root 4096 Oct 19 00:12 pid
-rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

/sys/kernel/config/cluster/CLUSTER/node:
total 0
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

/sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
-rw-r--r-- 1 root root 4096 Oct 19 00:12 local
-rw-r--r-- 1 root root 4096 Oct 19 00:12 num

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran


Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D *

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2

mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:

ls -lR /sys/kernel/debug/ocfs2
/sys/kernel/debug/ocfs2:
total 0

ls -lR /sys/kernel/debug/o2dlm
/sys/kernel/debug/o2dlm:
total 0

ocfs2_hb_ctl -I -d /dev/dm-2
ocfs2_hb_ctl: Device name specified was not found while reading uuid

There is no /dev/dm-2 mounted.


On 10/19/2011 00:27, Sunil Mushran wrote:

mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:

Again   the outputs:
 cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
dm-2
--->here should be volgr1-lvol0 i guess?

ls -lR /sys/kernel/debug/ocfs2
ls: /sys/kernel/debug/ocfs2: No such file or directory

ls -lR /sys/kernel/debug/o2dlm
ls: /sys/kernel/debug/o2dlm: No such file or directory

I think i have to enable debug first somehow..?

Laurentiu.

On 10/19/2011 00:17, Sunil Mushran wrote:

What does this return?
cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

Also, do:
ls -lR /sys/kernel/debug/ocfs2
ls -lR /sys/kernel/debug/o2dlm

On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:

Here is the output:

ls -lR /sys/kernel/config/cluster
/sys/kernel/config/cluster:
total 0
drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

/sys/kernel/config/cluster/CLUSTER:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
-rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
-rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
drwxr-xr-x 4 root root0 Oct 11 20:23 node
-rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

/sys/kernel/config/cluster/CLUSTER/heartbeat:
total 0
drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
-r--r--r-- 1 root root 4096 Oct 19 00:12 pid
-rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

/sys/kernel/config/cluster/CLUSTER/node:
total 0
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

/sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
-rw-r--r-- 1 root root 4096 Oct 19 00:12 local
-rw-r--r-- 1 root root 4096 Oct 19 00:12 num

/sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
-rw-r--r-- 1 root root 4096 Oct 19 00:12 local
-rw-r--r-- 1 root root 4096 Oct 19 00:12 num




On 10/19/2011 00:12, Sunil Mushran wrote:

ls -lR /sys/kernel/config/cluster

What does this return?

On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:

Hi,
I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
My pr

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

Let's do it by hand.
rm -rf /sys/kernel/config/cluster/.../heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
>  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
>
> No improvment :(
>
>
> On 10/19/2011 00:50, Sunil Mushran wrote:
>> See if this cleans it up.
>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>
>> On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
>>> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>>>
>>>
>>> On 10/19/2011 00:43, Sunil Mushran wrote:
>>>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>
>>>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>>>> mounted.ocfs2 -d
>>>>> DeviceFS Stack  UUID  
>>>>> Label
>>>>> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  
>>>>> ocfs2
>>>>>
>>>>> mounted.ocfs2 -f
>>>>> DeviceFS Nodes
>>>>> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>>>>>
>>>>> ro02xsrv001 = the other node in the cluster.
>>>>>
>>>>> By the way, there is no /dev/md-2
>>>>>  ls /dev/dm-*
>>>>> /dev/dm-0  /dev/dm-1
>>>>>
>>>>>
>>>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>>>> So it is not mounted. But we still have a hb thread because
>>>>>> hb could not be stopped during umount. The reason for that
>>>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>>>
>>>>>> Do:
>>>>>> mounted.ocfs2 -d
>>>>>>
>>>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>> /sys/kernel/debug/ocfs2:
>>>>>>> total 0
>>>>>>>
>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>> /sys/kernel/debug/o2dlm:
>>>>>>> total 0
>>>>>>>
>>>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>>>> ocfs2_hb_ctl: Device name specified was not found while reading uuid
>>>>>>>
>>>>>>> There is no /dev/dm-2 mounted.
>>>>>>>
>>>>>>>
>>>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>>>
>>>>>>>> Then list that dir.
>>>>>>>>
>>>>>>>> Also, do:
>>>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>>>
>>>>>>>> Be careful before killing. We want to be sure that dev is not mounted.
>>>>>>>>
>>>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>>>> Again   the outputs:
>>>>>>>>>  cat 
>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>> dm-2
>>>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>>>
>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>>>
>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>>>
>>>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>>>
>>>>>>>>> Laurentiu.
>>>>>>>>>
>>>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>>>> What does this return?
>>>>>>>>>> cat 
>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>
>>>>>>>>>> Also, do:
>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>
>>>>>

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>
>
> On 10/19/2011 00:43, Sunil Mushran wrote:
>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>
>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>> mounted.ocfs2 -d
>>> DeviceFS Stack  UUID  Label
>>> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  
>>> ocfs2
>>>
>>> mounted.ocfs2 -f
>>> DeviceFS Nodes
>>> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>>>
>>> ro02xsrv001 = the other node in the cluster.
>>>
>>> By the way, there is no /dev/md-2
>>>  ls /dev/dm-*
>>> /dev/dm-0  /dev/dm-1
>>>
>>>
>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>> So it is not mounted. But we still have a hb thread because
>>>> hb could not be stopped during umount. The reason for that
>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>
>>>> Do:
>>>> mounted.ocfs2 -d
>>>>
>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>> /sys/kernel/debug/ocfs2:
>>>>> total 0
>>>>>
>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>> /sys/kernel/debug/o2dlm:
>>>>> total 0
>>>>>
>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>> ocfs2_hb_ctl: Device name specified was not found while reading uuid
>>>>>
>>>>> There is no /dev/dm-2 mounted.
>>>>>
>>>>>
>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>
>>>>>> Then list that dir.
>>>>>>
>>>>>> Also, do:
>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>
>>>>>> Be careful before killing. We want to be sure that dev is not mounted.
>>>>>>
>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>> Again   the outputs:
>>>>>>>  cat 
>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>> dm-2
>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>
>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>
>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>
>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>
>>>>>>> Laurentiu.
>>>>>>>
>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>> What does this return?
>>>>>>>> cat 
>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>
>>>>>>>> Also, do:
>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>
>>>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>>>> Here is the output:
>>>>>>>>>
>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>> /sys/kernel/config/cluster:
>>>>>>>>> total 0
>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>>>
>>>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>>>> total 0
>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>>>> drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
>>>>>>>>> drwxr-xr-x 4 root root0 Oct 11 20:23 node
>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
> mounted.ocfs2 -d
> DeviceFS Stack  UUID  Label
> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  
> ocfs2
>
> mounted.ocfs2 -f
> DeviceFS Nodes
> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>
> ro02xsrv001 = the other node in the cluster.
>
> By the way, there is no /dev/md-2
>  ls /dev/dm-*
> /dev/dm-0  /dev/dm-1
>
>
> On 10/19/2011 00:37, Sunil Mushran wrote:
>> So it is not mounted. But we still have a hb thread because
>> hb could not be stopped during umount. The reason for that
>> could be the same that causes ocfs2_hb_ctl to fail.
>>
>> Do:
>> mounted.ocfs2 -d
>>
>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>> ls -lR /sys/kernel/debug/ocfs2
>>> /sys/kernel/debug/ocfs2:
>>> total 0
>>>
>>> ls -lR /sys/kernel/debug/o2dlm
>>> /sys/kernel/debug/o2dlm:
>>> total 0
>>>
>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>> ocfs2_hb_ctl: Device name specified was not found while reading uuid
>>>
>>> There is no /dev/dm-2 mounted.
>>>
>>>
>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>
>>>> Then list that dir.
>>>>
>>>> Also, do:
>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>
>>>> Be careful before killing. We want to be sure that dev is not mounted.
>>>>
>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>> Again   the outputs:
>>>>>  cat 
>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>> dm-2
>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>
>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>
>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>
>>>>> I think i have to enable debug first somehow..?
>>>>>
>>>>> Laurentiu.
>>>>>
>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>> What does this return?
>>>>>> cat 
>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>
>>>>>> Also, do:
>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>
>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>> Here is the output:
>>>>>>>
>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>> /sys/kernel/config/cluster:
>>>>>>> total 0
>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>
>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>> total 0
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>> drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
>>>>>>> drwxr-xr-x 4 root root0 Oct 11 20:23 node
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms
>>>>>>>
>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>>>> total 0
>>>>>>> drwxr-xr-x 2 root root0 Oct 19 00:12 
>>>>>>> 918673F06F8F4ED188DDCE14F39945F6
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>>>>>>>
>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
>>>>>>> total 0
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>>>
>>>>>>> /sys/kernel/config/cluster/C

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
> ls -lR /sys/kernel/debug/ocfs2
> /sys/kernel/debug/ocfs2:
> total 0
>
> ls -lR /sys/kernel/debug/o2dlm
> /sys/kernel/debug/o2dlm:
> total 0
>
> ocfs2_hb_ctl -I -d /dev/dm-2
> ocfs2_hb_ctl: Device name specified was not found while reading uuid
>
> There is no /dev/dm-2 mounted.
>
>
> On 10/19/2011 00:27, Sunil Mushran wrote:
>> mount -t debugfs debugfs /sys/kernel/debug
>>
>> Then list that dir.
>>
>> Also, do:
>> ocfs2_hb_ctl -l -d /dev/dm-2
>>
>> Be careful before killing. We want to be sure that dev is not mounted.
>>
>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>> Again   the outputs:
>>>  cat 
>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>> dm-2
>>> --->here should be volgr1-lvol0 i guess?
>>>
>>> ls -lR /sys/kernel/debug/ocfs2
>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>
>>> ls -lR /sys/kernel/debug/o2dlm
>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>
>>> I think i have to enable debug first somehow..?
>>>
>>> Laurentiu.
>>>
>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>> What does this return?
>>>> cat 
>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>
>>>> Also, do:
>>>> ls -lR /sys/kernel/debug/ocfs2
>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>
>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>> Here is the output:
>>>>>
>>>>> ls -lR /sys/kernel/config/cluster
>>>>> /sys/kernel/config/cluster:
>>>>> total 0
>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>
>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>> total 0
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>> drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
>>>>> drwxr-xr-x 4 root root0 Oct 11 20:23 node
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms
>>>>>
>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>> total 0
>>>>> drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>>>>>
>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
>>>>> total 0
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>
>>>>> /sys/kernel/config/cluster/CLUSTER/node:
>>>>> total 0
>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>>>
>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>>>> total 0
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>
>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>>>> total 0
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>
>>>>>> What does this return?
>>>>>>
>>>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>>>> Hi,
>>>>>>> I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
>>>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>>>> My problem is that all the time when i try to run /etc/init.d/o2cb stop
>>>>>>> it fails with this error:
>>>>>>>   Stopping O2CB cluster CLUSTER: Failed
>>>>>>>   Unable to stop cluster as heartbeat region still active
>>>>>>> There is no active mount point. I tried to manually stop the heartdbeat
>>>>>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2" (after finding
>>>>>>> the refs number with "ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ").
>>>>>>> But even if refs number is set to zero the "heartbeat region still
>>>>>>> active" occurs.
>>>>>>> How can i fix this?
>>>>>>>
>>>>>>> Thank you in advance.
>>>>>>> Laurentiu.
>>>>>>>
>>>>>>>
>>>>>>> ___
>>>>>>> Ocfs2-users mailing list
>>>>>>> Ocfs2-users@oss.oracle.com
>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>
>>>>>
>>>>
>>>
>>
>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
> Again   the outputs:
>  cat 
> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
> dm-2
> --->here should be volgr1-lvol0 i guess?
>
> ls -lR /sys/kernel/debug/ocfs2
> ls: /sys/kernel/debug/ocfs2: No such file or directory
>
> ls -lR /sys/kernel/debug/o2dlm
> ls: /sys/kernel/debug/o2dlm: No such file or directory
>
> I think i have to enable debug first somehow..?
>
> Laurentiu.
>
> On 10/19/2011 00:17, Sunil Mushran wrote:
>> What does this return?
>> cat 
>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>
>> Also, do:
>> ls -lR /sys/kernel/debug/ocfs2
>> ls -lR /sys/kernel/debug/o2dlm
>>
>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>> Here is the output:
>>>
>>> ls -lR /sys/kernel/config/cluster
>>> /sys/kernel/config/cluster:
>>> total 0
>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>
>>> /sys/kernel/config/cluster/CLUSTER:
>>> total 0
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>> drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
>>> drwxr-xr-x 4 root root0 Oct 11 20:23 node
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms
>>>
>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>> total 0
>>> drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>>>
>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
>>> total 0
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>
>>> /sys/kernel/config/cluster/CLUSTER/node:
>>> total 0
>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>
>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>> total 0
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>
>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>> total 0
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>
>>>
>>>
>>>
>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>> ls -lR /sys/kernel/config/cluster
>>>>
>>>> What does this return?
>>>>
>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>> Hi,
>>>>> I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>> My problem is that all the time when i try to run /etc/init.d/o2cb stop
>>>>> it fails with this error:
>>>>>   Stopping O2CB cluster CLUSTER: Failed
>>>>>   Unable to stop cluster as heartbeat region still active
>>>>> There is no active mount point. I tried to manually stop the heartdbeat
>>>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2" (after finding
>>>>> the refs number with "ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ").
>>>>> But even if refs number is set to zero the "heartbeat region still
>>>>> active" occurs.
>>>>> How can i fix this?
>>>>>
>>>>> Thank you in advance.
>>>>> Laurentiu.
>>>>>
>>>>>
>>>>> ___
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users@oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>
>>>
>>
>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

What does this return?
cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

Also, do:
ls -lR /sys/kernel/debug/ocfs2
ls -lR /sys/kernel/debug/o2dlm

On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
> Here is the output:
>
> ls -lR /sys/kernel/config/cluster
> /sys/kernel/config/cluster:
> total 0
> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>
> /sys/kernel/config/cluster/CLUSTER:
> total 0
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
> drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
> drwxr-xr-x 4 root root0 Oct 11 20:23 node
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms
>
> /sys/kernel/config/cluster/CLUSTER/heartbeat:
> total 0
> drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>
> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
> total 0
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>
> /sys/kernel/config/cluster/CLUSTER/node:
> total 0
> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>
> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
> total 0
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>
> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
> total 0
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>
>
>
>
> On 10/19/2011 00:12, Sunil Mushran wrote:
>> ls -lR /sys/kernel/config/cluster
>>
>> What does this return?
>>
>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>> Hi,
>>> I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>> My problem is that all the time when i try to run /etc/init.d/o2cb stop
>>> it fails with this error:
>>>   Stopping O2CB cluster CLUSTER: Failed
>>>   Unable to stop cluster as heartbeat region still active
>>> There is no active mount point. I tried to manually stop the heartdbeat
>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2" (after finding
>>> the refs number with "ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ").
>>> But even if refs number is set to zero the "heartbeat region still
>>> active" occurs.
>>> How can i fix this?
>>>
>>> Thank you in advance.
>>> Laurentiu.
>>>
>>>
>>> ___
>>> Ocfs2-users mailing list
>>> Ocfs2-users@oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

ls -lR /sys/kernel/config/cluster

What does this return?

On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
> Hi,
> I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
> My problem is that all the time when i try to run /etc/init.d/o2cb stop
> it fails with this error:
>   Stopping O2CB cluster CLUSTER: Failed
>   Unable to stop cluster as heartbeat region still active
> There is no active mount point. I tried to manually stop the heartdbeat
> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2" (after finding
> the refs number with "ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ").
> But even if refs number is set to zero the "heartbeat region still
> active" occurs.
> How can i fix this?
>
> Thank you in advance.
> Laurentiu.
>
>
> ___
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Partition table crash, where can I find debug message?

2011-10-12 Thread Sunil Mushran


extent of the corruption... (not crash)

On 10/12/2011 10:51 AM, Sunil Mushran wrote:

Hard to say. You'll need to investigate the extent of the crash.

On 10/12/2011 10:49 AM, Frank Zhang wrote:


Sorry, it's not power outage, it's just a normal reboot.

Is this serious to corrupt the super block?

*From:*Frank Zhang
*Sent:* Wednesday, October 12, 2011 10:37 AM
*To:* 'Sunil Mushran'
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* RE: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Thanks Suni. Yes the terminology should be super block corruption.

I checked with my colleague they said  the ISCSI server suffered a power outage 
yesterday so they rebooted it.

Given it was under heavy usage because of many VM running on, I guess this may 
be the cause. now I am trying to recover it

*From:*Sunil Mushran [mailto:sunil.mush...@oracle.com] 
<mailto:[mailto:sunil.mush...@oracle.com]>
*Sent:* Wednesday, October 12, 2011 10:08 AM
*To:* Frank Zhang
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* Re: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Not sure what you mean by a partition table crash. Is it that someone
overwrote the partition table on the iscsi server? That's what it looks
like. If mount cannot detect the fs type, then it means atleast superblock
corruption. And such corruptions typically caused by external entities.
Stray dd perhaps.

Did you try recovering the superblock using one of the the backups?
fsck.ocfs2 -r [1-6] /dev/sdX ?

On 10/11/2011 07:04 PM, Frank Zhang wrote:

Hi Experts, recently I observed a partition table crash that made me really 
scared.

I have two OVM servers sharing OCFS2 over iscsi, after running  a bunch of VMs 
for a while,  all VMs were gone and I saw the mount points of OCFS2 gone on 
both hosts.

Then I tried to mount it again, the iscsi device crashed by saying "please specify 
filesystem type". I checked dmesg but there is nothing useful except

"SCSI device sdc: drive cache: write back

sdc: unknown partition table

sd 2:0:0:1: Attached scsi disk sdc

sd 2:0:0:1: Attached scsi generic sg3 type 0

OCFS2 Node Manager 1.4.4

OCFS2 DLM 1.4.4

OCFS2 DLMFS 1.4.4

OCFS2 User DLM kernel interface loaded

connection1:0: detected conn error (1011)"

basically after logging into ISCSI device on both hosts, I created soft links 
of /dev/ovm_iscsi1 pointing to device node under 
/dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 
and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf 
and made o2cb correctly start).

Could somebody tell me where to get more debug info to trace the problem? This 
is really scared considering I may lose all my VMs because of the silent crash.

And is there any way to recover the partition table? Thanks

  
  
___

Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com  <mailto:Ocfs2-users@oss.oracle.com>
http://oss.oracle.com/mailman/listinfo/ocfs2-users




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Partition table crash, where can I find debug message?

2011-10-12 Thread Sunil Mushran


Hard to say. You'll need to investigate the extent of the crash.

On 10/12/2011 10:49 AM, Frank Zhang wrote:


Sorry, it's not power outage, it's just a normal reboot.

Is this serious to corrupt the super block?

*From:*Frank Zhang
*Sent:* Wednesday, October 12, 2011 10:37 AM
*To:* 'Sunil Mushran'
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* RE: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Thanks Suni. Yes the terminology should be super block corruption.

I checked with my colleague they said  the ISCSI server suffered a power outage 
yesterday so they rebooted it.

Given it was under heavy usage because of many VM running on, I guess this may 
be the cause. now I am trying to recover it

*From:*Sunil Mushran [mailto:sunil.mush...@oracle.com] 
<mailto:[mailto:sunil.mush...@oracle.com]>
*Sent:* Wednesday, October 12, 2011 10:08 AM
*To:* Frank Zhang
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* Re: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Not sure what you mean by a partition table crash. Is it that someone
overwrote the partition table on the iscsi server? That's what it looks
like. If mount cannot detect the fs type, then it means atleast superblock
corruption. And such corruptions typically caused by external entities.
Stray dd perhaps.

Did you try recovering the superblock using one of the the backups?
fsck.ocfs2 -r [1-6] /dev/sdX ?

On 10/11/2011 07:04 PM, Frank Zhang wrote:

Hi Experts, recently I observed a partition table crash that made me really 
scared.

I have two OVM servers sharing OCFS2 over iscsi, after running  a bunch of VMs 
for a while,  all VMs were gone and I saw the mount points of OCFS2 gone on 
both hosts.

Then I tried to mount it again, the iscsi device crashed by saying "please specify 
filesystem type". I checked dmesg but there is nothing useful except

"SCSI device sdc: drive cache: write back

sdc: unknown partition table

sd 2:0:0:1: Attached scsi disk sdc

sd 2:0:0:1: Attached scsi generic sg3 type 0

OCFS2 Node Manager 1.4.4

OCFS2 DLM 1.4.4

OCFS2 DLMFS 1.4.4

OCFS2 User DLM kernel interface loaded

connection1:0: detected conn error (1011)"

basically after logging into ISCSI device on both hosts, I created soft links 
of /dev/ovm_iscsi1 pointing to device node under 
/dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 
and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf 
and made o2cb correctly start).

Could somebody tell me where to get more debug info to trace the problem? This 
is really scared considering I may lose all my VMs because of the silent crash.

And is there any way to recover the partition table? Thanks

  
  
___

Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com  <mailto:Ocfs2-users@oss.oracle.com>
http://oss.oracle.com/mailman/listinfo/ocfs2-users



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Partition table crash, where can I find debug message?

2011-10-12 Thread Sunil Mushran


Not sure what you mean by a partition table crash. Is it that someone
overwrote the partition table on the iscsi server? That's what it looks
like. If mount cannot detect the fs type, then it means atleast superblock
corruption. And such corruptions typically caused by external entities.
Stray dd perhaps.

Did you try recovering the superblock using one of the the backups?
fsck.ocfs2 -r [1-6] /dev/sdX ?

On 10/11/2011 07:04 PM, Frank Zhang wrote:


Hi Experts, recently I observed a partition table crash that made me really 
scared.

I have two OVM servers sharing OCFS2 over iscsi, after running  a bunch of VMs 
for a while,  all VMs were gone and I saw the mount points of OCFS2 gone on 
both hosts.

Then I tried to mount it again, the iscsi device crashed by saying "please specify 
filesystem type". I checked dmesg but there is nothing useful except

"SCSI device sdc: drive cache: write back

sdc: unknown partition table

sd 2:0:0:1: Attached scsi disk sdc

sd 2:0:0:1: Attached scsi generic sg3 type 0

OCFS2 Node Manager 1.4.4

OCFS2 DLM 1.4.4

OCFS2 DLMFS 1.4.4

OCFS2 User DLM kernel interface loaded

connection1:0: detected conn error (1011)"

basically after logging into ISCSI device on both hosts, I created soft links 
of /dev/ovm_iscsi1 pointing to device node under 
/dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 
and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf 
and made o2cb correctly start).

Could somebody tell me where to get more debug info to trace the problem? This 
is really scared considering I may lose all my VMs because of the silent crash.

And is there any way to recover the partition table? Thanks


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1424 matches

Mail list logo