Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Tao Ma
Hi Mike,
Are you sure it is caused by the update of ocfs2-tools?
AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs 
etc. So if you don't make any change to the disk(by using this new 
tools), it shouldn't cause the problem of kernel panic since they are 
all user space tools.
Then there is only one thing maybe. Have you modify 
/etc/sysconfig/o2cb(This is the place for RHEL, not sure the place in 
ubuntu)? I have checked the rpm package for RHEL, it will update 
/etc/sysconfig/o2cb and this file has some timeouts defined in it.
So do you have some backups for this file? If yes, please restore it to 
see whether it helps(I can't say it for sure).
If not, do you remember the old value of some timeouts you set for 
ocfs2? If yes, you can use o2cb configure to set them by yourself.

Good Luck.

Regards,
Tao

mike wrote:
 Hi, I'm running into a big issue. I believe it is OCFS2, I can get my
 machines to kernel panic consistently.
 
 Before I was running Ubuntu Gutsy (7.10) ocfs2-tools 1.2.4.
 
 Now I am running Ubuntu Hardy (8.04) ocfs2-tools 1.3.9.
 
 I am even running the same kernel (2.6.22-14), but the behavior has
 changed with my OCFS2 mounts it seems. At first I thought it was due
 to the newer kernel (2.6.24-16) but it isn't the case. Now it is
 happening no matter which kernel I use. I even compiled my own vanilla
 2.6.25, and it still has this issue.
 
 I have 6 total clients mounting the ocfs2 partition:
 - 2 batch servers which only access it every 5 or 10 minutes to load
 up a PHP script to process
 - 1 server I am trying to rsync from local RAID disk - ocfs2  - I am
 limiting this to 250kb/sec
 - 3 webservers loading normal stuff - PHP scripts, graphics, media
 files - maybe 2MB/sec combined total
 
 That's not even 3MB/sec - yet when I start the rsync, pretty quickly
 the server doing the rsync kernel panics and reboots. The 3 webservers
 all have issues with reading from the OCFS2 mounted partition. The
 %util all drops to 0, it's like it bottlenecks and suspends all disk
 I/O on the webservers for a few seconds. Then things go back to normal
 for a while.
 
 Is there any additional info that could be useful? I am desperately in
 need of help. I have hosting customers and somehow this upgrade has
 pretty much crippled me...
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?

2008-04-21 Thread Tao Ma
mike wrote:
 I have changed my kernel back to 2.6.22-14-server, and now I don't get
 the kernel panics. It seems like an issue with 2.6.24-16 and some i/o
 made it crash...
   
OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :)
Then could you please describe it in more detail about how the kernel 
panic happens?
 However I am still getting file access timeouts once in a while. I am
 nervous about putting more load on the setup.
   
Also please provide more details about it.


 [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb

 # O2CB_ENABLED: 'true' means to load the driver on boot.
 O2CB_ENABLED=true

 # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
 O2CB_BOOTCLUSTER=mycluster

 # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
 O2CB_HEARTBEAT_THRESHOLD=7
   
This value is a little smaller, so how did you build up your shared 
disk(iSCSI or ...)? The most common value I heard of is 61. It is about 
120 secs. I don't know the reason and maybe Sunil can tell you. ;)
You can also refer to 
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT.
 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is
 considered dead.
 O2CB_IDLE_TIMEOUT_MS=1

 # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
 O2CB_KEEPALIVE_DELAY_MS=5000

 # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
 O2CB_RECONNECT_DELAY_MS=2000


 On 4/21/08, Tao Ma [EMAIL PROTECTED] wrote:
   
 Hi Mike,
Are you sure it is caused by the update of ocfs2-tools?
 AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So
 if you don't make any change to the disk(by using this new tools), it
 shouldn't cause the problem of kernel panic since they are all user space
 tools.
 Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This
 is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm
 package for RHEL, it will update /etc/sysconfig/o2cb and this file has some
 timeouts defined in it.
 So do you have some backups for this file? If yes, please restore it to see
 whether it helps(I can't say it for sure).
 If not, do you remember the old value of some timeouts you set for ocfs2? If
 yes, you can use o2cb configure to set them by yourself.
 


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to start cluster with one node

2008-05-12 Thread Tao Ma
Hi David,

David Coulson wrote:
 This is probably a stupid question, but here we go.
 
 I have two boxes running RHEL4U6 with DRBD mirroring disk between them. 
 DRBD is setup in active/active mode, and seems to be working nicely. I 
 have OCFS2 filesystems build on the DRBD devices, and normally I am able 
 to mount them on both nodes and life is good.
 
 Now, I have one node down. Not good, but that is why we have two... DRBD 
 is fine, but OCFS2 won't startup correctly.
 
 [EMAIL PROTECTED] network-scripts]# !/etc/init.d/o2cb status
 Module configfs: Loaded
 Filesystem configfs: Mounted
 Module ocfs2_nodemanager: Loaded
 Module ocfs2_dlm: Loaded
 Module ocfs2_dlmfs: Loaded
 Filesystem ocfs2_dlmfs: Mounted
 Checking O2CB cluster ocfs2: Online
   Heartbeat dead threshold: 31
   Network idle timeout: 3
   Network keepalive delay: 2000
   Network reconnect delay: 2000
 Checking O2CB heartbeat: Not active
There is no problem with the status. As for Not active, it just means 
that there is no device hearbeating. So if you don't mount a ocfs2 
volume, it will just shows Not active.
 
 I take it this has something to do with establishing a quorum, which 
 probably isn't happy with a single node. Is there a configuration change 
 or workaround that will allow a single OCFS2 node to mount a filesystem?
a single OCFS2 node can mount a file system without any change in the 
configuration. So you may try to mount it. If there is any problem, 
please paste the error message here. Thanks.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to start cluster with one node

2008-05-12 Thread Tao Ma
Hi David,

David Coulson wrote:
 Hi Tao,
 
 nt a file system without any change in the configuration. So you may 
 try to mount it. If there is any problem, please paste the error 
 message here. Thanks.
 
 I tried to create a filesystem on a unused DRBD block device... 
 mkfs.ocfs2 seemed to work okay, but it won't mount.
 
 # mount /mnt/mirror2
 ocfs2_hb_ctl: I/O error on channel while starting heartbeat
 mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: Operation 
 not permitted
Are you sure you use the right device in your fstab? If yes, could you 
please strace the mount process to see the arguments mount.ocfs2 give to 
ocfs2_hb_ctl and why it fails?
 
 Another ocfs2 filesystem which was built earlier seems happy, however 
 the two nodes were working together then.
So here you mean you can mount an old ocfs2 filesystem in your single 
node, right? If yes, have you updated the ocfs2-tools recently(If you 
use a new mkfs.ocfs2(like 1.3.9) and an old ocfs2_hb_ctl(like 1.2.x), it 
will fails).

btw, is there any error message in dmesg?

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Problems building ocfs2 rpm on Fedora 9

2008-06-29 Thread Tao Ma
Hi Tina,
datavolume is only used for ocfs2-1.2.* and ocfs2-1.4.* in the future 
if I am correct. It is oracle specific, so the main kernel doesn't have 
this mount option.

Regards,
Tao

Tina Soles wrote:
 Thanks!
 
 I installed the tools rpm and the console as well.  I've successfully
 formatted the ocfs2 system, but when trying to mount it, I'm getting the
 following:
 
 #mount -t ocfs2 -o datavolume,nointr /dev/sdb1 /ocfs2/oradata/orcl
 mount.ocfs2: Invalid argument while mounting /dev/sdb1 on
 /ocfs2/oradata/orcl
 
 Here is dmesg output:
 
 OCFS2 Node Manager 1.5.0
 OCFS2 DLM 1.5.0
 OCFS2 DLMFS 1.5.0
 OCFS2 User DLM kernel interface loaded
 OCFS2 Node Manager 1.5.0
 OCFS2 DLM 1.5.0
 OCFS2 DLMFS 1.5.0
 OCFS2 User DLM kernel interface loaded
 OCFS2 1.5.0
 (19226,0):ocfs2_parse_options:870 ERROR: Unrecognized mount option
 datavolume or missing value
 
 I googled this, and after researching it, I'm still not quite sure what
 to do. Is it ok to use the mount command above without the
 datavolume,nointr parameters? If so, then do I have to include something
 in init.ora? I'm a bit confused. Any help you can give me would be
 appreciated, as I am brand new to all of this.
 
 My versions are:
 
 # rpm -qa | grep ocfs2
 ocfs2console-1.3.9-8.20080221git.fc10.i386
 ocfs2-tools-1.3.9-8.20080221git.fc10.i386
 
 # find /lib/modules/`uname -r` -name \*ocfs2\*
 /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2
 /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/cluster/ocfs2_nodemanage
 r.ko
 /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/ocfs2.ko
 /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
 /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
 
 #./o2cb status
 Module configfs: Loaded
 Filesystem configfs: Mounted
 Module ocfs2_nodemanager: Loaded
 Module ocfs2_dlm: Loaded
 Module ocfs2_dlmfs: Loaded
 Filesystem ocfs2_dlmfs: Mounted
 Checking O2CB cluster ocfs2: Online
 Heartbeat dead threshold = 31
   Network idle timeout: 3
   Network keepalive delay: 2000
   Network reconnect delay: 2000
 
 Thanks in advance!
 
 -Original Message-
 From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
 Sent: Friday, June 27, 2008 4:57 PM
 To: Tina Soles
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Problems building ocfs2 rpm on Fedora 9
 
 ocfs2 fs module is shipped with the kernel. So no need to worry there.
 
 $ find /lib/modules/`uname -r` -name \*ocfs2\*
 Should return you 4 modules... fs, dlm, nm and userdlm.
 
 The ocfs2-tools rpm should be available tomorrow for download.
 Fingers crossed.
 
 Tina Soles wrote:
 Thanks for the reply. Can you be more specific and give me the exact 
 name of the native Fedora 9 rpm(s) that I need for ocfs2 and
 ocfs2-tools?
 Thanks.

 
 *From:* Sunil Mushran [mailto:[EMAIL PROTECTED]
 *Sent:* Fri 6/27/2008 1:16 PM
 *To:* Tina Soles
 *Cc:* ocfs2-users@oss.oracle.com
 *Subject:* Re: [Ocfs2-users] Problems building ocfs2 rpm on Fedora 9

 Fedora ships ocfs2 fs modules natively. You don't have to do all this.
 What is missing is the tools rpm.

 But the good news is that that should be available any day now
 literally speaking.

 Tina Soles wrote:
 Hello,

 I'm brand new to RAC and ocfs2. I need to install ocfs2, but there
 is
 no distribution for Fedora 9, for kernel 2.6.25-14.fc9.i686.

 I have obtained the ocfs2-1.3.9-0.1.tar.gz source file, unzipped, it
 and am following the instructions found here:

 http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#B
 UILD
 My .rpmmacros file looks like this: (the /usr/src/redhat and all
 subdirectories exist)

 %_topdir /usr/src/redhat

 %_tmppath /usr/src/redhat/tmp

 %_sourcedir /usr/src/redhat/SOURCES

 %_specdir /usr/src/redhat/SPECS

 %_srcrpmdir /usr/src/redhat/SRPMS

 %_rpmdir /usr/src/redhat/RPMS

 %_builddir /usr/src/redhat/BUILD

 I'm logged in as root.

 Here's a list of my kernel* packages installed:

 kernel.i686 2.6.25-14.fc9 installed

 kernel-PAE-devel.i686 2.6.25.6-55.fc9 installed

 kernel-devel.i686 2.6.25-14.fc9 installed

 kernel-devel.i586 2.6.25.6-55.fc9 installed

 kernel-headers.i386 2.6.25.6-55.fc9 installed

 kernel-xen-devel.i686 2.6.25.3-2.fc9 installed

 kerneloops.i386 0.10-11.fc9 installed

 Here's the contents of /usr/src/kernels:

 drwxr-xr-x 19 root root 4096 2008-06-26 13:11 2.6.25-14.fc9.i686

 drwxr-xr-x 19 root root 4096 2008-06-26 14:34
 2.6.25.3-2.fc9.i686.xen
 drwxr-xr-x 19 root root 4096 2008-06-26 14:37 2.6.25.6-55.fc9.i586

 drwxr-xr-x 19 root root 4096 2008-06-26 14:33
 2.6.25.6-55.fc9.i686.PAE
 When I attempt to issue ./vendor/rhel5/kernel.guess targets, it 
 returns me back to the # prompt with no output. I think this is 
 indicative of my problem, but I'm not sure what to do!
 I can run ./configure with no problem (see below), but when I
 attempt
 to execute the make statement, I get make: *** No rule to make
 target
 'rhel5_2.6.25-14.fc9_rpm'. Stop. 

Re: [Ocfs2-users] ocfs2 kernel BUG

2008-08-01 Thread Tao Ma
Hi,
Please provide the detail info of ocfs2 version which may be helpful 
for diagnose.

Peter Selzner wrote:
 Hi,
 
 we had this entries in /var/log/messeges a few days ago:
 
 Jul 28 23:30:47 xxx kernel: (12268,2):ocfs2_extend_file:790 ERROR: bug 
 expression: i_size_read(inode) != (le64_to_cpu(fe-i_size) - *bytes_extended)
 Jul 28 23:30:47 xxx kernel: (12268,2):ocfs2_extend_file:790 ERROR: Inode 
 8323098 i_size = 1572864, dinode i_size = 1568768, bytes_extended = 0, 
 new_i_size = 1576960 
 Jul 28 23:30:47 xxx kernel: klogd 1.4.1, -- state change -- 
 Jul 28 23:30:47 xxx kernel: [ cut here ]
 Jul 28 23:30:47 xxx kernel: kernel BUG at fs/ocfs2/file.c:790!
 Jul 28 23:30:47 xxx kernel: invalid opcode:  [#1]
 Jul 28 23:30:47 xxx kernel: SMP 
 Jul 28 23:30:47 xxx kernel: last sysfs file: /class/infiniband/mthca1/board_id
 Jul 28 23:30:47 xxx kernel: Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_dlm 
 ocfs2_nodemanager configfs cpqci mptctl mptbase ipmi_si ipmi_devintf 
 ipmi_msghandler rdma_ucm rds ib_ucm ib_sdp rdma_cm iw_cm
 ib_addr ib_local_sa ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad bonding 
 ib_mthca ib_mad ib_core button battery ac raw loop dm_round_robin 
 dm_multipath dm_mod usbhid hw_random ide_cd uhci_hcd e1000
 cdrom ehci_hcd bnx2 usbcore ext3 jbd ata_piix ahci libata edd fan thermal 
 processor cciss sg qla2400 qla2300 qla2xxx firmware_class qla2xxx_conf 
 intermodule piix sd_mod scsi_mod ide_disk ide_core
 Jul 28 23:30:47 xxx kernel: CPU:2   
 Jul 28 23:30:47 xxx kernel: EIP:0060:[f9de8173]Tainted: P U VLI 
 Jul 28 23:30:47 xxx kernel: EFLAGS: 00210292   (2.6.16.46-0.12-bigsmp #1) 
 Jul 28 23:30:47 xxx kernel: EIP is at ocfs2_extend_file+0x3cd/0xf9b [ocfs2]
 Jul 28 23:30:47 xxx kernel: eax: 008c   ebx:    ecx: ff00   
 edx: 00200286
 Jul 28 23:30:47 xxx kernel: esi:    edi:    ebp: df05f000   
 esp: e398de70
 Jul 28 23:30:47 xxx kernel: ds: 007b   es: 007b   ss: 0068
 Jul 28 23:30:47 xxx kernel: Process mv (pid: 12268, threadinfo=e398c000 
 task=f7f80660)
 Jul 28 23:30:47 xxx kernel: Stack: 0 dd4f9d88 ce48c000  
  0001 cf253280 dd4f9b80 
 Jul 28 23:30:47 xxx kernel:dd4f9ee4 0017f000   
 f9ddf432 e398dea8 dd4f9b80  
 Jul 28 23:30:47 xxx kernel:0001 e398deb4 e398deb4 ce48c000 
   ece0bc00  
 Jul 28 23:30:47 xxx kernel: Call Trace:
 Jul 28 23:30:47 xxx kernel:  [f9ddf432] ocfs2_status_completion_cb+0x0/0xa 
 [ocfs2]
 Jul 28 23:30:47 xxx kernel:  [f9df72f2] 
 ocfs2_write_lock_maybe_extend+0xb2f/0xde3 [ocfs2]
 Jul 28 23:30:47 xxx kernel:  [f9dea85d] ocfs2_file_write+0x125/0x24d [ocfs2]
 Jul 28 23:30:47 xxx kernel:  [f9dea738] ocfs2_file_write+0x0/0x24d [ocfs2]
 Jul 28 23:30:47 xxx kernel:  [c0164714] vfs_write+0xaa/0x152
 Jul 28 23:30:47 xxx kernel:  [c0164d1f] sys_write+0x3c/0x63
 Jul 28 23:30:47 xxx kernel:  [c0103cab] sysenter_past_esp+0x54/0x79
 Jul 28 23:30:47 xxx kernel: Code: 8b 4c 24 3c ff 71 04 ff 31 68 16 03 00 00 
 68 2b b5 e0 f9 ff 70 10 8b 00 ff b0 c0 00 00 00 68 b1 fd e0 f9 e8 ca a8 33 c6 
 83 c4 3c 0f 0b 16 03 db fb e0 f9 8b 5c 24 20
 8b 03 0f ae e8 89 f6 8b 74 
 
 It was impossible to do ls -al in a certain directory (each process that
 touched files in this directory ends in DEAD state (uninterruptible sleep).
 Any suggestions? Thanks.
How do this happen and could you please explain it in more detail? e.g, 
how many nodes are in your cluster? you hang in one node, how about 
other nodes or what you are doing in other nodes.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 node reboot method

2008-08-05 Thread Tao Ma
Hi,

Masanari Iida wrote:
 Hello list,
 
 There is a 14 node OCFS2 cluster.
 When I reboot all 14 nodes at once,  some node failed to
 mount the ocfs2 filesystem while rebooting.
 The mount is supposed to be done by /etc/fstab.
 The symptom is happened on randam node.
 I would like to know if there is such a rule that OCFS2 nodes
 need to be reboot one-by-one, instead of all-at-once.
No rule for rebooting ocfs2.
 
 An error message I saw was
 mount.ocfs2: Transport endpoint is not connected while mounting
 /dev/EXTDISK/OCFS2
Interesting. Have you update ocfs2 in some nodes? Normally it happens 
when there are some protocol collision among nodes.
Any helpful information in dmesg?

also please provide the version info of ocfs2.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 node reboot method

2008-08-05 Thread Tao Ma
Masanari Iida wrote:
  On Tue, Aug 5, 2008 at 5:43 PM, Tao Ma [EMAIL PROTECTED] wrote:

   
 An error message I saw was
 mount.ocfs2: Transport endpoint is not connected while mounting
 /dev/EXTDISK/OCFS2
   
 Interesting. Have you update ocfs2 in some nodes? Normally it happens when
 there are some protocol collision among nodes.
 Any helpful information in dmesg?

 
  The boxes are all SLES10 +  ocfs2-tools-1.2.5-SLES-r2997
  I found no other messages at the time of failure.
   
which version of ocfs2?
  I would like to catch some useful information for troubleshooting.
  Do you think  tcpdump may catch something?
  Or  Do I need to use debug.ocfs2?
  If latter case, what is the right option?
   
debugfs.ocfs2 -l CONN DLM_DOMAIN TCP allow
mount and check dmesg.

Regards,
Tao


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Enable mlog() messages

2008-08-05 Thread Tao Ma
Masanari Iida wrote:
 Hello again,

 I looked into the source and found the error message
 Transport endpoint is not connected could be came
 from ENOTCONN in tcp.c.
 There are multiple ENOTCONN, so I would like to know
 which one produce my message.  I want to enable mlog().

 My understanding is, debug.ocfs2 enable the mlog().
 http://oss.oracle.com/osswiki/OCFS2/Debugging

 I think my problem happen while the system boot up
 and mount the volume from /etc/fstab.
 In that case, how can I enable debugging?
   
As I have said in another mail, you can use
debugfs.ocfs2 -l CONN DLM_DOMAIN SUPER allow
mount the volume.
dmesg should give us enough information about what happens.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 node reboot method

2008-08-06 Thread Tao Ma
Hi,

Masanari Iida wrote:
 Hello Tao and Sunil,
] My case, the symptom (ocfs2 failed to mount a volume using
 /etc/fstab) happend when I reboot the system.
 Even if it failed to mount (by /etc/fstab), I can mount it later
 after I login the system.  So it could be some kind of timing issue.
 
 Your advice mount and check dmesg ) seemed to be a manual procedure.
 I would like to know how and where can I set the debugfs.ocfs2,
 and make it run just before the ocfs2 mount.
Are you sure your network device have been started before ocfs2? Have 
you added _netdev in your fstab?
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html
qs 41.
 
 Regards,
 
 Masanari Iida

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] New node..new problems

2008-10-09 Thread Tao Ma

Hi,
Dante Garro wrote:
 Sunil, now I fall in count of messages are related to node 0, but the new is
 node 1 and does not care about the value I've setup allways says 14000 ms.
 Do this change your diagnostic?
Node1 start connection with node0, so you see the messages related to 
node0 on node1. It looks like your configuration in node1 is wrong.
Please make sure that value of O2CB_HEARTBEAT_THRESHOLD in 
/etc/sysconfig/o2cb of node1 is the same as that in node0.

Regards,
Tao

 
 
 -Mensaje original-
 De: Sunil Mushran [mailto:[EMAIL PROTECTED] 
 Enviado el: Jueves, 09 de Octubre de 2008 06:02 p.m.
 Para: Dante Garro
 CC: 'ocfs2-users@oss.oracle.com'
 Asunto: Re: [Ocfs2-users] New node..new problems
 
 Yeah the cluster timeouts are not consistent. Update and restart the cluster
 on the new node (or all nodes as the case might be).
 
 Hint: cat /sys/kernel/config/cluster/clustername/idle_timeout_ms
 to see the active heartbeat threshold.
 
 Dante Garro wrote:
 Hi all, because problems with ocfs2 release of Debian distribution 
 decided to remake my cluster replacing it by CentOS based installation.
 Started replacing one of the nodes keeping the other working.

 On this recently created node the following errors appears:

 drbd0: Writing meta data super block now.
 (2558,1):o2hb_check_slot:881 ERROR: Node 0 on device drbd0 has a dead 
 count of 14000 ms, but our count is 13000 ms.
 Please double check your configuration values for
 'O2CB_HEARTBEAT_THRESHOLD'
 OCFS2 1.2.9 Wed Sep 24 19:26:41 PDT 2008 (build
 a693806cb619dd7f225004092b675ede)
 (2520,1):o2net_connect_expired:1585 ERROR: no connection established 
 with node 0 after 30.0 seconds, giving up and returning errors.
 (2556,1):dlm_request_join:901 ERROR: status = -107
 (2556,1):dlm_try_to_join_domain:1049 ERROR: status = -107
 (2556,1):dlm_join_domain:1321 ERROR: status = -107
 (2556,1):dlm_register_domain:1514 ERROR: status = -107
 (2556,1):ocfs2_dlm_init:2024 ERROR: status = -107
 (2556,1):ocfs2_mount_volume:1133 ERROR: status = -107
 ocfs2: Unmounting device (147,0) on (node 1)
 (2591,1):o2hb_check_slot:881 ERROR: Node 0 on device drbd0 has a dead 
 count of 14000 ms, but our count is 13000 ms.
 Please double check your configuration values for
 'O2CB_HEARTBEAT_THRESHOLD'
 (2520,1):o2net_connect_expired:1585 ERROR: no connection established 
 with node 0 after 30.0 seconds, giving up and returning errors.
 (2589,1):dlm_request_join:901 ERROR: status = -107
 (2589,1):dlm_try_to_join_domain:1049 ERROR: status = -107
 (2589,1):dlm_join_domain:1321 ERROR: status = -107
 (2589,1):dlm_register_domain:1514 ERROR: status = -107
 (2589,1):ocfs2_dlm_init:2024 ERROR: status = -107
 (2589,1):ocfs2_mount_volume:1133 ERROR: status = -107
 ocfs2: Unmounting device (147,0) on (node 1)

 I've changed the parameter O2CB_HEARTBEAT_THRESHOLD according O2CB 
 adviced me, but It don't resolve the issue.

 I hope someone could give me a clue.

 Thanks in advance.

 Dante


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] New node..new problems

2008-10-10 Thread Tao Ma


Dante Garro wrote:
 Thanks Tao, I've setup the same on both nodes and the cluster becomes
 online.
 Now, when I try to mount the following errors appears on node 1 (new
 CentOS):
 (2512,1):o2net_connect_expired:1585 ERROR: no connection established with
 node 0 after 30.0 seconds, giving up and returning errors.
 (3022,1):dlm_request_join:901 ERROR: status = -107
 (3022,1):dlm_try_to_join_domain:1049 ERROR: status = -107
 (3022,1):dlm_join_domain:1321 ERROR: status = -107
 (3022,1):dlm_register_domain:1514 ERROR: status = -107
 (3022,1):ocfs2_dlm_init:2024 ERROR: status = -107
 (3022,1):ocfs2_mount_volume:1133 ERROR: status = -107
 ocfs2: Unmounting device (147,0) on (node 1)
 
 And the following on node 0 (old Debian)
 
  (2228,0):o2net_check_handshake:1093 node nodo2 (num 1) at
 192.168.168.2: advertised net protocol version 103 but 2 is required,
 disconnecting
 
 I believe the Debian message is clear, protocol version incompatibility.
yes, you are right. The ocfs2 version in node1 and node0 isn't the same, 
so they can't coexist with each other(mount the same volume 
simultaneously). So I would suggest you to either change the node0 from 
Debian to CentOS or make the same ocfs2 version in both nodes. From the 
message, it looks that your CentOS has ocfs2-1.2.* while your Debian is 
a bit newer(ocfs2 is shipped with the mainline kernel it used I guess). 
Currently for CentOS 5, we only have 1.2.* and 1.4.*, So I don't think 
you can find a suitable ocfs2 version compatible with your debian.

Regards,
Tao
 
 Are there a way to resolve it?
 
 Thanks
 
 Dante

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2: ERROR (device sdh1): ocfs2_direct_IO_get_blocks

2009-03-01 Thread Tao Ma
Hi Daniel,

Daniel Keisling wrote:
 Patch was here:
 http://oss.oracle.com/pipermail/ocfs2-devel/2008-September/002787.html
yes, that patch has been merged into ocfs2-1.4 and should be ready for 
the next release. Also as Joel said, If you have the appropriate 
support, you should call support and file that way.

Here is the workaround from the mail list, and I don't know whether it 
is suitable in your case.

I would guess that you are upgrading from ocfs2-1.2 to ocfs2-1.4. If 
that is the case please make sure which file cause this bug. Use
debugfs.ocfs2 -R findpath 23693699 /dev/sdh1 to see what the file is.
If that file isn't a datafile(I think it shouldn't be since the file is 
only 5120 bytes) and your volume is used for other files(e.g Oracle 
Home), then please remove datavolume from the mount option and instead 
set the init.ora parameter filesystemio_options=directio (or is it 
odirect). This should limit Oracle to use the odirect flag for the files 
it should. datavolume is legacy, please refer to the OCFS2 1.4
Users Guide for further information about the datavolume option.
Wish it help.

Regards,
Tao
 
 [r...@wilracdbdr01 /]# debugfs.ocfs2 -R 'stat 23693699' /dev/sdh1
 Inode: 23693699   Mode: 0660   Generation: 2707416418
 (0xa15fe562)
 FS Generation: 236416663 (0xe176e97)
 Type: Regular   Attr: 0x0   Flags: Valid
 User: 503 (oracle)   Group: 505 (dba)   Size: 5120
 Links: 1   Clusters: 2
 ctime: 0x49a88566 -- Fri Feb 27 19:29:26 2009
 atime: 0x49a88566 -- Fri Feb 27 19:29:26 2009
 mtime: 0x49a88566 -- Fri Feb 27 19:29:26 2009
 dtime: 0x0 -- Wed Dec 31 19:00:00 1969
 ctime_nsec: 0x222d308b -- 573386891
 atime_nsec: 0x21390e2d -- 557387309
 mtime_nsec: 0x222d308b -- 573386891
 Last Extblk: 0
 Sub Alloc Slot: 0   Sub Alloc Bit: 19
 Tree Depth: 0   Count: 243   Next Free Rec: 1
 ## OffsetClusters   Block#  Flags
 0  0 2  346108690x0
 
  [r...@wilracdbdr01 /]# debugfs.ocfs2 -R stats /dev/sdh1
 Revision: 0.90
 Mount Count: 0   Max Mount Count: 20
 State: 0   Errors: 0
 Check Interval: 0   Last Check: Fri Feb 27 19:47:46 2009
 Creator OS: 0
 Feature Compat: 1 BackupSuper
 Feature Incompat: 0 None
 Tunefs Incomplete: 0 None
 Feature RO compat: 0 None
 Root Blknum: 5   System Dir Blknum: 6
 First Cluster Group Blknum: 3
 Block Size Bits: 12   Cluster Size Bits: 12
 Max Node Slots: 4
 Label: ph1p_arch
 UUID: 839B5D0925C74CD4920F4E8CC065D180
 Cluster stack: classic o2cb
 Inode: 2   Mode: 00   Generation: 236416663 (0xe176e97)
 FS Generation: 236416663 (0xe176e97)
 Type: Unknown   Attr: 0x0   Flags: Valid System Superblock
 User: 0 (root)   Group: 0 (root)   Size: 0
 Links: 0   Clusters: 39321087
 ctime: 0x48515d46 -- Thu Jun 12 13:30:46 2008
 atime: 0x0 -- Wed Dec 31 19:00:00 1969
 mtime: 0x48515d46 -- Thu Jun 12 13:30:46 2008
 dtime: 0x0 -- Wed Dec 31 19:00:00 1969
 ctime_nsec: 0x -- 0
 atime_nsec: 0x -- 0
 mtime_nsec: 0x -- 0
 Last Extblk: 0
 Sub Alloc Slot: Global   Sub Alloc Bit: 65535
 
 -Original Message-
 From: Joel Becker [mailto:joel.bec...@oracle.com] 
 Sent: Friday, February 27, 2009 6:50 PM
 To: Daniel Keisling
 Cc: ocfs2-users@oss.oracle.com; Sunil Mushran
 Subject: Re: OCFS2: ERROR (device sdh1): ocfs2_direct_IO_get_blocks

 On Fri, Feb 27, 2009 at 06:40:38PM -0600, Daniel Keisling wrote:
 I'm am getting the following error when writing to an OCF2 
 filesystem:
  
  
 Feb 27 19:06:37 wilracdbdr01 kernel: OCFS2: ERROR (device sdh1):
 ocfs2_direct_IO_get_blocks: Inode 23693699 has a hole at block 6
 Feb 27 19:06:37 wilracdbdr01 kernel: File system is now 
 read-only due to
 the potential of on-disk corruption. Please run fsck.ocfs2 
 once the file
 system is unmounted.
  This basically says that your filesystem does not support sparse
 files, but it does have a hole in an inode - which shouldn't happen if
 sparse isn't supported.
  Can you send the output of debugfs.ocfs2 -R 'stat 23693699'
 /dev/sdh1 and debugfs.ocfs2 -R stats /dev/sdh1?

 I saw a patch that was released in September 2008.  How do 
 I get this?

  What patch?  Do you have a link?  Without knowing the patch I
 can't tell you whether that patch affects you.

 This a production system and we are currently unable to 
 start the DB.

  If you have the appropriate support, you should call support and
 file that way.  Support will also want the information I requested
 above.

 Joel
 -- 

 Three o'clock is always too late or too early for anything you
  want to do.
 - Jean-Paul Sartre

 Joel Becker
 Principal Software Developer
 Oracle
 E-mail: 

Re: [Ocfs2-users] OCFS2 fencing

2009-03-12 Thread Tao Ma
Hi ramya,

ramya tn wrote:
 Hi All,
  
 One of our system fenced by itself few days back and this has been 
 happening very frequently from many days.
 But unfortunately, we aree not able to stop the system fencing as we are 
 not sure what is causing this.
  
 The error i found out in log file is:
 .
 ..
 .
 .
 .
 Feb 20 23:36:41 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 
 0x2
 Feb 20 23:36:41 ImageInt1 kernel: end_request: I/O error, dev sdc, 
 sector 656216192
 Feb 20 23:36:41 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 
 0x2
 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
 sector 657248384
 Feb 20 23:36:42 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 
 0x2
 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
 sector 667312256
 Feb 20 23:36:42 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 
 0x2
 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
 sector 670408832
 Feb 20 23:36:42 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 
 0x2
 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, 
 sector 670666880
 .
 .
 .
 .
 .
 Feb 20 23:53:21 ImageInt1 kernel: Index 13: took 0 ms to do submit_bio 
 for write
 Feb 20 23:53:21 ImageInt1 kernel: Index 14: took 0 ms to do checking slots
 Feb 20 23:53:21 ImageInt1 kernel: Index 15: took 50 ms to do waiting for 
 write completion
 Feb 20 23:53:21 ImageInt1 kernel: Index 16: took 1904 ms to do msleep
 Feb 20 23:53:21 ImageInt1 kernel: Index 17: took 0 ms to do allocating 
 bios for read
 Feb 20 23:53:21 ImageInt1 kernel: Index 18: took 0 ms to do bio alloc read
 Feb 20 23:53:21 ImageInt1 kernel: Index 19: took 0 ms to do bio add page 
 read
 Feb 20 23:53:21 ImageInt1 kernel: Index 20: took 0 ms to do submit_bio 
 for read
 Feb 20 23:53:21 ImageInt1 kernel: Index 21: took 44652 ms to do waiting 
 for read completion
 Feb 20 23:53:21 ImageInt1 kernel: Index 22: took 0 ms to do bio alloc write
 Feb 20 23:53:21 ImageInt1 kernel: Index 23: took 0 ms to do bio add page 
 write
 Feb 20 23:53:21 ImageInt1 kernel: Index 0: took 0 ms to do submit_bio 
 for write
 Feb 20 23:53:21 ImageInt1 kernel: Index 1: took 0 ms to do checking slots
 Feb 20 23:53:21 ImageInt1 kernel: Index 2: took 9307 ms to do waiting 
 for write completion
 Feb 20 23:53:21 ImageInt1 kernel: Index 3: took 0 ms to do allocating 
 bios for read
 Feb 20 23:53:21 ImageInt1 kernel: Index 4: took 0 ms to do bio alloc read
 Feb 20 23:53:21 ImageInt1 kernel: Index 5: took 0 ms to do bio add page read
 Feb 20 23:53:21 ImageInt1 kernel: Index 6: took 0 ms to do submit_bio 
 for read
 Feb 20 23:53:22 ImageInt1 kernel: Index 7: took 35756 ms to do waiting 
 for read completion
 Feb 20 23:53:22 ImageInt1 kernel: Index 8: took 0 ms to do bio alloc write
 Feb 20 23:53:22 ImageInt1 kernel: Index 9: took 0 ms to do bio add page 
 write
 Feb 20 23:53:22 ImageInt1 kernel: Index 10: took 0 ms to do submit_bio 
 for write
 Feb 20 23:53:22 ImageInt1 kernel: Index 11: took 0 ms to do checking slots
 Feb 20 23:53:22 ImageInt1 kernel: Index 12: took 84549 ms to do waiting 
 for write completion
 Feb 20 23:53:22 ImageInt1 kernel: *** ocfs2 is very sorry to be fencing 
 this system by restarting ***
 I found the same scsi errors each time it fences. Can anyone suggest 
 what could be the reason for these SCSI errors and is it those SCSI 
 errors which is causing fencing.
I don't know the reason for SCSI errors. So just answer your second qs.
Yes, SCSI error will cause ocfs2 fencing. OCFS2 need to heartbeat in the 
disk, so if it tries many times and still fails to write to disk because 
of the SCSI error, it will fence itself.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] problem stopping o2cb service on one of nodes

2009-04-02 Thread Tao Ma
Hi Nikola,

Nikola Ciprich wrote:
 Hi,
 I'm trying ocfs2 RHEL5 distro, 2.6.29 kernel, ocfstools-1.4.1. I'm using DRBD 
 in primary/primary mode
 as shared storage...
 
 I've configured the service according to quickstart document, and everything 
 works,
 but when I umount fs on both nodes, stopping o2cb service on one of the nodes 
 always
 fails with:
 
 [r...@vbox4 sysconfig]# /etc/rc.d/init.d/o2cb stop
 Stopping O2CB cluster vb34: Failed
 Unable to stop cluster as heartbeat region still active
It looks that your disk heartbeat is still there. I don't know the 
specific reason, maybe 
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1053 ?

but you can stop it manually.
1.  ocfs2_hb_ctl -I -d device
or ocfs2_hb_ctl -I -u uuid
this will tell you the reference number for the hearbeat.
2.  ocfs2_hb_ctl -K -d device service
   or  ocfs2_hb_ctl -K -u uuid service
this will killed the heartbeat manually.
service is the stack you used, and it should be o2cb in your case.

btw, you can try cfs2_hb_ctl -K -u uuid service to see whether it is 
the same problem as bug 1053.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 vs ext3?

2009-04-29 Thread Tao Ma


Andrew (Anything) wrote:
 Hi Andrew,
  I just checked max-features, it doesn't include local which means 
 that you still need to create dlm lock in your local node which will 
 cost some delay. You can check whether your volume enable local by 
 command

 echo 'stats'|debugfs.ocfs2 /dev/sdX|grep Incompat

 If the volume is mkfsed to used as local disk, you should see local.
 Otherwise you need to add --fs-features=local to your mkfs.

 Regards,
 Tao
 
 Thanks for your replies Tao and Thomas
 
 I still intend on using the filesystem in a 3 node cluster, but when I found
 it to be 25x slower than a single ext3 part I thought id start
 troubleshooting at the start.
 
 I just did a format of a small partition to give it a quick test anyway, and
 found that it performs just as well as ext3, in some test even better.
 
 Except if I understand correctly, local disables the ability for clustering
 completely?
yes, local means you will use it locally, so other nodes can't use it.
 
 Dear Andrew,

 I think the result should be normal because ocfs2 is a cluster fs ?

 
 Hi Thomas,
 
 
 on a single node system (with no network latency, and only itself to talk
 to) I had expected better results.
 is a 10x reduction in file creation/modification an expected result?
could you please talk a little more about the test case?

Just FYI. I have just committed a new series of patches to 2.6.30 which 
will improve inode allocation a lot. See
http://oss.oracle.com/pipermail/ocfs2-devel/2009-January/003799.html

It will save you a lot of time when you create a large amount of file, 
delete them and then recreate. The above link can show you the test result.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] another node is heartbeating in our slot

2009-05-05 Thread Tao Ma
Hi sundar,

sundar mahadevan wrote:
 Hi members,
 Newbie. Help pls.
 
 My setup:
 system 1: opensuse 11.1 with iscsitarget (secondary hard drive with
 logical volume) + ocfs2
 system 2: opensuse 11.1 with open-iscsi (detects the logical volume on
 system 1) + ocfs2
 
 1) mount -t ocfs /dev/sdb /u01/oradata/orcl  (on system 2) i have this
 entry made to /etc/fstab as well as /etc/init.d/boot.local but the
 detection of logical volume /dev/sdb during bootup takes place after
 the mount commands have surpassed and hence the mount fails. Are there
 any other options other than for manual mount after logon ?
 
 2) once i manually mount /dev/sdb on system 2 with mount -t ocfs
 /dev/sdb /u01/oradata/orcl i receive the following error on system 1:
 another node is heartbeating in our slot error.
I would guess that the volume was already mounted in system 1, right? So
What is your cluster.conf looks like? Does the 2 nodes knows each other?

Regards,
Tao


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread Tao Ma
Hi Florian,

florian.engelm...@bt.com wrote:
 Hi Tao,
 
 Hi florian,

 florian.engelm...@bt.com wrote:
 Florian,
 the problem here seems to be with network. The nodes are running
 into
 network heartbeat timeout and hence second node is getting fenced.
 Do
 you see o2net thread consuming 100% cpu on any node? if not then
 probably check your network
 thanks,
 --Srini
 I forgot to post my /etc/ocfs2/cluster.conf
 node:
 ip_port = 
 ip_address = 192.168.0.101
 number = 0
 name = defr1elcbtd01
 cluster = ocfs2

 node:
 ip_port = 
 ip_address = 192.168.0.102
 number = 1
 name = defr1elcbtd02
 cluster = ocfs2

 cluster:
 node_count = 2
 name = ocfs2


 192.168.0.10x is eth3 on both nodes and connected with a cross over
 cable. No active network component is involved here.

 defr1elcbtd02:~# traceroute 192.168.0.101
 traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
 packets
  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
 defr1elcbtd02:~#

 The error message looks like a network problem but why should there
 be a
 network problem if I shutdown a FC port?! I testet it about 20 times
 and
 got about 16 kernel panics starting with the same error message:

 kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at
 192.168.0.101:
 It isn't an error message, just a status report that we can't connect
 to
 that node now. That node may be rebooted or something else, but this
 node don't know, and it only knows the connection is down.
 
 But node defr1elcbtd01 was never down and also the network link (eth3)
 wasn't down. I was able to ping from each node to the other.
 Node 1 is hosting all services and never was faulted while I was
 testing.
 
 All I have to do to panic node 2 is to disable one of two fibre channel
 ports or pull one fibre channel cable or delete node 2 from the cisco
 SAN zoning.
 If I apply one of those 3 errors I get the message about o2net is no
 longer connected to node 1 and 60 seconds later the 2nd node panics
 because of ocfs2 fencing (but this happens only in about 80% of cases -
 in the other 20% of cases o2net does not disconnect and there are no
 messages about ocfs2 at all - like it should be...).
 Everything else is working fine in these 60 seconds. The filesystem is
 still writable from both nodes and both nodes can ping each other (via
 the cluster interconnect).
I just checked your log. The error why node 2 get the message is that 
node 1 get the message that node 2 stopped disk heartbeat for quite a 
long time so it stop the connection intentionally. So node 2 get this 
message.

See the log in node 1:
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node 1,
1 total
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
node 1 sc: 81007ddf4400 - , valid 1 - 0, err 0 -
-107
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
1, 1 total
Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to node
defr1elcbtd02 (num 1) at 192.168.0.102:

And I guess the reason why you see this log sometimes(80%) is that the 
time interval. You know ocfs2 disk heartbeat try every 2 secs so 
sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes 
node2 panic after node 1 call o2quo_hb_down(which will put something 
like no longer... in node 2's log).

So would you please give your timeout configuration(o2cb)?

Regards,
Tao



 
 Here are the logs with debug logging:
 
 Node 2:
 
 Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx :04:00.0: LOOP DOWN
 detected (2).
 Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
 81007c2f0800 refs 3 sock 8100694138c0 node 0 page
 81007fafbb00 pg_off 0] put
 Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
 81007c2f0800 refs 2 sock 8100694138c0 node 0 page
 81007fafbb00 pg_off 0] data_ready hit
 Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
 81007c2f0800 refs 2 sock 8100694138c0 node 0 page
 81007fafbb00 pg_off 0] get
 Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
 81007c2f0800 refs 3 sock 8100694138c0 node 0 page
 81007fafbb00 pg_off 0] receiving
 Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
 [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key  num 0] at page_off
 24
 Jun  8 09:46:11 defr1elcbtd02 kernel:
 (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
 sys_stat 0 key  num 0] processing message
 Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
 81007c2f0800 refs 3 sock 8100694138c0 node 0 page
 81007fafbb00 pg_off 24] get
 Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
 81007c2f0800 refs 4 sock 8100694138c0 node 0 page
 81007fafbb00 pg_off 0] ret = 1
 Jun  8 09:46:11 defr1elcbtd02 kernel: 

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread Tao Ma
Hi Florian,

florian.engelm...@bt.com wrote:
 Hi Tao,
 
 Hi Florian,

 florian.engelm...@bt.com wrote:
 Hi Tao,

 Hi florian,

 florian.engelm...@bt.com wrote:
 Florian,
 the problem here seems to be with network. The nodes are running
 into
 network heartbeat timeout and hence second node is getting
 fenced.
 Do
 you see o2net thread consuming 100% cpu on any node? if not then
 probably check your network
 thanks,
 --Srini
 I forgot to post my /etc/ocfs2/cluster.conf
 node:
 ip_port = 
 ip_address = 192.168.0.101
 number = 0
 name = defr1elcbtd01
 cluster = ocfs2

 node:
 ip_port = 
 ip_address = 192.168.0.102
 number = 1
 name = defr1elcbtd02
 cluster = ocfs2

 cluster:
 node_count = 2
 name = ocfs2


 192.168.0.10x is eth3 on both nodes and connected with a cross
 over
 cable. No active network component is involved here.

 defr1elcbtd02:~# traceroute 192.168.0.101
 traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
 packets
  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
 defr1elcbtd02:~#

 The error message looks like a network problem but why should
 there
 be a
 network problem if I shutdown a FC port?! I testet it about 20
 times
 and
 got about 16 kernel panics starting with the same error message:

 kernel: o2net: no longer connected to node defr1elcbtd01 (num 0)
 at
 192.168.0.101:
 It isn't an error message, just a status report that we can't
 connect
 to
 that node now. That node may be rebooted or something else, but
 this
 node don't know, and it only knows the connection is down.
 But node defr1elcbtd01 was never down and also the network link
 (eth3)
 wasn't down. I was able to ping from each node to the other.
 Node 1 is hosting all services and never was faulted while I was
 testing.

 All I have to do to panic node 2 is to disable one of two fibre
 channel
 ports or pull one fibre channel cable or delete node 2 from the
 cisco
 SAN zoning.
 If I apply one of those 3 errors I get the message about o2net is
 no
 longer connected to node 1 and 60 seconds later the 2nd node panics
 because of ocfs2 fencing (but this happens only in about 80% of
 cases -
 in the other 20% of cases o2net does not disconnect and there are no
 messages about ocfs2 at all - like it should be...).
 Everything else is working fine in these 60 seconds. The filesystem
 is
 still writable from both nodes and both nodes can ping each other
 (via
 the cluster interconnect).
 I just checked your log. The error why node 2 get the message is that
 node 1 get the message that node 2 stopped disk heartbeat for quite a
 long time so it stop the connection intentionally. So node 2 get this
 message.

 See the log in node 1:
 Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node
 1,
 1 total
 Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
 node 1 sc: 81007ddf4400 - , valid 1 - 0, err 0
 -
 -107
 Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
 1, 1 total
 Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
 node
 defr1elcbtd02 (num 1) at 192.168.0.102:

 And I guess the reason why you see this log sometimes(80%) is that the
 time interval. You know ocfs2 disk heartbeat try every 2 secs so
 sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes
 node2 panic after node 1 call o2quo_hb_down(which will put something
 like no longer... in node 2's log).

 So would you please give your timeout configuration(o2cb)?
 
 These are my setting on node 2:
 
 O2CB_HEARTBEAT_THRESHOLD=61
 O2CB_IDLE_TIMEOUT_MS=6
 O2CB_KEEPALIVE_DELAY_MS=4000
 O2CB_RECONNECT_DELAY_MS=4000
ocfs2 can't allow 2 nodes have different timeouts. So if node 1 and 2 
don't have the same configuration, node 2 won't be allowed to join the 
domain and mount the same volume.
yeah, this parameter looks much better.
See
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#TIMEOUT
For details.

I just went through the whole thread, and the panic because of the 
storage failure is a deliberate behavior since with no disk access, no 
need to survive. See
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#QUORUM

Regards,
Tao
 
 On node 1 I still got the old setting because there was no downtime to
 reboot this system till today. Is there any way to change the values
 without a reboot? The system is mission critical and I can only reboot
 on Sundays.
 
 Settings on node 1 are the default setting that came with the Debian
 package. I changed them to fit the settings of node 2 after the next
 reboot - so I can only guess they were:
 
 O2CB_HEARTBEAT_THRESHOLD=7
 O2CB_IDLE_TIMEOUT_MS=12000
 O2CB_KEEPALIVE_DELAY_MS=2000
 O2CB_RECONNECT_DELAY_MS=2000


 
 Regards,
 Florian
 
 
 
 Regards,
 Tao



 Here are the logs with debug logging:

 Node 2:

 Jun  8 09:46:11 defr1elcbtd02 kernel: 

Re: [Ocfs2-users] enable acl option for ocfs2

2009-06-18 Thread Tao Ma
Hi Marco,

Marco Huang wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi,
 
 I am setting up two nodes ocfs2 on debian lenny (2.6.26-1-amd64), but
 ocfs2 doesn't come with posix acl by default. Doesn't any one can
 provide patch for enable acl?
acl is added in ocfs2 in 2.6.29. So could you please update your kernel 
version?

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users



Re: [Ocfs2-users] umount hang + high CPU

2009-07-05 Thread Tao Ma
Hi,
Is there something in your system log?
I would guess there should be some info there.

Regards,
Tao

syla...@aim.com wrote:
   Hi,
 
  I had a problem where I got a kernel bug in the logs in ocfs2. That 
 happened when I unmounted the volume after a day or two that it was 
 mounted, so I thought I needed to upgrade the kernel (maybe the next 
 version will be bug free), so I did to 2.6.30, and now I tried mounting 
 and unmounting the volume right away... and it hanged, and the CPU got 
 high with that umount process.
 
 Please advice
 
 PS. tools and console packages are version 1.4.2.
 
 *A Good Credit Score is 700 or Above. See yours in just 2 easy steps! 
 http://pr.atwola.com/promoclk/100126575x1222887319x1201497660/aol?redir=http://www.freecreditreport.com/pm/default.aspx?sc=668072%26hmpgID=62%26bcd=JulystepsfooterNO62*
  
 
 
 
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 acl issue

2009-07-15 Thread Tao Ma
Hi Marco,
From the stack it looks that it isn't acl related. So could you please 
file a bug in http://oss.oracle.com/bugzilla/(just for this, acl is 
another issue) with all these informations? Thanks.

And could you please also do:
objdump -DSl /lib/modules/`uname -r`/kernel/fs/ocfs2/ocfs2.ko 
 /tmp/ocfs2.txt
and attach the generated file to the bug also.

Great thanks.

Regards,
Tao

Marco Huang wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi Tiger,
 
 I am also exporting the ocfs2 file system via nfs (with acl) to other
 servers. I am getting the following kernel panic about 2 times in 3
 days. Hope you can look into that, maybe it's related.
 
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819038]
 - [ cut here ]
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819074] kernel
 BUG at fs/ocfs2/inode.c:244!
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819100] invalid
 opcode:  [#1]
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz SMP
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819129] last
 sysfs file: /sys/class/fc_transport/target5:0:3/node_name
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819158] CPU 2
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819180] Modules
 linked in:
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  netconsole
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  nfsd
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  exportfs
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  nfs
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  lockd
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  nfs_acl
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  auth_rpcgss
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  sunrpc
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ocfs2
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  jbd2
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  quota_tree
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ocfs2_dlmfs
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ocfs2_stack_o2cb
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ocfs2_dlm
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ocfs2_nodemanager
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ocfs2_stackglue
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  configfs
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ipv6
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  aoe
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ext2
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  dm_round_robin
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  dm_multipath
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  loop
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  snd_pcm
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  snd_timer
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  snd
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  soundcore
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  snd_page_alloc
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  shpchp
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  pcspkr
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  pci_hotplug
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  serio_raw
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  iTCO_wdt
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  button
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  evdev
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ext3
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  jbd
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  mbcache
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  dm_mirror
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  dm_region_hash
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  dm_log
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  dm_snapshot
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  dm_mod
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ata_generic
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  sd_mod
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  sg
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  sr_mod
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  cdrom
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  usb_storage
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  qla2xxx
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ata_piix
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  scsi_transport_fc
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  libata
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  scsi_tgt
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  tg3
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  mptsas
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ide_pci_generic
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  ide_core
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  libphy
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  mptscsih
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  mptbase
 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz  

Re: [Ocfs2-users] git checkout on an ocfs2 filesystem

2009-08-31 Thread Tao Ma
Hi Joel,

Joel Becker wrote:
 On Mon, Aug 31, 2009 at 12:16:36PM -0700, Joel Becker wrote:
 On Sun, Aug 30, 2009 at 08:19:08PM -0500, Nathaniel Griswold wrote:
 Has anyone here had problems with git checkouts on ocfs2?
  Oh, boy, this is wacky.
 
   No, it's extra wacky:
 
 5441  lstat64(t/t6015-rev-list-show-all-parents.sh, 0xffc2e318) = -1 ENOENT 
 (No such file or directory)
 5441  lstat64(t, {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0
 5441  open(t/t6015-rev-list-show-all-parents.sh, 
 O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0777) = ? ERESTARTSYS (To be restarted)
 5441  --- SIGALRM (Alarm clock) @ 0 (0) ---
 5441  sigreturn()   = ? (mask now [])
 5441  open(t/t6015-rev-list-show-all-parents.sh, 
 O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0777) = -1 EEXIST (File exists)
 5441  write(2, error: git-checkout-index: unabl..., 100) = 100
 
 How on earth is userspace seeing ERESTARTSYS?  Did someone forget to
 -ive it?
The ERESTARTSYS may happen when we get interrupted from ocfs2_cluster_lock.
I met with it when I rm -rf a very large dir and use ctrl+c to stop it 
when I tested bug 1162.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] kernel panic - bug in dlmglue.c ?

2009-09-11 Thread Tao Ma
Hi John,
John McNulty wrote:
 Hi,

 I had a system crash last night.  Netconsole caught the following
 trace dump.  Has this one been seen before?
   
This bug is fixed in mainline and should show up in next ocfs2 release.
See http://oss.oracle.com/bugzilla/show_bug.cgi?id=1162

Regards,
Tao
 Rgds,

 John


 Sep 10 18:26:04 sdorac2b Kernel BUG at
 ...ld/src/redhat/BUILD/ocfs2-1.4.2/fs/ocfs2/dlmglue.c:2308
 Sep 10 18:26:04 sdorac2b invalid opcode:  [1]
 Sep 10 18:26:04 sdorac2b SMP
 Sep 10 18:26:04 sdorac2b
 Sep 10 18:26:04 sdorac2b last sysfs file: /block/sr0/dev
 Sep 10 18:26:04 sdorac2b CPU 4
 Sep 10 18:26:04 sdorac2b
 Sep 10 18:26:04 sdorac2b Modules linked in:
 Sep 10 18:26:04 sdorac2b  hangcheck_timer
 Sep 10 18:26:04 sdorac2b  joydev
 Sep 10 18:26:04 sdorac2b  mptctl
 Sep 10 18:26:04 sdorac2b  ipmi_si(U)
 Sep 10 18:26:04 sdorac2b  ipmi_devintf(U)
 Sep 10 18:26:04 sdorac2b  ipmi_msghandler(U)
 Sep 10 18:26:04 sdorac2b  nfsd
 Sep 10 18:26:04 sdorac2b  exportfs
 Sep 10 18:26:04 sdorac2b  auth_rpcgss
 Sep 10 18:26:04 sdorac2b  ipv6
 Sep 10 18:26:04 sdorac2b  xfrm_nalgo
 Sep 10 18:26:04 sdorac2b  crypto_api
 Sep 10 18:26:04 sdorac2b  netconsole
 Sep 10 18:26:04 sdorac2b  oracleasm(U)
 Sep 10 18:26:04 sdorac2b  autofs4
 Sep 10 18:26:04 sdorac2b  hidp
 Sep 10 18:26:04 sdorac2b  ocfs2(U)
 Sep 10 18:26:04 sdorac2b  nfs
 Sep 10 18:26:04 sdorac2b  lockd
 Sep 10 18:26:04 sdorac2b  fscache
 Sep 10 18:26:04 sdorac2b  nfs_acl
 Sep 10 18:26:04 sdorac2b  rfcomm
 Sep 10 18:26:05 sdorac2b  l2cap
 Sep 10 18:26:05 sdorac2b  bluetooth
 Sep 10 18:26:05 sdorac2b  ocfs2_dlmfs(U)
 Sep 10 18:26:05 sdorac2b  ocfs2_dlm(U)
 Sep 10 18:26:05 sdorac2b  ocfs2_nodemanager(U)
 Sep 10 18:26:05 sdorac2b  configfs
 Sep 10 18:26:05 sdorac2b  sunrpc
 Sep 10 18:26:05 sdorac2b  bonding
 Sep 10 18:26:05 sdorac2b  dm_round_robin
 Sep 10 18:26:05 sdorac2b  dm_multipath
 Sep 10 18:26:05 sdorac2b  scsi_dh
 Sep 10 18:26:05 sdorac2b  video
 Sep 10 18:26:05 sdorac2b  hwmon
 Sep 10 18:26:05 sdorac2b  backlight
 Sep 10 18:26:05 sdorac2b  sbs
 Sep 10 18:26:05 sdorac2b  i2c_ec
 Sep 10 18:26:05 sdorac2b  i2c_core
 Sep 10 18:26:05 sdorac2b  button
 Sep 10 18:26:05 sdorac2b  battery
 Sep 10 18:26:05 sdorac2b  asus_acpi
 Sep 10 18:26:05 sdorac2b  acpi_memhotplug
 Sep 10 18:26:05 sdorac2b  ac
 Sep 10 18:26:05 sdorac2b  parport_pc
 Sep 10 18:26:05 sdorac2b  lp
 Sep 10 18:26:05 sdorac2b  parport
 Sep 10 18:26:05 sdorac2b  st
 Sep 10 18:26:05 sdorac2b  sr_mod
 Sep 10 18:26:05 sdorac2b  cdrom
 Sep 10 18:26:05 sdorac2b  e1000e(U)
 Sep 10 18:26:05 sdorac2b  shpchp
 Sep 10 18:26:06 sdorac2b  hpilo
 Sep 10 18:26:06 sdorac2b  sg
 Sep 10 18:26:06 sdorac2b  serio_raw
 Sep 10 18:26:06 sdorac2b  bnx2(U)
 Sep 10 18:26:06 sdorac2b  pcspkr
 Sep 10 18:26:06 sdorac2b  dm_raid45
 Sep 10 18:26:06 sdorac2b  dm_message
 Sep 10 18:26:06 sdorac2b  dm_region_hash
 Sep 10 18:26:06 sdorac2b  dm_mem_cache
 Sep 10 18:26:06 sdorac2b  dm_snapshot
 Sep 10 18:26:06 sdorac2b  dm_zero
 Sep 10 18:26:06 sdorac2b  dm_mirror
 Sep 10 18:26:06 sdorac2b  dm_log
 Sep 10 18:26:06 sdorac2b  dm_mod
 Sep 10 18:26:06 sdorac2b  usb_storage
 Sep 10 18:26:06 sdorac2b  qla2xxx
 Sep 10 18:26:06 sdorac2b  scsi_transport_fc
 Sep 10 18:26:06 sdorac2b  ata_piix
 Sep 10 18:26:06 sdorac2b  libata
 Sep 10 18:26:06 sdorac2b  cciss
 Sep 10 18:26:06 sdorac2b  mptspi
 Sep 10 18:26:06 sdorac2b  mptscsih
 Sep 10 18:26:06 sdorac2b  scsi_transport_spi
 Sep 10 18:26:06 sdorac2b  sd_mod
 Sep 10 18:26:06 sdorac2b  scsi_mod
 Sep 10 18:26:06 sdorac2b  mptbase
 Sep 10 18:26:06 sdorac2b  raid1
 Sep 10 18:26:06 sdorac2b  ext3
 Sep 10 18:26:06 sdorac2b  jbd
 Sep 10 18:26:06 sdorac2b  uhci_hcd
 Sep 10 18:26:06 sdorac2b  ohci_hcd
 Sep 10 18:26:07 sdorac2b  ehci_hcd
 Sep 10 18:26:07 sdorac2b
 Sep 10 18:26:07 sdorac2b Pid: 965, comm: nxnode Tainted: G
 2.6.18-128.1.14.el5 #1
 Sep 10 18:26:07 sdorac2b RIP: 0010:[886cade7]
 Sep 10 18:26:07 sdorac2b  [886cade7]
 :ocfs2:ocfs2_dentry_lock+0x2f/0xfd
 Sep 10 18:26:07 sdorac2b RSP: 0018:8102e7557dc8  EFLAGS: 00010246
 Sep 10 18:26:07 sdorac2b RAX: 81042fcc0c00 RBX: 8104216cb800
 RCX: 
 Sep 10 18:26:07 sdorac2b RDX: ff00 RSI: 0001
 RDI: 8103f8af5660
 Sep 10 18:26:07 sdorac2b RBP: 0005 R08: 81025ce58970
 R09: 002ebc0c
 Sep 10 18:26:07 sdorac2b R10: 2b19b616b000 R11: 4aa93717
 R12: 81029775dd20
 Sep 10 18:26:07 sdorac2b R13:  R14: 2b92d121d5e0
 R15: 8104216cb800
 Sep 10 18:26:07 sdorac2b FS:  2b92d162e410()
 GS:81042ff77cc0() knlGS:
 Sep 10 18:26:07 sdorac2b CS:  0010 DS:  ES:  CR0: 8005003b
 Sep 10 18:26:07 sdorac2b CR2: 2b19b5fb3cd8 CR3: 000308119000
 CR4: 06e0
 Sep 10 18:26:07 sdorac2b Process nxnode (pid: 965, threadinfo
 8102e7556000, task 81023f8eb040)
 Sep 10 18:26:07 sdorac2b Stack:
 Sep 10 18:26:07 sdorac2b  8102e7557e58
 Sep 10 18:26:07 sdorac2b  
 Sep 10 

Re: [Ocfs2-users] core dump

2010-02-24 Thread Tao Ma
Hi Charlie,

Charlie Sharkey wrote:
 Hi,
 
  
 
 We got this core dump while running the dd command. I haven’t matched
 
 The time of the dump with the /var/log/messages file, but I believe it was
 
 In response to a cable pull.
you are right. I don't have an ocfs2 version for sles, but I guess it 
should be the same as the ocfs2 repository.

The panic is because the jbd give ocfs2 an error in journal_start and 
set the flag to aborted, ocfs2 think itself can't work without jbd. So 
it panic.

Regards,
Tao
 
  
 
 Any ideas ?
 
  
 
 Thanks in advance,
 
  
 
 charlie
 
  
 
  
 
 Sles10 SP2   x86_64
 
 ocfs2-tools-1.4.0-0.5
 
 ocfs2console-1.4.0-0.5
 
 Linux sr2600-1 2.6.16.60-0.34-smp #1 SMP Fri Jan 16 14:59:01 UTC 2009 
 x86_64 x86_64 x86_64 GNU/Linux
 
  
 
 OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build 
 f922955d99ef972235bd0c1fc236c5ddbb368611)
 
 o2cb heartbeat: registered disk 
 mode  
 
 
 OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build 
 f922955d99ef972235bd0c1fc236c5ddbb368611) 
 
 OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build 
 f922955d99ef972235bd0c1fc236c5ddbb368611)   
  
 
  
 
 PID: 22456  TASK: 81012ac900c0  CPU: 5   COMMAND: dd
 
  #0 [8101013ff7e0] machine_kexec   at 
 8011c036
 
  #1 [8101013ff8b0] crash_kexec  at 
 80153ea9
 
  #2 [8101013ff970] panic 
 at 80133f01
 
  #3 [8101013ffa60] ocfs2_handle_errorat 8859643e
 
  #4 [8101013ffa80] __ocfs2_abort   at 
 88596569
 
  #5 [8101013ffb70] ocfs2_start_trans at 
 885847f1
 
  #6 [8101013ffba0] ocfs2_write_begin_nolock   at 8856563d
 
  #7 [8101013ffc50] ocfs2_write_beginat 
 88567267
 
  #8 [8101013ffcb0] ocfs2_file_aio_writeat 
 88576bc3
 
  #9 [8101013ffe00] do_sync_write   at 
 80185d25
 
 #10 [8101013fff10] vfs_write  at 
 80186662
 
 #11 [8101013fff40] sys_write  at 
 80186c2b
 
 #12 [8101013fff80] system_call  at 
 8010adba
 
 RIP: 2b107dfe1190  RSP: 7fff2cd9dcc8  RFLAGS: 0246
 
 RAX: 0001  RBX: 8010adba  RCX: 2b107dfe1110
 
 RDX: 0010  RSI: 2b107e28a000  RDI: 0001
 
 RBP: 0001   R8: 2b107e28a011   R9: 
 
 R10:   R11: 0246  R12: 2b107e28a000
 
 R13:   R14: 0010  R15: 8010ad56
 
 ORIG_RAX: 0001  CS: 0033  SS: 002b
 
  
 
 
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] renaming a OCFS2 cluster

2010-02-24 Thread Tao Ma
Hi Werner,

Werner Flamme wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi everyone,
 
 another problem I did not find a solution for...
 
 I ran o2cb configure and configured a cluster named ocfs2. Lazy me,
 I did not invent a new name.
 
 Now this cluster must be renamed to avoid confusion, the name has to
 change to sapcluster.
 
 I changed /etc/ocfs2/cluster.conf to the new name and ran o2cb
 configure again, but to no avail:
 - ---snip---
 # rco2cb configure
 
 
 Configuring the O2CB driver.
 
 
 
 This will configure the on-boot properties of the O2CB driver.
 The following questions will determine whether the driver is loaded on
 boot.  The current values will be shown in brackets ('[]').  Hitting
 ENTER without typing an answer will keep that current value.  Ctrl-C
 will abort.
 
 Load O2CB driver on boot (y/n) [y]:
 Cluster to start on boot (Enter none to clear) [sapcluster]:
 Writing O2CB configuration: OK
 Mounting configfs filesystem at /config: OK
 Loading module ocfs2_dlm: OK
 Loading module ocfs2_dlmfs: OK
 Mounting ocfs2_dlmfs filesystem at /dlm: OK
 Starting cluster sapcluster: Failed
 o2cb_ctl: Memory allocation failed while setting cluster name
 Stopping cluster sapcluster: OK
 - ---pins---
 
 I can mount the volume, but in /config/cluster, I still see only the old
 cluster ocfs2, though /etc/ocfs2/cluster.conf and
 /etc/sysconfig/cluster both contain the cluster name sapcluster (on
 all nodes, but it is currently mounted on only one node).
Before you change a cluster name, you missed one important step, you 
need to offline the cluster. So the right steps is:
1. /etc/init.d/o2cb offline
2. change /etc/ocfs2/cluster.conf
3. run /etc/init.d/o2cb configure and change the name.

Now, since you have already changed the configuration file(step 2 and 
3), there are 2 different ways to get the right cluster up. One is 
umount the ocfs2 volume and reboot the box., this is simple.
The other is that you can just renamed the cluster name(in cluster.conf 
and o2cb configure) back to the name in your /config/cluster and then do 
the steps I described above(1,2,3).

Good luck.

Regards,
Tao
 
 BTW, I cannot modify cluster.conf by ocfs2console though I start it as
 root, but I have no problems modifying this file via any editor. I'm
 running SLES9-SP4 on the boxes, btw., and have the following packages
 installed:
 ocfs2-tools-1.2.1-4.2
 ocfs2console-1.2.1-4.2
 
 Will it help to download and install the newest OCFS2 packages from the
 website?
 
 Regards,
 Werner
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.12 (GNU/Linux)
 Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/
 
 iEYEARECAAYFAkuE8eMACgkQk33Krq8b42MefACeOWDceDV6sJSnSe1qeq/1mvDx
 axMAn2JIWWMMpjv2oTLBPhcfgXlq/RIv
 =pzmM
 -END PGP SIGNATURE-
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No Space left on the device.

2010-03-04 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi
 
 My ocfs filesystem has 270gb free space.
 
 FilesystemSize  Used Avail Use% Mounted on
 /dev/mapper/store 501G  232G  270G  47% /data
 
 INode details for ocfs filesystem is
 
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/mapper/store131225616 60588258 70637358   47% /data
 
 I have run the stat_sysdir.sh on /dev/mapper/store. The output is on the
 below link.
 
 http://pastebin.com/4VLLCpRa
I just went through the output and confirmed that the culprit is free 
space fragmentation. So please refer to
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189.

Regards,
Tao
 
 Please help me to resolve this issue.
 
 Rgds,
 Aravind M D
 
 
 
 Hi Aravind,

 Sounds like you might have hit the free space fragmentation issue:
 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189

 I'm sure that if you post output of stat_sysdir.sh
 (http://oss.oracle.com/~seeda/misc/stat_sysdir.sh) one of the ocfs2 devs
 will be able to confirm this.

 *If* it is this problem, removing some node slots will help. That is of
 course if you have more node slots that you need. I think 8 are created by
 default.

 Cheers,

 Brad


 On Thu, 4 Mar 2010 10:28:49 +0530 (IST)
 Aravind Divakaran aravind.divaka...@yukthi.com wrote:

 HiAll,

 For my mailserver i am using ocfs2 filesystem configured on san. Now my
 mail delivery application is sometimes complaining No Space left on the
 device, even though there is enough space and inodes. Can anyone help me
 to solve this issue.


 Rgds,
 Aravind M D



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
 
 
 
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No Space left on the device.

2010-03-04 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi,
 
 I am facing problem due to free space fragmentation.
 
 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189.
 
 In the above link it is mentioned that reducing the slots can solve the
 issue. Right now i have 4slots. As my ocfs device is configured for a
 failover for mailserver of 2systems, if i reduce it to 2slots whether i
 can solve this issue. Please let me know if this is the command i have to
 give to reduce the slots to 2.
Decreasing slots just release the volume we reserved for journal, so it 
didn't help for free space fragmentation actually. I have looked into 
your system_dir output, the journal size is only 256M, so in total you 
will have 512M more free space. Don't know whether it is big enough for you.
 
 tunefs.ocfs2 -N 2 /dev/mapper/store
This is an offline action, so you need to umount the volume from all the 
nodes to go ahead.

Another way is that you can cp the file to another volume, remove it and 
then cp back. It should be contiguous enough.

Regards,
Tao
 
 Rgds,
 Aravind M D
 
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi

 My ocfs filesystem has 270gb free space.

 FilesystemSize  Used Avail Use% Mounted on
 /dev/mapper/store 501G  232G  270G  47% /data

 INode details for ocfs filesystem is

 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/mapper/store131225616 60588258 70637358   47% /data

 I have run the stat_sysdir.sh on /dev/mapper/store. The output is on the
 below link.

 http://pastebin.com/4VLLCpRa
 I just went through the output and confirmed that the culprit is free
 space fragmentation. So please refer to
 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189.

 Regards,
 Tao
 Please help me to resolve this issue.

 Rgds,
 Aravind M D



 Hi Aravind,

 Sounds like you might have hit the free space fragmentation issue:
 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189

 I'm sure that if you post output of stat_sysdir.sh
 (http://oss.oracle.com/~seeda/misc/stat_sysdir.sh) one of the ocfs2
 devs
 will be able to confirm this.

 *If* it is this problem, removing some node slots will help. That is of
 course if you have more node slots that you need. I think 8 are created
 by
 default.

 Cheers,

 Brad


 On Thu, 4 Mar 2010 10:28:49 +0530 (IST)
 Aravind Divakaran aravind.divaka...@yukthi.com wrote:

 HiAll,

 For my mailserver i am using ocfs2 filesystem configured on san. Now
 my
 mail delivery application is sometimes complaining No Space left on
 the
 device, even though there is enough space and inodes. Can anyone help
 me
 to solve this issue.


 Rgds,
 Aravind M D



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users




 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
 
 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No Space left on the device.

2010-03-04 Thread Tao Ma
Hi Brad,

Brad Plant wrote:
 Hi Tao,
 
 On Fri, 05 Mar 2010 14:33:36 +0800
 Tao Ma tao...@oracle.com wrote:
 
 Another way is that you can cp the file to another volume, remove it and 
 then cp back. It should be contiguous enough.
 
 Assuming we *can* still write to the FS (i.e. as more of a preventative 
 action), would the following do the same?
 
 cp -a a b
 mv b a
 
 Can the above work as a *hack* online defrag? Will this reduce the amount of 
 free space fragmentation and therefore stop (or reduce the chance of) this 
 problem from occurring? Would we need to do it to the entire file system or 
 just a few larger files?
yeah, that should work.
you can use debugfs.ocfs2 to see whether a file is more fragmented or not.

Regards,
Tao
 
 Cheers,
 
 Brad

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No Space left on the device.

2010-03-04 Thread Tao Ma


Brad Plant wrote:
 Hi Tao,
 
 On Fri, 05 Mar 2010 15:03:50 +0800
 Tao Ma tao...@oracle.com wrote:
 
 Assuming we *can* still write to the FS (i.e. as more of a preventative 
 action), would the following do the same?

 cp -a a b
 mv b a

 Can the above work as a *hack* online defrag? Will this reduce the amount 
 of free space fragmentation and therefore stop (or reduce the chance of) 
 this problem from occurring? Would we need to do it to the entire file 
 system or just a few larger files?
 yeah, that should work.
 you can use debugfs.ocfs2 to see whether a file is more fragmented or not.
 
 Ok, so it will defrag files, but will it help with the inability to write to 
 the file system issue reported in 
 http://oss.oracle.com/bugzilla/show_bugcgi?id=1189 ?
yes, it helps to defrag files by cp.
But the root cause for bug 1189 is that the old mechanism is somehow 
making the volume more fragmented. So the real solution has to be a bug 
fix in kernel, which like Sunil said, is ongoing.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on the device

2010-03-17 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi All,
 
 I have already sent one mail regarding the space issue i am facing with my
 ocfs filesystem. As mentioned in the below link it is an issue related to
 free space fragmentation.
 
 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189
 
 I have seen a patch for stealing extent allocation which was there is
 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my system.
 
 Below is my ocfs details on my system
 
 #modinfo ocfs2
 
 filename:   /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko
 license:GPL
 author: Oracle
 version:1.5.0
 description:OCFS2 1.5.0
 srcversion: A8B69947E8FF56D74858993
 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
 vermagic:   2.6.34-rc1 SMP mod_unload modversions
 
 This is my stat_sysdir.sh output
 
 http://pastebin.com/RZH9DkTk
 
 Can anyone help me how to resolve this, please as the problem occurs on
 production mail server with 3000 emailid.
I just checked your stat_sysdir output. It isn't caused by extent block 
alloc actually. So the patch doesn't work for you. Yes, the problem you 
meet is fragmentation issue, but the root cause is that inode_alloc 
can't allocate any more inodes(a little different from 1189).

I am now working on discontiguous block group. It will resolve your 
issue I think. Hope it can be get into mainline in 2.6.35.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on the device

2010-03-17 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi Tao,
 
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi All,

 I have already sent one mail regarding the space issue i am facing with
 my
 ocfs filesystem. As mentioned in the below link it is an issue related
 to
 free space fragmentation.

 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189

 I have seen a patch for stealing extent allocation which was there is
 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my
 system.

 Below is my ocfs details on my system

 #modinfo ocfs2

 filename:   /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko
 license:GPL
 author: Oracle
 version:1.5.0
 description:OCFS2 1.5.0
 srcversion: A8B69947E8FF56D74858993
 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
 vermagic:   2.6.34-rc1 SMP mod_unload modversions

 This is my stat_sysdir.sh output

 http://pastebin.com/RZH9DkTk

 Can anyone help me how to resolve this, please as the problem occurs on
 production mail server with 3000 emailid.
 I just checked your stat_sysdir output. It isn't caused by extent block
 alloc actually. So the patch doesn't work for you. Yes, the problem you
 meet is fragmentation issue, but the root cause is that inode_alloc
 can't allocate any more inodes(a little different from 1189).

 I am now working on discontiguous block group. It will resolve your
 issue I think. Hope it can be get into mainline in 2.6.35.

 Regards,
 Tao

 
 For my previous mail i got reply from you
 
 Another way is that you can cp the file to another volume, remove it and
 then cp back. It should be contiguous enough.
 
 As mentioned in the 1189
 
 However, reducing the slot count by 1 (to 4) may not be enough as it does
 not
 have much contiguous space. It may work. But reducing it by 2 will
 definitely work.
 
 Umount the volume on all nodes and run:
 # tunefs.ocfs2 -N 3 /dev/sda1
 
 Run fsck.ocfs2 for sanity checking.
 
 Will anyone of the above solution will temporary solve my problem.
Yes, it works. I just replied you in another e-mail.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on the device

2010-03-18 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi Tao,
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi Tao,

 Hi Aravind,

 Aravind Divakaran wrote:
 Hi All,
snip
 
 After running the tunefs.ocfs2 command i am getting the following error on
 my console
 
 node01#tunefs.ocfs2 -N 2 /dev/mapper/store
 tunefs.ocfs2 1.4.1
 Segmentation fault
 node01:~#
 Message from sysl...@node01 at Mar 19 05:25:15 ...
  kernel:[  709.834536] [ cut here ]
 
 Message from sysl...@node01 at Mar 19 05:25:15 ...
  kernel:[  709.834678] invalid opcode:  [#1] SMP
 
 Message from sysl...@node01 at Mar 19 05:25:15 ...
  kernel:[  709.834820] last sysfs file: /sys/fs/o2cb/interface_revision
 
 Message from sysl...@node01 at Mar 19 05:25:15 ...
  kernel:[  709.838490] Stack:
 
 Message from sysl...@node01 at Mar 19 05:25:15 ...
  kernel:[  709.838490] Call Trace:
 
 Message from sysl...@node01 at Mar 19 05:25:15 ...
  kernel:[  709.838490] Code: 00 00 80 00 f7 c7 00 00 04 00 74 0b 81 e7 ff
 ff fb ff 0d 00 00 02 00 f7 c7 0
 0f 0b eb fe c3 48 8b 47 58 48 8b 40 48 4c 8b 58 08 41 ff e3 48
 
 And my /var/log/messages shows this error
 
 Mar 19 05:25:15 cmnode01 kernel: [  709.837837]
 Mar 19 05:25:15 cmnode01 kernel: [  709.837896] Pid: 9051, comm:
 tunefs.ocfs2 Not tainted 2.6.34-rc1 #1 S
 Mar 19 05:25:15 cmnode01 kernel: [  709.837984] RIP:
 0010:[a029708b]  [a029708b] flag
 Mar 19 05:25:15 cmnode01 kernel: [  709.838115] RSP: 0018:8802aadc7bc0
  EFLAGS: 00010206
 Mar 19 05:25:15 cmnode01 kernel: [  709.838179] RAX: 0100 RBX:
 001f RCX: 
 Mar 19 05:25:15 cmnode01 kernel: [  709.838246] RDX: 8802a90d6700 RSI:
 0005 RDI: 
 Mar 19 05:25:15 cmnode01 kernel: [  709.838313] RBP: 8802a90d6700 R08:
 8802a90d66d0 R09: 
 Mar 19 05:25:15 cmnode01 kernel: [  709.838381] R10: dead00100100 R11:
 a0297143 R12: 8802
 Mar 19 05:25:15 cmnode01 kernel: [  709.838448] R13: 0005 R14:
 8802a90d66d0 R15: 8802
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] FS: 
 7fd97b54b760() GS:88000184() knl
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] CS:  0010 DS:  ES:
  CR0: 80050033
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] CR2: 01c54048 CR3:
 00028798c000 CR4: 
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] DR0:  DR1:
  DR2: 
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] DR3:  DR6:
 0ff0 DR7: 
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] Process tunefs.ocfs2 (pid:
 9051, threadinfo 8802aadc6
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  a029716e
 0001 0286 8
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] 0 8802ae72f9e8
 8802a90d66c8 0005 00
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490] 0 8802aadc7c78
 8802aadc7c90 a029e274 00
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [a029716e] ?
 o2cb_dlm_lock+0x2b/0x78 [ocfs2_st
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [a029e274] ?
 user_dlm_cluster_lock+0x2f7/0x44d
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810f2f43] ?
 __blockdev_direct_IO+0x93e/0x996
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [a029eef7] ?
 dlmfs_file_open+0x0/0x17d [ocfs2_
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [a029f038] ?
 dlmfs_file_open+0x141/0x17d [ocfs
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810f5b15] ?
 inotify_d_instantiate+0x12/0x38
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [a029eef7] ?
 dlmfs_file_open+0x0/0x17d [ocfs2_
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810cbcbc] ?
 __dentry_open+0x17f/0x2a1
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810d6633] ?
 do_last+0x3a8/0x644
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810d86d6] ?
 do_filp_open+0x1ed/0x5f2
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810affac] ?
 handle_mm_fault+0x3ee/0x876
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810de67f] ?
 touch_atime+0x7c/0x127
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810cba6b] ?
 do_sys_open+0x55/0xfc
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  [810028ab] ?
 system_call_fastpath+0x16/0x1b
 Mar 19 05:25:15 cmnode01 kernel: [  709.838490]  RSP 8802aadc7bc0
 
 Can you please help me why i am getting this error.
It looks that you meet with a kernel panic. Strange. So do you umount 
the volume before running tunefs.ocfs2? We have never meet with it 
before. Could you please file a bug in http://oss.oracle.com/bugzilla/ 
about this. Thanks.
The good thing is that it corrupts in dlmfs, so your data should be ok.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com

Re: [Ocfs2-users] No space left on the device

2010-03-18 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi Tao,

 Hi Aravind,

 Aravind Divakaran wrote:
 Hi All,

 I have already sent one mail regarding the space issue i am facing
 with
 my
 ocfs filesystem. As mentioned in the below link it is an issue related
 to
 free space fragmentation.

 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189

 I have seen a patch for stealing extent allocation which was there is
 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my
 system.

 Below is my ocfs details on my system

 #modinfo ocfs2

 filename:   /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko
 license:GPL
 author: Oracle
 version:1.5.0
 description:OCFS2 1.5.0
 srcversion: A8B69947E8FF56D74858993
 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
 vermagic:   2.6.34-rc1 SMP mod_unload modversions

 This is my stat_sysdir.sh output

 http://pastebin.com/RZH9DkTk

 Can anyone help me how to resolve this, please as the problem occurs
 on
 production mail server with 3000 emailid.
 I just checked your stat_sysdir output. It isn't caused by extent block
 alloc actually. So the patch doesn't work for you. Yes, the problem you
 meet is fragmentation issue, but the root cause is that inode_alloc
 can't allocate any more inodes(a little different from 1189).

 I am now working on discontiguous block group. It will resolve your
 issue I think. Hope it can be get into mainline in 2.6.35.

 Regards,
 Tao

 For my previous mail i got reply from you

 Another way is that you can cp the file to another volume, remove it
 and
 then cp back. It should be contiguous enough.

 As mentioned in the 1189

 However, reducing the slot count by 1 (to 4) may not be enough as it
 does
 not
 have much contiguous space. It may work. But reducing it by 2 will
 definitely work.

 Umount the volume on all nodes and run:
 # tunefs.ocfs2 -N 3 /dev/sda1

 Run fsck.ocfs2 for sanity checking.

 Will anyone of the above solution will temporary solve my problem.
 Yes, it works. I just replied you in another e-mail.

 Regards,
 Tao

 I am running tunefs.ocfs2 on my 500gb harddisk which contain 215gb of
 data, in order to reduce the slots. I had used the below command.
 
 tunefs.ocfs2  -N 3 /dev/mapper/store
 
 Now almost 7hours is over still it didnt finished the execution. Below is
 the output i am getting.
 
 node01:~# tunefs.ocfs2 -N 3 /dev/mapper/store
 tunefs.ocfs2 1.4.1
 
 How much time it will take to reduce the slots. Whether it will be
 finished within 10hours. Can anyone help me.
It shouldn't cost so much time. I guess it get blocked in some case. So 
is this volume umounted in all the nodes? If yes, could you please 
strace it to see what's wrong?

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on the device

2010-03-18 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi Tao,
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi Tao,

 Hi Aravind,

 Aravind Divakaran wrote:
 Hi All,

 I have already sent one mail regarding the space issue i am facing
 with
 my
 ocfs filesystem. As mentioned in the below link it is an issue
 related
 to
 free space fragmentation.

 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189

 I have seen a patch for stealing extent allocation which was there
 is
 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my
 system.

 Below is my ocfs details on my system

 #modinfo ocfs2

 filename:   /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko
 license:GPL
 author: Oracle
 version:1.5.0
 description:OCFS2 1.5.0
 srcversion: A8B69947E8FF56D74858993
 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
 vermagic:   2.6.34-rc1 SMP mod_unload modversions

 This is my stat_sysdir.sh output

 http://pastebin.com/RZH9DkTk

 Can anyone help me how to resolve this, please as the problem occurs
 on
 production mail server with 3000 emailid.
 I just checked your stat_sysdir output. It isn't caused by extent
 block
 alloc actually. So the patch doesn't work for you. Yes, the problem
 you
 meet is fragmentation issue, but the root cause is that inode_alloc
 can't allocate any more inodes(a little different from 1189).

 I am now working on discontiguous block group. It will resolve your
 issue I think. Hope it can be get into mainline in 2.6.35.

 Regards,
 Tao

 For my previous mail i got reply from you

 Another way is that you can cp the file to another volume, remove it
 and
 then cp back. It should be contiguous enough.

 As mentioned in the 1189

 However, reducing the slot count by 1 (to 4) may not be enough as it
 does
 not
 have much contiguous space. It may work. But reducing it by 2 will
 definitely work.

 Umount the volume on all nodes and run:
 # tunefs.ocfs2 -N 3 /dev/sda1

 Run fsck.ocfs2 for sanity checking.

 Will anyone of the above solution will temporary solve my problem.
 Yes, it works. I just replied you in another e-mail.

 Regards,
 Tao

 I am running tunefs.ocfs2 on my 500gb harddisk which contain 215gb of
 data, in order to reduce the slots. I had used the below command.

 tunefs.ocfs2  -N 3 /dev/mapper/store

 Now almost 7hours is over still it didnt finished the execution. Below
 is
 the output i am getting.

 node01:~# tunefs.ocfs2 -N 3 /dev/mapper/store
 tunefs.ocfs2 1.4.1

 How much time it will take to reduce the slots. Whether it will be
 finished within 10hours. Can anyone help me.
 It shouldn't cost so much time. I guess it get blocked in some case. So
 is this volume umounted in all the nodes? If yes, could you please
 strace it to see what's wrong?

 Regards,
 Tao

 
 We have ocfs running with two cluster node. Before doing tunefs.ocfs2  i
 have stopped my ocfs2 service in both the systems which will unmount my
 ocfs filesystem. I have crossverified with mounted.ocfs2, which gives all
 unmounted. I tried to strace also but my strace is also not giving any
 output for that process id.
Oh, so is the program still running? If yes, 'top' should show its state.
If it is in 'D', please run
ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN to see what's wrong.
If it is in 'R', I guess there is a dead loop or somehow.
anyway, could you please o2image/compress your disk image and then send 
to us? We can test it locally.
See man o2image.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on the device

2010-03-18 Thread Tao Ma
Hi Aravind,

Aravind Divakaran wrote:
 Hi Tao,
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi Aravind,

 Aravind Divakaran wrote:
 Hi Tao,

 Hi Aravind,

 Aravind Divakaran wrote:
 Hi All,

 I have already sent one mail regarding the space issue i am facing
 with
 my
 ocfs filesystem. As mentioned in the below link it is an issue
 related
 to
 free space fragmentation.

 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189

 I have seen a patch for stealing extent allocation which was there
 is
 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my
 system.

 Below is my ocfs details on my system

 #modinfo ocfs2

 filename:   /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko
 license:GPL
 author: Oracle
 version:1.5.0
 description:OCFS2 1.5.0
 srcversion: A8B69947E8FF56D74858993
 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
 vermagic:   2.6.34-rc1 SMP mod_unload modversions

 This is my stat_sysdir.sh output

 http://pastebin.com/RZH9DkTk

 Can anyone help me how to resolve this, please as the problem occurs
 on
 production mail server with 3000 emailid.
 I just checked your stat_sysdir output. It isn't caused by extent
 block
 alloc actually. So the patch doesn't work for you. Yes, the problem
 you
 meet is fragmentation issue, but the root cause is that inode_alloc
 can't allocate any more inodes(a little different from 1189).

 I am now working on discontiguous block group. It will resolve your
 issue I think. Hope it can be get into mainline in 2.6.35.

 Regards,
 Tao

 For my previous mail i got reply from you

 Another way is that you can cp the file to another volume, remove it
 and
 then cp back. It should be contiguous enough.

 As mentioned in the 1189

 However, reducing the slot count by 1 (to 4) may not be enough as it
 does
 not
 have much contiguous space. It may work. But reducing it by 2 will
 definitely work.

 Umount the volume on all nodes and run:
 # tunefs.ocfs2 -N 3 /dev/sda1

 Run fsck.ocfs2 for sanity checking.

 Will anyone of the above solution will temporary solve my problem.
 Yes, it works. I just replied you in another e-mail.

 Regards,
 Tao

 I am running tunefs.ocfs2 on my 500gb harddisk which contain 215gb of
 data, in order to reduce the slots. I had used the below command.

 tunefs.ocfs2  -N 3 /dev/mapper/store

 Now almost 7hours is over still it didnt finished the execution. Below
 is
 the output i am getting.

 node01:~# tunefs.ocfs2 -N 3 /dev/mapper/store
 tunefs.ocfs2 1.4.1

 How much time it will take to reduce the slots. Whether it will be
 finished within 10hours. Can anyone help me.
 It shouldn't cost so much time. I guess it get blocked in some case. So
 is this volume umounted in all the nodes? If yes, could you please
 strace it to see what's wrong?

 Regards,
 Tao

 
 We have ocfs running with two cluster node. Before doing tunefs.ocfs2  i
 have stopped my ocfs2 service in both the systems which will unmount my
 ocfs filesystem. I have crossverified with mounted.ocfs2, which gives all
 unmounted. I tried to strace also but my strace is also not giving any
 output for that process id.
Oh, so is the program still running? If yes, 'top' should show its state.
If it is in 'D', please run
ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN to see what's wrong.
If it is in 'R', I guess there is a dead loop or somehow.
anyway, could you please o2image/compress your disk image and then send 
to us? We can test it locally.
See man o2image.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Compile error on RedHat EL5

2010-05-05 Thread Tao Ma
Hi Kristiansen,

On 05/05/2010 04:34 PM, Kristiansen Morten wrote:
 Hi,

 I'm trying to compile ocfs2-tools 1.2.7 on a RedHat EL5 kernel 
 2.6.18-194.el5, but it fails during the make command:

 make[1]: Entering directory 
 `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2'
 gcc -Wall -O2  -o debugfs.ocfs2 main.o commands.o dump.o utils.o journal.o 
 find_block_inode.o find_inode_paths.o dump_fs_locks.o -L/usr/local/lib 
 -lglib-2.0   -L../libocfs2 -locfs2 -L../libo2cb -lo2cb -lcom_err   -lreadline 
 -lncurses
 main.o: In function `main':
 main.c:(.text+0x4b6): undefined reference to `ocfs2_get_lock_type_string'
 utils.o: In function `read_whole_file':
 utils.c:(.text+0x4a8): undefined reference to `ocfs2_extent_map_init'
 utils.o: In function `dump_file':
 utils.c:(.text+0x689): undefined reference to `ocfs2_extent_map_init'
 journal.o: In function `read_journal':
 journal.c:(.text+0x10a): undefined reference to `ocfs2_extent_map_init'
 collect2: ld returned 1 exit status
 make[1]: *** [debugfs.ocfs2] Error 1
 make[1]: Leaving directory 
 `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2'
 make: *** [debugfs.ocfs2] Error 2

 Can anybody tell me what's wrong?
where do you get the src?

 The reason to install 1.2.9, is that we are currently running RHEL4 and 
 ocfs2-1.2.4. Now we want to reinstall the servers with RHEL5 and ocfs2-1.2.9, 
 install Oracle with clusterware and then upgrade ocfs2 from 1.2.9 to 1.4.7.
I don't get your point of installing 1.2.9. 1.4.7 does work with a old 
format volume. that's no problem with it.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Compile error on RedHat EL5

2010-05-05 Thread Tao Ma
Hi Kristiansen,

On 05/05/2010 06:05 PM, Kristiansen Morten wrote:
 The source is downloaded from 
 http://oss.oracle.com/projects/ocfs2-tools/files/source/v1.2/

 Forgot to tell you that we are reinstalling one node at a time, while the 
 other nodes is still in the oraclecluster. So you won't be able to mount the 
 ocfs2 disks on the newly reinstalled node, while the two other nodes have 
 mounted the disks with ocfs2-1.2.4. Therefor:
 on all nodes
 Reinstall with RHEL5
 Install ocfs2-1.2.9
 Reinstall oracle with clusterware and submit the cluster
 /on all nodes

 on all nodes
 Upgrade ocfs2 from 1.2.9 to 1.4.7
 /on all nodes
yes, this is the way upgrading ocfs2 kernel module.

But ocfs2-tools 1.4.7 can work with ocfs2-1.2.9 IIRC. Do you meet with 
any problem?

Regards,
Tao

 It's much faster to upgrade only ocfs2, than reinstall all.

 Morten K

 -Opprinnelig melding-
 Fra: ocfs2-users-boun...@oss.oracle.com 
 [mailto:ocfs2-users-boun...@oss.oracle.com] På vegne av Tao Ma
 Sendt: 5. mai 2010 11:18
 Til: Kristiansen Morten
 Kopi: ocfs2-users@oss.oracle.com
 Emne: Re: [Ocfs2-users] Compile error on RedHat EL5

 Hi Kristiansen,

 On 05/05/2010 04:34 PM, Kristiansen Morten wrote:
 Hi,

 I'm trying to compile ocfs2-tools 1.2.7 on a RedHat EL5 kernel 
 2.6.18-194.el5, but it fails during the make command:

 make[1]: Entering directory 
 `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2'
 gcc -Wall -O2  -o debugfs.ocfs2 main.o commands.o dump.o utils.o journal.o 
 find_block_inode.o find_inode_paths.o dump_fs_locks.o -L/usr/local/lib 
 -lglib-2.0   -L../libocfs2 -locfs2 -L../libo2cb -lo2cb -lcom_err   
 -lreadline -lncurses
 main.o: In function `main':
 main.c:(.text+0x4b6): undefined reference to `ocfs2_get_lock_type_string'
 utils.o: In function `read_whole_file':
 utils.c:(.text+0x4a8): undefined reference to `ocfs2_extent_map_init'
 utils.o: In function `dump_file':
 utils.c:(.text+0x689): undefined reference to `ocfs2_extent_map_init'
 journal.o: In function `read_journal':
 journal.c:(.text+0x10a): undefined reference to `ocfs2_extent_map_init'
 collect2: ld returned 1 exit status
 make[1]: *** [debugfs.ocfs2] Error 1
 make[1]: Leaving directory 
 `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2'
 make: *** [debugfs.ocfs2] Error 2

 Can anybody tell me what's wrong?
 where do you get the src?

 The reason to install 1.2.9, is that we are currently running RHEL4 and 
 ocfs2-1.2.4. Now we want to reinstall the servers with RHEL5 and 
 ocfs2-1.2.9, install Oracle with clusterware and then upgrade ocfs2 from 
 1.2.9 to 1.4.7.
 I don't get your point of installing 1.2.9. 1.4.7 does work with a old
 format volume. that's no problem with it.

 Regards,
 Tao

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] List of issues resolved by ocfs2 patch

2010-05-14 Thread Tao Ma
Hi Hogas,

Hogas Ciprian wrote:
 Hello guys

 Where can I see a list of issues resolved by a patch on ocfs2?
 For example I want to see what problems solve  patch OCFS2 version 
 1.4.1-1.
 Thanks a lot.
You can check the release note.
http://oss.oracle.com/projects/ocfs2/news/
It has the information.
As for 1.4.1-1, it is the 1st version of 1.4. So you can check
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.4/ocfs2-1_4-usersguide.pdf
for details.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 debug tools

2010-05-30 Thread Tao Ma
Hi Nicola,

On 05/30/2010 10:23 PM, Mailing List SVR wrote:
 Hi Sunil,

 even with the latest ocfs2 release (1.4.7 on rhel5) I'm having several issues,
 my systems hang pratically every two days when a lot of small files are 
 deleted
 (about 200.000 files, 50-120KB each), can you please describe me some
 procedures to collect infos about what operations ocfs2 is doing and so
 understand the problem?
Could you please explain a little detail about your system hang. The 
system is unresponsive? And what is your solution?  Any system log? do 
you enable magic key? Use sysrq+t to dump the information for the tasks.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] O2CB_HEARTBEAT_THRESHOLD won't take changes

2010-05-31 Thread Tao Ma
Hi Elliott,
Elliott Perrin wrote:

 Hello All,

  

 I have multiple OCFS2 clusters on SLES10 SP2 running Xen. We needed to 
 increase the O2CB_HEARTBEAT_THRESHOLD from 31 up to 61 and did so 
 successfully on 2 of our 3 clusters.

  

 However on one of the three clusters we are not able to change the 
 value. The /etc/sysconfig/o2cb file contains 61 as the threshold after 
 reconfiguring via /etc/init.d/o2cb configure, we reconfigure all 3 
 nodes at the same time after unmounting the volume and stopping ocfs2 
 and o2cb on all nodes in the cluster but on restart the value is reset 
 to 31. We have even gone so far as to change the default value in the 
 /etc/init.d/o2cb script from 31 to 61 thinking that must be where the 
 setting of 31 is coming from and even then, the value is set back to 
 31. Oh, and yes, we have rebooted with the new values in both the 
 /etc/sysconfig/o2cb config and with our change of the default value in 
 /etc/init.d/o2cb with no luck.

I would deem that you said that  
/sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold can't be set 
to 61, right?
You said even with restart, the heartbeat threadshold is still 31? That 
would be too strange.
So could you please do the following steps:
1. /etc/init.d/o2cb offline
this will make the cluster offline. Please make sure you didn't see 
anything in /sys/kernel/config/cluster/.
2.  grep HEARTBEAT /etc/sysconfig/o2cb and paste here so that we know 
the configuration file has been changed.
3. strace /etc/init.d/o2cb online and attach the output here. let's see 
what's going on.
4. cat /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold to see 
the new value.

Regards,
Tao


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] O2CB_HEARTBEAT_THRESHOLD won't take changes

2010-06-01 Thread Tao Ma
Elliott Perrin wrote:
 Hello Tao,

   
 Hi Elliott,
 Elliott Perrin wrote:
 
 Hello All,



 I have multiple OCFS2 clusters on SLES10 SP2 running Xen. We needed
   
 to
 
 increase the O2CB_HEARTBEAT_THRESHOLD from 31 up to 61 and did so
 successfully on 2 of our 3 clusters.



 However on one of the three clusters we are not able to change the
 value. The /etc/sysconfig/o2cb file contains 61 as the threshold
   
 after
 
 reconfiguring via /etc/init.d/o2cb configure, we reconfigure all 3
 nodes at the same time after unmounting the volume and stopping ocfs2
 and o2cb on all nodes in the cluster but on restart the value is
   
 reset
 
 to 31. We have even gone so far as to change the default value in the
 /etc/init.d/o2cb script from 31 to 61 thinking that must be where the
 setting of 31 is coming from and even then, the value is set back to
 31. Oh, and yes, we have rebooted with the new values in both the
 /etc/sysconfig/o2cb config and with our change of the default value
   
 in
 
 /etc/init.d/o2cb with no luck.

   
 I would deem that you said that
 /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold can't be set
 to 61, right?
 

 Yes that is correct. 

   
 You said even with restart, the heartbeat threadshold is still 31? That
 would be too strange.
 

 I thought the same as well. 

   
 So could you please do the following steps:
 1. /etc/init.d/o2cb offline
 this will make the cluster offline. Please make sure you didn't see
 anything in /sys/kernel/config/cluster/.
 

 That folder did disappear during restart but I am not sure if the admin I was 
 working with did an offline or a stop. We will do an offline during scheduled 
 maintenance tomorrow. 
   
IIRC, stop only works if you set up o2cb enabled(You set this when do 
/etc/init.d/o2cb configure and in the optionLoad O2CB driver on boot 
(y/n) .)
So if you don't have it enabled, stop does nothing.
   
 2.  grep HEARTBEAT /etc/sysconfig/o2cb and paste here so that we know
 the configuration file has been changed.
 

 elliott.per...@:~ grep HEART /etc/sysconfig/o2cb
 # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
 O2CB_HEARTBEAT_THRESHOLD=61
 # O2CB_HEARTBEAT_MODE: Whether to use the native kernel or the user
 O2CB_HEARTBEAT_MODE=kernel
   
yeah, so the configuration is right.
   
 3. strace /etc/init.d/o2cb online and attach the output here. let's see
 what's going on.
 4. cat /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold to see
 the new value.
 

 elliott.per...@xxx:~ cat 
 /sys/kernel/config/cluster/XXX/heartbeat/dead_threshold
 31
   
oh...
 Thank you Tao, we will be performing maintenance on this cluster in the next 
 couple days and will try these steps. 
   
np. Let us wait and see what's wrong.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 performance - disk random access time problem

2010-06-02 Thread Tao Ma
Hi Proskurin,

On 06/02/2010 05:23 PM, Proskurin Kirill wrote:
 On 01/06/10 22:34, Sunil Mushran wrote:
 The kernel is old. We fixed this issue in 2.6.30. We have also backported
 it to the 1.4 production tree.

 The problem was that the inodes being created did not have locality
 leading to a directory having inodes that were spaced far apart from
 each other. The one place where it really affected performance was rm.

 Thank you for reply!
 It is enough to update kernel or tools need to be updated too?
If you only want to use the old formatted volume, updating kernel is 
enough. But if you want to use some new features we added, better update 
ocfs2-tools also.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 performance - disk random access time problem

2010-06-02 Thread Tao Ma
Proskurin Kirill wrote:
 On 02/06/10 13:26, Tao Ma wrote:
   
 Thank you for reply!
 It is enough to update kernel or tools need to be updated too?
   
 If you only want to use the old formatted volume, updating kernel is
 enough. But if you want to use some new features we added, better update
 ocfs2-tools also.
 

 If im understand right - you bind tools release to kernel release.
 For 2.6.32 kernel which tools are preferred?
   
No we don't bind actually. We do have a 1.2 vs. 1.4 for the enterprise 
kernel.
For the mainline kernel, we don't have this limitation.
So using the latest source code from ocfs2-tools is ok, or if you want 
to use the
built rpm, use 1.4.* since it is almost the latest.

Regards,
Tao


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 performance - disk random access time problem

2010-06-02 Thread Tao Ma
Add Mark Fasheh mfas...@suse.com and Coly Li coly...@suse.de to cc 
since they know what
ocfs2 kernel version SUSE uses.
Angelo McComis wrote:

  On 01/06/10 22:34, Sunil Mushran wrote:
  The kernel is old. We fixed this issue in 2.6.30. We have also
 backported
  it to the 1.4 production tree.
 
  The problem was that the inodes being created did not have locality
  leading to a directory having inodes that were spaced far apart
 from
  each other. The one place where it really affected performance
 was rm.
 
  Thank you for reply!

  
 I've seen this rm problem in the production version of SLES 10.3 +updates.
  
 Is this fix available in the SLES Enterprise kernels yet? 10.3.x or 
 11.1.x (11.1 officially releases today, btw).
  
 Thanks,
 Angelo
 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] 'No space left on device' error with plenty of space.

2010-06-09 Thread Tao Ma
Hi Jason,

On 06/09/2010 11:34 PM, Jason Price wrote:
 And now it's starting to fail again.
How about the situation?
I checked your stat_sysfs output, it looks that you have spaces for 
inode, extent alloc and local alloc(but maybe the kernel haven't flushed 
the metadata to the disk while the stat_sysfs only read the disk). So 
why you meet with ENOSPC? Can you describe it in more detail? You meet 
with it when touching a new file, or cat some bytes to a file or ...?
If you find the wrong scenario, please enable the debugfs option so that 
we can find out the real cause.
debugfs.ocfs2 -l INODE allow
debugfs.ocfs2 -l DISK_ALLOC allow
run you test case here.
debugfs.ocfs2 -l INODE off
debugfs.ocfs2 -l DISK_ALLOC off

Regards,
Tao


 --Jason

 On Wed, Jun 9, 2010 at 9:51 AM, Jason Price japr...@gmail.com
 mailto:japr...@gmail.com wrote:

 I've got a busy FTP/Web cluster running OCFS2 v1.4.4.

 I've started getting No space on device errors when users attempt
 to write to the file system.  Disk utilization is about 76% with
 more than 100gb free.  Inode utilization is also at 76%.

 I thought this was a manifestation of bug # 1189, so I decreased the
 number of nodes via tunefs.ocfs2 from 8 (the default) down to 3
 (there are only 2 nodes in the cluster, with no growth anticipated).

 That got me out of the woods on Monday, but this morning the problem
 manifested again.

 I've opened bug # 1263 about this issue. (link:
 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1263 )

 Does anyone have other ideas?

 I'm more than happy to supply other information.

 What seems to happen is that small writes are allowed, but bigger
 writes failed.  On Monday, I could write multiple 325kb files, and I
 could cat them together to make one file of ~2 mb, but when I tried
 to make a 10ish mb file, it failed.

 --Jason




 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 error.

2010-06-22 Thread Tao Ma
Hi Veeraa,

On 06/23/2010 10:46 AM, veeraa bose wrote:
 Hi Team,


 Hi Team,

 we are getting below error in shared disk on VMwares guest operating system.

 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: reservation conflict
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: SCSI error: return
 code = 0x0018
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdg,
 sector 2367
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: (swapper,0,0):o2hb_bio_end_io:
 237 ERROR: IO Error -5
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel:
 (o2hb-DEEDA3062A,4504,0):o2hb_do_disk_heartbeat:768 ERROR: status = -5
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: reservation conflict
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: SCSI error: return
 code = 0x0018
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdg,
 sector 2367
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel:
 (syslogd,4298,0):o2hb_bio_end_io:237 ERROR: IO Error -5
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel:
 (o2hb-DEEDA3062A,4504,0):o2hb_do_disk_heartbeat:768 ERROR: status = -5
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: reservation conflict
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: SCSI error: return
 code = 0x0018
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdg,
 sector 8259921
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: reservation conflict
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: SCSI error: return
 code = 0x0018
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdf,
 sector 8268113
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: reservation conflict
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: SCSI error: return
 code = 0x0018
 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdf,
 sector 8309329

 we are getting Input/output error on fileson ocfs2 FS, when we tried to
 copy. Please let know what could be problem.
yes, you get I/O error(-5 is also EIO), so it isn't related to ocfs2 and 
there should be some problem in your block device that vmware provides. 
So could you please check whether you can write to the device 
successfully by 'dd'?

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] df showing wrong size

2010-06-28 Thread Tao Ma
Hi Garcia,

On 06/28/2010 02:17 PM, Garcia, Raymundo wrote:


 Hello… it was put under my attention that a partition we have in one of
 our production system was displaying wrong size with df command…. 123
 GB… but in fact the size of all the files is a mere 15 GB…. What is
 going on? Shall we use ocfs.fsck to fix that? Is strange…
123GB is the total size of the used size?
df use statfs(2) to check the size, so in general, you will get the used 
size for the whole volume(including some files that haven't been freed 
and stayed in orphan dir).

15GB is what you get from du I guess?
du will iterate the dir and check it. So it only contains the size for 
every visible file, not including orphan files.

Regards,
Tao


 Thanks for any comment….

 Raymundo Garcia


 The information contained in this message may be confidential and
 legally protected under applicable law. The message is intended solely
 for the addressee(s). If you are not the intended recipient, you are
 hereby notified that any use, forwarding, dissemination, or reproduction
 of this message is strictly prohibited and may be unlawful. If you are
 not the intended recipient, please contact the sender by return e-mail
 and destroy all copies of the original message.



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Too much journaling or not ?

2010-07-30 Thread Tao Ma
Hi Somsak,

On 07/30/2010 12:54 AM, Somsak Sriprayoonsakul wrote:
 Hi,

 (I am in the same team with Mr. Wanchat)

 Just want to note that we already format OCFS2 with -T mail option. As
 note below, data=writeback,noatime, and commit interval has been
 increase already.

 The weird thing about this problem is that, OCFS2 will work for like
 about 30 days, after that the I/O delay time will increase greatly, to
 the point that the system is unusable at all.


 This problem is quite serious for us. It disrupt the services every
 month. Right now the only cure we found is by formating another SAN and
 copy everything there (we did it with rsync).

 Right now the old SAN which contains the file system with trouble is
 still left intact. We can play around with it since it is not in
 production anymore. We can do some debugging on this. Just let us know
 what we should do.
I would guess that is because of the file system fragmentation.
So do you have stat_sysdir output?

And also if you don't mind, can you give us the o2image of the disk? 
with it we can try to reproduce it locally. For more information about 
o2image, see man page.
Also please tell us how to reproduce the case in your env.

Regards,
Tao



 Somsak

 soms...@gmail.com mailto:soms...@gmail.com



 On Wed, Jul 28, 2010 at 12:33 PM, wanchat padungrat
 wancha...@pantip.com mailto:wancha...@pantip.com wrote:

 Dear Sunil,
 Thank you for your promptly reply,
 Yes, we do mount as data=writeback .
 following is our current mount option : type ocfs2
 (rw,_netdev,noatime,commit=15,data=writeback,heartbeat=local)

 Wanchat P.
 2010/7/28 Sunil Mushran sunil.mush...@oracle.com
 mailto:sunil.mush...@oracle.com

 Have you tried mounting with data=writeback ?


 On Jul 27, 2010, at 9:31 PM, wanchat padungrat
 wancha...@pantip.com mailto:wancha...@pantip.com wrote:

 Dear all,

 Not realy sure whether this is bug or not, but we found that
 sometimes OCFS2 on our system do journaling a lot.

 (Please see screen shot below)

 As you can see, the IO was jumped from 111 w/s to 11,960 w/s ,
 IO utilize jumped from 1.5% to 97% , %iowait jumped from 0.25%
 to 10.94%.

 It will not come into question, if this happen in short time
 period (eg. 1-2 sec.) , but sometimes we found this persist
 for half an hour or even more.

 Is this normal for OCFS2 or not ? How could I tune OCFS2 to
 avoid the above situation.

 We also suspect the relation of this symptom with another
 problem similar to
 
 http://oss.oracle.com/pipermail/ocfs2-users/2009-January/003250.htmlhttp://oss.oracle.com/pipermail/ocfs2-users/2009-January/003250.html
 , because we found OCFS2 file system pause for a long period
 of time too ( 5-20 Sec. on open for write). Those scattered
 happen all day with more frequence during highload time.
 (Screen shot below)

 Our configuration is 3 web servers connect to HP SAN Storage
 using OCFS2 1.4.7. Amongst those 3 servers , there are 2 for
 reading data and 1 for writing data (traffic seperate at proxy
 layer) . The symptom above happen on data writing server.

 Thank you.

 Wanchat Padungrat

 PS. We keep our service by reinstall the whole data every
 month, without doing whole data copy (consumed 12 hours), the
 symptom getting worse an worse eg. journaling persist longer ,
 file system pause longer (sometimes more than a minute !!) ...
 which affect the whole system , such as, too many waiting
 apache process, lavg going skyrocket(sometimes 400+) and
 cannot recover itself etc. .

  

 Screen shot

 -

 
 -

 Here is iostat+top during journaling process running.

 
 -

 avg-cpu: %user %nice %system %iowait %steal %idle

 1.88 0.00 5.38 10.94 0.00 81.81

 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz
 await svctm %util

 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 sdc 0.00 5156.00 5.00 6804.50 19.00 47892.25 14.07 9.76 1.43
 0.14 97.55

 sdc1 0.00 5156.00 5.00 6804.50 19.00 47892.25 14.07 9.76 1.43
 0.14 97.60

 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 sdd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 sde 0.00 

Re: [Ocfs2-users] No space left on device

2010-09-08 Thread Tao Ma
Hi all

On 09/08/2010 04:11 PM, Alexander Barton wrote:
 Hi Sunil!

 Are there special steps one has to follow to recover such a filesystem that 
 has been used with a buggy kernel?

 We had this problem with a Debian 2.6.27 kernel and updated to a recent 
 „mainline“ kernel 2.6.33.x – but are still seeing the same problem: „no space 
 left“ is reported when trying to write files to the filesystem (a few 
 megabytes are often „too much“) although df(1) shows a lot (gigabytes) of 
 free space.

   - Kernel 2.6.33-2-amd64 (2.6.33-1~experimental.4)
   - OCFS2 Tools 1.4.3-1
Discontig block group is added in the last development cycle which mean 
2.6.35 IIRC.

As for the tools part, I have added the patches to the mail list and 
there is still one patch waiting for review.

 Any hints?
 Do you need more information?
yes, we have to make sure you meet with discontig block group problem if 
we have your stat_sysdir output.

Regards,
Tao

 Regards
 Alex

 Am 08.09.2010 um 04:12 schrieb Sunil Mushran:

 Which kernel are you using?

 We have fixed this issue in mainline. We will soon have the same
 fix for production kernels.

 On 09/07/2010 02:06 PM, Todd Freeman wrote:

 From reading the archives I can see this issue has been hit before but
 I haven't found a resolution.

 I have a 50gb partition...  I have formatted it at 10gb. I have it set
 for 4 cluster members and am using 3 of those slots.

 I fill the partition to 66% and voila... no space left on device.  I
 have tried it with big files and lots of small files and both ways I hit
 this error at 66% usage.

 I am using ubuntu-server with ocfs2-tools 1.4.2-1


 If anyone has ideas/solutions I would be most grateful...  this FS is
 awesome :P


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on device

2010-09-08 Thread Tao Ma
Hi,
Alexander Barton wrote:
 Hi Tao!

 Am 08.09.2010 um 10:53 schrieb Tao Ma:

   
 Hi all

 On 09/08/2010 04:11 PM, Alexander Barton wrote:

 
 Hi Sunil!

 Are there special steps one has to follow to recover such a filesystem that 
 has been used with a buggy kernel?

 We had this problem with a Debian 2.6.27 kernel and updated to a recent 
 „mainline“ kernel 2.6.33.x – but are still seeing the same problem: „no 
 space left“ is reported when trying to write files to the filesystem (a few 
 megabytes are often „too much“) although df(1) shows a lot (gigabytes) of 
 free space.

  - Kernel 2.6.33-2-amd64 (2.6.33-1~experimental.4)
  - OCFS2 Tools 1.4.3-1
   
 Discontig block group is added in the last development cycle which mean 
 2.6.35 IIRC.

 As for the tools part, I have added the patches to the mail list and there 
 is still one patch waiting for review.

 
 Any hints?
 Do you need more information?
   
 yes, we have to make sure you meet with discontig block group problem if we 
 have your stat_sysdir output.
 

 It is about 275 KBytes, so I uploaded it here:
 http://gist.github.com/569970
   
Does it happen now? Your volume seems to have the similar problem, a 
large number of groups has small
contiguous spaces that can't be used to allocate an inode.
So do please wait for the patch for the ocfs2-tools to be included and 
the mainline kernel of 2.6.35.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on device

2010-09-08 Thread Tao Ma
Alexander Barton wrote:
 Hi Tao!

 Am 08.09.2010 um 16:22 schrieb Tao Ma:

   
 Hi,

 Alexander Barton wrote:

 
 Hi Tao!

 Am 08.09.2010 um 10:53 schrieb Tao Ma:

   
 Hi all

 On 09/08/2010 04:11 PM, Alexander Barton wrote:

 
 Hi Sunil!

 Are there special steps one has to follow to recover such a filesystem 
 that has been used with a buggy kernel?

 We had this problem with a Debian 2.6.27 kernel and updated to a recent 
 „mainline“ kernel 2.6.33.x – but are still seeing the same problem: „no 
 space left“ is reported when trying to write files to the filesystem (a 
 few megabytes are often „too much“) although df(1) shows a lot 
 (gigabytes) of free space.

 - Kernel 2.6.33-2-amd64 (2.6.33-1~experimental.4)
 - OCFS2 Tools 1.4.3-1
   
 Discontig block group is added in the last development cycle which mean 
 2.6.35 IIRC.

 As for the tools part, I have added the patches to the mail list and there 
 is still one patch waiting for review.

 
 Any hints?
 Do you need more information?
   
 yes, we have to make sure you meet with discontig block group problem if 
 we have your stat_sysdir output.
 
 It is about 275 KBytes, so I uploaded it here:
 http://gist.github.com/569970

   
  
 Does it happen now?
 

 No. We deleted „lots“ of small cache files and are currently able to write to 
 the filesystem.
   
yeah, that is the reason I said your volumes seems to... in my 
previous e-mail. ;)
If you have deleted some small files, it looks it is.

btw, I may commit the ocfs2-tools patches recently, and you can try it 
with 2.6.35.

Regards,
Tao
   
 Your volume seems to have the similar problem, a large number of groups has 
 small
 contiguous spaces that can't be used to allocate an inode.
 So do please wait for the patch for the ocfs2-tools to be included and the 
 mainline kernel of 2.6.35.
 

 Ok, thanks a lot for your answer!

 Regards
 Alex


   


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on device

2010-09-21 Thread Tao Ma


On 09/21/2010 04:52 PM, Alexander Barton wrote:
 Hi Tao!

 Am 09.09.2010 um 02:29 schrieb Tao Ma:

 btw, I may commit the ocfs2-tools patches recently, and you can try it with 
 2.6.35.

 Ok, now we are seeing the problem again and want to try a new kernel and the 
 new OCFS2 tools — there aren’t that much files left we could delete instead …

 So kernel 2.6.35.4 would be ok?
It should work.
 And OCFS2 tools from the GIT master branch? Or a special tag? There is no 
 archive or release, right?
I have already committed the patches to ocfs2-tools.
So you can get from
git clone git://oss.oracle.com/git/ocfs2-tools.git
I guess there is no release for it yet.

 So are there any special steps we have to follow? Do some magic fsck.ocfs2 
 runs? Splashing chicken blood on moonlight? …? Or is simply installing the 
 new kernel and the new tools enough and „everything“ should be working fine 
 again afterwards?
No need for fsck.ocfs2.
Just do tunefs.ocfs2 --fs-features=discontig-bg /dev/sdx(which means you 
have to umount the volume from all the nodes first) and mount it again 
and it should work. any problem, please let us know. Thanks.

Make sure you see 'discontig-bg' in the output of following cmd.
echo 'stats'|debugfs.ocfs2 /dev/sdx|grep Incompat


 Thanks for your support!
no problem.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] No space left on device

2010-09-29 Thread Tao Ma


On 09/29/2010 05:13 PM, Alexander Barton wrote:
 Hello again!

 Am 21.09.2010 um 11:04 schrieb Tao Ma:

 On 09/21/2010 04:52 PM, Alexander Barton wrote:

 So kernel 2.6.35.4 would be ok?

 It should work.

 And OCFS2 tools from the GIT master branch? Or a special tag? There is no 
 archive or release, right?

 I have already committed the patches to ocfs2-tools.
 So you can get from
 git clone git://oss.oracle.com/git/ocfs2-tools.git

 We upgraded both of our cluster nodes last friday to

- Debian Linux Kernel „2.6.35-trunk-amd64“
  (linux-image-2.6.35-trunk-amd64_2.6.35-1~experimental.3_amd64.deb)
  which is 2.6.35.4 plus Debian patches

- OCFS2 tools 1.6.3 from GIT

 Since then, our cluster is VERY unstable, ge get lots of „general protection 
 faults“ and hard lockups. „Lots“ as in „often more than 2 times a day“.
sorry for the trouble.

 Our scenario is OCFS2 on top of DRBD. It looks like the „crash pattern“ is 
 the following:

 On Node 2:

 cl1-n2 kernel: [ 4006.829327] general protection fault:  [#21] SMP
 cl1-n2 kernel: [ 4006.829487] last sysfs file: 
 /sys/devices/platform/coretemp.7/temp1_label
 cl1-n2 kernel: [ 4006.829558] CPU 1
 cl1-n2 kernel: [ 4006.829611] Modules linked in: ocfs2 jbd2 quota_tree tun 
 xt_tcpudp iptable_filter hmac sha1_generic ocfs2_dlmfs ocfs2_stack_o2cb 
 ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iptable_nat nf_nat configfs 
 nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables nfsd 
 exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 coretemp drbd 
 lru_cache cn loop hed tpm_tis snd_pcm snd_timer snd soundcore psmouse 
 snd_page_alloc processor tpm pcspkr evdev joydev tpm_bios dcdbas serio_raw 
 i5k_amb button rng_core shpchp pci_hotplug i5000_edac edac_core ext3 jbd 
 mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod usbhid hid sg 
 sr_mod cdrom ata_generic sd_mod ses crc_t10dif enclosure ata_piix ehci_hcd 
 uhci_hcd usbcore bnx2 libata nls_base megaraid_sas scsi_mod e1000e thermal 
 fan thermal_sys [last unloaded: scsi_wait_scan]
 cl1-n2 kernel: [ 4006.833215]
 cl1-n2 kernel: [ 4006.833215] Pid: 7699, comm: apache2 Tainted: G  D 
 2.6.35-trunk-amd64 #1 0H603H/PowerEdge 2950
 cl1-n2 kernel: [ 4006.833215] RIP: 0010:[810e1886]  
 [810e1886] __kmalloc+0xd3/0x136
 cl1-n2 kernel: [ 4006.833215] RSP: 0018:88012e277cd8  EFLAGS: 00010006
 cl1-n2 kernel: [ 4006.833215] RAX:  RBX: 6f635f6465727265 
 RCX: a0686032
 cl1-n2 kernel: [ 4006.833215] RDX:  RSI: 88012e277da8 
 RDI: 0004
 cl1-n2 kernel: [ 4006.833215] RBP: 81625520 R08: 880001a52510 
 R09: 0003
 cl1-n2 kernel: [ 4006.833215] R10: 88009a561b40 R11: 88022d62f400 
 R12: 000b
 cl1-n2 kernel: [ 4006.833215] R13: 8050 R14: 8050 
 R15: 0246
 cl1-n2 kernel: [ 4006.833215] FS:  7f9199715740() 
 GS:880001a4() knlGS:
 cl1-n2 kernel: [ 4006.833215] CS:  0010 DS:  ES:  CR0: 
 8005003b
 cl1-n2 kernel: [ 4006.833215] CR2: 402de9d0 CR3: 0001372a1000 
 CR4: 000406e0
 cl1-n2 kernel: [ 4006.833215] DR0:  DR1:  
 DR2: 
 cl1-n2 kernel: [ 4006.833215] DR3:  DR6: 0ff0 
 DR7: 0400
 cl1-n2 kernel: [ 4006.833215] Process apache2 (pid: 7699, threadinfo 
 88012e276000, task 88009a561b40)
 cl1-n2 kernel: [ 4006.833215] Stack:
 cl1-n2 kernel: [ 4006.833215]  8801b1af9c20 a0686032 
 8801b1a2da20 88018f5f30c0
 cl1-n2 kernel: [ 4006.833215]0  88012e277e88 000a 
 88018d105300 88009a561b40
 cl1-n2 kernel: [ 4006.833215]0  88012e277da8 a0686032 
 88012e277e88 88012e277da8
 cl1-n2 kernel: [ 4006.833215] Call Trace:
 cl1-n2 kernel: [ 4006.833215]  [a0686032] ? 
 ocfs2_fast_follow_link+0x166/0x284 [ocfs2]
 cl1-n2 kernel: [ 4006.833215]  [810f29fa] ? 
 do_follow_link+0xdb/0x24c
 cl1-n2 kernel: [ 4006.833215]  [810f2d55] ? 
 link_path_walk+0x1ea/0x482
 cl1-n2 kernel: [ 4006.833215]  [810f311f] ? path_walk+0x63/0xd6
 cl1-n2 kernel: [ 4006.833215]  [810f27ba] ? path_init+0x46/0x1ab
 cl1-n2 kernel: [ 4006.833215]  [810f3288] ? do_path_lookup+0x20/0x85
 cl1-n2 kernel: [ 4006.833215]  [810f3cd9] ? user_path_at+0x46/0x78
 cl1-n2 kernel: [ 4006.833215]  [81038bac] ? 
 pick_next_task_fair+0xe6/0xf6
 cl1-n2 kernel: [ 4006.833215]  [81305101] ? schedule+0x4d4/0x530
 cl1-n2 kernel: [ 4006.833215]  [81060526] ? prepare_creds+0x87/0x9c
 cl1-n2 kernel: [ 4006.833215]  [810e8649] ? sys_faccessat+0x96/0x15b
 cl1-n2 kernel: [ 4006.833215]  [810089c2] ? 
 system_call_fastpath+0x16/0x1b
 cl1-n2 kernel: [ 4006.833215] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 
 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48

Re: [Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

2010-10-29 Thread Tao Ma
Hi Ronald,

On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:
 Hello,

 I was testing kernel 2.6.36 (vanilla mainline) and encountered the
 following BUG():

 [157756.266000] o2net: no longer connected to node app01 (num 0) at
 10.2.25.13:
 [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
 has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
 [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at
 least one node (0) to recover before lock mastery can begin
 [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at
 least one node (0) to recover before lock mastery can begin
 [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
 recover before lock mastery can begin
 [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
 master $RECOVERY lock now
 [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
 Node 1 is the Recovery Master for the Dead Node 0 for Domain
 5FA56B1D0A9249099CE58C82CFEC873A
 [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
 Recovering node 0 from slot 0 on device (8,32)
 [157772.850182] [ cut here ]
 [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!
Strange. the bug line is
BUG_ON(osb-node_num == node_num);
and it detects the same node number in the cluster.

So could you please grab the mount info from the system log of the 2 
nodes. The message looks like:

Oct 27 16:24:21 ocfs2-test2 kernel: ocfs2: Mounting device (8,8) on 
(node 2, slot 0) with ordered data mode.

It tell us which node and slot the volume used.

Regards,
Tao
 [157772.850238] invalid opcode:  [#1] SMP
 [157772.850270] last sysfs file:
 /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
 [157772.850314] CPU 0
 [157772.850320] Modules linked in: ip_vs_wrr ip_vs nf_conntrack ocfs2
 jbd2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
 ocfs2_nodemanager ocfs2_stackglue configfs sd_mod crc32c ib_iser
 rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
 libiscsi_tcp libiscsi scsi_transport_iscsi bonding ipv6 ipmi_devintf
 cpufreq_ondemand acpi_cpufreq freq_table mperf loop ipmi_si
 ipmi_msghandler hpilo hpwdt container snd_pcm serio_raw psmouse
 snd_timer snd soundcore tpm_tis tpm tpm_bios pcspkr iTCO_wdt
 snd_page_alloc button processor evdev ext3 jbd mbcache dm_mirror
 dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom usbhid hid
 ata_piix ata_generic cciss libata scsi_mod ide_pci_generic ide_core
 ehci_hcd bnx2 e1000e uhci_hcd thermal fan thermal_sys
 [157772.850758]
 [157772.850779] Pid: 14060, comm: ocfs2rec Not tainted 2.6.36 #2
 /ProLiant DL360 G6
 [157772.850823] RIP: 0010:[a03da8c3]  [a03da8c3]
 __ocfs2_recovery_thread+0x474/0x137f [ocfs2]
 [157772.850916] RSP: 0018:880084f49e00  EFLAGS: 00010246
 [157772.850943] RAX: 0001 RBX: 88011dd07108 RCX:
 88011d3fe344
 [157772.850986] RDX: 88011d3fe340 RSI: 0001 RDI:
 88011dd07108
 [157772.851029] RBP: 880118479000 R08:  R09:
 
 [157772.851073] R10:  R11: 0400 R12:
 88011faff800
 [157772.851116] R13: 0001 R14: 88011dd07000 R15:
 
 [157772.851159] FS:  () GS:88000160()
 knlGS:
 [157772.851205] CS:  0010 DS:  ES:  CR0: 8005003b
 [157772.851232] CR2: 01e88b58 CR3: 00011dd26000 CR4:
 06f0
 [157772.851275] DR0:  DR1:  DR2:
 
 [157772.851318] DR3:  DR6: 0ff0 DR7:
 0400
 [157772.851362] Process ocfs2rec (pid: 14060, threadinfo
 880084f48000, task 88009bd9e9c0)
 [157772.851407] Stack:
 [157772.851427]  8800  88010008
 0020
 [157772.851462]0  88009bd9ece8 88009bd9e9c0 88009bd9ece8
 88009bd9e9c0
 [157772.851515]0  88009bd9ece8 88009bd9e9c0 88009bd9ece8
 88009bd9e9c0
 [157772.851584] Call Trace:
 [157772.851611]  [a03da44f] ?
 __ocfs2_recovery_thread+0x0/0x137f [ocfs2]
 [157772.851657]  [81044aed] ? kthread+0x7e/0x86
 [157772.851684]  [81002b94] ? kernel_thread_helper+0x4/0x10
 [157772.851713]  [81044a6f] ? kthread+0x0/0x86
 [157772.851739]  [81002b90] ? kernel_thread_helper+0x0/0x10
 [157772.851766] Code: 89 1c 24 41 b9 a0 06 00 00 49 c7 c0 50 01 42 a0
 48 c7 c7 a9 9f 42 a0 31 c0 e8 1d 0c e7 e0 8b 74 24 74 41 39 b6 38 01
 00 00 75 040f  0b eb fe 48 c7 84 24 a0 00 00 00 00 00 00 00 48 c7 84
 24 98
 [157772.851973] RIP  [a03da8c3]
 __ocfs2_recovery_thread+0x474/0x137f [ocfs2]
 [157772.852024]  

Re: [Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

2010-10-29 Thread Tao Ma
Ronald Moesbergen wrote:
 2010/10/29 Ronald Moesbergen intercom...@gmail.com:
   
 2010/10/29 Tao Ma tao...@oracle.com:
 
 Hi Ronald,
   
 Hi Tao,

 Thanks for looking into this.

 
 On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:
   
 Hello,

 I was testing kernel 2.6.36 (vanilla mainline) and encountered the
 following BUG():

 [157756.266000] o2net: no longer connected to node app01 (num 0) at
 10.2.25.13:
 [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
 has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
 [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at
 least one node (0) to recover before lock mastery can begin
 [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at
 least one node (0) to recover before lock mastery can begin
 [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
 recover before lock mastery can begin
 [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
 master $RECOVERY lock now
 [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
 Node 1 is the Recovery Master for the Dead Node 0 for Domain
 5FA56B1D0A9249099CE58C82CFEC873A
 [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
 Recovering node 0 from slot 0 on device (8,32)
 [157772.850182] [ cut here ]
 [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!
 
 Strange. the bug line is
 BUG_ON(osb-node_num == node_num);
 and it detects the same node number in the cluster.
   

 I just tried to reproduce it and succeeded. Here's what I did:
 - unmount the filesystem on node app02
 - shutdown the o2cb services on app02
 - Do a halt -f on app01, which still has the OCFS2 volume mounted.
 - Start o2cb services on app02
 - Mount the OCFS2 filesystem - BUG
   
Thanks for the test. I will look at it. Thanks.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

2010-11-01 Thread Tao Ma
Hi Ronald,

On 10/29/2010 06:03 PM, Ronald Moesbergen wrote:
 2010/10/29 Ronald Moesbergenintercom...@gmail.com:
 2010/10/29 Tao Matao...@oracle.com:
 Hi Ronald,

 Hi Tao,

 Thanks for looking into this.

 On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:

 Hello,

 I was testing kernel 2.6.36 (vanilla mainline) and encountered the
 following BUG():

 [157756.266000] o2net: no longer connected to node app01 (num 0) at
 10.2.25.13:
 [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
 has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
 [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at
 least one node (0) to recover before lock mastery can begin
 [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at
 least one node (0) to recover before lock mastery can begin
 [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
 recover before lock mastery can begin
 [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
 master $RECOVERY lock now
 [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
 Node 1 is the Recovery Master for the Dead Node 0 for Domain
 5FA56B1D0A9249099CE58C82CFEC873A
 [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
 Recovering node 0 from slot 0 on device (8,32)
 [157772.850182] [ cut here ]
 [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!

 Strange. the bug line is
 BUG_ON(osb-node_num == node_num);
 and it detects the same node number in the cluster.

 I just tried to reproduce it and succeeded. Here's what I did:
 - unmount the filesystem on node app02
 - shutdown the o2cb services on app02
 - Do a halt -f on app01, which still has the OCFS2 volume mounted.
 - Start o2cb services on app02
 - Mount the OCFS2 filesystem -  BUG

 Works everytime. So one of the 2 variables checked in that BUG_ON
 statement must no be set correctly somewhere.
I have tried several times in my local test env, but with no luck by 
now. And it seems to me quite strange at least from the code.

So could you please file a bug in oss.oracle.com/bugzilla so that it is 
easier to track and discuss? Great thanks.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Pb with ocfs2 dlm on Fedora 13

2010-11-08 Thread Tao Ma
Hi Alain,

On 11/08/2010 11:08 PM, Alain.Moulle wrote:
   Hi,

 I have a problem on Fedora13 with releases :
 ocfs2  1.4.3-5.fc13.x86_64
 dlm_tool 3.0.17

 With a 3 nodes ocfs2 cluster, I can't mount FS on the three nodes at the same 
 time
 but only on two nodes   among the 3 nodes  , whatever the two nodes are among 
 the 3 nodes.

 The errors are :
 (1475,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 2 after 30.0 seconds, giving up and returning errors.
 (2175,0):dlm_request_join:1035 ERROR: status = -107
 (2175,0):dlm_try_to_join_domain:1209 ERROR: status = -107
 (2175,0):dlm_join_domain:1487 ERROR: status = -107
 (2175,0):dlm_register_domain:1753 ERROR: status = -107
 (2175,0):o2cb_cluster_connect:313 ERROR: status = -107
 (2175,0):ocfs2_dlm_init:2995 ERROR: status = -107
 (2175,0):ocfs2_mount_volume:1789 ERROR: status = -107
 ocfs2: Unmounting device (8,16) on (node 0)
 o2net: no longer connected to node selfxl-4 (num 0) at
 10.197.189.204:
 o2net: connected to node selfxl-4 (num 0) at 10.197.189.204:

 It seems to be a lock management problem
 Is it an already known issue ?
 Is there an available patch ?
It doesn't look like a dlm problem, but a network problem. ;)
So your first error is o2net_connect_expired.
So it seems that the 3rd node can't connect with node 2.
Could you please check the error message in node 2?

btw, I would deem that the cluster.conf is the same among the 3 nodes, 
and you you can connect to (which is used by ocfs2) of node 2 from 
node 3.

Regards,
Tao

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Pb with ocfs2 dlm on Fedora 13

2010-11-09 Thread Tao Ma
Hi Alain,

On 11/09/2010 04:49 PM, Alain.Moulle wrote:
   Hi,
 The three cluster.conf are exactly the same on the 3 nodes.
 The errors messages are :

 -nodes1:
   o2net: accepted connection from node selfxl-5 (num 1) at
 10.197.189.218:
 o2net: no longer connected to node selfxl-5 (num 1) at
 10.197.189.218:

 -nodes2:
   (1457,1):o2net_connect_expired:1656 ERROR: no connection established
 with node 1 after 30.0 seconds, giving up and returning errors.

 Note that once a mount is refused for example on node3, if
 I umount the FS on node1 for example, then I can mount it
 on node3.
Oh, so do you have enough slots for all these 3 nodes to mount?

What's the output for the below command?
echo 'stats'|debugfs.ocfs2 /dev/sdx|grep Slots

Regards,
Tao
 Note also that when the mound is refused for example on node3,
 I've check that this node3pings  successfully both other
 nodes on IP addr given in cluster.conf.

 Alain




 Tao Ma a écrit :
 Hi Alain,

 On 11/08/2010 11:08 PM, Alain.Moulle wrote:

Hi,

 I have a problem on Fedora13 with releases :
 ocfs2  1.4.3-5.fc13.x86_64
 dlm_tool 3.0.17

 With a 3 nodes ocfs2 cluster, I can't mount FS on the three nodes at the 
 same time
 but only on two nodes   among the 3 nodes  , whatever the two nodes are 
 among the 3 nodes.

 The errors are :
 (1475,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 2 after 30.0 seconds, giving up and returning errors.
 (2175,0):dlm_request_join:1035 ERROR: status = -107
 (2175,0):dlm_try_to_join_domain:1209 ERROR: status = -107
 (2175,0):dlm_join_domain:1487 ERROR: status = -107
 (2175,0):dlm_register_domain:1753 ERROR: status = -107
 (2175,0):o2cb_cluster_connect:313 ERROR: status = -107
 (2175,0):ocfs2_dlm_init:2995 ERROR: status = -107
 (2175,0):ocfs2_mount_volume:1789 ERROR: status = -107
 ocfs2: Unmounting device (8,16) on (node 0)
 o2net: no longer connected to node selfxl-4 (num 0) at
 10.197.189.204:
 o2net: connected to node selfxl-4 (num 0) at 10.197.189.204:

 It seems to be a lock management problem
 Is it an already known issue ?
 Is there an available patch ?

 It doesn't look like a dlm problem, but a network problem. ;)
 So your first error is o2net_connect_expired.
 So it seems that the 3rd node can't connect with node 2.
 Could you please check the error message in node 2?

 btw, I would deem that the cluster.conf is the same among the 3 nodes,
 and you you can connect to (which is used by ocfs2) of node 2 from
 node 3.

 Regards,
 Tao

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com  mailto:Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users





___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users