Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
Hi Mike, Are you sure it is caused by the update of ocfs2-tools? AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So if you don't make any change to the disk(by using this new tools), it shouldn't cause the problem of kernel panic since they are all user space tools. Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm package for RHEL, it will update /etc/sysconfig/o2cb and this file has some timeouts defined in it. So do you have some backups for this file? If yes, please restore it to see whether it helps(I can't say it for sure). If not, do you remember the old value of some timeouts you set for ocfs2? If yes, you can use o2cb configure to set them by yourself. Good Luck. Regards, Tao mike wrote: Hi, I'm running into a big issue. I believe it is OCFS2, I can get my machines to kernel panic consistently. Before I was running Ubuntu Gutsy (7.10) ocfs2-tools 1.2.4. Now I am running Ubuntu Hardy (8.04) ocfs2-tools 1.3.9. I am even running the same kernel (2.6.22-14), but the behavior has changed with my OCFS2 mounts it seems. At first I thought it was due to the newer kernel (2.6.24-16) but it isn't the case. Now it is happening no matter which kernel I use. I even compiled my own vanilla 2.6.25, and it still has this issue. I have 6 total clients mounting the ocfs2 partition: - 2 batch servers which only access it every 5 or 10 minutes to load up a PHP script to process - 1 server I am trying to rsync from local RAID disk - ocfs2 - I am limiting this to 250kb/sec - 3 webservers loading normal stuff - PHP scripts, graphics, media files - maybe 2MB/sec combined total That's not even 3MB/sec - yet when I start the rsync, pretty quickly the server doing the rsync kernel panics and reboots. The 3 webservers all have issues with reading from the OCFS2 mounted partition. The %util all drops to 0, it's like it bottlenecks and suspends all disk I/O on the webservers for a few seconds. Then things go back to normal for a while. Is there any additional info that could be useful? I am desperately in need of help. I have hosting customers and somehow this upgrade has pretty much crippled me... ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Did anything substantial change between 1.2.4 and 1.3.9?
mike wrote: I have changed my kernel back to 2.6.22-14-server, and now I don't get the kernel panics. It seems like an issue with 2.6.24-16 and some i/o made it crash... OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. :) Then could you please describe it in more detail about how the kernel panic happens? However I am still getting file access timeouts once in a while. I am nervous about putting more load on the setup. Also please provide more details about it. [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb # O2CB_ENABLED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=mycluster # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=7 This value is a little smaller, so how did you build up your shared disk(iSCSI or ...)? The most common value I heard of is 61. It is about 120 secs. I don't know the reason and maybe Sunil can tell you. ;) You can also refer to http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT. # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=1 # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent O2CB_KEEPALIVE_DELAY_MS=5000 # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts O2CB_RECONNECT_DELAY_MS=2000 On 4/21/08, Tao Ma [EMAIL PROTECTED] wrote: Hi Mike, Are you sure it is caused by the update of ocfs2-tools? AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs etc. So if you don't make any change to the disk(by using this new tools), it shouldn't cause the problem of kernel panic since they are all user space tools. Then there is only one thing maybe. Have you modify /etc/sysconfig/o2cb(This is the place for RHEL, not sure the place in ubuntu)? I have checked the rpm package for RHEL, it will update /etc/sysconfig/o2cb and this file has some timeouts defined in it. So do you have some backups for this file? If yes, please restore it to see whether it helps(I can't say it for sure). If not, do you remember the old value of some timeouts you set for ocfs2? If yes, you can use o2cb configure to set them by yourself. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to start cluster with one node
Hi David, David Coulson wrote: This is probably a stupid question, but here we go. I have two boxes running RHEL4U6 with DRBD mirroring disk between them. DRBD is setup in active/active mode, and seems to be working nicely. I have OCFS2 filesystems build on the DRBD devices, and normally I am able to mount them on both nodes and life is good. Now, I have one node down. Not good, but that is why we have two... DRBD is fine, but OCFS2 won't startup correctly. [EMAIL PROTECTED] network-scripts]# !/etc/init.d/o2cb status Module configfs: Loaded Filesystem configfs: Mounted Module ocfs2_nodemanager: Loaded Module ocfs2_dlm: Loaded Module ocfs2_dlmfs: Loaded Filesystem ocfs2_dlmfs: Mounted Checking O2CB cluster ocfs2: Online Heartbeat dead threshold: 31 Network idle timeout: 3 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Not active There is no problem with the status. As for Not active, it just means that there is no device hearbeating. So if you don't mount a ocfs2 volume, it will just shows Not active. I take it this has something to do with establishing a quorum, which probably isn't happy with a single node. Is there a configuration change or workaround that will allow a single OCFS2 node to mount a filesystem? a single OCFS2 node can mount a file system without any change in the configuration. So you may try to mount it. If there is any problem, please paste the error message here. Thanks. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to start cluster with one node
Hi David, David Coulson wrote: Hi Tao, nt a file system without any change in the configuration. So you may try to mount it. If there is any problem, please paste the error message here. Thanks. I tried to create a filesystem on a unused DRBD block device... mkfs.ocfs2 seemed to work okay, but it won't mount. # mount /mnt/mirror2 ocfs2_hb_ctl: I/O error on channel while starting heartbeat mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: Operation not permitted Are you sure you use the right device in your fstab? If yes, could you please strace the mount process to see the arguments mount.ocfs2 give to ocfs2_hb_ctl and why it fails? Another ocfs2 filesystem which was built earlier seems happy, however the two nodes were working together then. So here you mean you can mount an old ocfs2 filesystem in your single node, right? If yes, have you updated the ocfs2-tools recently(If you use a new mkfs.ocfs2(like 1.3.9) and an old ocfs2_hb_ctl(like 1.2.x), it will fails). btw, is there any error message in dmesg? Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problems building ocfs2 rpm on Fedora 9
Hi Tina, datavolume is only used for ocfs2-1.2.* and ocfs2-1.4.* in the future if I am correct. It is oracle specific, so the main kernel doesn't have this mount option. Regards, Tao Tina Soles wrote: Thanks! I installed the tools rpm and the console as well. I've successfully formatted the ocfs2 system, but when trying to mount it, I'm getting the following: #mount -t ocfs2 -o datavolume,nointr /dev/sdb1 /ocfs2/oradata/orcl mount.ocfs2: Invalid argument while mounting /dev/sdb1 on /ocfs2/oradata/orcl Here is dmesg output: OCFS2 Node Manager 1.5.0 OCFS2 DLM 1.5.0 OCFS2 DLMFS 1.5.0 OCFS2 User DLM kernel interface loaded OCFS2 Node Manager 1.5.0 OCFS2 DLM 1.5.0 OCFS2 DLMFS 1.5.0 OCFS2 User DLM kernel interface loaded OCFS2 1.5.0 (19226,0):ocfs2_parse_options:870 ERROR: Unrecognized mount option datavolume or missing value I googled this, and after researching it, I'm still not quite sure what to do. Is it ok to use the mount command above without the datavolume,nointr parameters? If so, then do I have to include something in init.ora? I'm a bit confused. Any help you can give me would be appreciated, as I am brand new to all of this. My versions are: # rpm -qa | grep ocfs2 ocfs2console-1.3.9-8.20080221git.fc10.i386 ocfs2-tools-1.3.9-8.20080221git.fc10.i386 # find /lib/modules/`uname -r` -name \*ocfs2\* /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2 /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/cluster/ocfs2_nodemanage r.ko /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/ocfs2.ko /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko /lib/modules/2.6.25-14.fc9.i686/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko #./o2cb status Module configfs: Loaded Filesystem configfs: Mounted Module ocfs2_nodemanager: Loaded Module ocfs2_dlm: Loaded Module ocfs2_dlmfs: Loaded Filesystem ocfs2_dlmfs: Mounted Checking O2CB cluster ocfs2: Online Heartbeat dead threshold = 31 Network idle timeout: 3 Network keepalive delay: 2000 Network reconnect delay: 2000 Thanks in advance! -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Friday, June 27, 2008 4:57 PM To: Tina Soles Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems building ocfs2 rpm on Fedora 9 ocfs2 fs module is shipped with the kernel. So no need to worry there. $ find /lib/modules/`uname -r` -name \*ocfs2\* Should return you 4 modules... fs, dlm, nm and userdlm. The ocfs2-tools rpm should be available tomorrow for download. Fingers crossed. Tina Soles wrote: Thanks for the reply. Can you be more specific and give me the exact name of the native Fedora 9 rpm(s) that I need for ocfs2 and ocfs2-tools? Thanks. *From:* Sunil Mushran [mailto:[EMAIL PROTECTED] *Sent:* Fri 6/27/2008 1:16 PM *To:* Tina Soles *Cc:* ocfs2-users@oss.oracle.com *Subject:* Re: [Ocfs2-users] Problems building ocfs2 rpm on Fedora 9 Fedora ships ocfs2 fs modules natively. You don't have to do all this. What is missing is the tools rpm. But the good news is that that should be available any day now literally speaking. Tina Soles wrote: Hello, I'm brand new to RAC and ocfs2. I need to install ocfs2, but there is no distribution for Fedora 9, for kernel 2.6.25-14.fc9.i686. I have obtained the ocfs2-1.3.9-0.1.tar.gz source file, unzipped, it and am following the instructions found here: http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#B UILD My .rpmmacros file looks like this: (the /usr/src/redhat and all subdirectories exist) %_topdir /usr/src/redhat %_tmppath /usr/src/redhat/tmp %_sourcedir /usr/src/redhat/SOURCES %_specdir /usr/src/redhat/SPECS %_srcrpmdir /usr/src/redhat/SRPMS %_rpmdir /usr/src/redhat/RPMS %_builddir /usr/src/redhat/BUILD I'm logged in as root. Here's a list of my kernel* packages installed: kernel.i686 2.6.25-14.fc9 installed kernel-PAE-devel.i686 2.6.25.6-55.fc9 installed kernel-devel.i686 2.6.25-14.fc9 installed kernel-devel.i586 2.6.25.6-55.fc9 installed kernel-headers.i386 2.6.25.6-55.fc9 installed kernel-xen-devel.i686 2.6.25.3-2.fc9 installed kerneloops.i386 0.10-11.fc9 installed Here's the contents of /usr/src/kernels: drwxr-xr-x 19 root root 4096 2008-06-26 13:11 2.6.25-14.fc9.i686 drwxr-xr-x 19 root root 4096 2008-06-26 14:34 2.6.25.3-2.fc9.i686.xen drwxr-xr-x 19 root root 4096 2008-06-26 14:37 2.6.25.6-55.fc9.i586 drwxr-xr-x 19 root root 4096 2008-06-26 14:33 2.6.25.6-55.fc9.i686.PAE When I attempt to issue ./vendor/rhel5/kernel.guess targets, it returns me back to the # prompt with no output. I think this is indicative of my problem, but I'm not sure what to do! I can run ./configure with no problem (see below), but when I attempt to execute the make statement, I get make: *** No rule to make target 'rhel5_2.6.25-14.fc9_rpm'. Stop.
Re: [Ocfs2-users] ocfs2 kernel BUG
Hi, Please provide the detail info of ocfs2 version which may be helpful for diagnose. Peter Selzner wrote: Hi, we had this entries in /var/log/messeges a few days ago: Jul 28 23:30:47 xxx kernel: (12268,2):ocfs2_extend_file:790 ERROR: bug expression: i_size_read(inode) != (le64_to_cpu(fe-i_size) - *bytes_extended) Jul 28 23:30:47 xxx kernel: (12268,2):ocfs2_extend_file:790 ERROR: Inode 8323098 i_size = 1572864, dinode i_size = 1568768, bytes_extended = 0, new_i_size = 1576960 Jul 28 23:30:47 xxx kernel: klogd 1.4.1, -- state change -- Jul 28 23:30:47 xxx kernel: [ cut here ] Jul 28 23:30:47 xxx kernel: kernel BUG at fs/ocfs2/file.c:790! Jul 28 23:30:47 xxx kernel: invalid opcode: [#1] Jul 28 23:30:47 xxx kernel: SMP Jul 28 23:30:47 xxx kernel: last sysfs file: /class/infiniband/mthca1/board_id Jul 28 23:30:47 xxx kernel: Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs cpqci mptctl mptbase ipmi_si ipmi_devintf ipmi_msghandler rdma_ucm rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_local_sa ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad bonding ib_mthca ib_mad ib_core button battery ac raw loop dm_round_robin dm_multipath dm_mod usbhid hw_random ide_cd uhci_hcd e1000 cdrom ehci_hcd bnx2 usbcore ext3 jbd ata_piix ahci libata edd fan thermal processor cciss sg qla2400 qla2300 qla2xxx firmware_class qla2xxx_conf intermodule piix sd_mod scsi_mod ide_disk ide_core Jul 28 23:30:47 xxx kernel: CPU:2 Jul 28 23:30:47 xxx kernel: EIP:0060:[f9de8173]Tainted: P U VLI Jul 28 23:30:47 xxx kernel: EFLAGS: 00210292 (2.6.16.46-0.12-bigsmp #1) Jul 28 23:30:47 xxx kernel: EIP is at ocfs2_extend_file+0x3cd/0xf9b [ocfs2] Jul 28 23:30:47 xxx kernel: eax: 008c ebx: ecx: ff00 edx: 00200286 Jul 28 23:30:47 xxx kernel: esi: edi: ebp: df05f000 esp: e398de70 Jul 28 23:30:47 xxx kernel: ds: 007b es: 007b ss: 0068 Jul 28 23:30:47 xxx kernel: Process mv (pid: 12268, threadinfo=e398c000 task=f7f80660) Jul 28 23:30:47 xxx kernel: Stack: 0 dd4f9d88 ce48c000 0001 cf253280 dd4f9b80 Jul 28 23:30:47 xxx kernel:dd4f9ee4 0017f000 f9ddf432 e398dea8 dd4f9b80 Jul 28 23:30:47 xxx kernel:0001 e398deb4 e398deb4 ce48c000 ece0bc00 Jul 28 23:30:47 xxx kernel: Call Trace: Jul 28 23:30:47 xxx kernel: [f9ddf432] ocfs2_status_completion_cb+0x0/0xa [ocfs2] Jul 28 23:30:47 xxx kernel: [f9df72f2] ocfs2_write_lock_maybe_extend+0xb2f/0xde3 [ocfs2] Jul 28 23:30:47 xxx kernel: [f9dea85d] ocfs2_file_write+0x125/0x24d [ocfs2] Jul 28 23:30:47 xxx kernel: [f9dea738] ocfs2_file_write+0x0/0x24d [ocfs2] Jul 28 23:30:47 xxx kernel: [c0164714] vfs_write+0xaa/0x152 Jul 28 23:30:47 xxx kernel: [c0164d1f] sys_write+0x3c/0x63 Jul 28 23:30:47 xxx kernel: [c0103cab] sysenter_past_esp+0x54/0x79 Jul 28 23:30:47 xxx kernel: Code: 8b 4c 24 3c ff 71 04 ff 31 68 16 03 00 00 68 2b b5 e0 f9 ff 70 10 8b 00 ff b0 c0 00 00 00 68 b1 fd e0 f9 e8 ca a8 33 c6 83 c4 3c 0f 0b 16 03 db fb e0 f9 8b 5c 24 20 8b 03 0f ae e8 89 f6 8b 74 It was impossible to do ls -al in a certain directory (each process that touched files in this directory ends in DEAD state (uninterruptible sleep). Any suggestions? Thanks. How do this happen and could you please explain it in more detail? e.g, how many nodes are in your cluster? you hang in one node, how about other nodes or what you are doing in other nodes. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 node reboot method
Hi, Masanari Iida wrote: Hello list, There is a 14 node OCFS2 cluster. When I reboot all 14 nodes at once, some node failed to mount the ocfs2 filesystem while rebooting. The mount is supposed to be done by /etc/fstab. The symptom is happened on randam node. I would like to know if there is such a rule that OCFS2 nodes need to be reboot one-by-one, instead of all-at-once. No rule for rebooting ocfs2. An error message I saw was mount.ocfs2: Transport endpoint is not connected while mounting /dev/EXTDISK/OCFS2 Interesting. Have you update ocfs2 in some nodes? Normally it happens when there are some protocol collision among nodes. Any helpful information in dmesg? also please provide the version info of ocfs2. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 node reboot method
Masanari Iida wrote: On Tue, Aug 5, 2008 at 5:43 PM, Tao Ma [EMAIL PROTECTED] wrote: An error message I saw was mount.ocfs2: Transport endpoint is not connected while mounting /dev/EXTDISK/OCFS2 Interesting. Have you update ocfs2 in some nodes? Normally it happens when there are some protocol collision among nodes. Any helpful information in dmesg? The boxes are all SLES10 + ocfs2-tools-1.2.5-SLES-r2997 I found no other messages at the time of failure. which version of ocfs2? I would like to catch some useful information for troubleshooting. Do you think tcpdump may catch something? Or Do I need to use debug.ocfs2? If latter case, what is the right option? debugfs.ocfs2 -l CONN DLM_DOMAIN TCP allow mount and check dmesg. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Enable mlog() messages
Masanari Iida wrote: Hello again, I looked into the source and found the error message Transport endpoint is not connected could be came from ENOTCONN in tcp.c. There are multiple ENOTCONN, so I would like to know which one produce my message. I want to enable mlog(). My understanding is, debug.ocfs2 enable the mlog(). http://oss.oracle.com/osswiki/OCFS2/Debugging I think my problem happen while the system boot up and mount the volume from /etc/fstab. In that case, how can I enable debugging? As I have said in another mail, you can use debugfs.ocfs2 -l CONN DLM_DOMAIN SUPER allow mount the volume. dmesg should give us enough information about what happens. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 node reboot method
Hi, Masanari Iida wrote: Hello Tao and Sunil, ] My case, the symptom (ocfs2 failed to mount a volume using /etc/fstab) happend when I reboot the system. Even if it failed to mount (by /etc/fstab), I can mount it later after I login the system. So it could be some kind of timing issue. Your advice mount and check dmesg ) seemed to be a manual procedure. I would like to know how and where can I set the debugfs.ocfs2, and make it run just before the ocfs2 mount. Are you sure your network device have been started before ocfs2? Have you added _netdev in your fstab? http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html qs 41. Regards, Masanari Iida ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] New node..new problems
Hi, Dante Garro wrote: Sunil, now I fall in count of messages are related to node 0, but the new is node 1 and does not care about the value I've setup allways says 14000 ms. Do this change your diagnostic? Node1 start connection with node0, so you see the messages related to node0 on node1. It looks like your configuration in node1 is wrong. Please make sure that value of O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb of node1 is the same as that in node0. Regards, Tao -Mensaje original- De: Sunil Mushran [mailto:[EMAIL PROTECTED] Enviado el: Jueves, 09 de Octubre de 2008 06:02 p.m. Para: Dante Garro CC: 'ocfs2-users@oss.oracle.com' Asunto: Re: [Ocfs2-users] New node..new problems Yeah the cluster timeouts are not consistent. Update and restart the cluster on the new node (or all nodes as the case might be). Hint: cat /sys/kernel/config/cluster/clustername/idle_timeout_ms to see the active heartbeat threshold. Dante Garro wrote: Hi all, because problems with ocfs2 release of Debian distribution decided to remake my cluster replacing it by CentOS based installation. Started replacing one of the nodes keeping the other working. On this recently created node the following errors appears: drbd0: Writing meta data super block now. (2558,1):o2hb_check_slot:881 ERROR: Node 0 on device drbd0 has a dead count of 14000 ms, but our count is 13000 ms. Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD' OCFS2 1.2.9 Wed Sep 24 19:26:41 PDT 2008 (build a693806cb619dd7f225004092b675ede) (2520,1):o2net_connect_expired:1585 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. (2556,1):dlm_request_join:901 ERROR: status = -107 (2556,1):dlm_try_to_join_domain:1049 ERROR: status = -107 (2556,1):dlm_join_domain:1321 ERROR: status = -107 (2556,1):dlm_register_domain:1514 ERROR: status = -107 (2556,1):ocfs2_dlm_init:2024 ERROR: status = -107 (2556,1):ocfs2_mount_volume:1133 ERROR: status = -107 ocfs2: Unmounting device (147,0) on (node 1) (2591,1):o2hb_check_slot:881 ERROR: Node 0 on device drbd0 has a dead count of 14000 ms, but our count is 13000 ms. Please double check your configuration values for 'O2CB_HEARTBEAT_THRESHOLD' (2520,1):o2net_connect_expired:1585 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. (2589,1):dlm_request_join:901 ERROR: status = -107 (2589,1):dlm_try_to_join_domain:1049 ERROR: status = -107 (2589,1):dlm_join_domain:1321 ERROR: status = -107 (2589,1):dlm_register_domain:1514 ERROR: status = -107 (2589,1):ocfs2_dlm_init:2024 ERROR: status = -107 (2589,1):ocfs2_mount_volume:1133 ERROR: status = -107 ocfs2: Unmounting device (147,0) on (node 1) I've changed the parameter O2CB_HEARTBEAT_THRESHOLD according O2CB adviced me, but It don't resolve the issue. I hope someone could give me a clue. Thanks in advance. Dante ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] New node..new problems
Dante Garro wrote: Thanks Tao, I've setup the same on both nodes and the cluster becomes online. Now, when I try to mount the following errors appears on node 1 (new CentOS): (2512,1):o2net_connect_expired:1585 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. (3022,1):dlm_request_join:901 ERROR: status = -107 (3022,1):dlm_try_to_join_domain:1049 ERROR: status = -107 (3022,1):dlm_join_domain:1321 ERROR: status = -107 (3022,1):dlm_register_domain:1514 ERROR: status = -107 (3022,1):ocfs2_dlm_init:2024 ERROR: status = -107 (3022,1):ocfs2_mount_volume:1133 ERROR: status = -107 ocfs2: Unmounting device (147,0) on (node 1) And the following on node 0 (old Debian) (2228,0):o2net_check_handshake:1093 node nodo2 (num 1) at 192.168.168.2: advertised net protocol version 103 but 2 is required, disconnecting I believe the Debian message is clear, protocol version incompatibility. yes, you are right. The ocfs2 version in node1 and node0 isn't the same, so they can't coexist with each other(mount the same volume simultaneously). So I would suggest you to either change the node0 from Debian to CentOS or make the same ocfs2 version in both nodes. From the message, it looks that your CentOS has ocfs2-1.2.* while your Debian is a bit newer(ocfs2 is shipped with the mainline kernel it used I guess). Currently for CentOS 5, we only have 1.2.* and 1.4.*, So I don't think you can find a suitable ocfs2 version compatible with your debian. Regards, Tao Are there a way to resolve it? Thanks Dante ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2: ERROR (device sdh1): ocfs2_direct_IO_get_blocks
Hi Daniel, Daniel Keisling wrote: Patch was here: http://oss.oracle.com/pipermail/ocfs2-devel/2008-September/002787.html yes, that patch has been merged into ocfs2-1.4 and should be ready for the next release. Also as Joel said, If you have the appropriate support, you should call support and file that way. Here is the workaround from the mail list, and I don't know whether it is suitable in your case. I would guess that you are upgrading from ocfs2-1.2 to ocfs2-1.4. If that is the case please make sure which file cause this bug. Use debugfs.ocfs2 -R findpath 23693699 /dev/sdh1 to see what the file is. If that file isn't a datafile(I think it shouldn't be since the file is only 5120 bytes) and your volume is used for other files(e.g Oracle Home), then please remove datavolume from the mount option and instead set the init.ora parameter filesystemio_options=directio (or is it odirect). This should limit Oracle to use the odirect flag for the files it should. datavolume is legacy, please refer to the OCFS2 1.4 Users Guide for further information about the datavolume option. Wish it help. Regards, Tao [r...@wilracdbdr01 /]# debugfs.ocfs2 -R 'stat 23693699' /dev/sdh1 Inode: 23693699 Mode: 0660 Generation: 2707416418 (0xa15fe562) FS Generation: 236416663 (0xe176e97) Type: Regular Attr: 0x0 Flags: Valid User: 503 (oracle) Group: 505 (dba) Size: 5120 Links: 1 Clusters: 2 ctime: 0x49a88566 -- Fri Feb 27 19:29:26 2009 atime: 0x49a88566 -- Fri Feb 27 19:29:26 2009 mtime: 0x49a88566 -- Fri Feb 27 19:29:26 2009 dtime: 0x0 -- Wed Dec 31 19:00:00 1969 ctime_nsec: 0x222d308b -- 573386891 atime_nsec: 0x21390e2d -- 557387309 mtime_nsec: 0x222d308b -- 573386891 Last Extblk: 0 Sub Alloc Slot: 0 Sub Alloc Bit: 19 Tree Depth: 0 Count: 243 Next Free Rec: 1 ## OffsetClusters Block# Flags 0 0 2 346108690x0 [r...@wilracdbdr01 /]# debugfs.ocfs2 -R stats /dev/sdh1 Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Fri Feb 27 19:47:46 2009 Creator OS: 0 Feature Compat: 1 BackupSuper Feature Incompat: 0 None Tunefs Incomplete: 0 None Feature RO compat: 0 None Root Blknum: 5 System Dir Blknum: 6 First Cluster Group Blknum: 3 Block Size Bits: 12 Cluster Size Bits: 12 Max Node Slots: 4 Label: ph1p_arch UUID: 839B5D0925C74CD4920F4E8CC065D180 Cluster stack: classic o2cb Inode: 2 Mode: 00 Generation: 236416663 (0xe176e97) FS Generation: 236416663 (0xe176e97) Type: Unknown Attr: 0x0 Flags: Valid System Superblock User: 0 (root) Group: 0 (root) Size: 0 Links: 0 Clusters: 39321087 ctime: 0x48515d46 -- Thu Jun 12 13:30:46 2008 atime: 0x0 -- Wed Dec 31 19:00:00 1969 mtime: 0x48515d46 -- Thu Jun 12 13:30:46 2008 dtime: 0x0 -- Wed Dec 31 19:00:00 1969 ctime_nsec: 0x -- 0 atime_nsec: 0x -- 0 mtime_nsec: 0x -- 0 Last Extblk: 0 Sub Alloc Slot: Global Sub Alloc Bit: 65535 -Original Message- From: Joel Becker [mailto:joel.bec...@oracle.com] Sent: Friday, February 27, 2009 6:50 PM To: Daniel Keisling Cc: ocfs2-users@oss.oracle.com; Sunil Mushran Subject: Re: OCFS2: ERROR (device sdh1): ocfs2_direct_IO_get_blocks On Fri, Feb 27, 2009 at 06:40:38PM -0600, Daniel Keisling wrote: I'm am getting the following error when writing to an OCF2 filesystem: Feb 27 19:06:37 wilracdbdr01 kernel: OCFS2: ERROR (device sdh1): ocfs2_direct_IO_get_blocks: Inode 23693699 has a hole at block 6 Feb 27 19:06:37 wilracdbdr01 kernel: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. This basically says that your filesystem does not support sparse files, but it does have a hole in an inode - which shouldn't happen if sparse isn't supported. Can you send the output of debugfs.ocfs2 -R 'stat 23693699' /dev/sdh1 and debugfs.ocfs2 -R stats /dev/sdh1? I saw a patch that was released in September 2008. How do I get this? What patch? Do you have a link? Without knowing the patch I can't tell you whether that patch affects you. This a production system and we are currently unable to start the DB. If you have the appropriate support, you should call support and file that way. Support will also want the information I requested above. Joel -- Three o'clock is always too late or too early for anything you want to do. - Jean-Paul Sartre Joel Becker Principal Software Developer Oracle E-mail:
Re: [Ocfs2-users] OCFS2 fencing
Hi ramya, ramya tn wrote: Hi All, One of our system fenced by itself few days back and this has been happening very frequently from many days. But unfortunately, we aree not able to stop the system fencing as we are not sure what is causing this. The error i found out in log file is: . .. . . . Feb 20 23:36:41 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 0x2 Feb 20 23:36:41 ImageInt1 kernel: end_request: I/O error, dev sdc, sector 656216192 Feb 20 23:36:41 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 0x2 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, sector 657248384 Feb 20 23:36:42 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 0x2 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, sector 667312256 Feb 20 23:36:42 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 0x2 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, sector 670408832 Feb 20 23:36:42 ImageInt1 kernel: SCSI error : 1 0 2 1 return code = 0x2 Feb 20 23:36:42 ImageInt1 kernel: end_request: I/O error, dev sdc, sector 670666880 . . . . . Feb 20 23:53:21 ImageInt1 kernel: Index 13: took 0 ms to do submit_bio for write Feb 20 23:53:21 ImageInt1 kernel: Index 14: took 0 ms to do checking slots Feb 20 23:53:21 ImageInt1 kernel: Index 15: took 50 ms to do waiting for write completion Feb 20 23:53:21 ImageInt1 kernel: Index 16: took 1904 ms to do msleep Feb 20 23:53:21 ImageInt1 kernel: Index 17: took 0 ms to do allocating bios for read Feb 20 23:53:21 ImageInt1 kernel: Index 18: took 0 ms to do bio alloc read Feb 20 23:53:21 ImageInt1 kernel: Index 19: took 0 ms to do bio add page read Feb 20 23:53:21 ImageInt1 kernel: Index 20: took 0 ms to do submit_bio for read Feb 20 23:53:21 ImageInt1 kernel: Index 21: took 44652 ms to do waiting for read completion Feb 20 23:53:21 ImageInt1 kernel: Index 22: took 0 ms to do bio alloc write Feb 20 23:53:21 ImageInt1 kernel: Index 23: took 0 ms to do bio add page write Feb 20 23:53:21 ImageInt1 kernel: Index 0: took 0 ms to do submit_bio for write Feb 20 23:53:21 ImageInt1 kernel: Index 1: took 0 ms to do checking slots Feb 20 23:53:21 ImageInt1 kernel: Index 2: took 9307 ms to do waiting for write completion Feb 20 23:53:21 ImageInt1 kernel: Index 3: took 0 ms to do allocating bios for read Feb 20 23:53:21 ImageInt1 kernel: Index 4: took 0 ms to do bio alloc read Feb 20 23:53:21 ImageInt1 kernel: Index 5: took 0 ms to do bio add page read Feb 20 23:53:21 ImageInt1 kernel: Index 6: took 0 ms to do submit_bio for read Feb 20 23:53:22 ImageInt1 kernel: Index 7: took 35756 ms to do waiting for read completion Feb 20 23:53:22 ImageInt1 kernel: Index 8: took 0 ms to do bio alloc write Feb 20 23:53:22 ImageInt1 kernel: Index 9: took 0 ms to do bio add page write Feb 20 23:53:22 ImageInt1 kernel: Index 10: took 0 ms to do submit_bio for write Feb 20 23:53:22 ImageInt1 kernel: Index 11: took 0 ms to do checking slots Feb 20 23:53:22 ImageInt1 kernel: Index 12: took 84549 ms to do waiting for write completion Feb 20 23:53:22 ImageInt1 kernel: *** ocfs2 is very sorry to be fencing this system by restarting *** I found the same scsi errors each time it fences. Can anyone suggest what could be the reason for these SCSI errors and is it those SCSI errors which is causing fencing. I don't know the reason for SCSI errors. So just answer your second qs. Yes, SCSI error will cause ocfs2 fencing. OCFS2 need to heartbeat in the disk, so if it tries many times and still fails to write to disk because of the SCSI error, it will fence itself. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] problem stopping o2cb service on one of nodes
Hi Nikola, Nikola Ciprich wrote: Hi, I'm trying ocfs2 RHEL5 distro, 2.6.29 kernel, ocfstools-1.4.1. I'm using DRBD in primary/primary mode as shared storage... I've configured the service according to quickstart document, and everything works, but when I umount fs on both nodes, stopping o2cb service on one of the nodes always fails with: [r...@vbox4 sysconfig]# /etc/rc.d/init.d/o2cb stop Stopping O2CB cluster vb34: Failed Unable to stop cluster as heartbeat region still active It looks that your disk heartbeat is still there. I don't know the specific reason, maybe http://oss.oracle.com/bugzilla/show_bug.cgi?id=1053 ? but you can stop it manually. 1. ocfs2_hb_ctl -I -d device or ocfs2_hb_ctl -I -u uuid this will tell you the reference number for the hearbeat. 2. ocfs2_hb_ctl -K -d device service or ocfs2_hb_ctl -K -u uuid service this will killed the heartbeat manually. service is the stack you used, and it should be o2cb in your case. btw, you can try cfs2_hb_ctl -K -u uuid service to see whether it is the same problem as bug 1053. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 vs ext3?
Andrew (Anything) wrote: Hi Andrew, I just checked max-features, it doesn't include local which means that you still need to create dlm lock in your local node which will cost some delay. You can check whether your volume enable local by command echo 'stats'|debugfs.ocfs2 /dev/sdX|grep Incompat If the volume is mkfsed to used as local disk, you should see local. Otherwise you need to add --fs-features=local to your mkfs. Regards, Tao Thanks for your replies Tao and Thomas I still intend on using the filesystem in a 3 node cluster, but when I found it to be 25x slower than a single ext3 part I thought id start troubleshooting at the start. I just did a format of a small partition to give it a quick test anyway, and found that it performs just as well as ext3, in some test even better. Except if I understand correctly, local disables the ability for clustering completely? yes, local means you will use it locally, so other nodes can't use it. Dear Andrew, I think the result should be normal because ocfs2 is a cluster fs ? Hi Thomas, on a single node system (with no network latency, and only itself to talk to) I had expected better results. is a 10x reduction in file creation/modification an expected result? could you please talk a little more about the test case? Just FYI. I have just committed a new series of patches to 2.6.30 which will improve inode allocation a lot. See http://oss.oracle.com/pipermail/ocfs2-devel/2009-January/003799.html It will save you a lot of time when you create a large amount of file, delete them and then recreate. The above link can show you the test result. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] another node is heartbeating in our slot
Hi sundar, sundar mahadevan wrote: Hi members, Newbie. Help pls. My setup: system 1: opensuse 11.1 with iscsitarget (secondary hard drive with logical volume) + ocfs2 system 2: opensuse 11.1 with open-iscsi (detects the logical volume on system 1) + ocfs2 1) mount -t ocfs /dev/sdb /u01/oradata/orcl (on system 2) i have this entry made to /etc/fstab as well as /etc/init.d/boot.local but the detection of logical volume /dev/sdb during bootup takes place after the mount commands have surpassed and hence the mount fails. Are there any other options other than for manual mount after logon ? 2) once i manually mount /dev/sdb on system 2 with mount -t ocfs /dev/sdb /u01/oradata/orcl i receive the following error on system 1: another node is heartbeating in our slot error. I would guess that the volume was already mounted in system 1, right? So What is your cluster.conf looks like? Does the 2 nodes knows each other? Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA
Hi Florian, florian.engelm...@bt.com wrote: Hi Tao, Hi florian, florian.engelm...@bt.com wrote: Florian, the problem here seems to be with network. The nodes are running into network heartbeat timeout and hence second node is getting fenced. Do you see o2net thread consuming 100% cpu on any node? if not then probably check your network thanks, --Srini I forgot to post my /etc/ocfs2/cluster.conf node: ip_port = ip_address = 192.168.0.101 number = 0 name = defr1elcbtd01 cluster = ocfs2 node: ip_port = ip_address = 192.168.0.102 number = 1 name = defr1elcbtd02 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 192.168.0.10x is eth3 on both nodes and connected with a cross over cable. No active network component is involved here. defr1elcbtd02:~# traceroute 192.168.0.101 traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte packets 1 node1 (192.168.0.101) 0.220 ms 0.142 ms 0.223 ms defr1elcbtd02:~# The error message looks like a network problem but why should there be a network problem if I shutdown a FC port?! I testet it about 20 times and got about 16 kernel panics starting with the same error message: kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at 192.168.0.101: It isn't an error message, just a status report that we can't connect to that node now. That node may be rebooted or something else, but this node don't know, and it only knows the connection is down. But node defr1elcbtd01 was never down and also the network link (eth3) wasn't down. I was able to ping from each node to the other. Node 1 is hosting all services and never was faulted while I was testing. All I have to do to panic node 2 is to disable one of two fibre channel ports or pull one fibre channel cable or delete node 2 from the cisco SAN zoning. If I apply one of those 3 errors I get the message about o2net is no longer connected to node 1 and 60 seconds later the 2nd node panics because of ocfs2 fencing (but this happens only in about 80% of cases - in the other 20% of cases o2net does not disconnect and there are no messages about ocfs2 at all - like it should be...). Everything else is working fine in these 60 seconds. The filesystem is still writable from both nodes and both nodes can ping each other (via the cluster interconnect). I just checked your log. The error why node 2 get the message is that node 1 get the message that node 2 stopped disk heartbeat for quite a long time so it stop the connection intentionally. So node 2 get this message. See the log in node 1: Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node 1, 1 total Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382 node 1 sc: 81007ddf4400 - , valid 1 - 0, err 0 - -107 Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node 1, 1 total Jun 8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to node defr1elcbtd02 (num 1) at 192.168.0.102: And I guess the reason why you see this log sometimes(80%) is that the time interval. You know ocfs2 disk heartbeat try every 2 secs so sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes node2 panic after node 1 call o2quo_hb_down(which will put something like no longer... in node 2's log). So would you please give your timeout configuration(o2cb)? Regards, Tao Here are the logs with debug logging: Node 2: Jun 8 09:46:11 defr1elcbtd02 kernel: qla2xxx :04:00.0: LOOP DOWN detected (2). Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc 81007c2f0800 refs 3 sock 8100694138c0 node 0 page 81007fafbb00 pg_off 0] put Jun 8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc 81007c2f0800 refs 2 sock 8100694138c0 node 0 page 81007fafbb00 pg_off 0] data_ready hit Jun 8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc 81007c2f0800 refs 2 sock 8100694138c0 node 0 page 81007fafbb00 pg_off 0] get Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc 81007c2f0800 refs 3 sock 8100694138c0 node 0 page 81007fafbb00 pg_off 0] receiving Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170 [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key num 0] at page_off 24 Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key num 0] processing message Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc 81007c2f0800 refs 3 sock 8100694138c0 node 0 page 81007fafbb00 pg_off 24] get Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc 81007c2f0800 refs 4 sock 8100694138c0 node 0 page 81007fafbb00 pg_off 0] ret = 1 Jun 8 09:46:11 defr1elcbtd02 kernel:
Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA
Hi Florian, florian.engelm...@bt.com wrote: Hi Tao, Hi Florian, florian.engelm...@bt.com wrote: Hi Tao, Hi florian, florian.engelm...@bt.com wrote: Florian, the problem here seems to be with network. The nodes are running into network heartbeat timeout and hence second node is getting fenced. Do you see o2net thread consuming 100% cpu on any node? if not then probably check your network thanks, --Srini I forgot to post my /etc/ocfs2/cluster.conf node: ip_port = ip_address = 192.168.0.101 number = 0 name = defr1elcbtd01 cluster = ocfs2 node: ip_port = ip_address = 192.168.0.102 number = 1 name = defr1elcbtd02 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 192.168.0.10x is eth3 on both nodes and connected with a cross over cable. No active network component is involved here. defr1elcbtd02:~# traceroute 192.168.0.101 traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte packets 1 node1 (192.168.0.101) 0.220 ms 0.142 ms 0.223 ms defr1elcbtd02:~# The error message looks like a network problem but why should there be a network problem if I shutdown a FC port?! I testet it about 20 times and got about 16 kernel panics starting with the same error message: kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at 192.168.0.101: It isn't an error message, just a status report that we can't connect to that node now. That node may be rebooted or something else, but this node don't know, and it only knows the connection is down. But node defr1elcbtd01 was never down and also the network link (eth3) wasn't down. I was able to ping from each node to the other. Node 1 is hosting all services and never was faulted while I was testing. All I have to do to panic node 2 is to disable one of two fibre channel ports or pull one fibre channel cable or delete node 2 from the cisco SAN zoning. If I apply one of those 3 errors I get the message about o2net is no longer connected to node 1 and 60 seconds later the 2nd node panics because of ocfs2 fencing (but this happens only in about 80% of cases - in the other 20% of cases o2net does not disconnect and there are no messages about ocfs2 at all - like it should be...). Everything else is working fine in these 60 seconds. The filesystem is still writable from both nodes and both nodes can ping each other (via the cluster interconnect). I just checked your log. The error why node 2 get the message is that node 1 get the message that node 2 stopped disk heartbeat for quite a long time so it stop the connection intentionally. So node 2 get this message. See the log in node 1: Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node 1, 1 total Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382 node 1 sc: 81007ddf4400 - , valid 1 - 0, err 0 - -107 Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node 1, 1 total Jun 8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to node defr1elcbtd02 (num 1) at 192.168.0.102: And I guess the reason why you see this log sometimes(80%) is that the time interval. You know ocfs2 disk heartbeat try every 2 secs so sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes node2 panic after node 1 call o2quo_hb_down(which will put something like no longer... in node 2's log). So would you please give your timeout configuration(o2cb)? These are my setting on node 2: O2CB_HEARTBEAT_THRESHOLD=61 O2CB_IDLE_TIMEOUT_MS=6 O2CB_KEEPALIVE_DELAY_MS=4000 O2CB_RECONNECT_DELAY_MS=4000 ocfs2 can't allow 2 nodes have different timeouts. So if node 1 and 2 don't have the same configuration, node 2 won't be allowed to join the domain and mount the same volume. yeah, this parameter looks much better. See http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#TIMEOUT For details. I just went through the whole thread, and the panic because of the storage failure is a deliberate behavior since with no disk access, no need to survive. See http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#QUORUM Regards, Tao On node 1 I still got the old setting because there was no downtime to reboot this system till today. Is there any way to change the values without a reboot? The system is mission critical and I can only reboot on Sundays. Settings on node 1 are the default setting that came with the Debian package. I changed them to fit the settings of node 2 after the next reboot - so I can only guess they were: O2CB_HEARTBEAT_THRESHOLD=7 O2CB_IDLE_TIMEOUT_MS=12000 O2CB_KEEPALIVE_DELAY_MS=2000 O2CB_RECONNECT_DELAY_MS=2000 Regards, Florian Regards, Tao Here are the logs with debug logging: Node 2: Jun 8 09:46:11 defr1elcbtd02 kernel:
Re: [Ocfs2-users] enable acl option for ocfs2
Hi Marco, Marco Huang wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am setting up two nodes ocfs2 on debian lenny (2.6.26-1-amd64), but ocfs2 doesn't come with posix acl by default. Doesn't any one can provide patch for enable acl? acl is added in ocfs2 in 2.6.29. So could you please update your kernel version? Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] umount hang + high CPU
Hi, Is there something in your system log? I would guess there should be some info there. Regards, Tao syla...@aim.com wrote: Hi, I had a problem where I got a kernel bug in the logs in ocfs2. That happened when I unmounted the volume after a day or two that it was mounted, so I thought I needed to upgrade the kernel (maybe the next version will be bug free), so I did to 2.6.30, and now I tried mounting and unmounting the volume right away... and it hanged, and the CPU got high with that umount process. Please advice PS. tools and console packages are version 1.4.2. *A Good Credit Score is 700 or Above. See yours in just 2 easy steps! http://pr.atwola.com/promoclk/100126575x1222887319x1201497660/aol?redir=http://www.freecreditreport.com/pm/default.aspx?sc=668072%26hmpgID=62%26bcd=JulystepsfooterNO62* ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 acl issue
Hi Marco, From the stack it looks that it isn't acl related. So could you please file a bug in http://oss.oracle.com/bugzilla/(just for this, acl is another issue) with all these informations? Thanks. And could you please also do: objdump -DSl /lib/modules/`uname -r`/kernel/fs/ocfs2/ocfs2.ko /tmp/ocfs2.txt and attach the generated file to the bug also. Great thanks. Regards, Tao Marco Huang wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Tiger, I am also exporting the ocfs2 file system via nfs (with acl) to other servers. I am getting the following kernel panic about 2 times in 3 days. Hope you can look into that, maybe it's related. Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819038] - [ cut here ] Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819074] kernel BUG at fs/ocfs2/inode.c:244! Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819100] invalid opcode: [#1] Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz SMP Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819129] last sysfs file: /sys/class/fc_transport/target5:0:3/node_name Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819158] CPU 2 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz [127658.819180] Modules linked in: Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz netconsole Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz nfsd Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz exportfs Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz nfs Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz lockd Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz nfs_acl Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz auth_rpcgss Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz sunrpc Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ocfs2 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz jbd2 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz quota_tree Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ocfs2_dlmfs Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ocfs2_stack_o2cb Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ocfs2_dlm Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ocfs2_nodemanager Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ocfs2_stackglue Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz configfs Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ipv6 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz aoe Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ext2 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz dm_round_robin Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz dm_multipath Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz loop Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz snd_pcm Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz snd_timer Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz snd Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz soundcore Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz snd_page_alloc Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz shpchp Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz pcspkr Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz pci_hotplug Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz serio_raw Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz iTCO_wdt Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz button Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz evdev Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ext3 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz jbd Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz mbcache Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz dm_mirror Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz dm_region_hash Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz dm_log Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz dm_snapshot Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz dm_mod Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ata_generic Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz sd_mod Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz sg Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz sr_mod Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz cdrom Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz usb_storage Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz qla2xxx Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ata_piix Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz scsi_transport_fc Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz libata Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz scsi_tgt Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz tg3 Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz mptsas Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ide_pci_generic Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz ide_core Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz libphy Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz mptscsih Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz mptbase Jul 16 12:51:40 webcluster2.sit.auckland.ac.nz
Re: [Ocfs2-users] git checkout on an ocfs2 filesystem
Hi Joel, Joel Becker wrote: On Mon, Aug 31, 2009 at 12:16:36PM -0700, Joel Becker wrote: On Sun, Aug 30, 2009 at 08:19:08PM -0500, Nathaniel Griswold wrote: Has anyone here had problems with git checkouts on ocfs2? Oh, boy, this is wacky. No, it's extra wacky: 5441 lstat64(t/t6015-rev-list-show-all-parents.sh, 0xffc2e318) = -1 ENOENT (No such file or directory) 5441 lstat64(t, {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0 5441 open(t/t6015-rev-list-show-all-parents.sh, O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0777) = ? ERESTARTSYS (To be restarted) 5441 --- SIGALRM (Alarm clock) @ 0 (0) --- 5441 sigreturn() = ? (mask now []) 5441 open(t/t6015-rev-list-show-all-parents.sh, O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0777) = -1 EEXIST (File exists) 5441 write(2, error: git-checkout-index: unabl..., 100) = 100 How on earth is userspace seeing ERESTARTSYS? Did someone forget to -ive it? The ERESTARTSYS may happen when we get interrupted from ocfs2_cluster_lock. I met with it when I rm -rf a very large dir and use ctrl+c to stop it when I tested bug 1162. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] kernel panic - bug in dlmglue.c ?
Hi John, John McNulty wrote: Hi, I had a system crash last night. Netconsole caught the following trace dump. Has this one been seen before? This bug is fixed in mainline and should show up in next ocfs2 release. See http://oss.oracle.com/bugzilla/show_bug.cgi?id=1162 Regards, Tao Rgds, John Sep 10 18:26:04 sdorac2b Kernel BUG at ...ld/src/redhat/BUILD/ocfs2-1.4.2/fs/ocfs2/dlmglue.c:2308 Sep 10 18:26:04 sdorac2b invalid opcode: [1] Sep 10 18:26:04 sdorac2b SMP Sep 10 18:26:04 sdorac2b Sep 10 18:26:04 sdorac2b last sysfs file: /block/sr0/dev Sep 10 18:26:04 sdorac2b CPU 4 Sep 10 18:26:04 sdorac2b Sep 10 18:26:04 sdorac2b Modules linked in: Sep 10 18:26:04 sdorac2b hangcheck_timer Sep 10 18:26:04 sdorac2b joydev Sep 10 18:26:04 sdorac2b mptctl Sep 10 18:26:04 sdorac2b ipmi_si(U) Sep 10 18:26:04 sdorac2b ipmi_devintf(U) Sep 10 18:26:04 sdorac2b ipmi_msghandler(U) Sep 10 18:26:04 sdorac2b nfsd Sep 10 18:26:04 sdorac2b exportfs Sep 10 18:26:04 sdorac2b auth_rpcgss Sep 10 18:26:04 sdorac2b ipv6 Sep 10 18:26:04 sdorac2b xfrm_nalgo Sep 10 18:26:04 sdorac2b crypto_api Sep 10 18:26:04 sdorac2b netconsole Sep 10 18:26:04 sdorac2b oracleasm(U) Sep 10 18:26:04 sdorac2b autofs4 Sep 10 18:26:04 sdorac2b hidp Sep 10 18:26:04 sdorac2b ocfs2(U) Sep 10 18:26:04 sdorac2b nfs Sep 10 18:26:04 sdorac2b lockd Sep 10 18:26:04 sdorac2b fscache Sep 10 18:26:04 sdorac2b nfs_acl Sep 10 18:26:04 sdorac2b rfcomm Sep 10 18:26:05 sdorac2b l2cap Sep 10 18:26:05 sdorac2b bluetooth Sep 10 18:26:05 sdorac2b ocfs2_dlmfs(U) Sep 10 18:26:05 sdorac2b ocfs2_dlm(U) Sep 10 18:26:05 sdorac2b ocfs2_nodemanager(U) Sep 10 18:26:05 sdorac2b configfs Sep 10 18:26:05 sdorac2b sunrpc Sep 10 18:26:05 sdorac2b bonding Sep 10 18:26:05 sdorac2b dm_round_robin Sep 10 18:26:05 sdorac2b dm_multipath Sep 10 18:26:05 sdorac2b scsi_dh Sep 10 18:26:05 sdorac2b video Sep 10 18:26:05 sdorac2b hwmon Sep 10 18:26:05 sdorac2b backlight Sep 10 18:26:05 sdorac2b sbs Sep 10 18:26:05 sdorac2b i2c_ec Sep 10 18:26:05 sdorac2b i2c_core Sep 10 18:26:05 sdorac2b button Sep 10 18:26:05 sdorac2b battery Sep 10 18:26:05 sdorac2b asus_acpi Sep 10 18:26:05 sdorac2b acpi_memhotplug Sep 10 18:26:05 sdorac2b ac Sep 10 18:26:05 sdorac2b parport_pc Sep 10 18:26:05 sdorac2b lp Sep 10 18:26:05 sdorac2b parport Sep 10 18:26:05 sdorac2b st Sep 10 18:26:05 sdorac2b sr_mod Sep 10 18:26:05 sdorac2b cdrom Sep 10 18:26:05 sdorac2b e1000e(U) Sep 10 18:26:05 sdorac2b shpchp Sep 10 18:26:06 sdorac2b hpilo Sep 10 18:26:06 sdorac2b sg Sep 10 18:26:06 sdorac2b serio_raw Sep 10 18:26:06 sdorac2b bnx2(U) Sep 10 18:26:06 sdorac2b pcspkr Sep 10 18:26:06 sdorac2b dm_raid45 Sep 10 18:26:06 sdorac2b dm_message Sep 10 18:26:06 sdorac2b dm_region_hash Sep 10 18:26:06 sdorac2b dm_mem_cache Sep 10 18:26:06 sdorac2b dm_snapshot Sep 10 18:26:06 sdorac2b dm_zero Sep 10 18:26:06 sdorac2b dm_mirror Sep 10 18:26:06 sdorac2b dm_log Sep 10 18:26:06 sdorac2b dm_mod Sep 10 18:26:06 sdorac2b usb_storage Sep 10 18:26:06 sdorac2b qla2xxx Sep 10 18:26:06 sdorac2b scsi_transport_fc Sep 10 18:26:06 sdorac2b ata_piix Sep 10 18:26:06 sdorac2b libata Sep 10 18:26:06 sdorac2b cciss Sep 10 18:26:06 sdorac2b mptspi Sep 10 18:26:06 sdorac2b mptscsih Sep 10 18:26:06 sdorac2b scsi_transport_spi Sep 10 18:26:06 sdorac2b sd_mod Sep 10 18:26:06 sdorac2b scsi_mod Sep 10 18:26:06 sdorac2b mptbase Sep 10 18:26:06 sdorac2b raid1 Sep 10 18:26:06 sdorac2b ext3 Sep 10 18:26:06 sdorac2b jbd Sep 10 18:26:06 sdorac2b uhci_hcd Sep 10 18:26:06 sdorac2b ohci_hcd Sep 10 18:26:07 sdorac2b ehci_hcd Sep 10 18:26:07 sdorac2b Sep 10 18:26:07 sdorac2b Pid: 965, comm: nxnode Tainted: G 2.6.18-128.1.14.el5 #1 Sep 10 18:26:07 sdorac2b RIP: 0010:[886cade7] Sep 10 18:26:07 sdorac2b [886cade7] :ocfs2:ocfs2_dentry_lock+0x2f/0xfd Sep 10 18:26:07 sdorac2b RSP: 0018:8102e7557dc8 EFLAGS: 00010246 Sep 10 18:26:07 sdorac2b RAX: 81042fcc0c00 RBX: 8104216cb800 RCX: Sep 10 18:26:07 sdorac2b RDX: ff00 RSI: 0001 RDI: 8103f8af5660 Sep 10 18:26:07 sdorac2b RBP: 0005 R08: 81025ce58970 R09: 002ebc0c Sep 10 18:26:07 sdorac2b R10: 2b19b616b000 R11: 4aa93717 R12: 81029775dd20 Sep 10 18:26:07 sdorac2b R13: R14: 2b92d121d5e0 R15: 8104216cb800 Sep 10 18:26:07 sdorac2b FS: 2b92d162e410() GS:81042ff77cc0() knlGS: Sep 10 18:26:07 sdorac2b CS: 0010 DS: ES: CR0: 8005003b Sep 10 18:26:07 sdorac2b CR2: 2b19b5fb3cd8 CR3: 000308119000 CR4: 06e0 Sep 10 18:26:07 sdorac2b Process nxnode (pid: 965, threadinfo 8102e7556000, task 81023f8eb040) Sep 10 18:26:07 sdorac2b Stack: Sep 10 18:26:07 sdorac2b 8102e7557e58 Sep 10 18:26:07 sdorac2b Sep 10
Re: [Ocfs2-users] core dump
Hi Charlie, Charlie Sharkey wrote: Hi, We got this core dump while running the dd command. I haven’t matched The time of the dump with the /var/log/messages file, but I believe it was In response to a cable pull. you are right. I don't have an ocfs2 version for sles, but I guess it should be the same as the ocfs2 repository. The panic is because the jbd give ocfs2 an error in journal_start and set the flag to aborted, ocfs2 think itself can't work without jbd. So it panic. Regards, Tao Any ideas ? Thanks in advance, charlie Sles10 SP2 x86_64 ocfs2-tools-1.4.0-0.5 ocfs2console-1.4.0-0.5 Linux sr2600-1 2.6.16.60-0.34-smp #1 SMP Fri Jan 16 14:59:01 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611) o2cb heartbeat: registered disk mode OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611) OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611) PID: 22456 TASK: 81012ac900c0 CPU: 5 COMMAND: dd #0 [8101013ff7e0] machine_kexec at 8011c036 #1 [8101013ff8b0] crash_kexec at 80153ea9 #2 [8101013ff970] panic at 80133f01 #3 [8101013ffa60] ocfs2_handle_errorat 8859643e #4 [8101013ffa80] __ocfs2_abort at 88596569 #5 [8101013ffb70] ocfs2_start_trans at 885847f1 #6 [8101013ffba0] ocfs2_write_begin_nolock at 8856563d #7 [8101013ffc50] ocfs2_write_beginat 88567267 #8 [8101013ffcb0] ocfs2_file_aio_writeat 88576bc3 #9 [8101013ffe00] do_sync_write at 80185d25 #10 [8101013fff10] vfs_write at 80186662 #11 [8101013fff40] sys_write at 80186c2b #12 [8101013fff80] system_call at 8010adba RIP: 2b107dfe1190 RSP: 7fff2cd9dcc8 RFLAGS: 0246 RAX: 0001 RBX: 8010adba RCX: 2b107dfe1110 RDX: 0010 RSI: 2b107e28a000 RDI: 0001 RBP: 0001 R8: 2b107e28a011 R9: R10: R11: 0246 R12: 2b107e28a000 R13: R14: 0010 R15: 8010ad56 ORIG_RAX: 0001 CS: 0033 SS: 002b ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] renaming a OCFS2 cluster
Hi Werner, Werner Flamme wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everyone, another problem I did not find a solution for... I ran o2cb configure and configured a cluster named ocfs2. Lazy me, I did not invent a new name. Now this cluster must be renamed to avoid confusion, the name has to change to sapcluster. I changed /etc/ocfs2/cluster.conf to the new name and ran o2cb configure again, but to no avail: - ---snip--- # rco2cb configure Configuring the O2CB driver. This will configure the on-boot properties of the O2CB driver. The following questions will determine whether the driver is loaded on boot. The current values will be shown in brackets ('[]'). Hitting ENTER without typing an answer will keep that current value. Ctrl-C will abort. Load O2CB driver on boot (y/n) [y]: Cluster to start on boot (Enter none to clear) [sapcluster]: Writing O2CB configuration: OK Mounting configfs filesystem at /config: OK Loading module ocfs2_dlm: OK Loading module ocfs2_dlmfs: OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting cluster sapcluster: Failed o2cb_ctl: Memory allocation failed while setting cluster name Stopping cluster sapcluster: OK - ---pins--- I can mount the volume, but in /config/cluster, I still see only the old cluster ocfs2, though /etc/ocfs2/cluster.conf and /etc/sysconfig/cluster both contain the cluster name sapcluster (on all nodes, but it is currently mounted on only one node). Before you change a cluster name, you missed one important step, you need to offline the cluster. So the right steps is: 1. /etc/init.d/o2cb offline 2. change /etc/ocfs2/cluster.conf 3. run /etc/init.d/o2cb configure and change the name. Now, since you have already changed the configuration file(step 2 and 3), there are 2 different ways to get the right cluster up. One is umount the ocfs2 volume and reboot the box., this is simple. The other is that you can just renamed the cluster name(in cluster.conf and o2cb configure) back to the name in your /config/cluster and then do the steps I described above(1,2,3). Good luck. Regards, Tao BTW, I cannot modify cluster.conf by ocfs2console though I start it as root, but I have no problems modifying this file via any editor. I'm running SLES9-SP4 on the boxes, btw., and have the following packages installed: ocfs2-tools-1.2.1-4.2 ocfs2console-1.2.1-4.2 Will it help to download and install the newest OCFS2 packages from the website? Regards, Werner -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.12 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/ iEYEARECAAYFAkuE8eMACgkQk33Krq8b42MefACeOWDceDV6sJSnSe1qeq/1mvDx axMAn2JIWWMMpjv2oTLBPhcfgXlq/RIv =pzmM -END PGP SIGNATURE- ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No Space left on the device.
Hi Aravind, Aravind Divakaran wrote: Hi My ocfs filesystem has 270gb free space. FilesystemSize Used Avail Use% Mounted on /dev/mapper/store 501G 232G 270G 47% /data INode details for ocfs filesystem is FilesystemInodes IUsed IFree IUse% Mounted on /dev/mapper/store131225616 60588258 70637358 47% /data I have run the stat_sysdir.sh on /dev/mapper/store. The output is on the below link. http://pastebin.com/4VLLCpRa I just went through the output and confirmed that the culprit is free space fragmentation. So please refer to http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189. Regards, Tao Please help me to resolve this issue. Rgds, Aravind M D Hi Aravind, Sounds like you might have hit the free space fragmentation issue: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I'm sure that if you post output of stat_sysdir.sh (http://oss.oracle.com/~seeda/misc/stat_sysdir.sh) one of the ocfs2 devs will be able to confirm this. *If* it is this problem, removing some node slots will help. That is of course if you have more node slots that you need. I think 8 are created by default. Cheers, Brad On Thu, 4 Mar 2010 10:28:49 +0530 (IST) Aravind Divakaran aravind.divaka...@yukthi.com wrote: HiAll, For my mailserver i am using ocfs2 filesystem configured on san. Now my mail delivery application is sometimes complaining No Space left on the device, even though there is enough space and inodes. Can anyone help me to solve this issue. Rgds, Aravind M D ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No Space left on the device.
Hi Aravind, Aravind Divakaran wrote: Hi, I am facing problem due to free space fragmentation. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189. In the above link it is mentioned that reducing the slots can solve the issue. Right now i have 4slots. As my ocfs device is configured for a failover for mailserver of 2systems, if i reduce it to 2slots whether i can solve this issue. Please let me know if this is the command i have to give to reduce the slots to 2. Decreasing slots just release the volume we reserved for journal, so it didn't help for free space fragmentation actually. I have looked into your system_dir output, the journal size is only 256M, so in total you will have 512M more free space. Don't know whether it is big enough for you. tunefs.ocfs2 -N 2 /dev/mapper/store This is an offline action, so you need to umount the volume from all the nodes to go ahead. Another way is that you can cp the file to another volume, remove it and then cp back. It should be contiguous enough. Regards, Tao Rgds, Aravind M D Hi Aravind, Aravind Divakaran wrote: Hi My ocfs filesystem has 270gb free space. FilesystemSize Used Avail Use% Mounted on /dev/mapper/store 501G 232G 270G 47% /data INode details for ocfs filesystem is FilesystemInodes IUsed IFree IUse% Mounted on /dev/mapper/store131225616 60588258 70637358 47% /data I have run the stat_sysdir.sh on /dev/mapper/store. The output is on the below link. http://pastebin.com/4VLLCpRa I just went through the output and confirmed that the culprit is free space fragmentation. So please refer to http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189. Regards, Tao Please help me to resolve this issue. Rgds, Aravind M D Hi Aravind, Sounds like you might have hit the free space fragmentation issue: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I'm sure that if you post output of stat_sysdir.sh (http://oss.oracle.com/~seeda/misc/stat_sysdir.sh) one of the ocfs2 devs will be able to confirm this. *If* it is this problem, removing some node slots will help. That is of course if you have more node slots that you need. I think 8 are created by default. Cheers, Brad On Thu, 4 Mar 2010 10:28:49 +0530 (IST) Aravind Divakaran aravind.divaka...@yukthi.com wrote: HiAll, For my mailserver i am using ocfs2 filesystem configured on san. Now my mail delivery application is sometimes complaining No Space left on the device, even though there is enough space and inodes. Can anyone help me to solve this issue. Rgds, Aravind M D ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No Space left on the device.
Hi Brad, Brad Plant wrote: Hi Tao, On Fri, 05 Mar 2010 14:33:36 +0800 Tao Ma tao...@oracle.com wrote: Another way is that you can cp the file to another volume, remove it and then cp back. It should be contiguous enough. Assuming we *can* still write to the FS (i.e. as more of a preventative action), would the following do the same? cp -a a b mv b a Can the above work as a *hack* online defrag? Will this reduce the amount of free space fragmentation and therefore stop (or reduce the chance of) this problem from occurring? Would we need to do it to the entire file system or just a few larger files? yeah, that should work. you can use debugfs.ocfs2 to see whether a file is more fragmented or not. Regards, Tao Cheers, Brad ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No Space left on the device.
Brad Plant wrote: Hi Tao, On Fri, 05 Mar 2010 15:03:50 +0800 Tao Ma tao...@oracle.com wrote: Assuming we *can* still write to the FS (i.e. as more of a preventative action), would the following do the same? cp -a a b mv b a Can the above work as a *hack* online defrag? Will this reduce the amount of free space fragmentation and therefore stop (or reduce the chance of) this problem from occurring? Would we need to do it to the entire file system or just a few larger files? yeah, that should work. you can use debugfs.ocfs2 to see whether a file is more fragmented or not. Ok, so it will defrag files, but will it help with the inability to write to the file system issue reported in http://oss.oracle.com/bugzilla/show_bugcgi?id=1189 ? yes, it helps to defrag files by cp. But the root cause for bug 1189 is that the old mechanism is somehow making the volume more fragmented. So the real solution has to be a bug fix in kernel, which like Sunil said, is ongoing. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on the device
Hi Aravind, Aravind Divakaran wrote: Hi All, I have already sent one mail regarding the space issue i am facing with my ocfs filesystem. As mentioned in the below link it is an issue related to free space fragmentation. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I have seen a patch for stealing extent allocation which was there is 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my system. Below is my ocfs details on my system #modinfo ocfs2 filename: /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko license:GPL author: Oracle version:1.5.0 description:OCFS2 1.5.0 srcversion: A8B69947E8FF56D74858993 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager vermagic: 2.6.34-rc1 SMP mod_unload modversions This is my stat_sysdir.sh output http://pastebin.com/RZH9DkTk Can anyone help me how to resolve this, please as the problem occurs on production mail server with 3000 emailid. I just checked your stat_sysdir output. It isn't caused by extent block alloc actually. So the patch doesn't work for you. Yes, the problem you meet is fragmentation issue, but the root cause is that inode_alloc can't allocate any more inodes(a little different from 1189). I am now working on discontiguous block group. It will resolve your issue I think. Hope it can be get into mainline in 2.6.35. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on the device
Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi All, I have already sent one mail regarding the space issue i am facing with my ocfs filesystem. As mentioned in the below link it is an issue related to free space fragmentation. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I have seen a patch for stealing extent allocation which was there is 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my system. Below is my ocfs details on my system #modinfo ocfs2 filename: /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko license:GPL author: Oracle version:1.5.0 description:OCFS2 1.5.0 srcversion: A8B69947E8FF56D74858993 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager vermagic: 2.6.34-rc1 SMP mod_unload modversions This is my stat_sysdir.sh output http://pastebin.com/RZH9DkTk Can anyone help me how to resolve this, please as the problem occurs on production mail server with 3000 emailid. I just checked your stat_sysdir output. It isn't caused by extent block alloc actually. So the patch doesn't work for you. Yes, the problem you meet is fragmentation issue, but the root cause is that inode_alloc can't allocate any more inodes(a little different from 1189). I am now working on discontiguous block group. It will resolve your issue I think. Hope it can be get into mainline in 2.6.35. Regards, Tao For my previous mail i got reply from you Another way is that you can cp the file to another volume, remove it and then cp back. It should be contiguous enough. As mentioned in the 1189 However, reducing the slot count by 1 (to 4) may not be enough as it does not have much contiguous space. It may work. But reducing it by 2 will definitely work. Umount the volume on all nodes and run: # tunefs.ocfs2 -N 3 /dev/sda1 Run fsck.ocfs2 for sanity checking. Will anyone of the above solution will temporary solve my problem. Yes, it works. I just replied you in another e-mail. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on the device
Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi All, snip After running the tunefs.ocfs2 command i am getting the following error on my console node01#tunefs.ocfs2 -N 2 /dev/mapper/store tunefs.ocfs2 1.4.1 Segmentation fault node01:~# Message from sysl...@node01 at Mar 19 05:25:15 ... kernel:[ 709.834536] [ cut here ] Message from sysl...@node01 at Mar 19 05:25:15 ... kernel:[ 709.834678] invalid opcode: [#1] SMP Message from sysl...@node01 at Mar 19 05:25:15 ... kernel:[ 709.834820] last sysfs file: /sys/fs/o2cb/interface_revision Message from sysl...@node01 at Mar 19 05:25:15 ... kernel:[ 709.838490] Stack: Message from sysl...@node01 at Mar 19 05:25:15 ... kernel:[ 709.838490] Call Trace: Message from sysl...@node01 at Mar 19 05:25:15 ... kernel:[ 709.838490] Code: 00 00 80 00 f7 c7 00 00 04 00 74 0b 81 e7 ff ff fb ff 0d 00 00 02 00 f7 c7 0 0f 0b eb fe c3 48 8b 47 58 48 8b 40 48 4c 8b 58 08 41 ff e3 48 And my /var/log/messages shows this error Mar 19 05:25:15 cmnode01 kernel: [ 709.837837] Mar 19 05:25:15 cmnode01 kernel: [ 709.837896] Pid: 9051, comm: tunefs.ocfs2 Not tainted 2.6.34-rc1 #1 S Mar 19 05:25:15 cmnode01 kernel: [ 709.837984] RIP: 0010:[a029708b] [a029708b] flag Mar 19 05:25:15 cmnode01 kernel: [ 709.838115] RSP: 0018:8802aadc7bc0 EFLAGS: 00010206 Mar 19 05:25:15 cmnode01 kernel: [ 709.838179] RAX: 0100 RBX: 001f RCX: Mar 19 05:25:15 cmnode01 kernel: [ 709.838246] RDX: 8802a90d6700 RSI: 0005 RDI: Mar 19 05:25:15 cmnode01 kernel: [ 709.838313] RBP: 8802a90d6700 R08: 8802a90d66d0 R09: Mar 19 05:25:15 cmnode01 kernel: [ 709.838381] R10: dead00100100 R11: a0297143 R12: 8802 Mar 19 05:25:15 cmnode01 kernel: [ 709.838448] R13: 0005 R14: 8802a90d66d0 R15: 8802 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] FS: 7fd97b54b760() GS:88000184() knl Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] CS: 0010 DS: ES: CR0: 80050033 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] CR2: 01c54048 CR3: 00028798c000 CR4: Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] DR0: DR1: DR2: Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] DR3: DR6: 0ff0 DR7: Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] Process tunefs.ocfs2 (pid: 9051, threadinfo 8802aadc6 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] a029716e 0001 0286 8 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] 0 8802ae72f9e8 8802a90d66c8 0005 00 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] 0 8802aadc7c78 8802aadc7c90 a029e274 00 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [a029716e] ? o2cb_dlm_lock+0x2b/0x78 [ocfs2_st Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [a029e274] ? user_dlm_cluster_lock+0x2f7/0x44d Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810f2f43] ? __blockdev_direct_IO+0x93e/0x996 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [a029eef7] ? dlmfs_file_open+0x0/0x17d [ocfs2_ Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [a029f038] ? dlmfs_file_open+0x141/0x17d [ocfs Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810f5b15] ? inotify_d_instantiate+0x12/0x38 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [a029eef7] ? dlmfs_file_open+0x0/0x17d [ocfs2_ Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810cbcbc] ? __dentry_open+0x17f/0x2a1 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810d6633] ? do_last+0x3a8/0x644 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810d86d6] ? do_filp_open+0x1ed/0x5f2 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810affac] ? handle_mm_fault+0x3ee/0x876 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810de67f] ? touch_atime+0x7c/0x127 Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810cba6b] ? do_sys_open+0x55/0xfc Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] [810028ab] ? system_call_fastpath+0x16/0x1b Mar 19 05:25:15 cmnode01 kernel: [ 709.838490] RSP 8802aadc7bc0 Can you please help me why i am getting this error. It looks that you meet with a kernel panic. Strange. So do you umount the volume before running tunefs.ocfs2? We have never meet with it before. Could you please file a bug in http://oss.oracle.com/bugzilla/ about this. Thanks. The good thing is that it corrupts in dlmfs, so your data should be ok. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com
Re: [Ocfs2-users] No space left on the device
Hi Aravind, Aravind Divakaran wrote: Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi All, I have already sent one mail regarding the space issue i am facing with my ocfs filesystem. As mentioned in the below link it is an issue related to free space fragmentation. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I have seen a patch for stealing extent allocation which was there is 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my system. Below is my ocfs details on my system #modinfo ocfs2 filename: /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko license:GPL author: Oracle version:1.5.0 description:OCFS2 1.5.0 srcversion: A8B69947E8FF56D74858993 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager vermagic: 2.6.34-rc1 SMP mod_unload modversions This is my stat_sysdir.sh output http://pastebin.com/RZH9DkTk Can anyone help me how to resolve this, please as the problem occurs on production mail server with 3000 emailid. I just checked your stat_sysdir output. It isn't caused by extent block alloc actually. So the patch doesn't work for you. Yes, the problem you meet is fragmentation issue, but the root cause is that inode_alloc can't allocate any more inodes(a little different from 1189). I am now working on discontiguous block group. It will resolve your issue I think. Hope it can be get into mainline in 2.6.35. Regards, Tao For my previous mail i got reply from you Another way is that you can cp the file to another volume, remove it and then cp back. It should be contiguous enough. As mentioned in the 1189 However, reducing the slot count by 1 (to 4) may not be enough as it does not have much contiguous space. It may work. But reducing it by 2 will definitely work. Umount the volume on all nodes and run: # tunefs.ocfs2 -N 3 /dev/sda1 Run fsck.ocfs2 for sanity checking. Will anyone of the above solution will temporary solve my problem. Yes, it works. I just replied you in another e-mail. Regards, Tao I am running tunefs.ocfs2 on my 500gb harddisk which contain 215gb of data, in order to reduce the slots. I had used the below command. tunefs.ocfs2 -N 3 /dev/mapper/store Now almost 7hours is over still it didnt finished the execution. Below is the output i am getting. node01:~# tunefs.ocfs2 -N 3 /dev/mapper/store tunefs.ocfs2 1.4.1 How much time it will take to reduce the slots. Whether it will be finished within 10hours. Can anyone help me. It shouldn't cost so much time. I guess it get blocked in some case. So is this volume umounted in all the nodes? If yes, could you please strace it to see what's wrong? Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on the device
Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi All, I have already sent one mail regarding the space issue i am facing with my ocfs filesystem. As mentioned in the below link it is an issue related to free space fragmentation. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I have seen a patch for stealing extent allocation which was there is 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my system. Below is my ocfs details on my system #modinfo ocfs2 filename: /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko license:GPL author: Oracle version:1.5.0 description:OCFS2 1.5.0 srcversion: A8B69947E8FF56D74858993 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager vermagic: 2.6.34-rc1 SMP mod_unload modversions This is my stat_sysdir.sh output http://pastebin.com/RZH9DkTk Can anyone help me how to resolve this, please as the problem occurs on production mail server with 3000 emailid. I just checked your stat_sysdir output. It isn't caused by extent block alloc actually. So the patch doesn't work for you. Yes, the problem you meet is fragmentation issue, but the root cause is that inode_alloc can't allocate any more inodes(a little different from 1189). I am now working on discontiguous block group. It will resolve your issue I think. Hope it can be get into mainline in 2.6.35. Regards, Tao For my previous mail i got reply from you Another way is that you can cp the file to another volume, remove it and then cp back. It should be contiguous enough. As mentioned in the 1189 However, reducing the slot count by 1 (to 4) may not be enough as it does not have much contiguous space. It may work. But reducing it by 2 will definitely work. Umount the volume on all nodes and run: # tunefs.ocfs2 -N 3 /dev/sda1 Run fsck.ocfs2 for sanity checking. Will anyone of the above solution will temporary solve my problem. Yes, it works. I just replied you in another e-mail. Regards, Tao I am running tunefs.ocfs2 on my 500gb harddisk which contain 215gb of data, in order to reduce the slots. I had used the below command. tunefs.ocfs2 -N 3 /dev/mapper/store Now almost 7hours is over still it didnt finished the execution. Below is the output i am getting. node01:~# tunefs.ocfs2 -N 3 /dev/mapper/store tunefs.ocfs2 1.4.1 How much time it will take to reduce the slots. Whether it will be finished within 10hours. Can anyone help me. It shouldn't cost so much time. I guess it get blocked in some case. So is this volume umounted in all the nodes? If yes, could you please strace it to see what's wrong? Regards, Tao We have ocfs running with two cluster node. Before doing tunefs.ocfs2 i have stopped my ocfs2 service in both the systems which will unmount my ocfs filesystem. I have crossverified with mounted.ocfs2, which gives all unmounted. I tried to strace also but my strace is also not giving any output for that process id. Oh, so is the program still running? If yes, 'top' should show its state. If it is in 'D', please run ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN to see what's wrong. If it is in 'R', I guess there is a dead loop or somehow. anyway, could you please o2image/compress your disk image and then send to us? We can test it locally. See man o2image. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on the device
Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi Aravind, Aravind Divakaran wrote: Hi Tao, Hi Aravind, Aravind Divakaran wrote: Hi All, I have already sent one mail regarding the space issue i am facing with my ocfs filesystem. As mentioned in the below link it is an issue related to free space fragmentation. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I have seen a patch for stealing extent allocation which was there is 2.6.34-rc1 kernel. So i compiled my new kernel and installed on my system. Below is my ocfs details on my system #modinfo ocfs2 filename: /lib/modules/2.6.34-rc1/kernel/fs/ocfs2/ocfs2.ko license:GPL author: Oracle version:1.5.0 description:OCFS2 1.5.0 srcversion: A8B69947E8FF56D74858993 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager vermagic: 2.6.34-rc1 SMP mod_unload modversions This is my stat_sysdir.sh output http://pastebin.com/RZH9DkTk Can anyone help me how to resolve this, please as the problem occurs on production mail server with 3000 emailid. I just checked your stat_sysdir output. It isn't caused by extent block alloc actually. So the patch doesn't work for you. Yes, the problem you meet is fragmentation issue, but the root cause is that inode_alloc can't allocate any more inodes(a little different from 1189). I am now working on discontiguous block group. It will resolve your issue I think. Hope it can be get into mainline in 2.6.35. Regards, Tao For my previous mail i got reply from you Another way is that you can cp the file to another volume, remove it and then cp back. It should be contiguous enough. As mentioned in the 1189 However, reducing the slot count by 1 (to 4) may not be enough as it does not have much contiguous space. It may work. But reducing it by 2 will definitely work. Umount the volume on all nodes and run: # tunefs.ocfs2 -N 3 /dev/sda1 Run fsck.ocfs2 for sanity checking. Will anyone of the above solution will temporary solve my problem. Yes, it works. I just replied you in another e-mail. Regards, Tao I am running tunefs.ocfs2 on my 500gb harddisk which contain 215gb of data, in order to reduce the slots. I had used the below command. tunefs.ocfs2 -N 3 /dev/mapper/store Now almost 7hours is over still it didnt finished the execution. Below is the output i am getting. node01:~# tunefs.ocfs2 -N 3 /dev/mapper/store tunefs.ocfs2 1.4.1 How much time it will take to reduce the slots. Whether it will be finished within 10hours. Can anyone help me. It shouldn't cost so much time. I guess it get blocked in some case. So is this volume umounted in all the nodes? If yes, could you please strace it to see what's wrong? Regards, Tao We have ocfs running with two cluster node. Before doing tunefs.ocfs2 i have stopped my ocfs2 service in both the systems which will unmount my ocfs filesystem. I have crossverified with mounted.ocfs2, which gives all unmounted. I tried to strace also but my strace is also not giving any output for that process id. Oh, so is the program still running? If yes, 'top' should show its state. If it is in 'D', please run ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN to see what's wrong. If it is in 'R', I guess there is a dead loop or somehow. anyway, could you please o2image/compress your disk image and then send to us? We can test it locally. See man o2image. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Compile error on RedHat EL5
Hi Kristiansen, On 05/05/2010 04:34 PM, Kristiansen Morten wrote: Hi, I'm trying to compile ocfs2-tools 1.2.7 on a RedHat EL5 kernel 2.6.18-194.el5, but it fails during the make command: make[1]: Entering directory `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2' gcc -Wall -O2 -o debugfs.ocfs2 main.o commands.o dump.o utils.o journal.o find_block_inode.o find_inode_paths.o dump_fs_locks.o -L/usr/local/lib -lglib-2.0 -L../libocfs2 -locfs2 -L../libo2cb -lo2cb -lcom_err -lreadline -lncurses main.o: In function `main': main.c:(.text+0x4b6): undefined reference to `ocfs2_get_lock_type_string' utils.o: In function `read_whole_file': utils.c:(.text+0x4a8): undefined reference to `ocfs2_extent_map_init' utils.o: In function `dump_file': utils.c:(.text+0x689): undefined reference to `ocfs2_extent_map_init' journal.o: In function `read_journal': journal.c:(.text+0x10a): undefined reference to `ocfs2_extent_map_init' collect2: ld returned 1 exit status make[1]: *** [debugfs.ocfs2] Error 1 make[1]: Leaving directory `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2' make: *** [debugfs.ocfs2] Error 2 Can anybody tell me what's wrong? where do you get the src? The reason to install 1.2.9, is that we are currently running RHEL4 and ocfs2-1.2.4. Now we want to reinstall the servers with RHEL5 and ocfs2-1.2.9, install Oracle with clusterware and then upgrade ocfs2 from 1.2.9 to 1.4.7. I don't get your point of installing 1.2.9. 1.4.7 does work with a old format volume. that's no problem with it. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Compile error on RedHat EL5
Hi Kristiansen, On 05/05/2010 06:05 PM, Kristiansen Morten wrote: The source is downloaded from http://oss.oracle.com/projects/ocfs2-tools/files/source/v1.2/ Forgot to tell you that we are reinstalling one node at a time, while the other nodes is still in the oraclecluster. So you won't be able to mount the ocfs2 disks on the newly reinstalled node, while the two other nodes have mounted the disks with ocfs2-1.2.4. Therefor: on all nodes Reinstall with RHEL5 Install ocfs2-1.2.9 Reinstall oracle with clusterware and submit the cluster /on all nodes on all nodes Upgrade ocfs2 from 1.2.9 to 1.4.7 /on all nodes yes, this is the way upgrading ocfs2 kernel module. But ocfs2-tools 1.4.7 can work with ocfs2-1.2.9 IIRC. Do you meet with any problem? Regards, Tao It's much faster to upgrade only ocfs2, than reinstall all. Morten K -Opprinnelig melding- Fra: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] På vegne av Tao Ma Sendt: 5. mai 2010 11:18 Til: Kristiansen Morten Kopi: ocfs2-users@oss.oracle.com Emne: Re: [Ocfs2-users] Compile error on RedHat EL5 Hi Kristiansen, On 05/05/2010 04:34 PM, Kristiansen Morten wrote: Hi, I'm trying to compile ocfs2-tools 1.2.7 on a RedHat EL5 kernel 2.6.18-194.el5, but it fails during the make command: make[1]: Entering directory `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2' gcc -Wall -O2 -o debugfs.ocfs2 main.o commands.o dump.o utils.o journal.o find_block_inode.o find_inode_paths.o dump_fs_locks.o -L/usr/local/lib -lglib-2.0 -L../libocfs2 -locfs2 -L../libo2cb -lo2cb -lcom_err -lreadline -lncurses main.o: In function `main': main.c:(.text+0x4b6): undefined reference to `ocfs2_get_lock_type_string' utils.o: In function `read_whole_file': utils.c:(.text+0x4a8): undefined reference to `ocfs2_extent_map_init' utils.o: In function `dump_file': utils.c:(.text+0x689): undefined reference to `ocfs2_extent_map_init' journal.o: In function `read_journal': journal.c:(.text+0x10a): undefined reference to `ocfs2_extent_map_init' collect2: ld returned 1 exit status make[1]: *** [debugfs.ocfs2] Error 1 make[1]: Leaving directory `/home/oracle/software/ocfs2/build/build_129/ocfs2-tools-1.2.7/debugfs.ocfs2' make: *** [debugfs.ocfs2] Error 2 Can anybody tell me what's wrong? where do you get the src? The reason to install 1.2.9, is that we are currently running RHEL4 and ocfs2-1.2.4. Now we want to reinstall the servers with RHEL5 and ocfs2-1.2.9, install Oracle with clusterware and then upgrade ocfs2 from 1.2.9 to 1.4.7. I don't get your point of installing 1.2.9. 1.4.7 does work with a old format volume. that's no problem with it. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] List of issues resolved by ocfs2 patch
Hi Hogas, Hogas Ciprian wrote: Hello guys Where can I see a list of issues resolved by a patch on ocfs2? For example I want to see what problems solve patch OCFS2 version 1.4.1-1. Thanks a lot. You can check the release note. http://oss.oracle.com/projects/ocfs2/news/ It has the information. As for 1.4.1-1, it is the 1st version of 1.4. So you can check http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.4/ocfs2-1_4-usersguide.pdf for details. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 debug tools
Hi Nicola, On 05/30/2010 10:23 PM, Mailing List SVR wrote: Hi Sunil, even with the latest ocfs2 release (1.4.7 on rhel5) I'm having several issues, my systems hang pratically every two days when a lot of small files are deleted (about 200.000 files, 50-120KB each), can you please describe me some procedures to collect infos about what operations ocfs2 is doing and so understand the problem? Could you please explain a little detail about your system hang. The system is unresponsive? And what is your solution? Any system log? do you enable magic key? Use sysrq+t to dump the information for the tasks. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] O2CB_HEARTBEAT_THRESHOLD won't take changes
Hi Elliott, Elliott Perrin wrote: Hello All, I have multiple OCFS2 clusters on SLES10 SP2 running Xen. We needed to increase the O2CB_HEARTBEAT_THRESHOLD from 31 up to 61 and did so successfully on 2 of our 3 clusters. However on one of the three clusters we are not able to change the value. The /etc/sysconfig/o2cb file contains 61 as the threshold after reconfiguring via /etc/init.d/o2cb configure, we reconfigure all 3 nodes at the same time after unmounting the volume and stopping ocfs2 and o2cb on all nodes in the cluster but on restart the value is reset to 31. We have even gone so far as to change the default value in the /etc/init.d/o2cb script from 31 to 61 thinking that must be where the setting of 31 is coming from and even then, the value is set back to 31. Oh, and yes, we have rebooted with the new values in both the /etc/sysconfig/o2cb config and with our change of the default value in /etc/init.d/o2cb with no luck. I would deem that you said that /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold can't be set to 61, right? You said even with restart, the heartbeat threadshold is still 31? That would be too strange. So could you please do the following steps: 1. /etc/init.d/o2cb offline this will make the cluster offline. Please make sure you didn't see anything in /sys/kernel/config/cluster/. 2. grep HEARTBEAT /etc/sysconfig/o2cb and paste here so that we know the configuration file has been changed. 3. strace /etc/init.d/o2cb online and attach the output here. let's see what's going on. 4. cat /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold to see the new value. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] O2CB_HEARTBEAT_THRESHOLD won't take changes
Elliott Perrin wrote: Hello Tao, Hi Elliott, Elliott Perrin wrote: Hello All, I have multiple OCFS2 clusters on SLES10 SP2 running Xen. We needed to increase the O2CB_HEARTBEAT_THRESHOLD from 31 up to 61 and did so successfully on 2 of our 3 clusters. However on one of the three clusters we are not able to change the value. The /etc/sysconfig/o2cb file contains 61 as the threshold after reconfiguring via /etc/init.d/o2cb configure, we reconfigure all 3 nodes at the same time after unmounting the volume and stopping ocfs2 and o2cb on all nodes in the cluster but on restart the value is reset to 31. We have even gone so far as to change the default value in the /etc/init.d/o2cb script from 31 to 61 thinking that must be where the setting of 31 is coming from and even then, the value is set back to 31. Oh, and yes, we have rebooted with the new values in both the /etc/sysconfig/o2cb config and with our change of the default value in /etc/init.d/o2cb with no luck. I would deem that you said that /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold can't be set to 61, right? Yes that is correct. You said even with restart, the heartbeat threadshold is still 31? That would be too strange. I thought the same as well. So could you please do the following steps: 1. /etc/init.d/o2cb offline this will make the cluster offline. Please make sure you didn't see anything in /sys/kernel/config/cluster/. That folder did disappear during restart but I am not sure if the admin I was working with did an offline or a stop. We will do an offline during scheduled maintenance tomorrow. IIRC, stop only works if you set up o2cb enabled(You set this when do /etc/init.d/o2cb configure and in the optionLoad O2CB driver on boot (y/n) .) So if you don't have it enabled, stop does nothing. 2. grep HEARTBEAT /etc/sysconfig/o2cb and paste here so that we know the configuration file has been changed. elliott.per...@:~ grep HEART /etc/sysconfig/o2cb # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=61 # O2CB_HEARTBEAT_MODE: Whether to use the native kernel or the user O2CB_HEARTBEAT_MODE=kernel yeah, so the configuration is right. 3. strace /etc/init.d/o2cb online and attach the output here. let's see what's going on. 4. cat /sys/kernel/config/cluster/ocfs2/heartbeat/dead_threshold to see the new value. elliott.per...@xxx:~ cat /sys/kernel/config/cluster/XXX/heartbeat/dead_threshold 31 oh... Thank you Tao, we will be performing maintenance on this cluster in the next couple days and will try these steps. np. Let us wait and see what's wrong. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 performance - disk random access time problem
Hi Proskurin, On 06/02/2010 05:23 PM, Proskurin Kirill wrote: On 01/06/10 22:34, Sunil Mushran wrote: The kernel is old. We fixed this issue in 2.6.30. We have also backported it to the 1.4 production tree. The problem was that the inodes being created did not have locality leading to a directory having inodes that were spaced far apart from each other. The one place where it really affected performance was rm. Thank you for reply! It is enough to update kernel or tools need to be updated too? If you only want to use the old formatted volume, updating kernel is enough. But if you want to use some new features we added, better update ocfs2-tools also. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 performance - disk random access time problem
Proskurin Kirill wrote: On 02/06/10 13:26, Tao Ma wrote: Thank you for reply! It is enough to update kernel or tools need to be updated too? If you only want to use the old formatted volume, updating kernel is enough. But if you want to use some new features we added, better update ocfs2-tools also. If im understand right - you bind tools release to kernel release. For 2.6.32 kernel which tools are preferred? No we don't bind actually. We do have a 1.2 vs. 1.4 for the enterprise kernel. For the mainline kernel, we don't have this limitation. So using the latest source code from ocfs2-tools is ok, or if you want to use the built rpm, use 1.4.* since it is almost the latest. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 performance - disk random access time problem
Add Mark Fasheh mfas...@suse.com and Coly Li coly...@suse.de to cc since they know what ocfs2 kernel version SUSE uses. Angelo McComis wrote: On 01/06/10 22:34, Sunil Mushran wrote: The kernel is old. We fixed this issue in 2.6.30. We have also backported it to the 1.4 production tree. The problem was that the inodes being created did not have locality leading to a directory having inodes that were spaced far apart from each other. The one place where it really affected performance was rm. Thank you for reply! I've seen this rm problem in the production version of SLES 10.3 +updates. Is this fix available in the SLES Enterprise kernels yet? 10.3.x or 11.1.x (11.1 officially releases today, btw). Thanks, Angelo ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] 'No space left on device' error with plenty of space.
Hi Jason, On 06/09/2010 11:34 PM, Jason Price wrote: And now it's starting to fail again. How about the situation? I checked your stat_sysfs output, it looks that you have spaces for inode, extent alloc and local alloc(but maybe the kernel haven't flushed the metadata to the disk while the stat_sysfs only read the disk). So why you meet with ENOSPC? Can you describe it in more detail? You meet with it when touching a new file, or cat some bytes to a file or ...? If you find the wrong scenario, please enable the debugfs option so that we can find out the real cause. debugfs.ocfs2 -l INODE allow debugfs.ocfs2 -l DISK_ALLOC allow run you test case here. debugfs.ocfs2 -l INODE off debugfs.ocfs2 -l DISK_ALLOC off Regards, Tao --Jason On Wed, Jun 9, 2010 at 9:51 AM, Jason Price japr...@gmail.com mailto:japr...@gmail.com wrote: I've got a busy FTP/Web cluster running OCFS2 v1.4.4. I've started getting No space on device errors when users attempt to write to the file system. Disk utilization is about 76% with more than 100gb free. Inode utilization is also at 76%. I thought this was a manifestation of bug # 1189, so I decreased the number of nodes via tunefs.ocfs2 from 8 (the default) down to 3 (there are only 2 nodes in the cluster, with no growth anticipated). That got me out of the woods on Monday, but this morning the problem manifested again. I've opened bug # 1263 about this issue. (link: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1263 ) Does anyone have other ideas? I'm more than happy to supply other information. What seems to happen is that small writes are allowed, but bigger writes failed. On Monday, I could write multiple 325kb files, and I could cat them together to make one file of ~2 mb, but when I tried to make a 10ish mb file, it failed. --Jason ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 error.
Hi Veeraa, On 06/23/2010 10:46 AM, veeraa bose wrote: Hi Team, Hi Team, we are getting below error in shared disk on VMwares guest operating system. Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: reservation conflict Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: SCSI error: return code = 0x0018 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdg, sector 2367 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: (swapper,0,0):o2hb_bio_end_io: 237 ERROR: IO Error -5 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: (o2hb-DEEDA3062A,4504,0):o2hb_do_disk_heartbeat:768 ERROR: status = -5 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: reservation conflict Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: SCSI error: return code = 0x0018 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdg, sector 2367 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: (syslogd,4298,0):o2hb_bio_end_io:237 ERROR: IO Error -5 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: (o2hb-DEEDA3062A,4504,0):o2hb_do_disk_heartbeat:768 ERROR: status = -5 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: reservation conflict Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:3:0: SCSI error: return code = 0x0018 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdg, sector 8259921 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: reservation conflict Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: SCSI error: return code = 0x0018 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdf, sector 8268113 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: reservation conflict Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: sd 1:0:2:0: SCSI error: return code = 0x0018 Jun 23 01:46:12 SCRBXLPDEFRM635 kernel: end_request: I/O error, dev sdf, sector 8309329 we are getting Input/output error on fileson ocfs2 FS, when we tried to copy. Please let know what could be problem. yes, you get I/O error(-5 is also EIO), so it isn't related to ocfs2 and there should be some problem in your block device that vmware provides. So could you please check whether you can write to the device successfully by 'dd'? Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] df showing wrong size
Hi Garcia, On 06/28/2010 02:17 PM, Garcia, Raymundo wrote: Hello… it was put under my attention that a partition we have in one of our production system was displaying wrong size with df command…. 123 GB… but in fact the size of all the files is a mere 15 GB…. What is going on? Shall we use ocfs.fsck to fix that? Is strange… 123GB is the total size of the used size? df use statfs(2) to check the size, so in general, you will get the used size for the whole volume(including some files that haven't been freed and stayed in orphan dir). 15GB is what you get from du I guess? du will iterate the dir and check it. So it only contains the size for every visible file, not including orphan files. Regards, Tao Thanks for any comment…. Raymundo Garcia The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Too much journaling or not ?
Hi Somsak, On 07/30/2010 12:54 AM, Somsak Sriprayoonsakul wrote: Hi, (I am in the same team with Mr. Wanchat) Just want to note that we already format OCFS2 with -T mail option. As note below, data=writeback,noatime, and commit interval has been increase already. The weird thing about this problem is that, OCFS2 will work for like about 30 days, after that the I/O delay time will increase greatly, to the point that the system is unusable at all. This problem is quite serious for us. It disrupt the services every month. Right now the only cure we found is by formating another SAN and copy everything there (we did it with rsync). Right now the old SAN which contains the file system with trouble is still left intact. We can play around with it since it is not in production anymore. We can do some debugging on this. Just let us know what we should do. I would guess that is because of the file system fragmentation. So do you have stat_sysdir output? And also if you don't mind, can you give us the o2image of the disk? with it we can try to reproduce it locally. For more information about o2image, see man page. Also please tell us how to reproduce the case in your env. Regards, Tao Somsak soms...@gmail.com mailto:soms...@gmail.com On Wed, Jul 28, 2010 at 12:33 PM, wanchat padungrat wancha...@pantip.com mailto:wancha...@pantip.com wrote: Dear Sunil, Thank you for your promptly reply, Yes, we do mount as data=writeback . following is our current mount option : type ocfs2 (rw,_netdev,noatime,commit=15,data=writeback,heartbeat=local) Wanchat P. 2010/7/28 Sunil Mushran sunil.mush...@oracle.com mailto:sunil.mush...@oracle.com Have you tried mounting with data=writeback ? On Jul 27, 2010, at 9:31 PM, wanchat padungrat wancha...@pantip.com mailto:wancha...@pantip.com wrote: Dear all, Not realy sure whether this is bug or not, but we found that sometimes OCFS2 on our system do journaling a lot. (Please see screen shot below) As you can see, the IO was jumped from 111 w/s to 11,960 w/s , IO utilize jumped from 1.5% to 97% , %iowait jumped from 0.25% to 10.94%. It will not come into question, if this happen in short time period (eg. 1-2 sec.) , but sometimes we found this persist for half an hour or even more. Is this normal for OCFS2 or not ? How could I tune OCFS2 to avoid the above situation. We also suspect the relation of this symptom with another problem similar to http://oss.oracle.com/pipermail/ocfs2-users/2009-January/003250.htmlhttp://oss.oracle.com/pipermail/ocfs2-users/2009-January/003250.html , because we found OCFS2 file system pause for a long period of time too ( 5-20 Sec. on open for write). Those scattered happen all day with more frequence during highload time. (Screen shot below) Our configuration is 3 web servers connect to HP SAN Storage using OCFS2 1.4.7. Amongst those 3 servers , there are 2 for reading data and 1 for writing data (traffic seperate at proxy layer) . The symptom above happen on data writing server. Thank you. Wanchat Padungrat PS. We keep our service by reinstall the whole data every month, without doing whole data copy (consumed 12 hours), the symptom getting worse an worse eg. journaling persist longer , file system pause longer (sometimes more than a minute !!) ... which affect the whole system , such as, too many waiting apache process, lavg going skyrocket(sometimes 400+) and cannot recover itself etc. . Screen shot - - Here is iostat+top during journaling process running. - avg-cpu: %user %nice %system %iowait %steal %idle 1.88 0.00 5.38 10.94 0.00 81.81 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 5156.00 5.00 6804.50 19.00 47892.25 14.07 9.76 1.43 0.14 97.55 sdc1 0.00 5156.00 5.00 6804.50 19.00 47892.25 14.07 9.76 1.43 0.14 97.60 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sde 0.00
Re: [Ocfs2-users] No space left on device
Hi all On 09/08/2010 04:11 PM, Alexander Barton wrote: Hi Sunil! Are there special steps one has to follow to recover such a filesystem that has been used with a buggy kernel? We had this problem with a Debian 2.6.27 kernel and updated to a recent „mainline“ kernel 2.6.33.x – but are still seeing the same problem: „no space left“ is reported when trying to write files to the filesystem (a few megabytes are often „too much“) although df(1) shows a lot (gigabytes) of free space. - Kernel 2.6.33-2-amd64 (2.6.33-1~experimental.4) - OCFS2 Tools 1.4.3-1 Discontig block group is added in the last development cycle which mean 2.6.35 IIRC. As for the tools part, I have added the patches to the mail list and there is still one patch waiting for review. Any hints? Do you need more information? yes, we have to make sure you meet with discontig block group problem if we have your stat_sysdir output. Regards, Tao Regards Alex Am 08.09.2010 um 04:12 schrieb Sunil Mushran: Which kernel are you using? We have fixed this issue in mainline. We will soon have the same fix for production kernels. On 09/07/2010 02:06 PM, Todd Freeman wrote: From reading the archives I can see this issue has been hit before but I haven't found a resolution. I have a 50gb partition... I have formatted it at 10gb. I have it set for 4 cluster members and am using 3 of those slots. I fill the partition to 66% and voila... no space left on device. I have tried it with big files and lots of small files and both ways I hit this error at 66% usage. I am using ubuntu-server with ocfs2-tools 1.4.2-1 If anyone has ideas/solutions I would be most grateful... this FS is awesome :P ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on device
Hi, Alexander Barton wrote: Hi Tao! Am 08.09.2010 um 10:53 schrieb Tao Ma: Hi all On 09/08/2010 04:11 PM, Alexander Barton wrote: Hi Sunil! Are there special steps one has to follow to recover such a filesystem that has been used with a buggy kernel? We had this problem with a Debian 2.6.27 kernel and updated to a recent „mainline“ kernel 2.6.33.x – but are still seeing the same problem: „no space left“ is reported when trying to write files to the filesystem (a few megabytes are often „too much“) although df(1) shows a lot (gigabytes) of free space. - Kernel 2.6.33-2-amd64 (2.6.33-1~experimental.4) - OCFS2 Tools 1.4.3-1 Discontig block group is added in the last development cycle which mean 2.6.35 IIRC. As for the tools part, I have added the patches to the mail list and there is still one patch waiting for review. Any hints? Do you need more information? yes, we have to make sure you meet with discontig block group problem if we have your stat_sysdir output. It is about 275 KBytes, so I uploaded it here: http://gist.github.com/569970 Does it happen now? Your volume seems to have the similar problem, a large number of groups has small contiguous spaces that can't be used to allocate an inode. So do please wait for the patch for the ocfs2-tools to be included and the mainline kernel of 2.6.35. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on device
Alexander Barton wrote: Hi Tao! Am 08.09.2010 um 16:22 schrieb Tao Ma: Hi, Alexander Barton wrote: Hi Tao! Am 08.09.2010 um 10:53 schrieb Tao Ma: Hi all On 09/08/2010 04:11 PM, Alexander Barton wrote: Hi Sunil! Are there special steps one has to follow to recover such a filesystem that has been used with a buggy kernel? We had this problem with a Debian 2.6.27 kernel and updated to a recent „mainline“ kernel 2.6.33.x – but are still seeing the same problem: „no space left“ is reported when trying to write files to the filesystem (a few megabytes are often „too much“) although df(1) shows a lot (gigabytes) of free space. - Kernel 2.6.33-2-amd64 (2.6.33-1~experimental.4) - OCFS2 Tools 1.4.3-1 Discontig block group is added in the last development cycle which mean 2.6.35 IIRC. As for the tools part, I have added the patches to the mail list and there is still one patch waiting for review. Any hints? Do you need more information? yes, we have to make sure you meet with discontig block group problem if we have your stat_sysdir output. It is about 275 KBytes, so I uploaded it here: http://gist.github.com/569970 Does it happen now? No. We deleted „lots“ of small cache files and are currently able to write to the filesystem. yeah, that is the reason I said your volumes seems to... in my previous e-mail. ;) If you have deleted some small files, it looks it is. btw, I may commit the ocfs2-tools patches recently, and you can try it with 2.6.35. Regards, Tao Your volume seems to have the similar problem, a large number of groups has small contiguous spaces that can't be used to allocate an inode. So do please wait for the patch for the ocfs2-tools to be included and the mainline kernel of 2.6.35. Ok, thanks a lot for your answer! Regards Alex ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on device
On 09/21/2010 04:52 PM, Alexander Barton wrote: Hi Tao! Am 09.09.2010 um 02:29 schrieb Tao Ma: btw, I may commit the ocfs2-tools patches recently, and you can try it with 2.6.35. Ok, now we are seeing the problem again and want to try a new kernel and the new OCFS2 tools — there aren’t that much files left we could delete instead … So kernel 2.6.35.4 would be ok? It should work. And OCFS2 tools from the GIT master branch? Or a special tag? There is no archive or release, right? I have already committed the patches to ocfs2-tools. So you can get from git clone git://oss.oracle.com/git/ocfs2-tools.git I guess there is no release for it yet. So are there any special steps we have to follow? Do some magic fsck.ocfs2 runs? Splashing chicken blood on moonlight? …? Or is simply installing the new kernel and the new tools enough and „everything“ should be working fine again afterwards? No need for fsck.ocfs2. Just do tunefs.ocfs2 --fs-features=discontig-bg /dev/sdx(which means you have to umount the volume from all the nodes first) and mount it again and it should work. any problem, please let us know. Thanks. Make sure you see 'discontig-bg' in the output of following cmd. echo 'stats'|debugfs.ocfs2 /dev/sdx|grep Incompat Thanks for your support! no problem. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] No space left on device
On 09/29/2010 05:13 PM, Alexander Barton wrote: Hello again! Am 21.09.2010 um 11:04 schrieb Tao Ma: On 09/21/2010 04:52 PM, Alexander Barton wrote: So kernel 2.6.35.4 would be ok? It should work. And OCFS2 tools from the GIT master branch? Or a special tag? There is no archive or release, right? I have already committed the patches to ocfs2-tools. So you can get from git clone git://oss.oracle.com/git/ocfs2-tools.git We upgraded both of our cluster nodes last friday to - Debian Linux Kernel „2.6.35-trunk-amd64“ (linux-image-2.6.35-trunk-amd64_2.6.35-1~experimental.3_amd64.deb) which is 2.6.35.4 plus Debian patches - OCFS2 tools 1.6.3 from GIT Since then, our cluster is VERY unstable, ge get lots of „general protection faults“ and hard lockups. „Lots“ as in „often more than 2 times a day“. sorry for the trouble. Our scenario is OCFS2 on top of DRBD. It looks like the „crash pattern“ is the following: On Node 2: cl1-n2 kernel: [ 4006.829327] general protection fault: [#21] SMP cl1-n2 kernel: [ 4006.829487] last sysfs file: /sys/devices/platform/coretemp.7/temp1_label cl1-n2 kernel: [ 4006.829558] CPU 1 cl1-n2 kernel: [ 4006.829611] Modules linked in: ocfs2 jbd2 quota_tree tun xt_tcpudp iptable_filter hmac sha1_generic ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iptable_nat nf_nat configfs nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 coretemp drbd lru_cache cn loop hed tpm_tis snd_pcm snd_timer snd soundcore psmouse snd_page_alloc processor tpm pcspkr evdev joydev tpm_bios dcdbas serio_raw i5k_amb button rng_core shpchp pci_hotplug i5000_edac edac_core ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod usbhid hid sg sr_mod cdrom ata_generic sd_mod ses crc_t10dif enclosure ata_piix ehci_hcd uhci_hcd usbcore bnx2 libata nls_base megaraid_sas scsi_mod e1000e thermal fan thermal_sys [last unloaded: scsi_wait_scan] cl1-n2 kernel: [ 4006.833215] cl1-n2 kernel: [ 4006.833215] Pid: 7699, comm: apache2 Tainted: G D 2.6.35-trunk-amd64 #1 0H603H/PowerEdge 2950 cl1-n2 kernel: [ 4006.833215] RIP: 0010:[810e1886] [810e1886] __kmalloc+0xd3/0x136 cl1-n2 kernel: [ 4006.833215] RSP: 0018:88012e277cd8 EFLAGS: 00010006 cl1-n2 kernel: [ 4006.833215] RAX: RBX: 6f635f6465727265 RCX: a0686032 cl1-n2 kernel: [ 4006.833215] RDX: RSI: 88012e277da8 RDI: 0004 cl1-n2 kernel: [ 4006.833215] RBP: 81625520 R08: 880001a52510 R09: 0003 cl1-n2 kernel: [ 4006.833215] R10: 88009a561b40 R11: 88022d62f400 R12: 000b cl1-n2 kernel: [ 4006.833215] R13: 8050 R14: 8050 R15: 0246 cl1-n2 kernel: [ 4006.833215] FS: 7f9199715740() GS:880001a4() knlGS: cl1-n2 kernel: [ 4006.833215] CS: 0010 DS: ES: CR0: 8005003b cl1-n2 kernel: [ 4006.833215] CR2: 402de9d0 CR3: 0001372a1000 CR4: 000406e0 cl1-n2 kernel: [ 4006.833215] DR0: DR1: DR2: cl1-n2 kernel: [ 4006.833215] DR3: DR6: 0ff0 DR7: 0400 cl1-n2 kernel: [ 4006.833215] Process apache2 (pid: 7699, threadinfo 88012e276000, task 88009a561b40) cl1-n2 kernel: [ 4006.833215] Stack: cl1-n2 kernel: [ 4006.833215] 8801b1af9c20 a0686032 8801b1a2da20 88018f5f30c0 cl1-n2 kernel: [ 4006.833215]0 88012e277e88 000a 88018d105300 88009a561b40 cl1-n2 kernel: [ 4006.833215]0 88012e277da8 a0686032 88012e277e88 88012e277da8 cl1-n2 kernel: [ 4006.833215] Call Trace: cl1-n2 kernel: [ 4006.833215] [a0686032] ? ocfs2_fast_follow_link+0x166/0x284 [ocfs2] cl1-n2 kernel: [ 4006.833215] [810f29fa] ? do_follow_link+0xdb/0x24c cl1-n2 kernel: [ 4006.833215] [810f2d55] ? link_path_walk+0x1ea/0x482 cl1-n2 kernel: [ 4006.833215] [810f311f] ? path_walk+0x63/0xd6 cl1-n2 kernel: [ 4006.833215] [810f27ba] ? path_init+0x46/0x1ab cl1-n2 kernel: [ 4006.833215] [810f3288] ? do_path_lookup+0x20/0x85 cl1-n2 kernel: [ 4006.833215] [810f3cd9] ? user_path_at+0x46/0x78 cl1-n2 kernel: [ 4006.833215] [81038bac] ? pick_next_task_fair+0xe6/0xf6 cl1-n2 kernel: [ 4006.833215] [81305101] ? schedule+0x4d4/0x530 cl1-n2 kernel: [ 4006.833215] [81060526] ? prepare_creds+0x87/0x9c cl1-n2 kernel: [ 4006.833215] [810e8649] ? sys_faccessat+0x96/0x15b cl1-n2 kernel: [ 4006.833215] [810089c2] ? system_call_fastpath+0x16/0x1b cl1-n2 kernel: [ 4006.833215] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48
Re: [Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36
Hi Ronald, On 10/29/2010 05:12 PM, Ronald Moesbergen wrote: Hello, I was testing kernel 2.6.36 (vanilla mainline) and encountered the following BUG(): [157756.266000] o2net: no longer connected to node app01 (num 0) at 10.2.25.13: [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at least one node (0) to recover before lock mastery can begin [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at least one node (0) to recover before lock mastery can begin [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to recover before lock mastery can begin [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must master $RECOVERY lock now [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920) Node 1 is the Recovery Master for the Dead Node 0 for Domain 5FA56B1D0A9249099CE58C82CFEC873A [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,32) [157772.850182] [ cut here ] [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700! Strange. the bug line is BUG_ON(osb-node_num == node_num); and it detects the same node number in the cluster. So could you please grab the mount info from the system log of the 2 nodes. The message looks like: Oct 27 16:24:21 ocfs2-test2 kernel: ocfs2: Mounting device (8,8) on (node 2, slot 0) with ordered data mode. It tell us which node and slot the volume used. Regards, Tao [157772.850238] invalid opcode: [#1] SMP [157772.850270] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map [157772.850314] CPU 0 [157772.850320] Modules linked in: ip_vs_wrr ip_vs nf_conntrack ocfs2 jbd2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding ipv6 ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf loop ipmi_si ipmi_msghandler hpilo hpwdt container snd_pcm serio_raw psmouse snd_timer snd soundcore tpm_tis tpm tpm_bios pcspkr iTCO_wdt snd_page_alloc button processor evdev ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom usbhid hid ata_piix ata_generic cciss libata scsi_mod ide_pci_generic ide_core ehci_hcd bnx2 e1000e uhci_hcd thermal fan thermal_sys [157772.850758] [157772.850779] Pid: 14060, comm: ocfs2rec Not tainted 2.6.36 #2 /ProLiant DL360 G6 [157772.850823] RIP: 0010:[a03da8c3] [a03da8c3] __ocfs2_recovery_thread+0x474/0x137f [ocfs2] [157772.850916] RSP: 0018:880084f49e00 EFLAGS: 00010246 [157772.850943] RAX: 0001 RBX: 88011dd07108 RCX: 88011d3fe344 [157772.850986] RDX: 88011d3fe340 RSI: 0001 RDI: 88011dd07108 [157772.851029] RBP: 880118479000 R08: R09: [157772.851073] R10: R11: 0400 R12: 88011faff800 [157772.851116] R13: 0001 R14: 88011dd07000 R15: [157772.851159] FS: () GS:88000160() knlGS: [157772.851205] CS: 0010 DS: ES: CR0: 8005003b [157772.851232] CR2: 01e88b58 CR3: 00011dd26000 CR4: 06f0 [157772.851275] DR0: DR1: DR2: [157772.851318] DR3: DR6: 0ff0 DR7: 0400 [157772.851362] Process ocfs2rec (pid: 14060, threadinfo 880084f48000, task 88009bd9e9c0) [157772.851407] Stack: [157772.851427] 8800 88010008 0020 [157772.851462]0 88009bd9ece8 88009bd9e9c0 88009bd9ece8 88009bd9e9c0 [157772.851515]0 88009bd9ece8 88009bd9e9c0 88009bd9ece8 88009bd9e9c0 [157772.851584] Call Trace: [157772.851611] [a03da44f] ? __ocfs2_recovery_thread+0x0/0x137f [ocfs2] [157772.851657] [81044aed] ? kthread+0x7e/0x86 [157772.851684] [81002b94] ? kernel_thread_helper+0x4/0x10 [157772.851713] [81044a6f] ? kthread+0x0/0x86 [157772.851739] [81002b90] ? kernel_thread_helper+0x0/0x10 [157772.851766] Code: 89 1c 24 41 b9 a0 06 00 00 49 c7 c0 50 01 42 a0 48 c7 c7 a9 9f 42 a0 31 c0 e8 1d 0c e7 e0 8b 74 24 74 41 39 b6 38 01 00 00 75 040f 0b eb fe 48 c7 84 24 a0 00 00 00 00 00 00 00 48 c7 84 24 98 [157772.851973] RIP [a03da8c3] __ocfs2_recovery_thread+0x474/0x137f [ocfs2] [157772.852024]
Re: [Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36
Ronald Moesbergen wrote: 2010/10/29 Ronald Moesbergen intercom...@gmail.com: 2010/10/29 Tao Ma tao...@oracle.com: Hi Ronald, Hi Tao, Thanks for looking into this. On 10/29/2010 05:12 PM, Ronald Moesbergen wrote: Hello, I was testing kernel 2.6.36 (vanilla mainline) and encountered the following BUG(): [157756.266000] o2net: no longer connected to node app01 (num 0) at 10.2.25.13: [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at least one node (0) to recover before lock mastery can begin [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at least one node (0) to recover before lock mastery can begin [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to recover before lock mastery can begin [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must master $RECOVERY lock now [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920) Node 1 is the Recovery Master for the Dead Node 0 for Domain 5FA56B1D0A9249099CE58C82CFEC873A [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,32) [157772.850182] [ cut here ] [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700! Strange. the bug line is BUG_ON(osb-node_num == node_num); and it detects the same node number in the cluster. I just tried to reproduce it and succeeded. Here's what I did: - unmount the filesystem on node app02 - shutdown the o2cb services on app02 - Do a halt -f on app01, which still has the OCFS2 volume mounted. - Start o2cb services on app02 - Mount the OCFS2 filesystem - BUG Thanks for the test. I will look at it. Thanks. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36
Hi Ronald, On 10/29/2010 06:03 PM, Ronald Moesbergen wrote: 2010/10/29 Ronald Moesbergenintercom...@gmail.com: 2010/10/29 Tao Matao...@oracle.com: Hi Ronald, Hi Tao, Thanks for looking into this. On 10/29/2010 05:12 PM, Ronald Moesbergen wrote: Hello, I was testing kernel 2.6.36 (vanilla mainline) and encountered the following BUG(): [157756.266000] o2net: no longer connected to node app01 (num 0) at 10.2.25.13: [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at least one node (0) to recover before lock mastery can begin [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890 5FA56B1D0A9249099CE58C82CFEC873A:M186ba2b09b: at least one node (0) to recover before lock mastery can begin [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to recover before lock mastery can begin [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must master $RECOVERY lock now [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920) Node 1 is the Recovery Master for the Dead Node 0 for Domain 5FA56B1D0A9249099CE58C82CFEC873A [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,32) [157772.850182] [ cut here ] [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700! Strange. the bug line is BUG_ON(osb-node_num == node_num); and it detects the same node number in the cluster. I just tried to reproduce it and succeeded. Here's what I did: - unmount the filesystem on node app02 - shutdown the o2cb services on app02 - Do a halt -f on app01, which still has the OCFS2 volume mounted. - Start o2cb services on app02 - Mount the OCFS2 filesystem - BUG Works everytime. So one of the 2 variables checked in that BUG_ON statement must no be set correctly somewhere. I have tried several times in my local test env, but with no luck by now. And it seems to me quite strange at least from the code. So could you please file a bug in oss.oracle.com/bugzilla so that it is easier to track and discuss? Great thanks. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Pb with ocfs2 dlm on Fedora 13
Hi Alain, On 11/08/2010 11:08 PM, Alain.Moulle wrote: Hi, I have a problem on Fedora13 with releases : ocfs2 1.4.3-5.fc13.x86_64 dlm_tool 3.0.17 With a 3 nodes ocfs2 cluster, I can't mount FS on the three nodes at the same time but only on two nodes among the 3 nodes , whatever the two nodes are among the 3 nodes. The errors are : (1475,0):o2net_connect_expired:1656 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. (2175,0):dlm_request_join:1035 ERROR: status = -107 (2175,0):dlm_try_to_join_domain:1209 ERROR: status = -107 (2175,0):dlm_join_domain:1487 ERROR: status = -107 (2175,0):dlm_register_domain:1753 ERROR: status = -107 (2175,0):o2cb_cluster_connect:313 ERROR: status = -107 (2175,0):ocfs2_dlm_init:2995 ERROR: status = -107 (2175,0):ocfs2_mount_volume:1789 ERROR: status = -107 ocfs2: Unmounting device (8,16) on (node 0) o2net: no longer connected to node selfxl-4 (num 0) at 10.197.189.204: o2net: connected to node selfxl-4 (num 0) at 10.197.189.204: It seems to be a lock management problem Is it an already known issue ? Is there an available patch ? It doesn't look like a dlm problem, but a network problem. ;) So your first error is o2net_connect_expired. So it seems that the 3rd node can't connect with node 2. Could you please check the error message in node 2? btw, I would deem that the cluster.conf is the same among the 3 nodes, and you you can connect to (which is used by ocfs2) of node 2 from node 3. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Pb with ocfs2 dlm on Fedora 13
Hi Alain, On 11/09/2010 04:49 PM, Alain.Moulle wrote: Hi, The three cluster.conf are exactly the same on the 3 nodes. The errors messages are : -nodes1: o2net: accepted connection from node selfxl-5 (num 1) at 10.197.189.218: o2net: no longer connected to node selfxl-5 (num 1) at 10.197.189.218: -nodes2: (1457,1):o2net_connect_expired:1656 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Note that once a mount is refused for example on node3, if I umount the FS on node1 for example, then I can mount it on node3. Oh, so do you have enough slots for all these 3 nodes to mount? What's the output for the below command? echo 'stats'|debugfs.ocfs2 /dev/sdx|grep Slots Regards, Tao Note also that when the mound is refused for example on node3, I've check that this node3pings successfully both other nodes on IP addr given in cluster.conf. Alain Tao Ma a écrit : Hi Alain, On 11/08/2010 11:08 PM, Alain.Moulle wrote: Hi, I have a problem on Fedora13 with releases : ocfs2 1.4.3-5.fc13.x86_64 dlm_tool 3.0.17 With a 3 nodes ocfs2 cluster, I can't mount FS on the three nodes at the same time but only on two nodes among the 3 nodes , whatever the two nodes are among the 3 nodes. The errors are : (1475,0):o2net_connect_expired:1656 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. (2175,0):dlm_request_join:1035 ERROR: status = -107 (2175,0):dlm_try_to_join_domain:1209 ERROR: status = -107 (2175,0):dlm_join_domain:1487 ERROR: status = -107 (2175,0):dlm_register_domain:1753 ERROR: status = -107 (2175,0):o2cb_cluster_connect:313 ERROR: status = -107 (2175,0):ocfs2_dlm_init:2995 ERROR: status = -107 (2175,0):ocfs2_mount_volume:1789 ERROR: status = -107 ocfs2: Unmounting device (8,16) on (node 0) o2net: no longer connected to node selfxl-4 (num 0) at 10.197.189.204: o2net: connected to node selfxl-4 (num 0) at 10.197.189.204: It seems to be a lock management problem Is it an already known issue ? Is there an available patch ? It doesn't look like a dlm problem, but a network problem. ;) So your first error is o2net_connect_expired. So it seems that the 3rd node can't connect with node 2. Could you please check the error message in node 2? btw, I would deem that the cluster.conf is the same among the 3 nodes, and you you can connect to (which is used by ocfs2) of node 2 from node 3. Regards, Tao ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com mailto:Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users