Re: [Ocfs2-users] RHEL 5.8, ocfs2 v1.4.4, stability issues

Nathan Patwardhan Fri, 27 Apr 2012 08:33:13 -0700


From: Kushnir, Michael (NIH/NLM/LHC) [C] [mailto:michael.kush...@nih.gov]
Sent: Wednesday, April 25, 2012 12:52 PM
To: Nathan Patwardhan; ocfs2-users@oss.oracle.com
Subject: RE: RHEL 5.8, ocfs2 v1.4.4, stability issues



Ø  Hi Nathan,


Ø  I've no particular insight on splunk...


Ø  We use OCFS 1.4  with RHEL 5.8 VMs on ESX 4.1 and ESXi 5 as storage for a 
Hadoop cluster as well as an index and document store for application servers 
running a search engine application (java, tomcat, apache). We also export the 
same volumes from a few cluster nodes over NFS for developer access. We've had 
no stability issues or corruption.

I think I've identified the issue.

http://communities.vmware.com/thread/164967

In short, we had been allocated a shared vmdk through shared bus rather than as 
an RDM.  As a result, and after some digging into the logs, I found a bunch of 
SCSI reservation conflicts.  I'll get setup with RDM within the next week or so 
and will update with results as I have them.


From: Nathan Patwardhan 
[mailto:npatward...@llbean.com]<mailto:[mailto:npatward...@llbean.com]>
Sent: Tuesday, April 24, 2012 9:44 AM
To: ocfs2-users@oss.oracle.com<mailto:ocfs2-users@oss.oracle.com>
Subject: [Ocfs2-users] RHEL 5.8, ocfs2 v1.4.4, stability issues

Hi everyone,

We're running a two-node cluster w/ocfs2 v.1.4.4 under RHEL 5.8 and we're 
having major stability issues:

[root@splwww02 ~]# lsb_release -a ; uname -a ; rpm -qa |grep -i ocfs
LSB Version:    
:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 5.8 (Tikanga)
Release:        5.8
Codename:       Tikanga

Linux splwww02.llbean.com 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 
2012 x86_64 x86_64 x86_64 GNU/Linux

ocfs2-2.6.18-308.1.1.el5-1.4.7-1.el5
ocfs2-tools-1.4.4-1.el5
ocfs2console-1.4.4-1.el5

These rpms were downloaded and installed from oracle.com.

The node are ESX VMs and the storage is shared to both VMs through vSphere.

We're using ocfs2 because we needed shared back end storage for a splunk 
implementation (splunk web search heads can used pooled storage and we didn't 
want to just use NFS for the obvious reasons - where shared storage seemed to 
be a better solution all the way around).

/etc/ocfs2/cluster.conf contains the following on each node:

cluster:
        node_count = 2
        name = splwww

node:
        ip_port = 7777
        ip_address = 10.130.245.40
        number = 0
        name = splwww01
        cluster = splwww

node:
        ip_port = 7777
        ip_address = 10.130.245.41
        number = 1
        name = splwww02
        cluster = splwww

Note that each node uses eth0 for clustering.  We do NOT currently have a 
private network setup in ESX for this purpose but CAN if this is recommended by 
people here.

o2cb status on both nodes:

Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster splwww: Online
Heartbeat dead threshold = 3
  Network idle timeout: 90000
  Network keepalive delay: 5000
  Network reconnect delay: 5000
Checking O2CB heartbeat: Active

/etc/sysconfig/o2cb:

O2CB_ENABLED=true
O2CB_STACK=o2cb
O2CB_BOOTCLUSTER=splwww
O2CB_HEARTBEAT_THRESHOLD=3
O2CB_IDLE_TIMEOUT_MS=90000
O2CB_KEEPALIVE_DELAY_MS=5000
O2CB_RECONNECT_DELAY_MS=5000

Each host mounts the (ocfs2) file system then turns around and exports the  
file system where it mounts it on localhost as shown below (yes I know this is 
awful but splunk does not support ocfs2 and as such is unable to lock files 
properly no matter what options I give to mount an ocfs2 file system):

# From /etc/fstab
/dev/mapper/splwwwshared-splwwwshared--lv /opt/splunk/pooled ocfs2 
_netdev,datavolume,errors=remount-ro,rw,noatime 0 0

# From /etc/exports
/opt/splunk/pooled 127.0.0.1(rw,no_root_squash)

# From /bin/mount
/dev/mapper/splwwwshared-splwwwshared--lv on /opt/splunk/pooled type ocfs2 
(rw,_netdev,noatime,datavolume,errors=remount-ro,heartbeat=local)
localhost:/opt/splunk/pooled on /opt/splunk/mnt type nfs 
(rw,nordirplus,addr=127.0.0.1)

The above should supposedly allow us to enjoy the benefits of an ocfs2 
shared/clustered file system while functionally supporting what splunk needs.  
Unfortunately we have experienced a number of problems with stability, 
summarized here:


1.       Cluster timeouts, node(s) leaving cluster.



Apr 21 21:03:50 splwww01 kernel: 
(o2hb-DA9199FC9F,3090,0):o2hb_do_disk_heartbeat:768 ERROR: status = -52

Apr 21 21:03:52 splwww01 kernel: 
(o2hb-DA9199FC9F,3090,0):o2hb_do_disk_heartbeat:777 ERROR: Device "dm-5": 
another node is heartbeating in our slot!



2.       Ocfs2 crashes, nfs mount becoming unavailable, /opt/splunk/pooled 
becoming inaccessible, splunk crashing.

Apr 24 06:53:43 splwww01 kernel: 
(o2hb-DA9199FC9F,3089,0):ocfs2_dlm_eviction_cb:98 device (253,5): dlm has 
evicted node 1


3.       inode irregularities on either/both node(s).



Apr 24 06:40:47 splwww01 kernel: (nfsd,3330,1):ocfs2_read_locked_inode:472 
ERROR: status = -5



Apr 21 10:09:31 splwww01 kernel: 
(ocfs2rec,7676,0):ocfs2_clear_journal_error:690 ERROR: File system error -5 
recorded in journal 0.

Apr 21 10:09:31 splwww01 kernel: 
(ocfs2rec,7676,0):ocfs2_clear_journal_error:692 ERROR: File system on device 
dm-5 needs checking.



4.       Full system crashes on both node(s).

Ideally, I'd love to continue using ocfs2 because it solves an age-old problem 
for us, but since we can't keep either system stable for more than an hour or 
so, I'm looking for some more insight as to what we're seeing and how we might 
resolve some of these issues.

Note also that I'm using monit to check a .state file I created under 
/opt/splunk/pool and re-mount all the fs, restart splunk, etc, but this only 
takes us so far once the system(s) crash and reboot.

--
Nathan Patwardhan, Sr. System Engineer
npatward...@llbean.com<mailto:npatward...@llbean.com>
x26662

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] RHEL 5.8, ocfs2 v1.4.4, stability issues

Reply via email to