So I forgot the test environment in this case. Here is the normal environment which is not fully productive yet, so I can do tests on it... Fencing (SCSI 3 persistent reservation) works and tested. I configured the cluster to used, it, and the lvms are still down... the cluster not able to mount the filesystem. However manually I can mount it, and also the clustered lvm active flags looks ok, -a- on one node, and --- on the other node: here are the logs and outputs and the config:
root@linuxsap2 cluster]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster config_version="14" name="linuxsap-c"> <clusternodes> <clusternode name="linuxsap1-priv" nodeid="1"> <fence> <method name="Method"> <device name="fence_dev"/> </method> </fence> <unfence> <device action="on" name="fence_dev"/> </unfence> </clusternode> <clusternode name="linuxsap2-priv" nodeid="2"> <fence> <method name="Method"> <device name="fence_dev"/> </method> </fence> <unfence> <device action="on" name="fence_dev"/> </unfence> </clusternode> </clusternodes> <cman expected_votes="3"/> <quorumd label="qdisk_dev"/> <rm> <failoverdomains> <failoverdomain name="FOD-Teszt" nofailback="1" ordered="1" restricted="0"> <failoverdomainnode name="linuxsap1-priv" priority="1"/> <failoverdomainnode name="linuxsap2-priv" priority="2"/> </failoverdomain> </failoverdomains> <resources> <lvm name="vg_PRD_oracle" vg_name="vg_PRD_oracle"/> <fs device="/dev/vg_PRD_oracle/lv_PRD_orabin" fsid="32283" fstype="ext4" mountpoint="/oracle/PRD" name="PRD_orabin"/> </resources> <service autostart="0" domain="FOD-Teszt" name="FS_teszt" recovery="disable"> <lvm ref="vg_PRD_oracle"/> <fs ref="PRD_oralog1"/> </service> </rm> <fencedevices> <fencedevice agent="fence_scsi" name="fence_dev"/> </fencedevices> </cluster> [root@linuxsap2 cluster]# ug 10 20:10:07 linuxsap1 rgmanager[9680]: Service service:FS_teszt is recovering Aug 10 20:10:07 linuxsap1 rgmanager[9680]: #71: Relocating failed service service:FS_teszt Aug 10 20:10:08 linuxsap1 rgmanager[9680]: Service service:FS_teszt is stopped Aug 10 20:11:21 linuxsap1 rgmanager[9680]: Starting stopped service service:FS_teszt Aug 10 20:11:21 linuxsap1 rgmanager[10777]: [lvm] Starting volume group, vg_PRD_oracle Aug 10 20:11:21 linuxsap1 rgmanager[10801]: [lvm] Failed to activate volume group, vg_PRD_oracle Aug 10 20:11:21 linuxsap1 rgmanager[10823]: [lvm] Attempting cleanup of vg_PRD_oracle Aug 10 20:11:22 linuxsap1 rgmanager[10849]: [lvm] Failed second attempt to activate vg_PRD_oracle Aug 10 20:11:22 linuxsap1 rgmanager[9680]: start on lvm "vg_PRD_oracle" returned 1 (generic error) Aug 10 20:11:22 linuxsap1 rgmanager[9680]: #68: Failed to start service:FS_teszt; return value: 1 Aug 10 20:11:22 linuxsap1 rgmanager[9680]: Stopping service service:FS_teszt [root@linuxsap1 cluster]# lvs | grep PRD lv_PRD_oraarch vg_PRD_oracle -wi-a--- 30.00g lv_PRD_orabin vg_PRD_oracle -wi-a--- 10.00g lv_PRD_oralog1 vg_PRD_oracle -wi-a--- 1.00g lv_PRD_oralog2 vg_PRD_oracle -wi-a--- 1.00g lv_PRD_sapdata1 vg_PRD_oracle -wi-a--- 408.00g lv_PRD_sapmnt vg_PRD_sapmnt -wi-a--- 10.00g lv_PRD_trans vg_PRD_trans -wi-a--- 40.00g lv_PRD_usrsap vg_PRD_usrsap -wi-a--- 9.00g [root@linuxsap2 cluster]# lvs | grep PRD lv_PRD_oraarch vg_PRD_oracle -wi----- 30.00g lv_PRD_orabin vg_PRD_oracle -wi----- 10.00g lv_PRD_oralog1 vg_PRD_oracle -wi----- 1.00g lv_PRD_oralog2 vg_PRD_oracle -wi----- 1.00g lv_PRD_sapdata1 vg_PRD_oracle -wi----- 408.00g lv_PRD_sapmnt vg_PRD_sapmnt -wi-a--- 10.00g lv_PRD_trans vg_PRD_trans -wi-a--- 40.00g lv_PRD_usrsap vg_PRD_usrsap -wi-a--- 9.00g [root@linuxsap1 cluster]# mount /dev/vg_PRD_oracle/lv_PRD_orabin /oracle/PRD [root@linuxsap1 cluster]# df -k /oracle/PRD/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_PRD_oracle-lv_PRD_orabin 10321208 4753336 5043584 49% /oracle/PRD On 08/10/2012 06:46 PM, Digimer wrote: > Not sure if it relates, but I can say that without fencing, things will > break in strange ways. The reason is that if anything triggers a fault, > the cluster blocks by design and stays blocked until a fence call > succeeds (which is impossible without fencing configured in the first > place). > > Can you please setup fencing, test to make sure it works (using > 'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this > is done, test again for your problem. If it still exists, please paste > the updated cluster.conf then. Also please include syslog from both > nodes around the time of your LVM tests. > > digimer > > On 08/10/2012 12:38 PM, Poós Krisztián wrote: >> This is the cluster conf, Which is a clone of the problematic system on >> a test environment (without the ORacle and SAP instances, only focusing >> on this LVM issue, with an LVM resource) >> >> [root@rhel2 ~]# cat /etc/cluster/cluster.conf >> <?xml version="1.0"?> >> <cluster config_version="7" name="teszt"> >> <fence_daemon clean_start="0" post_fail_delay="0" >> post_join_delay="3"/> >> <clusternodes> >> <clusternode name="rhel1.local" nodeid="1" votes="1"> >> <fence/> >> </clusternode> >> <clusternode name="rhel2.local" nodeid="2" votes="1"> >> <fence/> >> </clusternode> >> </clusternodes> >> <cman expected_votes="3"/> >> <fencedevices/> >> <rm> >> <failoverdomains> >> <failoverdomain name="all" nofailback="1" ordered="1" >> restricted="0"> >> <failoverdomainnode name="rhel1.local" priority="1"/> >> <failoverdomainnode name="rhel2.local" priority="2"/> >> </failoverdomain> >> </failoverdomains> >> <resources> >> <lvm lv_name="teszt-lv" name="teszt-lv" vg_name="teszt"/> >> <fs device="/dev/teszt/teszt-lv" fsid="43679" fstype="ext4" >> mountpoint="/lvm" name="teszt-fs"/> >> </resources> >> <service autostart="1" domain="all" exclusive="0" name="teszt" >> recovery="disable"> >> <lvm ref="teszt-lv"/> >> <fs ref="teszt-fs"/> >> </service> >> </rm> >> <quorumd label="qdisk"/> >> </cluster> >> >> Here are the log parts: >> Aug 10 17:21:21 rgmanager I am node #2 >> Aug 10 17:21:22 rgmanager Resource Group Manager Starting >> Aug 10 17:21:22 rgmanager Loading Service Data >> Aug 10 17:21:29 rgmanager Initializing Services >> Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted >> Aug 10 17:21:31 rgmanager Services Initialized >> Aug 10 17:21:31 rgmanager State change: Local UP >> Aug 10 17:21:31 rgmanager State change: rhel1.local UP >> Aug 10 17:23:23 rgmanager Starting stopped service service:teszt >> Aug 10 17:23:25 rgmanager Failed to activate logical volume, >> teszt/teszt-lv >> Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv >> Aug 10 17:23:29 rgmanager Failed second attempt to activate >> teszt/teszt-lv >> Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic >> error) >> Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return >> value: 1 >> Aug 10 17:23:29 rgmanager Stopping service service:teszt >> Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with >> a real device >> Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid >> argument(s)) >> Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop; >> intervention required >> Aug 10 17:23:31 rgmanager Service service:teszt is failed >> Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not >> start. >> Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop >> cleanly >> Aug 10 17:25:12 rgmanager Starting stopped service service:teszt >> Aug 10 17:25:14 rgmanager Failed to activate logical volume, >> teszt/teszt-lv >> Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv >> Aug 10 17:25:17 rgmanager Failed second attempt to activate >> teszt/teszt-lv >> Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic >> error) >> Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return >> value: 1 >> Aug 10 17:25:18 rgmanager Stopping service service:teszt >> Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with >> a real device >> Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid >> argument(s)) >> >> >> After I manually started the lvm on node1 and tried to switch it on >> node2 it's not able to start it. >> >> Regards, >> Krisztian >> >> >> On 08/10/2012 05:15 PM, Digimer wrote: >>> On 08/10/2012 11:07 AM, Poós Krisztián wrote: >>>> Dear all, >>>> >>>> I hope that anyone run into this problem in the past, so maybe can help >>>> me resolving this issue. >>>> >>>> There is a 2 node rhel cluster with quorum also. >>>> There are clustered lvms, where the -c- flag is on. >>>> If I start clvmd all the clustered lvms became online. >>>> >>>> After this if I start rgmanager, it deactivates all the volumes, and >>>> not >>>> able to activate them anymore as there are no such devices anymore >>>> during the startup of the service, so after this, the service fails. >>>> All lvs remain without the active flag. >>>> >>>> I can manually bring it up, but only if after clvmd is started, I set >>>> the lvms manually offline by the lvchange -an <lv> >>>> After this, when I start rgmanager, it can take it online without >>>> problems. However I think, this action should be done by the rgmanager >>>> itself. All the logs is full with the next: >>>> rgmanager Making resilient: lvchange -an .... >>>> rgmanager lv_exec_resilient failed >>>> rgmanager lv_activate_resilient stop failed on .... >>>> >>>> As well, sometimes the lvs/clvmd commands are also hanging. I have to >>>> restart clvmd to make it work again. (sometimes killing it) >>>> >>>> Anyone has any idea, what to check? >>>> >>>> Thanks and regards, >>>> Krisztian >>> >>> Please paste your cluster.conf file with minimal edits. > >
smime.p7s
Description: S/MIME Cryptographic Signature
-- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster