Re: [Linux-HA] NFS cluster after node crash
Am 04.04.2011 10:32, schrieb Andrew Beekhof: > On Thu, Mar 24, 2011 at 9:58 PM, Christoph Bartoschek > wrote: >> It seems as if the g_nfs service is stopped on the surviving node when >> the other one comes up again. > > To me it looks like the service gets stopped after it fails: > > p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, > status=complete): not running > Hi, thanks for the answer. In the meantime I think that the problem is a result of two separate issues: 1. It is wrong to have ocf::heartbeat:exportfs running separate from the group g_nfs. It has to start after the filesystem comes up and before the ip is assigned. Otherwise it will fail during restart in some situations. 2. The cluster tries to restart the service on the surviving node after the crashed one comes up. Christoph ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] NFS cluster after node crash
On Thu, Mar 24, 2011 at 9:58 PM, Christoph Bartoschek wrote: > It seems as if the g_nfs service is stopped on the surviving node when > the other one comes up again. To me it looks like the service gets stopped after it fails: p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, status=complete): not running > > Does anyone see a reason why the service does not continue to run? > > Christoph > > Am 22.03.2011 22:37, schrieb Christoph Bartoschek: >> Hi, >> >> I've created a NFS cluster after the linbit tutorial "Highly available >> NFS storage with DRBD and Pacemaker". Generally it seems to work fine. >> Today I simlated a node crash by just turning a maschine off. Failover >> went fine. After 17 seconds the second node was able to serve the clients. >> >> But when I started the crashed node again the service went down. I >> wonder why the cluster did not just restart the services on the new >> node? Instead it tried to change status on the surviving node. What is >> going wrong? >> >> The resulting status is: >> >> >> Online: [ ries laplace ] >> >> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] >> Masters: [ ries ] >> Slaves: [ laplace ] >> Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] >> Started: [ ries laplace ] >> Resource Group: g_nfs >> p_lvm_nfs (ocf::heartbeat:LVM): Started ries >> p_fs_afs (ocf::heartbeat:Filesystem): Started ries >> (unmanaged) FAILED >> p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped >> Clone Set: cl_exportfs_root [p_exportfs_root] >> p_exportfs_root:0 (ocf::heartbeat:exportfs): Started laplace >> FAILED >> Started: [ ries ] >> >> Failed actions: >> p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, >> status=complete): not running >> p_fs_afs_stop_0 (node=ries, call=37, rc=-2, status=Timed Out): >> unknown exec error >> >> >> My configuration is: >> >> >> node laplace \ >> attributes standby="off" >> node ries \ >> attributes standby="off" >> primitive p_drbd_nfs ocf:linbit:drbd \ >> params drbd_resource="afs" \ >> op monitor interval="15" role="Master" \ >> op monitor interval="30" role="Slave" >> primitive p_exportfs_root ocf:heartbeat:exportfs \ >> params fsid="0" directory="/srv/nfs" >> options="rw,no_root_squash,crossmnt" >> clientspec="192.168.1.0/255.255.255.0" wait_for_leasetime_on_stop="1" \ >> op monitor interval="30s" \ >> op stop interval="0" timeout="100s" >> primitive p_fs_afs ocf:heartbeat:Filesystem \ >> params device="/dev/afs/afs" directory="/srv/nfs/afs" >> fstype="ext4" \ >> op monitor interval="10s" >> primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ >> params ip="192.168.1.100" cidr_netmask="24" \ >> op monitor interval="30s" \ >> meta target-role="Started" >> primitive p_lsb_nfsserver lsb:nfsserver \ >> op monitor interval="30s" >> primitive p_lvm_nfs ocf:heartbeat:LVM \ >> params volgrpname="afs" \ >> op monitor interval="30s" >> group g_nfs p_lvm_nfs p_fs_afs p_ip_nfs \ >> meta target-role="Started" >> ms ms_drbd_nfs p_drbd_nfs \ >> meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" target-role="Started" >> clone cl_exportfs_root p_exportfs_root \ >> meta target-role="Started" >> clone cl_lsb_nfsserver p_lsb_nfsserver \ >> meta target-role="Started" >> colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master >> colocation c_nfs_on_root inf: g_nfs cl_exportfs_root >> order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start >> order o_nfs_server_before_exportfs inf: cl_lsb_nfsserver >> cl_exportfs_root:start >> order o_root_before_nfs inf: cl_exportfs_root g_nfs:start >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \ >> cluster-infrastructure="openais" \ >> expected-quorum-votes="2" \ >> stonith-enabled="false" \ >> no-quorum-policy="ignore" \ >> last-lrm-refresh="1300828539" >> rsc_defaults $id="rsc-options" \ >> resource-stickiness="200" >> >> >> Christoph >> ___ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] NFS cluster after node crash
I just see the same behaviour if I put one node into standby and then put it online again. The services are shortly stopped and then restarted. In my opinion this behaviour contradicts the HA aspect of the system. Christoph Am 24.03.2011 21:58, schrieb Christoph Bartoschek: > It seems as if the g_nfs service is stopped on the surviving node when > the other one comes up again. > > Does anyone see a reason why the service does not continue to run? > > Christoph > > Am 22.03.2011 22:37, schrieb Christoph Bartoschek: >> Hi, >> >> I've created a NFS cluster after the linbit tutorial "Highly available >> NFS storage with DRBD and Pacemaker". Generally it seems to work fine. >> Today I simlated a node crash by just turning a maschine off. Failover >> went fine. After 17 seconds the second node was able to serve the clients. >> >> But when I started the crashed node again the service went down. I >> wonder why the cluster did not just restart the services on the new >> node? Instead it tried to change status on the surviving node. What is >> going wrong? >> >> The resulting status is: >> >> >> Online: [ ries laplace ] >> >> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] >> Masters: [ ries ] >> Slaves: [ laplace ] >> Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] >> Started: [ ries laplace ] >> Resource Group: g_nfs >> p_lvm_nfs (ocf::heartbeat:LVM): Started ries >> p_fs_afs (ocf::heartbeat:Filesystem):Started ries >> (unmanaged) FAILED >> p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped >> Clone Set: cl_exportfs_root [p_exportfs_root] >> p_exportfs_root:0 (ocf::heartbeat:exportfs): Started laplace >> FAILED >> Started: [ ries ] >> >> Failed actions: >>p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, >> status=complete): not running >>p_fs_afs_stop_0 (node=ries, call=37, rc=-2, status=Timed Out): >> unknown exec error >> >> >> My configuration is: >> >> >> node laplace \ >>attributes standby="off" >> node ries \ >>attributes standby="off" >> primitive p_drbd_nfs ocf:linbit:drbd \ >>params drbd_resource="afs" \ >>op monitor interval="15" role="Master" \ >>op monitor interval="30" role="Slave" >> primitive p_exportfs_root ocf:heartbeat:exportfs \ >>params fsid="0" directory="/srv/nfs" >> options="rw,no_root_squash,crossmnt" >> clientspec="192.168.1.0/255.255.255.0" wait_for_leasetime_on_stop="1" \ >>op monitor interval="30s" \ >>op stop interval="0" timeout="100s" >> primitive p_fs_afs ocf:heartbeat:Filesystem \ >>params device="/dev/afs/afs" directory="/srv/nfs/afs" >> fstype="ext4" \ >>op monitor interval="10s" >> primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ >>params ip="192.168.1.100" cidr_netmask="24" \ >>op monitor interval="30s" \ >>meta target-role="Started" >> primitive p_lsb_nfsserver lsb:nfsserver \ >>op monitor interval="30s" >> primitive p_lvm_nfs ocf:heartbeat:LVM \ >>params volgrpname="afs" \ >>op monitor interval="30s" >> group g_nfs p_lvm_nfs p_fs_afs p_ip_nfs \ >>meta target-role="Started" >> ms ms_drbd_nfs p_drbd_nfs \ >>meta master-max="1" master-node-max="1" clone-max="2" >> clone-node-max="1" notify="true" target-role="Started" >> clone cl_exportfs_root p_exportfs_root \ >>meta target-role="Started" >> clone cl_lsb_nfsserver p_lsb_nfsserver \ >>meta target-role="Started" >> colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master >> colocation c_nfs_on_root inf: g_nfs cl_exportfs_root >> order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start >> order o_nfs_server_before_exportfs inf: cl_lsb_nfsserver >> cl_exportfs_root:start >> order o_root_before_nfs inf: cl_exportfs_root g_nfs:start >> property $id="cib-bootstrap-options" \ >>dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \ >>cluster-infrastructure="openais" \ >>expected-quorum-votes="2" \ >>stonith-enabled="false" \ >>no-quorum-policy="ignore" \ >>last-lrm-refresh="1300828539" >> rsc_defaults $id="rsc-options" \ >>resource-stickiness="200" >> >> >> Christoph >> ___ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] NFS cluster after node crash
It seems as if the g_nfs service is stopped on the surviving node when the other one comes up again. Does anyone see a reason why the service does not continue to run? Christoph Am 22.03.2011 22:37, schrieb Christoph Bartoschek: > Hi, > > I've created a NFS cluster after the linbit tutorial "Highly available > NFS storage with DRBD and Pacemaker". Generally it seems to work fine. > Today I simlated a node crash by just turning a maschine off. Failover > went fine. After 17 seconds the second node was able to serve the clients. > > But when I started the crashed node again the service went down. I > wonder why the cluster did not just restart the services on the new > node? Instead it tried to change status on the surviving node. What is > going wrong? > > The resulting status is: > > > Online: [ ries laplace ] > >Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] >Masters: [ ries ] >Slaves: [ laplace ] >Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] >Started: [ ries laplace ] >Resource Group: g_nfs >p_lvm_nfs (ocf::heartbeat:LVM): Started ries >p_fs_afs (ocf::heartbeat:Filesystem):Started ries > (unmanaged) FAILED >p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped >Clone Set: cl_exportfs_root [p_exportfs_root] >p_exportfs_root:0 (ocf::heartbeat:exportfs): Started laplace > FAILED >Started: [ ries ] > > Failed actions: > p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, > status=complete): not running > p_fs_afs_stop_0 (node=ries, call=37, rc=-2, status=Timed Out): > unknown exec error > > > My configuration is: > > > node laplace \ > attributes standby="off" > node ries \ > attributes standby="off" > primitive p_drbd_nfs ocf:linbit:drbd \ > params drbd_resource="afs" \ > op monitor interval="15" role="Master" \ > op monitor interval="30" role="Slave" > primitive p_exportfs_root ocf:heartbeat:exportfs \ > params fsid="0" directory="/srv/nfs" > options="rw,no_root_squash,crossmnt" > clientspec="192.168.1.0/255.255.255.0" wait_for_leasetime_on_stop="1" \ > op monitor interval="30s" \ > op stop interval="0" timeout="100s" > primitive p_fs_afs ocf:heartbeat:Filesystem \ > params device="/dev/afs/afs" directory="/srv/nfs/afs" > fstype="ext4" \ > op monitor interval="10s" > primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ > params ip="192.168.1.100" cidr_netmask="24" \ > op monitor interval="30s" \ > meta target-role="Started" > primitive p_lsb_nfsserver lsb:nfsserver \ > op monitor interval="30s" > primitive p_lvm_nfs ocf:heartbeat:LVM \ > params volgrpname="afs" \ > op monitor interval="30s" > group g_nfs p_lvm_nfs p_fs_afs p_ip_nfs \ > meta target-role="Started" > ms ms_drbd_nfs p_drbd_nfs \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" target-role="Started" > clone cl_exportfs_root p_exportfs_root \ > meta target-role="Started" > clone cl_lsb_nfsserver p_lsb_nfsserver \ > meta target-role="Started" > colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master > colocation c_nfs_on_root inf: g_nfs cl_exportfs_root > order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start > order o_nfs_server_before_exportfs inf: cl_lsb_nfsserver > cl_exportfs_root:start > order o_root_before_nfs inf: cl_exportfs_root g_nfs:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1300828539" > rsc_defaults $id="rsc-options" \ > resource-stickiness="200" > > > Christoph > ___ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] NFS cluster after node crash
Hi, I've created a NFS cluster after the linbit tutorial "Highly available NFS storage with DRBD and Pacemaker". Generally it seems to work fine. Today I simlated a node crash by just turning a maschine off. Failover went fine. After 17 seconds the second node was able to serve the clients. But when I started the crashed node again the service went down. I wonder why the cluster did not just restart the services on the new node? Instead it tried to change status on the surviving node. What is going wrong? The resulting status is: Online: [ ries laplace ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ ries ] Slaves: [ laplace ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ ries laplace ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started ries p_fs_afs (ocf::heartbeat:Filesystem):Started ries (unmanaged) FAILED p_ip_nfs (ocf::heartbeat:IPaddr2): Stopped Clone Set: cl_exportfs_root [p_exportfs_root] p_exportfs_root:0 (ocf::heartbeat:exportfs): Started laplace FAILED Started: [ ries ] Failed actions: p_exportfs_root:0_monitor_3 (node=laplace, call=12, rc=7, status=complete): not running p_fs_afs_stop_0 (node=ries, call=37, rc=-2, status=Timed Out): unknown exec error My configuration is: node laplace \ attributes standby="off" node ries \ attributes standby="off" primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource="afs" \ op monitor interval="15" role="Master" \ op monitor interval="30" role="Slave" primitive p_exportfs_root ocf:heartbeat:exportfs \ params fsid="0" directory="/srv/nfs" options="rw,no_root_squash,crossmnt" clientspec="192.168.1.0/255.255.255.0" wait_for_leasetime_on_stop="1" \ op monitor interval="30s" \ op stop interval="0" timeout="100s" primitive p_fs_afs ocf:heartbeat:Filesystem \ params device="/dev/afs/afs" directory="/srv/nfs/afs" fstype="ext4" \ op monitor interval="10s" primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ params ip="192.168.1.100" cidr_netmask="24" \ op monitor interval="30s" \ meta target-role="Started" primitive p_lsb_nfsserver lsb:nfsserver \ op monitor interval="30s" primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname="afs" \ op monitor interval="30s" group g_nfs p_lvm_nfs p_fs_afs p_ip_nfs \ meta target-role="Started" ms ms_drbd_nfs p_drbd_nfs \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started" clone cl_exportfs_root p_exportfs_root \ meta target-role="Started" clone cl_lsb_nfsserver p_lsb_nfsserver \ meta target-role="Started" colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master colocation c_nfs_on_root inf: g_nfs cl_exportfs_root order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start order o_nfs_server_before_exportfs inf: cl_lsb_nfsserver cl_exportfs_root:start order o_root_before_nfs inf: cl_exportfs_root g_nfs:start property $id="cib-bootstrap-options" \ dc-version="1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1300828539" rsc_defaults $id="rsc-options" \ resource-stickiness="200" Christoph ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems