[Linux-HA] Problem promoting Slave to Master
Hi all, I have a problem after I removed a node with the force command from my crm config. Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 1.1.7-6.el6) Then I wanted to add a third node acting as quorum node, but was not able to get it to work (probably because I don't understand how to set it up). So I removed the 3rd node, but had to use the force command as crm complained when I tried to remove it. Now when I start up Pacemaker the resources doesn't look like they come up correctly Online: [ testclu01 testclu02 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ testclu01 ] Slaves: [ testclu02 ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ tdtestclu01 tdtestclu02 ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started testclu01 p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01 p_fs_shared2 (ocf::heartbeat:Filesystem):Started testclu01 p_ip_nfs (ocf::heartbeat:IPaddr2): Started testclu01 Clone Set: cl_exportfs_root [p_exportfs_root] Started: [ testclu01 testclu02 ] Failed actions: p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running The filesystems mount correctly on the master at this stage and can be written to. When I stop the services on the master node for it to failover, it doesn't work.. Looses cluster-ip connectivity Corosync.log from master after I stopped pacemaker on master node : see attached file Additional files (attached): crm-configure show Corosync.conf Global_common.conf I'm not sure how to proceed to get it up in a fair state now So if anyone could help me it would be much appreciated Kind regards /Fredrik Hudner corosync.log Description: corosync.log corosync.conf Description: corosync.conf [?1034hnode tdtestclu01 node tdtestclu02 primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=nfs \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_exportfs_root ocf:heartbeat:exportfs \ params fsid=0 directory=/export options=rw,crossmnt clientspec=10.240.0.0/255.255.0.0 \ op monitor interval=30s primitive p_fs_shared ocf:heartbeat:Filesystem \ params device=/dev/vg_nfs/lv_shared directory=/export/shared fstype=ext4 \ op monitor interval=10s primitive p_fs_shared2 ocf:heartbeat:Filesystem \ params device=/dev/vg_nfs/lv_shared2 directory=/export/shared2 fstype=ext4 \ op monitor interval=10s primitive p_ip_nfs ocf:heartbeat:IPaddr2 \ params ip=10.240.64.20 cidr_netmask=24 \ op monitor interval=30s primitive p_lsb_nfsserver lsb:nfs \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=vg_nfs \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_shared p_fs_shared2 p_ip_nfs ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone cl_exportfs_root p_exportfs_root clone cl_lsb_nfsserver p_lsb_nfsserver location drbd-fence-by-handler-nfs-ms_drbd_nfs ms_drbd_nfs \ rule $id=drbd-fence-by-handler-nfs-rule-ms_drbd_nfs $role=Master -inf: #uname ne tdtestclu01 colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master colocation c_nfs_on_root inf: g_nfs cl_exportfs_root order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start order o_root_before_nfs inf: cl_exportfs_root g_nfs:start property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ last-lrm-refresh=1363170760 \ stonith-enabled=false \ no-quorum-policy=freeze \ maintenance-mode=false rsc_defaults $id=rsc-options \ resource-stickiness=200 global_common.conf Description: global_common.conf ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On Thu, 2013-03-14 at 16:26 +0100, Lars Marowsky-Bree wrote: On 2013-03-14T09:44:11, GGS (linux ha) support-linu...@ggsys.net wrote: That's fine. But the cluster software really assumes that only one instance of it is running per server - said instance can then manage multiple software stacks, though. Got it. That's what I was asking. No. Pacemaker allows resources and groups (probably the equivalent of your stacks) to be individually managed. If you want to bring down pacemaker itself for maintenance, you'd detach via maintenance mode, stop, update, restart, reattach. I'll have to dig in deeper, it may be a possibility. We really would like to move away from the in-house built solution. But there is a point where this matters, namely IO fencing/STONITH. In case of a real server failure, you don't want 200+ independent fencing processes to trigger. Believe it or not, I would actually rather have the 200+ fencing processes to trigger. But that is not a requirement. I just need to ensure failover completes within the allowed time. Yes. That's called multitasking/virtualization/cloud. We get that. ;-) Multitasking yes, virtualization no, that's another discussion :-) But just like you only have one kernel per physical server, you also only have one cluster stack that then manages multiple stacks. We even got ACLs so that you can grant people access to only the bits they're allowed to manage, etc. What you plan - running multiple heartbeat v1 setups on one node - will not work reliably. Running multiple pacemaker instances per node/OS image will not work either. That's what I thought. The emails from 2009 seemed to indicate that it was possible to run multiple instances. I asked because I suspected that it really wasn't the case. Thanks for confirming it. I'll dig deeper into pacemaker and see how I can make it work for our use case. One quick question on pacemaker. If I add a new stack, do I need to bring the old ones down (or fail them) to add it to pacemaker? From your comment above it seems that I wouldn't, but I just want to make sure. Thanks, Alberto Regards, Lars -- Alberto AlonsoGlobal Gate Systems LLC. (512) 351-7233http://www.ggsys.net Monitoring the metrics that are important to you in real time ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
hi, On Thu, Mar 14, 2013 at 11:15:29AM -0500, Alberto Alonso wrote: On Thu, 2013-03-14 at 16:26 +0100, Lars Marowsky-Bree wrote: On 2013-03-14T09:44:11, GGS (linux ha) support-linu...@ggsys.net wrote: That's fine. But the cluster software really assumes that only one instance of it is running per server - said instance can then manage multiple software stacks, though. Got it. That's what I was asking. No. Pacemaker allows resources and groups (probably the equivalent of your stacks) to be individually managed. If you want to bring down pacemaker itself for maintenance, you'd detach via maintenance mode, stop, update, restart, reattach. I'll have to dig in deeper, it may be a possibility. We really would like to move away from the in-house built solution. But there is a point where this matters, namely IO fencing/STONITH. In case of a real server failure, you don't want 200+ independent fencing processes to trigger. Believe it or not, I would actually rather have the 200+ fencing processes to trigger. But that is not a requirement. I just need to ensure failover completes within the allowed time. Yes. That's called multitasking/virtualization/cloud. We get that. ;-) Multitasking yes, virtualization no, that's another discussion :-) But just like you only have one kernel per physical server, you also only have one cluster stack that then manages multiple stacks. We even got ACLs so that you can grant people access to only the bits they're allowed to manage, etc. What you plan - running multiple heartbeat v1 setups on one node - will not work reliably. Running multiple pacemaker instances per node/OS image will not work either. That's what I thought. The emails from 2009 seemed to indicate that it was possible to run multiple instances. I asked because I suspected that it really wasn't the case. Thanks for confirming it. I'll dig deeper into pacemaker and see how I can make it work for our use case. One quick question on pacemaker. If I add a new stack, do I need to bring the old ones down (or fail them) to add it to pacemaker? No. Thanks, Dejan From your comment above it seems that I wouldn't, but I just want to make sure. Thanks, Alberto Regards, Lars -- Alberto AlonsoGlobal Gate Systems LLC. (512) 351-7233http://www.ggsys.net Monitoring the metrics that are important to you in real time ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] RA heartbeat/exportfs hangs sporadically
On Mon, 2013-03-11 at 16:28 +0100, Dejan Muhamedagic wrote: Hi, On Mon, Mar 11, 2013 at 10:53:55AM +0100, Roman Haefeli wrote: On Fri, 2013-03-08 at 14:15 +0100, Dejan Muhamedagic wrote: Hi, On Fri, Mar 08, 2013 at 01:39:27PM +0100, Roman Haefeli wrote: On Fri, 2013-03-08 at 13:28 +0100, Roman Haefeli wrote: On Fri, 2013-03-08 at 12:02 +0100, Lars Marowsky-Bree wrote: On 2013-03-08T11:56:12, Roman Haefeli reduz...@gmail.com wrote: Googling TrackedProcTimeoutFunction exportfs didn't reveal any results, which makes me think we are alone with this specific problem. Is it the RA that hangs or the command 'exportfs' which is executed by this RA? It is most probably the exportfs program. Unless you hit the rmtab growing indefinitely issue. No, this is with a later version of the RA. From the log: Mar 8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop process (PID 5528) timed out (try 2). Killing with signal SIGKILL (9) This means that the process didn't leave after being sent the TERM signal. I think that KILL takes place five seconds later. Was this with the rmtab problem? I still don't fully understand. Is this lrmd trying to kill the RA or the process 'exportfs' with given PID? The former. I thought I already answered that. Yeah, sorry you did. Just for clarification: You say it's most likely that the 'exportfs' process hangs and thus lrmd tries to kill the RA, which will not exit until exportfs exits, is that correct? For me valuable to know is what is lrmd trying to kill here: the process 'exportfs' or the process of the resource agent? The resource agent instance. I mean, is 'exportfs' broken on said machine? Name resolution taking long perhaps? We use IP addresses everywhere, so I assume it's not related to name resolution. What can I do about a broken 'exportfs'? It happens so seldom that I don't have a chance to deeply investigate the problem to write a proper bug report. Do you run the latest resource-agents (3.9.5)? Then you can trace the resource agent, like this: primitive r ocf:heartbeat:exportfs \ params ... \ op stop trace_ra=1 The trace files will be generated per call in $HA_VARLIB/trace_ra/type/id.action.timestamp HA_VARLIB is usually, I think, /var/lib/heartbeat. Thanks, that is valuable information. Is it safe to only upgrade the resource-agents while keeping corosync (1.4.2) and pacemaker (1.1.7) at their current version? Thanks, Roman ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem promoting Slave to Master
Hello Fedrik Why you have a clone of cl_exportfs_root and you have ext4 filesystem, and i think this order is not correct order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start order o_root_before_nfs inf: cl_exportfs_root g_nfs:start I think like that you try to start g_nfs twice 2013/3/14 Fredrik Hudner fredrik.hud...@evry.com Hi all, I have a problem after I removed a node with the force command from my crm config. Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 1.1.7-6.el6) Then I wanted to add a third node acting as quorum node, but was not able to get it to work (probably because I don't understand how to set it up). So I removed the 3rd node, but had to use the force command as crm complained when I tried to remove it. Now when I start up Pacemaker the resources doesn't look like they come up correctly Online: [ testclu01 testclu02 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ testclu01 ] Slaves: [ testclu02 ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ tdtestclu01 tdtestclu02 ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started testclu01 p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01 p_fs_shared2 (ocf::heartbeat:Filesystem):Started testclu01 p_ip_nfs (ocf::heartbeat:IPaddr2): Started testclu01 Clone Set: cl_exportfs_root [p_exportfs_root] Started: [ testclu01 testclu02 ] Failed actions: p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running The filesystems mount correctly on the master at this stage and can be written to. When I stop the services on the master node for it to failover, it doesn't work.. Looses cluster-ip connectivity Corosync.log from master after I stopped pacemaker on master node : see attached file Additional files (attached): crm-configure show Corosync.conf Global_common.conf I'm not sure how to proceed to get it up in a fair state now So if anyone could help me it would be much appreciated Kind regards /Fredrik Hudner ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] RA heartbeat/exportfs hangs sporadically
On Fri, Mar 15, 2013 at 10:44:37AM +0100, Roman Haefeli wrote: On Mon, 2013-03-11 at 16:28 +0100, Dejan Muhamedagic wrote: Hi, On Mon, Mar 11, 2013 at 10:53:55AM +0100, Roman Haefeli wrote: On Fri, 2013-03-08 at 14:15 +0100, Dejan Muhamedagic wrote: Hi, On Fri, Mar 08, 2013 at 01:39:27PM +0100, Roman Haefeli wrote: On Fri, 2013-03-08 at 13:28 +0100, Roman Haefeli wrote: On Fri, 2013-03-08 at 12:02 +0100, Lars Marowsky-Bree wrote: On 2013-03-08T11:56:12, Roman Haefeli reduz...@gmail.com wrote: Googling TrackedProcTimeoutFunction exportfs didn't reveal any results, which makes me think we are alone with this specific problem. Is it the RA that hangs or the command 'exportfs' which is executed by this RA? It is most probably the exportfs program. Unless you hit the rmtab growing indefinitely issue. No, this is with a later version of the RA. From the log: Mar 8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop process (PID 5528) timed out (try 2). Killing with signal SIGKILL (9) This means that the process didn't leave after being sent the TERM signal. I think that KILL takes place five seconds later. Was this with the rmtab problem? I still don't fully understand. Is this lrmd trying to kill the RA or the process 'exportfs' with given PID? The former. I thought I already answered that. Yeah, sorry you did. Just for clarification: You say it's most likely that the 'exportfs' process hangs and thus lrmd tries to kill the RA, which will not exit until exportfs exits, is that correct? Right. For me valuable to know is what is lrmd trying to kill here: the process 'exportfs' or the process of the resource agent? The resource agent instance. I mean, is 'exportfs' broken on said machine? Name resolution taking long perhaps? We use IP addresses everywhere, so I assume it's not related to name resolution. What can I do about a broken 'exportfs'? It happens so seldom that I don't have a chance to deeply investigate the problem to write a proper bug report. Do you run the latest resource-agents (3.9.5)? Then you can trace the resource agent, like this: primitive r ocf:heartbeat:exportfs \ params ... \ op stop trace_ra=1 The trace files will be generated per call in $HA_VARLIB/trace_ra/type/id.action.timestamp HA_VARLIB is usually, I think, /var/lib/heartbeat. Thanks, that is valuable information. Is it safe to only upgrade the resource-agents while keeping corosync (1.4.2) and pacemaker (1.1.7) at their current version? Yes, you can update them independently. Thanks, Dejan Thanks, Roman ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On 3/14/2013 11:15 AM, Alberto Alonso wrote: That's what I thought. The emails from 2009 seemed to indicate that it was possible to run multiple instances. I've always had difficulties with the concept: the way I see it if your hardware fails you want *all* your 200+ services moved. If you want them independently moved to different places, you're likely better off with a full cloud solution. If you want them moved while hardware's still up you're probably looking for load balancing, not HA. I'm sure you can patch heartbeat to replace all hardcoded stuff with config file settings. Or use pacemaker's ability to manage service groups more or less independently. I'm not sure why you'd want to use either that way. Dima ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On 2013-03-15T09:54:22, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: I've always had difficulties with the concept: the way I see it if your hardware fails you want *all* your 200+ services moved. If you want them independently moved to different places, you're likely better off with a full cloud solution. If you want them moved while hardware's still up you're probably looking for load balancing, not HA. I'm sure you can patch heartbeat to replace all hardcoded stuff with config file settings. Or use pacemaker's ability to manage service groups more or less independently. I'm not sure why you'd want to use either that way. You're contradicting yourself ;-) Pacemaker in fact gives you the management you suggest for the cloud use case - whether the services are handled natively or encapsulated into a VM. And the concept of HA clusters predates the cloud slightly. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On 03/15/2013 10:08 AM, Lars Marowsky-Bree wrote: You're contradicting yourself ;-) Pacemaker in fact gives you the management you suggest for the cloud use case - whether the services are handled natively or encapsulated into a VM. Yeah, I suppose. I meant going Open/CloudStack. (We get to write buzzword-compliant funding proposals, or I don't get to eat. So my perspective is skewed towards the hottest shiny du jour...) And the concept of HA clusters predates the cloud slightly. Relevant if you're looking at maintenance/upgrade on an existing cluster. Patching heartbeat to manage 200 services independently sounds like a new project. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On 2013-03-15T11:43:56, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: Yeah, I suppose. I meant going Open/CloudStack. (We get to write buzzword-compliant funding proposals, or I don't get to eat. So my perspective is skewed towards the hottest shiny du jour...) Yeah, I'd agree that today there are scenarios where a cloud makes more sense then a traditional HA environment. OpenStack et al still have to up their HA game a bit, though. And the concept of HA clusters predates the cloud slightly. Relevant if you're looking at maintenance/upgrade on an existing cluster. Patching heartbeat to manage 200 services independently sounds like a new project. Right. Thankfully, we already have that, it's called pacemaker ;-) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On 03/15/2013 11:55 AM, Lars Marowsky-Bree wrote: ... Right. Thankfully, we already have that, it's called pacemaker ;-) Which brings me back to my original problem with the concept: I can think of only one reason to failover services (as opposed to hardware), and that is your daemons are crashing all the time during normal operation. If I needed a solution for that, HA would be fairly low on my list of things to look at. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On Fri, 2013-03-15 at 11:43 -0500, Dimitri Maziuk wrote: On 03/15/2013 10:08 AM, Lars Marowsky-Bree wrote: You're contradicting yourself ;-) Pacemaker in fact gives you the management you suggest for the cloud use case - whether the services are handled natively or encapsulated into a VM. Yeah, I suppose. I meant going Open/CloudStack. (We get to write buzzword-compliant funding proposals, or I don't get to eat. So my perspective is skewed towards the hottest shiny du jour...) These projects do not relate well to full VMs, so it is actually not a good direction. But yes, we do use the load balancer and VMs approach for other things, so I am familiar with that type of architecture. And the concept of HA clusters predates the cloud slightly. Relevant if you're looking at maintenance/upgrade on an existing cluster. Patching heartbeat to manage 200 services independently sounds like a new project. The current solution was written in-house. We are looking to replace it. Based on the info from the list heartbeat is out, so I'll look more into pacemaker. I know what you mean about the buzzwords. I'm trying to avoid them. :-) Alberto ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On Fri, 2013-03-15 at 17:55 +0100, Lars Marowsky-Bree wrote: Yeah, I'd agree that today there are scenarios where a cloud makes more sense then a traditional HA environment. OpenStack et al still have to up their HA game a bit, though. You are being way too kind, a lot of improvement is needed. And the concept of HA clusters predates the cloud slightly. Relevant if you're looking at maintenance/upgrade on an existing cluster. Patching heartbeat to manage 200 services independently sounds like a new project. Right. Thankfully, we already have that, it's called pacemaker ;-) And that's where I'm looking next, I hope it is easy. Regards, Lars Alberto ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On Fri, 2013-03-15 at 12:32 -0500, Dimitri Maziuk wrote: On 03/15/2013 11:55 AM, Lars Marowsky-Bree wrote: ... Right. Thankfully, we already have that, it's called pacemaker ;-) Which brings me back to my original problem with the concept: I can think of only one reason to failover services (as opposed to hardware), and that is your daemons are crashing all the time during normal operation. If I needed a solution for that, HA would be fairly low on my list of things to look at. You need to look back at the original description I gave. These are not your typical web stack or back office apps. We do have them running for years without failures/crashes, but in the 24x7 environment they are in we try to minimize the risk of downtime regardless. Unfortunately I'm not at liberty to discuss the full architecture or what they are doing without written permission, which would make it clear why we are going the path we are. Alberto ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On 03/15/2013 12:59 PM, GGS (linux ha) wrote: Unfortunately I'm not at liberty to discuss the full architecture or what they are doing without written permission, which would make it clear why we are going the path we are. Yeah, I suspected something like that. Hopefully I won't ever need to know. ;-) (I'd still argue that a full vm solution should have less maintenance overhead in the long run -- or at least it looks that way now.) -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On Fri, 2013-03-15 at 13:08 -0500, Dimitri Maziuk wrote: On 03/15/2013 12:59 PM, GGS (linux ha) wrote: Unfortunately I'm not at liberty to discuss the full architecture or what they are doing without written permission, which would make it clear why we are going the path we are. Yeah, I suspected something like that. Hopefully I won't ever need to know. ;-) Actually is not serious like that, it's more of a legal, standard corporate structure. It would be better if they made it available for public discussion so that ideas could fly more freely. You guys could probably see something simple that we are missing. Oh well... that's a whole other tangent. (I'd still argue that a full vm solution should have less maintenance overhead in the long run -- or at least it looks that way now.) Virtualization has a huge penalty on performance, specially at the IO level. At another place we do Xen and KVM with up to 40 VMs/server and when there is any kind of IO (disk specially) going on things slow down to a crawl. Once I learn pacemaker I think things will be much better :-) Alberto ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Multiple instances of heartbeat
On 03/15/2013 01:20 PM, GGS (linux ha) wrote: Virtualization has a huge penalty on performance, specially at the IO level. At another place we do Xen and KVM with up to 40 VMs/server and when there is any kind of IO (disk specially) going on things slow down to a crawl. I'm yet to find anything that can deal with i/o. I recently spent a couple of weeks poking at ceph, it doesn't live up to the sales brochure either... I expect if you can roll out a dedicated 10GBe network for your iscsis you might get usable i/o speeds. :( -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems