[Linux-HA] Problem promoting Slave to Master

2013-03-15 Thread Fredrik Hudner
Hi all,
I have a problem after I removed a node with the force command from my crm 
config.
Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 
1.1.7-6.el6)

Then I wanted to add a third node acting as quorum node, but was not able to 
get it to work (probably because I don't understand how to set it up).
So I removed the 3rd node, but had to use the force command as crm complained 
when I tried to remove it.

Now when I start up Pacemaker the resources doesn't look like they come up 
correctly

Online: [ testclu01 testclu02 ]

Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
 Masters: [ testclu01 ]
 Slaves: [ testclu02 ]
Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
 Started: [ tdtestclu01 tdtestclu02 ]
Resource Group: g_nfs
 p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01
 p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01
 p_fs_shared2   (ocf::heartbeat:Filesystem):Started testclu01
 p_ip_nfs   (ocf::heartbeat:IPaddr2):   Started testclu01
Clone Set: cl_exportfs_root [p_exportfs_root]
 Started: [ testclu01 testclu02 ]

Failed actions:
p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7, 
status=complete): not running
p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, 
status=complete): not running

The filesystems mount correctly on the master at this stage and can be written 
to.
When I stop the services on the master node for it to failover, it doesn't 
work.. Looses cluster-ip connectivity

Corosync.log from master after I stopped pacemaker on master node :  see 
attached file

Additional files (attached): crm-configure show
  Corosync.conf
  Global_common.conf


I'm not sure how to proceed to get it up in a fair state now
So if anyone could help me it would be much appreciated

Kind regards
/Fredrik Hudner


corosync.log
Description: corosync.log


corosync.conf
Description: corosync.conf
[?1034hnode tdtestclu01
node tdtestclu02
primitive p_drbd_nfs ocf:linbit:drbd \
params drbd_resource=nfs \
op monitor interval=15 role=Master \
op monitor interval=30 role=Slave
primitive p_exportfs_root ocf:heartbeat:exportfs \
params fsid=0 directory=/export options=rw,crossmnt 
clientspec=10.240.0.0/255.255.0.0 \
op monitor interval=30s
primitive p_fs_shared ocf:heartbeat:Filesystem \
params device=/dev/vg_nfs/lv_shared directory=/export/shared 
fstype=ext4 \
op monitor interval=10s
primitive p_fs_shared2 ocf:heartbeat:Filesystem \
params device=/dev/vg_nfs/lv_shared2 directory=/export/shared2 
fstype=ext4 \
op monitor interval=10s
primitive p_ip_nfs ocf:heartbeat:IPaddr2 \
params ip=10.240.64.20 cidr_netmask=24 \
op monitor interval=30s
primitive p_lsb_nfsserver lsb:nfs \
op monitor interval=30s
primitive p_lvm_nfs ocf:heartbeat:LVM \
params volgrpname=vg_nfs \
op monitor interval=30s
group g_nfs p_lvm_nfs p_fs_shared p_fs_shared2 p_ip_nfs
ms ms_drbd_nfs p_drbd_nfs \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
clone cl_exportfs_root p_exportfs_root
clone cl_lsb_nfsserver p_lsb_nfsserver
location drbd-fence-by-handler-nfs-ms_drbd_nfs ms_drbd_nfs \
rule $id=drbd-fence-by-handler-nfs-rule-ms_drbd_nfs $role=Master 
-inf: #uname ne tdtestclu01
colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
colocation c_nfs_on_root inf: g_nfs cl_exportfs_root
order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
order o_root_before_nfs inf: cl_exportfs_root g_nfs:start
property $id=cib-bootstrap-options \
dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
last-lrm-refresh=1363170760 \
stonith-enabled=false \
no-quorum-policy=freeze \
maintenance-mode=false
rsc_defaults $id=rsc-options \
resource-stickiness=200


global_common.conf
Description: global_common.conf
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Alberto Alonso
On Thu, 2013-03-14 at 16:26 +0100, Lars Marowsky-Bree wrote:
 On 2013-03-14T09:44:11, GGS (linux ha) support-linu...@ggsys.net wrote:
 
 That's fine. But the cluster software really assumes that only one
 instance of it is running per server - said instance can then manage
 multiple software stacks, though.

Got it. That's what I was asking.
 
 No. Pacemaker allows resources and groups (probably the equivalent of
 your stacks) to be individually managed.
 
 If you want to bring down pacemaker itself for maintenance, you'd detach
 via maintenance mode, stop, update, restart, reattach.

I'll have to dig in deeper, it may be a possibility. We really
would like to move away from the in-house built solution.
 
 But there is a point where this matters, namely IO fencing/STONITH. In
 case of a real server failure, you don't want 200+ independent fencing
 processes to trigger.

Believe it or not, I would actually rather have the 200+
fencing processes to trigger. But that is not a
requirement. I just need to ensure failover completes
within the allowed time.

 
 Yes. That's called multitasking/virtualization/cloud. We get that. ;-)

Multitasking yes, virtualization no, that's another discussion :-)
 
 But just like you only have one kernel per physical server, you also
 only have one cluster stack that then manages multiple stacks. We even
 got ACLs so that you can grant people access to only the bits they're
 allowed to manage, etc.
 
 What you plan - running multiple heartbeat v1 setups on one node - will
 not work reliably. Running multiple pacemaker instances per node/OS
 image will not work either.

That's what I thought. The emails from 2009 seemed to indicate
that it was possible to run multiple instances. 

I asked because I suspected that it really wasn't the case. Thanks
for confirming it. I'll dig deeper into pacemaker and see how I
can make it work for our use case.

One quick question on pacemaker. If I add a new stack, do I need
to bring the old ones down (or fail them) to add it to pacemaker? 
From your comment above it seems that I wouldn't, but I just want 
to make sure.

Thanks,

Alberto


 
 
 Regards,
 Lars
 
-- 
Alberto AlonsoGlobal Gate Systems LLC.
(512) 351-7233http://www.ggsys.net
Monitoring the metrics that are important to you in real time

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Dejan Muhamedagic
hi,

On Thu, Mar 14, 2013 at 11:15:29AM -0500, Alberto Alonso wrote:
 On Thu, 2013-03-14 at 16:26 +0100, Lars Marowsky-Bree wrote:
  On 2013-03-14T09:44:11, GGS (linux ha) support-linu...@ggsys.net wrote:
  
  That's fine. But the cluster software really assumes that only one
  instance of it is running per server - said instance can then manage
  multiple software stacks, though.
 
 Got it. That's what I was asking.
  
  No. Pacemaker allows resources and groups (probably the equivalent of
  your stacks) to be individually managed.
  
  If you want to bring down pacemaker itself for maintenance, you'd detach
  via maintenance mode, stop, update, restart, reattach.
 
 I'll have to dig in deeper, it may be a possibility. We really
 would like to move away from the in-house built solution.
  
  But there is a point where this matters, namely IO fencing/STONITH. In
  case of a real server failure, you don't want 200+ independent fencing
  processes to trigger.
 
 Believe it or not, I would actually rather have the 200+
 fencing processes to trigger. But that is not a
 requirement. I just need to ensure failover completes
 within the allowed time.
 
  
  Yes. That's called multitasking/virtualization/cloud. We get that. ;-)
 
 Multitasking yes, virtualization no, that's another discussion :-)
  
  But just like you only have one kernel per physical server, you also
  only have one cluster stack that then manages multiple stacks. We even
  got ACLs so that you can grant people access to only the bits they're
  allowed to manage, etc.
  
  What you plan - running multiple heartbeat v1 setups on one node - will
  not work reliably. Running multiple pacemaker instances per node/OS
  image will not work either.
 
 That's what I thought. The emails from 2009 seemed to indicate
 that it was possible to run multiple instances. 
 
 I asked because I suspected that it really wasn't the case. Thanks
 for confirming it. I'll dig deeper into pacemaker and see how I
 can make it work for our use case.
 
 One quick question on pacemaker. If I add a new stack, do I need
 to bring the old ones down (or fail them) to add it to pacemaker? 

No.

Thanks,

Dejan

 From your comment above it seems that I wouldn't, but I just want 
 to make sure.
 
 Thanks,
 
 Alberto
 
 
  
  
  Regards,
  Lars
  
 -- 
 Alberto AlonsoGlobal Gate Systems LLC.
 (512) 351-7233http://www.ggsys.net
 Monitoring the metrics that are important to you in real time
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] RA heartbeat/exportfs hangs sporadically

2013-03-15 Thread Roman Haefeli
On Mon, 2013-03-11 at 16:28 +0100, Dejan Muhamedagic wrote:
 Hi,
 
 On Mon, Mar 11, 2013 at 10:53:55AM +0100, Roman Haefeli wrote:
  On Fri, 2013-03-08 at 14:15 +0100, Dejan Muhamedagic wrote:
   Hi,
   
   On Fri, Mar 08, 2013 at 01:39:27PM +0100, Roman Haefeli wrote:
On Fri, 2013-03-08 at 13:28 +0100, Roman Haefeli wrote:
 On Fri, 2013-03-08 at 12:02 +0100, Lars Marowsky-Bree wrote:
  On 2013-03-08T11:56:12, Roman Haefeli reduz...@gmail.com wrote:
  
   Googling TrackedProcTimeoutFunction exportfs didn't reveal any
   results, which makes me think we are alone with this specific 
   problem.
   Is it the RA that hangs or the command 'exportfs' which is 
   executed by
   this RA? 
   
   It is most probably the exportfs program. Unless you hit the
   rmtab growing indefinitely issue.
  
  No, this is with a later version of the RA.
  
From the log:
Mar  8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop
process (PID 5528) timed out (try 2).  Killing with signal SIGKILL (9)
   
   This means that the process didn't leave after being sent the
   TERM signal. I think that KILL takes place five seconds later.
   Was this with the rmtab problem?
  
  I still don't fully understand. Is this  lrmd trying to kill the RA or
  the process 'exportfs' with given PID?
 
 The former. I thought I already answered that.

Yeah, sorry you did. Just for clarification: You say it's most likely
that the 'exportfs' process hangs and thus lrmd tries to kill the RA,
which will not exit until exportfs exits, is that correct?

For me valuable to know is what is lrmd trying to kill here: the process
'exportfs' or the process of the resource agent?
   
   The resource agent instance.
   
I mean, is 'exportfs' broken on said machine?
   
   Name resolution taking long perhaps?
  
  We use IP addresses everywhere, so I assume it's not related to name
  resolution. 
  
  What can I do about a broken 'exportfs'? It happens so seldom that I
  don't have a chance to deeply investigate the problem to write a proper
  bug report.
 
 Do you run the latest resource-agents (3.9.5)? Then you can
 trace the resource agent, like this:
 
 primitive r ocf:heartbeat:exportfs \
   params ... \
   op stop trace_ra=1
 
 The trace files will be generated per call in
 $HA_VARLIB/trace_ra/type/id.action.timestamp
 
 HA_VARLIB is usually, I think, /var/lib/heartbeat.

Thanks, that is valuable information. Is it safe to only upgrade the
resource-agents while keeping corosync (1.4.2) and pacemaker (1.1.7) at
their current version?

Thanks,
Roman


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem promoting Slave to Master

2013-03-15 Thread emmanuel segura
Hello Fedrik

Why you have a clone of cl_exportfs_root and you have ext4 filesystem, and
i think this order is not correct

order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
order o_root_before_nfs inf: cl_exportfs_root g_nfs:start

I think like that you try to start g_nfs twice


2013/3/14 Fredrik Hudner fredrik.hud...@evry.com

 Hi all,
 I have a problem after I removed a node with the force command from my crm
 config.
 Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
 pacemaker 1.1.7-6.el6)

 Then I wanted to add a third node acting as quorum node, but was not able
 to get it to work (probably because I don't understand how to set it up).
 So I removed the 3rd node, but had to use the force command as crm
 complained when I tried to remove it.

 Now when I start up Pacemaker the resources doesn't look like they come up
 correctly

 Online: [ testclu01 testclu02 ]

 Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
  Masters: [ testclu01 ]
  Slaves: [ testclu02 ]
 Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
  Started: [ tdtestclu01 tdtestclu02 ]
 Resource Group: g_nfs
  p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01
  p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01
  p_fs_shared2   (ocf::heartbeat:Filesystem):Started testclu01
  p_ip_nfs   (ocf::heartbeat:IPaddr2):   Started testclu01
 Clone Set: cl_exportfs_root [p_exportfs_root]
  Started: [ testclu01 testclu02 ]

 Failed actions:
 p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7,
 status=complete): not running
 p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7,
 status=complete): not running

 The filesystems mount correctly on the master at this stage and can be
 written to.
 When I stop the services on the master node for it to failover, it doesn't
 work.. Looses cluster-ip connectivity

 Corosync.log from master after I stopped pacemaker on master node :  see
 attached file

 Additional files (attached): crm-configure show
   Corosync.conf

 Global_common.conf


 I'm not sure how to proceed to get it up in a fair state now
 So if anyone could help me it would be much appreciated

 Kind regards
 /Fredrik Hudner

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] RA heartbeat/exportfs hangs sporadically

2013-03-15 Thread Dejan Muhamedagic
On Fri, Mar 15, 2013 at 10:44:37AM +0100, Roman Haefeli wrote:
 On Mon, 2013-03-11 at 16:28 +0100, Dejan Muhamedagic wrote:
  Hi,
  
  On Mon, Mar 11, 2013 at 10:53:55AM +0100, Roman Haefeli wrote:
   On Fri, 2013-03-08 at 14:15 +0100, Dejan Muhamedagic wrote:
Hi,

On Fri, Mar 08, 2013 at 01:39:27PM +0100, Roman Haefeli wrote:
 On Fri, 2013-03-08 at 13:28 +0100, Roman Haefeli wrote:
  On Fri, 2013-03-08 at 12:02 +0100, Lars Marowsky-Bree wrote:
   On 2013-03-08T11:56:12, Roman Haefeli reduz...@gmail.com wrote:
   
Googling TrackedProcTimeoutFunction exportfs didn't reveal any
results, which makes me think we are alone with this specific 
problem.
Is it the RA that hangs or the command 'exportfs' which is 
executed by
this RA? 

It is most probably the exportfs program. Unless you hit the
rmtab growing indefinitely issue.
   
   No, this is with a later version of the RA.
   
 From the log:
 Mar  8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop
 process (PID 5528) timed out (try 2).  Killing with signal SIGKILL (9)

This means that the process didn't leave after being sent the
TERM signal. I think that KILL takes place five seconds later.
Was this with the rmtab problem?
   
   I still don't fully understand. Is this  lrmd trying to kill the RA or
   the process 'exportfs' with given PID?
  
  The former. I thought I already answered that.
 
 Yeah, sorry you did. Just for clarification: You say it's most likely
 that the 'exportfs' process hangs and thus lrmd tries to kill the RA,
 which will not exit until exportfs exits, is that correct?

Right.

 For me valuable to know is what is lrmd trying to kill here: the 
 process
 'exportfs' or the process of the resource agent?

The resource agent instance.

 I mean, is 'exportfs' broken on said machine?

Name resolution taking long perhaps?
   
   We use IP addresses everywhere, so I assume it's not related to name
   resolution. 
   
   What can I do about a broken 'exportfs'? It happens so seldom that I
   don't have a chance to deeply investigate the problem to write a proper
   bug report.
  
  Do you run the latest resource-agents (3.9.5)? Then you can
  trace the resource agent, like this:
  
  primitive r ocf:heartbeat:exportfs \
  params ... \
  op stop trace_ra=1
  
  The trace files will be generated per call in
  $HA_VARLIB/trace_ra/type/id.action.timestamp
  
  HA_VARLIB is usually, I think, /var/lib/heartbeat.
 
 Thanks, that is valuable information. Is it safe to only upgrade the
 resource-agents while keeping corosync (1.4.2) and pacemaker (1.1.7) at
 their current version?

Yes, you can update them independently.

Thanks,

Dejan

 Thanks,
 Roman
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Dimitri Maziuk
On 3/14/2013 11:15 AM, Alberto Alonso wrote:

 That's what I thought. The emails from 2009 seemed to indicate
 that it was possible to run multiple instances.

I've always had difficulties with the concept: the way I see it if your 
hardware fails you want *all* your 200+ services moved. If you want them 
independently moved to different places, you're likely better off with a 
full cloud solution. If you want them moved while hardware's still up 
you're probably looking for load balancing, not HA.

I'm sure you can patch heartbeat to replace all hardcoded stuff with 
config file settings. Or use pacemaker's ability to manage service 
groups more or less independently. I'm not sure why you'd want to use 
either that way.

Dima

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Lars Marowsky-Bree
On 2013-03-15T09:54:22, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:

 I've always had difficulties with the concept: the way I see it if your 
 hardware fails you want *all* your 200+ services moved. If you want them 
 independently moved to different places, you're likely better off with a 
 full cloud solution. If you want them moved while hardware's still up 
 you're probably looking for load balancing, not HA.
 
 I'm sure you can patch heartbeat to replace all hardcoded stuff with 
 config file settings. Or use pacemaker's ability to manage service 
 groups more or less independently. I'm not sure why you'd want to use 
 either that way.

You're contradicting yourself ;-) Pacemaker in fact gives you the
management you suggest for the cloud use case - whether the services
are handled natively or encapsulated into a VM.

And the concept of HA clusters predates the cloud slightly.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Dimitri Maziuk
On 03/15/2013 10:08 AM, Lars Marowsky-Bree wrote:

 You're contradicting yourself ;-) Pacemaker in fact gives you the
 management you suggest for the cloud use case - whether the services
 are handled natively or encapsulated into a VM.

Yeah, I suppose. I meant going Open/CloudStack.
(We get to write buzzword-compliant funding proposals, or I don't get to
eat. So my perspective is skewed towards the hottest shiny du jour...)

 And the concept of HA clusters predates the cloud slightly.

Relevant if you're looking at maintenance/upgrade on an existing
cluster. Patching heartbeat to manage 200 services independently sounds
like a new project.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Lars Marowsky-Bree
On 2013-03-15T11:43:56, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:

 Yeah, I suppose. I meant going Open/CloudStack.
 (We get to write buzzword-compliant funding proposals, or I don't get to
 eat. So my perspective is skewed towards the hottest shiny du jour...)

Yeah, I'd agree that today there are scenarios where a cloud makes
more sense then a traditional HA environment. OpenStack et al still have
to up their HA game a bit, though.

  And the concept of HA clusters predates the cloud slightly.
 Relevant if you're looking at maintenance/upgrade on an existing
 cluster. Patching heartbeat to manage 200 services independently sounds
 like a new project.

Right. Thankfully, we already have that, it's called pacemaker ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Dimitri Maziuk
On 03/15/2013 11:55 AM, Lars Marowsky-Bree wrote:
...
 Right. Thankfully, we already have that, it's called pacemaker ;-)

Which brings me back to my original problem with the concept: I can
think of only one reason to failover services (as opposed to
hardware), and that is your daemons are crashing all the time during
normal operation. If I needed a solution for that, HA would be fairly
low on my list of things to look at.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread GGS (linux ha)
On Fri, 2013-03-15 at 11:43 -0500, Dimitri Maziuk wrote:
 On 03/15/2013 10:08 AM, Lars Marowsky-Bree wrote:
 
  You're contradicting yourself ;-) Pacemaker in fact gives you the
  management you suggest for the cloud use case - whether the
 services
  are handled natively or encapsulated into a VM.
 
 Yeah, I suppose. I meant going Open/CloudStack.
 (We get to write buzzword-compliant funding proposals, or I don't get
 to
 eat. So my perspective is skewed towards the hottest shiny du jour...)

These projects do not relate well to full VMs, so it is
actually not a good direction. But yes, we do use the
load balancer and VMs approach for other things, so I
am familiar with that type of architecture.
 
  And the concept of HA clusters predates the cloud slightly.
 
 Relevant if you're looking at maintenance/upgrade on an existing
 cluster. Patching heartbeat to manage 200 services independently
 sounds
 like a new project.

The current solution was written in-house. We are
looking to replace it. Based on the info from the list
heartbeat is out, so I'll look more into pacemaker.

I know what you mean about the buzzwords. I'm trying to
avoid them. :-)

Alberto


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread GGS (linux ha)
On Fri, 2013-03-15 at 17:55 +0100, Lars Marowsky-Bree wrote:
 Yeah, I'd agree that today there are scenarios where a cloud makes
 more sense then a traditional HA environment. OpenStack et al still have
 to up their HA game a bit, though.

You are being way too kind, a lot of improvement is
needed.
 
   And the concept of HA clusters predates the cloud slightly.
  Relevant if you're looking at maintenance/upgrade on an existing
  cluster. Patching heartbeat to manage 200 services independently sounds
  like a new project.
 
 Right. Thankfully, we already have that, it's called pacemaker ;-)

And that's where I'm looking next, I hope it is easy.

 
 Regards,
 Lars

Alberto

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread GGS (linux ha)
On Fri, 2013-03-15 at 12:32 -0500, Dimitri Maziuk wrote:
 On 03/15/2013 11:55 AM, Lars Marowsky-Bree wrote:
 ...
  Right. Thankfully, we already have that, it's called pacemaker ;-)
 
 Which brings me back to my original problem with the concept: I can
 think of only one reason to failover services (as opposed to
 hardware), and that is your daemons are crashing all the time during
 normal operation. If I needed a solution for that, HA would be fairly
 low on my list of things to look at.

You need to look back at the original description I gave.
These are not your typical web stack or back office apps.

We do have them running for years without failures/crashes, 
but in the 24x7 environment they are in we try to minimize
the risk of downtime regardless. 

Unfortunately I'm not at liberty to discuss the full architecture 
or what they are doing without written permission, which would
make it clear why we are going the path we are.

Alberto

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Dimitri Maziuk
On 03/15/2013 12:59 PM, GGS (linux ha) wrote:

 Unfortunately I'm not at liberty to discuss the full architecture 
 or what they are doing without written permission, which would
 make it clear why we are going the path we are.

Yeah, I suspected something like that. Hopefully I won't ever need to
know. ;-)

(I'd still argue that a full vm solution should have less maintenance
overhead in the long run -- or at least it looks that way now.)

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread GGS (linux ha)
On Fri, 2013-03-15 at 13:08 -0500, Dimitri Maziuk wrote:
 On 03/15/2013 12:59 PM, GGS (linux ha) wrote:
 
  Unfortunately I'm not at liberty to discuss the full architecture 
  or what they are doing without written permission, which would
  make it clear why we are going the path we are.
 
 Yeah, I suspected something like that. Hopefully I won't ever need to
 know. ;-)

Actually is not serious like that, it's more of a legal,
standard corporate structure. It would be better if they
made it available for public discussion so that ideas could
fly more freely. You guys could probably see something simple
that we are missing. Oh well... that's a whole other tangent.
 
 (I'd still argue that a full vm solution should have less maintenance
 overhead in the long run -- or at least it looks that way now.)

Virtualization has a huge penalty on performance, specially
at the IO level. At another place we do Xen and KVM with up to
40 VMs/server and when there is any kind of IO (disk specially) going
on things slow down to a crawl.

Once I learn pacemaker I think things will be much better :-)

Alberto


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple instances of heartbeat

2013-03-15 Thread Dimitri Maziuk
On 03/15/2013 01:20 PM, GGS (linux ha) wrote:

 Virtualization has a huge penalty on performance, specially
 at the IO level. At another place we do Xen and KVM with up to
 40 VMs/server and when there is any kind of IO (disk specially) going
 on things slow down to a crawl.

I'm yet to find anything that can deal with i/o. I recently spent a
couple of weeks poking at ceph, it doesn't live up to the sales brochure
either... I expect if you can roll out a dedicated 10GBe network for
your iscsis you might get usable i/o speeds. :(

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems