[Linux-HA] strange heartbeat shutdown

2009-01-27 Thread Dimitri Maziuk
) standalone mode STARTUP
>Jan 27 06:07:12 swordfish ResourceManager[5458]: info: Running 
>/etc/init.d/httpd  start
>Jan 27 06:07:13 swordfish ResourceManager[5458]: info: Running 
>/etc/ha.d/resource.d/mon  start
>Jan 27 06:07:13 swordfish mach_down[5432]: info: 
>/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
>Jan 27 06:07:13 swordfish mach_down[5432]: info: mach_down takeover complete 
>for node moray.bmrb.wisc.edu.
...
>Jan 27 06:09:14 swordfish heartbeat: [2778]: info: Heartbeat shutdown in 
>progress. (2778)
>Jan 27 06:09:14 swordfish heartbeat: [5843]: info: Giving up all HA resources.
>Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Releasing resource 
>group: moray.bmrb.wisc.edu 144.92.217.20 proftpd httpd mon
>Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
>/etc/ha.d/resource.d/mon  stop
>Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
>/etc/init.d/httpd  stop
>Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
>/etc/init.d/proftpd  stop
>Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD 
>killed (signal 15)
>Jan 27 06:09:14 swordfish proftpd[5705]: swordfish.bmrb.wisc.edu - ProFTPD 
>1.3.1 standalone mode SHUTDOWN
>Jan 27 06:09:14 swordfish ResourceManager[5856]: info: Running 
>/etc/ha.d/resource.d/IPaddr 144.92.217.20 stop
>Jan 27 06:09:14 swordfish IPaddr[5981]: INFO: ifconfig eth0:0 down
>Jan 27 06:09:14 swordfish IPaddr[5964]: INFO:  Success
>Jan 27 06:09:14 swordfish heartbeat: [5843]: info: All HA resources 
>relinquished.
>Jan 27 06:09:16 swordfish ntpd[2383]: Deleting interface #9 eth0:0, 
>144.92.217.20#123, interface stats: received=0, sent=0, dropped=0, 
>active_time=163 secs
>Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBREAD process 2805 
>with signal 15
>Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBFIFO process 2803 
>with signal 15
>Jan 27 06:09:16 swordfish heartbeat: [2778]: info: killing HBWRITE process 
>2804 with signal 15
>Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2803 exited. 3 
>remaining
>Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2805 exited. 2 
>remaining
>Jan 27 06:09:16 swordfish heartbeat: [2778]: info: Core process 2804 exited. 1 
>remaining
>Jan 27 06:09:16 swordfish heartbeat: [2778]: info: swordfish.bmrb.wisc.edu 
>Heartbeat shutdown complete.
---
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] strange heartbeat shutdown

2009-01-30 Thread Dimitri Maziuk
On Friday 30 January 2009 04:22:27 Dejan Muhamedagic wrote:
>
> Looked through your logs: no idea what happened. It's as if
> somebody typed rcheartbeat stop.

Ah, that's the kick my brane needed -- mon would do that if it thought 
httpd/ftpd weren't running. Now to figure out why it'd think that.

> What about the configuration? 

Pretty basic: bcast eth1, two nodes, auto-failback is on, and "host IP/26 
proftpd httpd mon" in haresources. 

It's gotta be mon. Thanks for the push
Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] strange heartbeat shutdown - SOLUTION

2009-01-30 Thread Dimitri Maziuk

(Or at least I'm pretty sure that's what happened)

If you tell mon to monitor, say, httpd on a hostname, make sure hostname 
resolves. 

In my case, I gave it the hostname of cluster ip and it wasn't in /etc/hosts. 
On top of that, servers are in a dmz subnet that has no local dns server. So 
when the gateway went down mon couldn't resolve the hostname and assumed 
httpd was dead. 

(The other mistake was a combination of brain fart and insufficient RTFM -- of 
course I wanted mon to run hb_standby, not shut heartbeat down. D'oh.)

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Outgoing IP address

2009-02-04 Thread Dimitri Maziuk
On Wednesday 04 February 2009 15:35:21 Lee Hobart wrote:
> It's Unidata's LDM as far as I can tell I can specify a interface within
> LDM.

Then tell it to use eth0:0 or whatever it is on your system. Typically if 
there's more than one interface associated with the route to destination, the 
client will pick the first one -- for some value of "first". Most of the 
time "eth0" would come before "eth0:0" in the list, so that's the one it'll 
pick.

It's probably possible to mark the packets with iptables and route them to 
eth0:0 with iproute (google for iptables mark and traffic shaping/policy 
routing), but option 1 is way simpler.

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Outgoing IP address

2009-02-05 Thread Dimitri Maziuk

James R. Leu wrote:

For locally originated connections that do not bind to an interface you
can use the SNAT target of iptables.

iptables -A POSTROUTING -o eth0 -j SNAT --to-source 192.168.1.3


There's another problem with using cluster ip for outgoing address: if 
it fails over in the middle of a connection. The other node will start 
getting packets for a connection that doesn't exist on that node.


First, the default iptables rules: "accept established" won't let them 
through since it's not "established" on the new node. If you get around 
that, client software isn't there or isn't in the state to process those 
packets.


Dima
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to setup a 2 node active/passive apache2 cluster for Proof of Concept

2009-05-29 Thread Dimitri Maziuk
Bernie Wu wrote:
> The company I work for wants us to start investigating HA.
> My first POC setup was a 2 node cluster with a floating IP and that worked 
> out quite well.
> Now the second POC was to work with an application, in this case, apache2 in 
> a active/passive configuration.
> My question is this.  Do I need to setup a third node to serve as the quorum 
> node or can I work with 2 nodes.

If your setup is v1-style active/passive, all you need to do is add 
httpd to haresources line (and make sure it's not started by init). 
That's for proof of concept. IRL you may want to throw in at least mon, 
possibly stonith -- although I haven't seen a split brain problem in my 
setup (but then again, mine usually fail over when I upgrade the kernel).

Dima

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to setup a 2 node active/passive apache2 cluster for Proof of Concept

2009-05-29 Thread Dimitri Maziuk
On Friday 29 May 2009 14:44:55 Bernie Wu wrote:
> Hi Les,
> Currently, there is no database or filesystem on the active system that
> needs to be sync'ed.  Apache currently just serves up static pages.  It's
> too early in our POC for something fancy.

Syncing only really works if people upload files and expect them to persist 
past their current "session". With throw-away uploads (i.e. they click on 
upload and get back some output) it's not going to do anything, and with 
fancier stuff like data entry via multiple CGI forms you'll want DRBD. And if 
you have stateful stuff like applet-servlet apps, those will break during 
failover.

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] questions about using HA

2009-06-02 Thread Dimitri Maziuk
blue_hmq wrote:
> hi, i am sorry to disturb you because i have some questions about using  HA
> 
> i had configured the HA on two computer(HA01 as master,HA02),using apache2 
> service. 
> if i stop apache2 service on HA01,but the Ethernet and heartbeat is
Ok. as this situation ,will the ha02 take over the apache2 service ? if
taken ,how does it do that(how does ha02 know the apache service on HA02
was stoped) ?
> 

It doesn't, you'll need mon.

Dima

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Stonith with APCSmart UPS1000 +Network ManagementCard

2009-07-11 Thread Dimitri Maziuk
Ehlers, Kolja wrote:

> Could somebody explain how the APC Smart shutdown command works? Does it 
> actually allow to take away the power from only one of the
> connected servers or does it just take away the power from everything that is 
> connected? 

IIRC we have the previous model of that network management card. It can 
shutdown the whole UPS, or it can send the powerfail signal to connected 
machines. I'm pretty sure ours can't shut off an individual outlet.

You'll probably need their net-managed power strip for what you're 
trying to do.

Dima
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Stonith with APCSmart UPS1000 +Network ManagementCard

2009-07-13 Thread Dimitri Maziuk
Ehlers, Kolja wrote:
> Hello Dima,
> 
> can you explain how you get that card to shutdown the whole UPS or how to 
> send the powerfail command to connected machines (are you
> talking about the PowerChute agent?). Is any of that possible through a 
> stonith plugin?

I never got a 'round tuit: we have pretty reliable power and all 
important stuff is on journaled raids.

As I recall you can send powerfail either to a monitoring daemon 
(apcupsd?) or as snmp trap; with recent firmware you can also group 
upses to send the signal only when both of them lose power (with 2-psu 
machines connected to 2 upses on independent circuits). However, this is 
the opposite of what you want.

Their networked power strip lets you turn individual outlets on and off 
-- dunno if you can do it via snmp, though. I'm sure you can wrap 
telnet/expect or http/curl script into a plugin if it comes to that.

Dima
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Raid and drdb

2009-07-27 Thread Dimitri Maziuk
On Monday 27 July 2009 12:14:57 Claus Denk wrote:
> Hi all,
>
> we have now almost configured our hardware for a HA cluster. We will use
> two IBM 3550 M2 servers, and I would like to ask you if you could share
> some experience about different disk configurations.
>
> 1) Using drdb, is having some hardware RAID on both machines good
> practice (for example 3 disks with RAID 5) or is it not necessary to do
> so as drdb is our network raid? I guess in case of failure of the drdb
> communication it would be good to have some additional raid security on
> each server. Does it affect the write performance a lot?


Actually, I'd like to know: what happens when a disk fails? Would that 
propagate to DRBD and trigger a failover?

> 2) This machine has a maximum of 4 SAS slots, what would be good
> practice to install OS and data (2+2 disks with raid 1, or everything on
> 3 disks raid 5, o 4 disks raid 10?) Of course disk price is an issue, we
> would like to have 500 GB of data, which requires 3 300 Gb disks for raid
> 5.

With SAS drives you should be OK with raid 5 (SATA is where you'l want raid 
1/10, apparently). Personally, I wouldn't worry too much about raid'ing the 
root drives: HA cluster should give you enough time to reinstall the OS if 
one fails. I would keep a spare disk handy.

Dimitri
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Redundant MailMan

2009-09-11 Thread Dimitri Maziuk
On Friday 11 September 2009 06:34:39 maike wrote:
> Hello people, I wonder if anyone has implemented a list of Milaman
> redundant with Heartbeat. I've been searching the list but have not found
> anything, my network.
>
> correioa [postfix-mailman] -> smtpa
> correiob [postfix-mailman] -> smtp
>
> if correioa or b stay out the mailman service is not compromised. If
> someone can help me I am grateful.

You should put /var/lib/mailman on drbd so that archives are shared between 
the 2 nodes. If you're going to let people create mailing lists (as opposed 
to creating a couple at install time and not adding any more ever), you 
should also put /etc/mail on drbd to share mailman aliases.

I assume you'll be running httpd as well, you should make ServerName in apache 
config the same on both nodes: that of your cluster ip.

Other than that I can't think of any gotchas.

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] postfix recover failed.

2009-11-09 Thread Dimitri Maziuk
On Monday 09 November 2009 07:00:42 Dejan Muhamedagic wrote:

> It would, but the stop action failed:
>
> Nov  9 11:07:12 Server1 lrmd: [4682]: info: RA output:
> (Postfix:stop:stderr) 2009/11/09_11:07:12 ERROR: Postfix returned an error
> while stopping. 1
>
> This error comes from /usr/sbin/postfix. Not an expert on
> postfix, so CC the author of the RA, perhaps he can take a look.

Actually, this

> > Failed actions:
> > Postfix_monitor_5 (node=Server1, call=16, rc=7,
> > status=complete): not running
> > Postfix_stop_0 (node=Server1, call=17, rc=1, status=complete):
> > unknown error

looks like it's using a "restart" action, usually coded as "stop" followed 
by "start".

If the daemon isn't running to begin with, "stop" will fail.

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] postfix recover failed.

2009-11-10 Thread Dimitri Maziuk
On Tuesday 10 November 2009 03:57:25 Dejan Muhamedagic wrote:
> Hi,
>
> The patch looks fine to me. Looks like /usr/sbin/postfix can't
> handle this itself. Raoul: Is that actually expected?

Why not: if "postfix stop" fails to stop postfix for whatever reason, it 
should return a non-zero value.

Note that "stop" will wait for the daemons to shut down gracefully. Which 
means it could potentially block the failover for a while (unlikely worst 
case:  deadlock).

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems