Re: [Linux-HA] fsck filesystem?

2011-02-21 Thread Bernd Schubert
Hello Dejan,


On 02/21/2011 05:43 PM, Dejan Muhamedagic wrote:
> 
> No. ext3 is a filesystem with a journal, so it is considered
> that it can recover without fsck. Otherwise, there's a parameter
> called run_fsck, check the meta data: crm ra info Filesystem.
> 

no, not if it writes "Warning: mounting a filesystem with errors". In
that case extX has recorded an error either in its super block or in the
journal. We had a long discussion about that on the ext4 list back in
October and in the end upstream e2fsprogs excepted a patch for e2fsck to
allow to play back the journal only. After journal playback a possible
error always be recorded in the superblock and from there on the a
script can read it using dumpe2fs.  The Filesystem agent should be
rewritten to refuse to mount if the superblock has an error. Using the
new e2fsck option "-E  journal_only" is a bit more tricky, as only the
most recent e2fsprogs/e2fsck version has it.

http://kerneltrap.org/mailarchive/linux-ext4/2010/10/22/6885813

http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=commit;h=71873b17307993c08b38b97c9551bed231e6048c

Below is what I added to the DDN lustre_server agent:

> # check if the superblock knows about filesystem errors
> # return 0 if not, 1 if errors have been recorded
> check_sb_fs_errors()
> {
> with_error=`dumpe2fs -h $DEVICE 2>/dev/null | grep "Filesystem 
> state:" | grep "error"`
> if [ -n "$with_error" ]; then
> ocf_log err "$DEVICE : $with_error (run e2fsck)"
> return 1
> fi
> return 0
> }

(As I left DDN end of November and as the "e2fsck -E journal_only"
option was not accepted upstream that time, that part is not implemented
yet in that RA).


> BTW, it is very unusual (and suspicious) that the filesystem
> starts having errors just like that, while the system's running.
> You should find what caused the corruption.

Well, extX even recorded an error in the journal and subsequently in the
super-block if an IO error came up. Unfortunately, there does not seem
to a single expensive raid unit out there, that does not bring up
errors. Although I have to admit, that FC and IB HBAs and fabric also
play their part in that issue.
And of course, no filesystem is free of bugs. Which is why until now
extX suggests frequent fscks.

Cheers,
Bernd
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] monitor interval

2011-02-21 Thread Hugo, Marcio
 Hi...

By failing to specify 30 "s, " What is monitoring interval?

op monitor interval = "30s" is not equal? monitor interval op = "30"?


op monitor interval=30s  not equal? op monitor interval=30


Att, Marcio Hugo
Technology Consultant, EB HPS Technology Services
Hewlett-Packard Company
+55 11 7148 2184 / Mobile 
marcio.h...@hp.com / Email 
Av. Tamboré, 74/200
Barueri, SP 06400-000 - Brazil
 

-Original Message-
From: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Hugo, Marcio
Sent: segunda-feira, 21 de fevereiro de 2011 18:06
To: General Linux-HA mailing list
Subject: [Linux-HA] monitor interval

Hi...

By failing to specify 30 "s, " What is monitoring interval?

op monitor interval = "30s" is not equal? monitor interval op = "30"?


op monitor interval=30s  not equal? op monitor interval=30

TKS!

Att, Marcio Hugo
Technology Consultant, EB HPS Technology Services Hewlett-Packard Company
+55 11 7148 2184 / Mobile
marcio.h...@hp.com / Email Av. Tamboré, 74/200 
Barueri, SP 06400-000 - Brazil 
[file:///C:/Documents%20and%20Settings/moreiram/Desktop/Desktop_AGO2010/hp-logo.bmp]

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA and Bonding device

2011-02-21 Thread Bart Coninckx
Not a lot of time to go deeply into this, but as far as I know:

- for round robin bonding you need two unconnected switches
- for using the same switch you need 802.3ad

if you decide to mix and match, you need to keep this into consideration.

B.


On 02/21/11 12:12, Claudio Prono wrote:
> Hello all,
> 
> I wrote to the list because of a little of confusion about bonding
> devices in HA environment.
> 
> My network architecture is the following:
> 
> - 2 redundant firewalls with heartbeat, 4 network nics
> - 2 redundant switches
> - 1 system behind the switch with 2 nic
> 
> The system behind the switch have the 2 nic connected to the 2 switches.
> The two switches are connected each other
> From each switch i have 2 cable connecting to the two firewalls
> 
> Now, my question is: i have to set up a bonding device on each firewall,
> in mode 1 (like the example i have found here:
> http://linux-ip.net/html/ether-bonding.html#ex-ether-bonding-aggregation),
> but is needed some particular configuration of the switches to make this
> work correctly? Or is simply based on the interface link status?
> 
> I have searched some help in the previous mailing list posts, but is not
> clear to me
> Any hint is well accepted.
> 
> Cordially,
> 
> Claudio Prono.
> 
> 
> 
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] monitor interval

2011-02-21 Thread Hugo, Marcio
Hi...

By failing to specify 30 "s, " What is monitoring interval?

op monitor interval = "30" is not equal? monitor interval op = "30"?


op monitor interval=30s  not equal? op monitor interval=30

TKS!

Att, Marcio Hugo
Technology Consultant, EB HPS Technology Services
Hewlett-Packard Company
+55 11 7148 2184 / Mobile
marcio.h...@hp.com / Email
Av. Tamboré, 74/200
Barueri, SP 06400-000 - Brazil
[file:///C:/Documents%20and%20Settings/moreiram/Desktop/Desktop_AGO2010/hp-logo.bmp]

<>___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Logs in the solution of HA

2011-02-21 Thread Hugo, Marcio
Hi...

You can increase the criticality of the logs in the solution of HA (OpenAis 
Pacemaker )?

TKS!!


Att, Marcio Hugo
Technology Consultant, EB HPS Technology Services
Hewlett-Packard Company
+55 11 7148 2184 / Mobile
marcio.h...@hp.com / Email
Av. Tamboré, 74/200
Barueri, SP 06400-000 - Brazil
[file:///C:/Documents%20and%20Settings/moreiram/Desktop/Desktop_AGO2010/hp-logo.bmp]

<>___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] fsck filesystem?

2011-02-21 Thread Dejan Muhamedagic
Hi,

On Fri, Feb 18, 2011 at 11:56:49AM -0500, Tony Nelson wrote:
> Hi All,
> 
> I have a small cluster configured like this:
> 
> [-- config -]
> root@ihdb2:~# crm configure show
> node $id="3888bf0f-3e06-4ad8-a2c2-297451128d3d" ihdb1
> node $id="a1f70384-6684-47e6-ba00-ed082dee7a56" ihdb2
> primitive bacula-fd lsb:bacula-fd.local \
>   meta target-role="Started"
> primitive dbip ocf:heartbeat:IPaddr2 \
>   params ip="192.168.44.22" nic="eth0" \
>   op start interval="0" timeout="120s" \
>   op monitor interval="30s" timeout="20s"
> primitive fs0 ocf:heartbeat:Filesystem \
>   params fstype="ext3" directory="/var/lib/postgresql" 
> device="/dev/vg01/postgresql" options="noatime" \
>   op start interval="0" timeout="60s" \
>   op stop interval="0" timeout="60s" \
>   meta target-role="Started"
> primitive iscsi ocf:heartbeat:iscsi \
>   params portal="192.168.43.28" 
> target="iqn.2001-05.com.equallogic:0-8a0906-a6bb3d802-25aca117e304cae3-ihdb" \
>   op start interval="0" timeout="120s" \
>   op monitor interval="30s" timeout="30s" \
>   op stop interval="0" timeout="120s" \
>   meta target-role="Started"
> primitive psql lsb:postgresql-8.4 \
>   meta target-role="Started"
> group psql-group iscsi fs0 dbip bacula-fd psql \
>   meta target-role="Started"
> property $id="cib-bootstrap-options" \
>   dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
>   cluster-infrastructure="Heartbeat" \
>   stonith-enabled="false" \
>   last-lrm-refresh="1291165836" \
>   no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>   resource-stickiness="100"
> [ -- end config --]
> 
> This morning the postgres server started logging errors because of corrupted 
> data files.
> 
> I stopped all of the services except for the iscsi one and manually mounted 
> the filesystem.  The system said something like "Warning: mounting a 
> filesystem with errors".  Sorry I don't have the exact messages.
> 
> I unmounted the filesystem, did a fsck manually then restarted the services.  
> 
> Is there any way to have heartbeat fsck the filesystem like a normal mount 
> from fstab would?  Did I miss a step?

No. ext3 is a filesystem with a journal, so it is considered
that it can recover without fsck. Otherwise, there's a parameter
called run_fsck, check the meta data: crm ra info Filesystem.

BTW, it is very unusual (and suspicious) that the filesystem
starts having errors just like that, while the system's running.
You should find what caused the corruption.

Thanks,

Dejan

> Thank you in advance for any help.
> Tony Nelson
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD BrokenPipe

2011-02-21 Thread Bart Coninckx
Boris,

what does your network connection in between DRBD nodes look like? Which
NICs?

B.

On 02/21/11 12:45, Boris Virc wrote:
> Hello,
> 
> I have installed SLES with kernel version 2.6.32.19-0.3 and DRBD 8.3.8.1 
> (using two nodes - primary-slave).
> 
> I noticed that there is a lot of BrokenPipe errors in log files:
> 
> Feb 11 12:59:40 sles1 crm-fence-peer.sh[64879]: invoked for r0
> Feb 11 12:59:41 sles1 crm-fence-peer.sh[64879]: INFO peer is reachable, my 
> disk is UpToDate: placed constraint 'drbd-fence-by-handler-ms_drbd'
> Feb 11 12:59:41 sles1 kernel: [6022113.566198] block drbd0: helper command: 
> /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
> Feb 11 12:59:41 sles1 kernel: [6022113.566206] block drbd0: fence-peer helper 
> returned 4 (peer was fenced)
> Feb 11 12:59:41 sles1 kernel: [6022113.566228] block drbd0: pdsk( DUnknown -> 
> Outdated )
> Feb 11 12:59:41 sles1 kernel: [6022113.566400] block drbd0: conn( BrokenPipe 
> -> Unconnected )
> Feb 11 12:59:41 sles1 kernel: [6022113.566418] block drbd0: receiver 
> terminated
> Feb 11 12:59:41 sles1 kernel: [6022113.566422] block drbd0: Restarting 
> receiver thread
> Feb 11 12:59:41 sles1 kernel: [6022113.566426] block drbd0: receiver 
> (re)started
> Feb 11 12:59:41 sles1 kernel: [6022113.566441] block drbd0: conn( Unconnected 
> -> WFConnection )
> Feb 11 12:59:41 sles1 pengine: [30521]: notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> 
> The system works, but within 2 monts, there was already two unpredictable 
> error (we had to restart secondary server so that primary started to work 
> again).
> 
> Is there anything that we can do to avoid those errors ?
> 
> Regards,
> Boris
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] DRBD BrokenPipe

2011-02-21 Thread Boris Virc
Hello,

I have installed SLES with kernel version 2.6.32.19-0.3 and DRBD 8.3.8.1 (using 
two nodes - primary-slave).

I noticed that there is a lot of BrokenPipe errors in log files:

Feb 11 12:59:40 sles1 crm-fence-peer.sh[64879]: invoked for r0
Feb 11 12:59:41 sles1 crm-fence-peer.sh[64879]: INFO peer is reachable, my disk 
is UpToDate: placed constraint 'drbd-fence-by-handler-ms_drbd'
Feb 11 12:59:41 sles1 kernel: [6022113.566198] block drbd0: helper command: 
/sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
Feb 11 12:59:41 sles1 kernel: [6022113.566206] block drbd0: fence-peer helper 
returned 4 (peer was fenced)
Feb 11 12:59:41 sles1 kernel: [6022113.566228] block drbd0: pdsk( DUnknown -> 
Outdated )
Feb 11 12:59:41 sles1 kernel: [6022113.566400] block drbd0: conn( BrokenPipe -> 
Unconnected )
Feb 11 12:59:41 sles1 kernel: [6022113.566418] block drbd0: receiver terminated
Feb 11 12:59:41 sles1 kernel: [6022113.566422] block drbd0: Restarting receiver 
thread
Feb 11 12:59:41 sles1 kernel: [6022113.566426] block drbd0: receiver (re)started
Feb 11 12:59:41 sles1 kernel: [6022113.566441] block drbd0: conn( Unconnected 
-> WFConnection )
Feb 11 12:59:41 sles1 pengine: [30521]: notice: unpack_config: On loss of CCM 
Quorum: Ignore

The system works, but within 2 monts, there was already two unpredictable error 
(we had to restart secondary server so that primary started to work again).

Is there anything that we can do to avoid those errors ?

Regards,
Boris
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] HA and Bonding device

2011-02-21 Thread Claudio Prono
Hello all,

I wrote to the list because of a little of confusion about bonding
devices in HA environment.

My network architecture is the following:

- 2 redundant firewalls with heartbeat, 4 network nics
- 2 redundant switches
- 1 system behind the switch with 2 nic

The system behind the switch have the 2 nic connected to the 2 switches.
The two switches are connected each other
>From each switch i have 2 cable connecting to the two firewalls

Now, my question is: i have to set up a bonding device on each firewall,
in mode 1 (like the example i have found here:
http://linux-ip.net/html/ether-bonding.html#ex-ether-bonding-aggregation),
but is needed some particular configuration of the switches to make this
work correctly? Or is simply based on the interface link status?

I have searched some help in the previous mailing list posts, but is not
clear to me
Any hint is well accepted.

Cordially,

Claudio Prono.



-- 

Claudio Prono OPST
System Developer   
  Gsm: +39-349-54.33.258
@PSS Srl  Tel: +39-011-32.72.100
Via San Bernardino, 17Fax: +39-011-32.46.497
10141 Torino - ITALY  http://atpss.net/disclaimer

PGP Key - http://keys.atpss.net/c_prono.asc




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] CLVM & cmirror using Pacemaker DLM integration on rhel 6

2011-02-21 Thread Andrew Beekhof
On Thu, Feb 17, 2011 at 3:34 PM, Pieter Baele  wrote:
> Hi,
>
> With our last cluster experiments we try to set up Pacemaker with CLVM 
> mirroring
> on RHEL 6.0
>
> I added a DLM resource, but when I try to add clvm in crm, I get the
> following error:
> crm(live)configure# primitive clvm ocf:lvm2:clvmd params
> daemon_timeout="30" op monitor interval="60" timeout="60"
> ERROR: ocf:lvm2:clvmd: could not parse meta-data:
> ERROR: ocf:lvm2:clvmd: no such resource agent
>
> Any idea what's missing?

The ocf:lvm2:clvmd resource agent perhaps?

>
> Is there a short guide/howto somewhere how to set this up?

I don't know of one personally

>
> Last updated: Thu Feb 17 15:32:12 2011
> Stack: openais
> Current DC: xxx - partition with quorum
> Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> 
>
> Online: [ xyz xyz ]
>
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started x
> dlm     (ocf::pacemaker:controld):      Started x
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems