Re: [Linux-HA] One-Node-Cluster

2011-02-14 Thread Andrew Beekhof
On Tue, Feb 15, 2011 at 6:08 AM, Alan Robertson  wrote:
> On 02/14/2011 04:45 AM, Andrew Beekhof wrote:
>> On Mon, Feb 14, 2011 at 12:40 PM, Ulrich Windl
>>   wrote:
>> Andrew Beekhof  schrieb am 14.02.2011 um 10:08 in 
>> Nachricht
>>> :
>>> [...]
> The log just keeps on saying:
> Feb  8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not 
> have
> quorum - fencing and resource management disabled
 Exactly.
 Read that line again a couple of times, then read "clusters from scratch".
>>> [...]
>>>
>>> Which makes me wonder: Can a one-node-cluster ever have a quorum?
>> Not really, which is why we have no-quorum-policy.
>>
>>> I think a one-node-cluster is a completely valid construct. Also with 
>>> Linux-HA?
>> Yep.
>
> If you're using the Heartbeat membership stack, then it is perfectly
> happy to give you quorum in a one-node cluster.

Or a two node cluster.  Which is not exactly ideal.

> In fact, at one tmie I wrote a script to create a cluster configuration
> from your /etc/init.d/ scripts - so that Pacemaker could be effectively
> a nice replacement for init - with a respawn that really works ;-)
>
>
> --
>     Alan Robertson
>
> "Openness is the foundation and preservative of friendship...  Let me claim 
> from you at all times your undisguised opinions." - William Wilberforce
>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] One-Node-Cluster

2011-02-14 Thread Alan Robertson
On 02/14/2011 04:45 AM, Andrew Beekhof wrote:
> On Mon, Feb 14, 2011 at 12:40 PM, Ulrich Windl
>   wrote:
> Andrew Beekhof  schrieb am 14.02.2011 um 10:08 in 
> Nachricht
>> :
>> [...]
 The log just keeps on saying:
 Feb  8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not have
 quorum - fencing and resource management disabled
>>> Exactly.
>>> Read that line again a couple of times, then read "clusters from scratch".
>> [...]
>>
>> Which makes me wonder: Can a one-node-cluster ever have a quorum?
> Not really, which is why we have no-quorum-policy.
>
>> I think a one-node-cluster is a completely valid construct. Also with 
>> Linux-HA?
> Yep.

If you're using the Heartbeat membership stack, then it is perfectly 
happy to give you quorum in a one-node cluster.

In fact, at one tmie I wrote a script to create a cluster configuration 
from your /etc/init.d/ scripts - so that Pacemaker could be effectively 
a nice replacement for init - with a respawn that really works ;-)


-- 
 Alan Robertson

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Pacemaker] Solved: SLES 11 HAE SP1 Signon to CIB Failed

2011-02-14 Thread Tim Serong
On 2/9/2011 at 09:49 PM,  wrote: 
> > So I compared the /etc/ais/openais.conf in non-sp1 with  
> > /etc/corosync/corosync.conf from sp1 and found this bit missing which  
> > could be quite useful... 
> >   
> > service {  
> > # Load the Pacemaker Cluster Resource Manager  
> > ver:   0  
> > name:  pacemaker  
> > use_mgmtd: yes  
> > use_logd:  yes 
> > } 
> >   
> > Added it and it works. Doh.  
> >   
> > It seems the example corosync.conf that is shipped won't start  
> > pacemaker, I'm not sure if that's on purpose or not, but I found it a  
> > bit confusing after being used to it 'just working' previously. 
>  
> Ah.  Understandably confusing.  That got fixed post-SP1, in a 
> maintenance update that went out in September or thereabouts. 
>  
> Regards, 
>  
> Tim 
>  
>  
> -- 
> Tim Serong  
> Senior Clustering Engineer, OPS Engineering, Novell Inc. 
>  
>  
> --- 
>  
> Thanks Tim. 
>  
> Although the media that can be downloaded *now* from Novell downloads 
> still has this issue, so any new clusters will fall foul of it. 
> Generally with a test build you won't perform updates as it burns a 
> licence you would need for the production system. Should the 
> downloadable media have the issue fixed? 

With the disclaimer that I haven't tried this myself lately... :)
On this page:

  http://www.novell.com/products/highavailability/eval.html

It says:

  Please note: Once you login, your evaluation software will
  automatically be registered to you. You will be able to
  immediately access free maintenance patches and updates online
  for a 60-day period following your registration date.

So apparently new users should be able to get the latest maintenance
updates.

Regards,

Tim


-- 
Tim Serong 
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Linux-ha-dev] resource agents 1.0.4-rc announcement

2011-02-14 Thread Brett Delle Grazie
Hi,

On 14 February 2011 19:09, Florian Haas  wrote:
> On 02/14/2011 07:54 PM, Raoul Bhatia [IPAX] wrote:
>> hi,
>>
>> On 14.02.2011 17:56, Dejan Muhamedagic wrote:
>>> Hello,
>>>
>>> The current repository of Resource Agents has been tagged to
>>> agents-1.0.4-rc on Friday evening.
>>>
>>> Some major additions and improvements:
>>>
>>> - conntrackd, exportfs, nginx, fio: new agents
>>> - mysql: master-slave functionality and replication monitoring
>>
>> i have some serious issues with mysql master-slave and rapid fail over.
>> imho, and i might be wrong, this functionality "is not quite there yet".
>>
>> the basic problem: master fail over from node1 to node2 and back again
>> makes node2 try to parse the binlog from the very start.
>>
>> "CHANGE MASTER TO" does not honor the slave's last position for a given
>> master upon fail over and/or the binlogs on the master are never reset
>> thus leading to duplicate parsing of the very same binlog.
>>
>> i'll see to a more detailed report but am kind of swamped in work right
>> now.
>
> Okay, thanks for the feedback. We'll see what we can do about this. If
> you can spare any more time looking into this issue, it would be much
> appreciated.
>
> Cheers,
> Florian
>
>

There's also one possible issue in the tomcat stop operation due to an
apparent bug in pacemaker where the
resource agent is passed the wrong OCF_RESOURCE_CRM_meta_timeout. I'm
going to lodge a bug
report tomorrow.


-- 
Best Regards,

Brett Delle Grazie
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Linux-ha-dev] resource agents 1.0.4-rc announcement

2011-02-14 Thread Florian Haas
On 02/14/2011 07:54 PM, Raoul Bhatia [IPAX] wrote:
> hi,
> 
> On 14.02.2011 17:56, Dejan Muhamedagic wrote:
>> Hello,
>>
>> The current repository of Resource Agents has been tagged to
>> agents-1.0.4-rc on Friday evening.
>>
>> Some major additions and improvements:
>>
>> - conntrackd, exportfs, nginx, fio: new agents
>> - mysql: master-slave functionality and replication monitoring
> 
> i have some serious issues with mysql master-slave and rapid fail over.
> imho, and i might be wrong, this functionality "is not quite there yet".
> 
> the basic problem: master fail over from node1 to node2 and back again
> makes node2 try to parse the binlog from the very start.
> 
> "CHANGE MASTER TO" does not honor the slave's last position for a given
> master upon fail over and/or the binlogs on the master are never reset
> thus leading to duplicate parsing of the very same binlog.
> 
> i'll see to a more detailed report but am kind of swamped in work right
> now.

Okay, thanks for the feedback. We'll see what we can do about this. If
you can spare any more time looking into this issue, it would be much
appreciated.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] resource agents 1.0.4-rc announcement

2011-02-14 Thread Dejan Muhamedagic
Hello,

The current repository of Resource Agents has been tagged to
agents-1.0.4-rc on Friday evening.

Some major additions and improvements:

- conntrackd, exportfs, nginx, fio: new agents
- mysql: master-slave functionality and replication monitoring
- db2: support for db2 v9 and multiple partitions
- SAPDatabase,SAPInstance: many fixes and improvements
- tomcat: support for multiple instances and improved stop operation

The 1.0.4 will be released on Wednesday with a more detailed
announcement. We don't expect to see any significant changes in
the meantime.

Some big contributions didn't make it for this release: the db2
patch is too intrusive, slapd is not quite there yet, and there
was simply no time to review dovecot. Anyway, the next release
won't take too much time.

Many thanks to all contributors. You know who you are :-)

Cheers,

Florian, Lars, Dejan
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] One-Node-Cluster

2011-02-14 Thread Andrew Beekhof
On Mon, Feb 14, 2011 at 12:40 PM, Ulrich Windl
 wrote:
 Andrew Beekhof  schrieb am 14.02.2011 um 10:08 in 
 Nachricht
> :
> [...]
>> > The log just keeps on saying:
>> > Feb  8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not have
>> > quorum - fencing and resource management disabled
>>
>> Exactly.
>> Read that line again a couple of times, then read "clusters from scratch".
> [...]
>
> Which makes me wonder: Can a one-node-cluster ever have a quorum?

Not really, which is why we have no-quorum-policy.

> I think a one-node-cluster is a completely valid construct. Also with 
> Linux-HA?

Yep.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] One-Node-Cluster

2011-02-14 Thread Ulrich Windl
>>> Andrew Beekhof  schrieb am 14.02.2011 um 10:08 in 
>>> Nachricht
:
[...]
> > The log just keeps on saying:
> > Feb  8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not have
> > quorum - fencing and resource management disabled
> 
> Exactly.
> Read that line again a couple of times, then read "clusters from scratch".
[...]

Which makes me wonder: Can a one-node-cluster ever have a quorum? I think a 
one-node-cluster is a completely valid construct. Also with Linux-HA?

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] A bunch of thoughts/questions about heartbeat network(s)

2011-02-14 Thread Andrew Beekhof
On Tue, Jan 25, 2011 at 8:15 AM, Alain.Moulle  wrote:
> Hi
>
> A bunch of thoughts/questions about heartbeat network(s) :
>
> In the following, when I talk about "two heartbeat networks" , I'm
> talking about two physically different networks set in the corosync.conf
> as two different ring-number (with rrp_mode set to "active").
>
> 1/ with a 2-nodes HA cluster, it is recommended to have two heartbeat
> networks
>    to avoid the race for fencing, or even the dual-fencing in case of
> problem on
>    this heartbeat network.
>
>    But with a more-than-2-nodes HA cluster, is it always worthwhile to
> have two
>    heartbeat networks ? My understanding is that if one node can't have
> contact
>    from other nodes in the cluster due to a heartbeat network problem,
> as it is
>    "isolated", it does not have quorum and so is not authorized to
> fence any other node,
>    whereas other nodes have quorum and so will decide to fence the node
> with problem.
>    Right ?

Right, but wouldn't it be better to have no need to shoot anyone?

>
>    So is there any other advantage to have more than 2 heartbeats networks
>    in a more-than-2-nodes HA cluster ?
>
> 2/ if the future of the HA stack for Pacemaker is option 4 (corosync +
> cpg + cman + mcp),

Option 4 does not involve cman

> meaning
>    that cluster manager configuration parameters will all be in
> cluster.conf and nothing more in
>    corosync.conf (again that's my understanding...) ,

Other way around, cluster.conf is going away (like cman) not corosync.conf

> from memory there
> is any possibility
>    to set two heartbeat networks in cluster.conf (Cluster Suite from RH
> was working only
>    on 1 heartbeat network and if one wanted to work on 2 hearbeat
> netwoks he has to configure
>    a bonding solution).
>
>    Am I right when I write "no possibility of 2 hb networks with stack
> option 4" ?

No

>
> Thanks a lot for your responses, and tell me if some of my understanding
> is not right ...
>
> Alain
>
>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] slave not switching as primary after primary standby

2011-02-14 Thread Andrew Beekhof
On Tue, Feb 8, 2011 at 9:07 AM, Linux Cook  wrote:
> My slave node is not switching as primary if I try to standby my primary
> node. Here is my configuration:
>
> node dmcs1 \
>        attributes standby="on"
> node dmcs2
> primitive postgres_IP ocf:heartbeat:IPaddr2 \
>        params ip="10.110.10.3" cidr_netmask="32" nic="eth1" \
>        op monitor interval="30s"
> primitive postgres_db ocf:heartbeat:pgsql \
>        op monitor interval="30s" timeout="30s"
> primitive postgres_drbd ocf:linbit:drbd \
>        params drbd_resource="postgres" \
>        op monitor interval="15s" \
>        op stop interval="0" timeout="300s" \
>        op start interval="0" timeout="300s"
> primitive postgres_fs ocf:heartbeat:Filesystem \
>        params device="/dev/drbd0" directory="/usr/local/pgsql"
> fstype="ext4"
> ms ms_postgres_drbd postgres_drbd \
>        meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> location cli-prefer-postgres_db postgres_db 50: dmcs1
> colocation db-with-fs inf: postgres_db postgres_fs
> colocation db-with-ip inf: postgres_db postgres_IP
> colocation fs_on_drbd inf: postgres_fs ms_postgres_drbd:Master
> order db-after-fs inf: postgres_fs postgres_db
> order db-after-ip inf: postgres_IP postgres_db
> order fs-after-drbd inf: ms_postgres_drbd:promote postgres_fs:start
> property $id="cib-bootstrap-options" \
>        dc-version="1.1.4-ac608e3491c7dfc3b3e3c36d966ae9b016f77065" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="2" \
>        stonith-enabled="false" \
>        last-lrm-refresh="1297138923"
> rsc_defaults $id="rsc-options" \
>        resource-stickiness="100"
>
>
> The log just keeps on saying:
> Feb  8 16:01:03 dmcs2 pengine: [1480]: WARN: cluster_status: We do not have
> quorum - fencing and resource management disabled

Exactly.
Read that line again a couple of times, then read "clusters from scratch".





> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: unpack_rsc_op: Operation
> postgres_drbd:1_monitor_0 found resource postgres_drb
> d:1 active on dmcs2
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: clone_print:  Master/Slave
> Set: ms_postgres_drbd [postgres_drbd]
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: short_print:      Slaves: [
> dmcs2 ]
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: short_print:      Stopped: [
> postgres_drbd:0 ]
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: native_print:
> postgres_fs#011(ocf::heartbeat:Filesystem):#011Stopped
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: native_print:
> postgres_IP#011(ocf::heartbeat:IPaddr2):#011Started dmcs2
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: native_print:
> postgres_db#011(ocf::heartbeat:pgsql):#011Stopped
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: LogActions: Leave resource
> postgres_drbd:0#011(Stopped)
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: LogActions: Stop resource
> postgres_drbd:1#011(dmcs2)
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: LogActions: Leave resource
> postgres_fs#011(Stopped)
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: LogActions: Stop resource
> postgres_IP#011(dmcs2)
> Feb  8 16:01:03 dmcs2 pengine: [1480]: notice: LogActions: Leave resource
> postgres_db#011(Stopped)
> Feb  8 16:01:03 dmcs2 crmd: [1481]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Feb  8 16:01:03 dmcs2 crmd: [1481]: info: unpack_graph: Unpacked transition
> 0: 10 actions in 10 synapses
> Feb  8 16:01:03 dmcs2 crmd: [1481]: info: do_te_invoke: Processing graph 0
> (ref=pe_calc-dc-1297152063-26) derived from
> /var/lib/pengine/pe-input-471.bz2
> Feb  8 16:01:03 dmcs2 crmd: [1481]: info: te_pseudo_action: Pseudo action 17
> fired and confirmed
> Feb  8 16:01:03 dmcs2 crmd: [1481]: info: te_rsc_command: Initiating action
> 33: stop postgres_IP_stop_0 on dmcs2 (local)
> Feb  8 16:01:03 dmcs2 lrmd: [1478]: info: cancel_op: operation monitor[8] on
> ocf::IPaddr2::postgres_IP for client 1481, its parameters:
> CRM_meta_name=[monitor] cidr_netmask=[32] crm_feature_set=[3.0.5]
> CRM_meta_timeout=[2] CRM_meta_interval=[3] nic=[eth1]
> ip=[10.110.10.3]  cancelled
> Feb  8 16:01:03 dmcs2 crmd: [1481]: info: do_lrm_rsc_op: Performing
> key=33:0:0:854850f1-11ef-4777-970b-7d27ebf5e174 op=postgres_IP_stop_0 )
>
> Feb  8 16:07:06 dmcs2 crm_attribute: [5980]: info: Invoked: crm_attribute -N
> dmcs2 -n master-postgres_drbd:1 -l reboot -D
> Feb  8 16:07:21 dmcs2 crm_attribute: [6038]: info: Invoked: crm_attribute -N
> dmcs2 -n master-postgres_drbd:1 -l reboot -D
> Feb  8 16:07:36 dmcs2 crm_attribute: [6066]: info: Invoked: crm_attribute -N
> dmcs2 -n master-postgres_drbd:1 -l reboot -D
> Feb  8 16:07:37 dmcs2 cib: [1477]: info: cib_stats: Processed 417 operations
> (647.00us average, 0% utilization) in the last 10min
> Feb  8 16:07:51 dmcs2 crm_attribute: [6124]: info: Invoked: crm_attribute -N
> dmcs2 -

Re: [Linux-HA] OCF_RESKEY_CRM_meta_timeout not matching monitor timeout meta-data

2011-02-14 Thread Andrew Beekhof
On Fri, Feb 4, 2011 at 11:23 AM, Brett Delle Grazie
 wrote:
> Hi,
>
> Apologies for cross-posting but I'm not sure where this problem resides.
>
> I'm running:
> corosync-1.2.7-1.1.el5.x86_64
> corosynclib-1.2.7-1.1.el5.x86_64
> cluster-glue-1.0.6-1.6.el5.x86_64
> cluster-glue-libs-1.0.6-1.6.el5.x86_64
> pacemaker-1.0.10-1.4.el5.x86_64
> pacemaker-libs-1.0.10-1.4.el5.x86_64
> resource-agents-1.0.3-2.6.el5.x86_64
>
> on RHEL5.
>
> In one of my resource agents (tomcat) I'm directly outputting the result of:
> $((OCF_RESKEY_CRM_meta_timeout/1000))
> to an external file.
> and its coming up with a value of '100'
>
> Whereas the resource definition in pacemaker specifies timeout of '30'
> specifically:
>
> primitive tomcat_tc1 ocf:intact:tomcat \
>        params tomcat_user="tomcat" catalina_home="/opt/tomcat6"
> catalina_pid="/home/tomcat/tc1/temp/tomcat.pid"
> catalina_rotate_log="NO" script_log="/home/tomcat/tc1/logs/tc1.log"
> statusurl="http://127.0.0.1/version/"; java_home="/usr/lib/jvm/java" \
>        op start interval="0" timeout="70" \
>        op stop interval="0" timeout="20" \
>        op monitor interval="60" timeout="30" start-delay="70"
>
> Is this a known bug?

No.  Could you file a bug please?

> Does it affect all operation timeouts?

Unknown

>
> Thanks,
>
> --
> Best Regards,
>
> Brett Delle Grazie
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems