Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Jason Long
Thanks.
Cheat sheet is a PDF file that all useful commands and parameters existed in it.






On Friday, April 9, 2021, 01:12:16 PM GMT+4:30, Antony Stone 
 wrote: 





On Friday 09 April 2021 at 10:34:33, Jason Long wrote:

> Thanks.
> I meant was a Cheat sheet.

I don't understand that sentence.

> Yes, something like rendering a 3D movie or... . The Corosync and Pacemaker
> are not OK for it? What kind of clustering using for rendering? Beowulf
> cluster?

Corosync and pacemaker are for High Availability, which generally means that 
you have more computing resources than you need at any given time, in order 
that a failed machine can be efficiently replaced by a working one.  If all 
your 
machines are busy, and one fails, you have no spare computing resources to 
take over from the failed one.

The setup you were asking about is High Performance computing, where you are 
trying to use the resources you have as efficiently and continuously as 
possible, therefore you don't have any spare capacity (since 'spare' means 
'wasted' in this regard).

A Beowulf Cluster is one example of the sort of thing you're asking about; for 
others, see the "Implementations" section of the URL I previously provided.


Antony.

-- 
https://tools.ietf.org/html/rfc6890 - providing 16 million IPv4 addresses for 
talking to yourself.


                                                  Please reply to the list;
                                                        please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 4:04 PM, Klaus Wenninger wrote:

On 4/9/21 3:45 PM, Klaus Wenninger wrote:

On 4/9/21 3:36 PM, Klaus Wenninger wrote:

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd 
prepares to stop?

yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.

Stop ... that should actually work as pcmk_shutdown_worker
should exit quite quickly and proceed after mainloop
dispatching when called again.
Don't see anything atm that might be blocking for longer ...
but let me dig into it further ...

What happens is clear (thanks Ken for the hint ;-) ).
When pacemakerd is shutting down - already when it
shuts down the resources and not just when it starts to
reap the subdaemons - crm_mon reads that state and
doesn't try to connect to the cib anymore.

Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop 
processing.
crm_mon should return a response even after pacemakerd goes into a 
stop operation.


Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
to open-source clustering welcomed 

Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql 
resource control fails.


On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of 
demote and

stop, and the result is processed.
  However, pacemaker included in RHEL8.4beta fails to execute 
this crm_mon.

    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the 
Dummy resource.

  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via 
cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The 
resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via 
cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution 
of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...
  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
(log_op_output)
notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: 
cluster is not

available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so 
control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of 
Pacemaker 2.0.5 or

the Pacemaker included wit

Re: [ClusterLabs] Fwd: Issue with resource-agents ocf:heartbeat:mariadb

2021-04-09 Thread Ken Gaillot
Hi,

I'm not very familiar with the mariadb agent, but one thing to check is
that the output of "uname -n" can be used in the CHANGE MASTER command.
If not, you need to set node attributes for the right names to use.

I believe you have to configure and start replication manually once
before the cluster can manage it automatically.

On Fri, 2021-04-09 at 10:04 +0200, Olivier POUILLY wrote:
> Hi team,
> Thanks for this great job on those library.
> I would like to know if it was possible to get some help on the
> mariadb resource.
> After the configuration of my cluster pcs command shows me:
> root@node1:~# pcs status
> Cluster name: clusterserver
> Stack: corosync
> Current DC: node1 (version 2.0.1-9e909a5bdd) - partition with quorum
> Last updated: Thu Apr  8 15:45:35 2021
> Last change: Thu Apr  8 15:45:25 2021 by root via cibadmin on node1
> 
> 2 nodes configured
> 2 resources configured
> 
> Online: [ node1 node2 ]
> 
> Full list of resources:
> 
>  Clone Set: mariadb_server-clone [mariadb_server] (promotable)
>  Masters: [ node1 ]
>  Slaves: [ node2 ]
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> But when I go to mysql on server2 I see my slave statys off:
> MariaDB [(none)]> SHOW SLAVE STATUS\G
> *** 1. row ***
> Slave_IO_State: 
>Master_Host: node1
>Master_User: replication
>Master_Port: 3306
>  Connect_Retry: 60
>Master_Log_File: master-bin.01
>Read_Master_Log_Pos: 463
> Relay_Log_File: master-relay-bin.02
>  Relay_Log_Pos: 672
>  Relay_Master_Log_File: master-bin.01
>   Slave_IO_Running: No
>  Slave_SQL_Running: No
>Replicate_Do_DB: 
>Replicate_Ignore_DB: 
> Replicate_Do_Table: 
> Replicate_Ignore_Table: 
>Replicate_Wild_Do_Table: 
>Replicate_Wild_Ignore_Table: 
> Last_Errno: 0
> Last_Error: 
>   Skip_Counter: 0
>Exec_Master_Log_Pos: 463
>Relay_Log_Space: 2935
>Until_Condition: None
> Until_Log_File: 
>  Until_Log_Pos: 0
> Master_SSL_Allowed: No
> Master_SSL_CA_File: 
> Master_SSL_CA_Path: 
>Master_SSL_Cert: 
>  Master_SSL_Cipher: 
> Master_SSL_Key: 
>  Seconds_Behind_Master: NULL
>  Master_SSL_Verify_Server_Cert: No
>  Last_IO_Errno: 0
>  Last_IO_Error: 
> Last_SQL_Errno: 0
> Last_SQL_Error: 
>Replicate_Ignore_Server_Ids: 
>   Master_Server_Id: 0
> Master_SSL_Crl: 
> Master_SSL_Crlpath: 
> Using_Gtid: Current_Pos
>Gtid_IO_Pos: 
>Replicate_Do_Domain_Ids: 
>Replicate_Ignore_Domain_Ids: 
>  Parallel_Mode: conservative
>  SQL_Delay: 0
>SQL_Remaining_Delay: NULL
>Slave_SQL_Running_State: 
>   Slave_DDL_Groups: 0
> Slave_Non_Transactional_Groups: 0
> Slave_Transactional_Groups: 0
> 
> On pacemaker log I got the following message:
> Apr 08 19:26:18 node2 pacemaker-execd [6899] (operation_finished)
> notice: mariadb_server_start_0:7072:stderr [ Error performing
> operation: No such device or address ]
> 
> Here is the detailed of my configuration:
> - pcs : 0.10.1
> - Pacemaker 2.0.1
> - Corosync Cluster Engine, version '3.0.1'
> - mariadb  Ver 15.1 Distrib 10.3.27-MariaDB
> - Debian 10.8
> Mysql configuration:
> [server]
> [mysqld]
> user= mysql
> pid-file= /run/mysqld/mysqld.pid
> socket  = /run/mysqld/mysqld.sock
> basedir = /usr
> datadir = /var/lib/mysql
> tmpdir  = /tmp
> lc-messages-dir = /usr/share/mysql
> bind-address= 0.0.0.0
> query_cache_size= 16M
> log_error = /var/log/mysql/error.log
> server-id=2
> expire_logs_days= 10
> character-set-server  = utf8mb4
> collation-server  = utf8mb4_general_ci
> [embedded]
> [mariadb]
> log-bin
> server-id=2
> log-basename=master
> [mariadb-10.3]
> 
> Corosync configuration:
>  num_updates="0" admin_epoch="0" cib-last-written="Thu Apr  8 19:26:13
> 2021" update-origin="node1" update-client="cibadmin" update-
> user="root" have-quorum="1" dc-uuid="1">
>   
> 
>   
>  name="stonith-enabled" value="false"/>
> 
> 
> 
>  name="cluster-infrastructure" value="corosync"/>
>  name="cluster-name" value="clusterserver"/>
>   
>   
>  name="mariadb_server_REPL_INFO" value="node1"/>
>   
> 
> 
>   
>   
> 
> 
>   

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 3:45 PM, Klaus Wenninger wrote:

On 4/9/21 3:36 PM, Klaus Wenninger wrote:

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd 
prepares to stop?

yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.

Stop ... that should actually work as pcmk_shutdown_worker
should exit quite quickly and proceed after mainloop
dispatching when called again.
Don't see anything atm that might be blocking for longer ...
but let me dig into it further ...

Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop 
processing.
crm_mon should return a response even after pacemakerd goes into a 
stop operation.


Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
to open-source clustering welcomed 

Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
control fails.


On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of 
demote and

stop, and the result is processed.
  However, pacemaker included in RHEL8.4beta fails to execute this 
crm_mon.

    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the 
Dummy resource.

  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The 
resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution 
of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...
  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
(log_op_output)
notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster 
is not

available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so 
control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of Pacemaker 
2.0.5 or

the Pacemaker included with RHEL8.3.

  This issue has a huge impact on the user.

  Perhaps it also affects the control of other resources that utilize

crm_mon.
  Please improve the release version of RHEL8.4 so that it 
includes Pacemaker

which does not cause this problem.
    * Distributions other than RHEL may also be a

Re: [ClusterLabs] how to setup single node cluster

2021-04-09 Thread Ken Gaillot
On Fri, 2021-04-09 at 08:20 +0300, Andrei Borzenkov wrote:
> On 08.04.2021 09:26, d tbsky wrote:
> > Reid Wahl 
> > > I don't think we do require fencing for single-node clusters.
> > > (Anyone at Red Hat, feel free to comment.) I vaguely recall an
> > > internal mailing list or IRC conversation where we discussed this
> > > months ago, but I can't find it now. I've also checked our
> > > support policies documentation, and it's not mentioned in the
> > > "cluster size" doc or the "fencing" doc.
> > 
> >since the cluster is 100% alive or 100% dead with single node, I
> > think fencing/quorum is not required. I am just curious what is the
> > usage case. since RedHat supports it, it must be useful in real
> > scenario.
> 
> 
> I do not know what "disaster recovery" configuration you have in
> mind,
> but if you intend to use geo clustering fencing can speed up fail-
> over
> so it is at least useful.
> 
> Even in single node cluster if resource failed to stop you are stuck
> -
> you cannot actually do anything from that point without manual
> intervention. Depending on configuration and requirements rebooting
> node
> may be considered as an attempt to automatically "reset" cluster
> state.

The use case for a single-node disaster recovery cluster is to have the
main cluster be a full, multi-node cluster with fencing, with a single-
node cluster at a remote site for disaster recovery when the main
cluster is down (possibly for just the most essential resources).

Fencing isn't critical for the DR site because if the DR site is being
used, the main site is already down.

The DR site could be activated automatically with booth (if a third
arbitrator site is available), or manually by an administrator (for
example by changing the target-role resource default, or manually
assigning tickets).

The advantages of using a cluster at all at a manual DR site are that
administrators can use the same cluster management commands they're
familiar with, and certain resources can always run at the DR site to
keep it ready (e.g. shared storage or a database replicant).

There are some ideas about making such a setup easier to manage, such
as being able to coordinate configuration changes (each site has a
separate cluster configuration), and maybe having "storage agents" to
manage shared storage across clusters.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 3:36 PM, Klaus Wenninger wrote:

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd 
prepares to stop?

yep ... that doesn't look good.
While in pcmk_shutdown_worker ipc isn't handled.
Question is why that didn't create issue earlier.
Probably I didn't test with resources that had crm_mon in
their stop/monitor-actions but sbd should have run into
issues.

Klaus

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop 
processing.
crm_mon should return a response even after pacemakerd goes into a 
stop operation.


Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
open-source clustering welcomed 

Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
control fails.


On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of 
demote and

stop, and the result is processed.
  However, pacemaker included in RHEL8.4beta fails to execute this 
crm_mon.

    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the Dummy 
resource.

  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The 
resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution 
of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...
  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
(log_op_output)
notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster 
is not

available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so 
control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of Pacemaker 
2.0.5 or

the Pacemaker included with RHEL8.3.

  This issue has a huge impact on the user.

  Perhaps it also affects the control of other resources that utilize

crm_mon.
  Please improve the release version of RHEL8.4 so that it includes 
Pacemaker

which does not cause this problem.
    * Distributions other than RHEL may also be affected in future 
releases.


  
  This content is the same as the following Bugzilla.
    - https://bugs.clusterlabs.org/show_bug.cgi?id=5471
  

  Best Regards,
  Hideo Yamauchi.

  ___
  Manage your subscription:
  https://lists

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:

Hi Klaus,

Thanks for your comment.


Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd prepares to 
stop?

But when shutting down a node the resources should be
shutdown before pacemakerd goes down.
But let me have a look if it can happen that pacemakerd
doesn't react to the ipc-pings before. That btw. might be
lethal for sbd-scenarios (if the phase is too long and it
migh actually not be defined).

My idea with selinux would have been that it might block
the ipc if crm_mon is issued by execd. But well forget
about it as it is not enabled ;-)


Klaus


pgsql needs the result of crm_mon in demote processing and stop processing.
crm_mon should return a response even after pacemakerd goes into a stop 
operation.

Best Regards,
Hideo Yamauchi.


- Original Message -

From: Klaus Wenninger 
To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to open-source 
clustering welcomed 
Cc:
Date: 2021/4/9, Fri 21:12
Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
fails.

On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

  Hi Ken,
  Hi All,

  In the pgsql resource, crm_mon is executed in the process of demote and

stop, and the result is processed.

  However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
    - The problem also occurs on github

master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

  The problem can be easily reproduced in the following ways.

  Step1. Modify to execute crm_mon in the stop process of the Dummy resource.
  

  dummy_stop() {
       mon=$(crm_mon -1)
       ret=$?
       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
       dummy_monitor
       if [ $? =  $OCF_SUCCESS ]; then
           rm ${OCF_RESKEY_state}
       fi
       return $OCF_SUCCESS
  }
  

  Step2. Configure a cluster with two nodes.
  

  [root@rh84-beta01 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition

with quorum

     * Last updated: Thu Apr  8 18:00:52 2021
     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta01 rh84-beta02 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

  Migration Summary:
  

  Step3. Stop the node where the Dummy resource is running. The resource will

fail over.

  
  [root@rh84-beta02 ~]# crm_mon -rfA1
  Cluster Summary:
     * Stack: corosync
     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition

with quorum

     * Last updated: Thu Apr  8 18:08:56 2021
     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on

rh84-beta01

     * 2 nodes configured
     * 1 resource instance configured

  Node List:
     * Online: [ rh84-beta02 ]
     * OFFLINE: [ rh84-beta01 ]

  Full List of Resources:
     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
  

  However, if you look at the log, you can see that the execution of crm_mon

in the stop processing of the Dummy resource has failed.

  
  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI 

crm_mon[102] : Pacemaker daemons shutting down ...

  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)

notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not
available on this node ]
Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus

  

  Similarly, pgsql also executes crm_mon with demote or stop, so control

fails.

  The problem seems to be related to the next fix.
    * Report pacemakerd in state waiting for sbd
     - https://github.com/ClusterLabs/pacemaker/pull/2278

  The problem does not occur with the release version of Pacemaker 2.0.5 or

the Pacemaker included with RHEL8.3.

  This issue has a huge impact on the user.

  Perhaps it also affects the control of other resources that utilize

crm_mon.

  Please improve the release version of RHEL8.4 so that it includes Pacemaker

which does not cause this problem.

    * Distributions other than RHEL may also be affected in future releases.

  
  This content is the same as the following Bugzilla.
    - https://bugs.clusterlabs.org/show_bug.cgi?id=5471
  

  Best Regards,
  Hideo Yamauchi.

  ___
  Manage your subscription:
  https://lists.clusterlabs.org/mailman/listinfo/users

  ClusterLabs home: https://www.clusterlabs.org/



--
Klaus Wenninger

Senior Software Engineer, EMEA ENG Base Operating Systems

Red Hat

kwenn...@redhat.com

Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 1532

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread renayama19661014
Hi Klaus,

Thanks for your comment.

> Hmm ... is that with selinux enabled?

> Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd prepares to 
stop?

pgsql needs the result of crm_mon in demote processing and stop processing.
crm_mon should return a response even after pacemakerd goes into a stop 
operation.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/4/9, Fri 21:12
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi Ken,
>>  Hi All,
>> 
>>  In the pgsql resource, crm_mon is executed in the process of demote and 
> stop, and the result is processed.
>> 
>>  However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
>>    - The problem also occurs on github 
> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>> 
>>  The problem can be easily reproduced in the following ways.
>> 
>>  Step1. Modify to execute crm_mon in the stop process of the Dummy resource.
>>  
>> 
>>  dummy_stop() {
>>       mon=$(crm_mon -1)
>>       ret=$?
>>       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
>>       dummy_monitor
>>       if [ $? =  $OCF_SUCCESS ]; then
>>           rm ${OCF_RESKEY_state}
>>       fi
>>       return $OCF_SUCCESS
>>  }
>>  
>> 
>>  Step2. Configure a cluster with two nodes.
>>  
>> 
>>  [root@rh84-beta01 ~]# crm_mon -rfA1
>>  Cluster Summary:
>>     * Stack: corosync
>>     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition 
> with quorum
>>     * Last updated: Thu Apr  8 18:00:52 2021
>>     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on 
> rh84-beta01
>>     * 2 nodes configured
>>     * 1 resource instance configured
>> 
>>  Node List:
>>     * Online: [ rh84-beta01 rh84-beta02 ]
>> 
>>  Full List of Resources:
>>     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01
>> 
>>  Migration Summary:
>>  
>> 
>>  Step3. Stop the node where the Dummy resource is running. The resource will 
> fail over.
>>  
>>  [root@rh84-beta02 ~]# crm_mon -rfA1
>>  Cluster Summary:
>>     * Stack: corosync
>>     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition 
> with quorum
>>     * Last updated: Thu Apr  8 18:08:56 2021
>>     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on 
> rh84-beta01
>>     * 2 nodes configured
>>     * 1 resource instance configured
>> 
>>  Node List:
>>     * Online: [ rh84-beta02 ]
>>     * OFFLINE: [ rh84-beta01 ]
>> 
>>  Full List of Resources:
>>     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
>>  
>> 
>>  However, if you look at the log, you can see that the execution of crm_mon 
> in the stop processing of the Dummy resource has failed.
>> 
>>  
>>  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI  
> crm_mon[102] : Pacemaker daemons shutting down ...
>>  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)  
> notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not 
> available on this node ]
> Hmm ... is that with selinux enabled?
> Respectively do you see any related avc messages?
> 
> Klaus
>>  
>> 
>>  Similarly, pgsql also executes crm_mon with demote or stop, so control 
> fails.
>> 
>>  The problem seems to be related to the next fix.
>>    * Report pacemakerd in state waiting for sbd
>>     - https://github.com/ClusterLabs/pacemaker/pull/2278 
>> 
>>  The problem does not occur with the release version of Pacemaker 2.0.5 or 
> the Pacemaker included with RHEL8.3.
>> 
>>  This issue has a huge impact on the user.
>> 
>>  Perhaps it also affects the control of other resources that utilize 
> crm_mon.
>> 
>>  Please improve the release version of RHEL8.4 so that it includes Pacemaker 
> which does not cause this problem.
>>    * Distributions other than RHEL may also be affected in future releases.
>> 
>>  
>>  This content is the same as the following Bugzilla.
>>    - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 
>>  
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>>  ___
>>  Manage your subscription:
>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>>  ClusterLabs home: https://www.clusterlabs.org/ 
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Jason Long
Thanks.
I meant was a Cheat sheet.
Yes, something like rendering a 3D movie or... . The Corosync and Pacemaker are 
not OK for it? What kind of clustering using for rendering? Beowulf cluster?






On Friday, April 9, 2021, 12:55:27 PM GMT+4:30, Antony Stone 
 wrote: 





On Friday 09 April 2021 at 08:58:39, Jason Long wrote:

> Thank you so much for your great answers.
> As the final questions:

Really :) ?

> 1- Which commands are useful to monitoring and managing my pacemaker
> cluster?

Some people prefer https://crmsh.github.io/documentation/ and some people 
prefer https://github.com/ClusterLabs/pcs

> 2- I don't know if this is a right question or not. Consider 100 PCs that
> each of them have an Intel Core 2 Duo Processor (2 cores) with 4GB of RAM.
> How can I merge these PCs together so that I have a system with 200 CPUs
> and 400GB of RAM?

The answer to that depends on what you want to do with them.

As a general-purpose computing resource, you can't.  The CPU on machine A has 
no (reasonable) access to the RAM on machine B, so no part of the system can 
actually work with 400GBytes RAM.

For specialist purposes (generally speaking, performing the same tasks on 
small pieces of data all at the same time and then putting the results 
together at the end), you can create a very different type of "cluster" than 
the ones we talk about here with corosync and pacemaker.

https://en.wikipedia.org/wiki/Computer_cluster

A common usage for such a setup is frame rendering of computer generated films; 
give each of your 100 PCs one frame to render, put all the frames together in 
the right order at the end, and you've created your film in just over 1% of the 
time it would have taken on one computer (of the same type).


Regards,


Antony.

-- 
Most people have more than the average number of legs.


                                                  Please reply to the list;
                                                        please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread Klaus Wenninger

On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:

Hi Ken,
Hi All,

In the pgsql resource, crm_mon is executed in the process of demote and stop, 
and the result is processed.

However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
  - The problem also occurs on github 
master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

The problem can be easily reproduced in the following ways.

Step1. Modify to execute crm_mon in the stop process of the Dummy resource.


dummy_stop() {
     mon=$(crm_mon -1)
     ret=$?
     ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
     dummy_monitor
     if [ $? =  $OCF_SUCCESS ]; then
         rm ${OCF_RESKEY_state}
     fi
     return $OCF_SUCCESS
}


Step2. Configure a cluster with two nodes.


[root@rh84-beta01 ~]# crm_mon -rfA1
Cluster Summary:
   * Stack: corosync
   * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
   * Last updated: Thu Apr  8 18:00:52 2021
   * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on rh84-beta01
   * 2 nodes configured
   * 1 resource instance configured

Node List:
   * Online: [ rh84-beta01 rh84-beta02 ]

Full List of Resources:
   * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

Migration Summary:


Step3. Stop the node where the Dummy resource is running. The resource will 
fail over.

[root@rh84-beta02 ~]# crm_mon -rfA1
Cluster Summary:
   * Stack: corosync
   * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
   * Last updated: Thu Apr  8 18:08:56 2021
   * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on rh84-beta01
   * 2 nodes configured
   * 1 resource instance configured

Node List:
   * Online: [ rh84-beta02 ]
   * OFFLINE: [ rh84-beta01 ]

Full List of Resources:
   * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02


However, if you look at the log, you can see that the execution of crm_mon in 
the stop processing of the Dummy resource has failed.


Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI  crm_mon[102] 
: Pacemaker daemons shutting down ...
Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)  notice: 
dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on 
this node ]

Hmm ... is that with selinux enabled?
Respectively do you see any related avc messages?

Klaus



Similarly, pgsql also executes crm_mon with demote or stop, so control fails.

The problem seems to be related to the next fix.
  * Report pacemakerd in state waiting for sbd
   - https://github.com/ClusterLabs/pacemaker/pull/2278

The problem does not occur with the release version of Pacemaker 2.0.5 or the 
Pacemaker included with RHEL8.3.

This issue has a huge impact on the user.

Perhaps it also affects the control of other resources that utilize crm_mon.

Please improve the release version of RHEL8.4 so that it includes Pacemaker 
which does not cause this problem.
  * Distributions other than RHEL may also be affected in future releases.


This content is the same as the following Bugzilla.
  - https://bugs.clusterlabs.org/show_bug.cgi?id=5471


Best Regards,
Hideo Yamauchi.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Jason Long
Thank you so much for your great answers.
As the final questions:
1- Which commands are useful to monitoring and managing my pacemaker cluster?

2- I don't know if this is a right question or not. Consider 100 PCs that each 
of them have an Intel Core 2 Duo Processor (2 cores) with 4GB of RAM. How can I 
merge these PCs together so that I have a system with 200 CPUs and 400GB of RAM?






On Friday, April 9, 2021, 12:13:45 AM GMT+4:30, Antony Stone 
 wrote: 





On Thursday 08 April 2021 at 21:33:48, Jason Long wrote:

> Yes, I just wanted to know. In clustering, when a node is down and
> go online again, then the cluster will not use it again until another node
> fails. Am I right?

Think of it like this:

You can have as many nodes in your cluster as you think you need, and I'm 
going to assume that you only need the resources running on one node at any 
given time.

Cluster management (eg: corosync / pacemaker) will ensure that the resources 
are running on *a* node.

The resources will be moved *away* from that node if they can't run there any 
more, for some reason (the node going down is a good reason).

However, there is almost never any concept of the resources being moved *to* a 
(specific) node.  If they get moved away from one node, then obviously they 
need to be moved to another one, but the move happens because the resources 
have to be moved *away* from the first node, not because the cluster thinks 
they need to be moved *to* the second node.

So, if a node is running its resources quite happily, it doesn't matter what 
happens to all the other nodes (provided quorum remains); the resources will 
stay running on that same node all the time.


Antony.

-- 
Was ist braun, liegt ins Gras, und raucht?
Ein Kaminchen...


                                                  Please reply to the list;
                                                        please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Antony Stone
On Friday 09 April 2021 at 11:06:14, Ulrich Windl wrote:

> # lscpu
> CPU(s):  144

> # free -h
> Mem:  754Gi

Nice :)

No doubt Jason would like to connect 8 of these together in a cluster...


Antony.

-- 
Numerous psychological studies over the years have demonstrated that the 
majority of people genuinely believe they are not like the majority of people.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Ulrich Windl
>>> Jason Long  schrieb am 09.04.2021 um 08:58 in Nachricht
<2055279672.56029.1617951519...@mail.yahoo.com>:
> Thank you so much for your great answers.
> As the final questions:
> 1- Which commands are useful to monitoring and managing my pacemaker 
> cluster?

My favorite is "crm_mon -1Arfj".

> 
> 2- I don't know if this is a right question or not. Consider 100 PCs that 
> each of them have an Intel Core 2 Duo Processor (2 cores) with 4GB of RAM. 
> How can I merge these PCs together so that I have a system with 200 CPUs and 
> 400GB of RAM?

If you don't just want to recycle old hardware, you could consider buying _one_ 
recent machine that has almost all that cores and RAM in one machine, probably 
saving a lot of power and space, too.

Like here:
 # grep MHz /proc/cpuinfo | wc -l
144
 # lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   46 bits physical, 48 bits virtual
CPU(s):  144
On-line CPU(s) list: 0-143
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):   4
NUMA node(s):4
Vendor ID:   GenuineIntel
CPU family:  6
Model:   85
Model name:  Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Stepping:7
CPU MHz: 1001.007
...

# free -h
  totalusedfree  shared  buff/cache   available
Mem:  754Gi   1.7Gi   744Gi75Mi   8.1Gi   748Gi

Regards,
Ulrich


> 
> 
> 
> 
> 
> 
> On Friday, April 9, 2021, 12:13:45 AM GMT+4:30, Antony Stone 
>  wrote: 
> 
> 
> 
> 
> 
> On Thursday 08 April 2021 at 21:33:48, Jason Long wrote:
> 
>> Yes, I just wanted to know. In clustering, when a node is down and
>> go online again, then the cluster will not use it again until another node
>> fails. Am I right?
> 
> Think of it like this:
> 
> You can have as many nodes in your cluster as you think you need, and I'm 
> going to assume that you only need the resources running on one node at any 
> given time.
> 
> Cluster management (eg: corosync / pacemaker) will ensure that the resources 
> 
> are running on *a* node.
> 
> The resources will be moved *away* from that node if they can't run there 
> any 
> more, for some reason (the node going down is a good reason).
> 
> However, there is almost never any concept of the resources being moved *to* 
> a 
> (specific) node.  If they get moved away from one node, then obviously they 
> need to be moved to another one, but the move happens because the resources 
> have to be moved *away* from the first node, not because the cluster thinks 
> they need to be moved *to* the second node.
> 
> So, if a node is running its resources quite happily, it doesn't matter what 
> 
> happens to all the other nodes (provided quorum remains); the resources will 
> 
> stay running on that same node all the time.
> 
> 
> Antony.
> 
> -- 
> Was ist braun, liegt ins Gras, und raucht?
> Ein Kaminchen...
> 
> 
>   Please reply to the list;
> please *don't* CC 
> me.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Ulrich Windl
>>> Jason Long  schrieb am 08.04.2021 um 21:33 in Nachricht
<1151501391.584136.1617910428...@mail.yahoo.com>:
> Yes, I just wanted to know. In clustering, when a node is down and go online 
> again, then the cluster will not use it again until another node fails. Am I 
> right?

Hi!

Read about "stickiness", maybe setting it to zero, and see if that makes you 
happier.
If not, you learned whjat stickiness is for.

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Antony Stone
On Friday 09 April 2021 at 10:34:33, Jason Long wrote:

> Thanks.
> I meant was a Cheat sheet.

I don't understand that sentence.

> Yes, something like rendering a 3D movie or... . The Corosync and Pacemaker
> are not OK for it? What kind of clustering using for rendering? Beowulf
> cluster?

Corosync and pacemaker are for High Availability, which generally means that 
you have more computing resources than you need at any given time, in order 
that a failed machine can be efficiently replaced by a working one.  If all 
your 
machines are busy, and one fails, you have no spare computing resources to 
take over from the failed one.

The setup you were asking about is High Performance computing, where you are 
trying to use the resources you have as efficiently and continuously as 
possible, therefore you don't have any spare capacity (since 'spare' means 
'wasted' in this regard).

A Beowulf Cluster is one example of the sort of thing you're asking about; for 
others, see the "Implementations" section of the URL I previously provided.


Antony.

-- 
https://tools.ietf.org/html/rfc6890 - providing 16 million IPv4 addresses for 
talking to yourself.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Why my node1 couldn't back to the clustering chain?

2021-04-09 Thread Antony Stone
On Friday 09 April 2021 at 08:58:39, Jason Long wrote:

> Thank you so much for your great answers.
> As the final questions:

Really :) ?

> 1- Which commands are useful to monitoring and managing my pacemaker
> cluster?

Some people prefer https://crmsh.github.io/documentation/ and some people 
prefer https://github.com/ClusterLabs/pcs

> 2- I don't know if this is a right question or not. Consider 100 PCs that
> each of them have an Intel Core 2 Duo Processor (2 cores) with 4GB of RAM.
> How can I merge these PCs together so that I have a system with 200 CPUs
> and 400GB of RAM?

The answer to that depends on what you want to do with them.

As a general-purpose computing resource, you can't.  The CPU on machine A has 
no (reasonable) access to the RAM on machine B, so no part of the system can 
actually work with 400GBytes RAM.

For specialist purposes (generally speaking, performing the same tasks on 
small pieces of data all at the same time and then putting the results 
together at the end), you can create a very different type of "cluster" than 
the ones we talk about here with corosync and pacemaker.

https://en.wikipedia.org/wiki/Computer_cluster

A common usage for such a setup is frame rendering of computer generated films; 
give each of your 100 PCs one frame to render, put all the frames together in 
the right order at the end, and you've created your film in just over 1% of the 
time it would have taken on one computer (of the same type).


Regards,


Antony.

-- 
Most people have more than the average number of legs.

   Please reply to the list;
 please *don't* CC me.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Fwd: Issue with resource-agents ocf:heartbeat:mariadb

2021-04-09 Thread Olivier POUILLY

Hi team,

Thanks for this great job on those library.

I would like to know if it was possible to get some help on the mariadb 
resource.


After the configuration of my cluster pcs command shows me:

root@node1:~# pcs status
Cluster name: clusterserver
Stack: corosync
Current DC: node1 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Thu Apr  8 15:45:35 2021
Last change: Thu Apr  8 15:45:25 2021 by root via cibadmin on node1

2 nodes configured
2 resources configured

Online: [ node1 node2 ]

Full list of resources:

 Clone Set: mariadb_server-clone [mariadb_server] (promotable)
 Masters: [ node1 ]
 Slaves: [ node2 ]

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled


But when I go to mysql on server2 I see my slave statys off:

MariaDB [(none)]> SHOW SLAVE STATUS\G
*** 1. row ***
    Slave_IO_State:
   Master_Host: node1
   Master_User: replication
   Master_Port: 3306
 Connect_Retry: 60
   Master_Log_File: master-bin.01
   Read_Master_Log_Pos: 463
    Relay_Log_File: master-relay-bin.02
 Relay_Log_Pos: 672
 Relay_Master_Log_File: master-bin.01
  Slave_IO_Running: No
 Slave_SQL_Running: No
   Replicate_Do_DB:
   Replicate_Ignore_DB:
    Replicate_Do_Table:
    Replicate_Ignore_Table:
   Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
    Last_Errno: 0
    Last_Error:
  Skip_Counter: 0
   Exec_Master_Log_Pos: 463
   Relay_Log_Space: 2935
   Until_Condition: None
    Until_Log_File:
 Until_Log_Pos: 0
    Master_SSL_Allowed: No
    Master_SSL_CA_File:
    Master_SSL_CA_Path:
   Master_SSL_Cert:
 Master_SSL_Cipher:
    Master_SSL_Key:
 Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
 Last_IO_Errno: 0
 Last_IO_Error:
    Last_SQL_Errno: 0
    Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
  Master_Server_Id: 0
    Master_SSL_Crl:
    Master_SSL_Crlpath:
    Using_Gtid: Current_Pos
   Gtid_IO_Pos:
   Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
 Parallel_Mode: conservative
 SQL_Delay: 0
   SQL_Remaining_Delay: NULL
   Slave_SQL_Running_State:
  Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0


On pacemaker log I got the following message:

Apr 08 19:26:18 node2 pacemaker-execd [6899] (operation_finished)     
notice: mariadb_server_start_0:7072:stderr [ Error performing operation: 
No such device or address ]



Here is the detailed of my configuration:

- pcs : 0.10.1

- Pacemaker 2.0.1

- Corosync Cluster Engine, version '3.0.1'

- mariadb  Ver 15.1 Distrib 10.3.27-MariaDB

- Debian 10.8

Mysql configuration:

[server]

[mysqld]
user    = mysql
pid-file    = /run/mysqld/mysqld.pid
socket  = /run/mysqld/mysqld.sock
basedir = /usr
datadir = /var/lib/mysql
tmpdir  = /tmp
lc-messages-dir = /usr/share/mysql
bind-address    = 0.0.0.0
query_cache_size    = 16M
log_error = /var/log/mysql/error.log
server-id=2
expire_logs_days    = 10
character-set-server  = utf8mb4
collation-server  = utf8mb4_general_ci
[embedded]
[mariadb]
log-bin
server-id=2
log-basename=master
[mariadb-10.3]

Corosync configuration:

num_updates="0" admin_epoch="0" cib-last-written="Thu Apr  8 19:26:13 
2021" update-origin="node1" update-client="cibadmin" update-user="root" 
have-quorum="1" dc-uuid="1">

  
    
  
    name="stonith-enabled" value="false"/>
    name="no-quorum-policy" value="ignore"/>
    name="have-watchdog" value="false"/>
    value="2.0.1-9e909a5bdd"/>
    name="cluster-infrastructure" value="corosync"/>
    name="cluster-name" value="clusterserver"/>

  
  
    name="mariadb_server_REPL_INFO" value="node1"/>

  
    
    
  
  
    
    
  
    type="mariadb">

  
    name="binary" value="/usr/sbin/mysqld"/>
    name="config" value="/etc/mysql/my.cnf"/>
    name="datadir" value="/var/lib/mysql"/>
    name="node_list" value="node1 node2"/>
    name="pid" value="/var/run/mysqld/mysqld.pid"/>
    id="mariadb_server-instance_attributes-replication_passwd" 
name="replication_passwd" value="similarly-secure-password"/>
    id="mariadb_server-instance_attributes-replication_user" 
name="replication_user" value="replication"/>
    name="sock