Re: [ClusterLabs] Antw: Corosync ring marked as FAULTY

2017-02-22 Thread bliu

Hi, Denis

could you try tcpdump "udp port 5505" on the private network to see if 
there is packet?



On 02/22/2017 03:47 PM, Denis Gribkov wrote:


In our case it does not create problems since all nodes are located in 
few networks whichserved by single router.


There are also no any errors detected on public ring 1 unlike private 
ring 0.


I have a suspicion that this error could be related to private VLAN 
settings but unfortunately have no good idea how to found the issue.


On 22/02/17 09:37, Ulrich Windl wrote:

Is "ttl 1" a good idea for a public network?


Denis Gribkov  schrieb am 21.02.2017 um 18:26 in Nachricht

<4f5543c4-b80c-659d-ed5e-7a99e1482...@itsts.net>:

Hi Everyone.

I have 16-nodes asynchronous cluster configured with Corosync redundant
ring feature.

Each node has 2 similarly connected/configured NIC's. One NIC connected
to the public network,

another one to our private VLAN. When I checked Corosync rings
operability I found:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id  = 192.168.1.54
  status  = Marking ringid 0 interface 192.168.1.54 FAULTY
RING ID 1
  id  = 111.11.11.1
  status  = ring 1 active with no faults

After some time of digging into I identified that if I enable back the
failed ring with command:

# corosync-cfgtool -r

RING ID 0 will be marked as "active" for few minutes, but after it
marked permanently as faulty.

Log has no any useful info, just single message:

corosync[21740]:   [TOTEM ] Marking ringid 0 interface 192.168.1.54 FAULTY

And no any message like:

[TOTEM ] Automatically recovered ring 1


My corosync.conf looks like:

compatibility: whitetank

totem {
  version: 2
  secauth: on
  threads: 4
  rrp_mode: passive

  interface {

  member {
  memberaddr: PRIVATE_IP_1
  }

...

  member {
  memberaddr: PRIVATE_IP_16
  }

  ringnumber: 0
  bindnetaddr: PRIVATE_NET_ADDR
  mcastaddr: 226.0.0.1
  mcastport: 5505
  ttl: 1
  }

 interface {

  member {
  memberaddr: PUBLIC_IP_1
  }
...

  member {
  memberaddr: PUBLIC_IP_16
  }

  ringnumber: 1
  bindnetaddr: PUBLIC_NET_ADDR
  mcastaddr: 224.0.0.1
  mcastport: 5405
  ttl: 1
  }

  transport: udpu

logging {
  to_stderr: no
  to_logfile: yes
  logfile: /var/log/cluster/corosync.log
  logfile_priority: info
  to_syslog: yes
  syslog_priority: warning
  debug: on
  timestamp: on
}

I had tried to change rrp_mode, mcastaddr/mcastport for ringnumber: 0,
but result was the similar.

I checked multicast/unicast operability using omping utility and didn't
found any issues.

Also no errors on our private VLAN was found for network equipment.

Why Corosync decided to disable permanently second ring? How I can debug
the issue?

Other properties:

Corosync Cluster Engine, version '1.4.7'

Pacemaker properties:
   cluster-infrastructure: cman
   cluster-recheck-interval: 5min
   dc-version: 1.1.14-8.el6-70404b0
   expected-quorum-votes: 3
   have-watchdog: false
   last-lrm-refresh: 1484068350
   maintenance-mode: false
   no-quorum-policy: ignore
   pe-error-series-max: 1000
   pe-input-series-max: 1000
   pe-warn-series-max: 1000
   stonith-action: reboot
   stonith-enabled: false
   symmetric-cluster: false

Thank you.

--
Regards Denis Gribkov




___
Users mailing list:Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:http://bugs.clusterlabs.org


--
Regards Denis Gribkov


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Corosync ring marked as FAULTY

2017-02-22 Thread Denis Gribkov

Hi,

Just tried - no any packet captured even if I re-enabled ring 0 on all 
nodes.


On the other hand if I create listen UDP socket on the same port using 
nc utility:


# nc -u -l 5505

then try to send some messages from other node - I can capture and see 
these packets.



On 22/02/17 10:02, bliu wrote:


Hi, Denis

could you try tcpdump "udp port 5505" on the private network to see if 
there is packet?



On 02/22/2017 03:47 PM, Denis Gribkov wrote:


In our case it does not create problems since all nodes are located 
in few networks whichserved by single router.


There are also no any errors detected on public ring 1 unlike private 
ring 0.


I have a suspicion that this error could be related to private VLAN 
settings but unfortunately have no good idea how to found the issue.


On 22/02/17 09:37, Ulrich Windl wrote:

Is "ttl 1" a good idea for a public network?


Denis Gribkov  schrieb am 21.02.2017 um 18:26 in Nachricht

<4f5543c4-b80c-659d-ed5e-7a99e1482...@itsts.net>:

Hi Everyone.

I have 16-nodes asynchronous cluster configured with Corosync redundant
ring feature.

Each node has 2 similarly connected/configured NIC's. One NIC connected
to the public network,

another one to our private VLAN. When I checked Corosync rings
operability I found:

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
  id  = 192.168.1.54
  status  = Marking ringid 0 interface 192.168.1.54 FAULTY
RING ID 1
  id  = 111.11.11.1
  status  = ring 1 active with no faults

After some time of digging into I identified that if I enable back the
failed ring with command:

# corosync-cfgtool -r

RING ID 0 will be marked as "active" for few minutes, but after it
marked permanently as faulty.

Log has no any useful info, just single message:

corosync[21740]:   [TOTEM ] Marking ringid 0 interface 192.168.1.54 FAULTY

And no any message like:

[TOTEM ] Automatically recovered ring 1


My corosync.conf looks like:

compatibility: whitetank

totem {
  version: 2
  secauth: on
  threads: 4
  rrp_mode: passive

  interface {

  member {
  memberaddr: PRIVATE_IP_1
  }

...

  member {
  memberaddr: PRIVATE_IP_16
  }

  ringnumber: 0
  bindnetaddr: PRIVATE_NET_ADDR
  mcastaddr: 226.0.0.1
  mcastport: 5505
  ttl: 1
  }

 interface {

  member {
  memberaddr: PUBLIC_IP_1
  }
...

  member {
  memberaddr: PUBLIC_IP_16
  }

  ringnumber: 1
  bindnetaddr: PUBLIC_NET_ADDR
  mcastaddr: 224.0.0.1
  mcastport: 5405
  ttl: 1
  }

  transport: udpu

logging {
  to_stderr: no
  to_logfile: yes
  logfile: /var/log/cluster/corosync.log
  logfile_priority: info
  to_syslog: yes
  syslog_priority: warning
  debug: on
  timestamp: on
}

I had tried to change rrp_mode, mcastaddr/mcastport for ringnumber: 0,
but result was the similar.

I checked multicast/unicast operability using omping utility and didn't
found any issues.

Also no errors on our private VLAN was found for network equipment.

Why Corosync decided to disable permanently second ring? How I can debug
the issue?

Other properties:

Corosync Cluster Engine, version '1.4.7'

Pacemaker properties:
   cluster-infrastructure: cman
   cluster-recheck-interval: 5min
   dc-version: 1.1.14-8.el6-70404b0
   expected-quorum-votes: 3
   have-watchdog: false
   last-lrm-refresh: 1484068350
   maintenance-mode: false
   no-quorum-policy: ignore
   pe-error-series-max: 1000
   pe-input-series-max: 1000
   pe-warn-series-max: 1000
   stonith-action: reboot
   stonith-enabled: false
   symmetric-cluster: false

Thank you.

--
Regards Denis Gribkov



___
Users mailing list:Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:http://bugs.clusterlabs.org


--
Regards Denis Gribkov


___
Users mailing list:Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home:http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs:http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


--

Re: [ClusterLabs] Antw: Corosync ring marked as FAULTY

2017-02-22 Thread bliu

Hi


On 02/22/2017 04:24 PM, Denis Gribkov wrote:


Hi,

Just tried - no any packet captured even if I re-enabled ring 0 on all 
nodes.



Re-enable ring 0 will just reset the faulty to 0, no more operation.

Did you specify interface with "-i " when you 
are using tcpdump. If you did, corosync is not talking with the 
multicast address, you need to check if your private network support 
multicast.



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Corosync ring marked as FAULTY

2017-02-22 Thread Denis Gribkov

Hi,


On 22/02/17 10:35, bliu wrote:
Did you specify interface with "-i " when 
you are using tcpdump. If you did, corosync is not talking with the 
multicast address, you need to check if your private network support 
multicast.


Yes, I have used command:

# tcpdump -i em2 udp port 5505 -vv -X

Thanks for your advice, I'll ask network engineers about the issue.

--
Regards Denis Gribkov



smime.p7s
Description: S/MIME Cryptographic Signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Corosync ring marked as FAULTY

2017-02-22 Thread Vladislav Bogdanov

22.02.2017 11:40, Denis Gribkov wrote:

Hi,


On 22/02/17 10:35, bliu wrote:

Did you specify interface with "-i " when
you are using tcpdump. If you did, corosync is not talking with the
multicast address, you need to check if your private network support
multicast.


Yes, I have used command:

# tcpdump -i em2 udp port 5505 -vv -X

Thanks for your advice, I'll ask network engineers about the issue.


That could be igmp querier issue. Corosync does not follow "common" 
model of mcast usage - one sender/router in a segment and many 
receivers. Instead, all corosync nodes are mcast senders and receivers. 
For that to work reliably, both IGMP snooping should be enabled and 
*work* in a segment and IGMP querier exists there (in absence of a mcast 
router). Also, if switches differ in that two segments, then issue could 
be with IGMPv2 vs IGMPv3 snooping support in that in ring0 segment. Not 
all switches support IGMPv3 (linux default) snooping, and to be on a 
safest side it could be needed to downgrade used linux IGMP version to 
v2 (/proc/sys/net/ipv4/conf//force_igmp_version).





--
Regards Denis Gribkov



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Oracle Stopping

2017-02-22 Thread emmanuel segura
The first place where you need to look is oracle log.

2017-02-22 8:43 GMT+01:00 Ulrich Windl :
 Chad Cravens  schrieb am 22.02.2017 um 02:44 in
> Nachricht
> :
>> Hello fellow Cluster Geeks!
>>
>> I'm having an issue with the standard Oracle RA script that I can't
>> understand why this is happening.
>>
>> I create a Resource using the standard ocf:heartbeat:oracle script included
>> with a Pacemaker install. I do this through the web interface, so setting
>> the standard parameters for the Oracle install such as sid, home, user
>> etc...
>>
>> The databases appear to start just fine, however at what appear to be
>> random intervals (every 5 mins to 8 hours) the Resource is just stopped and
>> started. There is no failover event, and the only constraints are a
>> preferred location constraint and an order constraint with a filesystem
>> resource to make sure that disks are mounted before the database starts.
>> That's it.
>>
>> From /var/log/messages all I get are the ocf_log messages when the stop()
>> method is invoked. There's nothing about a failed monitor() action. Just a
>> stop() is invoked on the resource which crashes our apps.
>>
>> Any insight into what could be causing this sudden stopping of the
>> resource? Thanks for any help!
>
> Hi!
>
> Did you try to watch the output of "crm_mon -1Arfj" from time to time? Do you 
> have a separate pacemaker log? What about the files in 
> /var/lib/pacemaker/pengine?
>
> Regards,
> Ulrich
>
>>
>> --
>> Kindest Regards,
>> Chad Cravens
>> (843) 291-8340
>>
>> [image: http://www.ossys.com] 
>> [image: http://www.linkedin.com/company/open-source-systems-llc]
>>    [image:
>> https://www.facebook.com/OpenSrcSys] 
>>[image: https://twitter.com/OpenSrcSys] 
>>  [image: http://www.youtube.com/OpenSrcSys]
>>    [image: http://www.ossys.com/feed]
>>    [image: cont...@ossys.com] 
>> Chad Cravens
>> (843) 291-8340
>> chad.crav...@ossys.com
>> http://www.ossys.com
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Fence agent for VirtualBox

2017-02-22 Thread Marek Grac
Hi,

I have updated fence agent for Virtual Box (upstream git). The main benefit
is new option --host-os (host_os on stdin) that supports linux|macos. So if
your host is linux/macos all you need to set is this option (and ssh access
to a machine). I would love to add a support also for windows but I'm not
able to run vboxmanage.exe over the openssh. It works perfectly from
command prompt under same user, so there are some privileges issues, if you
know how to fix this please let me know.

m,
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Adding a node to the cluster deployed with "Unicast configuration"

2017-02-22 Thread Alejandro Comisario
Hi everyone, i have a problem when scaling a corosync/pacemaker
cluster deployed using unicast.

eg on corosync.conf.

nodelist {
node {
ring0_addr: 10.10.0.10
nodeid: 1
}
node {
ring0_addr: 10.10.0.11
nodeid: 2
}
}

Tried to add the new node with the new config (meaning, adding the new
node) nad leaving the other two with the same config and started
services on the third node, but doesnt seem to work until i update the
config on the server #1 and #2 and restart corosync/pacemaker which
does the obvious of bringing every resource down.

There should be a way to "hot add" a new node to the cluster, but i
dont seem to find one.
what is the best way to add a node without bothering the rest ? or
better said, what is the right way to do it ?

PS: i can't implement multicast on my network.

-- 
Alejandro Comisario

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Adding a node to the cluster deployed with "Unicast configuration"

2017-02-22 Thread Ken Gaillot
On 02/22/2017 08:44 AM, Alejandro Comisario wrote:
> Hi everyone, i have a problem when scaling a corosync/pacemaker
> cluster deployed using unicast.
> 
> eg on corosync.conf.
> 
> nodelist {
> node {
> ring0_addr: 10.10.0.10
> nodeid: 1
> }
> node {
> ring0_addr: 10.10.0.11
> nodeid: 2
> }
> }
> 
> Tried to add the new node with the new config (meaning, adding the new
> node) nad leaving the other two with the same config and started
> services on the third node, but doesnt seem to work until i update the
> config on the server #1 and #2 and restart corosync/pacemaker which
> does the obvious of bringing every resource down.
> 
> There should be a way to "hot add" a new node to the cluster, but i
> dont seem to find one.
> what is the best way to add a node without bothering the rest ? or
> better said, what is the right way to do it ?
> 
> PS: i can't implement multicast on my network.

AFAIK, in this situation, there's no way to avoid restarting corosync
(and thus pacemaker, too). But, you can put your cluster into
maintenance mode beforehand, so pacemaker will leave all your services
running, and re-detect them once it starts again. Once the cluster
status looks good again, you can leave maintenance mode.


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Adding a node to the cluster deployed with "Unicast configuration"

2017-02-22 Thread Jan Friesse

Alejandro Comisario napsal(a):

Hi everyone, i have a problem when scaling a corosync/pacemaker
cluster deployed using unicast.

eg on corosync.conf.

nodelist {
node {
ring0_addr: 10.10.0.10
nodeid: 1
}
node {
ring0_addr: 10.10.0.11
nodeid: 2
}
}

Tried to add the new node with the new config (meaning, adding the new
node) nad leaving the other two with the same config and started
services on the third node, but doesnt seem to work until i update the
config on the server #1 and #2 and restart corosync/pacemaker which
does the obvious of bringing every resource down.

There should be a way to "hot add" a new node to the cluster, but i
dont seem to find one.
what is the best way to add a node without bothering the rest ? or
better said, what is the right way to do it ?


Adding node is possible with following process:
- Add node to config file on both existing nodes
- Exec corosync-cfgtool -R (it's enough to exec only on one of nodes)
- Make sure 3rd (new) node has same config as two existing nodes
- Start corosync/pcmk on new node

This is (more or less) how pcs works. No stopping of corosync is needed.

Regards,
  Honza



PS: i can't implement multicast on my network.




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Adding a node to the cluster deployed with "Unicast configuration"

2017-02-22 Thread Ken Gaillot
On 02/22/2017 09:55 AM, Jan Friesse wrote:
> Alejandro Comisario napsal(a):
>> Hi everyone, i have a problem when scaling a corosync/pacemaker
>> cluster deployed using unicast.
>>
>> eg on corosync.conf.
>>
>> nodelist {
>> node {
>> ring0_addr: 10.10.0.10
>> nodeid: 1
>> }
>> node {
>> ring0_addr: 10.10.0.11
>> nodeid: 2
>> }
>> }
>>
>> Tried to add the new node with the new config (meaning, adding the new
>> node) nad leaving the other two with the same config and started
>> services on the third node, but doesnt seem to work until i update the
>> config on the server #1 and #2 and restart corosync/pacemaker which
>> does the obvious of bringing every resource down.
>>
>> There should be a way to "hot add" a new node to the cluster, but i
>> dont seem to find one.
>> what is the best way to add a node without bothering the rest ? or
>> better said, what is the right way to do it ?
> 
> Adding node is possible with following process:
> - Add node to config file on both existing nodes
> - Exec corosync-cfgtool -R (it's enough to exec only on one of nodes)
> - Make sure 3rd (new) node has same config as two existing nodes
> - Start corosync/pcmk on new node
> 
> This is (more or less) how pcs works. No stopping of corosync is needed.
> 
> Regards,
>   Honza

Ah, noted ... I need to update some upstream docs :-)

>>
>> PS: i can't implement multicast on my network.
>>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fence agent for VirtualBox

2017-02-22 Thread Marek Grac
Hi,

we have added support for a host with Windows but it is not trivial to
setup because of various contexts/privileges.

Install openssh on Windows (tutorial can be found on
http://linuxbsdos.com/2015/07/30/how-to-install-openssh-on-windows-10/)

There is a major issue with current setup in Windows.  You have to start
virtual machines from openssh connection if you wish to manage them from
openssh connection.

So, you have to connect from Windows to very same Windows using ssh and
then run

“/Program Files/Oracle/VirtualBox/VBoxManage.exe” start NAME_OF_VM

Be prepared that you will not see that your machine VM is running in
VirtualBox
management UI.

Afterwards it is enough to add parameter --host-os windows (or
host_os=windows when stdin/pcs is used).

m,

On Wed, Feb 22, 2017 at 11:49 AM, Marek Grac  wrote:

> Hi,
>
> I have updated fence agent for Virtual Box (upstream git). The main
> benefit is new option --host-os (host_os on stdin) that supports
> linux|macos. So if your host is linux/macos all you need to set is this
> option (and ssh access to a machine). I would love to add a support also
> for windows but I'm not able to run vboxmanage.exe over the openssh. It
> works perfectly from command prompt under same user, so there are some
> privileges issues, if you know how to fix this please let me know.
>
> m,
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org