Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Kristián Feldsam
hello, you should put cluster to maintenance mode



Sent from my MI 5On Attila Megyeri , Oct 12, 2017 6:55 PM wrote:Hi all, What is the recommended mysql server upgrade methodology in case of an active/passive DRBD storage?(Ubuntu is the platform) 1)      On the passive node the mysql data directory is not mounted, so the backup fails (some postinstall jobs will attempt to perform manipulations on certain files in the data directory).2)      If the upgrade is done on the active node, it will restart the service (with the service restart, not in a crm managed fassion…), which is not a very good option (downtime in a HA solution). Not to mention, that it will update some files in the mysql data directory, which can cause strange issues if the A/P pair is changed – since on the other node the program code will still be the old one, while the data dir is already upgraded. Any hints are welcome! Thanks,Attila ___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Is there a way to ignore a single monitoring timeout

2017-09-01 Thread Kristián Feldsam


S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 1 Sep 2017, at 13:15, Klechomir <kle...@gmail.com> wrote:
> 
> What I observe is that single monitoring request of different resources with 
> different resource agents is timing out.
> 
> For example LVM resource (the LVM RA) does this sometimes.
> Setting ridiculously high timeouts (5 minutes and more) didn't solve the 
> problem, so I think I'm  out of options there.
> Same for other I/O related resources/RAs.
> 

hmm, so probably is something bad in clvm configuration? I use clvm in three 
node cluster without issues. Which version of centos u use? I experience clvm 
problems only on pre 7.3 version due to bug in libqb.

> Regards,
> Klecho
> 
> One of the typical cases is LVM (LVM RA)monitoring.
> 
> On 1.09.2017 11:07, Jehan-Guillaume de Rorthais wrote:
>> On Fri, 01 Sep 2017 09:07:16 +0200
>> "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> wrote:
>> 
>>>>>> Klechomir <kle...@gmail.com> schrieb am 01.09.2017 um 08:48 in Nachricht
>>> <9f043557-233d-6c1c-b46d-63f8c2ee5...@gmail.com>:
>>>> Hi Ulrich,
>>>> Have to disagree here.
>>>> 
>>>> I have cases, when for an unknown reason a single monitoring request
>>>> never returns result.
>>>> So having bigger timeouts doesn't resolve this problem.
>>> But if your monitor hangs instead of giving a result, you also cannot ignore
>>> the result that isn't there! OTOH: Isn't the operation timeout for monitors
>>> that hang? If the monitor is killed, it returns an implicit status (it
>>> failed).
>> I agree. It seems to me the problems comes from either the resource agent or
>> the resource itself. Presently, this issue bothers the cluster stack, but 
>> soon
>> or later, it will blows something else. Track where the issue comes from, and
>> fix it.
>> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] big trouble with a DRBD resource

2017-08-06 Thread Kristián Feldsam
here is nice guide how to configure drbd, clvm...

http://marcitland.blogspot.cz/2013/04/building-using-highly-available-esos.html

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 6 Aug 2017, at 12:05, Kristoffer Grönlund <kgronl...@suse.com> wrote:
> 
> "Lentes, Bernd" <bernd.len...@helmholtz-muenchen.de> writes:
> 
>> Hi,
>> 
>> first: is there a tutorial or s.th. else which helps in understanding what 
>> pacemaker logs in syslog and /var/log/cluster/corosync.log ?
>> I try hard to find out what's going wrong, but they are difficult to 
>> understand, also because of the amount of information.
>> Or should i deal more with "crm histroy" or hb_report ?
> 
> I like to use crm history log to get the logs from all the nodes in a
> single flow, but it depends quite a bit on configuration what gets
> logged where..
> 
>> 
>> What happened:
>> I tried to configure a simple drbd resource following 
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296
>> I used this simple snip from the doc:
>> configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \
>>op monitor interval=60s
> 
> I'll try to sum up the issues I see, from a glance:
> 
> * The drbd resource is a multi-state / master-slave resource, which is
>  technically a variant of a clone resource where different clones can
>  either be in a primary or secondary state. To configure it correctly,
>  you'll need to create a master resource as well. Doing this with a
>  single command is unfortunately a bit painful. Either use crm
>  configure edit, or the interactive crm mode (with a verify / commit
>  after creating both the primitive and the master resources).
> 
> * You'll need to create monitor operations for both the master and slave
>  roles, as you note below, and set explicit timeouts for all
>  operations.
> 
> * Make sure the wwwdata DRBD resource exists, is accessible from both
>  nodes, and is in a good state to begin with (that is, not
>  split-brained).
> 
> I would recommend following one of the tutorials provided by Linbit
> themselves which show how to set this stuff up correctly, since it is
> quite a bit involved.
> 
>> Btw: is there a history like in the bash where i see which crm command i 
>> entered at which time ? I know that crm history is mighty, but didn't find 
>> that.
> 
> We don't have that yet :/ If you're not in interactive mode, your bash
> history should have the commands though.
> 
>> no backup - no mercy
> 
> lol ;)
> 
> Cheers,
> Kristoffer
> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Multi cluster

2017-08-05 Thread Kristián Feldsam
Hello, what about data synchronization? I suppose that user also write to ERP 
and not only read, right?

in my opinion, more complexity = more fragile system = lower availability that 
simple system.

so, you can setup in each site, local loadballancer, which by default will 
connect to DC instace and in case of failure (DC, Internet connection, etc..) 
it will switch to local instance.

If you need also writes, them you also need to solve some data synchronization.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 5 Aug 2017, at 13:11, Jan Pokorný <jpoko...@redhat.com> wrote:
> 
> On 05/08/17 00:10 +0200, Jan Pokorný wrote:
>> [addendum inline]
> 
> And some more...
> 
>> On 04/08/17 18:35 +0200, Jan Pokorný wrote:
>>> On 03/08/17 20:37 +0530, sharafraz khan wrote:
>>>> I am new to clustering so please ignore if my Question sounds silly, i have
>>>> a requirement were in i need to create cluster for ERP application with
>>>> apache, VIP component,below is the scenario
>>>> 
>>>> We have 5 Sites,
>>>> 1. DC
>>>> 2. Site A
>>>> 3. Site B
>>>> 4. Site C
>>>> 5. Site D
>>>> 
>>>> Over here we need to configure HA as such that DC would be the primary Node
>>>> hosting application & be accessed from by all the users in each sites, in
>>>> case of Failure of DC Node, Site users should automatically be switched to
>>>> there local ERP server, and not to the Nodes at other sites, so
>>>> communication would be as below
>>>> 
>>>> DC < -- > Site A
>>>> DC < -- > Site B
>>>> DC < -- > Site C
>>>> DC < -- > Site D
> 
> Note that if you wanted to imply you generally rely on/are limited with
> star-like network topology with a central machine doubling as a relay,
> you distort our implicit notion (perhaps we should make it explicit)
> of cluster forming a complete graph (directly or indirectly through
> multicast) amongst the nodes of the healthy partition (corosync is not
> as advanced to support grid/mesh/star topologies, but it's a non-goal
> for a direct peer messaging layer to start with).  Sure, you can
> workaround this with tunnelling, at the cost of compromising
> reliability (and efficiency) and hence high availability :)
> 
> With communication site x DC communication _after failure_, do you
> mean checking if the DC is OK again or something else?
> 
>>>> Now the challenge is
>>>> 
>>>> 1. If i create a cluster between say DC < -- > Site A it won't allow me to
>>>> create another cluster on DC with other sites
>>>> 
>>>> 2. if i setup all the nodes in single cluster how can i ensure that in case
>>>> of Node Failure or loss of connectivity to DC node from any site, users
>>>> from that sites should be switched to Local ERP node and not to nodes on
>>>> other site.
>>>> 
>>>> a urgent response and help would be quite helpful
>>> 
>>> From your description, I suppose you are limited to just a single
>>> machine per site/DC (making the overall picture prone to double
>>> fault, first DC goes down, then any of the sites goes down, then
>>> at least the clients of that very site encounter the downtime).
>>> Otherwise I'd suggest looking at booth project that facilitates
>>> inter-cluster (back to your "multi cluster") decisions, extending
>>> upon pacemaker performing the intra-cluster ones.
>>> 
>>> Using a single cluster approach, you should certainly be able to
>>> model your fallback scenario, something like:
>>> 
>>> - define a group A (VIP, apache, app), infinity-located with DC
>>> - define a different group B with the same content, set up as clone
>>>  B_clone being (-infinity)-located with DC
>>> - set up ordering "B_clone starts when A stops", of "Mandatory" kind
>>> 
>>> Further tweaks may be needed.
>> 
>> Hmm, actually VIP would not help much here, even if "ip" adapted per
>> host ("#uname") as there're two conflicting principles ("globality"
>> of the network for serving from DC vs

Re: [ClusterLabs] Antw: LVM resource and DAS - would two resources off one DAS...

2017-07-27 Thread Kristián Feldsam
In this case I prefer to use 
http://www.storagereview.com/lsi_syncro_cs_ha_das_storage_overview 
<http://www.storagereview.com/lsi_syncro_cs_ha_das_storage_overview> which you 
can buy on ebay for about 1000USD

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 27 Jul 2017, at 15:20, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> 
> wrote:
> 
> Hi!
> 
> I think it will work, because the cluster does not monitor the PVs or 
> prtition or LUNs. It just checks whether you can activate the LVs (i.e.: the 
> VG). That's what I know...
> 
> Regards,
> Ulrich
> 
>>>> lejeczek <pelj...@yahoo.co.uk> schrieb am 27.07.2017 um 15:05 in Nachricht
> <636398a2-e8ea-644b-046b-ff12358de...@yahoo.co.uk>:
>> hi fellas
>> 
>> I realise this might be quite specialized topic, as this 
>> regards hardware DAS(sas2) and LVM and cluster itself but 
>> I'm hoping with some luck an expert peeps over here and I'll 
>> get some or all the answers then.
>> 
>> question:
>> Can cluster manage two(or more) LVM resources which would be 
>> on/in same single DAS storage and have these resources(eg. 
>> one LVM runs on 1&2 the other LVM runs on 3&4) run on 
>> different nodes(which naturally all connect to that single DAS)?
>> 
>> Now, I guess this might be something many do already and 
>> many will say: trivial. In which case a few firm "yes" 
>> confirmations will mean - typical, just do it.
>> Or could it be something unusual and untested but 
>> might/should work when done with care and special "preparation"?
>> 
>> I understand that lots depends on what/how harwdare+kernel 
>> do things, but if possible(?) I'd leave it out for now and 
>> ask only the cluster itself - do you do it?
>> 
>> many thanks.
>> L.
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] why resources are restarted when a node rejoins a cluster?

2017-07-25 Thread Kristián Feldsam
It looks like stickiness not configured.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch05s03s02.html

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 25 Jul 2017, at 16:06, Ken Gaillot <kgail...@redhat.com> wrote:
> 
> On Mon, 2017-07-24 at 23:07 -0400, Digimer wrote:
>> On 2017-07-24 11:04 PM, ztj wrote:
>>> Hi all,
>>> I have 2 Centos nodes with heartbeat and pacemaker-1.1.13 installed,
>>> and almost everything is working fine, I have only apache configured
>>> for testing, when a node goes down the failover is done correctly,
>>> but there's a problem when a node failbacks.
>>> 
>>> For example, let's say that Node1 has the lead on apache resource,
>>> then I reboot Node1, so Pacemaker detect it goes down, then apache
>>> is promoted to the Node2 and it keeps there running fine, that's
>>> fine, but when Node1 recovers and joins the cluster again, apache is
>>> restarted on Node2 again.
>>> 
>>> Anyone knows, why resources are restarted when a node rejoins a
>>> cluster? thanks
> 
> That's not the default behavior, so something else is going on. Show
> your configuration (with any sensitive information removed) for more
> help.
> 
>> You sent this to the moderators, not the list.
>> 
>> Please don't use heartbeat, it is extremely deprecated. Please switch
>> to corosync.
> 
> Since it's CentOS, it has to be corosync, unless heartbeat was compiled
> locally.
> 
>> 
>> To offer any other advice, you need to share your config and the logs
>> from both nodes. Please respond to the list, not
>> developers-ow...@clusterlabs.org.
>> 
>> digimer
>> -- 
>> Digimer
>> Papers and Projects: https://alteeve.com/w/
>> "I am, somehow, less interested in the weight and convolutions of Einstein’s 
>> brain than in the near certainty that people of equal talent have lived and 
>> died in cotton fields and sweatshops." - Stephen Jay Gould
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> -- 
> Ken Gaillot <kgail...@redhat.com>
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
yes, I just have idea, he probably have managed switch or fabric...

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 22:18, Klaus Wenninger <kwenn...@redhat.com> wrote:
> 
> On 07/24/2017 09:46 PM, Kristián Feldsam wrote:
>> so why to use some other fencing method like disablink port on switch, so 
>> nobody can acces faultly node and write data to it. it is common practice 
>> too.
> 
> Well don't get me wrong here. I don't want to hard-sell sbd.
> Just though that very likely requirements that prevent usage
> of a remote-controlled power-switch will make access
> to a switch to disable the ports unusable as well.
> And if a working qdevice setup is there already the gap between
> what he thought he would get from qdevice and what he actually
> had just matches exactly quorum-based-watchdog-fencing.
> 
> But you are of course right.
> I don't really know the scenario.
> Maybe fabric fencing is the perfect match - good to mention it
> here as a possibility.
> 
> Regards,
> Klaus
>   
>> 
>> S pozdravem Kristián Feldsam
>> Tel.: +420 773 303 353, +421 944 137 535
>> E-mail.: supp...@feldhost.cz <mailto:supp...@feldhost.cz>
>> 
>> www.feldhost.cz <http://www.feldhost.cz/> - FeldHost™ – profesionální 
>> hostingové a serverové služby za adekvátní ceny.
>> 
>> FELDSAM s.r.o.
>> V rohu 434/3
>> Praha 4 – Libuš, PSČ 142 00
>> IČ: 290 60 958, DIČ: CZ290 60 958
>> C 200350 vedená u Městského soudu v Praze
>> 
>> Banka: Fio banka a.s.
>> Číslo účtu: 2400330446/2010
>> BIC: FIOBCZPPXX
>> IBAN: CZ82 2010  0024 0033 0446
>> 
>>> On 24 Jul 2017, at 21:16, Klaus Wenninger <kwenn...@redhat.com 
>>> <mailto:kwenn...@redhat.com>> wrote:
>>> 
>>> On 07/24/2017 08:27 PM, Prasad, Shashank wrote:
>>>> My understanding is that  SBD will need a shared storage between clustered 
>>>> nodes.
>>>> And that, SBD will need at least 3 nodes in a cluster, if using w/o shared 
>>>> storage.
>>> 
>>> Haven't tried to be honest but reason for 3 nodes is that without
>>> shared disk you need a real quorum-source and not something
>>> 'faked' as with 2-node-feature in corosync.
>>> But I don't see anything speaking against getting the proper
>>> quorum via qdevice instead with a third full cluster-node.
>>> 
>>>>  
>>>> Therefore, for systems which do NOT use shared storage between 1+1 HA 
>>>> clustered nodes, SBD may NOT be an option.
>>>> Correct me, if I am wrong.
>>>>  
>>>> For cluster systems using the likes of iDRAC/IMM2 fencing agents, which 
>>>> have redundant but shared power supply units with the nodes, the normal 
>>>> fencing mechanisms should work for all resiliency scenarios, but for 
>>>> IMM2/iDRAC are being NOT reachable for whatsoever reasons. And, to bail 
>>>> out of those situations in the absence of SBD, I believe using 
>>>> used-defined failover hooks (via scripts) into Pacemaker Alerts, with sudo 
>>>> permissions for ‘hacluster’, should help.
>>> 
>>> If you don't see your fencing device assuming after some time
>>> the the corresponding node will probably be down is quite risky
>>> in my opinion.
>>> But why not assure it to be down using a watchdog?
>>> 
>>>>  
>>>> Thanx.
>>>>  
>>>>  
>>>> From: Klaus Wenninger [mailto:kwenn...@redhat.com 
>>>> <mailto:kwenn...@redhat.com>] 
>>>> Sent: Monday, July 24, 2017 11:31 PM
>>>> To: Cluster Labs - All topics related to open-source clustering welcomed; 
>>>> Prasad, Shashank
>>>> Subject: Re: [ClusterLabs] Two nodes cluster issue
>>>>  
>>>> On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
>>>> Sometimes IPMI fence devices use shared power of the node, and it cannot 
>>>> be avoided.
>>>> In such scenarios the HA cluster is NOT able to handle the power failure 
>>>> of a node, since the power is shared with its own fence device.
>>>> The failure of IPMI based fencing can also exist due to other rea

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
so why to use some other fencing method like disablink port on switch, so 
nobody can acces faultly node and write data to it. it is common practice too.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 21:16, Klaus Wenninger <kwenn...@redhat.com> wrote:
> 
> On 07/24/2017 08:27 PM, Prasad, Shashank wrote:
>> My understanding is that  SBD will need a shared storage between clustered 
>> nodes.
>> And that, SBD will need at least 3 nodes in a cluster, if using w/o shared 
>> storage.
> 
> Haven't tried to be honest but reason for 3 nodes is that without
> shared disk you need a real quorum-source and not something
> 'faked' as with 2-node-feature in corosync.
> But I don't see anything speaking against getting the proper
> quorum via qdevice instead with a third full cluster-node.
> 
>>  
>> Therefore, for systems which do NOT use shared storage between 1+1 HA 
>> clustered nodes, SBD may NOT be an option.
>> Correct me, if I am wrong.
>>  
>> For cluster systems using the likes of iDRAC/IMM2 fencing agents, which have 
>> redundant but shared power supply units with the nodes, the normal fencing 
>> mechanisms should work for all resiliency scenarios, but for IMM2/iDRAC are 
>> being NOT reachable for whatsoever reasons. And, to bail out of those 
>> situations in the absence of SBD, I believe using used-defined failover 
>> hooks (via scripts) into Pacemaker Alerts, with sudo permissions for 
>> ‘hacluster’, should help.
> 
> If you don't see your fencing device assuming after some time
> the the corresponding node will probably be down is quite risky
> in my opinion.
> But why not assure it to be down using a watchdog?
> 
>>  
>> Thanx.
>>  
>>  
>> From: Klaus Wenninger [mailto:kwenn...@redhat.com 
>> <mailto:kwenn...@redhat.com>] 
>> Sent: Monday, July 24, 2017 11:31 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed; 
>> Prasad, Shashank
>> Subject: Re: [ClusterLabs] Two nodes cluster issue
>>  
>> On 07/24/2017 07:32 PM, Prasad, Shashank wrote:
>> Sometimes IPMI fence devices use shared power of the node, and it cannot be 
>> avoided.
>> In such scenarios the HA cluster is NOT able to handle the power failure of 
>> a node, since the power is shared with its own fence device.
>> The failure of IPMI based fencing can also exist due to other reasons also.
>>  
>> A failure to fence the failed node will cause cluster to be marked UNCLEAN.
>> To get over it, the following command needs to be invoked on the surviving 
>> node.
>>  
>> pcs stonith confirm  --force
>>  
>> This can be automated by hooking a recovery script, when the the Stonith 
>> resource ‘Timed Out’ event.
>> To be more specific, the Pacemaker Alerts can be used for watch for Stonith 
>> timeouts and failures.
>> In that script, all that’s essentially to be executed is the aforementioned 
>> command.
>> 
>> If I get you right here you can disable fencing then in the first place.
>> Actually quorum-based-watchdog-fencing is the way to do this in a
>> safe manner. This of course assumes you have a proper source for
>> quorum in your 2-node-setup with e.g. qdevice or using a shared
>> disk with sbd (not directly pacemaker quorum here but similar thing
>> handled inside sbd).
>> 
>> 
>> Since the alerts are issued from ‘hacluster’ login, sudo permissions for 
>> ‘hacluster’ needs to be configured.
>>  
>> Thanx.
>>  
>>  
>> From: Klaus Wenninger [mailto:kwenn...@redhat.com 
>> <mailto:kwenn...@redhat.com>] 
>> Sent: Monday, July 24, 2017 9:24 PM
>> To: Kristián Feldsam; Cluster Labs - All topics related to open-source 
>> clustering welcomed
>> Subject: Re: [ClusterLabs] Two nodes cluster issue
>>  
>> On 07/24/2017 05:37 PM, Kristián Feldsam wrote:
>> I personally think that power off node by switched pdu is more safe, or not?
>> 
>> True if that is working in you environment. If you can't do a physical setup
>> where you aren't simultaneously loosing connection to both your node and
>> the switch-device (or you just want to cover cases where that happens)
>> you have to come up with something else.
>

Re: [ClusterLabs] resources do not migrate although node is going to standby

2017-07-24 Thread Kristián Feldsam
hmmi think that it is just prefered location, if it is not available, server 
should start on other node. you can of cource migrate manualy byt crm resource 
move resource_name node_name - which in effect change that location pref

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 20:52, Lentes, Bernd <bernd.len...@helmholtz-muenchen.de> 
> wrote:
> 
> Hi,
> 
> just to be sure:
> i have a VirtualDomain resource (called prim_vm_servers_alive) running on one 
> node (ha-idg-2). From reasons i don't remember i have a location constraint:
> location cli-prefer-prim_vm_servers_alive prim_vm_servers_alive role=Started 
> inf: ha-idg-2
> 
> Now i try to set this node into standby, because i need it to reboot.
> From what i think now the resource can't migrate to node ha-idg-1 because of 
> this constraint. Right ?
> 
> That's what the log says:
> Jul 21 18:03:50 ha-idg-2 VirtualDomain(prim_vm_servers_alive)[28565]: ERROR: 
> Server_Monitoring: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:50 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_servers_alive_migrate_to_0:28565:stderr [ error: Requested operation 
> is not valid: domain 'Server_Monitoring' is already active ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
> prim_vm_servers_alive_migrate_to_0: unknown error (node=ha-idg-2, call=114, 
> rc=1, cib-update=572, confirmed=true)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
> ha-idg-2-prim_vm_servers_alive_migrate_to_0:114 [ error: Requested operation 
> is not valid: domain 'Server_Monitoring' is already active\n ]
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
> (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 
> 1): Error
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:   notice: abort_transition_graph: 
> Transition aborted by prim_vm_servers_alive_migrate_to_0 'modify' on 
> ha-idg-2: Event failed 
> (magic=0:1;64:417:0:656ecd4a-f8e8-46c9-b4e6-194616237988, cib=0.879.5, sou
> rce=match_graph_event:350, 0)
> Jul 21 18:03:50 ha-idg-2 crmd[8576]:  warning: status_from_rc: Action 64 
> (prim_vm_servers_alive_migrate_to_0) on ha-idg-2 failed (target: 0 vs. rc: 
> 1): Error
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
> mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> 
> That is the way i understand "Requested operation is not valid". It's not 
> possible because of the constraint.
> I just wanted to be sure. And because the resource can't be migrated but the 
> host is going to standby the resource is stopped. Right ?
> 
> Strange is that a second resource also running on node ha-idg-2 called 
> prim_vm_mausdb also didn't migrate to the other node. And that's something i 
> don't understand completely.
> The resource didn't have any location constraint.
> Both VirtualDomains have a vnc server configured (that i can monitor the boot 
> procedure if i have starting problems). The vnc port for prim_vm_mausdb is 
> 5900 in the configuration file.
> The port is set to auto for prim_vm_servers_alive because i forgot to 
> configure it fix. So it must be s.th like 5900+ because both resources were 
> running concurrently on the same node.
> But prim_vm_mausdb can't migrate because the port is occupied on the other 
> node ha-idg-1:
> 
> Jul 21 18:03:53 ha-idg-2 VirtualDomain(prim_vm_mausdb)[28564]: ERROR: 
> mausdb_vm: live migration to qemu+ssh://ha-idg-1/system  failed: 1
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [ error: internal error: early end 
> of file from monitor: possible problem: ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [ Failed to start VNC server on 
> `127.0.0.1:0,share=allow-exclusive': Failed to bind socket: Address already 
> in use ]
> Jul 21 18:03:53 ha-idg-2 lrmd[8573]:   notice: operation_finished: 
> prim_vm_mausdb_migrate_to_0:28564:stderr [  ]
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: Operation 
> prim_vm_mausdb_migrate_to_0: unknown error (node=ha-idg-2, call=110, rc=1, 
> cib-update=573, confirmed=true)
> Jul 21 18:03:53 ha-idg-2 crmd[8576]:   notice: process_lrm_event: 
> ha-idg-2-prim_vm_mausdb_migrate_to_

Re: [ClusterLabs] timeout for stop VirtualDomain running Windows 7

2017-07-24 Thread Kristián Feldsam
hmm, it is possible disable installing update on shutdown? and do regular 
maintanence for updating manually?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 19:30, Lentes, Bernd <bernd.len...@helmholtz-muenchen.de> 
> wrote:
> 
> Hi,
> 
> i have a VirtualDomian resource running a Windows 7 client. This is the 
> respective configuration:
> 
> primitive prim_vm_servers_alive VirtualDomain \
>params config="/var/lib/libvirt/images/xml/Server_Monitoring.xml" \
>params hypervisor="qemu:///system" \
>params migration_transport=ssh \
>params autoset_utilization_cpu=false \
>params autoset_utilization_hv_memory=false \
>op start interval=0 timeout=120 \
>op stop interval=0 timeout=130 \
>op monitor interval=30 timeout=30 \
>op migrate_from interval=0 timeout=180 \
>op migrate_to interval=0 timeout=190 \
>meta allow-migrate=true target-role=Started is-managed=true
> 
> The timeout for the stop operation is 130 seconds. But our windows 7 clients, 
> as most do, install updates from time to time .
> And then a shutdown can take 10 or 20 minutes or even longer.
> If the timeout isn't as long as the installation of the updates takes then 
> the vm is forced off. With all possible negative consequences.
> But on the other hand i don't like to set a timeout of eg. 20 minutes, which 
> may still not be enough in some circumstances, but is much too long
> if the guest doesn't install updates.
> 
> Any ideas ?
> 
> Thanks.
> 
> 
> Bernd
> 
> -- 
> Bernd Lentes 
> 
> Systemadministration 
> institute of developmental genetics 
> Gebäude 35.34 - Raum 208 
> HelmholtzZentrum München 
> bernd.len...@helmholtz-muenchen.de 
> phone: +49 (0)89 3187 1241 
> fax: +49 (0)89 3187 2294 
> 
> no backup - no mercy
> 
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Kristián Feldsam
nfs server/share is also managed by pacemaker and orderis set right?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 18:01, Dimitri Maziuk <dmaz...@bmrb.wisc.edu> wrote:
> 
> On 07/24/2017 10:38 AM, Ken Gaillot wrote:
> 
>> A restart shouldn't lead to fencing in any case where something's not
>> going seriously wrong. I'm not familiar with the "kernel is using it"
>> message, I haven't run into that before.
> 
> I posted it at least once before.
> 
>> 
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Running 
>> stop for /dev/drbd0 on /raid
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: Trying to 
>> unmount /raid
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:48 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:49 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:50 zebrafish ntpd[596]: Deleting interface #8 enp2s0f0, 
>> 144.92.167.221#123, interface stats: received=0, sent=0, dropped=0, 
>> active_time=260 secs
>> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with TERM
>> Jul 22 14:03:50 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:51 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:52 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:53 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid; trying cleanup with KILL
>> Jul 22 14:03:54 zebrafish Filesystem(drbd_filesystem)[6886]: INFO: No 
>> processes on /raid were signalled. force_unmount is set to 'yes'
>> Jul 22 14:03:55 zebrafish Filesystem(drbd_filesystem)[6886]: ERROR: Couldn't 
>> unmount /raid, giving up!
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuser(1)) ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ ocf-exit-reason:Couldn't unmount /raid; 
>> trying cleanup with TERM ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ umount: /raid: target is busy. ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [ (In some cases useful info 
>> about processes that use ]
>> Jul 22 14:03:55 zebrafish lrmd[1075]:  notice: 
>> drbd_filesystem_stop_0:6886:stderr [  the device is found by lsof(8) 
>> or fuse

Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
I personally think that power off node by switched pdu is more safe, or not?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 17:27, Klaus Wenninger <kwenn...@redhat.com> wrote:
> 
> On 07/24/2017 05:15 PM, Tomer Azran wrote:
>> I still don't understand why the qdevice concept doesn't help on this 
>> situation. Since the master node is down, I would expect the quorum to 
>> declare it as dead.
>> Why doesn't it happens?
> 
> That is not how quorum works. It just limits the decision-making to the 
> quorate subset of the cluster.
> Still the unknown nodes are not sure to be down.
> That is why I suggested to have quorum-based watchdog-fencing with sbd.
> That would assure that within a certain time all nodes of the non-quorate part
> of the cluster are down.
> 
>> 
>> 
>> 
>> On Mon, Jul 24, 2017 at 4:15 PM +0300, "Dmitri Maziuk" 
>> <dmitri.maz...@gmail.com <mailto:dmitri.maz...@gmail.com>> wrote:
>> 
>> On 2017-07-24 07:51, Tomer Azran wrote:
>> > We don't have the ability to use it.
>> > Is that the only solution?
>> 
>> No, but I'd recommend thinking about it first. Are you sure you will 
>> care about your cluster working when your server room is on fire? 'Cause 
>> unless you have halon suppression, your server room is a complete 
>> write-off anyway. (Think water from sprinklers hitting rich chunky volts 
>> in the servers.)
>> 
>> Dima
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> <http://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> <http://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
> 
> -- 
> Klaus Wenninger
> 
> Senior Software Engineer, EMEA ENG Openstack Infrastructure
> 
> Red Hat
> 
> kwenn...@redhat.com <mailto:kwenn...@redhat.com>   
> ___
> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> http://lists.clusterlabs.org/mailman/listinfo/users 
> <http://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-24 Thread Kristián Feldsam
Hmm, so when you know, that it happens also when putting node standy, them why 
you run yum update on live cluster, it must be clear that node will be fenced.

Would you post your pacemaker config? + some logs?

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 17:04, Dimitri Maziuk <dmaz...@bmrb.wisc.edu> wrote:
> 
> On 07/24/2017 09:40 AM, Jan Pokorný wrote:
> 
>> Would there be an interest, though?  And would that be meaningful?
> 
> IMO the only reason to put a node in standby is if you want to reboot
> the active node with no service interruption. For anything else,
> including a reboot with service interruption (during maintenance
> window), it's a no.
> 
> This is akin to "your mouse has moved, windows needs to be restarted".
> Except the mouse thing is a joke whereas those "standby" clowns appear
> to be serious.
> 
> With this particular failure, something in the Redhat patched kernel
> (NFS?) does not release the DRBD filesystem. It happens when I put the
> node in standby as well, the only difference is not messing up the RPM
> database which isn't that hard to fix. Since I have several centos 6 +
> DRBD + NFS + heartbeat R1 pairs running happily for years, I have to
> conclude that centos 7 is simply the wrong tool for this particular job.
> 
> -- 
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
APC AP7921 is just for 200€ on ebay.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 15:12, Dmitri Maziuk <dmitri.maz...@gmail.com> wrote:
> 
> On 2017-07-24 07:51, Tomer Azran wrote:
>> We don't have the ability to use it.
>> Is that the only solution?
> 
> No, but I'd recommend thinking about it first. Are you sure you will care 
> about your cluster working when your server room is on fire? 'Cause unless 
> you have halon suppression, your server room is a complete write-off anyway. 
> (Think water from sprinklers hitting rich chunky volts in the servers.)
> 
> Dima
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Two nodes cluster issue

2017-07-24 Thread Kristián Feldsam
Hello, you have to use second fencing device, for ex. APC Switched PDU.

https://wiki.clusterlabs.org/wiki/Configure_Multiple_Fencing_Devices_Using_pcs

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 24 Jul 2017, at 13:51, Tomer Azran <tomer.az...@edp.co.il> wrote:
> 
> Hello,
>  
> We built a pacemaker cluster with 2 physical servers.
> We configured DRBD in Master\Slave setup, a floating IP and file system mount 
> in Active\Passive mode.
> We configured two STONITH devices (fence_ipmilan), one for each server.
>  
> We are trying to simulate a situation when the Master server crushes with no 
> power.
> We pulled both of the PSU cables and the server becomes offline (UNCLEAN).
> The resources that the Master use to hold are now in Started (UNCLEAN) state.
> The state is unclean since the STONITH failed (the STONITH device is located 
> on the server (Intel RMM4 - IPMI) – which uses the same power supply).
>  
> The problem is that now, the cluster does not releasing the resources that 
> the Master holds, and the service goes down.
>  
> Is there any way to overcome this situation?
> We tried to add a qdevice but got the same results.
>  
> We are using pacemaker 1.1.15 on CentOS 7.3
>  
> Thanks,
> Tomer.
> ___
> Users mailing list: Users@clusterlabs.org <mailto:Users@clusterlabs.org>
> http://lists.clusterlabs.org/mailman/listinfo/users 
> <http://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] epic fail

2017-07-23 Thread Kristián Feldsam
You can not update running cluster! First you need put node standby, check if 
all resources stopped and them do what you need. This was unfortunately your 
fail :(

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – profesionální hostingové a serverové služby za 
adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 23 Jul 2017, at 14:27, Dmitri Maziuk <dmitri.maz...@gmail.com> wrote:
> 
> So yesterday I ran yum update that puled in the new pacemaker and tried to 
> restart it. The node went into its usual "can't unmount drbd because kernel 
> is using it" and got stonith'ed in the middle of yum transaction. The end 
> result: DRBD reports split brain, HA daemons don't start on boot, RPM 
> database is FUBAR. I've had enough. I'm rebuilding this cluster as centos 6 + 
> heartbeat R1.
> 
> Centos 7 + DRBD 8.4 + pacemaker + NFS server: FAIL. You have been warned.
> 
> Dima
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] GFS2 Errors

2017-07-19 Thread Kristián Feldsam
Hello, I see today GFS2 errors in log and nothing about that is on net, so I 
writing to this mailing list.

node2   19.07.2017 01:11:55 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete 
nr=-4549568322848002755
node2   19.07.2017 01:10:56 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete 
nr=-8191295421473926116
node2   19.07.2017 01:10:48 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete 
nr=-8225402411152149004
node2   19.07.2017 01:10:47 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete 
nr=-8230186816585019317
node2   19.07.2017 01:10:45 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete 
nr=-8242007238441787628
node2   19.07.2017 01:10:39 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete 
nr=-8250926852732428536
node3   19.07.2017 00:16:02 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete 
nr=-5150933278940354602
node3   19.07.2017 00:16:02 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete nr=-64
node3   19.07.2017 00:16:02 kernel  kernerr vmscan: shrink_slab: 
gfs2_glock_shrink_scan+0x0/0x2f0 [gfs2] negative objects to delete nr=-64
Would somebody explain this errors? cluster is looks like working normally. I 
enabled vm.zone_reclaim_mode = 1 on nodes...

Thank you!

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz <mailto:supp...@feldhost.cz>

www.feldhost.cz <http://www.feldhost.cz/> - FeldHost™ – profesionální 
hostingové a serverové služby za adekvátní ceny.

FELDSAM s.r.o.
V rohu 434/3
Praha 4 – Libuš, PSČ 142 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org