Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-07-02 Thread Jan Friesse

Hi Thomas,

Hi,

Am 04/25/2018 um 09:57 AM schrieb Jan Friesse:

Thomas Lamprecht napsal(a):

On 4/24/18 6:38 PM, Jan Friesse wrote:

On 4/6/18 10:59 AM, Jan Friesse wrote:

Thomas Lamprecht napsal(a):

Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:
I've tested it too and yes, you are 100% right. Bug is there and 
it's
pretty easy to reproduce when node with lowest nodeid is paused. 
It's

slightly harder when node with higher nodeid is paused.



Do you were able to make some progress on this issue?


Ya, kind of. Sadly I had to work on different problem, but I'm 
expecting to sent patch next week.




I guess the different problems where the ones related to the issued 
CVEs :)


Yep.

Also I've spent quite a lot of the time thinking about best possible 
solution. CPG is quite old, it was full of weird bugs and risk of 
breakage is very high.


Anyway, I've decided to not to try hack what is apparently broken 
and just go for risky but proper solution (= needs a LOT more 
testing, but so far looks good).




I did not looked deep into how your revert plays out with the
mentioned commits of the heuristics approach, but this fix would
mean to bring corosync back to a state it had already, and thus
was already battle tested?


Yep, but not fully. Important change was to use joinlists as 
authoritative source of information about other node clients, so I 
believe that solved problems which should had been "solved" by 
downlist heuristics.





Patch and approach seems good to me, with my limited knowledge,
when looking at the various "bandaid" fix commits you mentioned.


Patch is in PR (needle): https://github.com/corosync/corosync/pull/347



Much thanks! First tests work well here.
I could not yet reproduce the problem with the patch applied in both,
testcpg and our cluster configuration file system.


That's good to hear :)



I'll let it run


Perfect.




Just wanted to give some quick feedback.
We deployed this to your community repository about a week ago (after
another week of successful testing), we had no negative feedback or
issues reported or seen yet, with (strong lower bound) > 10k systems
running the fix by now.


Thanks, that's exciting news.



I saw just now that you merged it into needle and master, so, while a 
bit late, this just backs the confidence into the fix up.


Definitively not late until it's released :)



Much thanks for your, and the reviewers, work!


Yep, you are welcomed.

Honza



cheers,
Thomas



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-05-07 Thread Thomas Lamprecht

Hi,

Am 04/25/2018 um 09:57 AM schrieb Jan Friesse:

Thomas Lamprecht napsal(a):

On 4/24/18 6:38 PM, Jan Friesse wrote:

On 4/6/18 10:59 AM, Jan Friesse wrote:

Thomas Lamprecht napsal(a):

Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:
I've tested it too and yes, you are 100% right. Bug is there and 
it's
pretty easy to reproduce when node with lowest nodeid is paused. 
It's

slightly harder when node with higher nodeid is paused.



Do you were able to make some progress on this issue?


Ya, kind of. Sadly I had to work on different problem, but I'm 
expecting to sent patch next week.




I guess the different problems where the ones related to the issued 
CVEs :)


Yep.

Also I've spent quite a lot of the time thinking about best possible 
solution. CPG is quite old, it was full of weird bugs and risk of 
breakage is very high.


Anyway, I've decided to not to try hack what is apparently broken and 
just go for risky but proper solution (= needs a LOT more testing, 
but so far looks good).




I did not looked deep into how your revert plays out with the
mentioned commits of the heuristics approach, but this fix would
mean to bring corosync back to a state it had already, and thus
was already battle tested?


Yep, but not fully. Important change was to use joinlists as 
authoritative source of information about other node clients, so I 
believe that solved problems which should had been "solved" by downlist 
heuristics.





Patch and approach seems good to me, with my limited knowledge,
when looking at the various "bandaid" fix commits you mentioned.


Patch is in PR (needle): https://github.com/corosync/corosync/pull/347



Much thanks! First tests work well here.
I could not yet reproduce the problem with the patch applied in both,
testcpg and our cluster configuration file system.


That's good to hear :)



I'll let it run


Perfect.




Just wanted to give some quick feedback.
We deployed this to your community repository about a week ago (after
another week of successful testing), we had no negative feedback or
issues reported or seen yet, with (strong lower bound) > 10k systems
running the fix by now.

I saw just now that you merged it into needle and master, so, while a 
bit late, this just backs the confidence into the fix up.


Much thanks for your, and the reviewers, work!

cheers,
Thomas

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-25 Thread Jan Friesse

Thomas Lamprecht napsal(a):

Honza,

On 4/24/18 6:38 PM, Jan Friesse wrote:

On 4/6/18 10:59 AM, Jan Friesse wrote:

Thomas Lamprecht napsal(a):

Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:

I've tested it too and yes, you are 100% right. Bug is there and it's
pretty easy to reproduce when node with lowest nodeid is paused. It's
slightly harder when node with higher nodeid is paused.



Do you were able to make some progress on this issue?


Ya, kind of. Sadly I had to work on different problem, but I'm expecting to 
sent patch next week.



I guess the different problems where the ones related to the issued CVEs :)


Yep.

Also I've spent quite a lot of the time thinking about best possible solution. 
CPG is quite old, it was full of weird bugs and risk of breakage is very high.

Anyway, I've decided to not to try hack what is apparently broken and just go 
for risky but proper solution (= needs a LOT more testing, but so far looks 
good).



I did not looked deep into how your revert plays out with the
mentioned commits of the heuristics approach, but this fix would
mean to bring corosync back to a state it had already, and thus
was already battle tested?


Yep, but not fully. Important change was to use joinlists as 
authoritative source of information about other node clients, so I 
believe that solved problems which should had been "solved" by downlist 
heuristics.





Patch and approach seems good to me, with my limited knowledge,
when looking at the various "bandaid" fix commits you mentioned.


Patch is in PR (needle): https://github.com/corosync/corosync/pull/347



Much thanks! First tests work well here.
I could not yet reproduce the problem with the patch applied in both,
testcpg and our cluster configuration file system.


That's good to hear :)



I'll let it run


Perfect.

Regards,
  Honza



cheers,
Thomas



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-25 Thread Thomas Lamprecht
Honza,

On 4/24/18 6:38 PM, Jan Friesse wrote:
>> On 4/6/18 10:59 AM, Jan Friesse wrote:
>>> Thomas Lamprecht napsal(a):
 Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:
> I've tested it too and yes, you are 100% right. Bug is there and it's
> pretty easy to reproduce when node with lowest nodeid is paused. It's
> slightly harder when node with higher nodeid is paused.
>

 Do you were able to make some progress on this issue?
>>>
>>> Ya, kind of. Sadly I had to work on different problem, but I'm expecting to 
>>> sent patch next week.
>>>
>>
>> I guess the different problems where the ones related to the issued CVEs :)
> 
> Yep.
> 
> Also I've spent quite a lot of the time thinking about best possible 
> solution. CPG is quite old, it was full of weird bugs and risk of breakage is 
> very high.
> 
> Anyway, I've decided to not to try hack what is apparently broken and just go 
> for risky but proper solution (= needs a LOT more testing, but so far looks 
> good).
> 

I did not looked deep into how your revert plays out with the
mentioned commits of the heuristics approach, but this fix would
mean to bring corosync back to a state it had already, and thus
was already battle tested?

Patch and approach seems good to me, with my limited knowledge,
when looking at the various "bandaid" fix commits you mentioned.

> Patch is in PR (needle): https://github.com/corosync/corosync/pull/347
> 

Much thanks! First tests work well here.
I could not yet reproduce the problem with the patch applied in both,
testcpg and our cluster configuration file system.

I'll let it run 

cheers,
Thomas

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-24 Thread Jan Friesse

Thomas,



Hi Honza

On 4/6/18 10:59 AM, Jan Friesse wrote:

Thomas Lamprecht napsal(a):

Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:

I've tested it too and yes, you are 100% right. Bug is there and it's
pretty easy to reproduce when node with lowest nodeid is paused. It's
slightly harder when node with higher nodeid is paused.



Do you were able to make some progress on this issue?


Ya, kind of. Sadly I had to work on different problem, but I'm expecting to 
sent patch next week.



I guess the different problems where the ones related to the issued CVEs :)


Yep.

Also I've spent quite a lot of the time thinking about best possible 
solution. CPG is quite old, it was full of weird bugs and risk of 
breakage is very high.


Anyway, I've decided to not to try hack what is apparently broken and 
just go for risky but proper solution (= needs a LOT more testing, but 
so far looks good).


Patch is in PR (needle): https://github.com/corosync/corosync/pull/347

Regards,
  Honza




We'd really like a fix for this, so if there's anything I can do to help
just hit me up. :)


Testing would be welcomed.



Have you anything we could test already?
Our freeze for our next release is in sight, so it would be really great
if we had an upstream accepted resolution for this issue until then.

cheers,
Thomas



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-24 Thread Thomas Lamprecht
Hi Honza

On 4/6/18 10:59 AM, Jan Friesse wrote:
> Thomas Lamprecht napsal(a):
>> Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:
>>> I've tested it too and yes, you are 100% right. Bug is there and it's
>>> pretty easy to reproduce when node with lowest nodeid is paused. It's
>>> slightly harder when node with higher nodeid is paused.
>>>
>>
>> Do you were able to make some progress on this issue?
> 
> Ya, kind of. Sadly I had to work on different problem, but I'm expecting to 
> sent patch next week.
> 

I guess the different problems where the ones related to the issued CVEs :)

>> We'd really like a fix for this, so if there's anything I can do to help
>> just hit me up. :)
> 
> Testing would be welcomed.
> 

Have you anything we could test already?
Our freeze for our next release is in sight, so it would be really great
if we had an upstream accepted resolution for this issue until then.

cheers,
Thomas

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-06 Thread Jan Friesse

Hi Thomas,

Thomas Lamprecht napsal(a):

Hi Honza,

Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:

Thomas,

TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)


Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.


I've tested it too and yes, you are 100% right. Bug is there and it's
pretty easy to reproduce when node with lowest nodeid is paused. It's
slightly harder when node with higher nodeid is paused.



Do you were able to make some progress on this issue?


Ya, kind of. Sadly I had to work on different problem, but I'm expecting 
to sent patch next week.



We'd really like a fix for this, so if there's anything I can do to help
just hit me up. :)


Testing would be welcomed.

Honza



Else, I have a (little hacky) workaround here (cpg client side), if you
think the issue
isn't to easy to address anytime soon, I'd polish that patch up and we
could
use that while waiting for the real fix.

cheers,
Thomas




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-04-05 Thread Thomas Lamprecht

Hi Honza,

Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:

Thomas,

TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)


Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.


I've tested it too and yes, you are 100% right. Bug is there and it's 
pretty easy to reproduce when node with lowest nodeid is paused. It's 
slightly harder when node with higher nodeid is paused.




Do you were able to make some progress on this issue?
We'd really like a fix for this, so if there's anything I can do to help 
just hit me up. :)


Else, I have a (little hacky) workaround here (cpg client side), if you 
think the issue

isn't to easy to address anytime soon, I'd polish that patch up and we could
use that while waiting for the real fix.

cheers,
Thomas


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-14 Thread Ken Gaillot
On Fri, 2018-03-09 at 17:26 +0100, Jan Friesse wrote:
> Thomas,
> 
> > Hi,
> > 
> > On 3/7/18 1:41 PM, Jan Friesse wrote:
> > > Thomas,
> > > 
> > > > First thanks for your answer!
> > > > 
> > > > On 3/7/18 11:16 AM, Jan Friesse wrote:
> 
> ...
> 
> > TotemConfchgCallback: ringid (1.1436)
> > active processors 3: 1 2 3
> > EXIT
> > Finalize  result is 1 (should be 1)
> > 
> > 
> > Hope I did both test right, but as it reproduces multiple times
> > with testcpg, our cpg usage in our filesystem, this seems like
> > valid tested, not just an single occurrence.
> 
> I've tested it too and yes, you are 100% right. Bug is there and
> it's 
> pretty easy to reproduce when node with lowest nodeid is paused.
> It's 
> slightly harder when node with higher nodeid is paused.
> 
> Most of the clusters are using power fencing, so they simply never
> sees 
> this problem. That may be also the reason why it wasn't reported
> long 
> time ago (this bug exists virtually at least since OpenAIS
> Whitetank). 
> So really nice work with finding this bug.
> 
> What I'm not entirely sure is what may be best way to solve this 
> problem. What I'm sure is, that it's going to be "fun" :(
> 
> Lets start with very high level of possible solutions:
> - "Ignore the problem". CPG behaves more or less correctly.
> "Current" 
> membership really didn't changed so it doesn't make too much sense
> to 
> inform about change. It's possible to use cpg_totem_confchg_fn_t to
> find 
> out when ringid changes. I'm adding this solution just for
> completeness, 
> because I don't prefer it at all.
> - cpg_confchg_fn_t adds all left and back joined into left/join list
> - cpg will sends extra cpg_confchg_fn_t call about left and joined 
> nodes. I would prefer this solution simply because it makes cpg
> behavior 
> equal in all situations.
> 
> Which of the options you would prefer? Same question also for @Ken (-

Pacemaker should react essentially the same whichever of the last two
options is used. There could be differences due to timing (the second
solution might allow some work to be done between when the left and
join messages are received), but I think it should behave reasonably
with either approach.

Interestingly, there is some old code in Pacemaker for handling when a
node left and rejoined but "the cluster layer didn't notice", that may
have been a workaround for this case.

> > 
> what would you prefer for PCMK) and @Chrissie.
> 
> Regards,
>    Honza
> 
> 
> > 
> > cheers,
> > Thomas
> > 
> > > > 
> > > > > Now it's really cpg application problem to synchronize its
> > > > > data. Many applications (usually FS) are using quorum
> > > > > together with fencing to find out, which cluster partition is
> > > > > quorate and clean inquorate one.
> > > > > 
> > > > > Hopefully my explanation help you and feel free to ask more
> > > > > questions!
> > > > > 
> > > > 
> > > > They help, but I'm still a bit unsure about why the CB could
> > > > not happen here,
> > > > may need to dive a bit deeper into corosync :)
> > > > 
> > > > > Regards,
> > > > >    Honza
> > > > > 
> > > > > > 
> > > > > > help would be appreciated, much thanks!
> > > > > > 
> > > > > > cheers,
> > > > > > Thomas
> > > > > > 
> > > > > > [1]: https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=da
> > > > > > ta/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
> > > > > > [2]: https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=da
> > > > > > ta/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb
> > > > > > =HEAD#l1096
> > > > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> 
> 
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-12 Thread Thomas Lamprecht
Hi,

On 3/9/18 5:26 PM, Jan Friesse wrote:
> ...
> 
>> TotemConfchgCallback: ringid (1.1436)
>> active processors 3: 1 2 3
>> EXIT
>> Finalize  result is 1 (should be 1)
>>
>>
>> Hope I did both test right, but as it reproduces multiple times
>> with testcpg, our cpg usage in our filesystem, this seems like
>> valid tested, not just an single occurrence.
> 
> I've tested it too and yes, you are 100% right. Bug is there and it's pretty 
> easy to reproduce when node with lowest nodeid is paused. It's slightly 
> harder when node with higher nodeid is paused.

Good, so we're not crazy :)

> 
> Most of the clusters are using power fencing, so they simply never sees this 
> problem. That may be also the reason why it wasn't reported long time ago 
> (this bug exists virtually at least since OpenAIS Whitetank). So really nice 
> work with finding this bug.
> 

Hmm, but even slow pauses (1 to 2 seconds) cause this, so fencing should
get active there yet.
We here had a theory that environment changes let this bug trigger more
often, i.e. scheduler, IO subsystem changes in the Kernel, for example.
As we saw a significant raise of reports in the recent years.
(I mean we grow in users too, but the increase feels not just like correlation).

> What I'm not entirely sure is what may be best way to solve this problem. 
> What I'm sure is, that it's going to be "fun" :(
> 
> Lets start with very high level of possible solutions:
> - "Ignore the problem". CPG behaves more or less correctly. "Current" 
> membership really didn't changed so it doesn't make too much sense to inform 
> about change. It's possible to use cpg_totem_confchg_fn_t to find out when 
> ringid changes. I'm adding this solution just for completeness, because I 
> don't prefer it at all.

same here, I mean we could work around this, but it does not really
feels right.
And our code is designed with the assumption that we get a membership
callback, changing that assumption seems like a bit of a headache as
we need to verify that no side effects gets introduced by the workaround
and everything can cope with it. Doable, but also not to much fun :)

> - cpg_confchg_fn_t adds all left and back joined into left/join list

would work for us.

> - cpg will sends extra cpg_confchg_fn_t call about left and joined nodes. I 
> would prefer this solution simply because it makes cpg behavior equal in all 
> situations.
> 

So the behaviour you assumed it should do? Getting two callbacks,
one that all others left and then the one where all others joined
in the new membership?
This sounds like the best approach to me, as it really tells
the CPG application what happened in the way all other members
see it. But I'm not an corosync guru :)

> Which of the options you would prefer? Same question also for @Ken (-> what 
> would you prefer for PCMK) and @Chrissie.
> 

The last approach.

cheers and much thanks for your help!
Thomas



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-09 Thread Jan Friesse

Thomas,


Hi,

On 3/7/18 1:41 PM, Jan Friesse wrote:

Thomas,


First thanks for your answer!

On 3/7/18 11:16 AM, Jan Friesse wrote:


...


TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)


Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.


I've tested it too and yes, you are 100% right. Bug is there and it's 
pretty easy to reproduce when node with lowest nodeid is paused. It's 
slightly harder when node with higher nodeid is paused.


Most of the clusters are using power fencing, so they simply never sees 
this problem. That may be also the reason why it wasn't reported long 
time ago (this bug exists virtually at least since OpenAIS Whitetank). 
So really nice work with finding this bug.


What I'm not entirely sure is what may be best way to solve this 
problem. What I'm sure is, that it's going to be "fun" :(


Lets start with very high level of possible solutions:
- "Ignore the problem". CPG behaves more or less correctly. "Current" 
membership really didn't changed so it doesn't make too much sense to 
inform about change. It's possible to use cpg_totem_confchg_fn_t to find 
out when ringid changes. I'm adding this solution just for completeness, 
because I don't prefer it at all.

- cpg_confchg_fn_t adds all left and back joined into left/join list
- cpg will sends extra cpg_confchg_fn_t call about left and joined 
nodes. I would prefer this solution simply because it makes cpg behavior 
equal in all situations.


Which of the options you would prefer? Same question also for @Ken (-> 
what would you prefer for PCMK) and @Chrissie.


Regards,
  Honza




cheers,
Thomas




Now it's really cpg application problem to synchronize its data. Many 
applications (usually FS) are using quorum together with fencing to find out, 
which cluster partition is quorate and clean inquorate one.

Hopefully my explanation help you and feel free to ask more questions!



They help, but I'm still a bit unsure about why the CB could not happen here,
may need to dive a bit deeper into corosync :)


Regards,
   Honza



help would be appreciated, much thanks!

cheers,
Thomas

[1]: 
https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
[2]: 
https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096














___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-07 Thread Jan Friesse

Thomas,


First thanks for your answer!

On 3/7/18 11:16 AM, Jan Friesse wrote:

Thomas,



Hi,

first some background info for my questions I'm going to ask:
We use corosync as a basis for our distributed realtime configuration
file system (pmxcfs)[1].


nice



We got some reports of a completely hanging FS with the only
correlations being high load, often IO, and most times a message that
corosync did not got scheduled for longer than the token timeout.

See this example from a three node cluster, first:


Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN  ] Corosync main 
process was not scheduled for 3767.3159 ms (threshold is 1320. ms). 
Consider token timeout increase.


then we get a few logs that JOIN or LEAVE messages were thrown away
(understandable for this event):

Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Corosync main 
process was not scheduled for 3767.3159 ms (threshold is 1320. ms). 
Consider token timeout increase.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] A new 
membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] Failed to 
receive the leave message. failed: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] A new membership 
(192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] Failed to receive 
the leave message. failed: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [QUORUM] Members[3]: 1 
2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [QUORUM] Members[3]: 1 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Completed service 
synchronization, ready to provide service.

Until recently we stepped really in the dark and had everything from
Kernel bugs to our filesystem logic as possible cause in mind...  But
then we had the luck to trigger this in our test systems and went to
town with gdb on the core dump, finding that we can trigger this by
pausing the leader (from our FS POV) for a short moment (may be shorter
than the token timeout), so that a new leader get elected, and then
resuming our leader node VM again.

The problem I saw was that while the leader had a log entry which
proved that he noticed his blackout:

[TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 
left: 2 3


I know it looks weird but it's perfectly fine and expected.



It seemed OK, from this nodes POV, just the missing config change CB
was a bit odd to us.



our FS cpg_confchg_fn callback[2] was never called, thus it thought it


That shouldn't happen



So we really should get a config change CB on the paused node after
unpausing, with all other (online) nodes in both leave and join member
list?


Nope, cpg will take care to sent two messages (one is about left node 
and second is about join node).



Just asking again to confirm my thinking and that I did not misunderstood
you. :)


was still in sync and nothing ever happened, until another member
triggered this callback, by either leaving or (re-)joining.

Looking in the cpg.c code I saw that there's another callback, namely
cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that


This callback is not necessary to have as long as information about cpg group 
is enough. cpg_totem_confchg_fn contains information about all processors 
(nodes).



OK, make sense.


one... (I ain't the original author of the FS and it predates at least
to 2010, so maybe cpg_initialize was not yet deprecated there, and
thus model_initialize was not used then)




I switched over to using cpg_model_initialize and set the totem_confchg
callback, but for the "blacked out node" it gets called twice after the
event, but always shows all members...

So to finally get to my questions:

* why doesn't get the cpg_confchg_fn CB called when a node has a short
  blackout (i.e., corosync not being scheduled for a bit of time)?
  having all other nodes in it's leave and join list, as the log
  would suggests ("Members joined: 2 3 left: 2 3")


Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-07 Thread Thomas Lamprecht
First thanks for your answer!

On 3/7/18 11:16 AM, Jan Friesse wrote:
> Thomas,
> 
> 
>> Hi,
>>
>> first some background info for my questions I'm going to ask:
>> We use corosync as a basis for our distributed realtime configuration
>> file system (pmxcfs)[1].
> 
> nice
> 
>>
>> We got some reports of a completely hanging FS with the only
>> correlations being high load, often IO, and most times a message that
>> corosync did not got scheduled for longer than the token timeout.
>>
>> See this example from a three node cluster, first:
>>
>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN  ] Corosync 
>>> main process was not scheduled for 3767.3159 ms (threshold is 1320. 
>>> ms). Consider token timeout increase.
>>
>> then we get a few logs that JOIN or LEAVE messages were thrown away
>> (understandable for this event):
>>
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or 
>> LEAVE message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Corosync main 
>> process was not scheduled for 3767.3159 ms (threshold is 1320. ms). 
>> Consider token timeout increase.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
>> message was thrown away during flush operation.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] A new 
>> membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] Failed to 
>> receive the leave message. failed: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] A new membership 
>> (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] Failed to receive 
>> the leave message. failed: 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [QUORUM] 
>> Members[3]: 1 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [MAIN  ] Completed 
>> service synchronization, ready to provide service.
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [QUORUM] Members[3]: 1 2 3
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Completed service 
>> synchronization, ready to provide service.
>>
>> Until recently we stepped really in the dark and had everything from
>> Kernel bugs to our filesystem logic as possible cause in mind...  But
>> then we had the luck to trigger this in our test systems and went to
>> town with gdb on the core dump, finding that we can trigger this by
>> pausing the leader (from our FS POV) for a short moment (may be shorter
>> than the token timeout), so that a new leader get elected, and then
>> resuming our leader node VM again.
>>
>> The problem I saw was that while the leader had a log entry which
>> proved that he noticed his blackout:
>>> [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 
>>> 2 3 left: 2 3
> 
> I know it looks weird but it's perfectly fine and expected.
> 

It seemed OK, from this nodes POV, just the missing config change CB
was a bit odd to us.

>>
>> our FS cpg_confchg_fn callback[2] was never called, thus it thought it
> 
> That shouldn't happen
> 

So we really should get a config change CB on the paused node after
unpausing, with all other (online) nodes in both leave and join member
list?
Just asking again to confirm my thinking and that I did not misunderstood
you. :)

>> was still in sync and nothing ever happened, until another member
>> triggered this callback, by either leaving or (re-)joining.
>>
>> Looking in the cpg.c code I saw that there's another callback, namely
>> cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that
> 
> This callback is not necessary to have as long as information about cpg group 
> is enough. cpg_totem_confchg_fn contains information about all processors 
> (nodes).
> 

OK, make sense.

>> one... (I ain't the original author of the FS and it predates at least
>> to 2010, so maybe cpg_initialize was not yet deprecated there, and
>> thus model_initialize was not used then)
> 
>>
>> I switched over to using cpg_model_initialize and set the totem_confchg
>> callback, but for the "blacked out node" it gets called twice after the
>> event, but always shows all members...
>>
>> So to finally get to my questions:
>>
>> * why doesn't get the cpg_confchg_fn CB called when a node has a short
>>   blackout (i.e., corosync not being scheduled 

Re: [ClusterLabs] corosync 2.4 CPG config change callback

2018-03-07 Thread Jan Friesse

Thomas,



Hi,

first some background info for my questions I'm going to ask:
We use corosync as a basis for our distributed realtime configuration
file system (pmxcfs)[1].


nice



We got some reports of a completely hanging FS with the only
correlations being high load, often IO, and most times a message that
corosync did not got scheduled for longer than the token timeout.

See this example from a three node cluster, first:


Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN  ] Corosync main 
process was not scheduled for 3767.3159 ms (threshold is 1320. ms). 
Consider token timeout increase.


then we get a few logs that JOIN or LEAVE messages were thrown away
(understandable for this event):

Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Corosync main 
process was not scheduled for 3767.3159 ms (threshold is 1320. ms). 
Consider token timeout increase.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] A new 
membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] Failed to 
receive the leave message. failed: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] A new membership 
(192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] Failed to receive 
the leave message. failed: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [QUORUM] Members[3]: 1 
2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [QUORUM] Members[3]: 1 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Completed service 
synchronization, ready to provide service.

Until recently we stepped really in the dark and had everything from
Kernel bugs to our filesystem logic as possible cause in mind...  But
then we had the luck to trigger this in our test systems and went to
town with gdb on the core dump, finding that we can trigger this by
pausing the leader (from our FS POV) for a short moment (may be shorter
than the token timeout), so that a new leader get elected, and then
resuming our leader node VM again.

The problem I saw was that while the leader had a log entry which
proved that he noticed his blackout:

[TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 
left: 2 3


I know it looks weird but it's perfectly fine and expected.



our FS cpg_confchg_fn callback[2] was never called, thus it thought it


That shouldn't happen


was still in sync and nothing ever happened, until another member
triggered this callback, by either leaving or (re-)joining.

Looking in the cpg.c code I saw that there's another callback, namely
cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that


This callback is not necessary to have as long as information about cpg 
group is enough. cpg_totem_confchg_fn contains information about all 
processors (nodes).



one... (I ain't the original author of the FS and it predates at least
to 2010, so maybe cpg_initialize was not yet deprecated there, and
thus model_initialize was not used then)




I switched over to using cpg_model_initialize and set the totem_confchg
callback, but for the "blacked out node" it gets called twice after the
event, but always shows all members...

So to finally get to my questions:

* why doesn't get the cpg_confchg_fn CB called when a node has a short
  blackout (i.e., corosync not being scheduled for a bit of time)?
  having all other nodes in it's leave and join list, as the log
  would suggests ("Members joined: 2 3 left: 2 3")


I believe it was called but not when corosync was paused.



* If that doesn't seems like a good idea, what can we use to really
  detect such a node blackout?


It's not possible to detect from the affected node, but it must be 
detected from other nodes.




As a work around I added logic for when through a config change a node
with a lower ID joined. The node which was leader until then triggers
a CPG leave enforcing a cluster wide config change event to happen,
which this time also the blacked out node gets and syncs 

[ClusterLabs] corosync 2.4 CPG config change callback

2018-03-07 Thread Thomas Lamprecht
Hi,

first some background info for my questions I'm going to ask:
We use corosync as a basis for our distributed realtime configuration
file system (pmxcfs)[1].

We got some reports of a completely hanging FS with the only
correlations being high load, often IO, and most times a message that
corosync did not got scheduled for longer than the token timeout.

See this example from a three node cluster, first:

> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN  ] Corosync 
> main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). 
> Consider token timeout increase.

then we get a few logs that JOIN or LEAVE messages were thrown away
(understandable for this event):

Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Corosync main 
process was not scheduled for 3767.3159 ms (threshold is 1320. ms). 
Consider token timeout increase.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE 
message was thrown away during flush operation.
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] A new 
membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] Failed to 
receive the leave message. failed: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] A new membership 
(192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] Failed to receive 
the leave message. failed: 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [QUORUM] Members[3]: 1 
2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [MAIN  ] Completed 
service synchronization, ready to provide service.
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [QUORUM] Members[3]: 1 2 3
Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Completed service 
synchronization, ready to provide service.

Until recently we stepped really in the dark and had everything from
Kernel bugs to our filesystem logic as possible cause in mind...  But
then we had the luck to trigger this in our test systems and went to
town with gdb on the core dump, finding that we can trigger this by
pausing the leader (from our FS POV) for a short moment (may be shorter
than the token timeout), so that a new leader get elected, and then
resuming our leader node VM again.

The problem I saw was that while the leader had a log entry which
proved that he noticed his blackout:
> [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 
> 3 left: 2 3

our FS cpg_confchg_fn callback[2] was never called, thus it thought it
was still in sync and nothing ever happened, until another member
triggered this callback, by either leaving or (re-)joining.

Looking in the cpg.c code I saw that there's another callback, namely
cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that
one... (I ain't the original author of the FS and it predates at least
to 2010, so maybe cpg_initialize was not yet deprecated there, and
thus model_initialize was not used then)

I switched over to using cpg_model_initialize and set the totem_confchg
callback, but for the "blacked out node" it gets called twice after the
event, but always shows all members...

So to finally get to my questions:

* why doesn't get the cpg_confchg_fn CB called when a node has a short
  blackout (i.e., corosync not being scheduled for a bit of time)?
  having all other nodes in it's leave and join list, as the log
  would suggests ("Members joined: 2 3 left: 2 3")

* If that doesn't seems like a good idea, what can we use to really
  detect such a node blackout?

As a work around I added logic for when through a config change a node
with a lower ID joined. The node which was leader until then triggers
a CPG leave enforcing a cluster wide config change event to happen,
which this time also the blacked out node gets and syncs then again
This works, but does not feels really nice, IMO...

help would be appreciated, much thanks!

cheers,
Thomas

[1]: 
https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
[2]: 
https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096

___
Users mailing list: