Re: [ClusterLabs] corosync 2.4 CPG config change callback

Jan Friesse Fri, 09 Mar 2018 08:26:35 -0800

Thomas,

Hi,


On 3/7/18 1:41 PM, Jan Friesse wrote:

Thomas,

First thanks for your answer!

On 3/7/18 11:16 AM, Jan Friesse wrote:

...

TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)


Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.

I've tested it too and yes, you are 100% right. Bug is there and it'spretty easy to reproduce when node with lowest nodeid is paused. It'sslightly harder when node with higher nodeid is paused.

Most of the clusters are using power fencing, so they simply never seesthis problem. That may be also the reason why it wasn't reported longtime ago (this bug exists virtually at least since OpenAIS Whitetank).So really nice work with finding this bug.

What I'm not entirely sure is what may be best way to solve thisproblem. What I'm sure is, that it's going to be "fun" :(


Lets start with very high level of possible solutions:

- "Ignore the problem". CPG behaves more or less correctly. "Current"membership really didn't changed so it doesn't make too much sense toinform about change. It's possible to use cpg_totem_confchg_fn_t to findout when ringid changes. I'm adding this solution just for completeness,because I don't prefer it at all.

- cpg_confchg_fn_t adds all left and back joined into left/join list

- cpg will sends extra cpg_confchg_fn_t call about left and joinednodes. I would prefer this solution simply because it makes cpg behaviorequal in all situations.

Which of the options you would prefer? Same question also for @Ken (->what would you prefer for PCMK) and @Chrissie.


Regards,
  Honza


cheers,
Thomas

Now it's really cpg application problem to synchronize its data. Many 
applications (usually FS) are using quorum together with fencing to find out, 
which cluster partition is quorate and clean inquorate one.

Hopefully my explanation help you and feel free to ask more questions!


They help, but I'm still a bit unsure about why the CB could not happen here,
may need to dive a bit deeper into corosync :)

Regards,
   Honza


help would be appreciated, much thanks!

cheers,
Thomas

[1]: 
https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
[2]: 
https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync 2.4 CPG config change callback

Reply via email to