Thomas,

Hi,

On 3/7/18 1:41 PM, Jan Friesse wrote:
Thomas,

First thanks for your answer!

On 3/7/18 11:16 AM, Jan Friesse wrote:

...

TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)


Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.

I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused.

Most of the clusters are using power fencing, so they simply never sees this problem. That may be also the reason why it wasn't reported long time ago (this bug exists virtually at least since OpenAIS Whitetank). So really nice work with finding this bug.

What I'm not entirely sure is what may be best way to solve this problem. What I'm sure is, that it's going to be "fun" :(

Lets start with very high level of possible solutions:
- "Ignore the problem". CPG behaves more or less correctly. "Current" membership really didn't changed so it doesn't make too much sense to inform about change. It's possible to use cpg_totem_confchg_fn_t to find out when ringid changes. I'm adding this solution just for completeness, because I don't prefer it at all.
- cpg_confchg_fn_t adds all left and back joined into left/join list
- cpg will sends extra cpg_confchg_fn_t call about left and joined nodes. I would prefer this solution simply because it makes cpg behavior equal in all situations.

Which of the options you would prefer? Same question also for @Ken (-> what would you prefer for PCMK) and @Chrissie.

Regards,
  Honza



cheers,
Thomas


Now it's really cpg application problem to synchronize its data. Many 
applications (usually FS) are using quorum together with fencing to find out, 
which cluster partition is quorate and clean inquorate one.

Hopefully my explanation help you and feel free to ask more questions!


They help, but I'm still a bit unsure about why the CB could not happen here,
may need to dive a bit deeper into corosync :)

Regards,
   Honza


help would be appreciated, much thanks!

cheers,
Thomas

[1]: 
https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
[2]: 
https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096










_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to