Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER Thu, 17 Sep 2020 03:03:27 -0700

if needed, here my test script to reproduce it

node1 (restart corosync until node2 don't send the timestamp anymore)
-----

#!/bin/bash

for i in `seq 10000`; do 
   now=$(date +"%T")
   echo "restart corosync : $now"
    systemctl restart corosync
    for j in {1..59}; do
        last=$(cat /tmp/timestamp)
        curr=`date '+%s'`
        diff=$(($curr - $last))
        if [ $diff -gt 20 ]; then
           echo "too old"
           exit 0
        fi
        sleep 1
     done
done 

node2 (write to /etc/pve/test each second, then send the last timestamp to 
node1)
-----
#!/bin/bash
for i in {1..10000};
do
   now=$(date +"%T")
   echo "Current time : $now"
   curr=`date '+%s'`
   ssh root@node1 "echo $curr > /tmp/timestamp"
   echo "test" > /etc/pve/test
   sleep 1
done

----- Mail original -----
De: "aderumier" <[email protected]>
À: "Proxmox VE development discussion" <[email protected]>
Cc: "Thomas Lamprecht" <[email protected]>
Envoyé: Jeudi 17 Septembre 2020 11:59:32
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Thanks for the update. 

>> if 
>>we can't reproduce it, we'll have to send you patches/patched debs with 
>>increased logging to narrow down what is going on. if we can, than we 
>>can hopefully find and fix the issue fast. 

No problem, I can install the patched deb if needed. 

----- Mail original ----- 
De: "Fabian Grünbichler" <[email protected]> 
À: "Proxmox VE development discussion" <[email protected]>, "Thomas 
Lamprecht" <[email protected]> 
Envoyé: Jeudi 17 Septembre 2020 11:21:45 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On September 16, 2020 5:17 pm, Alexandre DERUMIER wrote: 
> I have produce it again, with the coredump this time 
> 
> 
> restart corosync : 17:05:27 
> 
> http://odisoweb1.odiso.net/pmxcfs-corosync2.log 
> 
> 
> bt full 
> 
> https://gist.github.com/aderumier/466dcc4aedb795aaf0f308de0d1c652b 
> 
> 
> coredump 
> 
> 
> http://odisoweb1.odiso.net/core.7761.gz 

just a short update on this: 

dcdb is stuck in START_SYNC mode, but nodeid 13 hasn't sent a STATE msg 
(yet). this looks like either the START_SYNC message to node 13, or the 
STATE response from it got lost or processed wrong. until the mode 
switches to SYNCED (after all states have been received and the state 
update went through), regular/normal messages can be sent, but the 
incoming normal messages are queued and not processed. this is why the 
fuse access blocks, it sends the request out, but the response ends up 
in the queue. 

status (the other thing running on top of dfsm) got correctly synced up 
at the same time, so it's either a dcdb specific bug, or just bad luck 
that one was affected and the other wasn't. 

unfortunately even with debug enabled the logs don't contain much 
information that would help (e.g., we don't log sending/receiving STATE 
messages except when they look 'wrong'), so Thomas is trying to 
reproduce this using your scenario here to improve turn around time. if 
we can't reproduce it, we'll have to send you patches/patched debs with 
increased logging to narrow down what is going on. if we can, than we 
can hopefully find and fix the issue fast. 

_______________________________________________ 
pve-devel mailing list 
[email protected] 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 

_______________________________________________
pve-devel mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Reply via email to