Re: [DRBD-user] Split brain problem.
On 12/04/2011 04:15 PM, Ivan Pavlenko wrote: handlers { pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; local-io-error /usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o /proc/sysrq-trigger ; halt -f; } You need to configure DRBD to use fencing. The best way to do this when using a Red Hat cluster is via Lon's obliterate-peer.sh script. You can download a copy this way; wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh chmod a+x /sbin/obliterate-peer.sh Then add this; handlers { fence-peer /sbin/obliterate-peer.sh; } Here my answers on your questions: 1) There is definitely split brain not a network problem. I demonstrated at my previous message I can ping members of the cluster and they have open firewall. When I use telnet and sniffer I see nodes try to estimate network connection, but they send reject pockets only. Indeed. Dec 2 10:04:00 infplsm018 kern.alert kernel: block drbd1: Split-Brain detected but unresolved, dropping connection! You will need to manually recover from this split brain. See; http://www.drbd.org/users-guide/s-resolve-split-brain.html 3) And here my /etc/cluster/cluster.conf file fencedevice agent=fence_null name=nullfence/ fencedevice agent=fence_manual name=manfence/ These are not effective or supported. You need to use real fence devices. This is exceedingly so when using shared storage in a cluster. What caused your split-brain in this case is largely meaningless without proper fencing. Once you have this setup, tested and working, then the next time DRBD would have split-brain'ed, it'll instead fence. At that point, then you need to sort out what is breaking your cluster. That is another thread though. -- Digimer E-Mail: digi...@alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org omg my singularity battery is dead again. stupid hawking radiation. - epitron ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Split brain problem.
Hi ALL, Digimer, thank you again for your answer I'm really appreciate it! Unfortunately, I've tried to fixes split brain manually several times. It doesn't work. # drbdadm disconnect r0 [root@infplsm017 ~]# drbdadm secondary r0 1: State change failed: (-12) Device is held open by someone Command 'drbdsetup 1 secondary' terminated with exit code 11 # drbdadm -- --discard-my-data connect r0 1: Failure: (123) --discard-my-data not allowed when primary. Command 'drbdsetup 1 net 10.10.24.10:7789 10.10.24.11:7789 C --set-defaults --create-device --ping-timeout=20 --after-sb-2pri=disconnect --after-sb-1pri=discard-secondary --after-sb-0pri=discard-zero-changes --allow-two-primaries --discard-my-data' terminated with exit code 10 # I guess I need to stop cluster daemons, don't I? Thank you again, Ivan On 12/05/2011 12:21 PM, Digimer wrote: On 12/04/2011 04:15 PM, Ivan Pavlenko wrote: handlers { pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; local-io-error /usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o /proc/sysrq-trigger ; halt -f; } You need to configure DRBD to use fencing. The best way to do this when using a Red Hat cluster is via Lon's obliterate-peer.sh script. You can download a copy this way; wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh chmod a+x /sbin/obliterate-peer.sh Then add this; handlers { fence-peer /sbin/obliterate-peer.sh; } Here my answers on your questions: 1) There is definitely split brain not a network problem. I demonstrated at my previous message I can ping members of the cluster and they have open firewall. When I use telnet and sniffer I see nodes try to estimate network connection, but they send reject pockets only. Indeed. Dec 2 10:04:00 infplsm018kern.alert kernel: block drbd1: Split-Brain detected but unresolved, dropping connection! You will need to manually recover from this split brain. See; http://www.drbd.org/users-guide/s-resolve-split-brain.html 3) And here my /etc/cluster/cluster.conf file fencedevice agent=fence_null name=nullfence/ fencedevice agent=fence_manual name=manfence/ These are not effective or supported. You need to use real fence devices. This is exceedingly so when using shared storage in a cluster. What caused your split-brain in this case is largely meaningless without proper fencing. Once you have this setup, tested and working, then the next time DRBD would have split-brain'ed, it'll instead fence. At that point, then you need to sort out what is breaking your cluster. That is another thread though. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Split brain problem.
On 12/04/2011 09:25 PM, Ivan Pavlenko wrote: Hi ALL, Digimer, thank you again for your answer I'm really appreciate it! Unfortunately, I've tried to fixes split brain manually several times. It doesn't work. # drbdadm disconnect r0 [root@infplsm017 ~]# drbdadm secondary r0 1: State change failed: (-12) Device is held open by someone Command 'drbdsetup 1 secondary' terminated with exit code 11 # drbdadm -- --discard-my-data connect r0 1: Failure: (123) --discard-my-data not allowed when primary. Command 'drbdsetup 1 net 10.10.24.10:7789 10.10.24.11:7789 C --set-defaults --create-device --ping-timeout=20 --after-sb-2pri=disconnect --after-sb-1pri=discard-secondary --after-sb-0pri=discard-zero-changes --allow-two-primaries --discard-my-data' terminated with exit code 10 # I guess I need to stop cluster daemons, don't I? Thank you again, Ivan Something is, as the error indicates, still trying to use the DRBD resource. Find it, stop it, and then you can demote the resource. Look at the 'lsof' command, that will probably help you find the program still using it. -- Digimer E-Mail: digi...@alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org omg my singularity battery is dead again. stupid hawking radiation. - epitron ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Split brain problem.
There is not easy. # lsof | grep drbd drbd1_wor 3414root cwd DIR 253,0 4096 2 / drbd1_wor 3414root rtd DIR 253,0 4096 2 / drbd1_wor 3414root txt unknown /proc/3414/exe # ps aux |grep 3414 root 3414 0.0 0.0 0 0 ?SDec02 0:00 [drbd1_worker] root 4690 0.0 0.0 61232 744 pts/1R+ 14:00 0:00 grep 3414 # lsof -p 3414 COMMANDPID USER FD TYPE DEVICE SIZE NODE NAME drbd1_wor 3414 root cwd DIR 253,0 40962 / drbd1_wor 3414 root rtd DIR 253,0 40962 / drbd1_wor 3414 root txt unknown /proc/3414/exe kill -9 3414 doesn't do anything. I even tried to restart two nodes simultaneously - no luck. Ivan. On 12/05/2011 01:50 PM, Digimer wrote: On 12/04/2011 09:25 PM, Ivan Pavlenko wrote: Hi ALL, Digimer, thank you again for your answer I'm really appreciate it! Unfortunately, I've tried to fixes split brain manually several times. It doesn't work. # drbdadm disconnect r0 [root@infplsm017 ~]# drbdadm secondary r0 1: State change failed: (-12) Device is held open by someone Command 'drbdsetup 1 secondary' terminated with exit code 11 # drbdadm -- --discard-my-data connect r0 1: Failure: (123) --discard-my-data not allowed when primary. Command 'drbdsetup 1 net 10.10.24.10:7789 10.10.24.11:7789 C --set-defaults --create-device --ping-timeout=20 --after-sb-2pri=disconnect --after-sb-1pri=discard-secondary --after-sb-0pri=discard-zero-changes --allow-two-primaries --discard-my-data' terminated with exit code 10 # I guess I need to stop cluster daemons, don't I? Thank you again, Ivan Something is, as the error indicates, still trying to use the DRBD resource. Find it, stop it, and then you can demote the resource. Look at the 'lsof' command, that will probably help you find the program still using it. ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user