Re: [DRBD-user] Split brain problem.

2011-12-04 Thread Digimer
On 12/04/2011 04:15 PM, Ivan Pavlenko wrote:
 handlers {
 pri-on-incon-degr
 /usr/lib/drbd/notify-pri-on-incon-degr.sh;
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ;
 reboot -f;
 pri-lost-after-sb
 /usr/lib/drbd/notify-pri-lost-after-sb.sh;
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ;
 reboot -f;
 local-io-error /usr/lib/drbd/notify-io-error.sh;
 /usr/lib/drbd/notify-emergency-shutdown.sh; echo o  /proc/sysrq-trigger
 ; halt -f;
 }

You need to configure DRBD to use fencing. The best way to do this when
using a Red Hat cluster is via Lon's obliterate-peer.sh script. You
can download a copy this way;

wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O
/sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh

Then add this;

handlers {
fence-peer  /sbin/obliterate-peer.sh;
}

 Here my answers on your questions:
 
 1) There is definitely split brain not a network problem. I demonstrated
 at my previous message I can ping members of the cluster and they have
 open firewall. When I use telnet and sniffer I see nodes try to estimate
 network connection, but they send reject pockets only.

Indeed.

 Dec  2 10:04:00 infplsm018 kern.alert kernel: block drbd1: Split-Brain
 detected but unresolved, dropping connection!

You will need to manually recover from this split brain. See;

http://www.drbd.org/users-guide/s-resolve-split-brain.html

 3) And here my /etc/cluster/cluster.conf file
 
 fencedevice agent=fence_null name=nullfence/
 fencedevice agent=fence_manual name=manfence/

These are not effective or supported. You need to use real fence
devices. This is exceedingly so when using shared storage in a cluster.
What caused your split-brain in this case is largely meaningless without
proper fencing.

Once you have this setup, tested and working, then the next time DRBD
would have split-brain'ed, it'll instead fence. At that point, then you
need to sort out what is breaking your cluster. That is another thread
though.

-- 
Digimer
E-Mail:  digi...@alteeve.com
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin:   http://nodeassassin.org
omg my singularity battery is dead again.
stupid hawking radiation. - epitron
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Split brain problem.

2011-12-04 Thread Ivan Pavlenko

Hi ALL,

Digimer, thank you again for your answer I'm really appreciate it! 
Unfortunately, I've tried to fixes split brain manually several times. 
It doesn't work.


# drbdadm disconnect r0
[root@infplsm017 ~]# drbdadm secondary r0
1: State change failed: (-12) Device is held open by someone
Command 'drbdsetup 1 secondary' terminated with exit code 11
# drbdadm -- --discard-my-data connect r0
1: Failure: (123) --discard-my-data not allowed when primary.
Command 'drbdsetup 1 net 10.10.24.10:7789 10.10.24.11:7789 C 
--set-defaults --create-device --ping-timeout=20 
--after-sb-2pri=disconnect --after-sb-1pri=discard-secondary 
--after-sb-0pri=discard-zero-changes --allow-two-primaries 
--discard-my-data' terminated with exit code 10

#

I guess I need to stop cluster daemons, don't I?

Thank you again,
Ivan


On 12/05/2011 12:21 PM, Digimer wrote:

On 12/04/2011 04:15 PM, Ivan Pavlenko wrote:

 handlers {
 pri-on-incon-degr
/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ;
reboot -f;
 pri-lost-after-sb
/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ;
reboot -f;
 local-io-error /usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o  /proc/sysrq-trigger
; halt -f;
 }

You need to configure DRBD to use fencing. The best way to do this when
using a Red Hat cluster is via Lon's obliterate-peer.sh script. You
can download a copy this way;

wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O
/sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh

Then add this;

handlers {
 fence-peer  /sbin/obliterate-peer.sh;
}


Here my answers on your questions:

1) There is definitely split brain not a network problem. I demonstrated
at my previous message I can ping members of the cluster and they have
open firewall. When I use telnet and sniffer I see nodes try to estimate
network connection, but they send reject pockets only.

Indeed.


Dec  2 10:04:00 infplsm018kern.alert  kernel: block drbd1: Split-Brain
detected but unresolved, dropping connection!

You will need to manually recover from this split brain. See;

http://www.drbd.org/users-guide/s-resolve-split-brain.html


3) And here my /etc/cluster/cluster.conf file

fencedevice agent=fence_null name=nullfence/
fencedevice agent=fence_manual name=manfence/

These are not effective or supported. You need to use real fence
devices. This is exceedingly so when using shared storage in a cluster.
What caused your split-brain in this case is largely meaningless without
proper fencing.

Once you have this setup, tested and working, then the next time DRBD
would have split-brain'ed, it'll instead fence. At that point, then you
need to sort out what is breaking your cluster. That is another thread
though.


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Split brain problem.

2011-12-04 Thread Digimer
On 12/04/2011 09:25 PM, Ivan Pavlenko wrote:
 Hi ALL,
 
 Digimer, thank you again for your answer I'm really appreciate it!
 Unfortunately, I've tried to fixes split brain manually several times.
 It doesn't work.
 
 # drbdadm disconnect r0
 [root@infplsm017 ~]# drbdadm secondary r0
 1: State change failed: (-12) Device is held open by someone
 Command 'drbdsetup 1 secondary' terminated with exit code 11
 # drbdadm -- --discard-my-data connect r0
 1: Failure: (123) --discard-my-data not allowed when primary.
 Command 'drbdsetup 1 net 10.10.24.10:7789 10.10.24.11:7789 C
 --set-defaults --create-device --ping-timeout=20
 --after-sb-2pri=disconnect --after-sb-1pri=discard-secondary
 --after-sb-0pri=discard-zero-changes --allow-two-primaries
 --discard-my-data' terminated with exit code 10
 #
 
 I guess I need to stop cluster daemons, don't I?
 
 Thank you again,
 Ivan

Something is, as the error indicates, still trying to use the DRBD
resource. Find it, stop it, and then you can demote the resource. Look
at the 'lsof' command, that will probably help you find the program
still using it.

-- 
Digimer
E-Mail:  digi...@alteeve.com
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin:   http://nodeassassin.org
omg my singularity battery is dead again.
stupid hawking radiation. - epitron
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Split brain problem.

2011-12-04 Thread Ivan Pavlenko

There is not easy.

# lsof | grep drbd
drbd1_wor  3414root  cwd   DIR  253,0  
4096  2 /
drbd1_wor  3414root  rtd   DIR  253,0  
4096  2 /
drbd1_wor  3414root  txt   
unknown /proc/3414/exe


# ps aux  |grep 3414
root  3414  0.0  0.0  0 0 ?SDec02   0:00 
[drbd1_worker]

root  4690  0.0  0.0  61232   744 pts/1R+   14:00   0:00 grep 3414
# lsof -p 3414
COMMANDPID USER   FD  TYPE DEVICE SIZE NODE NAME
drbd1_wor 3414 root  cwd   DIR  253,0 40962 /
drbd1_wor 3414 root  rtd   DIR  253,0 40962 /
drbd1_wor 3414 root  txt   unknown  /proc/3414/exe

kill -9 3414 doesn't do anything.  I even tried to restart two nodes 
simultaneously - no luck.


Ivan.

On 12/05/2011 01:50 PM, Digimer wrote:

On 12/04/2011 09:25 PM, Ivan Pavlenko wrote:

Hi ALL,

Digimer, thank you again for your answer I'm really appreciate it!
Unfortunately, I've tried to fixes split brain manually several times.
It doesn't work.

# drbdadm disconnect r0
[root@infplsm017 ~]# drbdadm secondary r0
1: State change failed: (-12) Device is held open by someone
Command 'drbdsetup 1 secondary' terminated with exit code 11
# drbdadm -- --discard-my-data connect r0
1: Failure: (123) --discard-my-data not allowed when primary.
Command 'drbdsetup 1 net 10.10.24.10:7789 10.10.24.11:7789 C
--set-defaults --create-device --ping-timeout=20
--after-sb-2pri=disconnect --after-sb-1pri=discard-secondary
--after-sb-0pri=discard-zero-changes --allow-two-primaries
--discard-my-data' terminated with exit code 10
#

I guess I need to stop cluster daemons, don't I?

Thank you again,
Ivan

Something is, as the error indicates, still trying to use the DRBD
resource. Find it, stop it, and then you can demote the resource. Look
at the 'lsof' command, that will probably help you find the program
still using it.


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user