Re: [Veritas-ha] IPMultiNICB, mpathd and network outages
On Mon, Oct 20, 2008 at 10:37:08AM -0400, Paul Robertson wrote: > We recently experienced a Cisco network issue which prevented all > nodes in that subnet from accessing the default gateway for about a > minute. > > The Solaris nodes which run probe-based IPMP reported that all > interfaces had failed because they were unable to ping the default > gateway; however, they came back within seconds once the network issue > was resolved. Fine. > > Unfortunately, our VCS nodes initiated an offline of the service group > after the IPMultiNICB resources detected the IPMP fault. Since the > service group offline/online takes several minutes, the outage on > these nodes was more painful. Furthermore, since the peer cluster > nodes in the same subnet were also experiencing the same mpathd fault, > there would have been little advantage to failing over the service > group to another node. > > We would like to find a way to configure VCS so that the service group > does not offline (and any dependent resources within the service group > are not offlined) in the event of an mpathd (i.e. IPMultiNICB) > failure. In looking through the documentation, it seems that the > closest we can come is to increase the IPMultiNICB ToleranceLimit from > "1" to a huge value: I've been bitten by this before and found the problem was caused by spanning tree re-calcs. The way I got around it was to disable probe-based fault detection and use link-based detection. Whilst probe-based detection monitors both L2 and L3 connectivity, we found it to be too fragile and were willing to assume the risk of only monitoring L2. Solaris 10 IPMP natively supports link-based detection, but unfortunately with Solaris 8 & 9, you have to disable IPMP altogether and rely on the MultiNICB agent. Solaris 8+9: # main.cf: MultiNICB multinicb ( UseMpathd = 0 LinkTestRatio = 0 IgnoreLinkStatus = 0 Device @server1 = { ce0 = 0, ce4 = 1 } Device @server2 = { ce0 = 0, ce4 = 1 } ) # /etc/hostname.ce0: server1-ce0 netmask + broadcast + deprecated -failover up \ addif server1 netmask + broadcast + failover up # /etc/hostname.ce4 server1-ce4 netmask + broadcast + deprecated -failover standby up Solaris 10: # main.cf: MultiNICB multinicb ( UseMpathd = 1 MpathdCommand = "/usr/lib/inet/in.mpathd -a" ConfigCheck = 0 GroupName = ipmp0 Device @server1 = { nxge0 = 0, nxge4 = 1 } Device @server2 = { nxge0 = 0, nxge4 = 1 } ) # /etc/hostname.nxge0 server1 netmask + broadcast + group ipmp0 up # /etc/hostname.nxge4 group ipmp0 standby up -- Jason Fortezzo [EMAIL PROTECTED] ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] IPMultiNICB, mpathd and network outages
Yes always have 2 interconnect switches for sure. Also if you switch interconnect to UDP for Oracle, it will still work with a few modifications using 2 switches. For me having 2 switches for public solved most networking blunders. IT is "OK" cluster wise if the DG goes away. Traffic just stops, big deal. That is the network teams issue. Have lived through it many times. C On Mon, Oct 20, 2008 at 9:52 AM, Jim Senicka <[EMAIL PROTECTED]>wrote: > I would be more concerned about future failures being handled properly. > If you were able to take out all networks from all nodes at same time, > you have a SPOF. If this was a one time maintenance upgrade to your > network gear and not a normal event, setting VCS to not respond to > network events means that future cable or port issues will not be > handled. > If it is a common occurrence for all networks to be lost, perhaps you > need to address the network issues :-) > > > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > DeMontier, Frank > Sent: Monday, October 20, 2008 11:10 AM > To: Paul Robertson; veritas-ha@mailman.eng.auburn.edu > Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages > > FaultPropagation=0 should do it. > > Buddy DeMontier > State Street Global Advisors > Infrastructure Technical Services > Boston Ma 02111 > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Paul > Robertson > Sent: Monday, October 20, 2008 10:37 AM > To: veritas-ha@mailman.eng.auburn.edu > Subject: [Veritas-ha] IPMultiNICB, mpathd and network outages > > We recently experienced a Cisco network issue which prevented all > nodes in that subnet from accessing the default gateway for about a > minute. > > The Solaris nodes which run probe-based IPMP reported that all > interfaces had failed because they were unable to ping the default > gateway; however, they came back within seconds once the network issue > was resolved. Fine. > > Unfortunately, our VCS nodes initiated an offline of the service group > after the IPMultiNICB resources detected the IPMP fault. Since the > service group offline/online takes several minutes, the outage on > these nodes was more painful. Furthermore, since the peer cluster > nodes in the same subnet were also experiencing the same mpathd fault, > there would have been little advantage to failing over the service > group to another node. > > We would like to find a way to configure VCS so that the service group > does not offline (and any dependent resources within the service group > are not offlined) in the event of an mpathd (i.e. IPMultiNICB) > failure. In looking through the documentation, it seems that the > closest we can come is to increase the IPMultiNICB ToleranceLimit from > "1" to a huge value: > > # hatype -modify IPMultiNICB ToleranceLimit > > This should achieve our desired goal, but I can't help thinking that > it's an ugly hack, and that there must be a better way. Any > suggestions are appreciated. > > Cheers, > > Paul > > P.S. A snippet of the main.cf file is listed below: > > > group multinicbsg ( > SystemList = { app04 = 1, app05 = 2 } > Parallel = 1 > ) > > MultiNICB multinicb ( > UseMpathd = 1 > MpathdCommand = "/usr/lib/inet/in.mpathd -a" > Device = { ce0 = 0, ce4 = 2 } > DefaultRouter = "192.168.9.1" > ) > > Phantom phantomb ( > ) > > phantomb requires multinicb > > group app_grp ( > SystemList = { app04 = 0, app05 = 0 } > ) > > IPMultiNICB app_ip ( > BaseResName = multinicb > Address = "192.168.9.34" > NetMask = "255.255.255.0" > > Proxy appmnic_proxy ( > TargetResName = multinicb > ) > > (various other resources, including some that depend on app_ip > excluded for brevity) > > app_ip requires appmnic_proxy > ___ > Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu > http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha > > ___ > Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu > http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha > > ___ > Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu > http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha > -- "You can't wait for inspiration. You have to go after it with a stick" - Jack London ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] IPMultiNICB, mpathd and network outages
If you only have 1 router (which is very common) in your subnet then the router can possibly be a SPOF for IPMP. Here's how you can setup up more targets for the IPMP to probe so that the router is not a SPOF. How to Manually Specify Target Systems for Probe-Based Failure Detection 1. Log in with your user account to the system where you are configuring probe-based failure detection. 2. Add a route to a particular host to be used as a target in probe-based failure detection. $ route add -host destination-IP gateway-IP -static Replace the values of destination-IP and gateway-IP with the IPv4 address of the host to be used as a target. For example, you would type the following to specify the target system 192.168.85.137, which is on the same subnet as the interfaces in IPMP group testgroup1. $ route add -host 192.168.85.137 192.168.85.137 -static 3. Add routes to additional hosts on the network to be used as target systems. Taken from: http://docs.sun.com/app/docs/doc/816-4554/etmkd?l=en&a=view This should probably be a best practise. However, one can argue that if the router is down then the cluster is useless anyways. Sandeep -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Senicka Sent: Monday, October 20, 2008 9:53 AM To: DeMontier, Frank; Paul Robertson; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages I would be more concerned about future failures being handled properly. If you were able to take out all networks from all nodes at same time, you have a SPOF. If this was a one time maintenance upgrade to your network gear and not a normal event, setting VCS to not respond to network events means that future cable or port issues will not be handled. If it is a common occurrence for all networks to be lost, perhaps you need to address the network issues :-) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of DeMontier, Frank Sent: Monday, October 20, 2008 11:10 AM To: Paul Robertson; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages FaultPropagation=0 should do it. Buddy DeMontier State Street Global Advisors Infrastructure Technical Services Boston Ma 02111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Paul Robertson Sent: Monday, October 20, 2008 10:37 AM To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] IPMultiNICB, mpathd and network outages We recently experienced a Cisco network issue which prevented all nodes in that subnet from accessing the default gateway for about a minute. The Solaris nodes which run probe-based IPMP reported that all interfaces had failed because they were unable to ping the default gateway; however, they came back within seconds once the network issue was resolved. Fine. Unfortunately, our VCS nodes initiated an offline of the service group after the IPMultiNICB resources detected the IPMP fault. Since the service group offline/online takes several minutes, the outage on these nodes was more painful. Furthermore, since the peer cluster nodes in the same subnet were also experiencing the same mpathd fault, there would have been little advantage to failing over the service group to another node. We would like to find a way to configure VCS so that the service group does not offline (and any dependent resources within the service group are not offlined) in the event of an mpathd (i.e. IPMultiNICB) failure. In looking through the documentation, it seems that the closest we can come is to increase the IPMultiNICB ToleranceLimit from "1" to a huge value: # hatype -modify IPMultiNICB ToleranceLimit This should achieve our desired goal, but I can't help thinking that it's an ugly hack, and that there must be a better way. Any suggestions are appreciated. Cheers, Paul P.S. A snippet of the main.cf file is listed below: group multinicbsg ( SystemList = { app04 = 1, app05 = 2 } Parallel = 1 ) MultiNICB multinicb ( UseMpathd = 1 MpathdCommand = "/usr/lib/inet/in.mpathd -a" Device = { ce0 = 0, ce4 = 2 } DefaultRouter = "192.168.9.1" ) Phantom phantomb ( ) phantomb requires multinicb group app_grp ( SystemList = { app04 = 0, app05 = 0 } ) IPMultiNICB app_ip ( BaseResName = multinicb Address = "192.168.9.34" NetMask = "255.255.255.0" Proxy appmnic_proxy ( TargetResName = multinicb ) (various other resources, including some that depend on app_ip excluded for brevity) app_ip requires appmnic_proxy ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha ___ Ve
Re: [Veritas-ha] IPMultiNICB, mpathd and network outages
Buddy, Thanks for the info. The "FaultPropagation=0" idea is a good one, but it applies at the service group level, and cannot be set only for specified resources or resource types. It's probably better than my ToleranceLimit approach, since it results in a resfault and will therefore run our resfault trigger to send smtp notification. If there was an equivalent parameter that applied at the resource level, then I'd have exactly what I was looking for. Cheers, Paul On Mon, Oct 20, 2008 at 11:09 AM, DeMontier, Frank <[EMAIL PROTECTED]> wrote: > FaultPropagation=0 should do it. > > Buddy DeMontier > State Street Global Advisors > Infrastructure Technical Services > Boston Ma 02111 > ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] IPMultiNICB, mpathd and network outages
Some notes inline. On Mon, Oct 20, 2008 at 7:37 AM, Paul Robertson < [EMAIL PROTECTED]> wrote: > We recently experienced a Cisco network issue which prevented all > nodes in that subnet from accessing the default gateway for about a > minute. > > The Solaris nodes which run probe-based IPMP reported that all > interfaces had failed because they were unable to ping the default > gateway; however, they came back within seconds once the network issue > was resolved. Fine. It would say "FAILED" in the ifconfig output. But the interface will still try to work long as the link existed. > > > Unfortunately, our VCS nodes initiated an offline of the service group > after the IPMultiNICB resources detected the IPMP fault. Since the > service group offline/online takes several minutes, the outage on > these nodes was more painful. Furthermore, since the peer cluster > nodes in the same subnet were also experiencing the same mpathd fault, > there would have been little advantage to failing over the service > group to another node. > > We would like to find a way to configure VCS so that the service group > does not offline (and any dependent resources within the service group > are not offlined) in the event of an mpathd (i.e. IPMultiNICB) > failure. In looking through the documentation, it seems that the > closest we can come is to increase the IPMultiNICB ToleranceLimit from > "1" to a huge value: Best way to get around this is to have 2 different switches. Always 2 for interconnect, and 2 for public. Also IMHO IPMP sucks. Linux NIC bonding works WAY better. IF you ever make that platform switch life will be much better. > > > # hatype -modify IPMultiNICB ToleranceLimit > > This should achieve our desired goal, but I can't help thinking that > it's an ugly hack, and that there must be a better way. Any > suggestions are appreciated. > > Cheers, > > Paul > > P.S. A snippet of the main.cf file is listed below: > > > group multinicbsg ( > SystemList = { app04 = 1, app05 = 2 } > Parallel = 1 > ) > > MultiNICB multinicb ( > UseMpathd = 1 > MpathdCommand = "/usr/lib/inet/in.mpathd -a" > Device = { ce0 = 0, ce4 = 2 } > DefaultRouter = "192.168.9.1" > ) > > Phantom phantomb ( > ) > > phantomb requires multinicb > > group app_grp ( > SystemList = { app04 = 0, app05 = 0 } > ) > > IPMultiNICB app_ip ( > BaseResName = multinicb > Address = "192.168.9.34" > NetMask = "255.255.255.0" > > Proxy appmnic_proxy ( > TargetResName = multinicb > ) > > (various other resources, including some that depend on app_ip > excluded for brevity) > > app_ip requires appmnic_proxy > ___ > Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu > http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha > -- "You can't wait for inspiration. You have to go after it with a stick" - Jack London ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] IPMultiNICB, mpathd and network outages
I would be more concerned about future failures being handled properly. If you were able to take out all networks from all nodes at same time, you have a SPOF. If this was a one time maintenance upgrade to your network gear and not a normal event, setting VCS to not respond to network events means that future cable or port issues will not be handled. If it is a common occurrence for all networks to be lost, perhaps you need to address the network issues :-) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of DeMontier, Frank Sent: Monday, October 20, 2008 11:10 AM To: Paul Robertson; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages FaultPropagation=0 should do it. Buddy DeMontier State Street Global Advisors Infrastructure Technical Services Boston Ma 02111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Paul Robertson Sent: Monday, October 20, 2008 10:37 AM To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] IPMultiNICB, mpathd and network outages We recently experienced a Cisco network issue which prevented all nodes in that subnet from accessing the default gateway for about a minute. The Solaris nodes which run probe-based IPMP reported that all interfaces had failed because they were unable to ping the default gateway; however, they came back within seconds once the network issue was resolved. Fine. Unfortunately, our VCS nodes initiated an offline of the service group after the IPMultiNICB resources detected the IPMP fault. Since the service group offline/online takes several minutes, the outage on these nodes was more painful. Furthermore, since the peer cluster nodes in the same subnet were also experiencing the same mpathd fault, there would have been little advantage to failing over the service group to another node. We would like to find a way to configure VCS so that the service group does not offline (and any dependent resources within the service group are not offlined) in the event of an mpathd (i.e. IPMultiNICB) failure. In looking through the documentation, it seems that the closest we can come is to increase the IPMultiNICB ToleranceLimit from "1" to a huge value: # hatype -modify IPMultiNICB ToleranceLimit This should achieve our desired goal, but I can't help thinking that it's an ugly hack, and that there must be a better way. Any suggestions are appreciated. Cheers, Paul P.S. A snippet of the main.cf file is listed below: group multinicbsg ( SystemList = { app04 = 1, app05 = 2 } Parallel = 1 ) MultiNICB multinicb ( UseMpathd = 1 MpathdCommand = "/usr/lib/inet/in.mpathd -a" Device = { ce0 = 0, ce4 = 2 } DefaultRouter = "192.168.9.1" ) Phantom phantomb ( ) phantomb requires multinicb group app_grp ( SystemList = { app04 = 0, app05 = 0 } ) IPMultiNICB app_ip ( BaseResName = multinicb Address = "192.168.9.34" NetMask = "255.255.255.0" Proxy appmnic_proxy ( TargetResName = multinicb ) (various other resources, including some that depend on app_ip excluded for brevity) app_ip requires appmnic_proxy ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] IPMultiNICB, mpathd and network outages
FaultPropagation=0 should do it. Buddy DeMontier State Street Global Advisors Infrastructure Technical Services Boston Ma 02111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Paul Robertson Sent: Monday, October 20, 2008 10:37 AM To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] IPMultiNICB, mpathd and network outages We recently experienced a Cisco network issue which prevented all nodes in that subnet from accessing the default gateway for about a minute. The Solaris nodes which run probe-based IPMP reported that all interfaces had failed because they were unable to ping the default gateway; however, they came back within seconds once the network issue was resolved. Fine. Unfortunately, our VCS nodes initiated an offline of the service group after the IPMultiNICB resources detected the IPMP fault. Since the service group offline/online takes several minutes, the outage on these nodes was more painful. Furthermore, since the peer cluster nodes in the same subnet were also experiencing the same mpathd fault, there would have been little advantage to failing over the service group to another node. We would like to find a way to configure VCS so that the service group does not offline (and any dependent resources within the service group are not offlined) in the event of an mpathd (i.e. IPMultiNICB) failure. In looking through the documentation, it seems that the closest we can come is to increase the IPMultiNICB ToleranceLimit from "1" to a huge value: # hatype -modify IPMultiNICB ToleranceLimit This should achieve our desired goal, but I can't help thinking that it's an ugly hack, and that there must be a better way. Any suggestions are appreciated. Cheers, Paul P.S. A snippet of the main.cf file is listed below: group multinicbsg ( SystemList = { app04 = 1, app05 = 2 } Parallel = 1 ) MultiNICB multinicb ( UseMpathd = 1 MpathdCommand = "/usr/lib/inet/in.mpathd -a" Device = { ce0 = 0, ce4 = 2 } DefaultRouter = "192.168.9.1" ) Phantom phantomb ( ) phantomb requires multinicb group app_grp ( SystemList = { app04 = 0, app05 = 0 } ) IPMultiNICB app_ip ( BaseResName = multinicb Address = "192.168.9.34" NetMask = "255.255.255.0" Proxy appmnic_proxy ( TargetResName = multinicb ) (various other resources, including some that depend on app_ip excluded for brevity) app_ip requires appmnic_proxy ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
[Veritas-ha] IPMultiNICB, mpathd and network outages
We recently experienced a Cisco network issue which prevented all nodes in that subnet from accessing the default gateway for about a minute. The Solaris nodes which run probe-based IPMP reported that all interfaces had failed because they were unable to ping the default gateway; however, they came back within seconds once the network issue was resolved. Fine. Unfortunately, our VCS nodes initiated an offline of the service group after the IPMultiNICB resources detected the IPMP fault. Since the service group offline/online takes several minutes, the outage on these nodes was more painful. Furthermore, since the peer cluster nodes in the same subnet were also experiencing the same mpathd fault, there would have been little advantage to failing over the service group to another node. We would like to find a way to configure VCS so that the service group does not offline (and any dependent resources within the service group are not offlined) in the event of an mpathd (i.e. IPMultiNICB) failure. In looking through the documentation, it seems that the closest we can come is to increase the IPMultiNICB ToleranceLimit from "1" to a huge value: # hatype -modify IPMultiNICB ToleranceLimit This should achieve our desired goal, but I can't help thinking that it's an ugly hack, and that there must be a better way. Any suggestions are appreciated. Cheers, Paul P.S. A snippet of the main.cf file is listed below: group multinicbsg ( SystemList = { app04 = 1, app05 = 2 } Parallel = 1 ) MultiNICB multinicb ( UseMpathd = 1 MpathdCommand = "/usr/lib/inet/in.mpathd -a" Device = { ce0 = 0, ce4 = 2 } DefaultRouter = "192.168.9.1" ) Phantom phantomb ( ) phantomb requires multinicb group app_grp ( SystemList = { app04 = 0, app05 = 0 } ) IPMultiNICB app_ip ( BaseResName = multinicb Address = "192.168.9.34" NetMask = "255.255.255.0" Proxy appmnic_proxy ( TargetResName = multinicb ) (various other resources, including some that depend on app_ip excluded for brevity) app_ip requires appmnic_proxy ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha