Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

2008-10-20 Thread Jim Senicka
I would be more concerned about future failures being handled properly.
If you were able to take out all networks from all nodes at same time,
you have a SPOF. If this was a one time maintenance upgrade to your
network gear and not a normal event, setting VCS to not respond to
network events means that future cable or port issues will not be
handled.
If it is a common occurrence for all networks to be lost, perhaps you
need to address the network issues :-)



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
DeMontier, Frank
Sent: Monday, October 20, 2008 11:10 AM
To: Paul Robertson; veritas-ha@mailman.eng.auburn.edu
Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

FaultPropagation=0 should do it.

Buddy DeMontier
State Street Global Advisors
Infrastructure Technical Services
Boston Ma 02111

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Paul
Robertson
Sent: Monday, October 20, 2008 10:37 AM
To: veritas-ha@mailman.eng.auburn.edu
Subject: [Veritas-ha] IPMultiNICB, mpathd and network outages

We recently experienced a Cisco network issue which prevented all
nodes in that subnet from accessing the default gateway for about a
minute.

The Solaris nodes which run probe-based IPMP reported that all
interfaces had failed because they were unable to ping the default
gateway; however, they came back within seconds once the network issue
was resolved. Fine.

Unfortunately, our VCS nodes initiated an offline of the service group
after the IPMultiNICB resources detected the IPMP fault. Since the
service group offline/online takes several minutes, the outage on
these nodes was more painful. Furthermore, since the peer cluster
nodes in the same subnet were also experiencing the same mpathd fault,
there would have been little advantage to failing over the service
group to another node.

We would like to find a way to configure VCS so that the service group
does not offline (and any dependent resources within the service group
are not offlined) in the event of an mpathd (i.e. IPMultiNICB)
failure. In looking through the documentation, it seems that the
closest we can come is to increase the IPMultiNICB ToleranceLimit from
1 to a huge value:

 # hatype -modify IPMultiNICB ToleranceLimit 

This should achieve our desired goal, but I can't help thinking that
it's an ugly hack, and that there must be a better way. Any
suggestions are appreciated.

Cheers,

Paul

P.S. A snippet of the main.cf file is listed below:


 group multinicbsg (
   SystemList = { app04 = 1, app05 = 2 }
   Parallel = 1
   )

   MultiNICB multinicb (
   UseMpathd = 1
   MpathdCommand = /usr/lib/inet/in.mpathd -a
   Device = { ce0 = 0, ce4 = 2 }
   DefaultRouter = 192.168.9.1
   )

   Phantom phantomb (
   )

   phantomb requires multinicb

 group app_grp (
   SystemList = { app04 = 0, app05 = 0 }
   )

   IPMultiNICB app_ip (
   BaseResName = multinicb
   Address = 192.168.9.34
   NetMask = 255.255.255.0

   Proxy appmnic_proxy (
   TargetResName = multinicb
   )

   (various other resources, including some that depend on app_ip
   excluded for brevity)

   app_ip requires appmnic_proxy
___
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha

___
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha

___
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha


Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

2008-10-20 Thread Craig Simpson
Some notes inline.

On Mon, Oct 20, 2008 at 7:37 AM, Paul Robertson 
[EMAIL PROTECTED] wrote:

 We recently experienced a Cisco network issue which prevented all
 nodes in that subnet from accessing the default gateway for about a
 minute.

 The Solaris nodes which run probe-based IPMP reported that all
 interfaces had failed because they were unable to ping the default
 gateway; however, they came back within seconds once the network issue
 was resolved. Fine.


It would say FAILED in the ifconfig output. But the interface will still
try to work long as the link existed.




 Unfortunately, our VCS nodes initiated an offline of the service group
 after the IPMultiNICB resources detected the IPMP fault. Since the
 service group offline/online takes several minutes, the outage on
 these nodes was more painful. Furthermore, since the peer cluster
 nodes in the same subnet were also experiencing the same mpathd fault,
 there would have been little advantage to failing over the service
 group to another node.

 We would like to find a way to configure VCS so that the service group
 does not offline (and any dependent resources within the service group
 are not offlined) in the event of an mpathd (i.e. IPMultiNICB)
 failure. In looking through the documentation, it seems that the
 closest we can come is to increase the IPMultiNICB ToleranceLimit from
 1 to a huge value:


Best way to get around this is to have 2 different switches. Always 2 for
interconnect, and 2 for public.

Also IMHO IPMP sucks. Linux NIC bonding works WAY better. IF you ever make
that platform switch life will be much better.




  # hatype -modify IPMultiNICB ToleranceLimit 

 This should achieve our desired goal, but I can't help thinking that
 it's an ugly hack, and that there must be a better way. Any
 suggestions are appreciated.

 Cheers,

 Paul

 P.S. A snippet of the main.cf file is listed below:


  group multinicbsg (
   SystemList = { app04 = 1, app05 = 2 }
   Parallel = 1
   )

   MultiNICB multinicb (
   UseMpathd = 1
   MpathdCommand = /usr/lib/inet/in.mpathd -a
   Device = { ce0 = 0, ce4 = 2 }
   DefaultRouter = 192.168.9.1
   )

   Phantom phantomb (
   )

   phantomb requires multinicb

  group app_grp (
   SystemList = { app04 = 0, app05 = 0 }
   )

   IPMultiNICB app_ip (
   BaseResName = multinicb
   Address = 192.168.9.34
   NetMask = 255.255.255.0

   Proxy appmnic_proxy (
   TargetResName = multinicb
   )

   (various other resources, including some that depend on app_ip
   excluded for brevity)

   app_ip requires appmnic_proxy
 ___
 Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
 http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha




-- 
You can't wait for inspiration. You have to go after it with a stick -
Jack London
___
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha


Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

2008-10-20 Thread Paul Robertson
Buddy,

Thanks for the info.

The FaultPropagation=0 idea is a good one, but it applies at the
service group level, and cannot be set only for specified resources or
resource types. It's probably better than my ToleranceLimit approach,
since it results in a resfault and will therefore run our resfault
trigger to send smtp notification.

If there was an equivalent parameter that applied at the resource
level, then I'd have exactly what I was looking for.

Cheers,

Paul

On Mon, Oct 20, 2008 at 11:09 AM, DeMontier, Frank
[EMAIL PROTECTED] wrote:
 FaultPropagation=0 should do it.

 Buddy DeMontier
 State Street Global Advisors
 Infrastructure Technical Services
 Boston Ma 02111

___
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha


Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

2008-10-20 Thread Sandeep Agarwal (MTV)
If you only have 1 router (which is very common) in your subnet then the
router can possibly be a SPOF for IPMP. Here's how you can setup up more
targets for the IPMP to probe so that the router is not a SPOF.

How to Manually Specify Target Systems for Probe-Based Failure Detection

   1.

  Log in with your user account to the system where you are
configuring probe-based failure detection.
   2.

  Add a route to a particular host to be used as a target in
probe-based failure detection.

  $ route add -host destination-IP gateway-IP -static

  Replace the values of destination-IP and gateway-IP with the IPv4
address of the host to be used as a target. For example, you would type
the following to specify the target system 192.168.85.137, which is on
the same subnet as the interfaces in IPMP group testgroup1.

  $ route add -host 192.168.85.137 192.168.85.137 -static 

   3.

  Add routes to additional hosts on the network to be used as target
systems.

Taken from:
http://docs.sun.com/app/docs/doc/816-4554/etmkd?l=ena=view

This should probably be a best practise. However, one can argue that if
the router is down then the cluster is useless anyways.

Sandeep

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Jim
Senicka
Sent: Monday, October 20, 2008 9:53 AM
To: DeMontier, Frank; Paul Robertson; veritas-ha@mailman.eng.auburn.edu
Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages


I would be more concerned about future failures being handled properly.
If you were able to take out all networks from all nodes at same time,
you have a SPOF. If this was a one time maintenance upgrade to your
network gear and not a normal event, setting VCS to not respond to
network events means that future cable or port issues will not be
handled. If it is a common occurrence for all networks to be lost,
perhaps you need to address the network issues :-)



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
DeMontier, Frank
Sent: Monday, October 20, 2008 11:10 AM
To: Paul Robertson; veritas-ha@mailman.eng.auburn.edu
Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

FaultPropagation=0 should do it.

Buddy DeMontier
State Street Global Advisors
Infrastructure Technical Services
Boston Ma 02111

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Paul
Robertson
Sent: Monday, October 20, 2008 10:37 AM
To: veritas-ha@mailman.eng.auburn.edu
Subject: [Veritas-ha] IPMultiNICB, mpathd and network outages

We recently experienced a Cisco network issue which prevented all nodes
in that subnet from accessing the default gateway for about a minute.

The Solaris nodes which run probe-based IPMP reported that all
interfaces had failed because they were unable to ping the default
gateway; however, they came back within seconds once the network issue
was resolved. Fine.

Unfortunately, our VCS nodes initiated an offline of the service group
after the IPMultiNICB resources detected the IPMP fault. Since the
service group offline/online takes several minutes, the outage on these
nodes was more painful. Furthermore, since the peer cluster nodes in the
same subnet were also experiencing the same mpathd fault, there would
have been little advantage to failing over the service group to another
node.

We would like to find a way to configure VCS so that the service group
does not offline (and any dependent resources within the service group
are not offlined) in the event of an mpathd (i.e. IPMultiNICB) failure.
In looking through the documentation, it seems that the closest we can
come is to increase the IPMultiNICB ToleranceLimit from 1 to a huge
value:

 # hatype -modify IPMultiNICB ToleranceLimit 

This should achieve our desired goal, but I can't help thinking that
it's an ugly hack, and that there must be a better way. Any suggestions
are appreciated.

Cheers,

Paul

P.S. A snippet of the main.cf file is listed below:


 group multinicbsg (
   SystemList = { app04 = 1, app05 = 2 }
   Parallel = 1
   )

   MultiNICB multinicb (
   UseMpathd = 1
   MpathdCommand = /usr/lib/inet/in.mpathd -a
   Device = { ce0 = 0, ce4 = 2 }
   DefaultRouter = 192.168.9.1
   )

   Phantom phantomb (
   )

   phantomb requires multinicb

 group app_grp (
   SystemList = { app04 = 0, app05 = 0 }
   )

   IPMultiNICB app_ip (
   BaseResName = multinicb
   Address = 192.168.9.34
   NetMask = 255.255.255.0

   Proxy appmnic_proxy (
   TargetResName = multinicb
   )

   (various other resources, including some that depend on app_ip
   excluded for brevity)

   app_ip requires appmnic_proxy
___
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha

___
Veritas-ha

Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

2008-10-20 Thread Craig Simpson
Yes always have 2 interconnect switches for sure. Also if you switch
interconnect to UDP for Oracle, it will still work with a few modifications
using 2 switches.

For me having 2 switches for public solved most networking blunders. IT is
OK cluster wise if the DG goes away. Traffic just stops, big deal. That is
the network teams issue. Have lived through it many times.

C

On Mon, Oct 20, 2008 at 9:52 AM, Jim Senicka [EMAIL PROTECTED]wrote:

 I would be more concerned about future failures being handled properly.
 If you were able to take out all networks from all nodes at same time,
 you have a SPOF. If this was a one time maintenance upgrade to your
 network gear and not a normal event, setting VCS to not respond to
 network events means that future cable or port issues will not be
 handled.
 If it is a common occurrence for all networks to be lost, perhaps you
 need to address the network issues :-)



 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of
 DeMontier, Frank
 Sent: Monday, October 20, 2008 11:10 AM
 To: Paul Robertson; veritas-ha@mailman.eng.auburn.edu
 Subject: Re: [Veritas-ha] IPMultiNICB, mpathd and network outages

 FaultPropagation=0 should do it.

 Buddy DeMontier
 State Street Global Advisors
 Infrastructure Technical Services
 Boston Ma 02111

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Paul
 Robertson
 Sent: Monday, October 20, 2008 10:37 AM
 To: veritas-ha@mailman.eng.auburn.edu
 Subject: [Veritas-ha] IPMultiNICB, mpathd and network outages

 We recently experienced a Cisco network issue which prevented all
 nodes in that subnet from accessing the default gateway for about a
 minute.

 The Solaris nodes which run probe-based IPMP reported that all
 interfaces had failed because they were unable to ping the default
 gateway; however, they came back within seconds once the network issue
 was resolved. Fine.

 Unfortunately, our VCS nodes initiated an offline of the service group
 after the IPMultiNICB resources detected the IPMP fault. Since the
 service group offline/online takes several minutes, the outage on
 these nodes was more painful. Furthermore, since the peer cluster
 nodes in the same subnet were also experiencing the same mpathd fault,
 there would have been little advantage to failing over the service
 group to another node.

 We would like to find a way to configure VCS so that the service group
 does not offline (and any dependent resources within the service group
 are not offlined) in the event of an mpathd (i.e. IPMultiNICB)
 failure. In looking through the documentation, it seems that the
 closest we can come is to increase the IPMultiNICB ToleranceLimit from
 1 to a huge value:

  # hatype -modify IPMultiNICB ToleranceLimit 

 This should achieve our desired goal, but I can't help thinking that
 it's an ugly hack, and that there must be a better way. Any
 suggestions are appreciated.

 Cheers,

 Paul

 P.S. A snippet of the main.cf file is listed below:


  group multinicbsg (
   SystemList = { app04 = 1, app05 = 2 }
   Parallel = 1
   )

   MultiNICB multinicb (
   UseMpathd = 1
   MpathdCommand = /usr/lib/inet/in.mpathd -a
   Device = { ce0 = 0, ce4 = 2 }
   DefaultRouter = 192.168.9.1
   )

   Phantom phantomb (
   )

   phantomb requires multinicb

  group app_grp (
   SystemList = { app04 = 0, app05 = 0 }
   )

   IPMultiNICB app_ip (
   BaseResName = multinicb
   Address = 192.168.9.34
   NetMask = 255.255.255.0

   Proxy appmnic_proxy (
   TargetResName = multinicb
   )

   (various other resources, including some that depend on app_ip
   excluded for brevity)

   app_ip requires appmnic_proxy
 ___
 Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
 http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha

 ___
 Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
 http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha

 ___
 Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
 http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha




-- 
You can't wait for inspiration. You have to go after it with a stick -
Jack London
___
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha