Re: Facility wide DR/Continuity

2009-06-03 Thread Bill Woodcock
  On Thu, 4 Jun 2009, Roland Dobbins wrote:
> With all due respect, both of these posited choices are quite ugly and
> tend to lead to huge operational difficulties, susceptibility to DDoS,
> etc.  Definitely not recommended except as a last resort in a difficult
> situation, IMHO.

I wouldn't go quite so far as to say that they have security implications, 
but I definitely agree that these are solutions of last resort, and that 
any live load-balanced solution is infinitely preferable to a stand-by 
solution.  Which, IMHO, is unlikely to ever work as hoped for.  I was just 
answering the question at hand, rather than the meta-question of whether 
the question being asked was the right question.  :-)

-Bill




Re: Facility wide DR/Continuity

2009-06-03 Thread Roland Dobbins


On Jun 4, 2009, at 12:53 AM, Brandon Galbraith wrote:

Or you use RFC1918 address space at each location, and NAT each side  
between

public anycasted space and your private IP space. Prevents internal IP
conflicts, having to deal with site to site NAT, etc.


With all due respect, both of these posited choices are quite ugly and  
tend to lead to huge operational difficulties, susceptibility to DDoS,  
etc.  Definitely not recommended except as a last resort in a  
difficult situation, IMHO.


---
Roland Dobbins  // 

Unfortunately, inefficiency scales really well.

   -- Kevin Lawton




Re: Facility wide DR/Continuity

2009-06-03 Thread Brandon Galbraith
On Wed, Jun 3, 2009 at 12:47 PM, Bill Woodcock  wrote:

>  On Wed, 3 Jun 2009, Drew Weaver wrote:
>> Should the additional sites be connected to the primary site
>> (and/or the Internet directly)?
>
> Yes, because any out-of-band synchronization method between the servers at
> the production site and the servers at the DR site is likely to be more
> difficult to manage.  You could do UUCP over a serial line, but...
>
>> What is the best way to handle the routing? Obviously two devices
>> cannot occupy the same IP address at the same time, so how do you
>> provide that instant 'cut-over'?
>
> This is one of the only instances in which I like NATs.  Set up a NAT
> between the two sites to do static 1-to-1 mapping of each site into a
> different range for the other, so that the DR servers have the same IP
> addresses as their production masters, but have a different IP address to
> synchronize with.
>

Or you use RFC1918 address space at each location, and NAT each side between
public anycasted space and your private IP space. Prevents internal IP
conflicts, having to deal with site to site NAT, etc.

-brandon



-- 
Brandon Galbraith
Mobile: 630.400.6992
FNAL: 630.840.2141


Re: Facility wide DR/Continuity

2009-06-03 Thread Bill Woodcock
  On Wed, 3 Jun 2009, Drew Weaver wrote:
> Should the additional sites be connected to the primary site 
> (and/or the Internet directly)?

Yes, because any out-of-band synchronization method between the servers at 
the production site and the servers at the DR site is likely to be more 
difficult to manage.  You could do UUCP over a serial line, but...

> What is the best way to handle the routing? Obviously two devices 
> cannot occupy the same IP address at the same time, so how do you 
> provide that instant 'cut-over'?

This is one of the only instances in which I like NATs.  Set up a NAT 
between the two sites to do static 1-to-1 mapping of each site into a 
different range for the other, so that the DR servers have the same IP 
addresses as their production masters, but have a different IP address to 
synchronize with.  

-Bill




Re: Facility wide DR/Continuity

2009-06-03 Thread Roland Dobbins


On Jun 3, 2009, at 11:15 PM, gb10hkzo-na...@yahoo.co.uk wrote:

For example, consider the licensing and hardware costs involved in  
running something like Oracle Database in active/active mode  (in a  
topology that is supported by Oracle Tech Support).


In my experience, it's no more expensive in terms of hardware/software  
licensing costs to run active/active, and actually less in terms of  
opex costs due to issues raised previously in this thread, as well as  
a host of others.


Note that running active/active doesn't necessarily mean doing  
something like running a clustered database back-end, utilizing vendor- 
specific HA solutions.  It can be done via a combination of caching,  
sharding, distributed indexing, et. al. - i.e., via application  
structuring and logic.


---
Roland Dobbins  // 

Unfortunately, inefficiency scales really well.

   -- Kevin Lawton




Re: Facility wide DR/Continuity

2009-06-03 Thread gb10hkzo-nanog

Some things just don't active/active nicely on a budget.

>  Sure, because of inefficient legacy design choices.


Roland, 

I'm not sure I understand your argument here.

Budget is very much an issue when choosing between active/active and 
active/passive.  Nothing to do with "inefficient legacy design".

For example, consider the licensing and hardware costs involved in running 
something like Oracle Database in active/active mode  (in a topology that is 
supported by Oracle Tech Support).







Re: Facility wide DR/Continuity

2009-06-03 Thread Roland Dobbins


On Jun 3, 2009, at 10:38 PM, Seth Mattinen wrote:


Some things just don't active/active nicely on a budget.



Sure, because of inefficient legacy design choices.

Distribution and scale is ultimately an application architecture  
issue, with networking and ancillary technologies playing an important  
supporting role.


---
Roland Dobbins  // 

Unfortunately, inefficiency scales really well.

   -- Kevin Lawton




Re: Facility wide DR/Continuity

2009-06-03 Thread Roland Dobbins


On Jun 3, 2009, at 10:36 PM, William Herrin wrote:


Sometimes you're limited by the need to use applications which aren't
capable of running on more than one server at a time.


All understood - which is why it's important that app devs/database  
folks/sysadmins are all part of the virtual team working to uplift  
legacy siloed OS/app stacks into more modern and flexible architectures.


;>

---
Roland Dobbins  // 

Unfortunately, inefficiency scales really well.

   -- Kevin Lawton




Re: Facility wide DR/Continuity

2009-06-03 Thread Seth Mattinen
Roland Dobbins wrote:
> 
> On Jun 3, 2009, at 10:05 PM, William Herrin wrote:
> 
>> You rarely need to fail over to the passive system.
> 
> 
> And management will never, ever let you do a full-up test, nor will they
> allow you to spend the money to build a scaled-up system which can
> handle the full load, because they can't stand the thought of hardware
> sitting there gathering dust.
> 
> Concur 100%.
> 
> Active/passive is an obsolete 35-year-old mainframe paradigm, and it
> deserves to die the death.  With modern technology, there's just really
> no excuse not to go active/active, IMHO.
> 

There's always one good reason: money. Some things just don't
active/active nicely on a budget. Then you're trying to explain why you
want to spend money on a SAN when they really want to spend the money on
new "green" refrigerators. (That's not a joke, it really happened.)

~Seth



Re: Facility wide DR/Continuity

2009-06-03 Thread William Herrin
On Wed, Jun 3, 2009 at 11:15 AM, Roland Dobbins wrote:
> Active/passive is an obsolete 35-year-old mainframe paradigm, and it
> deserves to die the death.  With modern technology, there's just really no
> excuse not to go active/active, IMHO.

Roland,

Sometimes you're limited by the need to use applications which aren't
capable of running on more than one server at a time.  In other cases,
its obscenely expensive to run an application on more than one server
at a time. Nor is the split-brain problem in active/active systems a
trivial one.

There are still reasons for using active/passive configurations, but
be advised that active/active solutions have a noticeably better
success rate than active/passive ones.

Regards,
Bill Herrin


-- 
William D. Herrin  her...@dirtside.com  b...@herrin.us
3005 Crane Dr. .. Web: 
Falls Church, VA 22042-3004



Re: Facility wide DR/Continuity

2009-06-03 Thread gb10hkzo-nanog

(by the way before the accusations start flying of spamming , no I don't 
work for Zeus or have any incentive to mention their name... just happen to 
know their product)







Re: Facility wide DR/Continuity

2009-06-03 Thread gb10hkzo-nanog

to whoever said ...

>F5's if you like commercial solutions

F5s if you like expensive commercial solutions .

Those with a less bulging wallet may wish to speak to the guys at Zeus 
Technology (http://www.zeus.com/) who have a lot of experience in the area and 
a more reasonable price tag.

Contact me off-list and I can give you some names of senior techies there to 
speak to. 

(usual
disclaimer that you should research your products  it may be that
for your particular application/environment/business model/whatever F5
may be your tool of choice)






Re: Facility wide DR/Continuity

2009-06-03 Thread Roland Dobbins


On Jun 3, 2009, at 10:05 PM, William Herrin wrote:


You rarely need to fail over to the passive system.



And management will never, ever let you do a full-up test, nor will  
they allow you to spend the money to build a scaled-up system which  
can handle the full load, because they can't stand the thought of  
hardware sitting there gathering dust.


Concur 100%.

Active/passive is an obsolete 35-year-old mainframe paradigm, and it  
deserves to die the death.  With modern technology, there's just  
really no excuse not to go active/active, IMHO.


---
Roland Dobbins  // 

Unfortunately, inefficiency scales really well.

   -- Kevin Lawton




Re: Facility wide DR/Continuity

2009-06-03 Thread gb10hkzo-nanog

Tell me about it .. "failover test what failover test"  ;-)



- Original Message 
From: William Herrin 
To: gb10hkzo-na...@yahoo.co.uk
Cc: nanog@nanog.org
Sent: Wednesday, 3 June, 2009 16:05:15
Subject: Re: Facility wide DR/Continuity

On Wed, Jun 3, 2009 at 10:53 AM,  wrote:
> - whether you are after an active/active or active/passive solution

In practice, active/passive DR solutions often fail. You rarely need
to fail over to the passive system. When you finally do need to fail
over, there are a dozen configuration changes that didn't make it from
the active system, so the passive system isn't in a runable state.

Regards,
Bill Herrin



-- 
William D. Herrin  her...@dirtside.com  b...@herrin.us
3005 Crane Dr. .. Web: <http://bill.herrin.us/>
Falls Church, VA 22042-3004







Re: Facility wide DR/Continuity

2009-06-03 Thread William Herrin
On Wed, Jun 3, 2009 at 10:53 AM,  wrote:
> - whether you are after an active/active or active/passive solution

In practice, active/passive DR solutions often fail. You rarely need
to fail over to the passive system. When you finally do need to fail
over, there are a dozen configuration changes that didn't make it from
the active system, so the passive system isn't in a runable state.

Regards,
Bill Herrin



-- 
William D. Herrin  her...@dirtside.com  b...@herrin.us
3005 Crane Dr. .. Web: 
Falls Church, VA 22042-3004



Re: Facility wide DR/Continuity

2009-06-03 Thread Roland Dobbins


On Jun 3, 2009, at 9:37 PM, William Herrin wrote:


If you can afford it, stretch the LAN across the facilities via fiber
and rebuild the critical services as a load balanced active-active
cluster.


I would advise strongly against stretching a layer-2 topology across  
sites, if at all possible - far better to go for layer-3 separation,  
work with the app/database/sysadmin folks to avoid dependence on  
direct adjacencies, and gain the topological freedom of routing.


---
Roland Dobbins  // 

Unfortunately, inefficiency scales really well.

   -- Kevin Lawton




Re: Facility wide DR/Continuity

2009-06-03 Thread Stefan
On Wed, Jun 3, 2009 at 7:09 AM, Drew Weaver  wrote:

> Hi All,
>
> I'm attempting to devise a method which will provide continuous operation
> of certain resources in the event of a disaster at a single facility.
>
> The types of resources that need to be available in the event of a disaster
> are ecommerce applications and other business critical resources.
>
> Some of the questions I keep running into are:
>
>Should the additional sites be connected to the primary site
> (and/or the Internet directly)?
>What is the best way to handle the routing? Obviously two
> devices cannot occupy the same IP address at the same time, so how do you
> provide that instant 'cut-over'? I could see using application balancers to
> do this but then what if the application balancers fail, etc?
>
> Any advice from folks on list or off who have done similar work is greatly
> appreciated.
>
> Thanks,
> -Drew
>
>
>

In an environment where a DR site is deemed critical, it is my experience
that critical business applications also have a test or development
environment associated with the production one. If you look at the problem
this way, then a DR equipped with the test/devel systems, with one
"instance" of production always available, would only be challenging in
terms of data sync. Various SAN solutions would resolve that (SAN sync-ing
over WAN/MAN/etc.). Virtualization of critical systems may also add some
benefits here: clone the critical VMs in the DR, and in conjunction with the
storage being available, you'll be able to bring up this type of machines in
no time - just make sure you have some sort of L2 available - maybe EoS, or
tunneling over an L3 connectivity - tons of info when querying for virtual
machine mobility and inter-site connectivity.

Voice has to be considered, also - f/PSTN - make arrangements with provider
to re-route (8xx) in case of disaster. VoIP may add some extra capabilities
in terms of reachability over the Internet, in case your DR site cannot
accommodate - C/S people, for example, who are critical to interface with
customers in case of disaster (if no information - bigger loss - perception
issues) have to be able to connect even from home.

As far as "immediate" switch from one to another - DNS is the primary
concern (unless some wise people have hardcoded IPs all over), but there are
other issues people tend to forget, at the core of some clilents - take
Oracle "fat" client and its TNS names - I've seen those associated with IPs,
instead of host names ... etc.

Disclaimer: the above = one of many aspects. Have seen DNS comments already,
so I won't repeat those aspects.

HTH,
-- 
***Stefan
http://twitter.com/netfortius


Re: Facility wide DR/Continuity

2009-06-03 Thread gb10hkzo-nanog

As with all things, there's no "right answer" . a lot of it depends on 
three things :

- what you are hoping to achieve
- what your budget is
- what you have at your disposal in terms of numbers of qualified staff 
available to both implement and support the chosen solution

That's the main business level factors.  From a technical level, two key 
factors (although, of course, there are many others to consider) are :

- whether you are after an active/active or active/passive solution
- what the underlying application(s) are (e.g. you might have other options 
such as anycast with DNS)


Anyway, there's a lot to consider.  And despite all the expertise on Nanog, I 
would still suggest the original poster does their fair share of their own 
homework. :)






- Original Message 
From: Jim Wise 
To: gb10hkzo-na...@yahoo.co.uk
Cc: nanog@nanog.org
Sent: Wednesday, 3 June, 2009 15:42:24
Subject: Re: Facility wide DR/Continuity

gb10hkzo-na...@yahoo.co.uk writes:

> On the subject of DNS GSLB, there's a fairly well known article on the
> subject that anyone considering implementing it should read at least
> once :)
>
> http://www.tenereillo.com/GSLBPageOfShame.htm
> and part 2
> http://www.tenereillo.com/GSLBPageOfShameII.htm
>
> Yes it was written in 2004.  But all the "food for thought" that it
> provides is still very much applicable today.

One thing I've noticed about this paper in the past that kind of bugs me
is that in arguing that multiple A records are a better solution than a
single GSLB-managed A record, the paper assumes that browsers and other
common internet clients will actually cache multiple A records, and fail
between them if the earlier A records fail.  The (first) of the two
pages explicitly touts this as a high availability solution.

However, I haven't observed this behavior from browsers, media players,
and similar programs `in the wild' -- as far as I've been able to tell,
most client software picks an A record from those returned (possibly,
but not usually skipping those found to be unreachable), and then holds
onto that choice of IP address until the record times out of cache, and
a new request is made.

Have I been unlucky in my observations?  Are there client programs which
do failover between multiple A records returned for a single name --
presumably sticking with one IP for session-affinity purposes until a
failure is detected?

If clients do not behave this way, then the paper's observations about
GSLB for HA purposes don't seem to hold -- though in my limited
experience the paper's other point (that geographic dispatch is Hard)
seems much more accurate (making GSLB a better HA solution than it is a
load-sharing solution, again, at least in my experience).

Or am I missing something?

-- 
Jim Wise
jw...@draga.com



 



Re: Facility wide DR/Continuity

2009-06-03 Thread Brandon Galbraith
On Wed, Jun 3, 2009 at 9:37 AM, William Herrin wrote:

> On Wed, Jun 3, 2009 at 8:09 AM, Drew Weaver wrote:
>


> 
>
> If you can't afford the fiber or need to put the DR site too far away
> for fiber to be practical, you can still build a network which
> virtualizes your LAN. However, you then have to worry about issues
> with the broadcast domain and traffic demand between the clustered
> servers over the slower WAN.
>
> It's doable. I've done it with VPNs over Internet T1's. But you better
> have your developers on board early and and provide them with a
> simulated environment so that they can get used to the idea of having
> little bandwidth between the clustered servers.
>
>
 In most cases, the fiber is affordable (a certain bandwidth provider out
there offers Layer 2 point to point anywhere on their network for very low
four digit prices). We recently put into place an active/active environment
with one end point in the US and the other end point in Amsterdam, and both
sides see the other as if they were on the same physical lan segment. I've
found that, like you said, you *must* have the application developers
onboard early, as you can only do so much at the network level without the
app being aware.

-brandon

>
> --
Brandon Galbraith
Mobile: 630.400.6992
FNAL: 630.840.2141


Re: Facility wide DR/Continuity

2009-06-03 Thread Jim Wise
gb10hkzo-na...@yahoo.co.uk writes:

> On the subject of DNS GSLB, there's a fairly well known article on the
> subject that anyone considering implementing it should read at least
> once :)
>
> http://www.tenereillo.com/GSLBPageOfShame.htm
> and part 2
> http://www.tenereillo.com/GSLBPageOfShameII.htm
>
> Yes it was written in 2004.  But all the "food for thought" that it
> provides is still very much applicable today.

One thing I've noticed about this paper in the past that kind of bugs me
is that in arguing that multiple A records are a better solution than a
single GSLB-managed A record, the paper assumes that browsers and other
common internet clients will actually cache multiple A records, and fail
between them if the earlier A records fail.  The (first) of the two
pages explicitly touts this as a high availability solution.

However, I haven't observed this behavior from browsers, media players,
and similar programs `in the wild' -- as far as I've been able to tell,
most client software picks an A record from those returned (possibly,
but not usually skipping those found to be unreachable), and then holds
onto that choice of IP address until the record times out of cache, and
a new request is made.

Have I been unlucky in my observations?  Are there client programs which
do failover between multiple A records returned for a single name --
presumably sticking with one IP for session-affinity purposes until a
failure is detected?

If clients do not behave this way, then the paper's observations about
GSLB for HA purposes don't seem to hold -- though in my limited
experience the paper's other point (that geographic dispatch is Hard)
seems much more accurate (making GSLB a better HA solution than it is a
load-sharing solution, again, at least in my experience).

Or am I missing something?

-- 
Jim Wise
jw...@draga.com


pgp1DBsFUwXep.pgp
Description: PGP signature


Re: Facility wide DR/Continuity

2009-06-03 Thread William Herrin
On Wed, Jun 3, 2009 at 8:09 AM, Drew Weaver wrote:
> I'm attempting to devise a method which will provide continuous
>operation of certain resources in the event of a disaster at a single facility.

Drew,

If you can afford it, stretch the LAN across the facilities via fiber
and rebuild the critical services as a load balanced active-active
cluster. Then a facility failure and a routine server failure are
identical and are handled by the load balancer. F5's if you like
commercial solutions, Linux LVS if you're partial to open source as I
am. Then make sure you have a Internet entry into each location with
BGP.

BTW, this tends to make maintenance easier too. Just remove servers
from the cluster when you need to work on them and add them back in
when you're done. Really reduces the off-hours maintenance windows.

This is how I did it when I worked at the DNC and it worked flawlessly.

If you can't afford the fiber or need to put the DR site too far away
for fiber to be practical, you can still build a network which
virtualizes your LAN. However, you then have to worry about issues
with the broadcast domain and traffic demand between the clustered
servers over the slower WAN.

It's doable. I've done it with VPNs over Internet T1's. But you better
have your developers on board early and and provide them with a
simulated environment so that they can get used to the idea of having
little bandwidth between the clustered servers.


On Wed, Jun 3, 2009 at 9:25 AM, Ricky Duman wrote:
> - Failover to backup servers using DNS (but may not be instant)

If your budget is more than a shoestring, save yourself some grief and
don't go down this road. Even with the TTLs set to 5 minutes, it takes
hours to get to two-nines recovery from a DNS change and months to get
to five-nines. The DNS protocol is designed to be able to recover
quickly but the applications which use it aren't. Like web browsers.
Google "DNS Pinning."

Regards,
Bill Herrin


-- 
William D. Herrin  her...@dirtside.com  b...@herrin.us
3005 Crane Dr. .. Web: 
Falls Church, VA 22042-3004



RE: Facility wide DR/Continuity

2009-06-03 Thread gb10hkzo-nanog

On the subject of DNS GSLB, there's a fairly well known article on the subject 
that anyone considering implementing it should read at least once :)

http://www.tenereillo.com/GSLBPageOfShame.htm
and part 2
http://www.tenereillo.com/GSLBPageOfShameII.htm

Yes it was written in 2004.  But all the "food for thought" that it provides is 
still very much applicable today.







RE: Facility wide DR/Continuity

2009-06-03 Thread Ricky Duman
Drew,

IMO as your %Availability goes up, (99, 99.9, 99.99.100%)... your
price to implement will go up exponentially.  That being said, a
majority of this will depend on what your budget is to achieve your
desired availability.

You can either

- Failover to backup servers using DNS (but may not be instant)
Also should consider replication solution if your
resources need fresh data

On the high-end you can:
- run secondary mirror location...replicating data 
- Setup BGP announcing your IP block out of both locations
- You should have a private link interconnecting your sites for
your IBGP session.




-Original Message-
From: Drew Weaver [mailto:drew.wea...@thenap.com] 
Sent: Wednesday, June 03, 2009 8:10 AM
To: 'nanog@nanog.org'
Subject: Facility wide DR/Continuity

Hi All,

I'm attempting to devise a method which will provide continuous
operation of certain resources in the event of a disaster at a single
facility.

The types of resources that need to be available in the event of a
disaster are ecommerce applications and other business critical
resources.

Some of the questions I keep running into are:

Should the additional sites be connected to the primary
site (and/or the Internet directly)?
What is the best way to handle the routing? Obviously
two devices cannot occupy the same IP address at the same time, so how
do you provide that instant 'cut-over'? I could see using application
balancers to do this but then what if the application balancers fail,
etc?

Any advice from folks on list or off who have done similar work is
greatly appreciated.

Thanks,
-Drew





Re: Facility wide DR/Continuity

2009-06-03 Thread Roland Dobbins


On Jun 3, 2009, at 7:09 PM, Drew Weaver wrote:

 What is the best way to handle the routing? Obviously two devices  
cannot occupy the same IP address at the same time, so how do you  
provide that instant 'cut-over'?


Avoid 'cut-over' entirely - go active/active/etc., and use DNS-based  
GSLB for the various system elements.


---
Roland Dobbins  // 

Unfortunately, inefficiency scales really well.

   -- Kevin Lawton




Facility wide DR/Continuity

2009-06-03 Thread Drew Weaver
Hi All,

I'm attempting to devise a method which will provide continuous operation of 
certain resources in the event of a disaster at a single facility.

The types of resources that need to be available in the event of a disaster are 
ecommerce applications and other business critical resources.

Some of the questions I keep running into are:

Should the additional sites be connected to the primary site 
(and/or the Internet directly)?
What is the best way to handle the routing? Obviously two 
devices cannot occupy the same IP address at the same time, so how do you 
provide that instant 'cut-over'? I could see using application balancers to do 
this but then what if the application balancers fail, etc?

Any advice from folks on list or off who have done similar work is greatly 
appreciated.

Thanks,
-Drew