Re: Data Center testing

2009-08-25 Thread eric clark
Most Provider type datacenters I've worked with get a lot of flak from
customers when they announce they're doing network failover testing, because
there's always going to be a certain amount of chance (at least) of
disruption. Its the exception to find a provider that does it I think (or
maybe just one that admits it when they're doing it). Power tests are a
different thing.
As for testing your own equipment, there are a couple ways to do that,
regular failover tests (quarterly, or more likely at 6 month intervals),
and/or routing traffic so that you have some of your traffic on all paths
(ie internal traffic on one path, external traffic on another). The latter
doesn't necessarily tell you that your failover will work perfectly, only
that all your gear in the 2nd path is functioning. I prefer doing both.

When doing the failover tests, no matter how good your setup is, there's
always a chance for taking a hit, so I
always do this kind of work during a maintenance window, not too close
to quarter end, etc.
If you have your equipment set up correctly of course, it goes like butter
and is a total non-event.

For test procedure, I usually pull cables. I'll go all the way to line cards
or power cables if I really want to test, though that can be hard on
equipment.

E



On Mon, Aug 24, 2009 at 10:45 AM, Jack Bates  wrote:

> Dan Snyder wrote:
>
>> We have done power tests before and had no problem.  I guess I am looking
>> for someone who does testing of the network equipment outside of just
>> power
>> tests.  We had an outage due to a configuration mistake that became
>> apparent
>> when a switch failed.  It didn't cause a problem however when we did a
>> power
>> test for the whole data center.
>>
>>
> The plus side of failure testing is that it can be controlled. The downside
> to failure testing is that you can induce a failure. Maintenance windows are
> cool, but some people really dislike failures of any type which limits how
> often you can test. I personally try for once a year. However, a lot can go
> wrong in a year.
>
> Jack
>
>


Re: Data Center testing

2009-08-25 Thread James Hess
On Tue, Aug 25, 2009 at 7:53 AM, Jeff Aitken wrote:
>[..] Periodically inducing failures to catch [...] them is sorta like using 
>your smoke detector as an oven timer.
>[..]
> machine-parsable format, but the benefit is that you know in pseudo-realtime
> when something is wrong, as opposed to finding out the next time a device
> fails.

Config checking can't say much about silent hardware failures.
Unanticipated problems are likely to arise in failover systems,
especially complicated ones.  A failover system that has not been
periodically verified may not work as designed.

Simulations, config review, and change controls are not substitutes
for testing, they address overlapping but different problems.
Testing detects unanticipated error;  config review  is a preventive
measure that helps avoid and correct apparent configuration issues.

Config checking  (both software and hardware choices) also help to
keep out unnecessary complexity.

A human still has to write the script and review its output -- an
operator error would eventually occur that is an accidental omission
from both the current state and from the "desired" state;  there is a
chance that an erroneous entry escapes detection.

There can be other types of errors:
Possibly there is a damaged patch cable, dying port, failing power
supply, or other hardware on the warm spare that has silently degraded
and its poor condition won't be detected(until it actually tries
to take a heavy workload, blows a fuse, eats a transceiver,  and
everything just falls apart).


Perhaps you upgraded a hardware module or software image X months ago,
to fix bug Y on the secondary unit, and the upgrade caused completely
unanticipated side effect Z.


Config checking can't say much about silent hardware problems.

--
-Mysid



RE: Data Center testing

2009-08-25 Thread Frank Bulk - iName.com
There's more to data integrity in a data center (well, anything powered,
that is) than network configurations.  There's the loading of individual
power outlets, UPS loading, UPS battery replacement cycles, loading of
circuits, backup lighting, etc.  And the only way to know if something is
really working like it's designed is to test it.  That's why we have
financial auditors, military exercises, fire drills, etc.

So while your analogy emphasizes the importance of having good processes in
place to catch the problems up front, it doesn't eliminate throwing the
switch.

Frank

-Original Message-
From: Jeff Aitken [mailto:jait...@aitken.com] 
Sent: Tuesday, August 25, 2009 7:53 AM
To: Dan Snyder
Cc: NANOG list
Subject: Re: Data Center testing

On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote:
> We have done power tests before and had no problem.  I guess I am looking
> for someone who does testing of the network equipment outside of just
power
> tests.  We had an outage due to a configuration mistake that became
apparent
> when a switch failed.  It didn't cause a problem however when we did a
power
> test for the whole data center.

Dan,

With all due respect, if there are config changes being made to your 
devices that aren't authorized or in accordance with your standards (you
*do* have config standards, right?) then you don't have a testing problem,
you have a data integrity problem.  Periodically inducing failures to catch
them is sorta like using your smoke detector as an oven timer.

There are several tools that can help in this area; a good free one is
rancid [1], which logs in to your routers and collects copies of configs
and other info, all of which gets stored in a central repository.  By
default, you will be notified via email of any changes.  An even better
approach than scanning the hourly config diff emails is to develop scripts
that compare the *actual* state of the network with the *desired* state and
alert you if the two are not in sync.  Obviously this is more work because
you have to have some way of describing the desired state of the network in
machine-parsable format, but the benefit is that you know in pseudo-realtime
when something is wrong, as opposed to finding out the next time a device
fails.  Rancid diffs + tacacs logs will tell you who made the changes, and
with that info you can get at the root of the problem.

Having said that, every planned maintenance activity is an opportunity to
run through at least some failure cases.  If one of your providers is going
to take down a longhaul circuit, you can observe how traffic re-routes and
verify that your metrics and/or TE are doing what you expect.  Any time you
need to load new code on a device you can test that things fail over
appropriately.  Of course, you have to willing to just shut the device
down without draining it first, but that's between you and your customers.
Link and/or device failures will generate routing events that could be used
to test convergence times across your network, etc.

The key is to be prepared.  The more instrumentation you have in place
prior to the test, the better you will be able to analyze the impact of the
failure.  An experienced operator can often tell right away when looking at
a bunch of MRTG graphs that "something doesn't look right", but that doesn't
tell you *what* is wrong.  There are tools (free and commercial) that can
help here, too.  Have a central syslog server and some kind of log reduction
tool in place.  Have beacons/probes deployed, in both the control and data
planes.  If you want to record, analyze, and even replay routing system
events, you might want to take a look at the Route Explorer product from
Packet Design [2].

You said "switch failure" above, so I'm guessing that this doesn't apply
to you, but there are also good network simulation packages out there.
Cariden [3] and WANDL [4] can build models of your network based on actual
router configs and let you simulate the impact of various scenarios,
including device/link failures.  However, these tools are more appropriate
for design and planning than for catching configuration mistakes, so
they may not be what you're looking for in this case.


--Jeff


[1] http://www.shrubbery.net/rancid/
[2] http://www.packetdesign.com/products/rex.htm
[3] http://www.cariden.com/
[4] http://www.wandl.com/html/index.php






Re: FCCs RFC for the Definition of Broadband

2009-08-25 Thread Fred Baker


On Aug 24, 2009, at 9:17 AM, Luke Marrott wrote:

What are your thoughts on what the definition of Broadband should be  
going
forward? I would assume this will be the standard definition for a  
number of

years to come.



Historically, narrowband was circuit switched (ISDN etc) and broadband  
was packet switched. Narrowband was therefore tied to the digital  
signaling hierarchy and was in some way a multiple of 64 KBPS. As the  
term was used then, broadband delivery options of course included  
virtual circuits bearing packets, like Frame Relay and ATM.


The new services I am hearing about include streamed video to multiple  
HD TVs in the home. I think I would encourage the FCC to discuss  
"broadband" to step away from the technology and look at the bandwidth  
usably delivered (as in "I don't care what the bit rate of the  
connection at the curb is if the back end is clogged; how much can a  
commodity TCP session move through the network"). http://tinyurl.com/pgxqzb 
 suggests that the average broadband service worldwide delivers a  
download rate of 1.5 MBPS; having the FCC assert that the new  
definition of broadband is that it delivers a usable data rate in  
excess of 1 MBPS while narrowband delivers less seems reasonable. That  
said, the US is ~15th worldwide in broadband speed; Belgium, Ireland,  
South Korea, Taiwan, and the UK seem to think that FTTH that can serve  
multiple HDTVs simultaneously is normal.




Re: FCCs RFC for the Definition of Broadband

2009-08-25 Thread Bill Stewart
It's not a technical question, it's a political one, so feel free to
squelch this for off-topicness if you want.
Technically, broadband is "faster than narrowband", and beyond that
it's "fast enough for what you're trying to sell"; tell me what you're
trying to sell and I'll tell you how fast a connection you need.
If you're trying to sell email, VOIP, and lightly-graphical
web browsing, 64kbps is enough, and 128 is better.  If you're
trying to sell wireless data excluding laptop tethering, that's also
fast enough for anything except maybe uploading hi-res camera video.
If you're trying to sell talking-heads video conferencing, 128's
enough but 384's better.  If you're trying to sell internet radio,
somewhere around 300 is probably enough.  If you're trying to sell
online gaming, you'll need to find a WoW addict; I gather latency's a
bit more of an issue than bandwidth for most people.  If you're
trying to sell home web servers - oh wait, they're not! - 100-300k's
usually enough, unless you get slashdotted, in which case you need
50-100Mbps for a couple of hours. If you're trying to sell
Youtube-quality video, 1 Mbps is enough, 3 Mbps is better.  If
you're trying to sell television replacement, 10M's about enough for
one HD channel, 20's better, but the real question is what kind of
multicast upstream infrastructure you're using to manage the number of
channels you're selling, and whether you're price-competitive with
cable, satellite, or radio broadcast, and how well you get along with
your city and state regulators who'd like a piece of the action.
 If what you're trying to sell is "the relevance of the FCC to the
Democratic political machines", the answer is measured in TV-hours,
newspaper-inches, and letters to Congresscritters, which isn't my
problem.  



Re: Alternatives to storm-control on Cat 6509.

2009-08-25 Thread Mike Bartz
On Mon, Aug 24, 2009 at 4:59 PM, Nick Hilliard  wrote:

> On 24/08/2009 19:03, Holmes,David A wrote:
>
>> Additionally, and perhaps most significantly for deterministic network
>> design, the copper cards share input hardware buffers for every 8 ports.
>> Running one port of the 8 at wire speed will cause input drops on the
>> other 7 ports. Also, the cards connect to the older 32 Gbps shared bus.
>>
>
> IMO, a more serious problem with the 6148tx and 6548tx cards is the
> internal architecture, which is effectively six internal managed gigabit
> ethernet hubs (i.e. shared bus) with a 1M buffer per hub, and each hub
> connected with a single 1G uplink to a 32 gig backplane.  Ref:
>
>
>> http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a00801751d7.shtml#ASIC
>>
>
> In Cisco's own words: "These line cards are oversubscription cards that are
> designed to extend gigabit to the desktop and might not be ideal for server
> farm connectivity".   In other words, these cards are fine in their place,
> but they are not designed or suitable for data centre usage.
>
> I don't want to sound like I'm damning this card beyond redemption - it has
> a useful place in this world - but at the expense of reliability,
> manageability and configuration control, you will get useful features
> (including broadcast/unicast flood control) and in many situations very
> significantly better performance from a recent SRW 48-port linksys gig
> switch than from one of these cards.
>
> Nick
>
>
We experienced the joy of using the X6148 cards with a SAN/ESX cluster.
Lots of performance issues!  A fairly inexpensive solution was to switch to
the X6148A card instead, which does not suffer the the 8:1
oversubscription.  It also supports MTU's larger than 1500, which was
another shortcoming of the older card.

Mike



-- 
Mike Bartz
m...@bartzfamily.net


Re: SORBS?

2009-08-25 Thread Graeme Fowler
On Tue, 2009-08-25 at 09:35 -0500, Marc Powell wrote:
> I don't think they watch here; at least I've never seen Michelle post  
> here.

I've had confirmation from Michelle personally this morning (following a
similar question elsewhere) that the SORBS systems are indeed
relocating. From a previous message to SPAM-L (reproduced with
permission):

Michelle Sullivan wrote:
> SORBS is not closing.  SORBS has received 3 credible offers for the
> purchase of SORBS, one of which was not interested in continuing SORBS
> but obtaining the IP and spamtraps.  SORBS will not be accepting the
> latter offer.
> 
> Currently the two offers being considered are with anti-spam vendors
> and one of the two have indicated that they will not commercialise
> SORBS, but keep it as a community project.  The other anti-spam vendor
> have indicated they would pursue a split commercial model, where there
> would be a free service as well as a 'premium' service (how this would
> work I do not know).
> 
> An announcement about which company is successful will be forthcoming
> when necessary paperwork has been signed.
> 
> Small outages will occur in the central database when the servers are
> moved, this will NOT affect SORBS services globally, only updates
> (listing and delisting) and local (Au) services during the outages.

As inconvenient as this outage may be, the background to it is one with
which a large proportion of this list is probably bearing scars -
physical relocation.

On a related note, no I don't have any information as to who it is that
has taken SORBS on.

Regards,

Graeme




Re: SORBS?

2009-08-25 Thread trainier
Thanks for the replies.  I will use the mailing list if my issue doesn't 
get resolved.


Regards,

Tim R. Rainier
Systems Administrator II
Kalsec Inc.
www.kalsec.com

Marc Powell  wrote on 08/25/2009 10:35:43 AM:

> From:
> 
> Marc Powell 
> 
> To:
> 
> NANOG list 
> 
> Date:
> 
> 08/25/2009 10:36 AM
> 
> Subject:
> 
> Re: SORBS?
> 
> 
> On Aug 25, 2009, at 8:40 AM, train...@kalsec.com wrote:
> 
> > I need a SORBS maintainer to contact me.
> 
> I don't think they watch here; at least I've never seen Michelle post 
> here. Try dnsbl-users, the SORBS mailling list. From the google cache 
> of the Mailling Lists page --
> 
> "This list is an open list where the SORBS DNSbl may be discussed. If 
> it is about the SORBS DNSbl it is on topic (including questions on how 
> to configure mailers to use SORBS). Currently this list is quiet, un- 
> moderated, and anyone is free to join. Non-members of the list are not 
> permitted to send mail to the list.
> 
> For people who don't know the meaning of "confirmed opt-in" ("double 
> opt-in" as most spammers call it), subscribe to this list and you will 
> see how it works.
> 
> Subscription is performed by sending a message to: 
majord...@dnsbl.sorbs.net 
>   with a message body of:
> subscribe dnsbl-users
> end
> "
> 
> 
> 




Re: SORBS?

2009-08-25 Thread Marc Powell


On Aug 25, 2009, at 8:40 AM, train...@kalsec.com wrote:


I need a SORBS maintainer to contact me.


I don't think they watch here; at least I've never seen Michelle post  
here. Try dnsbl-users, the SORBS mailling list. From the google cache  
of the Mailling Lists page --


"This list is an open list where the SORBS DNSbl may be discussed. If  
it is about the SORBS DNSbl it is on topic (including questions on how  
to configure mailers to use SORBS). Currently this list is quiet, un- 
moderated, and anyone is free to join. Non-members of the list are not  
permitted to send mail to the list.


For people who don't know the meaning of "confirmed opt-in" ("double  
opt-in" as most spammers call it), subscribe to this list and you will  
see how it works.


Subscription is performed by sending a message to: majord...@dnsbl.sorbs.net 
 with a message body of:

subscribe dnsbl-users
end
"





Re: SORBS?

2009-08-25 Thread Jon Lewis

On Tue, 25 Aug 2009 train...@kalsec.com wrote:


I need a SORBS maintainer to contact me.

The SORBS site reports the site and databases are in maintenance mode for
the second day in a row.  One of my domains was legitimately listed, but
now that I've resolved the problem, I'm unable to request removal.


Based on info previously posted to the SORBS web site, I suspect SORBS may 
be in the middle of relocating their servers (changing hosting providers). 
If that's the case, I don't think you're going to have any luck getting 
changes made to the SORBS database until the move has been completed.


--
 Jon Lewis   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net|
_ http://www.lewis.org/~jlewis/pgp for PGP public key_



SORBS?

2009-08-25 Thread trainier
I need a SORBS maintainer to contact me.

The SORBS site reports the site and databases are in maintenance mode for 
the second day in a row.  One of my domains was legitimately listed, but 
now that I've resolved the problem, I'm unable to request removal.


Regards,

Tim R. Rainier
Systems Administrator II
Kalsec Inc.
www.kalsec.com



Re: Data Center testing

2009-08-25 Thread Jeff Aitken
On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote:
> We have done power tests before and had no problem.  I guess I am looking
> for someone who does testing of the network equipment outside of just power
> tests.  We had an outage due to a configuration mistake that became apparent
> when a switch failed.  It didn't cause a problem however when we did a power
> test for the whole data center.

Dan,

With all due respect, if there are config changes being made to your 
devices that aren't authorized or in accordance with your standards (you
*do* have config standards, right?) then you don't have a testing problem,
you have a data integrity problem.  Periodically inducing failures to catch
them is sorta like using your smoke detector as an oven timer.

There are several tools that can help in this area; a good free one is
rancid [1], which logs in to your routers and collects copies of configs
and other info, all of which gets stored in a central repository.  By
default, you will be notified via email of any changes.  An even better
approach than scanning the hourly config diff emails is to develop scripts
that compare the *actual* state of the network with the *desired* state and
alert you if the two are not in sync.  Obviously this is more work because
you have to have some way of describing the desired state of the network in
machine-parsable format, but the benefit is that you know in pseudo-realtime
when something is wrong, as opposed to finding out the next time a device
fails.  Rancid diffs + tacacs logs will tell you who made the changes, and
with that info you can get at the root of the problem.

Having said that, every planned maintenance activity is an opportunity to
run through at least some failure cases.  If one of your providers is going
to take down a longhaul circuit, you can observe how traffic re-routes and
verify that your metrics and/or TE are doing what you expect.  Any time you
need to load new code on a device you can test that things fail over
appropriately.  Of course, you have to willing to just shut the device
down without draining it first, but that's between you and your customers.
Link and/or device failures will generate routing events that could be used
to test convergence times across your network, etc.

The key is to be prepared.  The more instrumentation you have in place
prior to the test, the better you will be able to analyze the impact of the
failure.  An experienced operator can often tell right away when looking at
a bunch of MRTG graphs that "something doesn't look right", but that doesn't
tell you *what* is wrong.  There are tools (free and commercial) that can
help here, too.  Have a central syslog server and some kind of log reduction
tool in place.  Have beacons/probes deployed, in both the control and data
planes.  If you want to record, analyze, and even replay routing system
events, you might want to take a look at the Route Explorer product from
Packet Design [2].

You said "switch failure" above, so I'm guessing that this doesn't apply
to you, but there are also good network simulation packages out there.
Cariden [3] and WANDL [4] can build models of your network based on actual
router configs and let you simulate the impact of various scenarios,
including device/link failures.  However, these tools are more appropriate
for design and planning than for catching configuration mistakes, so
they may not be what you're looking for in this case.


--Jeff


[1] http://www.shrubbery.net/rancid/
[2] http://www.packetdesign.com/products/rex.htm
[3] http://www.cariden.com/
[4] http://www.wandl.com/html/index.php