Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-28 Thread David Hiers
In general, it seems that a field has to be aware that it can kill (or
has killed) an embarrassing number of people before its members accept
the need for controls such as processes and checklists.

Here's a couple if incidents in which gruesome, public loss of life
was necessary to for thought to triumph over ego:

Doctors took forever to get over their bad selves and adopt the
process of handwashing:
http://en.wikipedia.org/wiki/Ignaz_Semmelweis

Pilots discover humility and the value of checklists in managing complexity:
http://www.atchistory.org/History/checklst.htm

Reactor-rats, wing-wipers, barber-surgeons, and rocket-jockeys now
recognize that the best and brightest among us, polished with state of
the art education and training, ruthlessly drilled in the
fundamentals, and armed with the best processes and checklists, are
just barely good enough to have even-money odds when dealing with
everything the world can throw at them.

I suppose that once us packet-pushers kill enough people, the
economics of lost market share, falling stock prices, and embarrassed
CxOs on CNN will push us in that direction.   Until then, however,
Anarchy and Heroics (http://www.cert.org/archive/pdf/csi0711.pdf) sing
their siren song.



David



On Sat, Dec 26, 2009 at 4:24 PM, Robert Boyle rob...@tellurian.com wrote:
 At 02:08 AM 12/25/2009, Scott Howard wrote:

 On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote:

  So you can put a lot of process around changes in advance but there
  isn't quite as much to manage incidents that strike out of the clear
  blue.  Too much process at that point could impede progress in clearing
  the issue.  Capt. Sullenberger did not need to fill out an incident
  report, bring up a conference bridge, and give a detailed description of
  what was happening with his plane, the status of all subsystems, and his
  proposed plan of action (subject to consensus of those on the conference
  bridge) and get approval for deviation from his initial flight plan
  before he took the required actions to land the plane as best as he
  could under the circumstances.

 *mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost
 thrust (in/on) both engines we're turning back towards LaGuardia* - Capt.
 Sullenberger

 Not exactly detailed, but he definitely initiated an incident report
 (the mayday), gave a description of what was happening with his plane,
 the
 status of [the relevant] subsystems, and his proposed plan of action -
 even in the order you've asked for!

 His actions were then subject to the consensus of those on the conference
 bridge (ie, ATC) who could have denied his actions if they believed they
 would have made the situation worse (ie, if what they were proposing would
 have had them on a collision course with another plane). In this case, the
 conference bridge gave approval for his course of action (*ok uh, you
 need
 to return to LaGuardia? turn left heading of uh two two zero.* - ATC)

 Once he declared an emergency, he had the right of way over all other
 traffic. ATC would move anyone in his way out of the way.
 Under http://en.wikipedia.org/wiki//wiki/U.S.U.S.
 http://en.wikipedia.org/wiki//wiki/FAAFAA FAR 91.3, Responsibility and
 authority of the pilot in command, the FAA declares:[2]
   * (a) The pilot in command of an aircraft is directly responsible for, and
 is the final authority as to, the operation of that aircraft.
   * (b) In an in-flight emergency requiring immediate action, the pilot in
 command may deviate from any rule of this part to the extent required to
 meet that emergency.
   * (c) Each pilot in command who deviates from a rule under paragraph (b)
 of this section shall, upon the request of the Administrator, send a written
 report of that deviation to the Administrator.
 Just because we have checklists doesn't mean we can't think on our feet and
 handle situations not contemplated in checklists, but checklists and
 procedures exist to ensure we don't forget something we need to remember.
 They aren't a substitute for creativity and logical thought. They are an aid
 to it to ensure a minimum of creative thinking is needed to solve problems
 which shouldn't exist in the first place.

 -Robert
 SELMEL+I



 Well done is better than well said. - Benjamin Franklin






Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-28 Thread Bill Woodcock

The connection may not be immediately apparent, but I think Philip 
Greenspun's article critiquing Malcolm Gladwell's musings on cranial 
metrics etc. has some bearing:

   http://philip.greenspun.com/flying/foreign-airline-safety

...or is at least an interesting read.  In observing network operations 
screw-ups, I've seen a lot that were either caused by, or prolonged by, a 
culture-of-emergency.  Young guys drinking way too much coffee, working a 
service window at two in the morning, believing they've seen something 
that needs to be fixed, and winging it.  In building networks, I've tried 
very hard to engineer things such that the operating procedure for dealing 
with an emergency is to note its existence and place it in a work queue 
to be dealt with by people who are on a day shift, have just come in 
from a full night's sleep, and are working in a team with senior people 
who can assist with anything tricky, and make sure that junior folks are 
following proceedures that have been worked out in advance by people who 
had plenty of time in a lab, and plenty of time to choose the best of many 
alternative procedures.

In my experience, reducing the frequency of emergencies is most beneficial 
in reducing the frequency of outages.  :-)

-Bill




Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-28 Thread Michael Sinatra

On 12/25/09 7:57 AM, Anton Kapela wrote:


What I'm getting at is that after following this thread for a while,
I'm not convinced any amount of process-borrowing is going to solve
problems better, faster, or even avoid them in the first place. At
best, our craft is 1/3rd as old (if that's somehow I measure of
maturity) as flight and nobody is being sued to settle 200+ accidental
deaths because of our mistakes.


So, we're supposed to make the mistakes of aviation, nuclear power, the 
chemical industry (i.e. Bhopal), oil production  refining, etc., all 
over again?


Checklists and MOPs are but one of the things we ignore from other 
industries.  Some others:


o Increasing complexity and tight coupling lead to systemic failures. 
Simply grafting redundancy onto complex systems can make them less, not 
more, reliable.  Yet this is the trend in networking.  Want bells and 
whistles, firewalls, load-balancers, rate-limiters in your network?  You 
can have 'em without sacrificing reliability if you just buy two of 'em!


o The gradual acceptance of components or procedures that have adequate 
reliability for a certain task (say, research) that are not reliable 
enough for another task (e.g. being a critical part of a 1,000 megawatt 
nuclear power plant) without understanding the implications.  Do we know 
how our technology is being used and will be used?  Will the people 
adopting IP for everything (the smart grid, VoIP, life-supporting 
functions) fail to see these implications just as the people who shoved 
a fissile core into a pressure vessel did?


This last point directly contradicts the theme of your message.  The 
notion that what we do is not (yet) a matter of life-or-death has bitten 
other industries in the past and it provides a nice illustration of why 
we should *not* be ignoring their lessons.


michael



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-28 Thread Owen DeLong

On Dec 24, 2009, at 11:08 PM, Scott Howard wrote:

 On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote:
 
 So you can put a lot of process around changes in advance but there
 isn't quite as much to manage incidents that strike out of the clear
 blue.  Too much process at that point could impede progress in clearing
 the issue.  Capt. Sullenberger did not need to fill out an incident
 report, bring up a conference bridge, and give a detailed description of
 what was happening with his plane, the status of all subsystems, and his
 proposed plan of action (subject to consensus of those on the conference
 bridge) and get approval for deviation from his initial flight plan
 before he took the required actions to land the plane as best as he
 could under the circumstances.
 
 
 
 *mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost
 thrust (in/on) both engines we're turning back towards LaGuardia* - Capt.
 Sullenberger
 
 Not exactly detailed, but he definitely initiated an incident report
 (the mayday), gave a description of what was happening with his plane, the
 status of [the relevant] subsystems, and his proposed plan of action -
 even in the order you've asked for!
 
Exactly.

 His actions were then subject to the consensus of those on the conference
 bridge (ie, ATC) who could have denied his actions if they believed they
 would have made the situation worse (ie, if what they were proposing would
 have had them on a collision course with another plane). In this case, the
 conference bridge gave approval for his course of action (*ok uh, you need
 to return to LaGuardia? turn left heading of uh two two zero.* - ATC)
 
Not exactly.  If the others on the bridge don't consent, FAR 91.3 gives him
full and absolute authority to tell them to screw themselves and do what he
feels is best.

FAR 91.3 reads:

Responsibility and authority of the pilot in command.

(a) The pilot in command of an aircraft is directly responsible for, 
and is the final
authority as to, the operation of that aircraft.

(b) In an in-flight emergency requiring immediate action, the pilot in 
command may
deviate from any rule of this part to the extent required to meet that 
emergency.

(c) Each pilot in command who deviates from a rule under paragraph (b) 
of this
section shall, upon the request of the Administrator, send a written 
report of that
deviation to the Administrator.

As near as I can tell, that regulation was last modified in 1963.

 5 seconds before they made the above call they were reaching for the QRH
 (Quick Reference Handbook), which contains checklists of the steps to take
 in such a situation - including what to do in the event of loss of both
 engines due to multiple birdstrikes.  They had no need to confer with others
 as to what actions to take to try and recover from the problem, or what
 order to take them in, because that pre-work had already been carried out
 when the check-lists were written.
 
Yep.

 Of course, at the end of the day, training, skill and experience played a
 very large part in what transpired - but so did the actions of the people on
 the conference bridge (You can't get much more of a conference bridge
 than open radio frequencies), and the checklists they have for almost every
 conceivable situation.
 

And in case there are any misconceptions here on the list, I know that in the
public eye, there is often a lot of distrust and/or perceived animosity between
controllers and pilots.  Frankly, this is a misconception for the most part.  
Sure,
there are incidents where pilots and controllers don't get along, each blaming
the other.  However, by and large, both groups are consummate professionals
doing their best to make sure flights end well.  In my years as a pilot, I have
had more than one occasion to be very thankful for ATC and the services they
provide. Generally, they are a very helpful and hardworking group.  I respect
them greatly and appreciate the tough job they do.

Owen
(Commercial Pilot, ASEL, Instrument Airplane)
 


Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-28 Thread Owen DeLong

On Dec 25, 2009, at 7:57 AM, Anton Kapela wrote:

 On Fri, Dec 25, 2009 at 5:44 AM, Vadim Antonov a...@kotovnik.com wrote:
 
 The ISP industry has a long way to go until it reaches the same level of
 sophistication in handling problems as aviation has.
 
 It seems that there's a logical fallacy floating around somewhere
 (networks have parts and are complicated, airplanes and flight involve
 lots of parts and are also complicated, therefore aircraft are like
 networks). I assert that comparing 'packet switching' to an industry
 that has its roots in the late 1800's and had its first hello world
 moment in 1903 isn't terribly fruitful.
 
As someone with a fair amount of experience with both, I have to
disagree with you.  Yes, there are differences, and, yes you have
to keep comparisons and the like in perspective, but, there are
definitely areas where networking could learn from aviation, and,
to some extent, vice versa.

 Further, aircraft are the asymptotic limit of 'singly homed transit.'
 Because of this, I think one could argue that pilots and ATC must be
 held to a different professional standard due to the nature of public
 trust at risk.  At the other end of our strawman spectrum, we have end
 users who must accept the risk that their provider will be unable to
 connect them to lolcats.com on occasion, perhaps as often as 0.01% per
 year, and most are happy to accept this. Four nines survivability on
 flights, clearly, won't work.
 
Correct... As I stated in my earliest posts on this subject, while there
is value to be obtained in looking at how aviation has improved its
safety/reliability record over the years, there is also value in recognizing
the cost/benefit ratio of some of those improvements.

If you draw a graph with one curve arcing from bottom left towards
upper right, steepening as it goes to the right, that line can be thought
of as the amount of cost of achieving additional reliability.

A second curve sloping from top left to bottom right, flattening out
as it goes to the right can be thought of as the gains achieved from
those additional 9s of reliability.

Finally, the point where those two curves intersect is defined by
the cost of outages and/or downtime.

Interestingly, this same diagram will be familiar to most pilots,
but, the two arcs will be induced drag (drag from producing lift)
and parasite drag (drag from friction with the air). The point where
they meet is called L/D Max and is the airspeed at which the
given aircraft will achieve it's best glide ratio.

 What I'm getting at is that after following this thread for a while,
 I'm not convinced any amount of process-borrowing is going to solve
 problems better, faster, or even avoid them in the first place. At
 best, our craft is 1/3rd as old (if that's somehow I measure of
 maturity) as flight and nobody is being sued to settle 200+ accidental
 deaths because of our mistakes.
 
There are lessons to be learned that are valuable.  Both from
things aviation has done well that we could emulate, and, from
things aviation has done poorly that we should avoid.  There
are also additional lessons to be learned about the differences
in cost/benefit analysis between the two disciplines.

Owen




Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-28 Thread Robert Boyle

At 03:38 PM 12/28/2009, Owen DeLong wrote:

There are lessons to be learned that are valuable.  Both from
things aviation has done well that we could emulate, and, from
things aviation has done poorly that we should avoid.  There
are also additional lessons to be learned about the differences
in cost/benefit analysis between the two disciplines.


Agreed. You have to learn from the mistakes of others because you 
won't live long enough to make them all yourself. -Admiral Rickover


-Robert



Well done is better than well said. - Benjamin Franklin




Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-26 Thread Robert Boyle

At 02:08 AM 12/25/2009, Scott Howard wrote:

On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote:

 So you can put a lot of process around changes in advance but there
 isn't quite as much to manage incidents that strike out of the clear
 blue.  Too much process at that point could impede progress in clearing
 the issue.  Capt. Sullenberger did not need to fill out an incident
 report, bring up a conference bridge, and give a detailed description of
 what was happening with his plane, the status of all subsystems, and his
 proposed plan of action (subject to consensus of those on the conference
 bridge) and get approval for deviation from his initial flight plan
 before he took the required actions to land the plane as best as he
 could under the circumstances.

*mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost
thrust (in/on) both engines we're turning back towards LaGuardia* - Capt.
Sullenberger

Not exactly detailed, but he definitely initiated an incident report
(the mayday), gave a description of what was happening with his plane, the
status of [the relevant] subsystems, and his proposed plan of action -
even in the order you've asked for!

His actions were then subject to the consensus of those on the conference
bridge (ie, ATC) who could have denied his actions if they believed they
would have made the situation worse (ie, if what they were proposing would
have had them on a collision course with another plane). In this case, the
conference bridge gave approval for his course of action (*ok uh, you need
to return to LaGuardia? turn left heading of uh two two zero.* - ATC)


Once he declared an emergency, he had the right of way over all other 
traffic. ATC would move anyone in his way out of the way.
Under http://en.wikipedia.org/wiki//wiki/U.S.U.S. 
http://en.wikipedia.org/wiki//wiki/FAAFAA FAR 91.3, Responsibility 
and authority of the pilot in command, the FAA declares:[2]
   * (a) The pilot in command of an aircraft is directly responsible 
for, and is the final authority as to, the operation of that aircraft.
   * (b) In an in-flight emergency requiring immediate action, the 
pilot in command may deviate from any rule of this part to the extent 
required to meet that emergency.
   * (c) Each pilot in command who deviates from a rule under 
paragraph (b) of this section shall, upon the request of the 
Administrator, send a written report of that deviation to the Administrator.
Just because we have checklists doesn't mean we can't think on our 
feet and handle situations not contemplated in checklists, but 
checklists and procedures exist to ensure we don't forget something 
we need to remember. They aren't a substitute for creativity and 
logical thought. They are an aid to it to ensure a minimum of 
creative thinking is needed to solve problems which shouldn't exist 
in the first place.


-Robert
SELMEL+I



Well done is better than well said. - Benjamin Franklin




RE: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread Vadim Antonov

Just clearing a small point about pilots (I'm a pilot) - the
pilot-in-command has ultimate responsibility for his a/c and can ignore
whatever ATC tells him to do if he considers that to be contrary to the
safety of his flight (he may be asked to explain his actions later,
though). Now, usually ignoring ATC or keeping it in the dark about one's
intentions is not very clever - but dispatchers are not in the cockpit and 
may misunderstand the situation or be simply mistaken about something (so 
a pilot is encouraged to decline ATC instructions he considers to be in 
error - informing ATC about it, of course).

But one of the first things a pilot does in an emergency is pulling out
the appropriate emergency checklist.  It is kind of hard to keep from 
forgetting to check obvious things when things get hectic (one of the 
distressingly common causes of accidents is trivial running out of fuel - 
either because the pilot didn't do homework on the ground (checking actual 
fuel level in tanks, etc) or because when the engine got suddenly quiet he
forgot to switch to another, non-empty, tank).

The mantra about priorities in both normal and emergency situations is 
Aviate-Navigate-Communicate meaning that maintaining control of a/c 
always comes first, no matter what. Knowing where you are and where you 
are going (and other pertinent situational awareness such as condition of 
the a/c and current plan of actions) come second.  Talking is lowest 
priority.

The pre-planned emergency checklists may be a good idea for network
operators.  Try obvious (when you're calm, that's it) actions first, if
they fail to help, try to limit damage.  Only then go file the ticket and
talk to people who can investigate situation in depth and can develop a 
fix.

The way aviation industry come with these checklists is, basically,
experience - it pays to debrief after recovery from every problem not
adequately fixed by existing procedures, find common ones, and develop
diagnostic procedure one could follow step-by-step for these situations. 
(The non-punitive error or incident reporting which actually shields 
pilots from FAA enforcement actions in most cases also helps to collect
real-world information on where and how pilots get into trouble).

The all-too-common multistep ticket escalation chains (which merely work
as delay lines in a significant portion of cases) is something to be
avoided.

Even better is to provide some drilling in diagnostic and recovery from
common problems to the front-line personnel - starting from following the 
checklist on a simulated outage in the lab, and then getting it down to
what pilots call the flow - a habitual memorized procedure, which is 
performed first and then checked against the checklist.

Note that use of checklists, drilling, and flows does not make pilots a 
kind of robots - they still have to make decisions, recognize and deal 
with situations not covered in the standard procedures; what it does is 
speeding up dealing with common tasks, reduces mistakes, and frees up 
mental processing for thinking ahead.

The ISP industry has a long way to go until it reaches the same level of 
sophistication in handling problems as aviation has.

--vadim

On Fri, 25 Dec 2009, George Bonser wrote:

 I think any network engineer who sees a major problem is going to have a
 Houston, we have a problem moment.  And actually, he was telling the
 ATC what he was going to need to do, he wasn't getting permission so
 much as telling them what he was doing so traffic could be cleared out
 of his way. First he told them he was returning to the airport, then he
 inquired about Peterburough, the ATC called Peterburough to get a runway
 and inform them of an inbound emergency, then the Captain told the ATC
 they were going to be in the Hudson.  And I hit birds, have lost both
 engines, and am turning back results in a whole different chain of
 events these days than I have two guys banging on the cockpit door and
 am returning or simply turning back toward the airport with no
 communication.  And any network engineer is going to say something if he
 sees CPU or bandwidth utilization hit the rail in either direction.
 Saying something like we just got flooded with thousands of /24 and
 smaller wildly flapping routes from peer X and I am shutting off the BGP
 session until they get their stuff straight is different than we just
 got flooded with thousands of routes and it is blowing up the router and
 all the other routers talking to it.  Can I do something about it?
 
  
 
 And that illustrates a point that is key.  In that case the ATC was
 asking what the pilot needed and was prepared to clear traffic, get
 emergency equipment prepared, whatever it took to get that person
 dealing with the problem whatever they needed to get it resolved in the
 best way forward.  The ATC isn't asking him if he was sure he set the
 flaps at the right angle and did you try to restart the engine sorts
 of things.
 
  
 
 What I 

RE: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread Mikael Abrahamsson

On Fri, 25 Dec 2009, Vadim Antonov wrote:

The ISP industry has a long way to go until it reaches the same level of 
sophistication in handling problems as aviation has.


Well, to counter this one might talk about the medical business (doctors) 
which hasn't been able to embrace the checklists at all (apart from in a 
few places), and they still consider their profession to be a craft, just 
like most network engineers do.


It's the classical good/fast/cheap, please pick two. Aviation is 
slow/careful to bring in new tech, same with the health care side, they're 
both very conservative. We in the network business are still immature but 
quick and flexible, but as time goes on, our services are more and more 
important, and thus things settle in and slow down, but becomes more 
reliable. This is an evoltion that'll take quite some time, but it's 
already changed a lot the past 10 years.


There was quite a buzz regarding doctor checklists a few years back, I 
read several articles about it, but now I can't find the one I want to 
find, but http://www.healthbeatblog.org/2007/12/pilots-use-chec.html 
talks a bit about the topic.


--
Mikael Abrahamssonemail: swm...@swm.pp.se



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread Anton Kapela
On Fri, Dec 25, 2009 at 5:44 AM, Vadim Antonov a...@kotovnik.com wrote:

 The ISP industry has a long way to go until it reaches the same level of
 sophistication in handling problems as aviation has.

It seems that there's a logical fallacy floating around somewhere
(networks have parts and are complicated, airplanes and flight involve
lots of parts and are also complicated, therefore aircraft are like
networks). I assert that comparing 'packet switching' to an industry
that has its roots in the late 1800's and had its first hello world
moment in 1903 isn't terribly fruitful.

Further, aircraft are the asymptotic limit of 'singly homed transit.'
Because of this, I think one could argue that pilots and ATC must be
held to a different professional standard due to the nature of public
trust at risk.  At the other end of our strawman spectrum, we have end
users who must accept the risk that their provider will be unable to
connect them to lolcats.com on occasion, perhaps as often as 0.01% per
year, and most are happy to accept this. Four nines survivability on
flights, clearly, won't work.

What I'm getting at is that after following this thread for a while,
I'm not convinced any amount of process-borrowing is going to solve
problems better, faster, or even avoid them in the first place. At
best, our craft is 1/3rd as old (if that's somehow I measure of
maturity) as flight and nobody is being sued to settle 200+ accidental
deaths because of our mistakes.

-Tk



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread Joe Provo
On Thu, Dec 24, 2009 at 01:09:26PM -0500, Randy Bush wrote:
  I _do_ create action plans and _do_ quarterback each step and _do_
  slap down any attempt to deviate.
 
 imagine a network engineering culture where the concept of 'attempt to
 deviate' just does not occur.

Whimsical deviations don't belong in the maint execution, they belong 
in the brainstorming and design.  Gather more points of view during 
the peer review of the specification of work.  In my experience, good 
engineering makes for bad drama (and conversely if it is a dramatic 
save then you have a bad engineer and likely a cowboy).  Have a plan 
that executes in stages, tests at checkpoints where partial completion
is possible, and a fallback for each step.  A great way to train up 
junior people, document as you go, expose flaws and lines of future 
investigation, and if things go south you escalte to those who can 
judge *reasonable* new directions.

To me, that kind of change management for non-automatable work is a 
descendent of resonable group work.  If you have project-oriented 
autonomous teams that stick to the guideposts of your standards and
minimal disruptions/maximal uptime then good work will emerge.  As 
for automation, that enables your expensive hmans to do more smart 
things so should always be incorporated in processes and be something
people move toward, IMO.

Cheers,

Joe

-- 
 RSUC / GweepNet / Spunk / FnB / Usenix / SAGE



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread bross

On Thu, 24 Dec 2009, Scott Howard wrote:


His actions were then subject to the consensus of those on the conference
bridge (ie, ATC) who could have denied his actions if they believed they
would have made the situation worse (ie, if what they were proposing would
have had them on a collision course with another plane).


This has been mentioned by others in this thread, but not to the level of 
importance I think it represents.  I, too, am a pilot.  The pilot in 
command of an aircraft always has the final say on the safety of the 
flight, not the controller, and not the design engineers.  If the pilot in 
command violates the rules and the result is negative (crash, loss of 
separation, etc.) you better believe there will be questions to be 
answered and a possible loss of the pilot's license (or life!) may result. 
On the other hand if the pilot's decision to violate the rules results in 
a positive outcome (see Sullenburger or any other number of emergencies 
that happen every day that you never hear about) there will often never 
even be a single question about why the rules were violated.


This can be applied directly to network engineering work.  If I assign an 
engineer to do a network change, yes, they better have a 
plan/checklist/etc. before they start and they better follow it.  When 
things go wrong, I expect that engineer to make the right decisions to 
minimize the damage.  Sometimes that means following the rules to the 
letter.  Sometimes that means breaking the rules.  If the rules are 
broken, there darn better be a good reason for it, but frankly, a good 
engineer will always have a good explanation, just like the good pilot.


Rigid procedures are no better than the lack of procedures.  Process is 
very important, don't get me wrong, but so is the knowledge and experience 
to know when you should throw them out the door.  Any organization that 
doesn't recognize that is doomed to inefficiency at best, and failure at 
worst.


--
Brandon Ross  AIM:  BrandonNRoss
Director of Network EngineeringICQ:  2269442
Xiocom WirelessSkype:  brandonross  Yahoo:  BrandonNRoss



RE: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread George Bonser
 What I'm getting at is that after following this thread for a while,
 I'm not convinced any amount of process-borrowing is going to solve
 problems better, faster, or even avoid them in the first place. At
 best, our craft is 1/3rd as old (if that's somehow I measure of
 maturity) as flight and nobody is being sued to settle 200+ accidental
 deaths because of our mistakes.
 
 -Tk

Not now, that is true, but when you look at things that are on the
drawing board such as systems designed to manage automobile traffic
flows, networks that are used to fly UAVs, networks that keep track of
friendly units in combat where the technology might someday migrate to
civilian law enforcement and/or emergency services (keeping track of
where firefighters are in a building or at a wildfire, for example), I
can see situations in the future where people's lives could be dependent
on networks working properly, or at least endangered if a network fails.


But my original intent was to point out that there are two kinds of
process for two different kinds of circumstances and the sort of process
surrounding routine changes might not be the best process for handing
emergency changes. I have seen examples of places that want to handle
emergency changes with the same sort of process they use for routine
changes and those places can be frustrating to work with when stuff is
broken. My goal was to give managers of networks who might read this the
idea that when the fan is in an unsavory condition, more can get done by
shifting from a mode of questioning, analyzing and second-guessing
everything the engineer is doing to a mode where the organization is
responding to immediate needs, clearing obstacles out of the way, and
documenting as best they can what is done and when, to make the
debriefing afterwards easier. AFTER the incident is the time to go over
what was done, think about how it was dealt with, consider any changes
in emergency process that might have shortened the duration, etc.

In fact the What could we have done differently that would have
shortened the duration of the outage question is pretty important.  The
answer might be nothing, and that is ok, too, but the question should
be asked.




RE: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread Frank Bulk
Shops where engineering and operations function separately can suffer from
reduced efficiencies.  A recent example comes to mind.  Vendor X was onsite
turning up some equipment, including a small VPN concentrator for remote
access.  It was a new model of VPN concentrator that the installers hadn't
worked with before.  They used scripts, a set of a CLI commands with
field-replaceable variables for site specific parameters, to configure the
device.  But connections to the VPN were failing.  After trying different
versions of the scripts (for similar models) they broke down and called
their internal tech support department for help.  Total turn-up time for the
concentrator: 8+ hours.  There wasn't that much wrong with the script that
kept it from working, but the ops folks lacked the training to understand
the problems and fix them.  On the other hand, the engineering folks should
probably have produced a more robust set of scripts.

While having no experience myself, it would seem a good practice that every
project, including the actual turn-up, include representation from
engineering.  This automatically creates a liaison between the two groups
and keeps the engineer abreast of real world issues.  

Frank

-Original Message-
From: Michael Dillon [mailto:wavetos...@googlemail.com] 
Sent: Thursday, December 24, 2009 6:02 PM
To: NANOG list
Subject: Re: Revisiting the Aviation Safety vs. Networking discussion

 imagine a network engineering culture where the concept of 'attempt to
 deviate' just does not occur.

 Are you trying to suggest that this is something horrible, or that it's
the future of network engineering? :)

The model of network engineering that grew up during the 1990s is
forever gone unless you work
in a smaller organization where people have to wear many hats. In the
big ISPs, now identical to
the big telcos, operations and engineering design duties are
separated. The operations folks
do not deviate from the written plans that they work with. If the
slightest thing happens that is not
in the plan, they rollback the changes as specified in the plan. They
don't fix anything unless it
is officially broken with trouble tickets filed and escalations up to
senior management. That is
about the only time that operations people can get away with taking
shortcuts and creative solutions.

On the other hand, the engineering design folks should spend a good
part of their day trying out
things, thinking up new ideas, poking around equipment and software to
see how far it can be pushed.
Then, when they have learned something and are ready to implement it
in the network, they write
a detailed plan for operations. Then some other engineering folks test
the heck out of that design
to try and find fault with it. After all the faults are fixed, it goes
to operations and the engineering
design folks move on to something else unless serious problems occur
and operations needs
a design engineer to approve some sensible action to be taken. The
operations folk can't take
the sensible action because that would deviate from their plans, but
getting engineering design
folks involved, gives them an out for real emergencies.

So the term network engineering is ambiguous because a lot of people
use it to mean the 90's
style job where engineering design activity and operational activity
were all jumbled together.

In some companies, taking the engineering design track not only means
that you lose enable
on the routers, but you lose all TACACS access and have to get
authorisation from a VP just
to ask for a copy of the running config on a production router. Some
people like ops because
they see a lot of stuff go by and learn from it, get their CCIE and
move into design engineering.
Others like ops because they are scared of the responsibility for
thinking up what to do next,
and making a mistake.

As far as I can see, the only way to get a job that mixes ops and
design is to be in 3rd or 4th
level support which is the top of the technical escalation chain where
a few excellent design
engineers do have enable on the routers because they fix important
problems in near realtime.
I suspect that it would be advantageous to have a career in which you
worked for a while in
ops before moving into design engineering if you want to get into
top-level support.

Take all this with a grain of salt. Every company does things a bit
different, and the terminology
that is used is ambiguous. It would be interesting to see what others
have to say about this
answer.

--Michael Dillon





RE: Revisiting the Aviation Safety vs. Networking discussion

2009-12-25 Thread Vadim Antonov

 I can see situations in the future where people's lives could be
 dependent on networks working properly, or at least endangered if a
 network fails.

Actually it's not the future. My father's design bureau was making
hardware, since 70s (including network stuff) for running industrial
processes of a kind where software crash or a network malfunction was
usually associated with casualties.  Gas pipelines, power plants, electric
grids, stuff like that.

That's a completely different class of hardware, more of a kind you'd find
in avionics - modules in triplicate, voting, pervasive error correction,
etc.  Software was also designed differently, with a lot more review
processes, and with data structures designed for integrity checking (I
still use this trick in my work, which saves me a lot of grief during
debugging) and recovery from memory corruption and such.

I'd be seriously loath to put any of the current crop of COTS network
boxes into a life-critical network.

--vadim




Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Eddy Martinez
On Dec 24, 2009, at 9:51 AM, Randy Bush wrote:

 I'm more persistent than smart, and I tell ya, if you prep well
 enough, you can hand your checklist to a stoned intern and you'll
 have no worries at all.
 
 this works in a tech culture where folk follow mops obsessively.  my
 experience is that most north american engineers are too smart to do
 that, and take shortcuts.
 
 randy
 

Being a North American Engineer, I resent that remark.  =]

I _do_ create action plans and _do_ quarterback each step and _do_ slap down 
any attempt to deviate. 


Eddy





Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Randy Bush
 I _do_ create action plans and _do_ quarterback each step and _do_
 slap down any attempt to deviate.

imagine a network engineering culture where the concept of 'attempt to
deviate' just does not occur.

randy



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Eddy Martinez
On Dec 24, 2009, at 10:09 AM, Randy Bush wrote:

 I _do_ create action plans and _do_ quarterback each step and _do_
 slap down any attempt to deviate.
 
 imagine a network engineering culture where the concept of 'attempt to
 deviate' just does not occur.
 
 randy


=]

The networking group is under control. 

Its the software engineers that start making edits to configs and code on the 
fly, improvisation at its finest. I guess my scope of interaction is greater 
than just networking. The hard part is that its a peer situation and how do you 
elevate the members of another team who have a lessor standard of operation. 
Also, they feel its fine to act like a cowboy and tackle problems on the fly. 
As long as the product is live before the window close. Then there is the 
almighty We can't back out, we already made too many changes that makes me 
want to grab rope and attach it to the ceiling. 

Have a Merry Christmas, 
Eddy 






Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Jim Shankland

Eddy Martinez wrote:

On Dec 24, 2009, at 10:09 AM, Randy Bush wrote:


I _do_ create action plans and _do_ quarterback each step and _do_
slap down any attempt to deviate.

imagine a network engineering culture where the concept of 'attempt to
deviate' just does not occur.


I find the thought of *any* culture in which attempts to deviate
just do not occur a little unnerving.

Jim Shankland

http://blog.oliver-gassner.de/archives/225-Guenter-Eich,-Traeume.html



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread David Andersen
On Dec 24, 2009, at 1:09 PM, Randy Bush wrote:

 I _do_ create action plans and _do_ quarterback each step and _do_
 slap down any attempt to deviate.
 
 imagine a network engineering culture where the concept of 'attempt to
 deviate' just does not occur.

Are you trying to suggest that this is something horrible, or that it's the 
future of network engineering? :)

I'm actually serious in asking the question, despite the grin.

  -Dave


Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Randy Bush
 imagine a network engineering culture where the concept of 'attempt to
 deviate' just does not occur.
 
 Are you trying to suggest that this is something horrible, or that
 it's the future of network engineering? :)

neither.  it is one [type of] ops engineering culture, and a very
successful one.  it seems, from this gaijin's naive point of view, to be
the common one in japan.

when i try to 'sell' configuration automation, they are confused by how
important it is to me.  they have a hard time seeing the need because
mops just work.  my read is that this is because people do not have the
arrogance to take shortcuts.  

when one is raised knowing that one's responsibility to the group is
more important than how smart one may think that one is, mops work.

randy



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Dave Israel



I _do_ create action plans and _do_ quarterback each step and _do_
slap down any attempt to deviate.
  

imagine a network engineering culture where the concept of 'attempt to
deviate' just does not occur.



Are you trying to suggest that this is something horrible, or that it's the 
future of network engineering? :)

I'm actually serious in asking the question, despite the grin.
  


Possibly, he is trying to hint at a connection with Nazis, so somebody 
will mention it, invoking Godwin's Law, and bringing a fruitless 
religious thread to a close.


There's a full range of methods, with just do it on one side, 
deviation is terms for dismissal on the other, and plenty of shades of 
gray in between.  I've seen both extremes result in excessive downtime. 
(How impromptu engineering can go wrong shouldn't take much imagination; 
the no deviation rule is especially hysterical when the backout plan 
doesn't work, but even without that, the one thing didn't work exactly 
right, back it out and try again in two weeks effect is destructive to 
both progress and morale.)  Working with the dynamic and quality of the 
team is more important than any change management paradigm.


-Dave


Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Scott Weeks

: this works in a tech culture where folk follow mops obsessively.  my
: experience is that most north americam engineers are too smart to do
: that, and take shoprtcuts

 and _do_ slap down any attempt to deviate

: imagine a network engineering culture where the concept of 'attempt to
: deviate' just does not occur

 the network group is under control


Hopefully, at least some of that was tongue-in-cheek.

For managers: saved LOTS of dollars when deviating from MoPs by fixing AFU 
things not thought of in the MoP.

For fellow netgeeks:  no one woke you up because the AFU things were fixed 
while you slept.

scott



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Scott Weeks


flameproof panties == ON  :-)

:mops work.

It depends on who wrote it and the experience the person has (on the particular 
network) who generated it..

scott



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Michael Dillon
 imagine a network engineering culture where the concept of 'attempt to
 deviate' just does not occur.

 Are you trying to suggest that this is something horrible, or that it's the 
 future of network engineering? :)

The model of network engineering that grew up during the 1990s is
forever gone unless you work
in a smaller organization where people have to wear many hats. In the
big ISPs, now identical to
the big telcos, operations and engineering design duties are
separated. The operations folks
do not deviate from the written plans that they work with. If the
slightest thing happens that is not
in the plan, they rollback the changes as specified in the plan. They
don't fix anything unless it
is officially broken with trouble tickets filed and escalations up to
senior management. That is
about the only time that operations people can get away with taking
shortcuts and creative solutions.

On the other hand, the engineering design folks should spend a good
part of their day trying out
things, thinking up new ideas, poking around equipment and software to
see how far it can be pushed.
Then, when they have learned something and are ready to implement it
in the network, they write
a detailed plan for operations. Then some other engineering folks test
the heck out of that design
to try and find fault with it. After all the faults are fixed, it goes
to operations and the engineering
design folks move on to something else unless serious problems occur
and operations needs
a design engineer to approve some sensible action to be taken. The
operations folk can't take
the sensible action because that would deviate from their plans, but
getting engineering design
folks involved, gives them an out for real emergencies.

So the term network engineering is ambiguous because a lot of people
use it to mean the 90's
style job where engineering design activity and operational activity
were all jumbled together.

In some companies, taking the engineering design track not only means
that you lose enable
on the routers, but you lose all TACACS access and have to get
authorisation from a VP just
to ask for a copy of the running config on a production router. Some
people like ops because
they see a lot of stuff go by and learn from it, get their CCIE and
move into design engineering.
Others like ops because they are scared of the responsibility for
thinking up what to do next,
and making a mistake.

As far as I can see, the only way to get a job that mixes ops and
design is to be in 3rd or 4th
level support which is the top of the technical escalation chain where
a few excellent design
engineers do have enable on the routers because they fix important
problems in near realtime.
I suspect that it would be advantageous to have a career in which you
worked for a while in
ops before moving into design engineering if you want to get into
top-level support.

Take all this with a grain of salt. Every company does things a bit
different, and the terminology
that is used is ambiguous. It would be interesting to see what others
have to say about this
answer.

--Michael Dillon



Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Dobbins, Roland

On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote:

 It would be interesting to see what others have to say about this answer.

I think it's a pretty accurate summation of how these things work in a lot of 
big organizations, all over the world.

There's a detrimental side to it, in that in the engineering org, the 
near-complete siloing away from ops can lead to an ivory-tower/King Canute type 
of mentality; in the ops org, this phenomenon in turn can lead to increasing 
frustration and lowered morale, which in turn leads to apathy and poor customer 
service. 

All too often, one ends up with mutually-hostile engineering and ops teams who 
waste time and energy actively working to frustrate one another's ambitions, 
rather than combining their efforts to design, build, and operate the best 
network possible.  Which in turn leads to many of the frustrations experienced 
every day by the end-customer.

---
Roland Dobbins rdobb...@arbor.net // http://www.arbornetworks.com

Injustice is relatively easy to bear; what stings is justice.

-- H.L. Mencken






RE: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread George Bonser


 -Original Message-
 From: Dobbins, Roland

 On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote:
 
  It would be interesting to see what others have to say about this
 answer.
 
 I think it's a pretty accurate summation of how these things work in a
 lot of big organizations, all over the world.


I think that one must keep in mind that there are two kinds of
check-lists.  There is a takeoff list where you can always choose to go
back to the ramp and fly another day if something doesn't check out but
there is a different priority when someone is already in the air and
something goes wrong.  You can't decide to land a different day.  In
that case you must rely on experience and knowledge to handle the
situation as it presents itself.  Sure, you can have some basic checks
for things even in an emergency but you can't know how the problem is
going to present itself ahead of time.  In cases like that you have set
of general parameters but the person at the controls needs to have
leeway to both clearly identify the nature of the problem and mitigate
the same if possible and that might include calling in some extra eyes
in order to identify things that might be going on with applications or
other devices that aren't specifically network gear.

So you can put a lot of process around changes in advance but there
isn't quite as much to manage incidents that strike out of the clear
blue.  Too much process at that point could impede progress in clearing
the issue.  Capt. Sullenberger did not need to fill out an incident
report, bring up a conference bridge, and give a detailed description of
what was happening with his plane, the status of all subsystems, and his
proposed plan of action (subject to consensus of those on the conference
bridge) and get approval for deviation from his initial flight plan
before he took the required actions to land the plane as best as he
could under the circumstances.  And while that is a bit extreme in the
sense of most networks in that lives are not often at stake, some
concepts are the same (and there might be networks supporting various
occupations on this planet where lives might actually be at stake in the
case of a network failure during some sort of activity).

One of the most efficient shops I worked in was when the production
internet operation was owned by the engineering department.  Corporate
operations owned the internal corporate IT, but engineering owned the
internet production data centers and network operations.  If engineering
released a code revision that blew up the network, the VP of Engineering
was responsible for the entire picture, not just the software piece.
Same is true where a networking change blew up the application.  Having
the responsibility for the entire system (software, hardware
platforms, and networking) under the same organization resulted in a lot
smoother operation without backbiting and greater access to and sharing
of resources between the application engineers, the systems
administrators, and the network engineers.




Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Dobbins, Roland

On Dec 25, 2009, at 9:27 AM, George Bonser wrote:

 Capt. Sullenberger did not need to fill out an incident
 report, bring up a conference bridge, and give a detailed description of
 what was happening with his plane, the status of all subsystems, and his
 proposed plan of action (subject to consensus of those on the conference
 bridge) and get approval for deviation from his initial flight plan
 before he took the required actions to land the plane as best as he
 could under the circumstances.

Conversely, the ever-increasing outright hostility and contempt evinced towards 
their customers by airlines worldwide -  especially US-based airlines - over 
the last decade or so, all in the name of 'regulations', offers a useful 
counterexample.

When it comes to larger organizations, this latter scenario is more the norm 
than what you describe, in my experience.  Critical problems are left 
unresolved for days/weeks/months; if one attempts to report an issue which is 
causing problems for many of an organizations customers worldwide, but one 
isn't oneself a direct customer of said organization, one is often as not 
ignored and shunted aside.

This isn't specific to the SP realm; it's simply a function of increased size, 
which leads to increased bureaucritization, which leads to dehumanization and 
the subordination of the organization's ostensible goals to internal politics, 
one-upsmanship, and blame-laying, no matter the industry in question.  The 
folks with a can-do attitude who're willing to buck the system in order to do 
the right thing for the customer stand out in stark contrast to their peers, 
and in many cases end up paying a price in terms of career advancement because 
of their willingness to Do The Right Thing.

'Process' is all too often merely a ruse designed to avoid responsibility, 
shift blame/liability, justify hiring lower-cost/unqualified employees whilst 
shedding expensive/competent employees, and indulge in empire-building.  We've 
seen this throughout corporate America with the 'permanent Y2K' of SoX and 
HIPAA, and the increasing involvement of government in terms of 
telecommunications-related rule-making which ends up directly affecting SPs.

I'm a big advocate of standards and change-control, and not an advocate of 
seat-of-the-pants, midnight engineering - except when the latter is necessary, 
as in the examples you give.  

Unfortunately, many folks who work in larger organizations are actively 
prohibited from indulging in fluid, situationally-approrpriate problem 
resolution; and because of the aforementioned siloing of ops and engineering, 
their valuable first-hand experiences and the lessons learned thereby aren't 
taken into account during the design and rulemaking processes.

---
Roland Dobbins rdobb...@arbor.net // http://www.arbornetworks.com

Injustice is relatively easy to bear; what stings is justice.

-- H.L. Mencken






Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-24 Thread Scott Howard
On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote:

 So you can put a lot of process around changes in advance but there
 isn't quite as much to manage incidents that strike out of the clear
 blue.  Too much process at that point could impede progress in clearing
 the issue.  Capt. Sullenberger did not need to fill out an incident
 report, bring up a conference bridge, and give a detailed description of
 what was happening with his plane, the status of all subsystems, and his
 proposed plan of action (subject to consensus of those on the conference
 bridge) and get approval for deviation from his initial flight plan
 before he took the required actions to land the plane as best as he
 could under the circumstances.



*mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost
thrust (in/on) both engines we're turning back towards LaGuardia* - Capt.
Sullenberger

Not exactly detailed, but he definitely initiated an incident report
(the mayday), gave a description of what was happening with his plane, the
status of [the relevant] subsystems, and his proposed plan of action -
even in the order you've asked for!

His actions were then subject to the consensus of those on the conference
bridge (ie, ATC) who could have denied his actions if they believed they
would have made the situation worse (ie, if what they were proposing would
have had them on a collision course with another plane). In this case, the
conference bridge gave approval for his course of action (*ok uh, you need
to return to LaGuardia? turn left heading of uh two two zero.* - ATC)

5 seconds before they made the above call they were reaching for the QRH
(Quick Reference Handbook), which contains checklists of the steps to take
in such a situation - including what to do in the event of loss of both
engines due to multiple birdstrikes.  They had no need to confer with others
as to what actions to take to try and recover from the problem, or what
order to take them in, because that pre-work had already been carried out
when the check-lists were written.

Of course, at the end of the day, training, skill and experience played a
very large part in what transpired - but so did the actions of the people on
the conference bridge (You can't get much more of a conference bridge
than open radio frequencies), and the checklists they have for almost every
conceivable situation.

  Scott.


Re: Revisiting the Aviation Safety vs. Networking discussion

2009-12-23 Thread David Hiers
1.  I grew up at the local airport watching my CFII pop train an
endless stream of pilots.

2.  The checklist for my last production gear swap had over 400 steps
and 4 time/task gates (each with a rollback plan).  As I did each
sequence of steps, I called it out, and someone read their copy of the
checklist and checked it off.  An entire peanut gallery of rouges
watched the whole thing on livemeeting, waiting to pounce on the first
misstep or shortcut.

3.  We migrated an entire nationwide phone system in 6 hours and
nobody noticed anything.

4.  We met afterward to in an after action review meeting that I
picked up in the Army.

I'm more persistent than smart, and I tell ya, if you prep well
enough, you can hand your checklist to a stoned intern and you'll have
no worries at all.


David




On Wed, Dec 23, 2009 at 12:48 PM, Owen DeLong o...@delong.com wrote:
 Those that remember the discussion may find this article interesting:

 http://abcnews.go.com/Health/wireStory?id=9394406

 Owen