Re: Revisiting the Aviation Safety vs. Networking discussion
In general, it seems that a field has to be aware that it can kill (or has killed) an embarrassing number of people before its members accept the need for controls such as processes and checklists. Here's a couple if incidents in which gruesome, public loss of life was necessary to for thought to triumph over ego: Doctors took forever to get over their bad selves and adopt the process of handwashing: http://en.wikipedia.org/wiki/Ignaz_Semmelweis Pilots discover humility and the value of checklists in managing complexity: http://www.atchistory.org/History/checklst.htm Reactor-rats, wing-wipers, barber-surgeons, and rocket-jockeys now recognize that the best and brightest among us, polished with state of the art education and training, ruthlessly drilled in the fundamentals, and armed with the best processes and checklists, are just barely good enough to have even-money odds when dealing with everything the world can throw at them. I suppose that once us packet-pushers kill enough people, the economics of lost market share, falling stock prices, and embarrassed CxOs on CNN will push us in that direction. Until then, however, Anarchy and Heroics (http://www.cert.org/archive/pdf/csi0711.pdf) sing their siren song. David On Sat, Dec 26, 2009 at 4:24 PM, Robert Boyle rob...@tellurian.com wrote: At 02:08 AM 12/25/2009, Scott Howard wrote: On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote: So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. *mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia* - Capt. Sullenberger Not exactly detailed, but he definitely initiated an incident report (the mayday), gave a description of what was happening with his plane, the status of [the relevant] subsystems, and his proposed plan of action - even in the order you've asked for! His actions were then subject to the consensus of those on the conference bridge (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action (*ok uh, you need to return to LaGuardia? turn left heading of uh two two zero.* - ATC) Once he declared an emergency, he had the right of way over all other traffic. ATC would move anyone in his way out of the way. Under http://en.wikipedia.org/wiki//wiki/U.S.U.S. http://en.wikipedia.org/wiki//wiki/FAAFAA FAR 91.3, Responsibility and authority of the pilot in command, the FAA declares:[2] * (a) The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft. * (b) In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency. * (c) Each pilot in command who deviates from a rule under paragraph (b) of this section shall, upon the request of the Administrator, send a written report of that deviation to the Administrator. Just because we have checklists doesn't mean we can't think on our feet and handle situations not contemplated in checklists, but checklists and procedures exist to ensure we don't forget something we need to remember. They aren't a substitute for creativity and logical thought. They are an aid to it to ensure a minimum of creative thinking is needed to solve problems which shouldn't exist in the first place. -Robert SELMEL+I Well done is better than well said. - Benjamin Franklin
Re: Revisiting the Aviation Safety vs. Networking discussion
The connection may not be immediately apparent, but I think Philip Greenspun's article critiquing Malcolm Gladwell's musings on cranial metrics etc. has some bearing: http://philip.greenspun.com/flying/foreign-airline-safety ...or is at least an interesting read. In observing network operations screw-ups, I've seen a lot that were either caused by, or prolonged by, a culture-of-emergency. Young guys drinking way too much coffee, working a service window at two in the morning, believing they've seen something that needs to be fixed, and winging it. In building networks, I've tried very hard to engineer things such that the operating procedure for dealing with an emergency is to note its existence and place it in a work queue to be dealt with by people who are on a day shift, have just come in from a full night's sleep, and are working in a team with senior people who can assist with anything tricky, and make sure that junior folks are following proceedures that have been worked out in advance by people who had plenty of time in a lab, and plenty of time to choose the best of many alternative procedures. In my experience, reducing the frequency of emergencies is most beneficial in reducing the frequency of outages. :-) -Bill
Re: Revisiting the Aviation Safety vs. Networking discussion
On 12/25/09 7:57 AM, Anton Kapela wrote: What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as old (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes. So, we're supposed to make the mistakes of aviation, nuclear power, the chemical industry (i.e. Bhopal), oil production refining, etc., all over again? Checklists and MOPs are but one of the things we ignore from other industries. Some others: o Increasing complexity and tight coupling lead to systemic failures. Simply grafting redundancy onto complex systems can make them less, not more, reliable. Yet this is the trend in networking. Want bells and whistles, firewalls, load-balancers, rate-limiters in your network? You can have 'em without sacrificing reliability if you just buy two of 'em! o The gradual acceptance of components or procedures that have adequate reliability for a certain task (say, research) that are not reliable enough for another task (e.g. being a critical part of a 1,000 megawatt nuclear power plant) without understanding the implications. Do we know how our technology is being used and will be used? Will the people adopting IP for everything (the smart grid, VoIP, life-supporting functions) fail to see these implications just as the people who shoved a fissile core into a pressure vessel did? This last point directly contradicts the theme of your message. The notion that what we do is not (yet) a matter of life-or-death has bitten other industries in the past and it provides a nice illustration of why we should *not* be ignoring their lessons. michael
Re: Revisiting the Aviation Safety vs. Networking discussion
On Dec 24, 2009, at 11:08 PM, Scott Howard wrote: On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote: So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. *mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia* - Capt. Sullenberger Not exactly detailed, but he definitely initiated an incident report (the mayday), gave a description of what was happening with his plane, the status of [the relevant] subsystems, and his proposed plan of action - even in the order you've asked for! Exactly. His actions were then subject to the consensus of those on the conference bridge (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action (*ok uh, you need to return to LaGuardia? turn left heading of uh two two zero.* - ATC) Not exactly. If the others on the bridge don't consent, FAR 91.3 gives him full and absolute authority to tell them to screw themselves and do what he feels is best. FAR 91.3 reads: Responsibility and authority of the pilot in command. (a) The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft. (b) In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency. (c) Each pilot in command who deviates from a rule under paragraph (b) of this section shall, upon the request of the Administrator, send a written report of that deviation to the Administrator. As near as I can tell, that regulation was last modified in 1963. 5 seconds before they made the above call they were reaching for the QRH (Quick Reference Handbook), which contains checklists of the steps to take in such a situation - including what to do in the event of loss of both engines due to multiple birdstrikes. They had no need to confer with others as to what actions to take to try and recover from the problem, or what order to take them in, because that pre-work had already been carried out when the check-lists were written. Yep. Of course, at the end of the day, training, skill and experience played a very large part in what transpired - but so did the actions of the people on the conference bridge (You can't get much more of a conference bridge than open radio frequencies), and the checklists they have for almost every conceivable situation. And in case there are any misconceptions here on the list, I know that in the public eye, there is often a lot of distrust and/or perceived animosity between controllers and pilots. Frankly, this is a misconception for the most part. Sure, there are incidents where pilots and controllers don't get along, each blaming the other. However, by and large, both groups are consummate professionals doing their best to make sure flights end well. In my years as a pilot, I have had more than one occasion to be very thankful for ATC and the services they provide. Generally, they are a very helpful and hardworking group. I respect them greatly and appreciate the tough job they do. Owen (Commercial Pilot, ASEL, Instrument Airplane)
Re: Revisiting the Aviation Safety vs. Networking discussion
On Dec 25, 2009, at 7:57 AM, Anton Kapela wrote: On Fri, Dec 25, 2009 at 5:44 AM, Vadim Antonov a...@kotovnik.com wrote: The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has. It seems that there's a logical fallacy floating around somewhere (networks have parts and are complicated, airplanes and flight involve lots of parts and are also complicated, therefore aircraft are like networks). I assert that comparing 'packet switching' to an industry that has its roots in the late 1800's and had its first hello world moment in 1903 isn't terribly fruitful. As someone with a fair amount of experience with both, I have to disagree with you. Yes, there are differences, and, yes you have to keep comparisons and the like in perspective, but, there are definitely areas where networking could learn from aviation, and, to some extent, vice versa. Further, aircraft are the asymptotic limit of 'singly homed transit.' Because of this, I think one could argue that pilots and ATC must be held to a different professional standard due to the nature of public trust at risk. At the other end of our strawman spectrum, we have end users who must accept the risk that their provider will be unable to connect them to lolcats.com on occasion, perhaps as often as 0.01% per year, and most are happy to accept this. Four nines survivability on flights, clearly, won't work. Correct... As I stated in my earliest posts on this subject, while there is value to be obtained in looking at how aviation has improved its safety/reliability record over the years, there is also value in recognizing the cost/benefit ratio of some of those improvements. If you draw a graph with one curve arcing from bottom left towards upper right, steepening as it goes to the right, that line can be thought of as the amount of cost of achieving additional reliability. A second curve sloping from top left to bottom right, flattening out as it goes to the right can be thought of as the gains achieved from those additional 9s of reliability. Finally, the point where those two curves intersect is defined by the cost of outages and/or downtime. Interestingly, this same diagram will be familiar to most pilots, but, the two arcs will be induced drag (drag from producing lift) and parasite drag (drag from friction with the air). The point where they meet is called L/D Max and is the airspeed at which the given aircraft will achieve it's best glide ratio. What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as old (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes. There are lessons to be learned that are valuable. Both from things aviation has done well that we could emulate, and, from things aviation has done poorly that we should avoid. There are also additional lessons to be learned about the differences in cost/benefit analysis between the two disciplines. Owen
Re: Revisiting the Aviation Safety vs. Networking discussion
At 03:38 PM 12/28/2009, Owen DeLong wrote: There are lessons to be learned that are valuable. Both from things aviation has done well that we could emulate, and, from things aviation has done poorly that we should avoid. There are also additional lessons to be learned about the differences in cost/benefit analysis between the two disciplines. Agreed. You have to learn from the mistakes of others because you won't live long enough to make them all yourself. -Admiral Rickover -Robert Well done is better than well said. - Benjamin Franklin
Re: Revisiting the Aviation Safety vs. Networking discussion
At 02:08 AM 12/25/2009, Scott Howard wrote: On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote: So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. *mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia* - Capt. Sullenberger Not exactly detailed, but he definitely initiated an incident report (the mayday), gave a description of what was happening with his plane, the status of [the relevant] subsystems, and his proposed plan of action - even in the order you've asked for! His actions were then subject to the consensus of those on the conference bridge (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action (*ok uh, you need to return to LaGuardia? turn left heading of uh two two zero.* - ATC) Once he declared an emergency, he had the right of way over all other traffic. ATC would move anyone in his way out of the way. Under http://en.wikipedia.org/wiki//wiki/U.S.U.S. http://en.wikipedia.org/wiki//wiki/FAAFAA FAR 91.3, Responsibility and authority of the pilot in command, the FAA declares:[2] * (a) The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft. * (b) In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency. * (c) Each pilot in command who deviates from a rule under paragraph (b) of this section shall, upon the request of the Administrator, send a written report of that deviation to the Administrator. Just because we have checklists doesn't mean we can't think on our feet and handle situations not contemplated in checklists, but checklists and procedures exist to ensure we don't forget something we need to remember. They aren't a substitute for creativity and logical thought. They are an aid to it to ensure a minimum of creative thinking is needed to solve problems which shouldn't exist in the first place. -Robert SELMEL+I Well done is better than well said. - Benjamin Franklin
RE: Revisiting the Aviation Safety vs. Networking discussion
Just clearing a small point about pilots (I'm a pilot) - the pilot-in-command has ultimate responsibility for his a/c and can ignore whatever ATC tells him to do if he considers that to be contrary to the safety of his flight (he may be asked to explain his actions later, though). Now, usually ignoring ATC or keeping it in the dark about one's intentions is not very clever - but dispatchers are not in the cockpit and may misunderstand the situation or be simply mistaken about something (so a pilot is encouraged to decline ATC instructions he considers to be in error - informing ATC about it, of course). But one of the first things a pilot does in an emergency is pulling out the appropriate emergency checklist. It is kind of hard to keep from forgetting to check obvious things when things get hectic (one of the distressingly common causes of accidents is trivial running out of fuel - either because the pilot didn't do homework on the ground (checking actual fuel level in tanks, etc) or because when the engine got suddenly quiet he forgot to switch to another, non-empty, tank). The mantra about priorities in both normal and emergency situations is Aviate-Navigate-Communicate meaning that maintaining control of a/c always comes first, no matter what. Knowing where you are and where you are going (and other pertinent situational awareness such as condition of the a/c and current plan of actions) come second. Talking is lowest priority. The pre-planned emergency checklists may be a good idea for network operators. Try obvious (when you're calm, that's it) actions first, if they fail to help, try to limit damage. Only then go file the ticket and talk to people who can investigate situation in depth and can develop a fix. The way aviation industry come with these checklists is, basically, experience - it pays to debrief after recovery from every problem not adequately fixed by existing procedures, find common ones, and develop diagnostic procedure one could follow step-by-step for these situations. (The non-punitive error or incident reporting which actually shields pilots from FAA enforcement actions in most cases also helps to collect real-world information on where and how pilots get into trouble). The all-too-common multistep ticket escalation chains (which merely work as delay lines in a significant portion of cases) is something to be avoided. Even better is to provide some drilling in diagnostic and recovery from common problems to the front-line personnel - starting from following the checklist on a simulated outage in the lab, and then getting it down to what pilots call the flow - a habitual memorized procedure, which is performed first and then checked against the checklist. Note that use of checklists, drilling, and flows does not make pilots a kind of robots - they still have to make decisions, recognize and deal with situations not covered in the standard procedures; what it does is speeding up dealing with common tasks, reduces mistakes, and frees up mental processing for thinking ahead. The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has. --vadim On Fri, 25 Dec 2009, George Bonser wrote: I think any network engineer who sees a major problem is going to have a Houston, we have a problem moment. And actually, he was telling the ATC what he was going to need to do, he wasn't getting permission so much as telling them what he was doing so traffic could be cleared out of his way. First he told them he was returning to the airport, then he inquired about Peterburough, the ATC called Peterburough to get a runway and inform them of an inbound emergency, then the Captain told the ATC they were going to be in the Hudson. And I hit birds, have lost both engines, and am turning back results in a whole different chain of events these days than I have two guys banging on the cockpit door and am returning or simply turning back toward the airport with no communication. And any network engineer is going to say something if he sees CPU or bandwidth utilization hit the rail in either direction. Saying something like we just got flooded with thousands of /24 and smaller wildly flapping routes from peer X and I am shutting off the BGP session until they get their stuff straight is different than we just got flooded with thousands of routes and it is blowing up the router and all the other routers talking to it. Can I do something about it? And that illustrates a point that is key. In that case the ATC was asking what the pilot needed and was prepared to clear traffic, get emergency equipment prepared, whatever it took to get that person dealing with the problem whatever they needed to get it resolved in the best way forward. The ATC isn't asking him if he was sure he set the flaps at the right angle and did you try to restart the engine sorts of things. What I
RE: Revisiting the Aviation Safety vs. Networking discussion
On Fri, 25 Dec 2009, Vadim Antonov wrote: The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has. Well, to counter this one might talk about the medical business (doctors) which hasn't been able to embrace the checklists at all (apart from in a few places), and they still consider their profession to be a craft, just like most network engineers do. It's the classical good/fast/cheap, please pick two. Aviation is slow/careful to bring in new tech, same with the health care side, they're both very conservative. We in the network business are still immature but quick and flexible, but as time goes on, our services are more and more important, and thus things settle in and slow down, but becomes more reliable. This is an evoltion that'll take quite some time, but it's already changed a lot the past 10 years. There was quite a buzz regarding doctor checklists a few years back, I read several articles about it, but now I can't find the one I want to find, but http://www.healthbeatblog.org/2007/12/pilots-use-chec.html talks a bit about the topic. -- Mikael Abrahamssonemail: swm...@swm.pp.se
Re: Revisiting the Aviation Safety vs. Networking discussion
On Fri, Dec 25, 2009 at 5:44 AM, Vadim Antonov a...@kotovnik.com wrote: The ISP industry has a long way to go until it reaches the same level of sophistication in handling problems as aviation has. It seems that there's a logical fallacy floating around somewhere (networks have parts and are complicated, airplanes and flight involve lots of parts and are also complicated, therefore aircraft are like networks). I assert that comparing 'packet switching' to an industry that has its roots in the late 1800's and had its first hello world moment in 1903 isn't terribly fruitful. Further, aircraft are the asymptotic limit of 'singly homed transit.' Because of this, I think one could argue that pilots and ATC must be held to a different professional standard due to the nature of public trust at risk. At the other end of our strawman spectrum, we have end users who must accept the risk that their provider will be unable to connect them to lolcats.com on occasion, perhaps as often as 0.01% per year, and most are happy to accept this. Four nines survivability on flights, clearly, won't work. What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as old (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes. -Tk
Re: Revisiting the Aviation Safety vs. Networking discussion
On Thu, Dec 24, 2009 at 01:09:26PM -0500, Randy Bush wrote: I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. Whimsical deviations don't belong in the maint execution, they belong in the brainstorming and design. Gather more points of view during the peer review of the specification of work. In my experience, good engineering makes for bad drama (and conversely if it is a dramatic save then you have a bad engineer and likely a cowboy). Have a plan that executes in stages, tests at checkpoints where partial completion is possible, and a fallback for each step. A great way to train up junior people, document as you go, expose flaws and lines of future investigation, and if things go south you escalte to those who can judge *reasonable* new directions. To me, that kind of change management for non-automatable work is a descendent of resonable group work. If you have project-oriented autonomous teams that stick to the guideposts of your standards and minimal disruptions/maximal uptime then good work will emerge. As for automation, that enables your expensive hmans to do more smart things so should always be incorporated in processes and be something people move toward, IMO. Cheers, Joe -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE
Re: Revisiting the Aviation Safety vs. Networking discussion
On Thu, 24 Dec 2009, Scott Howard wrote: His actions were then subject to the consensus of those on the conference bridge (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). This has been mentioned by others in this thread, but not to the level of importance I think it represents. I, too, am a pilot. The pilot in command of an aircraft always has the final say on the safety of the flight, not the controller, and not the design engineers. If the pilot in command violates the rules and the result is negative (crash, loss of separation, etc.) you better believe there will be questions to be answered and a possible loss of the pilot's license (or life!) may result. On the other hand if the pilot's decision to violate the rules results in a positive outcome (see Sullenburger or any other number of emergencies that happen every day that you never hear about) there will often never even be a single question about why the rules were violated. This can be applied directly to network engineering work. If I assign an engineer to do a network change, yes, they better have a plan/checklist/etc. before they start and they better follow it. When things go wrong, I expect that engineer to make the right decisions to minimize the damage. Sometimes that means following the rules to the letter. Sometimes that means breaking the rules. If the rules are broken, there darn better be a good reason for it, but frankly, a good engineer will always have a good explanation, just like the good pilot. Rigid procedures are no better than the lack of procedures. Process is very important, don't get me wrong, but so is the knowledge and experience to know when you should throw them out the door. Any organization that doesn't recognize that is doomed to inefficiency at best, and failure at worst. -- Brandon Ross AIM: BrandonNRoss Director of Network EngineeringICQ: 2269442 Xiocom WirelessSkype: brandonross Yahoo: BrandonNRoss
RE: Revisiting the Aviation Safety vs. Networking discussion
What I'm getting at is that after following this thread for a while, I'm not convinced any amount of process-borrowing is going to solve problems better, faster, or even avoid them in the first place. At best, our craft is 1/3rd as old (if that's somehow I measure of maturity) as flight and nobody is being sued to settle 200+ accidental deaths because of our mistakes. -Tk Not now, that is true, but when you look at things that are on the drawing board such as systems designed to manage automobile traffic flows, networks that are used to fly UAVs, networks that keep track of friendly units in combat where the technology might someday migrate to civilian law enforcement and/or emergency services (keeping track of where firefighters are in a building or at a wildfire, for example), I can see situations in the future where people's lives could be dependent on networks working properly, or at least endangered if a network fails. But my original intent was to point out that there are two kinds of process for two different kinds of circumstances and the sort of process surrounding routine changes might not be the best process for handing emergency changes. I have seen examples of places that want to handle emergency changes with the same sort of process they use for routine changes and those places can be frustrating to work with when stuff is broken. My goal was to give managers of networks who might read this the idea that when the fan is in an unsavory condition, more can get done by shifting from a mode of questioning, analyzing and second-guessing everything the engineer is doing to a mode where the organization is responding to immediate needs, clearing obstacles out of the way, and documenting as best they can what is done and when, to make the debriefing afterwards easier. AFTER the incident is the time to go over what was done, think about how it was dealt with, consider any changes in emergency process that might have shortened the duration, etc. In fact the What could we have done differently that would have shortened the duration of the outage question is pretty important. The answer might be nothing, and that is ok, too, but the question should be asked.
RE: Revisiting the Aviation Safety vs. Networking discussion
Shops where engineering and operations function separately can suffer from reduced efficiencies. A recent example comes to mind. Vendor X was onsite turning up some equipment, including a small VPN concentrator for remote access. It was a new model of VPN concentrator that the installers hadn't worked with before. They used scripts, a set of a CLI commands with field-replaceable variables for site specific parameters, to configure the device. But connections to the VPN were failing. After trying different versions of the scripts (for similar models) they broke down and called their internal tech support department for help. Total turn-up time for the concentrator: 8+ hours. There wasn't that much wrong with the script that kept it from working, but the ops folks lacked the training to understand the problems and fix them. On the other hand, the engineering folks should probably have produced a more robust set of scripts. While having no experience myself, it would seem a good practice that every project, including the actual turn-up, include representation from engineering. This automatically creates a liaison between the two groups and keeps the engineer abreast of real world issues. Frank -Original Message- From: Michael Dillon [mailto:wavetos...@googlemail.com] Sent: Thursday, December 24, 2009 6:02 PM To: NANOG list Subject: Re: Revisiting the Aviation Safety vs. Networking discussion imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :) The model of network engineering that grew up during the 1990s is forever gone unless you work in a smaller organization where people have to wear many hats. In the big ISPs, now identical to the big telcos, operations and engineering design duties are separated. The operations folks do not deviate from the written plans that they work with. If the slightest thing happens that is not in the plan, they rollback the changes as specified in the plan. They don't fix anything unless it is officially broken with trouble tickets filed and escalations up to senior management. That is about the only time that operations people can get away with taking shortcuts and creative solutions. On the other hand, the engineering design folks should spend a good part of their day trying out things, thinking up new ideas, poking around equipment and software to see how far it can be pushed. Then, when they have learned something and are ready to implement it in the network, they write a detailed plan for operations. Then some other engineering folks test the heck out of that design to try and find fault with it. After all the faults are fixed, it goes to operations and the engineering design folks move on to something else unless serious problems occur and operations needs a design engineer to approve some sensible action to be taken. The operations folk can't take the sensible action because that would deviate from their plans, but getting engineering design folks involved, gives them an out for real emergencies. So the term network engineering is ambiguous because a lot of people use it to mean the 90's style job where engineering design activity and operational activity were all jumbled together. In some companies, taking the engineering design track not only means that you lose enable on the routers, but you lose all TACACS access and have to get authorisation from a VP just to ask for a copy of the running config on a production router. Some people like ops because they see a lot of stuff go by and learn from it, get their CCIE and move into design engineering. Others like ops because they are scared of the responsibility for thinking up what to do next, and making a mistake. As far as I can see, the only way to get a job that mixes ops and design is to be in 3rd or 4th level support which is the top of the technical escalation chain where a few excellent design engineers do have enable on the routers because they fix important problems in near realtime. I suspect that it would be advantageous to have a career in which you worked for a while in ops before moving into design engineering if you want to get into top-level support. Take all this with a grain of salt. Every company does things a bit different, and the terminology that is used is ambiguous. It would be interesting to see what others have to say about this answer. --Michael Dillon
RE: Revisiting the Aviation Safety vs. Networking discussion
I can see situations in the future where people's lives could be dependent on networks working properly, or at least endangered if a network fails. Actually it's not the future. My father's design bureau was making hardware, since 70s (including network stuff) for running industrial processes of a kind where software crash or a network malfunction was usually associated with casualties. Gas pipelines, power plants, electric grids, stuff like that. That's a completely different class of hardware, more of a kind you'd find in avionics - modules in triplicate, voting, pervasive error correction, etc. Software was also designed differently, with a lot more review processes, and with data structures designed for integrity checking (I still use this trick in my work, which saves me a lot of grief during debugging) and recovery from memory corruption and such. I'd be seriously loath to put any of the current crop of COTS network boxes into a life-critical network. --vadim
Re: Revisiting the Aviation Safety vs. Networking discussion
On Dec 24, 2009, at 9:51 AM, Randy Bush wrote: I'm more persistent than smart, and I tell ya, if you prep well enough, you can hand your checklist to a stoned intern and you'll have no worries at all. this works in a tech culture where folk follow mops obsessively. my experience is that most north american engineers are too smart to do that, and take shortcuts. randy Being a North American Engineer, I resent that remark. =] I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. Eddy
Re: Revisiting the Aviation Safety vs. Networking discussion
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. randy
Re: Revisiting the Aviation Safety vs. Networking discussion
On Dec 24, 2009, at 10:09 AM, Randy Bush wrote: I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. randy =] The networking group is under control. Its the software engineers that start making edits to configs and code on the fly, improvisation at its finest. I guess my scope of interaction is greater than just networking. The hard part is that its a peer situation and how do you elevate the members of another team who have a lessor standard of operation. Also, they feel its fine to act like a cowboy and tackle problems on the fly. As long as the product is live before the window close. Then there is the almighty We can't back out, we already made too many changes that makes me want to grab rope and attach it to the ceiling. Have a Merry Christmas, Eddy
Re: Revisiting the Aviation Safety vs. Networking discussion
Eddy Martinez wrote: On Dec 24, 2009, at 10:09 AM, Randy Bush wrote: I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. I find the thought of *any* culture in which attempts to deviate just do not occur a little unnerving. Jim Shankland http://blog.oliver-gassner.de/archives/225-Guenter-Eich,-Traeume.html
Re: Revisiting the Aviation Safety vs. Networking discussion
On Dec 24, 2009, at 1:09 PM, Randy Bush wrote: I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :) I'm actually serious in asking the question, despite the grin. -Dave
Re: Revisiting the Aviation Safety vs. Networking discussion
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :) neither. it is one [type of] ops engineering culture, and a very successful one. it seems, from this gaijin's naive point of view, to be the common one in japan. when i try to 'sell' configuration automation, they are confused by how important it is to me. they have a hard time seeing the need because mops just work. my read is that this is because people do not have the arrogance to take shortcuts. when one is raised knowing that one's responsibility to the group is more important than how smart one may think that one is, mops work. randy
Re: Revisiting the Aviation Safety vs. Networking discussion
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate. imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :) I'm actually serious in asking the question, despite the grin. Possibly, he is trying to hint at a connection with Nazis, so somebody will mention it, invoking Godwin's Law, and bringing a fruitless religious thread to a close. There's a full range of methods, with just do it on one side, deviation is terms for dismissal on the other, and plenty of shades of gray in between. I've seen both extremes result in excessive downtime. (How impromptu engineering can go wrong shouldn't take much imagination; the no deviation rule is especially hysterical when the backout plan doesn't work, but even without that, the one thing didn't work exactly right, back it out and try again in two weeks effect is destructive to both progress and morale.) Working with the dynamic and quality of the team is more important than any change management paradigm. -Dave
Re: Revisiting the Aviation Safety vs. Networking discussion
: this works in a tech culture where folk follow mops obsessively. my : experience is that most north americam engineers are too smart to do : that, and take shoprtcuts and _do_ slap down any attempt to deviate : imagine a network engineering culture where the concept of 'attempt to : deviate' just does not occur the network group is under control Hopefully, at least some of that was tongue-in-cheek. For managers: saved LOTS of dollars when deviating from MoPs by fixing AFU things not thought of in the MoP. For fellow netgeeks: no one woke you up because the AFU things were fixed while you slept. scott
Re: Revisiting the Aviation Safety vs. Networking discussion
flameproof panties == ON :-) :mops work. It depends on who wrote it and the experience the person has (on the particular network) who generated it.. scott
Re: Revisiting the Aviation Safety vs. Networking discussion
imagine a network engineering culture where the concept of 'attempt to deviate' just does not occur. Are you trying to suggest that this is something horrible, or that it's the future of network engineering? :) The model of network engineering that grew up during the 1990s is forever gone unless you work in a smaller organization where people have to wear many hats. In the big ISPs, now identical to the big telcos, operations and engineering design duties are separated. The operations folks do not deviate from the written plans that they work with. If the slightest thing happens that is not in the plan, they rollback the changes as specified in the plan. They don't fix anything unless it is officially broken with trouble tickets filed and escalations up to senior management. That is about the only time that operations people can get away with taking shortcuts and creative solutions. On the other hand, the engineering design folks should spend a good part of their day trying out things, thinking up new ideas, poking around equipment and software to see how far it can be pushed. Then, when they have learned something and are ready to implement it in the network, they write a detailed plan for operations. Then some other engineering folks test the heck out of that design to try and find fault with it. After all the faults are fixed, it goes to operations and the engineering design folks move on to something else unless serious problems occur and operations needs a design engineer to approve some sensible action to be taken. The operations folk can't take the sensible action because that would deviate from their plans, but getting engineering design folks involved, gives them an out for real emergencies. So the term network engineering is ambiguous because a lot of people use it to mean the 90's style job where engineering design activity and operational activity were all jumbled together. In some companies, taking the engineering design track not only means that you lose enable on the routers, but you lose all TACACS access and have to get authorisation from a VP just to ask for a copy of the running config on a production router. Some people like ops because they see a lot of stuff go by and learn from it, get their CCIE and move into design engineering. Others like ops because they are scared of the responsibility for thinking up what to do next, and making a mistake. As far as I can see, the only way to get a job that mixes ops and design is to be in 3rd or 4th level support which is the top of the technical escalation chain where a few excellent design engineers do have enable on the routers because they fix important problems in near realtime. I suspect that it would be advantageous to have a career in which you worked for a while in ops before moving into design engineering if you want to get into top-level support. Take all this with a grain of salt. Every company does things a bit different, and the terminology that is used is ambiguous. It would be interesting to see what others have to say about this answer. --Michael Dillon
Re: Revisiting the Aviation Safety vs. Networking discussion
On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote: It would be interesting to see what others have to say about this answer. I think it's a pretty accurate summation of how these things work in a lot of big organizations, all over the world. There's a detrimental side to it, in that in the engineering org, the near-complete siloing away from ops can lead to an ivory-tower/King Canute type of mentality; in the ops org, this phenomenon in turn can lead to increasing frustration and lowered morale, which in turn leads to apathy and poor customer service. All too often, one ends up with mutually-hostile engineering and ops teams who waste time and energy actively working to frustrate one another's ambitions, rather than combining their efforts to design, build, and operate the best network possible. Which in turn leads to many of the frustrations experienced every day by the end-customer. --- Roland Dobbins rdobb...@arbor.net // http://www.arbornetworks.com Injustice is relatively easy to bear; what stings is justice. -- H.L. Mencken
RE: Revisiting the Aviation Safety vs. Networking discussion
-Original Message- From: Dobbins, Roland On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote: It would be interesting to see what others have to say about this answer. I think it's a pretty accurate summation of how these things work in a lot of big organizations, all over the world. I think that one must keep in mind that there are two kinds of check-lists. There is a takeoff list where you can always choose to go back to the ramp and fly another day if something doesn't check out but there is a different priority when someone is already in the air and something goes wrong. You can't decide to land a different day. In that case you must rely on experience and knowledge to handle the situation as it presents itself. Sure, you can have some basic checks for things even in an emergency but you can't know how the problem is going to present itself ahead of time. In cases like that you have set of general parameters but the person at the controls needs to have leeway to both clearly identify the nature of the problem and mitigate the same if possible and that might include calling in some extra eyes in order to identify things that might be going on with applications or other devices that aren't specifically network gear. So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. And while that is a bit extreme in the sense of most networks in that lives are not often at stake, some concepts are the same (and there might be networks supporting various occupations on this planet where lives might actually be at stake in the case of a network failure during some sort of activity). One of the most efficient shops I worked in was when the production internet operation was owned by the engineering department. Corporate operations owned the internal corporate IT, but engineering owned the internet production data centers and network operations. If engineering released a code revision that blew up the network, the VP of Engineering was responsible for the entire picture, not just the software piece. Same is true where a networking change blew up the application. Having the responsibility for the entire system (software, hardware platforms, and networking) under the same organization resulted in a lot smoother operation without backbiting and greater access to and sharing of resources between the application engineers, the systems administrators, and the network engineers.
Re: Revisiting the Aviation Safety vs. Networking discussion
On Dec 25, 2009, at 9:27 AM, George Bonser wrote: Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. Conversely, the ever-increasing outright hostility and contempt evinced towards their customers by airlines worldwide - especially US-based airlines - over the last decade or so, all in the name of 'regulations', offers a useful counterexample. When it comes to larger organizations, this latter scenario is more the norm than what you describe, in my experience. Critical problems are left unresolved for days/weeks/months; if one attempts to report an issue which is causing problems for many of an organizations customers worldwide, but one isn't oneself a direct customer of said organization, one is often as not ignored and shunted aside. This isn't specific to the SP realm; it's simply a function of increased size, which leads to increased bureaucritization, which leads to dehumanization and the subordination of the organization's ostensible goals to internal politics, one-upsmanship, and blame-laying, no matter the industry in question. The folks with a can-do attitude who're willing to buck the system in order to do the right thing for the customer stand out in stark contrast to their peers, and in many cases end up paying a price in terms of career advancement because of their willingness to Do The Right Thing. 'Process' is all too often merely a ruse designed to avoid responsibility, shift blame/liability, justify hiring lower-cost/unqualified employees whilst shedding expensive/competent employees, and indulge in empire-building. We've seen this throughout corporate America with the 'permanent Y2K' of SoX and HIPAA, and the increasing involvement of government in terms of telecommunications-related rule-making which ends up directly affecting SPs. I'm a big advocate of standards and change-control, and not an advocate of seat-of-the-pants, midnight engineering - except when the latter is necessary, as in the examples you give. Unfortunately, many folks who work in larger organizations are actively prohibited from indulging in fluid, situationally-approrpriate problem resolution; and because of the aforementioned siloing of ops and engineering, their valuable first-hand experiences and the lessons learned thereby aren't taken into account during the design and rulemaking processes. --- Roland Dobbins rdobb...@arbor.net // http://www.arbornetworks.com Injustice is relatively easy to bear; what stings is justice. -- H.L. Mencken
Re: Revisiting the Aviation Safety vs. Networking discussion
On Thu, Dec 24, 2009 at 6:27 PM, George Bonser gbon...@seven.com wrote: So you can put a lot of process around changes in advance but there isn't quite as much to manage incidents that strike out of the clear blue. Too much process at that point could impede progress in clearing the issue. Capt. Sullenberger did not need to fill out an incident report, bring up a conference bridge, and give a detailed description of what was happening with his plane, the status of all subsystems, and his proposed plan of action (subject to consensus of those on the conference bridge) and get approval for deviation from his initial flight plan before he took the required actions to land the plane as best as he could under the circumstances. *mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost thrust (in/on) both engines we're turning back towards LaGuardia* - Capt. Sullenberger Not exactly detailed, but he definitely initiated an incident report (the mayday), gave a description of what was happening with his plane, the status of [the relevant] subsystems, and his proposed plan of action - even in the order you've asked for! His actions were then subject to the consensus of those on the conference bridge (ie, ATC) who could have denied his actions if they believed they would have made the situation worse (ie, if what they were proposing would have had them on a collision course with another plane). In this case, the conference bridge gave approval for his course of action (*ok uh, you need to return to LaGuardia? turn left heading of uh two two zero.* - ATC) 5 seconds before they made the above call they were reaching for the QRH (Quick Reference Handbook), which contains checklists of the steps to take in such a situation - including what to do in the event of loss of both engines due to multiple birdstrikes. They had no need to confer with others as to what actions to take to try and recover from the problem, or what order to take them in, because that pre-work had already been carried out when the check-lists were written. Of course, at the end of the day, training, skill and experience played a very large part in what transpired - but so did the actions of the people on the conference bridge (You can't get much more of a conference bridge than open radio frequencies), and the checklists they have for almost every conceivable situation. Scott.
Re: Revisiting the Aviation Safety vs. Networking discussion
1. I grew up at the local airport watching my CFII pop train an endless stream of pilots. 2. The checklist for my last production gear swap had over 400 steps and 4 time/task gates (each with a rollback plan). As I did each sequence of steps, I called it out, and someone read their copy of the checklist and checked it off. An entire peanut gallery of rouges watched the whole thing on livemeeting, waiting to pounce on the first misstep or shortcut. 3. We migrated an entire nationwide phone system in 6 hours and nobody noticed anything. 4. We met afterward to in an after action review meeting that I picked up in the Army. I'm more persistent than smart, and I tell ya, if you prep well enough, you can hand your checklist to a stoned intern and you'll have no worries at all. David On Wed, Dec 23, 2009 at 12:48 PM, Owen DeLong o...@delong.com wrote: Those that remember the discussion may find this article interesting: http://abcnews.go.com/Health/wireStory?id=9394406 Owen