> > Using another load-balancing box (F5 or whatever) only moves the problem > > to that box. Duplicating it, moves the problem to another box, until > > the costs exponentially grow beyond the initial intended value of the > > solution. The weak points become lots of other boxes and infrastructure, > > suggesting that asterisk really isn't "the" weakest point (regardless of > > what its built on). > > Rich is hitting the main point in designing anything for high > reliability. So lets enumerate failures and then what if anything can be > done to eliminate them. > > 1. Line failures. <snip> > 2. Hardware failure. <snip> > 3. Software failure. > This could be any number of bugs not yet found or that will be > introduced later. <snip> > 4. Phones.
The primary points the questions were attempting to uncover are more related to basic layer-2 and layer-3 issues (of all necessary components in an end-to-end telephony implementation), and not just basic hardware configurations. Having spent a fair number of years working with corporations that have attempted to build high-availability solutions, the typical engineering approach is almost always oriented towards throwing more hardware at the problem and not thinking about the basic layer-2/3/4 issues. (I don't have an answer that I'm sponsoring either, just looking for comments from those that intimately know the "end-to-end" impact of doing things like hot-sparing or clustering.) I'm sure its fairly clear to most that adding redundant supplies, ups, raid, etc, will improve the uptime of the * box. However, once past throwing hardware at "the" server, where are the pitfalls associated with hot-sparing or clustering * servers? Several well-known companies have attempted products that swap MAC addresses between machines (layer-2), hide servers behind a virtual IP (layer-3), hide a cluster behind some form of load balancing hardware (generally layer-2 & 3), etc. Most of those solutions end up creating yet another problem that was not considered in the original thought process. I.e., not well thought out. (Even Cisco with a building full of engineers didn't initially consider the impact of flip-flopping between boxes when hsrp was first implemented. And there still are issues with that approach that many companies have witnessed first hand.) Load balancers have some added value, but those that have had to deal with a problem where a single system within the cluster is up but not processing data would probably argue their actual value. So, if one were to attempt either hot-sparing or clustering, are there issues associated with sip, rtp, iax, nat and/or other asterisk protocols that would impact the high-availability design? One issue that would _seem_ to be a problem are those installations that have to use canreinvite=no (meaning, even in a clustered environment those rtp sessions are going to be dropped with a server failure. Maybe its okay to simply note the exceptions in a proposed high-availability design.) If any proposed design actually involved a different MAC address, obviously all local sip phones would die since the arp cache timeout within the phones would preclude a failover. (Not cool.) IBM (with their stack of AIX machines) and Tandem (with their non-stop architecture) didn't throw clustered database servers at the problem. Both had them, but not as a means of increasing the availability of the base systems. Technology now supports 100 meg layer-2 pipes throughout a city at a reasonable cost. If a cluster were split across mutiple buildings within a city, it certainly would be of interest to those that are responsible for business continuity planning. Are there limitations? Someone mentioned the only data needed to be shared between clustered systems was phone Registration info (and then quickly jumped to engineering a solution for that). Is that the only data needed or might someone need a ton of other stuff? (Is cdr, iax, dialplans, agi, vm, and/or other dynamic data an issue that needs to be considered in a reasonable high-availability design?) Whether the objective is 2, 3, 4, or 5 nines is somewhat irrelavent. If one had to stand in front of the President or Board and represent/sell availability, they are going to assume end-to-end and not just "the" server. Later, they are not going to talk kindly about the phone system when your single F5 box died; or, (not all that unusual) you say asterisk was up the entire time, its your stupid phones that couldn't find it!! (Or, you lost five hours of cdr data because of why???) I'd have to guess there are probably hundreds on this list that can engineer raid drives, ups's for ethernet closet switches, protected cat 5 cabling, and switch boxes that can move physical interfaces between servers. But, I'd also guess there are far fewer that can identify many of the sip, rtp, iax, nat, cdr, etc, etc, issues. What are some of those issues? (Maybe there aren't any?) Rich _______________________________________________ Asterisk-Users mailing list [EMAIL PROTECTED] http://lists.digium.com/mailman/listinfo/asterisk-users