See comments [Jeff] below: -----Original Message----- From: George Sexton [mailto:geor...@mhsoftware.com] Sent: Wednesday, August 26, 2009 2:39 PM To: 'Tomcat Users List' Subject: RE: Multiple data centers and redundency?
> -----Original Message----- > From: Jeffrey Janner [mailto:jeffrey.jan...@polydyne.com] > Sent: Wednesday, August 26, 2009 12:53 PM > To: Tomcat Users List > Subject: RE: Multiple data centers and redundency? > > George - > This is why I hate statistics. You can make them say anything. > Wouldn't the better calculation be based on the average number of > currently active sessions at one data center, since when it goes down, > that is the number of users which will be affected. You're talking about the actual number of people affected. I'm talking about the probability of any one session being affected. I mean really, if once every 3 years or so, 500 people have to re-login to do their transaction is it that big a deal? [Jeff] I don't think so, but that's a question for Andre-Jon's management, thus our discussion. I was merely pointing out that limiting it to one session was a bit overkill, probably resulting in probability much lower than actual. If it affects one session, it will affect the 500, so you only really need to calculate the probability of the event itself. Then determine if that probability multiplied by the percentage of the 500 customers that would get severely pissed off therefore, would be worth the $$$$ to find a solution. I'm basically disagreeing with your 2/24 portion of the calculation as immaterial to the problem. Length of session doesn't really matter, it's criticality of session state that matters. So, determine risk of outage event (period) and apply to formula to determine $ value of that event in lost business (actual recovery costs will be incurred either way). Also, you divided by 2 for both data centers. Shouldn't that be left out as the event should only happen at the one? [/Jeff] > The calculation should also include probability and length of outage I talked about probability. 1 instance per 3 years was the baseline. Length of the outage is not a factor. They have a 2nd data center that things will transparently fail over to, so the length of the outage is immaterial Length of outage would only count if you were putting everything in one data center and wanted to calculate the risk/loss for an outage. [Jeff] You're right, length of outage doesn't apply in this case where failover is available. I had a brain-fart. [/Jeff] > and then weighed against the downside of the end user gettng an error > message and having to login again and start over. > I'd really only see it as being a problem if there were extremely long- > running transactions that would have to be restarted in the event of an > outage (or a really poorly designed app). There's a lot of math you can only do if you know the app. How long are sessions really? If they're 20 minutes long, then I'm overstating it by a factor of 6. Are sessions counts evenly distributed through the day? If not, probability is going to vary based on the time of day of the outage. So, if your session is 2 hours long, and all your traffic is during the day, then that part becomes 2/8, not 2/24. Are outages evenly distributed throughout the day? If so, then there's about a 2/3 chance the outage will come at an off-peak time. OTOH, the really big data centers don't have anything like an outage per 3 years. Just doesn't happen. [Jeff] Yep, redundant links, redundant power & air, redundant servers & DBs - shouldn't be a problem. I could only see this as an issue if you are looking to hot-site your Disaster-Recovery planning, and normally you are expecting some downtime for that scenario. I personally don't see the need for this much engineering except for an extremely critical life & death processing operation. [/Jeff] > Jeff > p.s. I'm with you on this probably being a minor concern causing a > larger headache, but we should get the scope of the problem correct to > begin with. (said by one who both supports and uses webapps that > support large numbers of users) Not very many people take the time to understand their actual risk so they end up over-engineering/over-spending on solutions. The problem with over-engineered solutions can be they're under-tested and they break when you need them. [Jeff] Speaking of engineering the solution, they will also need to guarantee that all background info (DB updates, etc.) have replicated to the other datacenter before switching the session over. Chances are probably better that some transactions didn't make it before the outage occurred. Beginning to sound to me like it's better to let the users login again at the new site. > > -----Original Message----- > From: George Sexton [mailto:geor...@mhsoftware.com] > Sent: Wednesday, August 26, 2009 9:52 AM > To: 'Tomcat Users List' > Subject: RE: Multiple data centers and redundency? > > > -----Original Message----- > > From: Andre-John Mas [mailto:aj...@sympatico.ca] > > Sent: Tuesday, August 25, 2009 6:30 PM > > To: Tomcat Users List > > Subject: Multiple data centers and redundency? > > > > Hi, > > > > I have been asked to look into a solution that would involve a few > > different data centres each with their own set of load balanced > Tomcat > > servers. The requirement is for the users not to lose their session > if > > one data center goes down. I have never had to work on something this > > large and have no idea to what extent this can be achieved with > Tomcat. > > > > My initial thoughts would be for each data center to have a session > > pool, which is synced with each other, so if ever a Tomcat server or > > data center goes down they can check in the pool to see if it exists > > and then reuse that. It would mean extra communication behind the > > scene, but I see no other way go about it. > > > > Any help would be appreciated. > > > > André-John > > Has anyone really done any math to determine the risk? > > Here an example of what I mean. > > Say you are in a high quality co-location. The probability of an outage > is > maybe once in 3 years. That's overstating the probability in my mind, > but > we'll use it. Let's also say that you have a high quality clustering > solution in place in each data center that handles failover of any > equipment > WITHIN the data center. > > Say the average length of a user/customer session is 2 hours, and your > failover system will route any new users to a remaining data center. I > think > 2 hours is kind of a long session but we'll use it. Say you have two > data > centers. > > So, the probability of an average customer being affected by a data > center > outage is: > > 1/( (2 hours)/24(Hours day) * 1/(3*365))/2 (Data centers) > > The probability of an average customer being affected by an outage is > conservatively 1 in 26280. Expressed as a percentage, the probability > of any > individual session being affected is 0.0038%. > > Is your application really so big and critical that you have to address > this > very small percentage chance of a session being interrupted? > > George Sexton MH Software, Inc. http://www.mhsoftware.com/ Voice: 303 438 9585 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org ******************************* NOTICE ********************************* This message is intended for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by reply or by telephone (call us collect at 512-343-9100) and immediately delete this message and all its attachments. --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org