Good afternoon Howard, Across the courtyard from you, they have a fairly robust production AR System implementation which consisted of the following: 6 AR System Application servers running Red Hat Enterprise LINUX 4 AR System Mid-Tier Application Servers running Red Hat Enterprise LiNUX 3-node Oracle 11G RAC cluster (with hot backup) Each of the AR System application servers are a member of the same AR System Server Group. We did logically segment access to the servers into three categories:
user-facing application/integration facing back-office The “user-facing" farm was responsible for handling all traffic directly initiated by human beings using either the AR System server or the AR System Mid-Tier application servers. People accessed this environment by going to yamato.wildstartech.com <http://yamato.wildstartech.com/> (made up server name). The “application/integration facing” farm was responsible for handling all traffic initiated by applications that are integrated with the platform. Applications accessed this environment by going to yamatoapp.wildstartech.com <http://yamatoapp.wildstartech.com/> (also made up). The “back-office” farm was responsible for tasks such as escalations, DSO, and notifications. There were two servers in each segment. On the Mid-Tier side of things, we had a user-facing segment (consisting of two servers) and an application/integration facing segment (also consisting of two servers). Users accessed these environments using yamato.wildstartech.com <http://yamato.wildstartech.com/> and yamatoapp.wildstartech.com <http://yamatoapp.wildstartech.com/>. We used hardware load balancers to manage access to the servers. The load balancers were configured such that if traffic came into yamato.wildstartech.com <http://yamato.wildstartech.com/> on either TCP port 80 or 443, the request was routed to the user-facing segment of AR System Mid-Tier application servers. Traffic coming in on TCP/UDP port 111 (for the UNIX portmapper) or TCP port 3111 were routed to the user-facing segment of AR System Application servers. We also configured the load balancer in such a way that the different segments backed each other up. Say, for example, one of the nodes in the user-facing farm became unavailable. That traffic would be routed to the back-office farm. If the back-office farm was not available for some reason, then user-facing traffic would be routed to the application/integration facing farm. If neither the back-office nor application/integration facing farms were available, the load would simply be handled by the remaining node in the user-facing farm. In the AR System Mid-Tier environment, a user’s request was balanced at the mid-tier level, but traffic between mid-tier application server and AR System application server was NOT load-balanced. So in our two-node configuration, traffic routed to AR System Mid-Tier Application Server A was ALWAYS routed to AR System Application Server A. Similarly, traffic routed to AR System Mid-Tier application server B was ALWAYS routed to AR System Application Server B. The only exception to this was if either AR System Application server was NOT available, then traffic would be routed to the remaining application server. We did not do anything special with regards to individual node failures in the Oracle RAC cluster. That was handled by the OCI client based upon how the AR System application servers were opening connections. We did NOT take servers out of the farm when we were deploying AR System application server workflow. We simply deployed the code to the server that was identified as the Administrator server in the server group. If we had to bring down the AR System application server, the process was relatively straight forward. Request that our NOC remove the server from the load-balanced farm. Bring that server down and perform maintenance. Bring the server back up and validate the maintenance was completed properly. Request that our NOC add the server back into the load-balanced farm. All AR System server patches were completed in this way thereby allowing us to classify the maintenance activity as a degradation because at no point in time did the AR System applications become unavailable to a user. If we were to really tow the line, we would have allowed the connections to the load-balanced server to “drain off” of the server on which we were going to be performing maintenance once the NCO removed the server from the load-balanced farm. Because users were accessing the AR System application server through a load balanced farm, this transition was relatively invisible to them. At most, they’d have to log back in and they could continue doing whatever it was they were working on. The only time we really would have the entire platform down was if: major structural database maintenance needed to be performed a database outage occurred sometimes even the hot-backup wasn’t enough major AR System server upgrade (not mid-TIer) needed to be performed If a patch needed to be applied at the database level, the patch was typically applied to the cluster NOT in use. During a scheduled maintenance activity, we would fail our AR System application servers over to that cluster thereby allowing the un-patched database to be updated. As I understand it, there are changes coming in a future release of the AR System server platform which will be more conducive to zero downtime for upgrades (for all things in the AR System world). Oracle RAC clustering is pretty good; however, it wasn’t as seamless as I would have liked. From a usage perspective, we typically had between 3,000 and 5,000 concurrent users on our application servers during the day. We only did planned maintenance between 2:00 AM and 7:00 AM EST. During that time, there were significantly fewer users online - maybe only a few hundred. In five years, we ran 1 billion ticket and ticket-supporiting (e.g., attachments, journals, auditing, notifications) records through the platform. Most of our volume occurred between 7:00 AM and 10:00 PM EST. Throughput averaged on a twenty-four hour clock was fairly consistently at 8 records being created/second. We were a 24x7x365 shop, but we did have planned maintenance. I only highlight this information to highlight that it was (and is) a busy platform. For us, this configuration was a good balance between availability and cost. There are more sophisticated things that we could have done and there was more money we could have spent to get even more availability out of our platform, but what we had worked for us. Hope this helps. Derek > On Sep 10, 2014, at 8:05 AM, Richter, Howard (CEI - Atlanta) > <howard.rich...@coxinc.com> wrote: > > ** > All, > > After needing 3 long duration outages to install the pieces needed for a BMC > add on product (which will remain nameless), I wanted to see if there was > some method for a High Availability ITSM (7.6.4 or greater) system. > > Our current system is 3 arservers (1 app, 2 user facing), three mid-tiers and > one database. > > So I am looking at future architectural ideas (when we move to 8.x) to put us > in a position, whereby we can give our customers the high availability we > need and yet install some of these products, that require restarts and other > items that add to unavailability. > > So looking for ideas. > > Thanks, > > hbr > > <image001.jpg> > Howard Richter, Remedy Administrator > Email = howard.rich...@coxinc.com <mailto:howard.rich...@coxinc.com> > > _ARSlist: "Where the Answers Are" and have been for 20 years_ _______________________________________________________________________________ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org "Where the Answers Are, and have been for 20 years"