Re: High Availability ITSM (7.6.4 or greater)

Derek Berube Wed, 10 Sep 2014 14:03:25 -0700

Good afternoon Howard,

Across the courtyard from you, they have a fairly robust production AR System 
implementation which consisted of the following:
6 AR System Application servers running Red Hat Enterprise LINUX
4 AR System Mid-Tier Application Servers running Red Hat Enterprise LiNUX
3-node Oracle 11G RAC cluster (with hot backup)
Each of the AR System application servers are a member of the same AR System 
Server Group.  We did logically segment access to the servers into three 
categories:

user-facing
application/integration facing
back-office

The “user-facing" farm was responsible for handling all traffic directly 
initiated by human beings using either the AR System server or the AR System 
Mid-Tier application servers.  People accessed this environment by going to 
yamato.wildstartech.com <http://yamato.wildstartech.com/> (made up server name).

The “application/integration facing” farm was responsible for handling all 
traffic initiated by applications that are integrated with the platform.  
Applications accessed this environment by going to yamatoapp.wildstartech.com 
<http://yamatoapp.wildstartech.com/> (also made up).

The “back-office” farm was responsible for tasks such as escalations, DSO, and 
notifications.

There were two servers in each segment.

On the Mid-Tier side of things, we had a user-facing segment (consisting of two 
servers) and an application/integration facing segment (also consisting of two 
servers).  Users accessed these environments using yamato.wildstartech.com 
<http://yamato.wildstartech.com/> and yamatoapp.wildstartech.com 
<http://yamatoapp.wildstartech.com/>.

We used hardware load balancers to manage access to the servers.  The load 
balancers were configured such that if traffic came into 
yamato.wildstartech.com <http://yamato.wildstartech.com/> on either TCP port 80 
or 443, the request was routed to the user-facing segment of AR System Mid-Tier 
application servers.  Traffic coming in on TCP/UDP port 111 (for the UNIX 
portmapper) or TCP port 3111 were routed to the user-facing segment of AR 
System Application servers.

We also configured the load balancer in such a way that the different segments 
backed each other up.  Say, for example, one of the nodes in the user-facing 
farm became unavailable.  That traffic would be routed to the back-office farm. 
 If the back-office farm was not available for some reason, then user-facing 
traffic would be routed to the application/integration facing farm.  If neither 
the back-office nor application/integration facing farms were available, the 
load would simply be handled by the remaining node in the user-facing farm.

In the AR System Mid-Tier environment, a user’s request was balanced at the 
mid-tier level, but traffic between mid-tier application server and AR System 
application server was NOT load-balanced.  So in our two-node configuration, 
traffic routed to AR System Mid-Tier Application Server A was ALWAYS routed to 
AR System Application Server A.  Similarly, traffic routed to AR System 
Mid-Tier application server B was ALWAYS routed to AR System Application Server 
B.  The only exception to this was if either AR System Application server was 
NOT available, then traffic would be routed to the remaining application server.

We did not do anything special with regards to individual node failures in the 
Oracle RAC cluster.  That was handled by the OCI client based upon how the AR 
System application servers were opening connections.

We did NOT take servers out of the farm when we were deploying AR System 
application server workflow.  We simply deployed the code to the server that 
was identified as the Administrator server in the server group.

If we had to bring down the AR System application server, the process was 
relatively straight forward.

Request that our NOC remove the server from the load-balanced farm.
Bring that server down and perform maintenance.
Bring the server back up and validate the maintenance was completed properly.
Request that our NOC add the server back into the load-balanced farm.

All AR System server patches were completed in this way thereby allowing us to 
classify the maintenance activity as a degradation because at no point in time 
did the AR System applications become unavailable to a user.  If we were to 
really tow the line, we would have allowed the connections to the load-balanced 
server to “drain off” of the server on which we were going to be performing 
maintenance once the NCO removed the server from the load-balanced farm.  
Because users were accessing the AR System application server through a load 
balanced farm, this transition was relatively invisible to them.  At most, 
they’d have to log back in and they could continue doing whatever it was they 
were working on.

The only time we really would have the entire platform down was if:

major structural database maintenance needed to be performed
a database outage occurred
sometimes even the hot-backup wasn’t enough
major AR System server upgrade (not mid-TIer) needed to be performed

If a patch needed to be applied at the database level, the patch was typically 
applied to the cluster NOT in use.  During a scheduled maintenance activity, we 
would fail our AR System application servers over to that cluster thereby 
allowing the un-patched database to be updated.

As I understand it, there are changes coming in a future release of the AR 
System server platform which will be more conducive to zero downtime for 
upgrades (for all things in the AR System world).  Oracle RAC clustering is 
pretty good; however, it wasn’t as seamless as I would have liked.

From a usage perspective, we typically had between 3,000 and 5,000 concurrent 
users on our application servers during the day.  We only did planned 
maintenance between 2:00 AM and 7:00 AM EST.  During that time, there were 
significantly fewer users online - maybe only a few hundred.  In five years, we 
ran 1 billion ticket and ticket-supporiting (e.g., attachments, journals, 
auditing, notifications) records through the platform.  Most of our volume 
occurred between 7:00 AM and 10:00 PM EST.  Throughput averaged on a 
twenty-four hour clock was fairly consistently at 8 records being 
created/second.  We were a 24x7x365 shop, but we did have planned maintenance.  
I only highlight this information to highlight that it was (and is) a busy 
platform.

For us, this configuration was a good balance between availability and cost.  
There are more sophisticated things that we could have done and there was more 
money we could have spent to get even more availability out of our platform, 
but what we had worked for us.

Hope this helps.

Derek

> On Sep 10, 2014, at 8:05 AM, Richter, Howard (CEI - Atlanta) 
> <howard.rich...@coxinc.com> wrote:
> 
> **
> All,
>  
> After needing 3 long duration outages to install the pieces needed for a BMC 
> add on product (which will remain nameless), I wanted to see if there was 
> some method for a High Availability ITSM (7.6.4 or greater) system. 
>  
> Our current system is 3 arservers (1 app, 2 user facing), three mid-tiers and 
> one database.
>  
> So I am looking at future architectural ideas (when we move to 8.x) to put us 
> in a position, whereby we can give our customers the high availability we 
> need and yet install some of these products, that require restarts and other 
> items that add to unavailability.
>  
> So  looking for ideas.
>  
> Thanks,
>  
> hbr
>  
> <image001.jpg>
> Howard Richter, Remedy Administrator
> Email = howard.rich...@coxinc.com <mailto:howard.rich...@coxinc.com>
>  
> _ARSlist: "Where the Answers Are" and have been for 20 years_

_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
"Where the Answers Are, and have been for 20 years"

Re: High Availability ITSM (7.6.4 or greater)

Reply via email to