On Tue, 18 Mar 2003, Joseph Temple wrote:

> I would point out that clustering makes hardware more available, not more
> reliable.

If the application stays up, ir's more reliable.

> The things actually fail more often because there is  more to
> fail,

I'm sure that's actually true in IBM mainframes too.

I read recently a new "disk" drive from IBM, I guess in many respects a
successor to RAID.

A disk failed? Leave it there, swap in a spare.

Zero maintenance because failed components are swapped out of service,
spares swapped in.

Are the individual drives especially reliable? No. Is the storage device
especially reliable? Yes. Does anyone care about the fine distinction?
No.


> but the user sees the cluster as available during the failures.
> There  are some problems with availability clustering as it is usually
> done, causing the cluster to have lower (often significantly lower
> availability) than it is designed to have.  The first problem is that as
> the utilization rises on a cluster the redundancy in the cluster drops,
> unfortunately so does the reliability of the components.   Systems whose
> load is growing start losing availability from day one as the workload
> grows.   Most folks add hardware when they need it for capacity, not when
> they need it for availability.  Next when running with one or more
> redundant servers down, the probability of failure of the remaining servers
> increases due to stress brougth on by higher utilization.   Because of this
> n+1 availability is not usually a good enough design point.  The second
> problem is that utilization must be kept quite low if to maintain
> redundancy unless the utilization grows linearly with load.  If the through
> put "tails off" or "saturates" as the utilization goes up, the utilization
> required to maintain redundancy is lower than intuitively expected.   Most
> people don't have a clue about how their workload saturates on a cluster,
> let alone at what utilization they loose the redundancy required to get the
> availability they desire.  Furthermore, n+2 availability is met a lower
> utilizations than n+1 availability.   The third problem is that  failover
> time is often long enough to count as a measurable outage, particularly
> when a data base or shared state is involved.  As far as I know the "IBM
> Parallel Sysplex" with data sharing and redundant coupling facilities is
> the only system that can avoid a measurable outage on a failover.  In
> today's multiple tiered systems the availability of the tiers is
> mulitplied, so that the availability of the whole solution is somewhat less
> than the availability of the weakest tier.
>
> Finally, the Linux on z Solution has a an advantage on patching in that
> multiple virtual machines can share the unpatched and patched versions.
> You only have to update the shared image once and then roll the boot/ipl of
> the VMs to point to the new version.  In addition the virtual machines'
> redundant capacity can be handled by letting the remaining machines have
> the resources of the VM that was rolled out which it then are  reclaimed on
> restart.  The hardware utilization stays relatively constant because
> workloads saturate zSeries machines less than other machines.  This is
> because saturation comes from non processor bottlenecks and the zSeries
> machines are more robust in supply of  other resources per CPU configured.
> As a result  a virtual cluster will see higher redundancy at any
> utilization and therefore will be more available than the equivalent
> cluster, EVEN IF THE CLUSTER HARDWARE were AS RELIABLE AS zSERIES, which it
> is not.

There are many possible solutions and compromises, depending on need,
budget and the cost of failure.

At one extreme you might want failover to another site, far distant, to
cover against disasters such as floods, earthquakes, fires or even a
lightning strike, and you want no visible outage.

Linux can do that, using software raid (mirroring) across a network.
Whether you want to do it with a monster Z or something more modest, say
a low-end IA32 server depends on budget, the cost of downtime.

I'm sure IBM has some tricks that make the Z do it better, at a price of
course.

We've discussed Google here before: would anyone notice if a few Google
servers went missing for a while? Seems to me, probably not, and
according to some in the discussion that is on low-cost hardware.







--


Cheers
John.

Join the "Linux Support by Small Businesses" list at
http://mail.computerdatasafe.com.au/mailman/listinfo/lssb

Reply via email to