On Thu, 14 Jan 1999, brian beuning wrote:
> At the risk of being off topic...
>
> It seems relatively easy to make a highly available cluster
> at least as far as staying available when a node fails. What
> I can not figure out is how to make a cluster highly available
> if part of the network fails.
Well this is an interesting topic. And, one that will become more
and more important as the numbers of nodes (failure points) increases.
Suppose you start a job on 128 CPUs that takes 2 weeks to complete.
What happens after 5 days if a power supply goes down. Right
now you loose the job. It is however, possible to build in
some level of fault tolerance into software, but it is often at the
expense of efficiency.
My view of the problem is one where you want to distribute
a computational task on machines that have changing computational
speeds that can be zero (a failed node) or quite high
(an unloaded CPU). In other words a completely heterogeneous environment
where you may have different machines or the same machines with
various loads (a machine that is down has an infinite load), but
the loads are dynamic.
I believe it is possible to develop algorithms that can
handle this type of environment, but, they would not be
as efficient as a closed solution where you can accurately
schedule jobs based on guaranteed resources (non-fault tolerant
situation)
When I have more time....
Doug
-------------------------------------------------------------------
Paralogic, Inc. | PEAK | Voice:+610.861.6960
115 Research Drive | PARALLEL | Fax:+610.861.8247
Bethlehem, PA 18017 USA | PERFORMANCE | http://www.plogic.com
-------------------------------------------------------------------
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]