Re: Developing Mars lander software

Xinok Wed, 19 Feb 2014 15:41:31 -0800

On Wednesday, 19 February 2014 at 05:53:55 UTC, Tolga Cakirogluwrote:

On Wednesday, 19 February 2014 at 01:09:43 UTC, Xinok wrote:
On Wednesday, 19 February 2014 at 00:16:03 UTC, TolgaCakiroglu wrote:
TL;DR the link though, how are they detecting that a CPUfails? An information must be passes outside of CPU to dothis. The only solution comes to my mind is that main CPUchanges a variable on an external memory at every step, andback up CPU checks it continuously to catch a failureimmediately. But this would require about 50% of CPU's poweralready.
While thinking about this kind of back up systems, knowingand reading that some people are really doing is really great.
I'm assuming this has something to do with it:
https://en.wikipedia.org/wiki/Heartbeat_%28computing%29
In clustered servers, the active node sends a continuoussignal indicating it's still alive. This signal is referred toas a heartbeat. There's a standby node waiting to take overshould it stop receiving this signal.
I think only knowing that it has failed is not enough. Becausethe process is landing, and other CPU should know where theprocess is left. With that heatbeat signal, only option is thatall sensor information must be sent both CPUs continuously andsensor values should be enough about what next step to betaken. Then I think it can continue the process flawlessly.

I don't have experience with, or much knowledge of, these kindsof systems; I'm merely aware of the concepts. The process of onesystem taking over when another system fails is called failover[1]. Depending on the requirements, the system could be designedso the standby node continues from the last successful state ofthe failed node [2].

To quote the page on Wikipedia [2], "Most importantly, theapplication must store as much of its state on non-volatileshared storage as possible. Equally important is the ability torestart on another node at the last state before failure usingthe saved state from the shared storage."

I would consider it likely that both systems run in conjunction,but the primary system is in control and the backup system merely"observes", ready to take over in an instant as soon as it nolonger detects a heartbeat.


[1] https://en.wikipedia.org/wiki/Failover

[2]https://en.wikipedia.org/wiki/High-availability_cluster#Application_design_requirements

Re: Developing Mars lander software

Reply via email to