|
Hi Andrew, Here’s what is already implemented:
if the Manager detects that an Executor’s heartbeat stopped (or rather
delayed some time) then it signals the Scheduler to put the executing thread(s)
back into the pool of threads available to schedule. Another available Executor
will receive the thread. In this design the Scheduler is implemented as an
interface so it can be easily replaced with other smarter versions. Here are a few scenarios we thought about: -
Executor
fails “hard” due to a hardware failure, an OS reboot or the process
being killed. In this case we are using the heartbeat technique to re-schedule
the thread to another Executor. I think this is an acceptable solution. -
Executor
fails “soft” due to the user logging off or stopping down the
Executor. In this case the Manager is informed that the Executor is going
offline and the thread is re-scheduled (somebody should test this to be sure
but that’s the theory). I think this is an acceptable solution. If you
are designing a checkpointing technique then you would probably make this
scenario better. I’m interested to hear your ideas. -
Executing
thread “hangs” on the Executor. This means that the executing
thread is not responding but the heartbeat thread keeps working. Currently the
Executor remains in the hung state until it is restarted. This is not an
acceptable solution. We have some ideas if you are interested in exploring this
area. In case of a Manager failure there is no
backup at this point. An immediate problem here is that once a Manager goes
offline all Executors that are running threads will probably fail as well. It
would be nice to have the Executor store the thread’s results and wait
until the Manager comes back online. Other ideas are welcome. To get started with Alchemi: -
Get the
latest source code. -
Compile
them. -
Run the
samples from the SDK. -
Ask questions
if something puzzles you -
File bug
reports on SourceForge for whatever problems you find. -
If you
are still learning your way around the code and .NET then a good way to learn
is to try to fix reported bugs or to implement smaller features. There are
several issues reported on SourceForge in the bugs or feature requests
sections. If you have questions please use the SourceForge
forums or the mailing lists so others can benefit from the shared knowledge. At
the moment the mailing lists are being drowned in spam so the SourceForge
forums are a better medium to communicate. Regards, Tibor From: andrew hudson
[mailto:[EMAIL PROTECTED] |
- [Alchemi-developers] Fault tolerance in Alchemi andrew hudson
- RE: RE: [Alchemi-developers] Fault tolerance in Alchemi Tibor Biro
- [Alchemi-developers] Fault Tolerance in Alchemi andrew hudson
- Re: RE: [Alchemi-developers] Fault Tolerance in Alchemi andrew hudson
