Hi Andrew,

 

Here’s what is already implemented: if the Manager detects that an Executor’s heartbeat stopped (or rather delayed some time) then it signals the Scheduler to put the executing thread(s) back into the pool of threads available to schedule. Another available Executor will receive the thread. In this design the Scheduler is implemented as an interface so it can be easily replaced with other smarter versions.

 

Here are a few scenarios we thought about:

-          Executor fails “hard” due to a hardware failure, an OS reboot or the process being killed. In this case we are using the heartbeat technique to re-schedule the thread to another Executor. I think this is an acceptable solution.

-          Executor fails “soft” due to the user logging off or stopping down the Executor. In this case the Manager is informed that the Executor is going offline and the thread is re-scheduled (somebody should test this to be sure but that’s the theory). I think this is an acceptable solution. If you are designing a checkpointing technique then you would probably make this scenario better. I’m interested to hear your ideas.

-          Executing thread “hangs” on the Executor. This means that the executing thread is not responding but the heartbeat thread keeps working. Currently the Executor remains in the hung state until it is restarted. This is not an acceptable solution. We have some ideas if you are interested in exploring this area.

 

In case of a Manager failure there is no backup at this point. An immediate problem here is that once a Manager goes offline all Executors that are running threads will probably fail as well. It would be nice to have the Executor store the thread’s results and wait until the Manager comes back online. Other ideas are welcome.

 

To get started with Alchemi:

-          Get the latest source code.

-          Compile them.

-          Run the samples from the SDK.

-          Ask questions if something puzzles you

-          File bug reports on SourceForge for whatever problems you find.

-          If you are still learning your way around the code and .NET then a good way to learn is to try to fix reported bugs or to implement smaller features. There are several issues reported on SourceForge in the bugs or feature requests sections.

 

If you have questions please use the SourceForge forums or the mailing lists so others can benefit from the shared knowledge. At the moment the mailing lists are being drowned in spam so the SourceForge forums are a better medium to communicate.

 

Regards,

Tibor

 

 

 


From: andrew hudson [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 15, 2006 11:18 AM
To: Luis Cota
Cc: Tibor Biro
Subject: Re: RE: [Alchemi-developers] Fault tolerance in Alchemi

 

 
hello sir
I m planning to develop some application that will increased the alchemi manager functionality for fault tolerance. What i know is that alchemi uses the heartbeat technique to check for the executor status. But is there any provision that if the executor stops in between the execution of job assigned to it, will the thread dealt with that executor is assigned to other from new or some checkpointing type of technique be there so that the thread should continue from the last checked state.
If this functionality is not there, then i have an idea that we can pick the data from heartbeat and then check that if the node fails, after that it will check from the jobs execution log that which thread that execution node is executing.We can restart that thred by assigning it to sme other node.But being not much comfortable with .net, i am getting sme difficulties to understand such a huge code of alchemi.
Can u suggest me how i can do the implementation part.

Another thing is that i want to ask that weather we can have a backup manager for avoiding the risk of grid crash if manager node fails.
Eagerly Waiting for your reply.

with regards

On Mon, 13 Mar 2006 Luis Cota wrote :
>Hi Andrew,
>
>
>
>It's fantastic that you're interested in working on enhancing Alchemi's
>fault-tolerance functionality.  Do you have some kind of Messaging
>program?  Several ideas have been discussed regarding fault-tolerance -
>it'd be good to have a chat regarding some of the ideas and, perhaps
>more importantly, implementation details.
>
>
>
>- Luis
>
>
>
>________________________________
>
> From: [EMAIL PROTECTED]
>[mailto:[EMAIL PROTECTED] On Behalf Of
>andrew hudson
>Sent: Saturday, March 11, 2006 12:05 AM
>To: [email protected]
>Subject: [Alchemi-developers] Fault tolerance in Alchemi
>
>
>
>
>hello
>
>I want to know that what are the processes alchemi is using for fault
>tolerence.Like by fault tolerance i mean heartbeating technique,
>checkpointing in case of job fails,etc
>Basically i want to work in alchemi for making a fault tolerant tool
>that will do efficient fault tolerance for alchemi.will anybody can help
>me from where to start developing this.
>Also, in the projects section of alchemi.net, it is written about the
>project on fault tolerance done by kerela institute.Will i get some
>information regarding this project so that i can do something new that
>is alredy not present in alchemi.
>
>with regards
>andrew
>
>
>
>
><http://adworks.rediff.com/cgi-bin/AdWorks/sigclick.cgi/www.rediff.com/s
>ignature-home.htm/[EMAIL PROTECTED]>
>



Reply via email to