Re: RE: [Alchemi-developers] Fault Tolerance in Alchemi

andrew hudson Mon, 20 Mar 2006 01:13:04 -0800

hello sir
A lot of time spent by me in looking in both the matter related to fault tolerance in alchemi.The first problem seems to be intresting but is little complex and also i m not getting any idea from where to start in it.
The second problem of "manager node failure" seems to be a new addition if we are able to solve it.The coding of alchemi seems to be complex for a novice user like me.
I had studied various class files like Gmanager, IManager, GExecutors, GThread.....but still i m not getting the point that in which file i be finding the correct material to use for the second proble.
I am having a very good concept in mind for providing Alchemi with backup server but i am struc at implementation part. My idea is that:-

1. As executors send heartbeats to manager for telling their status, similarly we can engage the backup node which will be informed by the manager node for its existance.

2. The backup manager node will hv all that information which the actual manager have.

3. Incase of failure of manager, the backup manager will come into play. What we have to do in this case is to inform all the executor nodes that this is their new manager.

4. With this we will be able to run our grid even in case of manager node failure.

And for the first problem, i am looking for more ideas. sir please help me to make my ideas to be implemented by me.For this from where i have to start with.I got confused after seeing such lengthy code at one time.

One more help other than this is needed by me is that how to implement the multicluster grid using Alchemi.In the source code of alchemi1.0.3 i found that the multicluster support is removed from alchemi after 1.0 version. Is it true. But i want to implement multicluster approach.

Eagerly waiting for your reply.

On Thu, 16 Mar 2006 Tibor Biro wrote :
>
>"Executing thread "hangs" on the Executor. This means that the executing
>thread is not responding but the heartbeat thread keeps working. Currently
>the Executor remains in the hung state until it is restarted. This is not an
>acceptable solution. We have some ideas if you are interested in exploring
>this area."
>
>[Tibor Biro] One idea is to monitor the executing thread and terminate it if
>it exceeds a configurable amount of time. This value should be configurable
> from the application so the user can set it but an override at the Executor
>level is probably desirable as well. One problem here is that some machines
>take longer to execute something than others so maybe if the time it waits
>is a factor of the computer's speed it might be useful.
>
>Another idea I've been toying with is to require long running threads to
>raise status events containing a "percent done" value and maybe some other
>custom stuff. The monitoring thread would then have data to see if the
>thread is dead or just taking longer to complete but still alive. The events
>could have enough information to be used as a checkpoints but this would be
>up to the implementation of each application.
>
>Both approaches could be implemented in some mix. I wouldn't mind exploring
>other ideas as well.
>
>
>"In case of a Manager failure there is no backup at this point. An immediate
>problem here is that once a Manager goes offline all Executors that are
>running threads will probably fail as well. It would be nice to have the
>Executor store the thread's results and wait until the Manager comes back
>online. Other ideas are welcome."
>
>[Tibor Biro] The Executor should persist the executed thread in case the
>Manager is not available and send it back once a connection is made. You
>should investigate the points of failure in this scenario and address them
>as they are discovered.
>
>The above mentioned areas by you are seems to be a good areas of intrests.I
>want to explore both of these areas in some more detail and then try to
>remove the above mentioned problems.. I need your help for this which part
>of code i need to understand before starting working on them as
>understanding the whole code is not easy. I want to remove the above
>problems as soon as possible. i am totally devoting my time on Alchemi.
>
>
>
>[Tibor Biro] Once you decide which area to work on let's start a thread on
>the SourceForge forums so we can iron out the details and document what
>other ideas were considered.
>
>
>Good Luck!
>
>Tibor
>
>
>

Re: RE: [Alchemi-developers] Fault Tolerance in Alchemi

Reply via email to