Well, if reviving means making device failover work, then yes, in a way we
revived it ;)
We are currently making mostly experiments to figure out how to have device
failover working. No big fixes for now, and that's why we are posting here
before going further.
From what I understand, Rolf's work seems very close to what we want to do
and we'd better work with him on making ob1 able to do device failover
rather than trying to work on dr.
This sound good to me : there is no reason why ob1 couldn't invalidate a
device (e.g. if we send a signal). However, replaying lost sends still
seems to be needed if we want to be able to handle a network failure.
Clearly, ob1 doesn't support this yet.
Thanks a lot for your advices, we will continue to think about it and come
back to you.
Sylvain
On Wed, 15 Apr 2009, Ralph Castain wrote:
Last anyone knew, the dr pml was dead - way out of date and unmaintained. I
gather that you folks have revived it and sync'd it back up to the current
ob1 module?
I don't think anyone really cares what is done with the dr module itself.
There are others working on failover modules, and there is a new separate
checksum module that just aborts if it detects an error.
So I would guess you are welcome to do whatever you want to it. I suspect the
others working on failover may speak up here too.
On Apr 15, 2009, at 6:47 AM, Mouhamed Gueye wrote:
Hi all,
We are currently working on the dr pml component and specifically on device
failover. The failover mecanism seems to work fine on different components,
but if we want to do it on different modules of the same component - say 2
Infiniband rails - the code seems to be broken.
Actually, when the first openib module fails, the progress function of the
openib component is deregistered and progress is no longer made on any
openib module. We managed to circumvent this by keeping the progress
function as long as an openib module might be using it and it seems to work
fine.
So I have a few questions :
1. Is there already work in progress to support multi-module failover on the
dr pml ?
2. Do you think this is the correct way to handle multi-module failover ?
Also, the fact that the "dr" component includes many things like checksuming
bothers us a bit (we'd like to lower performance overhead as far as possible
when including device failover). So,
3. Do you plan to fork this component to a "df (device failover) only" one ?
(we would like to, but maybe this is not the right way to go)
That's all for now,
Mouhamed
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel