Hi,
We are also looking to get device failover working. However, for the
reasons cited by Ralph, we are using the OB1 PML as the starting point.
Also, similar to you, we do not need the checksumming feature or the
timed out retransmission that the dr PML provides.
Rolf
Ralph Castain wrote:
Last anyone knew, the dr pml was dead - way out of date and
unmaintained. I gather that you folks have revived it and sync'd it
back up to the current ob1 module?
I don't think anyone really cares what is done with the dr module
itself. There are others working on failover modules, and there is a
new separate checksum module that just aborts if it detects an error.
So I would guess you are welcome to do whatever you want to it. I
suspect the others working on failover may speak up here too.
On Apr 15, 2009, at 6:47 AM, Mouhamed Gueye wrote:
Hi all,
We are currently working on the dr pml component and specifically on
device failover. The failover mecanism seems to work fine on
different components, but if we want to do it on different modules of
the same component - say 2 Infiniband rails - the code seems to be
broken.
Actually, when the first openib module fails, the progress function
of the openib component is deregistered and progress is no longer
made on any openib module. We managed to circumvent this by keeping
the progress function as long as an openib module might be using it
and it seems to work fine.
So I have a few questions :
1. Is there already work in progress to support multi-module failover
on the dr pml ?
2. Do you think this is the correct way to handle multi-module
failover ?
Also, the fact that the "dr" component includes many things like
checksuming bothers us a bit (we'd like to lower performance overhead
as far as possible when including device failover). So,
3. Do you plan to fork this component to a "df (device failover)
only" one ? (we would like to, but maybe this is not the right way to
go)
That's all for now,
Mouhamed
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================