I have not, but there should be no difference. The failover code only
gets triggered when an error happens. Otherwise, there are no
differences in the code paths while everything is functioning normally.
Rolf
On 08/03/09 11:14, Pavel Shamis (Pasha) wrote:
Rolf,
Did you compare latency/bw for failover-enabled code VS trunk ?
Pasha.
Rolf Vandevaart wrote:
Hi folks:
As some of you know, I have also been looking into implementing
failover as well. I took a different approach as I am solving the
problem within the openib BTL itself. This of course means that this
only works for failing from one openib BTL to another but that was our
area of interest. This also means that we do not need to keep track
of fragments as we get them back from the completion queue upon
failure. We then extract the relevant information and repost on the
other working endpoint.
My work has been progressing at http://bitbucket.org/rolfv/ompi-failover.
This only currently works for send semantics so you have to run with
-mca btl_openib_flags 1.
Rolf
On 07/31/09 05:49, Mouhamed Gueye wrote:
Hi list,
Here is an update on our work concerning device failover.
As many of you suggested, we reoriented our work on ob1 rather than
dr and we now have a working prototype on top of ob1. The approach is
to store btl descriptors sent to peers and delete them when we
receive proof of delivery. So far, we rely on completion callback
functions, assuming that the message is delivered when the completion
function is called, that is the case of openib. When a btl module
fails, it is removed from the endpoint's btl list and the next one is
used to retransmit stored descriptors. No extra-message is
transmitted, it only consists in additions to the header. It has been
mainly tested with two IB modules, in both multi-rail (two separate
networks) and multi-path (a big unique network).
You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/
To compile with failover support, just define
--enable-device-failover at configure. You can then run a benchmark,
disconnect a port and see the failover operate.
A little latency increase (~ 2%) is induced by the failover layer
when no failover occurs. To accelerate the failover process on
openib, you can try to lower the btl_openib_ib_timeout openib
parameter to 15 for example instead of 20 (default value).
Mouhamed
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
=========================
rolf.vandeva...@sun.com
781-442-3043
=========================