Re: [OMPI devel] Device failover on ob1

2009-08-06 Thread Jeff Squyres
Is it time to "svn rm ompi/mca/pml/dr"? On Aug 4, 2009, at 6:50 AM, Ralph Castain wrote: Rolf/Mouhamed Could you get together off-list to discuss the different approaches and see if/where there is common ground. It would be nice to see an integrated solution - personally, I would rather not s

Re: [OMPI devel] Device failover on ob1

2009-08-04 Thread Graham, Richard L.
>From my perspective, the assumption that the low-level is reliable is >completely consistent with the assumptions that went into the ob1 design, so I don't see changes you may propose as a problem in principal. Thanks a lot for the clarification, Rich On 8/3/09 9:39 AM, "Mouhamed Gueye" wr

Re: [OMPI devel] Device failover on ob1

2009-08-04 Thread Ralph Castain
Rolf/Mouhamed Could you get together off-list to discuss the different approaches and see if/where there is common ground. It would be nice to see an integrated solution - personally, I would rather not see two orthogonal approaches unless they can be cleanly separated. Much better if the

Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Pavel Shamis (Pasha)
I have not, but there should be no difference. The failover code only gets triggered when an error happens. Otherwise, there are no differences in the code paths while everything is functioning normally. Sounds good. I still did not have time to review the code. I will try to do it during t

Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Rolf Vandevaart
I have not, but there should be no difference. The failover code only gets triggered when an error happens. Otherwise, there are no differences in the code paths while everything is functioning normally. Rolf On 08/03/09 11:14, Pavel Shamis (Pasha) wrote: Rolf, Did you compare latency/bw fo

Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Brian W. Barrett
On Sun, 2 Aug 2009, Ralph Castain wrote: Perhaps a bigger question needs to be addressed - namely, does the ob1 code need to be refactored? Having been involved a little in the early discussion with bull when we debated over where to put this, I know the primary concern was that the code not

Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Pavel Shamis (Pasha)
Rolf, Did you compare latency/bw for failover-enabled code VS trunk ? Pasha. Rolf Vandevaart wrote: Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of cour

Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Rolf Vandevaart
Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of int

Re: [OMPI devel] Device failover on ob1

2009-08-03 Thread Mouhamed Gueye
Hi list, I'll try to answer to the main concerns so far. We chose to work on ob1 for mainly 2 reasons: - we focused first on fixing dr but were quite disappointed by its performance in comparison with ob1. Then, we oriented our work on ob1 to provide failover while keeping good performance.

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Ralph Castain
Okay - here's a thought. Why not do what the original message asked? Checkout their changes and look at what they did. Then we can have the discussion about how intrusive it is. Otherwise, all we're doing is debating what they -might- have done, or what someone thinks they -should- have don

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Graham, Richard L.
The point here is very different, and is not being made because of objections for fail-over support. Previous work took precisely this sort of approach, and in that particular case the desire to support reliability, but be able to compile out this support still had a negative performance imp

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Ralph Castain
The objections being cited are somewhat unfair - perhaps people do not understand the proposal being made? The developers have gone out of their way to ensure that all changes are configured out unless you specifically select to use that functionality. This has been our policy from day one

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Graham, Richard L.
On 8/2/09 12:55 AM, "Brian Barrett" wrote: While I agree that performance impact (latency in this case) is important, I disagree that this necessarily belongs somewhere other than ob1. For example, a zero-performance impact solution would be to provide two versions of all the interface functi

Re: [OMPI devel] Device failover on ob1

2009-08-02 Thread Brian Barrett
While I agree that performance impact (latency in this case) is important, I disagree that this necessarily belongs somewhere other than ob1. For example, a zero-performance impact solution would be to provide two versions of all the interface functions, one with failover turned on and one

Re: [OMPI devel] Device failover on ob1

2009-08-01 Thread Graham, Richard L.
What is the impact on sm, which is by far the most sensitive to latency. This really belongs in a place other than ob1. Ob1 is supposed to provide the lowest latency possible, and other pml's are supposed to be used for heavier weight protocols. On the technical side, how do you distinguish be

[OMPI devel] Device failover on ob1

2009-07-31 Thread Mouhamed Gueye
Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery