On Apr 19, 2013, at 17:00 , John Chludzinski <john.chludzin...@gmail.com> wrote:

> So the apparent conclusion to this thread is that an (Open)MPI based RTI is 
> very doable - if we allow for the future development of dynamic joining and 
> leaving of the MPI collective?

John,

What do you mean by dynamically joining and leaving of the MPI collective? 

There are quite a few functions in MPI to dynamically join and disconnect 
processes (MPI_Spawn, MPI_Connect, MPI_Comm_join). So if your processes 
__always__ leave cleanly (using the defined MPI pattern of comm_disconnect + 
comm_free), you might be lucky enough to have this working today. If you want 
to support processes leaving for reasons outside of your control (such as 
crash) you do not have an option today in MPI, you need to use some extension 
(such as ULFM).

  George.



>  
> ---John
> 
> 
> On Wed, Apr 17, 2013 at 12:45 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Thanks for the clarification - very interesting indeed! I'll look at it more 
> closely.
> 
> 
> On Apr 17, 2013, at 9:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
>> On Apr 16, 2013, at 15:51 , Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> Just curious: I thought ULFM dealt with recovering an MPI job where one or 
>>> more processes fail. Is this correct?
>> 
>> It depends what is the definition of "recovering" you take. ULFM is about 
>> leaving the processes that remains (after a fault or a disconnect) in a 
>> state that allow them to continue to make progress. It is not about 
>> recovering processes, or user data, but it does provide the minimalistic set 
>> of functionalities to allow application to do this, if needed (revoke, 
>> agreement and shrink).
>> 
>>> HLA/RTI consists of processes that start at random times, run to 
>>> completion, and then exit normally. While a failure could occur, most 
>>> process terminations are normal and there is no need/intent to revive them.
>> 
>> As I said above, there is no revival of processes in ULFM, and it was never 
>> our intent to have such feature. The dynamic world is to be constructed 
>> using MPI-2 constructs (MPI_Spawn or MPI_Connect/Accept or even MPI_Join).
>> 
>>> So it's mostly a case of massively exercising MPI's dynamic 
>>> connect/accept/disconnect functions.
>>> 
>>> Do ULFM's structures have some utility for that purpose?
>> 
>> Absolutely. If the process that leaves instead of calling MPI_Finalize calls 
>> exit() this will be interpreted by the version of the runtime in ULFM as an 
>> event triggering a report. All the ensuing mechanisms are then activated and 
>> the application can react to this event with the most meaningful approach it 
>> can envision.
>> 
>>   George.
>> 
>>> 
>>> 
>>> On Apr 16, 2013, at 3:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>> 
>>>> There is an ongoing effort to address the potential volatility of 
>>>> processes in MPI called ULFM. There is a working version available at 
>>>> http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You will 
>>>> find some examples, and the document explaining the additional constructs 
>>>> needed in MPI to achieve this.
>>>> 
>>>>   George.
>>>> 
>>>> On Apr 15, 2013, at 17:29 , John Chludzinski <john.chludzin...@gmail.com> 
>>>> wrote:
>>>> 
>>>>> That would seem to preclude its use for an RTI.  Unless you have a card 
>>>>> up your sleeve?
>>>>>  
>>>>> ---John
>>>>> 
>>>>> 
>>>>> On Mon, Apr 15, 2013 at 11:23 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> It isn't the fact that there are multiple programs being used - we 
>>>>> support that just fine. The problem with HLA/RTI is that it allows 
>>>>> programs to come/go at will - i.e., not every program has to start at the 
>>>>> same time, nor complete at the same time. MPI requires that all programs 
>>>>> be executing at the beginning, and that all call finalize prior to anyone 
>>>>> exiting.
>>>>> 
>>>>> 
>>>>> On Apr 15, 2013, at 8:14 AM, John Chludzinski 
>>>>> <john.chludzin...@gmail.com> wrote:
>>>>> 
>>>>>> I just received an e-mail notifying me that MPI-2 supports MPMD.  This 
>>>>>> would seen to be just what the doctor ordered?
>>>>>>  
>>>>>> ---John
>>>>>> 
>>>>>> 
>>>>>> On Mon, Apr 15, 2013 at 11:10 AM, Ralph Castain <r...@open-mpi.org> 
>>>>>> wrote:
>>>>>> FWIW: some of us are working on a variant of MPI that would indeed 
>>>>>> support what you describe - it would support send/recv (i.e., MPI-1), 
>>>>>> but not collectives, and so would allow communication between arbitrary 
>>>>>> programs.
>>>>>> 
>>>>>> Not specifically targeting HLA/RTI, though I suppose a wrapper that 
>>>>>> conformed to that standard could be created.
>>>>>> 
>>>>>> On Apr 15, 2013, at 7:50 AM, John Chludzinski 
>>>>>> <john.chludzin...@gmail.com> wrote:
>>>>>> 
>>>>>> > This would be a departure from the SPMD paradigm that seems central to
>>>>>> > MPI's design. Each process would be a completely different program
>>>>>> > (piece of code) and I'm not sure how well that would working using
>>>>>> > MPI?
>>>>>> >
>>>>>> > BTW, MPI is commonly used in the parallel discrete even world for
>>>>>> > communication between LPs (federates in HLA). But these LPs are
>>>>>> > usually the same program.
>>>>>> >
>>>>>> > ---John
>>>>>> >
>>>>>> > On Mon, Apr 15, 2013 at 10:22 AM, John Chludzinski
>>>>>> > <john.chludzin...@gmail.com> wrote:
>>>>>> >> Is anyone aware of an MPI based HLA/RTI (DoD High Level Architecture
>>>>>> >> (HLA) / Runtime Infrastructure)?
>>>>>> >>
>>>>>> >> ---John
>>>>>> > _______________________________________________
>>>>>> > users mailing list
>>>>>> > us...@open-mpi.org
>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to