Re: [OMPI devel] Fake Modex

2011-06-16 Thread Hugo Meyer
Hello. Thanks for yours answers. I'ts as you said Josh, i'm trying to do something uncoordinated, and on demand. What i'm doing now is to put some code in the btl_tcp_endpoint.c and others file that allows me to change the attempts of communication in the sockets when a failure occurs. At the mom

Re: [OMPI devel] Fake Modex

2011-06-13 Thread Josh Hursey
I don't think this will help much, but I can tell you how we handled this for the coordinated C/R functionality. When we added automatic recovery and process migration using coordinated checkpoints to the Open MPI trunk (spring/summer 2010) we were able to take advantage of the coordinated nature

Re: [OMPI devel] Fake Modex

2011-06-04 Thread Ralph Castain
On Jun 4, 2011, at 5:21 AM, Hugo Meyer wrote: > Thanks for your replies. > > >After doing that, the MPI_Init procedure calls grpcomm.modex to distribute > >the data across all procs in the job. Unfortunately, being a collective, all > >procs must participate. In your case, you'll have to find

Re: [OMPI devel] Fake Modex

2011-06-04 Thread Hugo Meyer
Thanks for your replies. >After doing that, the MPI_Init procedure calls grpcomm.modex to distribute the data across all procs in the job. Unfortunately, being a collective, all procs must participate. In your case, you'll have to find a different way to do it. Upon receipt, each proc updates its

Re: [OMPI devel] Fake Modex

2011-06-03 Thread Jeff Squyres
On Jun 3, 2011, at 10:12 AM, Ralph Castain wrote: > When an MPI proc calls MPI_Init, each btl pushes its contact info into the > modex database - one example is the btl.tcp.1.7 info you found there. That > entry is for the TCP btl, which is probably what you are looking for. There > is no way f

Re: [OMPI devel] Fake Modex

2011-06-03 Thread Ralph Castain
On Jun 3, 2011, at 8:03 AM, Hugo Meyer wrote: > Hello Ralph. > > Are you talking about an MPI communication? If so, then you need to update > every proc's modex info for the proc that moved - this is something stored > in each MPI proc's memory, so it isn't something that you can just get fro

Re: [OMPI devel] Fake Modex

2011-06-03 Thread Hugo Meyer
Hello Ralph. Are you talking about an MPI communication? If so, then you need to update every proc's modex info for the proc that moved - this is something stored in each MPI proc's memory, so it isn't something that you can just get from the daemon on-demand. You'll have to provide the update to

Re: [OMPI devel] Fake Modex

2011-06-03 Thread Ralph Castain
Are you talking about an MPI communication? If so, then you need to update every proc's modex info for the proc that moved - this is something stored in each MPI proc's memory, so it isn't something that you can just get from the daemon on-demand. You'll have to provide the update to every sing

Re: [OMPI devel] Fake Modex

2011-06-02 Thread Hugo Meyer
Hello again. My actual problem is that i don't know where is the struct that has the information that is used to send messages to the procs. Something like: Rank URI 0 21222:tcp:192.168.1.1:1250 1 21223:tcp:192.168.1.2:1250 . . Because what i need

[OMPI devel] Fake Modex

2011-05-31 Thread Hugo Meyer
Hello @ll. I'm needing some help to restart the communication with a process that i restore in a different node. My situation is as follows: The process fails and it's restored in another node succesfully from a previous checkpoint that i sent there. Now, when a process try to send a message to t