Re: [OMPI devel] Fake Modex

2011-06-16 Thread Hugo Meyer
t more. For their uncoordinated C/R approach they > would have had to deal with this when restarting processes mid-run > without halting other processes. So maybe you can use a similar > approach. > > -- Josh > > > On Sat, Jun 4, 2011 at 10:55 AM, Ralph Castain wrote: > >

Re: [OMPI devel] Fake Modex

2011-06-04 Thread Hugo Meyer
Thanks for your replies. >After doing that, the MPI_Init procedure calls grpcomm.modex to distribute the data across all procs in the job. Unfortunately, being a collective, all procs must participate. In your case, you'll have to find a different way to do it. Upon receipt, each proc updates its

Re: [OMPI devel] Fake Modex

2011-06-03 Thread Hugo Meyer
comm/base/grpcomm_base_modex.c. You'll have to create new code > to send/recv an update message, but the code to update the database entry > exists. > > > On Jun 2, 2011, at 7:52 AM, Hugo Meyer wrote: > > Hello again. > > My actual problem is that i don

Re: [OMPI devel] Fake Modex

2011-06-02 Thread Hugo Meyer
hat i need is to update it when i move a process from its original site, is there something like this?? Thanks a lot. Hugo 2011/5/31 Hugo Meyer > Hello @ll. > > I'm needing some help to restart the communication with a process that i > restore in a different node. My situation

[OMPI devel] Fake Modex

2011-05-31 Thread Hugo Meyer
Hello @ll. I'm needing some help to restart the communication with a process that i restore in a different node. My situation is as follows: The process fails and it's restored in another node succesfully from a previous checkpoint that i sent there. Now, when a process try to send a message to t

Re: [OMPI devel] Paffinity Error.

2011-05-15 Thread Hugo Meyer
figured the system to no-build them. > > > On May 12, 2011, at 11:31 AM, Hugo Meyer wrote: > > Hello. > > I'm getting an error when i try to use the paffinity option: > > Open MPI tried to bind a new process, but something went wrong. The > process was killed witho

[OMPI devel] Paffinity Error.

2011-05-12 Thread Hugo Meyer
narios/bin/mpirun -v -n 8 \ -tag-output \ --hostfile ../hostfile \ --slot-list 1:1 \ --bynode \ ./mm-static 1000 100 Am i doing something wrong? Thanks for the help. Hugo Meyer

Re: [OMPI devel] Add child to another parent.

2011-04-13 Thread Hugo Meyer
When the proc restarts, it calls orte_routed.init_routes. If you look in routed cm, you should see a call to "register_sync" - this is where the proc sends a message to the local daemon, allowing it to "learn" the port/address where the proc resides. I've done this. I had a problem because when i

Re: [OMPI devel] Add child to another parent.

2011-04-08 Thread Hugo Meyer
.c - > should be something in there that updates the lifeline during restart of a > checkpoint. > > > On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote: > > Hi all. > > > I corrected the error with the port. The mistake was because he tried to > start theprocess back and

Re: [OMPI devel] Add child to another parent.

2011-04-06 Thread Hugo Meyer
],1] state COMMUNICATION FAILURE exit_code 1 [1,1]:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline [[65478,0],1] lost [1,1]:[[65478,1],1] assigned port 31256 Any help on how to solve this error, or how to interpret it will be greatly appreciated. Best regards. Hugo 2011/4/5 Hugo Meyer >

Re: [OMPI devel] Add child to another parent.

2011-04-05 Thread Hugo Meyer
s will find out about this? Is this a good choice? Best regards. Hugo Meyer 2011/3/31 Hugo Meyer > Ok Ralph. > Thanks a lot, i will resend this message with a new subject. > > Best Regards. > > Hugo > > > 2011/3/31 Ralph Castain > >> Sorry - should have included

[OMPI devel] Setting Checkpoint path and executables

2011-03-31 Thread Hugo Meyer
alues of where the checkpoint are stored and his exec names taking into account my situation? Best Regards. Hugo Meyer

Re: [OMPI devel] Add child to another parent.

2011-03-31 Thread Hugo Meyer
osh created a man page > to explain how sstore works. It's in section 7, looks like "man orte_sstore" > should get it. > > > On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote: > > Hello again. > > I'm working in the launch code to handle my checkpoints, but

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Hugo Meyer
the functions to pass the details of the checkpoint and the PID. Best Regards. Hugo Meyer 2011/3/30 Hugo Meyer > Thanks Ralph. > I have finished the (a) point, and now its working, now i have to work to > relaunch from my checkpoint as you said. > > Best regards. > > Hugo Meyer

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Hugo Meyer
Thanks Ralph. I have finished the (a) point, and now its working, now i have to work to relaunch from my checkpoint as you said. Best regards. Hugo Meyer 2011/3/29 Ralph Castain > The resilient mapper -only- works on procs being restarted - it cannot map > a job for its initial launc

Re: [OMPI devel] Add child to another parent.

2011-03-29 Thread Hugo Meyer
e a flag that i'm not turning on? or a component that i should have selected? Thanks again. Hugo Meyer 2011/3/26 Hugo Meyer > Ok Ralph. > > Thanks a lot for your help, i will do as you said and then let you know how > it goes. > > Best Regards. > > Hugo Meyer >

Re: [OMPI devel] Add child to another parent.

2011-03-26 Thread Hugo Meyer
Ok Ralph. Thanks a lot for your help, i will do as you said and then let you know how it goes. Best Regards. Hugo Meyer 2011/3/25 Ralph Castain > > On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: > > From what you've described before, I suspect all you'll need to do

Re: [OMPI devel] Add child to another parent.

2011-03-25 Thread Hugo Meyer
rom rsh) in the PLM framework and then use the orted_comm to command a remote_spawn in the protector, but i don't know here how to update the info so everyone knows about the change or how this is managed. I might be very wrong in what I said, my apologies if so. Thanks a lot for all the help. Best regards. Hugo Meyer

Re: [OMPI devel] Add child to another parent.

2011-03-24 Thread Hugo Meyer
oning and try to use it. > > At the least, the cited code should provide guidance on how to correctly > restart procs if you need your own errmgr module for other reasons. > Again thanks Ralph, you have been very helpful. Best regards. Hugo Meyer

[OMPI devel] Add child to another parent.

2011-03-24 Thread Hugo Meyer
jobdat = (orte_odls_job_t*)item; if (jobdat->jobid == child->name->jobid) { break; } } app = jobdat->apps[child->app_idx]; In order to do this, i need to have the child in the jobdat. If there is not such thing implemented, could someone give me an advice on how to do this. Best Regards. Hugo Meyer

Re: [OMPI devel] Setting data into the orte_proc_t

2011-03-23 Thread Hugo Meyer
Thanks again Ralph. I've solved thanks to you. My first mistake was what you told me and then i realize that i have to communicate with the hnp when the vprotocol initiates so he can set that data in the orte_proc_t. Again thanks. Hugo Meyer 2011/3/23 Ralph Castain > > On Mar 2

[OMPI devel] Setting data into the orte_proc_t

2011-03-23 Thread Hugo Meyer
getting now my default initial value. Thanks in advance. Best Regards. Hugo Meyer

Re: [OMPI devel] JDATA access problem.

2011-03-22 Thread Hugo Meyer
Yes. That was the problem Ralph. Again, thanks a lot for your help, it was a silly mistake of mine :). Best regards. Hugo Meyer 2011/3/22 Ralph Castain > The problem is here: > > /* Pack the faulty vpid */ >

Re: [OMPI devel] JDATA access problem.

2011-03-22 Thread Hugo Meyer
have made some changes after my first email, but what i'm trying to do is basically the same. In the line 23 of the orted_comm.c, that i'm sending, i'm always getting NULL as a result, so i can't obtain the jdata. Thanks a lot again for your help. Best Regards. Hugo Meyer ca

Re: [OMPI devel] JDATA access problem.

2011-03-21 Thread Hugo Meyer
the jdata objects are not > populated. The daemons work exclusively from the orte_local_jobdata and > orte_local_children lists, so you would have to find your process there. > That's why i'm asking to the hnp about the jdata using * ORTE_DAEMON_REPORT_JOB_INFO_CMD*, i assume that he has the information about the dead process. Any idea? Best regards. Hugo Meyer

[OMPI devel] JDATA access problem.

2011-03-21 Thread Hugo Meyer
ORTE_ERROR_LOG(rc);* *goto CLEANUP;* *}* * * *if (NULL == procs[proc->vpid] || NULL == procs[proc->vpid]->node) {* *OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output, "PROBLEM: procs[proc.vpid]==null"));* *}* * * Thanks a lot. Hugo Meyer

Re: [OMPI devel] Communication Failure with orted_comm.c

2011-03-09 Thread Hugo Meyer
essage that > was interpreted as a ORTE_DAEMON_IOF_COMPLETE (21). Nothing more to get out > from your output unfortunately. > > george. > > On Mar 8, 2011, at 08:15 , Hugo Meyer wrote: > > > Hello @ll. > > > > I've got a problem in a communication bet

Re: [OMPI devel] Communication Failure with orted_comm.c

2011-03-08 Thread Hugo Meyer
ertainly be done - there are other sections of that code > that also send messages. I can't see the end of your new code section, but I > assume you ended it properly with a "break"? Otherwise, you'll execute > whatever lies below it as well. > > > On Mar 8, 2011

Re: [OMPI devel] Communication Failure with orted_comm.c

2011-03-08 Thread Hugo Meyer
Yes, i set the value 31 and it is not duplicated. 2011/3/8 Ralph Castain > What value did you set for this new command? Did you look at the cmds in > orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value? > > > On Mar 8, 2011, at 6:15 AM, Hugo Meyer

[OMPI devel] Communication Failure with orted_comm.c

2011-03-08 Thread Hugo Meyer
CLEANUP;* } OBJ_RELEASE(answer); I assume by testing that the error is in the bolded section, maybe because i'am missing some sentence when i try to communicate, or maybe this communication cannot be done. Any help will be appreciated. Thanks a lot. Hugo Meyer

Re: [OMPI devel] OMPI-MIGRATE error

2011-02-01 Thread Hugo Meyer
Hi Josh. Thanks for the reply, i've fixed the stuff with the passwd. But i'm still getting the segmentation fault. I'm sending you the output. I think that is almost the same output that i sent you yesterday. Best Regards. Hugo Meyer 2011/1/31 Joshua Hursey > That helped. T

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-31 Thread Hugo Meyer
Best Regards Hugo Meyer 2011/1/31 Joshua Hursey > So I was not able to reproduce this issue. > > A couple notes: > - You can see the node-to-process-rank mapping using the '-display-map' > command line option to mpirun. This will give you the node names that Open > MP

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-31 Thread Hugo Meyer
mentation fault* * * I'm using the ompi-migrate command in the right way? or i am missing something? Because the first attempt didn't find any process. Best Regards. Hugo Meyer 2011/1/28 Hugo Meyer > Thanks to you Joshua. > > I will try the procedure with this modification

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-28 Thread Hugo Meyer
Thanks to you Joshua. I will try the procedure with this modifications and i will let you know how it goes. Best Regards. Hugo Meyer 2011/1/27 Joshua Hursey > I believe that this is now fixed on the trunk. All the details are in the > commit message: > https://svn.open-mpi.org/

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-27 Thread Hugo Meyer
_recover_proc [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages I asume that the orte_get_job_data_object is the problem, because it is not obtaining the proper value. If you need more data, just let me know. Best Regards. Hugo Meyer 2011/1/26 Joshua H

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-26 Thread Hugo Meyer
Josh. The ompi-checkpoint with his restart now are working great, but the same error persist with ompi-migrate. I've also tried using "-r", but i get the same error. Best regards. Hugo Meyer 2011/1/26 Hugo Meyer > Thanks Josh. > > I've already check te prelin

Re: [OMPI devel] OMPI-MIGRATE error

2011-01-26 Thread Hugo Meyer
Thanks Josh. I've already check te prelink and is set to "no". I'm going to try with the trunk head, and then i'll let you know how it goes. Best regards. Hugo Meyer 2011/1/25 Joshua Hursey > Can you try with the current trunk head (r24296)? > I j

[OMPI devel] OMPI-MIGRATE error

2011-01-24 Thread Hugo Meyer
restore an application that has more than one process, this one is restored and executed until the last line before MPI_FINALIZE(), but the processes never finalize, i assume that they never call the MPI_FINALIZE(), but with one process ompi-checkpoint and ompi-restart work great. Best regards. Hugo Meyer

Re: [OMPI devel] Change in communication between process (RMAPS)

2011-01-07 Thread Hugo Meyer
a look to the code of the components that you mention, and i will let you know how things are going. Thanks a lot. Hugo Meyer 2011/1/6 Joshua Hursey > So I can point you to some of the work that I did while at Indiana > University to support process migration in Open MPI in a coordinated manner.

Re: [OMPI devel] Change in communication between process (RMAPS)

2011-01-06 Thread Hugo Meyer
ut without making a coordinated checkpoint. I just need to checkpoint processes in an uncoordinated way, and move them. Where can i see something about process migration in the code? or something that could guide me. Greetings. Hugo Meyer 2011/1/6 Jeff Squyres > Sorry for the delay; you wrote

[OMPI devel] Change in communication between process (RMAPS)

2010-12-28 Thread Hugo Meyer
odify tables as orte_job_map_t and orte_proc_t, but I wanted to know if someone already has experience doing something similar, and can guide me at least. The communication between processes, in principle, would be irrelevant, so i will not need to use checkpoints / restarts for now. Greetings Hugo Meyer