t more. For their uncoordinated C/R approach they
> would have had to deal with this when restarting processes mid-run
> without halting other processes. So maybe you can use a similar
> approach.
>
> -- Josh
>
>
> On Sat, Jun 4, 2011 at 10:55 AM, Ralph Castain wrote:
> >
Thanks for your replies.
>After doing that, the MPI_Init procedure calls grpcomm.modex to distribute
the data across all procs in the job. Unfortunately, being a collective, all
procs must participate. In your case, you'll have to find a different way to
do it. Upon receipt, each proc updates its
comm/base/grpcomm_base_modex.c. You'll have to create new code
> to send/recv an update message, but the code to update the database entry
> exists.
>
>
> On Jun 2, 2011, at 7:52 AM, Hugo Meyer wrote:
>
> Hello again.
>
> My actual problem is that i don
hat i need is to update it when i move a process from its original
site, is there something like this??
Thanks a lot.
Hugo
2011/5/31 Hugo Meyer
> Hello @ll.
>
> I'm needing some help to restart the communication with a process that i
> restore in a different node. My situation
Hello @ll.
I'm needing some help to restart the communication with a process that i
restore in a different node. My situation is as follows:
The process fails and it's restored in another node succesfully from a
previous checkpoint that i sent there. Now, when a process try to send a
message to t
figured the system to no-build them.
>
>
> On May 12, 2011, at 11:31 AM, Hugo Meyer wrote:
>
> Hello.
>
> I'm getting an error when i try to use the paffinity option:
>
> Open MPI tried to bind a new process, but something went wrong. The
> process was killed witho
narios/bin/mpirun -v -n 8 \
-tag-output \
--hostfile ../hostfile \
--slot-list 1:1 \
--bynode \
./mm-static 1000 100
Am i doing something wrong?
Thanks for the help.
Hugo Meyer
When the proc restarts, it calls orte_routed.init_routes. If you look in
routed cm, you should see a call to "register_sync" - this is where the proc
sends a message to the local daemon, allowing it to "learn" the port/address
where the proc resides.
I've done this. I had a problem because when i
.c -
> should be something in there that updates the lifeline during restart of a
> checkpoint.
>
>
> On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote:
>
> Hi all.
>
>
> I corrected the error with the port. The mistake was because he tried to
> start theprocess back and
],1] state COMMUNICATION
FAILURE exit_code 1
[1,1]:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline
[[65478,0],1] lost
[1,1]:[[65478,1],1] assigned port 31256
Any help on how to solve this error, or how to interpret it will be greatly
appreciated.
Best regards.
Hugo
2011/4/5 Hugo Meyer
>
s
will find out about this? Is this a good choice?
Best regards.
Hugo Meyer
2011/3/31 Hugo Meyer
> Ok Ralph.
> Thanks a lot, i will resend this message with a new subject.
>
> Best Regards.
>
> Hugo
>
>
> 2011/3/31 Ralph Castain
>
>> Sorry - should have included
alues of where the checkpoint are stored and
his exec names taking into account my situation?
Best Regards.
Hugo Meyer
osh created a man page
> to explain how sstore works. It's in section 7, looks like "man orte_sstore"
> should get it.
>
>
> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>
> Hello again.
>
> I'm working in the launch code to handle my checkpoints, but
the functions
to pass the details of the checkpoint and the PID.
Best Regards.
Hugo Meyer
2011/3/30 Hugo Meyer
> Thanks Ralph.
> I have finished the (a) point, and now its working, now i have to work to
> relaunch from my checkpoint as you said.
>
> Best regards.
>
> Hugo Meyer
Thanks Ralph.
I have finished the (a) point, and now its working, now i have to work to
relaunch from my checkpoint as you said.
Best regards.
Hugo Meyer
2011/3/29 Ralph Castain
> The resilient mapper -only- works on procs being restarted - it cannot map
> a job for its initial launc
e a flag that i'm not turning on? or a component that i should have
selected?
Thanks again.
Hugo Meyer
2011/3/26 Hugo Meyer
> Ok Ralph.
>
> Thanks a lot for your help, i will do as you said and then let you know how
> it goes.
>
> Best Regards.
>
> Hugo Meyer
>
Ok Ralph.
Thanks a lot for your help, i will do as you said and then let you know how
it goes.
Best Regards.
Hugo Meyer
2011/3/25 Ralph Castain
>
> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>
> From what you've described before, I suspect all you'll need to do
rom rsh) in
the PLM framework and then use the orted_comm to command a remote_spawn in
the protector, but i don't know here how to update the info so everyone
knows about the change or how this is managed.
I might be very wrong in what I said, my apologies if so.
Thanks a lot for all the help.
Best regards.
Hugo Meyer
oning and try to use it.
>
> At the least, the cited code should provide guidance on how to correctly
> restart procs if you need your own errmgr module for other reasons.
>
Again thanks Ralph, you have been very helpful.
Best regards.
Hugo Meyer
jobdat = (orte_odls_job_t*)item;
if (jobdat->jobid == child->name->jobid) {
break;
}
}
app = jobdat->apps[child->app_idx];
In order to do this, i need to have the child in the jobdat. If there is not
such thing implemented, could someone give me an advice on how to do this.
Best Regards.
Hugo Meyer
Thanks again Ralph.
I've solved thanks to you. My first mistake was what you told me and then i
realize that i have to communicate with the hnp when the vprotocol initiates
so he can set that data in the orte_proc_t.
Again thanks.
Hugo Meyer
2011/3/23 Ralph Castain
>
> On Mar 2
getting now
my default initial value.
Thanks in advance.
Best Regards.
Hugo Meyer
Yes.
That was the problem Ralph. Again, thanks a lot for your help, it was a
silly mistake of mine :).
Best regards.
Hugo Meyer
2011/3/22 Ralph Castain
> The problem is here:
>
> /* Pack the faulty vpid */
>
have
made some changes after my first email, but what i'm trying to do is
basically the same. In the line 23 of the orted_comm.c, that i'm sending,
i'm always getting NULL as a result, so i can't obtain the jdata.
Thanks a lot again for your help.
Best Regards.
Hugo Meyer
ca
the jdata objects are not
> populated. The daemons work exclusively from the orte_local_jobdata and
> orte_local_children lists, so you would have to find your process there.
>
That's why i'm asking to the hnp about the jdata using *
ORTE_DAEMON_REPORT_JOB_INFO_CMD*, i assume that he has the information about
the dead process.
Any idea?
Best regards.
Hugo Meyer
ORTE_ERROR_LOG(rc);*
*goto CLEANUP;*
*}*
*
*
*if (NULL == procs[proc->vpid] ||
NULL == procs[proc->vpid]->node) {*
*OPAL_OUTPUT_VERBOSE((5,
orte_errmgr_base.output, "PROBLEM: procs[proc.vpid]==null"));*
*}*
*
*
Thanks a lot.
Hugo Meyer
essage that
> was interpreted as a ORTE_DAEMON_IOF_COMPLETE (21). Nothing more to get out
> from your output unfortunately.
>
> george.
>
> On Mar 8, 2011, at 08:15 , Hugo Meyer wrote:
>
> > Hello @ll.
> >
> > I've got a problem in a communication bet
ertainly be done - there are other sections of that code
> that also send messages. I can't see the end of your new code section, but I
> assume you ended it properly with a "break"? Otherwise, you'll execute
> whatever lies below it as well.
>
>
> On Mar 8, 2011
Yes, i set the value 31 and it is not duplicated.
2011/3/8 Ralph Castain
> What value did you set for this new command? Did you look at the cmds in
> orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value?
>
>
> On Mar 8, 2011, at 6:15 AM, Hugo Meyer
CLEANUP;*
}
OBJ_RELEASE(answer);
I assume by testing that the error is in the bolded section, maybe because
i'am missing some sentence when i try to communicate, or maybe this
communication cannot be done. Any help will be appreciated.
Thanks a lot.
Hugo Meyer
Hi Josh.
Thanks for the reply, i've fixed the stuff with the passwd. But i'm still
getting the segmentation fault. I'm sending you the output. I think that is
almost the same output that i sent you yesterday.
Best Regards.
Hugo Meyer
2011/1/31 Joshua Hursey
> That helped. T
Best Regards
Hugo Meyer
2011/1/31 Joshua Hursey
> So I was not able to reproduce this issue.
>
> A couple notes:
> - You can see the node-to-process-rank mapping using the '-display-map'
> command line option to mpirun. This will give you the node names that Open
> MP
mentation fault*
*
*
I'm using the ompi-migrate command in the right way? or i am missing
something? Because the first attempt didn't find any process.
Best Regards.
Hugo Meyer
2011/1/28 Hugo Meyer
> Thanks to you Joshua.
>
> I will try the procedure with this modification
Thanks to you Joshua.
I will try the procedure with this modifications and i will let you know how
it goes.
Best Regards.
Hugo Meyer
2011/1/27 Joshua Hursey
> I believe that this is now fixed on the trunk. All the details are in the
> commit message:
> https://svn.open-mpi.org/
_recover_proc
[clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
I asume that the orte_get_job_data_object is the problem, because it is not
obtaining the proper value.
If you need more data, just let me know.
Best Regards.
Hugo Meyer
2011/1/26 Joshua H
Josh.
The ompi-checkpoint with his restart now are working great, but the same
error persist with ompi-migrate. I've also tried using "-r", but i get the
same error.
Best regards.
Hugo Meyer
2011/1/26 Hugo Meyer
> Thanks Josh.
>
> I've already check te prelin
Thanks Josh.
I've already check te prelink and is set to "no".
I'm going to try with the trunk head, and then i'll let you know how it
goes.
Best regards.
Hugo Meyer
2011/1/25 Joshua Hursey
> Can you try with the current trunk head (r24296)?
> I j
restore an application
that has more than one process, this one is restored and executed until the
last line before MPI_FINALIZE(), but the processes never finalize, i assume
that they never call the MPI_FINALIZE(), but with one process
ompi-checkpoint and ompi-restart work great.
Best regards.
Hugo Meyer
a look to the code of the components that you mention, and
i will let you know how things are going.
Thanks a lot.
Hugo Meyer
2011/1/6 Joshua Hursey
> So I can point you to some of the work that I did while at Indiana
> University to support process migration in Open MPI in a coordinated manner.
ut without making a coordinated
checkpoint. I just need to checkpoint processes in an uncoordinated way, and
move them.
Where can i see something about process migration in the code? or something
that could guide me.
Greetings.
Hugo Meyer
2011/1/6 Jeff Squyres
> Sorry for the delay; you wrote
odify tables as orte_job_map_t and orte_proc_t, but I
wanted to know if someone already has experience doing something similar,
and can guide me at least.
The communication between processes, in principle, would be irrelevant, so i
will not need to use checkpoints / restarts for now.
Greetings
Hugo Meyer
41 matches
Mail list logo