Yes. That was the problem Ralph. Again, thanks a lot for your help, it was a silly mistake of mine :).
Best regards. Hugo Meyer 2011/3/22 Ralph Castain <r...@open-mpi.org> > The problem is here: > > /* Pack the faulty vpid */ > if (ORTE_SUCCESS != (rc = > opal_dss.pack(buffer, &proc, 1, ORTE_NAME))) { > ORTE_ERROR_LOG(rc); > goto CLEANUP; > } > > The variable proc is apparently a pointer to orte_process_name_t. You > therefore should have packed it like this: > > /* Pack the faulty vpid */ > if (ORTE_SUCCESS != (rc = > opal_dss.pack(buffer, proc, 1, ORTE_NAME))) { > ORTE_ERROR_LOG(rc); > goto CLEANUP; > } > > i.e.., without the & in front. Accordingly, the problem was that the HNP > was getting garbage for the process name, and thus finding NULL at the > specified locations. > > Just for testing, you might want to print out the received process name to > ensure your communication is correct :-) > > > On Mar 22, 2011, at 5:58 AM, Hugo Meyer wrote: > > Thanks again Ralph for your reply. > > >> There's your problem - that module is run in the daemon, where the >> orte_job_data pointer array isn't used. You have to use the >> orte_local_jobdata and orte_local_children lists instead. So once the HNP >> replies with the jobid, you look up the orte_odls_job_t for that job from >> the orte_local_jobdata list. >> > > I'm sending now to you all the piece of code involved, at the beginning i'm > doing something about what you are saying. Then having the child info i ask > to the hnp for the jobdata of the child, but i'm still getting no data about > the child (that is the dead process). I'm trying to get this info to send > info to another orted to restart this failed process. > > >> I'm not sure what you are trying to accomplish, so I can't give further >> advice. Note that daemons have limited knowledge of application processes >> that are not their own immediate children. What little they know regarding >> processes other than their own is stored in the nidmap/pidmap arrays - >> limited to location, local rank, and node rank. They have no storage >> currently allocated for things like the state of a non-local process. >> > > I want to restart the process in another node, that's why i'm needing the > jobdata. So, the hnp cannot do something like: > *jdata = orte_get_job_data_object(proc.jobid))* > > when the proc doesn't belong to him?? > So where i can obtain this information, because i'm asumming that i cannot > ask about the dead process to his daemon (because i assume that the daemon > also is dead, but that's not true). I was supossing that in the HNP i could > execute the sentence above. > > I'm attaching all the code involving the described situation. But i have > made some changes after my first email, but what i'm trying to do is > basically the same. In the line 23 of the orted_comm.c, that i'm sending, > i'm always getting NULL as a result, so i can't obtain the jdata. > > Thanks a lot again for your help. > > Best Regards. > > Hugo Meyer > > <orted_comm.c><errmgr_orted.c> > > >