Yes.

That was the problem Ralph. Again, thanks a lot for your help, it was a
silly mistake of mine :).

Best regards.

Hugo Meyer

2011/3/22 Ralph Castain <r...@open-mpi.org>

> The problem is here:
>
>                                       /* Pack the faulty vpid */
>                                         if (ORTE_SUCCESS != (rc =
> opal_dss.pack(buffer, &proc, 1, ORTE_NAME))) {
>                                             ORTE_ERROR_LOG(rc);
>                                             goto CLEANUP;
>                                         }
>
> The variable proc is apparently a pointer to orte_process_name_t. You
> therefore should have packed it like this:
>
>                                         /* Pack the faulty vpid */
>                                         if (ORTE_SUCCESS != (rc =
> opal_dss.pack(buffer, proc, 1, ORTE_NAME))) {
>                                             ORTE_ERROR_LOG(rc);
>                                             goto CLEANUP;
>                                         }
>
> i.e.., without the & in front. Accordingly, the problem was that the HNP
> was getting garbage for the process name, and thus finding NULL at the
> specified locations.
>
> Just for testing, you might want to print out the received process name to
> ensure your communication is correct :-)
>
>
> On Mar 22, 2011, at 5:58 AM, Hugo Meyer wrote:
>
> Thanks again Ralph for your reply.
>
>
>> There's your problem - that module is run in the daemon, where the
>> orte_job_data pointer array isn't used. You have to use the
>> orte_local_jobdata and orte_local_children lists instead. So once the HNP
>> replies with the jobid, you look up the orte_odls_job_t for that job from
>> the orte_local_jobdata list.
>>
>
> I'm sending now to you all the piece of code involved, at the beginning i'm
> doing something about what you are saying. Then having the child info i ask
> to the hnp for the jobdata of the child, but i'm still getting no data about
> the child (that is the dead process). I'm trying to get this info to send
> info to another orted to restart this failed process.
>
>
>> I'm not sure what you are trying to accomplish, so I can't give further
>> advice. Note that daemons have limited knowledge of application processes
>> that are not their own immediate children. What little they know regarding
>> processes other than their own is stored in the nidmap/pidmap arrays -
>> limited to location, local rank, and node rank. They have no storage
>> currently allocated for things like the state of a non-local process.
>>
>
> I want to restart the process in another node, that's why i'm needing the
> jobdata. So, the hnp cannot do something like:
> *jdata = orte_get_job_data_object(proc.jobid))*
>
> when the proc doesn't belong to him??
> So where i can obtain this information, because i'm asumming that i cannot
> ask about the dead process to his daemon (because i assume that the daemon
> also is dead, but that's not true). I was supossing that in the HNP i could
> execute the sentence above.
>
> I'm attaching all the code involving the described situation. But i have
> made some changes after my first email, but what i'm trying to do is
> basically the same. In the line 23 of the orted_comm.c, that i'm sending,
> i'm always getting NULL as a result, so i can't obtain the jdata.
>
> Thanks a lot again for your help.
>
> Best Regards.
>
> Hugo Meyer
>
> <orted_comm.c><errmgr_orted.c>
>
>
>

Reply via email to