[OMPI devel] One more week of different US / European time

2011-03-21 Thread Jeff Squyres
Remember: we have the OMPI teleconf tomorrow at 11am US Eastern, which is still 
a "different" time in Europe this week.  Next Tuesday, we should be back to the 
"normal" European time.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] CMR floodgates open for 1.5.4

2011-03-21 Thread Jeff Squyres
There's a bunch of pending CMRs that don't have reviews.  I pinged each ticket 
telling each ticket owner that they need a reviewer before it will be approved.

Please check your tickets; thanks.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] JDATA access problem.

2011-03-21 Thread Hugo Meyer
Hello @ll.

I'm having a problem when i try to access to data->procs->addr[vpid] when
the vpid belong to a recently killed process. I'm sending here a piece of my
code. The problem is that the execution is always entering in the last if
clause maybe because the information of the dead process is no longer
available, or maybe i'm doing something wrong when accessing.

Any help will be apreciated.

*command =
ORTE_DAEMON_REPORT_JOB_INFO_CMD;*
*buffer = OBJ_NEW(opal_buffer_t);*
*if (ORTE_SUCCESS != (rc =
opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) {*
*ORTE_ERROR_LOG(rc);*
*OBJ_RELEASE(buffer);*
*return rc;*
*}*
*if (ORTE_SUCCESS != (rc =
opal_dss.pack(buffer, &proc->jobid, 1, ORTE_JOBID))) {*
*ORTE_ERROR_LOG(rc);*
*OBJ_RELEASE(buffer);*
*return rc;*
*}*
*/* do the send */*
*if (0 > (rc =
orte_rml.send_buffer(ORTE_PROC_MY_HNP, buffer, ORTE_RML_TAG_DAEMON, 0))) {*
*ORTE_ERROR_LOG(rc);*
*OBJ_RELEASE(buffer);*
*return rc;*
*}*
*OBJ_RELEASE(buffer);*
*buffer = OBJ_NEW(opal_buffer_t);*
*
*
*
 orte_rml.recv_buffer(ORTE_NAME_WILDCARD, buffer, ORTE_RML_TAG_TOOL, 0);*
**
*opal_dss.unpack(buffer, &response,
&n, OPAL_INT32);*
*
*
*if(response==0){*
*OPAL_OUTPUT_VERBOSE((5,
orte_errmgr_base.output,"NO ESCRIBÍ AL HNP\n "));*
*}else{*
*opal_dss.unpack(buffer, &jdata,
&n, ORTE_JOB);*
*}*
*
*
*procs =
(orte_proc_t**)jdata->procs->addr;*
*if(procs==NULL){*
*OPAL_OUTPUT_VERBOSE((5,
orte_errmgr_base.output, "grave: procs==null"));*
*}*
*
*
*command =
ORTE_DAEMON_UPDATE_STATE_CMD;*
*
*
*OBJ_RELEASE(buffer);*
*buffer = OBJ_NEW(opal_buffer_t);*
**
*if (ORTE_SUCCESS != (rc =
opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) {*
*ORTE_ERROR_LOG(rc);*
*OBJ_RELEASE(buffer);*
*goto CLEANUP;*
*}*
*
*
*orte_proc_state_t state =
ORTE_PROC_STATE_FAULT;*
*/* Pack the faulty vpid */*
*if (ORTE_SUCCESS != (rc =
opal_dss.pack(buffer, &proc, 1, ORTE_NAME))) {*
*ORTE_ERROR_LOG(rc);*
*goto CLEANUP;*
*}*
*
*
*/* Pack the state */*
*if (ORTE_SUCCESS != (rc =
opal_dss.pack(buffer, &state, 1, OPAL_UINT16))) {*
*ORTE_ERROR_LOG(rc);*
*goto CLEANUP;*
*}*
*
*
*if (NULL == procs[proc->vpid] ||
NULL == procs[proc->vpid]->node) {*
*OPAL_OUTPUT_VERBOSE((5,
orte_errmgr_base.output, "PROBLEM: procs[proc.vpid]==null"));*
*}*
*
*
Thanks a lot.

Hugo Meyer


Re: [OMPI devel] JDATA access problem.

2011-03-21 Thread Ralph Castain
You should never access a pointer array's data area that way (i.e., by index 
against the raw data). You really should do:

if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, 
vpid))) {
  /* error report */
}

to protect against changes. The errmgr generally doesn't remove a process 
object upon failure - it just sets its state to some appropriate value. 
However, depending upon where you are trying to do this, and the history that 
got you down this code path, it is possible.

Also, remember that if you are in a daemon, then the jdata objects are not 
populated. The daemons work exclusively from the orte_local_jobdata and 
orte_local_children lists, so you would have to find your process there.

We might change that someday, but my first attempt at doing so ran into a 
snarled mess.

On Mar 21, 2011, at 12:40 PM, Hugo Meyer wrote:

> Hello @ll.
> 
> I'm having a problem when i try to access to data->procs->addr[vpid] when the 
> vpid belong to a recently killed process. I'm sending here a piece of my 
> code. The problem is that the execution is always entering in the last if 
> clause maybe because the information of the dead process is no longer 
> available, or maybe i'm doing something wrong when accessing.
> 
> Any help will be apreciated.
> 
> command = 
> ORTE_DAEMON_REPORT_JOB_INFO_CMD;
> buffer = OBJ_NEW(opal_buffer_t);
> if (ORTE_SUCCESS != (rc = 
> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) {
> ORTE_ERROR_LOG(rc);
> OBJ_RELEASE(buffer);
> return rc;
> }
> if (ORTE_SUCCESS != (rc = 
> opal_dss.pack(buffer, &proc->jobid, 1, ORTE_JOBID))) {
> ORTE_ERROR_LOG(rc);
> OBJ_RELEASE(buffer);
> return rc;
> }
> /* do the send */
> if (0 > (rc = 
> orte_rml.send_buffer(ORTE_PROC_MY_HNP, buffer, ORTE_RML_TAG_DAEMON, 0))) {
> ORTE_ERROR_LOG(rc);
> OBJ_RELEASE(buffer);
> return rc;
> }
> OBJ_RELEASE(buffer);
> buffer = OBJ_NEW(opal_buffer_t);
> 
> 
> orte_rml.recv_buffer(ORTE_NAME_WILDCARD, buffer, ORTE_RML_TAG_TOOL, 0);
> 
> opal_dss.unpack(buffer, &response, 
> &n, OPAL_INT32);
> 
> if(response==0){
> OPAL_OUTPUT_VERBOSE((5, 
> orte_errmgr_base.output,"NO ESCRIBÍ AL HNP\n "));
> }else{
> opal_dss.unpack(buffer, &jdata, 
> &n, ORTE_JOB);
> }
> 
> procs = 
> (orte_proc_t**)jdata->procs->addr;
> if(procs==NULL){
> OPAL_OUTPUT_VERBOSE((5, 
> orte_errmgr_base.output, "grave: procs==null"));
> }
> 
> command = 
> ORTE_DAEMON_UPDATE_STATE_CMD;
> 
> OBJ_RELEASE(buffer);
> buffer = OBJ_NEW(opal_buffer_t);
> 
> if (ORTE_SUCCESS != (rc = 
> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) {
> ORTE_ERROR_LOG(rc);
> OBJ_RELEASE(buffer);
> goto CLEANUP;
> }
> 
> orte_proc_state_t state = 
> ORTE_PROC_STATE_FAULT;
> /* Pack the faulty vpid */
> if (ORTE_SUCCESS != (rc = 
> opal_dss.pack(buffer, &proc, 1, ORTE_NAME))) {
> ORTE_ERROR_LOG(rc);
> goto CLEANUP;
> }
> 
> /* Pack the state */
> if (ORTE_SUCCESS != (rc = 
> opal_dss.pack(buffer, &state, 1, OPAL_UINT16))) {
>   

[OMPI devel] Return status of MPI_Probe()

2011-03-21 Thread Joshua Hursey
If MPI_Probe() encounters an error causing it to exit with the 
'status.MPI_ERROR' set, say:
  ret = MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);

Should it return an error? So should it return:
 - ret = status.MPI_ERROR
 - ret = MPI_ERROR_IN_STATUS
 - ret = MPI_SUCCESS
Additionally, should it trigger the error handler on the communicator?

In Open MPI, it will always return MPI_SUCCESS (pml_ob1_iprobe.c:74), but it 
feels like this is wrong. I looked to the MPI standard for some insight, but 
could not find where it addresses the return code of MPI_Probe.

Can anyone shed some light on this topic for me?

Thanks,
Josh



Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




Re: [OMPI devel] JDATA access problem.

2011-03-21 Thread Hugo Meyer
Thanks Ralph for your reply.

2011/3/21 Ralph Castain 

> You should never access a pointer array's data area that way (i.e., by
> index against the raw data). You really should do:
>
> if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs,
> vpid))) {
>   /* error report */
> }
>
>
About this, i've changed this in my code but i'm getting the same result.
Null when asking about a dead process.


> The errmgr generally doesn't remove a process object upon failure - it just
> sets its state to some appropriate value. However, depending upon where you
> are trying to do this, and the history that got you down this code path, it
> is possible.
>

I'm writing this code into the errmgr_orted.c, and it is executed when a
process fails.


>
> Also, remember that if you are in a daemon, then the jdata objects are not
> populated. The daemons work exclusively from the orte_local_jobdata and
> orte_local_children lists, so you would have to find your process there.
>

That's why i'm asking to the hnp about the jdata using *
ORTE_DAEMON_REPORT_JOB_INFO_CMD*, i assume that he has the information about
the dead process.

Any idea?

Best regards.

Hugo Meyer


Re: [OMPI devel] JDATA access problem.

2011-03-21 Thread Ralph Castain

On Mar 21, 2011, at 2:51 PM, Hugo Meyer wrote:

> Thanks Ralph for your reply.
> 
> 2011/3/21 Ralph Castain 
> You should never access a pointer array's data area that way (i.e., by index 
> against the raw data). You really should do:
> 
> if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, 
> vpid))) {
>   /* error report */
> }
> 
> 
> About this, i've changed this in my code but i'm getting the same result. 
> Null when asking about a dead process.
>  
> The errmgr generally doesn't remove a process object upon failure - it just 
> sets its state to some appropriate value. However, depending upon where you 
> are trying to do this, and the history that got you down this code path, it 
> is possible.
> 
> I'm writing this code into the errmgr_orted.c, and it is executed when a 
> process fails. 
>  

There's your problem - that module is run in the daemon, where the 
orte_job_data pointer array isn't used. You have to use the orte_local_jobdata 
and orte_local_children lists instead. So once the HNP replies with the jobid, 
you look up the orte_odls_job_t for that job from the orte_local_jobdata list.

If you want to find a particular proc, though, you would look under 
orte_local_children - search the list for a child whose jobid and vpid both 
match.

Note that you will not find that child process -unless- the child is under that 
daemon.

I'm not sure what you are trying to accomplish, so I can't give further advice. 
Note that daemons have limited knowledge of application processes that are not 
their own immediate children. What little they know regarding processes other 
than their own is stored in the nidmap/pidmap arrays - limited to location, 
local rank, and node rank. They have no storage currently allocated for things 
like the state of a non-local process.


> 
> Also, remember that if you are in a daemon, then the jdata objects are not 
> populated. The daemons work exclusively from the orte_local_jobdata and 
> orte_local_children lists, so you would have to find your process there.
> 
> That's why i'm asking to the hnp about the jdata using 
> ORTE_DAEMON_REPORT_JOB_INFO_CMD, i assume that he has the information about 
> the dead process.

Only after the daemon reports it.

> 
> Any idea?
> 
> Best regards.
> 
> Hugo Meyer
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Return status of MPI_Probe()

2011-03-21 Thread George Bosilca
Josh,

If we don't take in account resilience I would not expect MPI_Probe to have 
that many opportunities to return errors. However, in order to keep the 
implementation consistent (with the other MPI functions) I would abide to the 
following.

MPI_ERROR_IN_STATUS is only for calls taking multiple requests as input, so I 
don't think this should be applied to the MPI_Probe. I would expect the return 
to be equal to status.MPI_ERROR (similar to only other function working on a 
single request, such as MPI_Test).

It better trigger the error-handler attached to the communicator, as explicitly 
requested by the MPI standard (section 8.3).
> A user can associate error handlers to three types of objects: communicators, 
> windows, and files. The specified error handling routine will be used for any 
> MPI exception that occurs during a call to MPI for the respective object.

  george.

On Mar 21, 2011, at 16:50 , Joshua Hursey wrote:

> If MPI_Probe() encounters an error causing it to exit with the 
> 'status.MPI_ERROR' set, say:
>  ret = MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
> 
> Should it return an error? So should it return:
> - ret = status.MPI_ERROR
> - ret = MPI_ERROR_IN_STATUS
> - ret = MPI_SUCCESS
> Additionally, should it trigger the error handler on the communicator?
> 
> In Open MPI, it will always return MPI_SUCCESS (pml_ob1_iprobe.c:74), but it 
> feels like this is wrong. I looked to the MPI standard for some insight, but 
> could not find where it addresses the return code of MPI_Probe.
> 
> Can anyone shed some light on this topic for me?
> 
> Thanks,
> Josh
> 
> 
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

"To preserve the freedom of the human mind then and freedom of the press, every 
spirit should be ready to devote itself to martyrdom; for as long as we may 
think as we will, and speak as we think, the condition of man will proceed in 
improvement."
  -- Thomas Jefferson, 1799