[OMPI devel] One more week of different US / European time
Remember: we have the OMPI teleconf tomorrow at 11am US Eastern, which is still a "different" time in Europe this week. Next Tuesday, we should be back to the "normal" European time. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] CMR floodgates open for 1.5.4
There's a bunch of pending CMRs that don't have reviews. I pinged each ticket telling each ticket owner that they need a reviewer before it will be approved. Please check your tickets; thanks. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] JDATA access problem.
Hello @ll. I'm having a problem when i try to access to data->procs->addr[vpid] when the vpid belong to a recently killed process. I'm sending here a piece of my code. The problem is that the execution is always entering in the last if clause maybe because the information of the dead process is no longer available, or maybe i'm doing something wrong when accessing. Any help will be apreciated. *command = ORTE_DAEMON_REPORT_JOB_INFO_CMD;* *buffer = OBJ_NEW(opal_buffer_t);* *if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) {* *ORTE_ERROR_LOG(rc);* *OBJ_RELEASE(buffer);* *return rc;* *}* *if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &proc->jobid, 1, ORTE_JOBID))) {* *ORTE_ERROR_LOG(rc);* *OBJ_RELEASE(buffer);* *return rc;* *}* */* do the send */* *if (0 > (rc = orte_rml.send_buffer(ORTE_PROC_MY_HNP, buffer, ORTE_RML_TAG_DAEMON, 0))) {* *ORTE_ERROR_LOG(rc);* *OBJ_RELEASE(buffer);* *return rc;* *}* *OBJ_RELEASE(buffer);* *buffer = OBJ_NEW(opal_buffer_t);* * * * orte_rml.recv_buffer(ORTE_NAME_WILDCARD, buffer, ORTE_RML_TAG_TOOL, 0);* ** *opal_dss.unpack(buffer, &response, &n, OPAL_INT32);* * * *if(response==0){* *OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output,"NO ESCRIBÍ AL HNP\n "));* *}else{* *opal_dss.unpack(buffer, &jdata, &n, ORTE_JOB);* *}* * * *procs = (orte_proc_t**)jdata->procs->addr;* *if(procs==NULL){* *OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output, "grave: procs==null"));* *}* * * *command = ORTE_DAEMON_UPDATE_STATE_CMD;* * * *OBJ_RELEASE(buffer);* *buffer = OBJ_NEW(opal_buffer_t);* ** *if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) {* *ORTE_ERROR_LOG(rc);* *OBJ_RELEASE(buffer);* *goto CLEANUP;* *}* * * *orte_proc_state_t state = ORTE_PROC_STATE_FAULT;* */* Pack the faulty vpid */* *if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &proc, 1, ORTE_NAME))) {* *ORTE_ERROR_LOG(rc);* *goto CLEANUP;* *}* * * */* Pack the state */* *if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &state, 1, OPAL_UINT16))) {* *ORTE_ERROR_LOG(rc);* *goto CLEANUP;* *}* * * *if (NULL == procs[proc->vpid] || NULL == procs[proc->vpid]->node) {* *OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output, "PROBLEM: procs[proc.vpid]==null"));* *}* * * Thanks a lot. Hugo Meyer
Re: [OMPI devel] JDATA access problem.
You should never access a pointer array's data area that way (i.e., by index against the raw data). You really should do: if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, vpid))) { /* error report */ } to protect against changes. The errmgr generally doesn't remove a process object upon failure - it just sets its state to some appropriate value. However, depending upon where you are trying to do this, and the history that got you down this code path, it is possible. Also, remember that if you are in a daemon, then the jdata objects are not populated. The daemons work exclusively from the orte_local_jobdata and orte_local_children lists, so you would have to find your process there. We might change that someday, but my first attempt at doing so ran into a snarled mess. On Mar 21, 2011, at 12:40 PM, Hugo Meyer wrote: > Hello @ll. > > I'm having a problem when i try to access to data->procs->addr[vpid] when the > vpid belong to a recently killed process. I'm sending here a piece of my > code. The problem is that the execution is always entering in the last if > clause maybe because the information of the dead process is no longer > available, or maybe i'm doing something wrong when accessing. > > Any help will be apreciated. > > command = > ORTE_DAEMON_REPORT_JOB_INFO_CMD; > buffer = OBJ_NEW(opal_buffer_t); > if (ORTE_SUCCESS != (rc = > opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) { > ORTE_ERROR_LOG(rc); > OBJ_RELEASE(buffer); > return rc; > } > if (ORTE_SUCCESS != (rc = > opal_dss.pack(buffer, &proc->jobid, 1, ORTE_JOBID))) { > ORTE_ERROR_LOG(rc); > OBJ_RELEASE(buffer); > return rc; > } > /* do the send */ > if (0 > (rc = > orte_rml.send_buffer(ORTE_PROC_MY_HNP, buffer, ORTE_RML_TAG_DAEMON, 0))) { > ORTE_ERROR_LOG(rc); > OBJ_RELEASE(buffer); > return rc; > } > OBJ_RELEASE(buffer); > buffer = OBJ_NEW(opal_buffer_t); > > > orte_rml.recv_buffer(ORTE_NAME_WILDCARD, buffer, ORTE_RML_TAG_TOOL, 0); > > opal_dss.unpack(buffer, &response, > &n, OPAL_INT32); > > if(response==0){ > OPAL_OUTPUT_VERBOSE((5, > orte_errmgr_base.output,"NO ESCRIBÍ AL HNP\n ")); > }else{ > opal_dss.unpack(buffer, &jdata, > &n, ORTE_JOB); > } > > procs = > (orte_proc_t**)jdata->procs->addr; > if(procs==NULL){ > OPAL_OUTPUT_VERBOSE((5, > orte_errmgr_base.output, "grave: procs==null")); > } > > command = > ORTE_DAEMON_UPDATE_STATE_CMD; > > OBJ_RELEASE(buffer); > buffer = OBJ_NEW(opal_buffer_t); > > if (ORTE_SUCCESS != (rc = > opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD))) { > ORTE_ERROR_LOG(rc); > OBJ_RELEASE(buffer); > goto CLEANUP; > } > > orte_proc_state_t state = > ORTE_PROC_STATE_FAULT; > /* Pack the faulty vpid */ > if (ORTE_SUCCESS != (rc = > opal_dss.pack(buffer, &proc, 1, ORTE_NAME))) { > ORTE_ERROR_LOG(rc); > goto CLEANUP; > } > > /* Pack the state */ > if (ORTE_SUCCESS != (rc = > opal_dss.pack(buffer, &state, 1, OPAL_UINT16))) { >
[OMPI devel] Return status of MPI_Probe()
If MPI_Probe() encounters an error causing it to exit with the 'status.MPI_ERROR' set, say: ret = MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); Should it return an error? So should it return: - ret = status.MPI_ERROR - ret = MPI_ERROR_IN_STATUS - ret = MPI_SUCCESS Additionally, should it trigger the error handler on the communicator? In Open MPI, it will always return MPI_SUCCESS (pml_ob1_iprobe.c:74), but it feels like this is wrong. I looked to the MPI standard for some insight, but could not find where it addresses the return code of MPI_Probe. Can anyone shed some light on this topic for me? Thanks, Josh Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI devel] JDATA access problem.
Thanks Ralph for your reply. 2011/3/21 Ralph Castain > You should never access a pointer array's data area that way (i.e., by > index against the raw data). You really should do: > > if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, > vpid))) { > /* error report */ > } > > About this, i've changed this in my code but i'm getting the same result. Null when asking about a dead process. > The errmgr generally doesn't remove a process object upon failure - it just > sets its state to some appropriate value. However, depending upon where you > are trying to do this, and the history that got you down this code path, it > is possible. > I'm writing this code into the errmgr_orted.c, and it is executed when a process fails. > > Also, remember that if you are in a daemon, then the jdata objects are not > populated. The daemons work exclusively from the orte_local_jobdata and > orte_local_children lists, so you would have to find your process there. > That's why i'm asking to the hnp about the jdata using * ORTE_DAEMON_REPORT_JOB_INFO_CMD*, i assume that he has the information about the dead process. Any idea? Best regards. Hugo Meyer
Re: [OMPI devel] JDATA access problem.
On Mar 21, 2011, at 2:51 PM, Hugo Meyer wrote: > Thanks Ralph for your reply. > > 2011/3/21 Ralph Castain > You should never access a pointer array's data area that way (i.e., by index > against the raw data). You really should do: > > if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, > vpid))) { > /* error report */ > } > > > About this, i've changed this in my code but i'm getting the same result. > Null when asking about a dead process. > > The errmgr generally doesn't remove a process object upon failure - it just > sets its state to some appropriate value. However, depending upon where you > are trying to do this, and the history that got you down this code path, it > is possible. > > I'm writing this code into the errmgr_orted.c, and it is executed when a > process fails. > There's your problem - that module is run in the daemon, where the orte_job_data pointer array isn't used. You have to use the orte_local_jobdata and orte_local_children lists instead. So once the HNP replies with the jobid, you look up the orte_odls_job_t for that job from the orte_local_jobdata list. If you want to find a particular proc, though, you would look under orte_local_children - search the list for a child whose jobid and vpid both match. Note that you will not find that child process -unless- the child is under that daemon. I'm not sure what you are trying to accomplish, so I can't give further advice. Note that daemons have limited knowledge of application processes that are not their own immediate children. What little they know regarding processes other than their own is stored in the nidmap/pidmap arrays - limited to location, local rank, and node rank. They have no storage currently allocated for things like the state of a non-local process. > > Also, remember that if you are in a daemon, then the jdata objects are not > populated. The daemons work exclusively from the orte_local_jobdata and > orte_local_children lists, so you would have to find your process there. > > That's why i'm asking to the hnp about the jdata using > ORTE_DAEMON_REPORT_JOB_INFO_CMD, i assume that he has the information about > the dead process. Only after the daemon reports it. > > Any idea? > > Best regards. > > Hugo Meyer > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Return status of MPI_Probe()
Josh, If we don't take in account resilience I would not expect MPI_Probe to have that many opportunities to return errors. However, in order to keep the implementation consistent (with the other MPI functions) I would abide to the following. MPI_ERROR_IN_STATUS is only for calls taking multiple requests as input, so I don't think this should be applied to the MPI_Probe. I would expect the return to be equal to status.MPI_ERROR (similar to only other function working on a single request, such as MPI_Test). It better trigger the error-handler attached to the communicator, as explicitly requested by the MPI standard (section 8.3). > A user can associate error handlers to three types of objects: communicators, > windows, and files. The specified error handling routine will be used for any > MPI exception that occurs during a call to MPI for the respective object. george. On Mar 21, 2011, at 16:50 , Joshua Hursey wrote: > If MPI_Probe() encounters an error causing it to exit with the > 'status.MPI_ERROR' set, say: > ret = MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); > > Should it return an error? So should it return: > - ret = status.MPI_ERROR > - ret = MPI_ERROR_IN_STATUS > - ret = MPI_SUCCESS > Additionally, should it trigger the error handler on the communicator? > > In Open MPI, it will always return MPI_SUCCESS (pml_ob1_iprobe.c:74), but it > feels like this is wrong. I looked to the MPI standard for some insight, but > could not find where it addresses the return code of MPI_Probe. > > Can anyone shed some light on this topic for me? > > Thanks, > Josh > > > > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel "To preserve the freedom of the human mind then and freedom of the press, every spirit should be ready to devote itself to martyrdom; for as long as we may think as we will, and speak as we think, the condition of man will proceed in improvement." -- Thomas Jefferson, 1799