[OMPI devel] [PATCH] iof/hnp: daemon part of the sink structure is not initialized when forwarding stdin to all ranks
Hi, When forwarding stdin to all ranks in the job (mpirun --stdin all), the following error message is output: -- [berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 316 [berlin73:02223] [[56600,0],0] unable to find address for [[INVALID],INVALID] [berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mca/iof/hnp/iof_hnp_send.c at line 116 -- This is due to the daemon part of the sink structure not beeing initialized in hnp_push() when the destination vpid is ORTE_VPID_WILDCARD. And then, when orte_iof_hnp_read_local_handler() is called, it calls orte_iof_hnp_send_data_to_endpoint() with a sink->daemon that is not set. orte_iof_hnp_send_data_to_endpoint() in turn doesn't call orte_grpcomm.xcast() but orte_rml.send_buffer_nb() with an invalid host. The attached patch applied on the trunk solves the issue. This patch is trivial, but since it's the first time I have to look at iof code, I'm not sure of all its impacts... Regards, Nadia daemon part of the sink structure is not initialzaed when forwarding stdin to all ranks diff -r 490e6afa37fe orte/mca/iof/hnp/iof_hnp.c --- a/orte/mca/iof/hnp/iof_hnp.c Tue Mar 06 11:56:15 2012 +0100 +++ b/orte/mca/iof/hnp/iof_hnp.c Tue Mar 06 12:43:44 2012 +0100 @@ -263,6 +263,8 @@ static int hnp_push(const orte_process_n ORTE_IOF_SINK_DEFINE(&sink, dst_name, -1, ORTE_IOF_STDIN, stdin_write_handler, &mca_iof_hnp_component.sinks); +sink->daemon.jobid = ORTE_PROC_MY_NAME->jobid; +sink->daemon.vpid = ORTE_VPID_WILDCARD; } else { /* no - lookup the proc's daemon and set that into sink */ if (NULL == (jdata = orte_get_job_data_object(dst_name->jobid))) {
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
Mike -- I would make this a bit better of an error. I.e., use orte_show_help(), so you can explain the issue more, and also remove all duplicates (i.e., if it fails to register multiple times). On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > Author: miked > Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) > New Revision: 26106 > URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 > > Log: > print error which is ignored on upper layer > Text files modified: > trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ > > 1 files changed, 2 insertions(+), 0 deletions(-) > > Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c > == > --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) > +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST > (Tue, 06 Mar 2012) > @@ -569,6 +569,8 @@ > openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); > > if (NULL == openib_reg->mr) { > +BTL_ERROR(("%s: error pinning openib memory errno says %s", > + __func__, strerror(errno))); > return OMPI_ERR_OUT_OF_RESOURCE; > } > > ___ > svn-full mailing list > svn-f...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an error. If the registration returns out of resources, the BTL will returns OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept it and insert the request into a pending list. Later on this pending list will be examined and the request for resource re-issued. Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? george. On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: > Mike -- > > I would make this a bit better of an error. I.e., use orte_show_help(), so > you can explain the issue more, and also remove all duplicates (i.e., if it > fails to register multiple times). > > > On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > >> Author: miked >> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) >> New Revision: 26106 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 >> >> Log: >> print error which is ignored on upper layer >> Text files modified: >> trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ >> >> 1 files changed, 2 insertions(+), 0 deletions(-) >> >> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c >> == >> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) >> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST >> (Tue, 06 Mar 2012) >> @@ -569,6 +569,8 @@ >>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); >> >>if (NULL == openib_reg->mr) { >> +BTL_ERROR(("%s: error pinning openib memory errno says %s", >> + __func__, strerror(errno))); >>return OMPI_ERR_OUT_OF_RESOURCE; >>} >> >> ___ >> svn-full mailing list >> svn-f...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [PATCH] iof/hnp: daemon part of the sink structure is not initialized when forwarding stdin to all ranks
You are quite right - good catch! Fixed in trunk with r26107 - will file patch for 1.5. Ralph On Tue, Mar 6, 2012 at 4:18 AM, nadia.derbey wrote: > Hi, > > When forwarding stdin to all ranks in the job (mpirun --stdin all), the > following error message is output: > > -- > [berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting > to be sent to a process whose contact information is unknown in > file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 316 > [berlin73:02223] [[56600,0],0] unable to find address for > [[INVALID],INVALID] > [berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting > to be sent to a process whose contact information is unknown in > file ../../../../../orte/mca/iof/hnp/iof_hnp_send.c at line 116 > -- > > This is due to the daemon part of the sink structure not beeing > initialized in hnp_push() when the destination vpid is > ORTE_VPID_WILDCARD. > And then, when orte_iof_hnp_read_local_handler() is called, it calls > orte_iof_hnp_send_data_to_endpoint() with a sink->daemon that is not > set. > orte_iof_hnp_send_data_to_endpoint() in turn doesn't call > orte_grpcomm.xcast() but orte_rml.send_buffer_nb() with an invalid host. > > The attached patch applied on the trunk solves the issue. This patch is > trivial, but since it's the first time I have to look at iof code, I'm > not sure of all its impacts... > > Regards, > Nadia > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >