[OMPI devel] [PATCH] iof/hnp: daemon part of the sink structure is not initialized when forwarding stdin to all ranks

2012-03-06 Thread nadia.derbey
Hi,

When forwarding stdin to all ranks in the job (mpirun --stdin all), the
following error message is output:

--
[berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting
to be sent to a process whose contact information is unknown in
file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 316
[berlin73:02223] [[56600,0],0] unable to find address for
[[INVALID],INVALID]
[berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting
to be sent to a process whose contact information is unknown in
file ../../../../../orte/mca/iof/hnp/iof_hnp_send.c at line 116
--

This is due to the daemon part of the sink structure not beeing
initialized in hnp_push() when the destination vpid is
ORTE_VPID_WILDCARD.
And then, when orte_iof_hnp_read_local_handler() is called, it calls
orte_iof_hnp_send_data_to_endpoint() with a sink->daemon that is not
set.
orte_iof_hnp_send_data_to_endpoint() in turn doesn't call
orte_grpcomm.xcast() but orte_rml.send_buffer_nb() with an invalid host.

The attached patch applied on the trunk solves the issue. This patch is
trivial, but since it's the first time I have to look at iof code, I'm
not sure of all its impacts...

Regards,
Nadia
daemon part of the sink structure is not initialzaed when forwarding stdin to all ranks

diff -r 490e6afa37fe orte/mca/iof/hnp/iof_hnp.c
--- a/orte/mca/iof/hnp/iof_hnp.c	Tue Mar 06 11:56:15 2012 +0100
+++ b/orte/mca/iof/hnp/iof_hnp.c	Tue Mar 06 12:43:44 2012 +0100
@@ -263,6 +263,8 @@ static int hnp_push(const orte_process_n
 ORTE_IOF_SINK_DEFINE(&sink, dst_name, -1, ORTE_IOF_STDIN,
  stdin_write_handler,
  &mca_iof_hnp_component.sinks);
+sink->daemon.jobid = ORTE_PROC_MY_NAME->jobid;
+sink->daemon.vpid = ORTE_VPID_WILDCARD;
 } else {
 /* no - lookup the proc's daemon and set that into sink */
 if (NULL == (jdata = orte_get_job_data_object(dst_name->jobid))) {


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-06 Thread Jeffrey Squyres
Mike --

I would make this a bit better of an error.  I.e., use orte_show_help(), so you 
can explain the issue more, and also remove all duplicates (i.e., if it fails 
to register multiple times).


On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:

> Author: miked
> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
> New Revision: 26106
> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
> 
> Log:
> print error which is ignored on upper layer
> Text files modified: 
>   trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ 
>  
>   1 files changed, 2 insertions(+), 0 deletions(-)
> 
> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
> ==
> --- trunk/ompi/mca/btl/openib/btl_openib_component.c  (original)
> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c  2012-03-06 09:25:56 EST 
> (Tue, 06 Mar 2012)
> @@ -569,6 +569,8 @@
> openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
> 
> if (NULL == openib_reg->mr) {
> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
> +   __func__, strerror(errno)));
> return OMPI_ERR_OUT_OF_RESOURCE;
> }
> 
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-06 Thread George Bosilca
I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
error. If the registration returns out of resources, the BTL will returns 
OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept 
it and insert the request into a pending list. Later on this pending list will 
be examined and the request for resource re-issued.

Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?

  george.

On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:

> Mike --
> 
> I would make this a bit better of an error.  I.e., use orte_show_help(), so 
> you can explain the issue more, and also remove all duplicates (i.e., if it 
> fails to register multiple times).
> 
> 
> On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
> 
>> Author: miked
>> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
>> New Revision: 26106
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
>> 
>> Log:
>> print error which is ignored on upper layer
>> Text files modified: 
>>  trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ 
>>  
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>> 
>> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
>> ==
>> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
>> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST 
>> (Tue, 06 Mar 2012)
>> @@ -569,6 +569,8 @@
>>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
>> 
>>if (NULL == openib_reg->mr) {
>> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
>> +   __func__, strerror(errno)));
>>return OMPI_ERR_OUT_OF_RESOURCE;
>>}
>> 
>> ___
>> svn-full mailing list
>> svn-f...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [PATCH] iof/hnp: daemon part of the sink structure is not initialized when forwarding stdin to all ranks

2012-03-06 Thread Ralph Castain
You are quite right - good catch! Fixed in trunk with r26107 - will file
patch for 1.5.
Ralph


On Tue, Mar 6, 2012 at 4:18 AM, nadia.derbey  wrote:

> Hi,
>
> When forwarding stdin to all ranks in the job (mpirun --stdin all), the
> following error message is output:
>
> --
> [berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 316
> [berlin73:02223] [[56600,0],0] unable to find address for
> [[INVALID],INVALID]
> [berlin73:02223] [[56600,0],0] ORTE_ERROR_LOG: A message is attempting
> to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/iof/hnp/iof_hnp_send.c at line 116
> --
>
> This is due to the daemon part of the sink structure not beeing
> initialized in hnp_push() when the destination vpid is
> ORTE_VPID_WILDCARD.
> And then, when orte_iof_hnp_read_local_handler() is called, it calls
> orte_iof_hnp_send_data_to_endpoint() with a sink->daemon that is not
> set.
> orte_iof_hnp_send_data_to_endpoint() in turn doesn't call
> orte_grpcomm.xcast() but orte_rml.send_buffer_nb() with an invalid host.
>
> The attached patch applied on the trunk solves the issue. This patch is
> trivial, but since it's the first time I have to look at iof code, I'm
> not sure of all its impacts...
>
> Regards,
> Nadia
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>