Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-20 Thread George Bosilca
If we don't want to lose the usefulness of the error messages (and don't care 
that much about the memory requirements), we can initialize this value with the 
string of the rank of the process in MPI_COMM_WORLD (instead of NULL). We will 
at least get an idea where to start looking in case of troubles …

  George.

On Aug 20, 2013, at 04:20 , Ralph Castain  wrote:

> 
> On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)"  
> wrote:
> 
>> On Aug 19, 2013, at 8:02 PM, Ralph Castain  wrote:
>> 
>>> That's how it works now. My concern is with the error message scenario. 
>>> IIRC, Jeff's issue was that the error message only contains the hostname of 
>>> the proc that generates it - it doesn't tell you the hostname of the remote 
>>> proc. Hence, we included that info in the proc_t.
>> 
>> This is quite important for getting useful error messages.
>> 
>>> However, IIRC we also provided an option to *not* send that info due to 
>>> scaling concerns way back when. I wonder if we can resolve this simply by 
>>> having Nathan set that option in his platform .conf files, and then 
>>> removing ompi_proc_get_hostname completely. Since the IP-based comm 
>>> channels will call modex_recv anyway, we'll get the hostname at that time. 
>>> Otherwise, the errors print "NULL" for proc->hostname.
>>> 
>>> Yes, that means that users of direct-launched apps on Nathan's systems will 
>>> get less informative error messages - but they can always override Nathan's 
>>> default param if they want better info. After all, the vast majority of 
>>> users aren't running such big jobs as to care about this optimization.
>> 
>> I'm good with it.  It could also be (might already be) a run-time MCA 
>> param...?
> 
> I think it is - I'll check tonight
> 
>> 
>> We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N 
>> procs, send the hostname around, otherwise, don't send it (we can argue over 
>> the value of N -- e.g., 1024 or 2048).
> 
> That makes the most sense to me - for small jobs, the time difference is too 
> tiny to measure.
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Steve Wise


> Thanks for finding r27212.  It was about a year ago, and had clearly fallen 
> out of my cache (I
have very
> little to do with the openib BTL these days).
> 
> Your solution isn't correct, because HAVE_IBV_LINK_LAYER_ETHERNET is defined 
> (nor not) via this m4
> macro in config/ompi_check_openfabrics.m4:
> 
>AC_CHECK_DECLS([IBV_LINK_LAYER_ETHERNET],
>   [$1_have_rdmaoe=1], [],
>   [#include ])
> 
> This m4 macro will #define HAVE_IBV_LINK_LAYER_ETHERNET if it exists, or 
> #undef that name if it
> doesn't.

I checked in the correct fix,  just below the code snipit you cited,in 
ompi_check_openfabrics.m4, we
see this snipit which is incorrect:

   AC_DEFINE_UNQUOTED([OMPI_HAVE_RDMAOE], [$$1_have_rdmaoe], [Enable 
RDMAoE support])

It should be adding HAVE_IBV_LINK_LAYER_ETHERNET, not OMPI_HAVE_RDMAOE.

STevo



Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Jeff Squyres (jsquyres)
On Aug 20, 2013, at 9:51 AM, Steve Wise  wrote:

> I checked in the correct fix,

Er, no.  Please re-read my email -- your fix was incorrect (you're overriding 
the output of an AC macro).  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Steve Wise


> -Original Message-
> From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> Sent: Tuesday, August 20, 2013 8:59 AM
> To: Steve Wise
> Cc: Open MPI Developers; Indranil Choudhury
> Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC
> 
> On Aug 20, 2013, at 9:51 AM, Steve Wise  wrote:
> 
> > I checked in the correct fix,
> 
> Er, no.  Please re-read my email -- your fix was incorrect (you're overriding 
> the output of an AC
macro).
> :-)
> 

What is the correct fix then?  I've never worked with any of this AC stuff...

With the existing code (prior to my broken fix), HAVE_IBV_LINK_LAYER_ETHERNET 
does not get defined.
Yet the enum and the link_type field are in verbs.h...

Thanks.




Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Jeff Squyres (jsquyres)
On Aug 20, 2013, at 10:06 AM, Steve Wise  wrote:

> What is the correct fix then?  I've never worked with any of this AC stuff...
> 
> With the existing code (prior to my broken fix), HAVE_IBV_LINK_LAYER_ETHERNET 
> does not get defined.
> Yet the enum and the link_type field are in verbs.h...


What's the result of the IBV_LINK_LAYER_ETHERNET test in your configure?  Is it 
failing for some reason?  Look in config.log to see exactly what that test 
tried and what its result was.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Steve Wise

Ah:

Here's the config.log:

configure:133950: checking whether IBV_LINK_LAYER_ETHERNET is declared
configure:133950: gcc -std=gnu99 -c -g -Wall -Wundef -Wno-long-long 
-Wsign-compare
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic 
-Werror-implicit-function-declaration
-finline-functions -fno-strict-aliasing -pthread
-I/usr/local/src/ompi-trunk/opal/mca/hwloc/hwloc152/hwloc/include
-I/usr/local/src/ompi-trunk/opal/mca/event/libevent2021/libevent
-I/usr/local/src/ompi-trunk/opal/mca/event/libevent2021/libevent/include  
conftest.c >&5
conftest.c:611: warning: function declaration isn't a prototype
configure:133950: $? = 0
configure:133950: result: yes

And I see it in opal_config.h:

/* Define to 1 if you have the declaration of `IBV_LINK_LAYER_ETHERNET', and
   to 0 if you don't. */
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1

Note the #define is HAVE_DECL_IBV_LINK_LAYER_ETHERNET but the code is checking 
for
HAVE_IBV_LINK_LAYER_ETHERNET!

No _DECL_...



> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Steve Wise
> Sent: Tuesday, August 20, 2013 9:07 AM
> To: 'Jeff Squyres (jsquyres)'
> Cc: 'Open MPI Developers'; 'Indranil Choudhury'
> Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC
> 
> 
> 
> > -Original Message-
> > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> > Sent: Tuesday, August 20, 2013 8:59 AM
> > To: Steve Wise
> > Cc: Open MPI Developers; Indranil Choudhury
> > Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC
> >
> > On Aug 20, 2013, at 9:51 AM, Steve Wise  wrote:
> >
> > > I checked in the correct fix,
> >
> > Er, no.  Please re-read my email -- your fix was incorrect (you're 
> > overriding the output of an
AC
> macro).
> > :-)
> >
> 
> What is the correct fix then?  I've never worked with any of this AC stuff...
> 
> With the existing code (prior to my broken fix), HAVE_IBV_LINK_LAYER_ETHERNET 
> does not get
defined.
> Yet the enum and the link_type field are in verbs.h...
> 
> Thanks.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-20 Thread Ralph Castain
The error messages already output the name of the other proc, so that should be 
available. Besides, I just spent all yesterday afternoon auditing our MPI 
layers memory usage byte-by-byte and getting my ears burned about the need to 
reduce that footprint - not really thrilled about adding to it.

I think the key here is to only do this reduction when directed to do so. It 
only benefits really big scale, which is the exception and not the rule. And if 
someone in that scenario wants the error output, they can just ask for it 
(assuming their sys admin defaulted it to not include the hostname).


On Aug 20, 2013, at 3:18 AM, George Bosilca  wrote:

> If we don't want to lose the usefulness of the error messages (and don't care 
> that much about the memory requirements), we can initialize this value with 
> the string of the rank of the process in MPI_COMM_WORLD (instead of NULL). We 
> will at least get an idea where to start looking in case of troubles …
> 
>  George.
> 
> On Aug 20, 2013, at 04:20 , Ralph Castain  wrote:
> 
>> 
>> On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)"  
>> wrote:
>> 
>>> On Aug 19, 2013, at 8:02 PM, Ralph Castain  wrote:
>>> 
 That's how it works now. My concern is with the error message scenario. 
 IIRC, Jeff's issue was that the error message only contains the hostname 
 of the proc that generates it - it doesn't tell you the hostname of the 
 remote proc. Hence, we included that info in the proc_t.
>>> 
>>> This is quite important for getting useful error messages.
>>> 
 However, IIRC we also provided an option to *not* send that info due to 
 scaling concerns way back when. I wonder if we can resolve this simply by 
 having Nathan set that option in his platform .conf files, and then 
 removing ompi_proc_get_hostname completely. Since the IP-based comm 
 channels will call modex_recv anyway, we'll get the hostname at that time. 
 Otherwise, the errors print "NULL" for proc->hostname.
 
 Yes, that means that users of direct-launched apps on Nathan's systems 
 will get less informative error messages - but they can always override 
 Nathan's default param if they want better info. After all, the vast 
 majority of users aren't running such big jobs as to care about this 
 optimization.
>>> 
>>> I'm good with it.  It could also be (might already be) a run-time MCA 
>>> param...?
>> 
>> I think it is - I'll check tonight
>> 
>>> 
>>> We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N 
>>> procs, send the hostname around, otherwise, don't send it (we can argue 
>>> over the value of N -- e.g., 1024 or 2048).
>> 
>> That makes the most sense to me - for small jobs, the time difference is too 
>> tiny to measure.
>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Steve Wise
So is this the correct fix?

[root@r9 ompi-trunk]# svn diff
Index: ompi/mca/btl/openib/btl_openib_component.c
===
--- ompi/mca/btl/openib/btl_openib_component.c  (revision 29050)
+++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
@@ -716,7 +716,7 @@
 return OMPI_ERR_NOT_FOUND;
 }

-#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
+#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
 if (IBV_LINK_LAYER_ETHERNET == ib_port_attr->link_layer) {
 subnet_id = mca_btl_openib_get_ip_subnet_id(device->ib_dev,
port_num);
Index: ompi/mca/btl/openib/btl_openib.c
===
--- ompi/mca/btl/openib/btl_openib.c(revision 29050)
+++ ompi/mca/btl/openib/btl_openib.c(working copy)
@@ -444,7 +444,7 @@
 #ifdef HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE
 switch(openib_btl->device->ib_dev->transport_type) {
 case IBV_TRANSPORT_IB:
-#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
+#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
 switch(openib_btl->ib_port_attr.link_layer) {
 case IBV_LINK_LAYER_ETHERNET:
 return MCA_BTL_OPENIB_TRANSPORT_RDMAOE;
Index: ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c   (revision 29050)
+++ ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c   (working copy)
@@ -389,7 +389,7 @@
/* If we do not have struct ibv_device.transport_device, then
   we're in an old version of OFED that is IB only (i.e., no
   iWarp), so we can safely assume that we can use this CPC. */
-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
defined(HAVE_IBV_LINK_LAYER_ETHERNET)
+#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
BTL_VERBOSE(("UD CPC only supported on InfiniBand; skipped on 
%s:%d",
 ibv_get_device_name(btl->device->ib_dev),
Index: ompi/mca/btl/openib/connect/btl_openib_connect_oob.c
===
--- ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(revision 29050)
+++ ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(working copy)
@@ -127,7 +127,7 @@
IB (this CPC will not work with iWarp).  If we do not have the
transport_type member, then we must be < OFED v1.2, and
therefore we must be IB. */
-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
defined(HAVE_IBV_LINK_LAYER_ETHERNET)
+#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
 if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
 opal_output_verbose(5, ompi_btl_base_framework.framework_output,
 "openib BTL: oob CPC only supported on InfiniBand; 
skipped on  %s:%d",
Index: ompi/mca/common/verbs/common_verbs_find_ports.c
===
--- ompi/mca/common/verbs/common_verbs_find_ports.c (revision 29050)
+++ ompi/mca/common/verbs/common_verbs_find_ports.c (working copy)
@@ -170,7 +170,7 @@
 }
 }

-#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
+#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
 static const char *link_layer_to_str(int link_type)
 {
 switch(link_type) {
@@ -417,7 +417,7 @@
 /* If they specified neither link layer, then we want this 
port */
 want = true;
 }
-#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
+#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
 else if (flags & OMPI_COMMON_VERBS_FLAGS_LINK_LAYER_IB) {
 if (IBV_LINK_LAYER_INFINIBAND == port_attr.link_layer) {
 want = true;



> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Steve Wise
> Sent: Tuesday, August 20, 2013 9:25 AM
> To: 'Open MPI Developers'; 'Jeff Squyres (jsquyres)'
> Cc: 'Indranil Choudhury'
> Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC
> 
> 
> Ah:
> 
> Here's the config.log:
> 
> configure:133950: checking whether IBV_LINK_LAYER_ETHERNET is declared
> configure:133950: gcc -std=gnu99 -c -g -Wall -Wundef -Wno-long-long 
> -Wsign-compare
> -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic 
> -Werror-implicit-function-declaration
> -finline-functions -fno-strict-aliasing -pthread
> -I/usr/local/src/ompi-trunk/opal/mca/hwloc/hwloc152/hwloc/include
> -I/usr/local/src/ompi-trunk/opal/mca/event/libevent2021/libevent
> -I/usr/local/src/ompi-trunk/opal/mca/event/libevent2021/libevent/include  
> conftest.c >&5
> conftest.c:611: warning: function declaration isn't a prototype
> configure:13

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Jeff Squyres (jsquyres)
I think you hit the nail on the head -- we typo'ed the macro name in the C 
code.  Doh!

If you can confirm that this fixes the issue for you, please commit and CMR.

Thank you for tracking this down!


On Aug 20, 2013, at 11:06 AM, Steve Wise  wrote:

> So is this the correct fix?
> 
> [root@r9 ompi-trunk]# svn diff
> Index: ompi/mca/btl/openib/btl_openib_component.c
> ===
> --- ompi/mca/btl/openib/btl_openib_component.c  (revision 29050)
> +++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
> @@ -716,7 +716,7 @@
> return OMPI_ERR_NOT_FOUND;
> }
> 
> -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> if (IBV_LINK_LAYER_ETHERNET == ib_port_attr->link_layer) {
> subnet_id = mca_btl_openib_get_ip_subnet_id(device->ib_dev,
>port_num);
> Index: ompi/mca/btl/openib/btl_openib.c
> ===
> --- ompi/mca/btl/openib/btl_openib.c(revision 29050)
> +++ ompi/mca/btl/openib/btl_openib.c(working copy)
> @@ -444,7 +444,7 @@
> #ifdef HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE
> switch(openib_btl->device->ib_dev->transport_type) {
> case IBV_TRANSPORT_IB:
> -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> switch(openib_btl->ib_port_attr.link_layer) {
> case IBV_LINK_LAYER_ETHERNET:
> return MCA_BTL_OPENIB_TRANSPORT_RDMAOE;
> Index: ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> ===
> --- ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c   (revision 
> 29050)
> +++ ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c   (working copy)
> @@ -389,7 +389,7 @@
>/* If we do not have struct ibv_device.transport_device, then
>   we're in an old version of OFED that is IB only (i.e., no
>   iWarp), so we can safely assume that we can use this CPC. */
> -#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
> defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> +#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
> defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
>if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
>BTL_VERBOSE(("UD CPC only supported on InfiniBand; skipped on 
> %s:%d",
> ibv_get_device_name(btl->device->ib_dev),
> Index: ompi/mca/btl/openib/connect/btl_openib_connect_oob.c
> ===
> --- ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(revision 
> 29050)
> +++ ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(working copy)
> @@ -127,7 +127,7 @@
>IB (this CPC will not work with iWarp).  If we do not have the
>transport_type member, then we must be < OFED v1.2, and
>therefore we must be IB. */
> -#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
> defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> +#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) && 
> defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
> opal_output_verbose(5, ompi_btl_base_framework.framework_output,
> "openib BTL: oob CPC only supported on 
> InfiniBand; skipped on  %s:%d",
> Index: ompi/mca/common/verbs/common_verbs_find_ports.c
> ===
> --- ompi/mca/common/verbs/common_verbs_find_ports.c (revision 29050)
> +++ ompi/mca/common/verbs/common_verbs_find_ports.c (working copy)
> @@ -170,7 +170,7 @@
> }
> }
> 
> -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> static const char *link_layer_to_str(int link_type)
> {
> switch(link_type) {
> @@ -417,7 +417,7 @@
> /* If they specified neither link layer, then we want this 
> port */
> want = true;
> }
> -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> else if (flags & OMPI_COMMON_VERBS_FLAGS_LINK_LAYER_IB) {
> if (IBV_LINK_LAYER_INFINIBAND == port_attr.link_layer) {
> want = true;
> 
> 
> 
>> -Original Message-
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Steve Wise
>> Sent: Tuesday, August 20, 2013 9:25 AM
>> To: 'Open MPI Developers'; 'Jeff Squyres (jsquyres)'
>> Cc: 'Indranil Choudhury'
>> Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC
>> 
>> 
>> Ah:
>> 
>> Here's the config.log:
>> 
>> configure:133950: checking whether IBV_LINK_LAYER_ETHERNET is declared
>> configure:133950: gcc -std=gnu99 -c -g -Wall -Wundef -Wno-long-long 
>> -Wsign-compare
>> -Wmissing-prototypes -Wstrict-prototypes -Wcomm

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Steve Wise


> -Original Message-
> From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> Sent: Tuesday, August 20, 2013 11:07 AM
> To: Steve Wise
> Cc: Open MPI Developers; Indranil Choudhury
> Subject: Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC
> 
> I think you hit the nail on the head -- we typo'ed the macro name in the C 
> code.  Doh!
> 
> If you can confirm that this fixes the issue for you, please commit and CMR.
> 

Will do!


> Thank you for tracking this down!
>

U R welcome. :)

> 
> On Aug 20, 2013, at 11:06 AM, Steve Wise  wrote:
> 
> > So is this the correct fix?
> >
> > [root@r9 ompi-trunk]# svn diff
> > Index: ompi/mca/btl/openib/btl_openib_component.c
> > ===
> > --- ompi/mca/btl/openib/btl_openib_component.c  (revision 29050)
> > +++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
> > @@ -716,7 +716,7 @@
> > return OMPI_ERR_NOT_FOUND;
> > }
> >
> > -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> > +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> > if (IBV_LINK_LAYER_ETHERNET == ib_port_attr->link_layer) {
> > subnet_id = mca_btl_openib_get_ip_subnet_id(device->ib_dev,
> >port_num);
> > Index: ompi/mca/btl/openib/btl_openib.c
> > ===
> > --- ompi/mca/btl/openib/btl_openib.c(revision 29050)
> > +++ ompi/mca/btl/openib/btl_openib.c(working copy)
> > @@ -444,7 +444,7 @@
> > #ifdef HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE
> > switch(openib_btl->device->ib_dev->transport_type) {
> > case IBV_TRANSPORT_IB:
> > -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> > +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> > switch(openib_btl->ib_port_attr.link_layer) {
> > case IBV_LINK_LAYER_ETHERNET:
> > return MCA_BTL_OPENIB_TRANSPORT_RDMAOE;
> > Index: ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > ===
> > --- ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c   (revision 
> > 29050)
> > +++ ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c   (working 
> > copy)
> > @@ -389,7 +389,7 @@
> >/* If we do not have struct ibv_device.transport_device, then
> >   we're in an old version of OFED that is IB only (i.e., no
> >   iWarp), so we can safely assume that we can use this CPC. */
> > -#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) &&
> defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> > +#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) &&
> defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> >if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
> >BTL_VERBOSE(("UD CPC only supported on InfiniBand; skipped on 
> > %s:%d",
> > ibv_get_device_name(btl->device->ib_dev),
> > Index: ompi/mca/btl/openib/connect/btl_openib_connect_oob.c
> > ===
> > --- ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(revision 
> > 29050)
> > +++ ompi/mca/btl/openib/connect/btl_openib_connect_oob.c(working 
> > copy)
> > @@ -127,7 +127,7 @@
> >IB (this CPC will not work with iWarp).  If we do not have the
> >transport_type member, then we must be < OFED v1.2, and
> >therefore we must be IB. */
> > -#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) &&
> defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> > +#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE) &&
> defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> > if (BTL_OPENIB_CONNECT_BASE_CHECK_IF_NOT_IB(btl)) {
> > opal_output_verbose(5, ompi_btl_base_framework.framework_output,
> > "openib BTL: oob CPC only supported on 
> > InfiniBand; skipped on
%s:%d",
> > Index: ompi/mca/common/verbs/common_verbs_find_ports.c
> > ===
> > --- ompi/mca/common/verbs/common_verbs_find_ports.c (revision 29050)
> > +++ ompi/mca/common/verbs/common_verbs_find_ports.c (working copy)
> > @@ -170,7 +170,7 @@
> > }
> > }
> >
> > -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> > +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> > static const char *link_layer_to_str(int link_type)
> > {
> > switch(link_type) {
> > @@ -417,7 +417,7 @@
> > /* If they specified neither link layer, then we want this 
> > port */
> > want = true;
> > }
> > -#if defined(HAVE_IBV_LINK_LAYER_ETHERNET)
> > +#if defined(HAVE_DECL_IBV_LINK_LAYER_ETHERNET)
> > else if (flags & OMPI_COMMON_VERBS_FLAGS_LINK_LAYER_IB) {
> > if (IBV_LINK_LAYER_INFINIBAND == port_attr.link_layer) {
> > want = true;
> >
> >
> >
> >> -Original Message-
> >> From: devel [mailto:devel-boun...@open-mp

Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Jeff Squyres (jsquyres)
On Aug 20, 2013, at 12:08 PM, Steve Wise  wrote:

>> Thank you for tracking this down!
> 
> U R welcome. :)


Don't forget that Chelsio is still on the hook for adding iWARP support into 
ompi/mca/common/ofacm, however.  :-)

Specifically: At some point iWARP support will break because we'll be removing 
ompi/mca/btl/openib/cpc and exclusively using ompi/mca/common/ofacm.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Steve Wise
> 
> Don't forget that Chelsio is still on the hook for adding iWARP support into
ompi/mca/common/ofacm,
> however.  :-)
>

You won't let me forget. ;)  I will do it.

> Specifically: At some point iWARP support will break because we'll be removing
> ompi/mca/btl/openib/cpc and exclusively using ompi/mca/common/ofacm.
> 

When is this going to happen?

I can probably get to this project around the end of Sep (vacation is pending).

Steve




Re: [OMPI devel] openmpi-1.7.2 fails to use the RDMACM CPC

2013-08-20 Thread Jeff Squyres (jsquyres)
On Aug 20, 2013, at 12:57 PM, Steve Wise  wrote:

> You won't let me forget. ;)  I will do it.

Awesome, thanks.

>> Specifically: At some point iWARP support will break because we'll be 
>> removing
>> ompi/mca/btl/openib/cpc and exclusively using ompi/mca/common/ofacm.
> 
> When is this going to happen?

Don't know yet.  It's been "pending / real soon now..." for a little while, but 
other higher-priority things have crept in.

> I can probably get to this project around the end of Sep (vacation is 
> pending).


K.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] OpenSHMEM round 2

2013-08-20 Thread Joshua Ladd
All,

As requested, I have placed a simple shmem test code, 'hello_shmem.c' in the 
'examples' directory of my Bitbucket repo. It actually does a bit more than 
just "hello world"; this code will "gently" exercise 
start_pes/put/get/barrier/shmem_finalize.

Josh


From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Joshua Ladd
Sent: Tuesday, August 06, 2013 12:32 PM
To: Open MPI Developers (de...@open-mpi.org)
Subject: [OMPI devel] OpenSHMEM round 2

Dear OMPI Community,

Please find on Bitbucket the latest round of OSHMEM changes based on community 
feedback. Please git and test at your leisure.

https://bitbucket.org/jladd_math/mlnx-oshmem.git

Best regards,

Josh

Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898




Re: [OMPI devel] OpenSHMEM round 2

2013-08-20 Thread Ralph Castain
Thanks!

On Aug 20, 2013, at 11:25 AM, Joshua Ladd  wrote:

> All,
>  
> As requested, I have placed a simple shmem test code, ‘hello_shmem.c’ in the 
> ‘examples’ directory of my Bitbucket repo. It actually does a bit more than 
> just “hello world”; this code will “gently” exercise 
> start_pes/put/get/barrier/shmem_finalize.
>  
> Josh
>   
>  
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Joshua Ladd
> Sent: Tuesday, August 06, 2013 12:32 PM
> To: Open MPI Developers (de...@open-mpi.org)
> Subject: [OMPI devel] OpenSHMEM round 2
>  
> Dear OMPI Community,
>  
> Please find on Bitbucket the latest round of OSHMEM changes based on 
> community feedback. Please git and test at your leisure.
>  
> https://bitbucket.org/jladd_math/mlnx-oshmem.git
>  
> Best regards,
>  
> Josh
>  
> Joshua S. Ladd, PhD
> HPC Algorithms Engineer
> Mellanox Technologies
>  
> Email: josh...@mellanox.com
> Cell: +1 (865) 258 - 8898
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni

2013-08-20 Thread Ralph Castain
Okay, please see r29052 - I believe this will address everyone's concerns. 
Please give it a test so we can verify it is clean - it worked for me, but I 
can't test all environments


On Aug 20, 2013, at 7:35 AM, Ralph Castain  wrote:

> The error messages already output the name of the other proc, so that should 
> be available. Besides, I just spent all yesterday afternoon auditing our MPI 
> layers memory usage byte-by-byte and getting my ears burned about the need to 
> reduce that footprint - not really thrilled about adding to it.
> 
> I think the key here is to only do this reduction when directed to do so. It 
> only benefits really big scale, which is the exception and not the rule. And 
> if someone in that scenario wants the error output, they can just ask for it 
> (assuming their sys admin defaulted it to not include the hostname).
> 
> 
> On Aug 20, 2013, at 3:18 AM, George Bosilca  wrote:
> 
>> If we don't want to lose the usefulness of the error messages (and don't 
>> care that much about the memory requirements), we can initialize this value 
>> with the string of the rank of the process in MPI_COMM_WORLD (instead of 
>> NULL). We will at least get an idea where to start looking in case of 
>> troubles …
>> 
>> George.
>> 
>> On Aug 20, 2013, at 04:20 , Ralph Castain  wrote:
>> 
>>> 
>>> On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)"  
>>> wrote:
>>> 
 On Aug 19, 2013, at 8:02 PM, Ralph Castain  wrote:
 
> That's how it works now. My concern is with the error message scenario. 
> IIRC, Jeff's issue was that the error message only contains the hostname 
> of the proc that generates it - it doesn't tell you the hostname of the 
> remote proc. Hence, we included that info in the proc_t.
 
 This is quite important for getting useful error messages.
 
> However, IIRC we also provided an option to *not* send that info due to 
> scaling concerns way back when. I wonder if we can resolve this simply by 
> having Nathan set that option in his platform .conf files, and then 
> removing ompi_proc_get_hostname completely. Since the IP-based comm 
> channels will call modex_recv anyway, we'll get the hostname at that 
> time. Otherwise, the errors print "NULL" for proc->hostname.
> 
> Yes, that means that users of direct-launched apps on Nathan's systems 
> will get less informative error messages - but they can always override 
> Nathan's default param if they want better info. After all, the vast 
> majority of users aren't running such big jobs as to care about this 
> optimization.
 
 I'm good with it.  It could also be (might already be) a run-time MCA 
 param...?
>>> 
>>> I think it is - I'll check tonight
>>> 
 
 We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N 
 procs, send the hostname around, otherwise, don't send it (we can argue 
 over the value of N -- e.g., 1024 or 2048).
>>> 
>>> That makes the most sense to me - for small jobs, the time difference is 
>>> too tiny to measure.
>>> 
 
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to: 
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>