Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Paul Hargrove
Unless I am mistaken, the text quoted below from README no longer reflects
the current behavior.
The text appears to be the same in master and v1.8.

-Paul

--with-libltdl(=value)
  This option specifies where to find the GNU Libtool libltdl support
  library.  The following values are permitted:

internal:Use Open MPI's internal copy of libltdl.
external:Use an external libltdl installation (rely on default
 compiler and linker paths to find it)
:  Same as "internal".
: Specify the location of a specific libltdl
 installation to use

  By default (or if --with-libltdl is specified with no VALUE), Open
  MPI will build and use the copy of libltdl that it has in its source
  tree.  However, if the VALUE is "external", Open MPI will look for
  the relevant libltdl header file and library in default compiler /
  linker locations.  Or, VALUE can be a directory tree where the
  libltdl header file and library can be found.  This option allows
  operating systems to include Open MPI and use their default libltdl
  installation instead of Open MPI's bundled libltdl.


On Tue, Apr 21, 2015 at 3:43 PM, Jeff Squyres (jsquyres)  wrote:

> In the usual location:
>
> http://www.open-mpi.org/software/ompi/v1.8/
>
> The NEWS changed completely between rc1 and r2, so I don't know easily
> exactly what is different between rc1 and rc2.  Here's the full 1.8.5 NEWS:
>
> - Fixed configure problems in some cases when using an external hwloc
>   installation.  Thanks to Erick Schnetter for reporting the error and
>   helping track down the source of the problem.
> - Fixed linker error on OS X when using the clang compiler.  Thanks to
>   Erick Schnetter for reporting the error and helping track down the
>   source of the problem.
> - Fixed MPI_THREAD_MULTIPLE deadlock error in the vader BTL.  Thanks
>   to Thomas Klimpel for reporting the issue.
> - Fixed several Valgrind warnings.  Thanks for Lisandro Dalcin for
>   contributing a patch fixing some one-sided code paths.
> - Fixed version compatibility test in OOB that broke ABI within the
>   1.8 series. NOTE: this will not resolve the problem between pre-1.8.5
>   versions, but will fix it going forward.
> - Fix some issues related to running on Intel Xeon Phi coprocessors.
> - Opportunistically switch away from using GNU Libtool's libltdl
>   library when possible (by default).
> - Fix some VampirTrace errors.  Thanks to Paul Hargrove for reporting
>   the issues.
> - Correct default binding patterns when --use-hwthread-cpus was
>   specified and nprocs <= 2.
> - Fix warnings about -finline-functions when compiling with clang.
> - Updated the embedded hwloc with several bug fixes, including the
>   "duplicate Lhwloc1 symbol" that multiple users reported on some
>   platforms.
> - Do not error when mpirun is invoked with with default bindings
>   (i.e., no binding was specified), and one or more nodes do not
>   support bindings.  Thanks to Annu Desari for pointing out the
>   problem.
> - Let root invoke "mpirun --version" to check the version without
>   printing the "Don't run as root!" warnings.  Thanks to Robert McLay
>   for the suggestion.
> - Fixed several bugs in OpenSHMEM support.
> - Extended vader shared memory support to 32-bit architectures.
> - Fix handling of very large datatypes.  Thanks to Bogdan Sataric for
>   the bug report.
> - Fixed a bug in handling subarray MPI datatypes, and a bug when using
>   MPI_LB and MPI_UB.  Thanks to Gus Correa for pointing out the issue.
> - Restore user-settable bandwidth and latency PML MCA variables.
> - Multiple bug fixes for cleanup during MPI_FINALIZE in unusual
>   situations.
> - Added support for TCP keepalive signals to ensure timely termination
>   when sockets between daemons cannot be created (e.g., due to a
>   firewall).
> - Added MCA parameter to allow full use of a SLURM allocation when
>   started from a tool (supports LLNL debugger).
> - Fixed several bugs in the configure logic for PMI and hwloc.
> - Fixed incorrect interface index in TCP communications setup.  Thanks
>   to Mark Kettenis for spotting the problem and providing a patch.
> - Fixed MPI_IREDUCE_SCATTER with single-process communicators when
>   MPI_IN_PLACE was not used.
> - Added XRC support for OFED v3.12 and higher.
> - Various updates and bug fixes to the Mellanox hcoll collective
>   support.
> - Fix problems with Fortran compilers that did not support
>   REAL*16/COMPLEX*32 types.  Thanks to Orion Poplawski for identifying
>   the issue.
> - Fixed problem with rpath/runpath support in pkg-config files.
>   Thanks to Christoph Junghans for notifying us of the issue.
> - Man page fixes:
>   - Removed erroneous "color" discussion from MPI_COMM_SPLIT_TYPE.
> Thanks to Erick Schnetter for spotting the outdated text.
>   - Fixed prototypes for MPI_IBARRIER.  Thanks to Maximilian for
> finding the issue.
>   - Updated docs about buffer usage in non-blocking communications.
>  

Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Marco Atzeri

On 4/22/2015 12:43 AM, Jeff Squyres (jsquyres) wrote:

In the usual location:

 http://www.open-mpi.org/software/ompi/v1.8/




Making all in mpi/fortran/use-mpi-f08
make[2]: Entering directory 
'/cygdrive/e/cyg_pub/devel/openmpi/openmpi-1.8.5rc2-1.x86_64/build/ompi/mpi/fortran/use-mpi-f08'

  FCLD libmpi_usempif08.la
.libs/abort_f08.o: In function `mpi_abort_f08_':
/usr/src/debug/openmpi-1.8.5rc2-1/ompi/mpi/fortran/use-mpi-f08/abort_f08.F90:17: 
undefined reference to `ompi_abort_f'
/usr/src/debug/openmpi-1.8.5rc2-1/ompi/mpi/fortran/use-mpi-f08/abort_f08.F90:17:(.text+0xe): 
relocation truncated to fit: R_X86_64_PC32 against undefined symbol 
`ompi_abort_f'

.libs/accumulate_f08.o: In function `mpi_accumulate_f08_':
/usr/src/debug/openmpi-1.8.5rc2-1/ompi/mpi/fortran/use-mpi-f08/accumulate_f08.F90:28: 
undefined reference to `ompi_accumulate_f'


Patch attached.

Question:
what is the scope of the new two shared libs

 usr/bin/cygmpi_usempi_ignore_tkr-0.dll
 usr/bin/cygmpi_usempif08-0.dll

in comparison to previous

 usr/bin/cygmpi_mpifh-2.dll
 usr/bin/cygmpi_usempi-1.dll

already present in 1.8.4 ?

REgards
Marco
--- origsrc/openmpi-1.8.5rc2/ompi/mpi/fortran/use-mpi-f08/Makefile.am   
2015-04-05 20:40:24.0 +0200
+++ src/openmpi-1.8.5rc2/ompi/mpi/fortran/use-mpi-f08/Makefile.am   
2015-04-22 15:39:46.739793600 +0200
@@ -805,6 +805,7 @@ endif
 libmpi_usempif08_la_LIBADD = \
 $(module_sentinel_file) \
 $(OMPI_MPIEXT_USEMPIF08_LIBS) \
+$(top_builddir)/ompi/mpi/fortran/mpif-h/libmpi_mpifh.la \
 $(top_builddir)/ompi/libmpi.la
 libmpi_usempif08_la_DEPENDENCIES = $(module_sentinel_file)
 libmpi_usempif08_la_LDFLAGS = -version-info $(libmpi_usempif08_so_version)


[OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Raphaël Fouassier
We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
master: if locked memory limits are too low, a segfault happens
in openib/udcm because some memory is not correctly deallocated.

To reproduce it, modify /etc/security/limits.conf with:
* soft memlock 64
* hard memlock 64
and launch with mpirun (not in a slurm allocation).


I propose 2 patches for 1.8.4 and master (because of the btl move to
opal) which:
- free all allocated ressources
- print the limits error

diff --git a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c 
b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
index 19753a9..b74 100644
--- a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
+++ b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
@@ -5,6 +5,7 @@
  * Copyright (c) 2009  IBM Corporation.  All rights reserved.
  * Copyright (c) 2011-2014 Los Alamos National Security, LLC.  All rights
  * reserved.
+ * Copyright (c) 2015  Bull SAS. All rights reserved.
  *
  * $COPYRIGHT$
  * 
@@ -460,6 +461,8 @@ static int udcm_component_query(mca_btl_openib_module_t 
*btl,
 
 rc = udcm_module_init (m, btl);
 if (OMPI_SUCCESS != rc) {
+free(m);
+m = NULL;
 break;
 }
 
@@ -536,9 +539,19 @@ static int udcm_endpoint_finalize(struct 
mca_btl_base_endpoint_t *lcl_ep)
 return OMPI_SUCCESS;
 }
 
+static void *udcm_unmonitor(int fd, int flags, void *context)
+{
+volatile int *barrier = (volatile int *)context;
+
+*barrier = 1;
+
+return NULL;
+}
+
 static int udcm_module_init (udcm_module_t *m, mca_btl_openib_module_t *btl)
 {
 int rc = OMPI_ERR_NOT_SUPPORTED;
+volatile int barrier = 0;
 
 BTL_VERBOSE(("created cpc module %p for btl %p",
  (void*)m, (void*)btl));
@@ -549,7 +562,7 @@ static int udcm_module_init (udcm_module_t *m, 
mca_btl_openib_module_t *btl)
 m->cm_channel = ibv_create_comp_channel (btl->device->ib_dev_context);
 if (NULL == m->cm_channel) {
 BTL_VERBOSE(("error creating ud completion channel"));
-return OMPI_ERR_NOT_SUPPORTED;
+goto out;
 }
 
 /* Create completion queues */
@@ -558,29 +571,33 @@ static int udcm_module_init (udcm_module_t *m, 
mca_btl_openib_module_t *btl)
m->cm_channel, 0);
 if (NULL == m->cm_recv_cq) {
 BTL_VERBOSE(("error creating ud recv completion queue"));
-return OMPI_ERR_NOT_SUPPORTED;
+mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
+   
ibv_get_device_name(btl->device->ib_dev));
+goto out1;
 }
 
 m->cm_send_cq = ibv_create_cq (btl->device->ib_dev_context,
UDCM_SEND_CQ_SIZE, NULL, NULL, 0);
 if (NULL == m->cm_send_cq) {
 BTL_VERBOSE(("error creating ud send completion queue"));
-return OMPI_ERR_NOT_SUPPORTED;
+mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
+   
ibv_get_device_name(btl->device->ib_dev));
+goto out2;
 }
 
 if (0 != (rc = udcm_module_allocate_buffers (m))) {
 BTL_VERBOSE(("error allocating cm buffers"));
-return rc;
+goto out3;
 }
 
 if (0 != (rc = udcm_module_create_listen_qp (m))) {
 BTL_VERBOSE(("error creating UD QP"));
-return rc;
+goto out4;
 }
 
 if (0 != (rc = udcm_module_post_all_recvs (m))) {
 BTL_VERBOSE(("error posting receives"));
-return rc;
+goto out5;
 }
 
 /* UD CM initialized properly.  So fill in the rest of the CPC
@@ -633,12 +650,41 @@ static int udcm_module_init (udcm_module_t *m, 
mca_btl_openib_module_t *btl)
 /* Finally, request CQ notification */
 if (0 != ibv_req_notify_cq (m->cm_recv_cq, 0)) {
 BTL_VERBOSE(("error requesting recv completions"));
-return OMPI_ERROR;
+rc = OMPI_ERROR;
+goto out6;
 }
 
 /* Ready to use */
 
 return OMPI_SUCCESS;
+
+out6:
+OBJ_DESTRUCT(&m->cm_timeout_lock);
+OBJ_DESTRUCT(&m->flying_messages);
+OBJ_DESTRUCT(&m->cm_recv_msg_queue_lock);
+OBJ_DESTRUCT(&m->cm_recv_msg_queue);
+OBJ_DESTRUCT(&m->cm_send_lock);
+OBJ_DESTRUCT(&m->cm_lock);
+
+m->channel_monitored = false;
+
+ompi_btl_openib_fd_unmonitor(m->cm_channel->fd,
+ udcm_unmonitor, (void *)&barrier);
+while (0 == barrier) {
+sched_yield();
+}
+out5:
+udcm_module_destroy_listen_qp (m);
+out4:
+udcm_module_destroy_buffers (m);
+out3:
+ibv_destroy_cq (m->cm_send_cq);
+out2:
+ibv_destroy_cq (m->cm_recv_cq);
+out1:
+ibv_destroy_comp_channel (m->cm_channel);
+out:
+return rc;
 }
 
 static int
@@ -691,15 +737,6 @@ 
udcm_module_start_connect(ompi_btl_openib_connect_base_modu

Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Howard Pritchard
Hi Raphael,

Thanks very much for the patches.

Would one of the developers on the list have a system where they
can make these kernel limit changes and which have HCAs installed?

I don't have access to any system where I have such permissions.

Howard


2015-04-22 8:55 GMT-06:00 Raphaël Fouassier :

> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
> master: if locked memory limits are too low, a segfault happens
> in openib/udcm because some memory is not correctly deallocated.
>
> To reproduce it, modify /etc/security/limits.conf with:
> * soft memlock 64
> * hard memlock 64
> and launch with mpirun (not in a slurm allocation).
>
>
> I propose 2 patches for 1.8.4 and master (because of the btl move to
> opal) which:
> - free all allocated ressources
> - print the limits error
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php
>


Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Paul Hargrove
Howard,

Unless there is some reason the settings must be global, you should be able
to set the limits w/o root privs:

Bourne shells:
$ ulimit -l 64
C shells:
% limit -h memorylocked 64

I would have thought these lines might need to go in a .profile or .cshrc
to affect the application processes, but perhaps mpirun propogates the
rlimits.
So, on NERSC's Carver I can reproduce the problem (in my build of 1.8.5rc2)
quite easily (below).
I have configured with --enable-debug, which probably explains why I see an
assertion failure rather than the reported SEGV.

-Paul

{hargrove@c1436 BLD}$ ulimit -l 64
{hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c
ring_c:
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743:
udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[c1436:05774] *** Process received signal ***
[c1436:05774] Signal: Aborted (6)
[c1436:05774] Signal code:  (-6)
[c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10]
[c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265]
[c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10]
[c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6]
[c1436:05774] [ 4] examples/ring_c[0x44e8b3]
[c1436:05774] [ 5] examples/ring_c[0x44d7e0]
[c1436:05774] [ 6] examples/ring_c[0x4464c5]
[c1436:05774] [ 7] examples/ring_c[0x442c5c]
[c1436:05774] [ 8] examples/ring_c[0x433328]
[c1436:05774] [ 9] examples/ring_c[0x43266c]
[c1436:05774] [10] examples/ring_c[0x50e460]
[c1436:05774] [11] examples/ring_c[0x4d9e09]
[c1436:05774] [12] examples/ring_c[0x4d90b0]
[c1436:05774] [13] examples/ring_c[0x423e2d]
[c1436:05774] [14] examples/ring_c[0x42dceb]
[c1436:05774] [15] examples/ring_c[0x407285]
[c1436:05774] [16] /lib64/libc.so.6(__libc_start_main+0xf4)[0x2b4107cd6994]
[c1436:05774] [17] examples/ring_c[0x407129]
[c1436:05774] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 5774 on node c1436 exited on
signal 6 (Aborted).
--



On Wed, Apr 22, 2015 at 9:16 AM, Howard Pritchard 
wrote:

> Hi Raphael,
>
> Thanks very much for the patches.
>
> Would one of the developers on the list have a system where they
> can make these kernel limit changes and which have HCAs installed?
>
> I don't have access to any system where I have such permissions.
>
> Howard
>
>
> 2015-04-22 8:55 GMT-06:00 Raphaël Fouassier :
>
>> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
>> master: if locked memory limits are too low, a segfault happens
>> in openib/udcm because some memory is not correctly deallocated.
>>
>> To reproduce it, modify /etc/security/limits.conf with:
>> * soft memlock 64
>> * hard memlock 64
>> and launch with mpirun (not in a slurm allocation).
>>
>>
>> I propose 2 patches for 1.8.4 and master (because of the btl move to
>> opal) which:
>> - free all allocated ressources
>> - print the limits error
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/04/17305.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17306.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Paul Hargrove
And here is the backtrace I probably should have provided in the previous
email.
-Paul

#0  0x2b4107ce9265 in raise () from /lib64/libc.so.6
#1  0x2b4107ceaeb8 in abort () from /lib64/libc.so.6
#2  0x2b4107ce26e6 in __assert_fail () from /lib64/libc.so.6
#3  0x0044e8b3 in udcm_module_finalize (btl=0x1cf2ae0,
cpc=0x1ce6f80)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743
#4  0x0044d7e0 in udcm_component_query (btl=0x1cf2ae0,
cpc=0x1cec948)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:485
#5  0x004464c5 in
ompi_btl_openib_connect_base_select_for_local_port (btl=0x1cf2ae0)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x00442c5c in btl_openib_component_init
(num_btl_modules=0x7fff6e9b5a10,
enable_progress_threads=false, enable_mpi_threads=false)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/btl_openib_component.c:2837
#7  0x00433328 in mca_btl_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/base/btl_base_select.c:111
#8  0x0043266c in mca_bml_r2_component_init
(priority=0x7fff6e9b5ac4, enable_progress_threads=false,
enable_mpi_threads=false)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x0050e460 in mca_bml_base_init (enable_progress_threads=false,
enable_mpi_threads=false)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/base/bml_base_init.c:69
#10 0x004d9e09 in mca_pml_ob1_component_init
(priority=0x7fff6e9b5bcc,
enable_progress_threads=false, enable_mpi_threads=false)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x004d90b0 in mca_pml_base_select
(enable_progress_threads=false, enable_mpi_threads=false)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/base/pml_base_select.c:128
#12 0x00423e2d in ompi_mpi_init (argc=1, argv=0x7fff6e9b5ea8,
requested=0, provided=0x7fff6e9b5d5c)
at
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/runtime/ompi_mpi_init.c:614
#13 0x0042dceb in PMPI_Init (argc=0x7fff6e9b5d9c,
argv=0x7fff6e9b5d90) at pinit.c:84
#14 0x00407285 in main (argc=1, argv=0x7fff6e9b5ea8) at ring_c.c:19


On Wed, Apr 22, 2015 at 9:41 AM, Paul Hargrove  wrote:

> Howard,
>
> Unless there is some reason the settings must be global, you should be
> able to set the limits w/o root privs:
>
> Bourne shells:
> $ ulimit -l 64
> C shells:
> % limit -h memorylocked 64
>
> I would have thought these lines might need to go in a .profile or .cshrc
> to affect the application processes, but perhaps mpirun propogates the
> rlimits.
> So, on NERSC's Carver I can reproduce the problem (in my build of
> 1.8.5rc2) quite easily (below).
> I have configured with --enable-debug, which probably explains why I see
> an assertion failure rather than the reported SEGV.
>
> -Paul
>
> {hargrove@c1436 BLD}$ ulimit -l 64
> {hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c
> ring_c:
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743:
> udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
> ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
> [c1436:05774] *** Process received signal ***
> [c1436:05774] Signal: Aborted (6)
> [c1436:05774] Signal code:  (-6)
> [c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10]
> [c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265]
> [c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10]
> [c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6]
> [c1436:05774] [ 4] examples/ring_c[0x44e8b3]
> [c1436:05774] [ 5] examples/ring_c[0x44d7e0]
> [c1436:05774] [ 6] examples/ring_c[0x4464c5]
> [c1436:05774] [ 7] examples/ring_c[0x442c5c]
> [c1436:05774] [ 8] examples/ring_c[0x433328]
> [c1436:05774] [ 9] examples/ring_c[0x43266c]
> [c1436:05774] [10] examples/ring_c[0x50e460]
> [c1436:05774] [11] examples/ring_c[0x4d9e09]
> [c1436:05774] [12] examples/ring_c[0x4d90b0]
> [c1436:05774] [13] examples/ring_c[0x423e2d]
> [c1436:05774] [14] examples/ring_c[0x42dceb]
> [c1436:05774] [15] examples/ring_c[0x4

Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Howard Pritchard
Hi Paul,

silly me.  forgot this was a ulimit thing.  I'll test on carver.

Howard


2015-04-22 10:45 GMT-06:00 Paul Hargrove :

> And here is the backtrace I probably should have provided in the previous
> email.
> -Paul
>
> #0  0x2b4107ce9265 in raise () from /lib64/libc.so.6
> #1  0x2b4107ceaeb8 in abort () from /lib64/libc.so.6
> #2  0x2b4107ce26e6 in __assert_fail () from /lib64/libc.so.6
> #3  0x0044e8b3 in udcm_module_finalize (btl=0x1cf2ae0,
> cpc=0x1ce6f80)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743
> #4  0x0044d7e0 in udcm_component_query (btl=0x1cf2ae0,
> cpc=0x1cec948)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:485
> #5  0x004464c5 in
> ompi_btl_openib_connect_base_select_for_local_port (btl=0x1cf2ae0)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
> #6  0x00442c5c in btl_openib_component_init
> (num_btl_modules=0x7fff6e9b5a10,
> enable_progress_threads=false, enable_mpi_threads=false)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/btl_openib_component.c:2837
> #7  0x00433328 in mca_btl_base_select
> (enable_progress_threads=false, enable_mpi_threads=false)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/base/btl_base_select.c:111
> #8  0x0043266c in mca_bml_r2_component_init
> (priority=0x7fff6e9b5ac4, enable_progress_threads=false,
> enable_mpi_threads=false)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/r2/bml_r2_component.c:88
> #9  0x0050e460 in mca_bml_base_init
> (enable_progress_threads=false, enable_mpi_threads=false)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/bml/base/bml_base_init.c:69
> #10 0x004d9e09 in mca_pml_ob1_component_init
> (priority=0x7fff6e9b5bcc,
> enable_progress_threads=false, enable_mpi_threads=false)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/ob1/pml_ob1_component.c:271
> #11 0x004d90b0 in mca_pml_base_select
> (enable_progress_threads=false, enable_mpi_threads=false)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/pml/base/pml_base_select.c:128
> #12 0x00423e2d in ompi_mpi_init (argc=1, argv=0x7fff6e9b5ea8,
> requested=0, provided=0x7fff6e9b5d5c)
> at
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/runtime/ompi_mpi_init.c:614
> #13 0x0042dceb in PMPI_Init (argc=0x7fff6e9b5d9c,
> argv=0x7fff6e9b5d90) at pinit.c:84
> #14 0x00407285 in main (argc=1, argv=0x7fff6e9b5ea8) at ring_c.c:19
>
>
> On Wed, Apr 22, 2015 at 9:41 AM, Paul Hargrove  wrote:
>
>> Howard,
>>
>> Unless there is some reason the settings must be global, you should be
>> able to set the limits w/o root privs:
>>
>> Bourne shells:
>> $ ulimit -l 64
>> C shells:
>> % limit -h memorylocked 64
>>
>> I would have thought these lines might need to go in a .profile or .cshrc
>> to affect the application processes, but perhaps mpirun propogates the
>> rlimits.
>> So, on NERSC's Carver I can reproduce the problem (in my build of
>> 1.8.5rc2) quite easily (below).
>> I have configured with --enable-debug, which probably explains why I see
>> an assertion failure rather than the reported SEGV.
>>
>> -Paul
>>
>> {hargrove@c1436 BLD}$ ulimit -l 64
>> {hargrove@c1436 BLD}$ mpirun -np 2 examples/ring_c
>> ring_c:
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.5rc2-linux-x86_64-static/openmpi-1.8.5rc2/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:743:
>> udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
>> ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
>> [c1436:05774] *** Process received signal ***
>> [c1436:05774] Signal: Aborted (6)
>> [c1436:05774] Signal code:  (-6)
>> [c1436:05774] [ 0] /lib64/libpthread.so.0[0x2b4107aabb10]
>> [c1436:05774] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b4107ce9265]
>> [c1436:05774] [ 2] /lib64/libc.so.6(abort+0x110)[0x2b4107cead10]
>> [c1436:05774] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x2b4107ce26e6]
>> [c1436:05774] [ 4] examples/ring_c[0x44e8b3]
>> [c1436:05774] [ 5] examples/ring_c[0x44d7e0]
>> [c1436:05774] [ 6] examples/ring_c[0x4464c5]
>> [c1436:05774] [ 7] examples/ring_c[0x442c5c]
>> [c1436:05774] [ 8] examples/ring_c[0x433328]
>> [c1436:05774] [ 9] examples/ring_c[0x43266

[OMPI devel] Broken flex-required error message

2015-04-22 Thread Paul Hargrove
When building from a git clone of master I encountered the following:

checking for flex... no
checking for lex... no
configure: WARNING: *** Could not find GNU Flex on your system.
configure: WARNING: *** GNU Flex required for developer builds of Open MPI.
configure: WARNING: *** Other versions of Lex are not supported.
configure: WARNING: *** YOU DO NOT NEED FLEX FOR DISTRIBUTION TARBALLS!
configure: WARNING: *** If you absolutely cannot install GNU Flex on this
system
configure: WARNING: *** consider using a distribution tarball, or generate
the
configure: WARNING: *** following files on another system (using Flex) and
configure: WARNING: *** copy them here:
configure: error: Cannot continue


I do not disagree with the requirement for flex.
However, there are multiple problems with that output.

1) The "following files" list is EMPTY.
If there is no way to output the list, then the last 4 WARNING lines are
pointless.

2) A minor grammar point: "Flex required" should be "Flex is required"

3) It is NOT known as "GNU Flex".
Quoting from https://www.gnu.org/software/flex/flex.html
   Flex is a free (but non-GNU) implementation of the original Unix *lex*
 program.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Nathan Hjelm

Umm, why are you cleaning up this way. The allocated resources *should*
be freed by the udcm_module_finalize call. If there is a bug in that
path it should be fixed there NOT by adding a bunch of gotos (ick).

I will take a look now and apply the appropriate fix.

-Nathan

On Wed, Apr 22, 2015 at 04:55:57PM +0200, Raphaël Fouassier wrote:
> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
> master: if locked memory limits are too low, a segfault happens
> in openib/udcm because some memory is not correctly deallocated.
> 
> To reproduce it, modify /etc/security/limits.conf with:
> * soft memlock 64
> * hard memlock 64
> and launch with mpirun (not in a slurm allocation).
> 
> 
> I propose 2 patches for 1.8.4 and master (because of the btl move to
> opal) which:
> - free all allocated ressources
> - print the limits error
> 

> diff --git a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c 
> b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> index 19753a9..b74 100644
> --- a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> +++ b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> @@ -5,6 +5,7 @@
>   * Copyright (c) 2009  IBM Corporation.  All rights reserved.
>   * Copyright (c) 2011-2014 Los Alamos National Security, LLC.  All rights
>   * reserved.
> + * Copyright (c) 2015  Bull SAS. All rights reserved.
>   *
>   * $COPYRIGHT$
>   * 
> @@ -460,6 +461,8 @@ static int udcm_component_query(mca_btl_openib_module_t 
> *btl,
>  
>  rc = udcm_module_init (m, btl);
>  if (OMPI_SUCCESS != rc) {
> +free(m);
> +m = NULL;
>  break;
>  }
>  
> @@ -536,9 +539,19 @@ static int udcm_endpoint_finalize(struct 
> mca_btl_base_endpoint_t *lcl_ep)
>  return OMPI_SUCCESS;
>  }
>  
> +static void *udcm_unmonitor(int fd, int flags, void *context)
> +{
> +volatile int *barrier = (volatile int *)context;
> +
> +*barrier = 1;
> +
> +return NULL;
> +}
> +
>  static int udcm_module_init (udcm_module_t *m, mca_btl_openib_module_t *btl)
>  {
>  int rc = OMPI_ERR_NOT_SUPPORTED;
> +volatile int barrier = 0;
>  
>  BTL_VERBOSE(("created cpc module %p for btl %p",
>   (void*)m, (void*)btl));
> @@ -549,7 +562,7 @@ static int udcm_module_init (udcm_module_t *m, 
> mca_btl_openib_module_t *btl)
>  m->cm_channel = ibv_create_comp_channel (btl->device->ib_dev_context);
>  if (NULL == m->cm_channel) {
>  BTL_VERBOSE(("error creating ud completion channel"));
> -return OMPI_ERR_NOT_SUPPORTED;
> +goto out;
>  }
>  
>  /* Create completion queues */
> @@ -558,29 +571,33 @@ static int udcm_module_init (udcm_module_t *m, 
> mca_btl_openib_module_t *btl)
> m->cm_channel, 0);
>  if (NULL == m->cm_recv_cq) {
>  BTL_VERBOSE(("error creating ud recv completion queue"));
> -return OMPI_ERR_NOT_SUPPORTED;
> +mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
> +   
> ibv_get_device_name(btl->device->ib_dev));
> +goto out1;
>  }
>  
>  m->cm_send_cq = ibv_create_cq (btl->device->ib_dev_context,
> UDCM_SEND_CQ_SIZE, NULL, NULL, 0);
>  if (NULL == m->cm_send_cq) {
>  BTL_VERBOSE(("error creating ud send completion queue"));
> -return OMPI_ERR_NOT_SUPPORTED;
> +mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
> +   
> ibv_get_device_name(btl->device->ib_dev));
> +goto out2;
>  }
>  
>  if (0 != (rc = udcm_module_allocate_buffers (m))) {
>  BTL_VERBOSE(("error allocating cm buffers"));
> -return rc;
> +goto out3;
>  }
>  
>  if (0 != (rc = udcm_module_create_listen_qp (m))) {
>  BTL_VERBOSE(("error creating UD QP"));
> -return rc;
> +goto out4;
>  }
>  
>  if (0 != (rc = udcm_module_post_all_recvs (m))) {
>  BTL_VERBOSE(("error posting receives"));
> -return rc;
> +goto out5;
>  }
>  
>  /* UD CM initialized properly.  So fill in the rest of the CPC
> @@ -633,12 +650,41 @@ static int udcm_module_init (udcm_module_t *m, 
> mca_btl_openib_module_t *btl)
>  /* Finally, request CQ notification */
>  if (0 != ibv_req_notify_cq (m->cm_recv_cq, 0)) {
>  BTL_VERBOSE(("error requesting recv completions"));
> -return OMPI_ERROR;
> +rc = OMPI_ERROR;
> +goto out6;
>  }
>  
>  /* Ready to use */
>  
>  return OMPI_SUCCESS;
> +
> +out6:
> +OBJ_DESTRUCT(&m->cm_timeout_lock);
> +OBJ_DESTRUCT(&m->flying_messages);
> +OBJ_DESTRUCT(&m->cm_recv_msg_queue_lock);
> +OBJ_DESTRUCT(&m->cm_recv_msg_queue);
> +OBJ_DESTRUCT(&m->cm_send_lock);
> +OBJ_DESTRUCT(&m->cm_lock);
> +
> +m->channel_monitored = f

[OMPI devel] Incorrect timer frequency [w/ patch]

2015-04-22 Thread Paul Hargrove
I had reason to look at the linux timer code today and noticed what I
believe to be a subtle error.
This is in both 'master' and v1.8.5rc2

Since casts bind tighter than multiplication in C, I believe that the
1-line patch below is required to produce the desired result of conversion
to an integer *after* the multiplication.
While I don't see how to exercise the code, I can see that cpu_f=2692.841
will yield opal_timer_linux_freq=2,692,000,000 rather than 2,692,841,000.

-Paul


diff --git a/opal/mca/timer/linux/timer_linux_component.c
b/opal/mca/timer/linux/timer_linux_component.c
index b130826..7abe578 100644
--- a/opal/mca/timer/linux/timer_linux_component.c
+++ b/opal/mca/timer/linux/timer_linux_component.c
@@ -128,7 +128,7 @@ static int opal_timer_linux_find_freq(void)
 ret = sscanf(loc, "%f", &cpu_f);
 if (1 == ret) {
 /* numer is in MHz - convert to Hz and make an integer */
-opal_timer_linux_freq = (opal_timer_t) cpu_f * 100;
+opal_timer_linux_freq = (opal_timer_t) (cpu_f * 100);
 }
 }
 }

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Nathan Hjelm

I will commit the log messages with my fix. Will PR the fix for 1.8.5.

-Nathan

On Wed, Apr 22, 2015 at 12:43:56PM -0600, Nathan Hjelm wrote:
> 
> Umm, why are you cleaning up this way. The allocated resources *should*
> be freed by the udcm_module_finalize call. If there is a bug in that
> path it should be fixed there NOT by adding a bunch of gotos (ick).
> 
> I will take a look now and apply the appropriate fix.
> 
> -Nathan
> 
> On Wed, Apr 22, 2015 at 04:55:57PM +0200, Raphaël Fouassier wrote:
> > We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
> > master: if locked memory limits are too low, a segfault happens
> > in openib/udcm because some memory is not correctly deallocated.
> > 
> > To reproduce it, modify /etc/security/limits.conf with:
> > * soft memlock 64
> > * hard memlock 64
> > and launch with mpirun (not in a slurm allocation).
> > 
> > 
> > I propose 2 patches for 1.8.4 and master (because of the btl move to
> > opal) which:
> > - free all allocated ressources
> > - print the limits error
> > 
> 
> > diff --git a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c 
> > b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > index 19753a9..b74 100644
> > --- a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > +++ b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > @@ -5,6 +5,7 @@
> >   * Copyright (c) 2009  IBM Corporation.  All rights reserved.
> >   * Copyright (c) 2011-2014 Los Alamos National Security, LLC.  All rights
> >   * reserved.
> > + * Copyright (c) 2015  Bull SAS. All rights reserved.
> >   *
> >   * $COPYRIGHT$
> >   * 
> > @@ -460,6 +461,8 @@ static int udcm_component_query(mca_btl_openib_module_t 
> > *btl,
> >  
> >  rc = udcm_module_init (m, btl);
> >  if (OMPI_SUCCESS != rc) {
> > +free(m);
> > +m = NULL;
> >  break;
> >  }
> >  
> > @@ -536,9 +539,19 @@ static int udcm_endpoint_finalize(struct 
> > mca_btl_base_endpoint_t *lcl_ep)
> >  return OMPI_SUCCESS;
> >  }
> >  
> > +static void *udcm_unmonitor(int fd, int flags, void *context)
> > +{
> > +volatile int *barrier = (volatile int *)context;
> > +
> > +*barrier = 1;
> > +
> > +return NULL;
> > +}
> > +
> >  static int udcm_module_init (udcm_module_t *m, mca_btl_openib_module_t 
> > *btl)
> >  {
> >  int rc = OMPI_ERR_NOT_SUPPORTED;
> > +volatile int barrier = 0;
> >  
> >  BTL_VERBOSE(("created cpc module %p for btl %p",
> >   (void*)m, (void*)btl));
> > @@ -549,7 +562,7 @@ static int udcm_module_init (udcm_module_t *m, 
> > mca_btl_openib_module_t *btl)
> >  m->cm_channel = ibv_create_comp_channel (btl->device->ib_dev_context);
> >  if (NULL == m->cm_channel) {
> >  BTL_VERBOSE(("error creating ud completion channel"));
> > -return OMPI_ERR_NOT_SUPPORTED;
> > +goto out;
> >  }
> >  
> >  /* Create completion queues */
> > @@ -558,29 +571,33 @@ static int udcm_module_init (udcm_module_t *m, 
> > mca_btl_openib_module_t *btl)
> > m->cm_channel, 0);
> >  if (NULL == m->cm_recv_cq) {
> >  BTL_VERBOSE(("error creating ud recv completion queue"));
> > -return OMPI_ERR_NOT_SUPPORTED;
> > +mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
> > +   
> > ibv_get_device_name(btl->device->ib_dev));
> > +goto out1;
> >  }
> >  
> >  m->cm_send_cq = ibv_create_cq (btl->device->ib_dev_context,
> > UDCM_SEND_CQ_SIZE, NULL, NULL, 0);
> >  if (NULL == m->cm_send_cq) {
> >  BTL_VERBOSE(("error creating ud send completion queue"));
> > -return OMPI_ERR_NOT_SUPPORTED;
> > +mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
> > +   
> > ibv_get_device_name(btl->device->ib_dev));
> > +goto out2;
> >  }
> >  
> >  if (0 != (rc = udcm_module_allocate_buffers (m))) {
> >  BTL_VERBOSE(("error allocating cm buffers"));
> > -return rc;
> > +goto out3;
> >  }
> >  
> >  if (0 != (rc = udcm_module_create_listen_qp (m))) {
> >  BTL_VERBOSE(("error creating UD QP"));
> > -return rc;
> > +goto out4;
> >  }
> >  
> >  if (0 != (rc = udcm_module_post_all_recvs (m))) {
> >  BTL_VERBOSE(("error posting receives"));
> > -return rc;
> > +goto out5;
> >  }
> >  
> >  /* UD CM initialized properly.  So fill in the rest of the CPC
> > @@ -633,12 +650,41 @@ static int udcm_module_init (udcm_module_t *m, 
> > mca_btl_openib_module_t *btl)
> >  /* Finally, request CQ notification */
> >  if (0 != ibv_req_notify_cq (m->cm_recv_cq, 0)) {
> >  BTL_VERBOSE(("error requesting recv completions"));
> > -return OMPI_ERROR;
> > +rc = OMPI_ERROR;
> > +got

Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Howard Pritchard
Hi Rafael,

I give you an A+ for effort.   We always appreciate patches.

Howard


2015-04-22 12:43 GMT-06:00 Nathan Hjelm :

>
> Umm, why are you cleaning up this way. The allocated resources *should*
> be freed by the udcm_module_finalize call. If there is a bug in that
> path it should be fixed there NOT by adding a bunch of gotos (ick).
>
> I will take a look now and apply the appropriate fix.
>
> -Nathan
>
> On Wed, Apr 22, 2015 at 04:55:57PM +0200, Raphaël Fouassier wrote:
> > We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
> > master: if locked memory limits are too low, a segfault happens
> > in openib/udcm because some memory is not correctly deallocated.
> >
> > To reproduce it, modify /etc/security/limits.conf with:
> > * soft memlock 64
> > * hard memlock 64
> > and launch with mpirun (not in a slurm allocation).
> >
> >
> > I propose 2 patches for 1.8.4 and master (because of the btl move to
> > opal) which:
> > - free all allocated ressources
> > - print the limits error
> >
>
> > diff --git a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > index 19753a9..b74 100644
> > --- a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > +++ b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > @@ -5,6 +5,7 @@
> >   * Copyright (c) 2009  IBM Corporation.  All rights reserved.
> >   * Copyright (c) 2011-2014 Los Alamos National Security, LLC.  All
> rights
> >   * reserved.
> > + * Copyright (c) 2015  Bull SAS. All rights reserved.
> >   *
> >   * $COPYRIGHT$
> >   *
> > @@ -460,6 +461,8 @@ static int
> udcm_component_query(mca_btl_openib_module_t *btl,
> >
> >  rc = udcm_module_init (m, btl);
> >  if (OMPI_SUCCESS != rc) {
> > +free(m);
> > +m = NULL;
> >  break;
> >  }
> >
> > @@ -536,9 +539,19 @@ static int udcm_endpoint_finalize(struct
> mca_btl_base_endpoint_t *lcl_ep)
> >  return OMPI_SUCCESS;
> >  }
> >
> > +static void *udcm_unmonitor(int fd, int flags, void *context)
> > +{
> > +volatile int *barrier = (volatile int *)context;
> > +
> > +*barrier = 1;
> > +
> > +return NULL;
> > +}
> > +
> >  static int udcm_module_init (udcm_module_t *m, mca_btl_openib_module_t
> *btl)
> >  {
> >  int rc = OMPI_ERR_NOT_SUPPORTED;
> > +volatile int barrier = 0;
> >
> >  BTL_VERBOSE(("created cpc module %p for btl %p",
> >   (void*)m, (void*)btl));
> > @@ -549,7 +562,7 @@ static int udcm_module_init (udcm_module_t *m,
> mca_btl_openib_module_t *btl)
> >  m->cm_channel = ibv_create_comp_channel
> (btl->device->ib_dev_context);
> >  if (NULL == m->cm_channel) {
> >  BTL_VERBOSE(("error creating ud completion channel"));
> > -return OMPI_ERR_NOT_SUPPORTED;
> > +goto out;
> >  }
> >
> >  /* Create completion queues */
> > @@ -558,29 +571,33 @@ static int udcm_module_init (udcm_module_t *m,
> mca_btl_openib_module_t *btl)
> > m->cm_channel, 0);
> >  if (NULL == m->cm_recv_cq) {
> >  BTL_VERBOSE(("error creating ud recv completion queue"));
> > -return OMPI_ERR_NOT_SUPPORTED;
> > +mca_btl_openib_show_init_error(__FILE__, __LINE__,
> "ibv_create_cq",
> > +
>  ibv_get_device_name(btl->device->ib_dev));
> > +goto out1;
> >  }
> >
> >  m->cm_send_cq = ibv_create_cq (btl->device->ib_dev_context,
> > UDCM_SEND_CQ_SIZE, NULL, NULL, 0);
> >  if (NULL == m->cm_send_cq) {
> >  BTL_VERBOSE(("error creating ud send completion queue"));
> > -return OMPI_ERR_NOT_SUPPORTED;
> > +mca_btl_openib_show_init_error(__FILE__, __LINE__,
> "ibv_create_cq",
> > +
>  ibv_get_device_name(btl->device->ib_dev));
> > +goto out2;
> >  }
> >
> >  if (0 != (rc = udcm_module_allocate_buffers (m))) {
> >  BTL_VERBOSE(("error allocating cm buffers"));
> > -return rc;
> > +goto out3;
> >  }
> >
> >  if (0 != (rc = udcm_module_create_listen_qp (m))) {
> >  BTL_VERBOSE(("error creating UD QP"));
> > -return rc;
> > +goto out4;
> >  }
> >
> >  if (0 != (rc = udcm_module_post_all_recvs (m))) {
> >  BTL_VERBOSE(("error posting receives"));
> > -return rc;
> > +goto out5;
> >  }
> >
> >  /* UD CM initialized properly.  So fill in the rest of the CPC
> > @@ -633,12 +650,41 @@ static int udcm_module_init (udcm_module_t *m,
> mca_btl_openib_module_t *btl)
> >  /* Finally, request CQ notification */
> >  if (0 != ibv_req_notify_cq (m->cm_recv_cq, 0)) {
> >  BTL_VERBOSE(("error requesting recv completions"));
> > -return OMPI_ERROR;
> > +rc = OMPI_ERROR;
> > +goto out6;
> >  }
> >
> >  /* Ready to use */
> >
> >  return OMPI_SUCCESS;
> > +
> > +out6:
> > +OBJ_DESTRUCT(&m->cm_ti

Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Nathan Hjelm

I see the problem. I thought I fixed this awhile ago but apparently
not. The various OBJ_CONSTRUCT lines should be at the top of the
udcm_module_init to ensure that they are always called. Fixing.

-Nathan

On Wed, Apr 22, 2015 at 01:13:08PM -0600, Nathan Hjelm wrote:
> 
> I will commit the log messages with my fix. Will PR the fix for 1.8.5.
> 
> -Nathan
> 
> On Wed, Apr 22, 2015 at 12:43:56PM -0600, Nathan Hjelm wrote:
> > 
> > Umm, why are you cleaning up this way. The allocated resources *should*
> > be freed by the udcm_module_finalize call. If there is a bug in that
> > path it should be fixed there NOT by adding a bunch of gotos (ick).
> > 
> > I will take a look now and apply the appropriate fix.
> > 
> > -Nathan
> > 
> > On Wed, Apr 22, 2015 at 04:55:57PM +0200, Raphaël Fouassier wrote:
> > > We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
> > > master: if locked memory limits are too low, a segfault happens
> > > in openib/udcm because some memory is not correctly deallocated.
> > > 
> > > To reproduce it, modify /etc/security/limits.conf with:
> > > * soft memlock 64
> > > * hard memlock 64
> > > and launch with mpirun (not in a slurm allocation).
> > > 
> > > 
> > > I propose 2 patches for 1.8.4 and master (because of the btl move to
> > > opal) which:
> > > - free all allocated ressources
> > > - print the limits error
> > > 
> > 
> > > diff --git a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c 
> > > b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > > index 19753a9..b74 100644
> > > --- a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > > +++ b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> > > @@ -5,6 +5,7 @@
> > >   * Copyright (c) 2009  IBM Corporation.  All rights reserved.
> > >   * Copyright (c) 2011-2014 Los Alamos National Security, LLC.  All rights
> > >   * reserved.
> > > + * Copyright (c) 2015  Bull SAS. All rights reserved.
> > >   *
> > >   * $COPYRIGHT$
> > >   * 
> > > @@ -460,6 +461,8 @@ static int 
> > > udcm_component_query(mca_btl_openib_module_t *btl,
> > >  
> > >  rc = udcm_module_init (m, btl);
> > >  if (OMPI_SUCCESS != rc) {
> > > +free(m);
> > > +m = NULL;
> > >  break;
> > >  }
> > >  
> > > @@ -536,9 +539,19 @@ static int udcm_endpoint_finalize(struct 
> > > mca_btl_base_endpoint_t *lcl_ep)
> > >  return OMPI_SUCCESS;
> > >  }
> > >  
> > > +static void *udcm_unmonitor(int fd, int flags, void *context)
> > > +{
> > > +volatile int *barrier = (volatile int *)context;
> > > +
> > > +*barrier = 1;
> > > +
> > > +return NULL;
> > > +}
> > > +
> > >  static int udcm_module_init (udcm_module_t *m, mca_btl_openib_module_t 
> > > *btl)
> > >  {
> > >  int rc = OMPI_ERR_NOT_SUPPORTED;
> > > +volatile int barrier = 0;
> > >  
> > >  BTL_VERBOSE(("created cpc module %p for btl %p",
> > >   (void*)m, (void*)btl));
> > > @@ -549,7 +562,7 @@ static int udcm_module_init (udcm_module_t *m, 
> > > mca_btl_openib_module_t *btl)
> > >  m->cm_channel = ibv_create_comp_channel 
> > > (btl->device->ib_dev_context);
> > >  if (NULL == m->cm_channel) {
> > >  BTL_VERBOSE(("error creating ud completion channel"));
> > > -return OMPI_ERR_NOT_SUPPORTED;
> > > +goto out;
> > >  }
> > >  
> > >  /* Create completion queues */
> > > @@ -558,29 +571,33 @@ static int udcm_module_init (udcm_module_t *m, 
> > > mca_btl_openib_module_t *btl)
> > > m->cm_channel, 0);
> > >  if (NULL == m->cm_recv_cq) {
> > >  BTL_VERBOSE(("error creating ud recv completion queue"));
> > > -return OMPI_ERR_NOT_SUPPORTED;
> > > +mca_btl_openib_show_init_error(__FILE__, __LINE__, 
> > > "ibv_create_cq",
> > > +   
> > > ibv_get_device_name(btl->device->ib_dev));
> > > +goto out1;
> > >  }
> > >  
> > >  m->cm_send_cq = ibv_create_cq (btl->device->ib_dev_context,
> > > UDCM_SEND_CQ_SIZE, NULL, NULL, 0);
> > >  if (NULL == m->cm_send_cq) {
> > >  BTL_VERBOSE(("error creating ud send completion queue"));
> > > -return OMPI_ERR_NOT_SUPPORTED;
> > > +mca_btl_openib_show_init_error(__FILE__, __LINE__, 
> > > "ibv_create_cq",
> > > +   
> > > ibv_get_device_name(btl->device->ib_dev));
> > > +goto out2;
> > >  }
> > >  
> > >  if (0 != (rc = udcm_module_allocate_buffers (m))) {
> > >  BTL_VERBOSE(("error allocating cm buffers"));
> > > -return rc;
> > > +goto out3;
> > >  }
> > >  
> > >  if (0 != (rc = udcm_module_create_listen_qp (m))) {
> > >  BTL_VERBOSE(("error creating UD QP"));
> > > -return rc;
> > > +goto out4;
> > >  }
> > >  
> > >  if (0 != (rc = udcm_module_post_all_recvs (m))) {
> > >  

Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Nathan Hjelm

Agreed. goto's just make me grumpy.

-Nathan

On Wed, Apr 22, 2015 at 01:17:11PM -0600, Howard Pritchard wrote:
>Hi Rafael,
>I give you an A+ for effort.   We always appreciate patches.
>Howard
>2015-04-22 12:43 GMT-06:00 Nathan Hjelm :
> 
>  Umm, why are you cleaning up this way. The allocated resources *should*
>  be freed by the udcm_module_finalize call. If there is a bug in that
>  path it should be fixed there NOT by adding a bunch of gotos (ick).
> 
>  I will take a look now and apply the appropriate fix.
> 
>  -Nathan
> 
>  On Wed, Apr 22, 2015 at 04:55:57PM +0200, Raphael Fouassier wrote:
>  > We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
>  > master: if locked memory limits are too low, a segfault happens
>  > in openib/udcm because some memory is not correctly deallocated.
>  >
>  > To reproduce it, modify /etc/security/limits.conf with:
>  > * soft memlock 64
>  > * hard memlock 64
>  > and launch with mpirun (not in a slurm allocation).
>  >
>  >
>  > I propose 2 patches for 1.8.4 and master (because of the btl move to
>  > opal) which:
>  > - free all allocated ressources
>  > - print the limits error
>  >
> 
>  > diff --git a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
>  b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
>  > index 19753a9..b74 100644
>  > --- a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
>  > +++ b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
>  > @@ -5,6 +5,7 @@
>  >   * Copyright (c) 2009  IBM Corporation.  All rights reserved.
>  >   * Copyright (c) 2011-2014 Los Alamos National Security, LLC.  All
>  rights
>  >   * reserved.
>  > + * Copyright (c) 2015  Bull SAS. All rights reserved.
>  >   *
>  >   * $COPYRIGHT$
>  >   *
>  > @@ -460,6 +461,8 @@ static int
>  udcm_component_query(mca_btl_openib_module_t *btl,
>  >
>  >  rc = udcm_module_init (m, btl);
>  >  if (OMPI_SUCCESS != rc) {
>  > +free(m);
>  > +m = NULL;
>  >  break;
>  >  }
>  >
>  > @@ -536,9 +539,19 @@ static int udcm_endpoint_finalize(struct
>  mca_btl_base_endpoint_t *lcl_ep)
>  >  return OMPI_SUCCESS;
>  >  }
>  >
>  > +static void *udcm_unmonitor(int fd, int flags, void *context)
>  > +{
>  > +volatile int *barrier = (volatile int *)context;
>  > +
>  > +*barrier = 1;
>  > +
>  > +return NULL;
>  > +}
>  > +
>  >  static int udcm_module_init (udcm_module_t *m,
>  mca_btl_openib_module_t *btl)
>  >  {
>  >  int rc = OMPI_ERR_NOT_SUPPORTED;
>  > +volatile int barrier = 0;
>  >
>  >  BTL_VERBOSE(("created cpc module %p for btl %p",
>  >   (void*)m, (void*)btl));
>  > @@ -549,7 +562,7 @@ static int udcm_module_init (udcm_module_t *m,
>  mca_btl_openib_module_t *btl)
>  >  m->cm_channel = ibv_create_comp_channel
>  (btl->device->ib_dev_context);
>  >  if (NULL == m->cm_channel) {
>  >  BTL_VERBOSE(("error creating ud completion channel"));
>  > -return OMPI_ERR_NOT_SUPPORTED;
>  > +goto out;
>  >  }
>  >
>  >  /* Create completion queues */
>  > @@ -558,29 +571,33 @@ static int udcm_module_init (udcm_module_t *m,
>  mca_btl_openib_module_t *btl)
>  > m->cm_channel, 0);
>  >  if (NULL == m->cm_recv_cq) {
>  >  BTL_VERBOSE(("error creating ud recv completion queue"));
>  > -return OMPI_ERR_NOT_SUPPORTED;
>  > +mca_btl_openib_show_init_error(__FILE__, __LINE__,
>  "ibv_create_cq",
>  > + 
>   ibv_get_device_name(btl->device->ib_dev));
>  > +goto out1;
>  >  }
>  >
>  >  m->cm_send_cq = ibv_create_cq (btl->device->ib_dev_context,
>  > UDCM_SEND_CQ_SIZE, NULL, NULL, 0);
>  >  if (NULL == m->cm_send_cq) {
>  >  BTL_VERBOSE(("error creating ud send completion queue"));
>  > -return OMPI_ERR_NOT_SUPPORTED;
>  > +mca_btl_openib_show_init_error(__FILE__, __LINE__,
>  "ibv_create_cq",
>  > + 
>   ibv_get_device_name(btl->device->ib_dev));
>  > +goto out2;
>  >  }
>  >
>  >  if (0 != (rc = udcm_module_allocate_buffers (m))) {
>  >  BTL_VERBOSE(("error allocating cm buffers"));
>  > -return rc;
>  > +goto out3;
>  >  }
>  >
>  >  if (0 != (rc = udcm_module_create_listen_qp (m))) {
>  >  BTL_VERBOSE(("error creating UD QP"));
>  > -return rc;
>  > + 

Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

2015-04-22 Thread Tom Wurgler
Well, still not working.  I compiled on the AMD and for 1.6.4 I get:

(note: ideally, at this point, we really want 1.6.4 ,not 1.8.4 (yet)).


1.6.4 using --bind-to-socket --bind-to-core


--
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--
--
mpirun was unable to start the specified application as it encountered an error:

Error name: Input/output error
Node: rdsargo36

when attempting to start process rank 0.
--
Error: Previous command failed (exitcode=1)
=

Now using 1.8.4 and --map-by socket --bind-to core, the job runs and is on the 
cores I want, namely 0,1,2,3,4,5,6,7
BUT when a job lands on a node with other jobs, it over-subcribes the cores.  I 
tried adding --nooversubscribe with no effect.

The already running jobs were submitted with openmpi 1.6.4 using --mca 
mpi-paffinity-alone 1
and none of the bind-core etc args.

We have nodes with 4 sockets with 12 cores each, so we really want to make use 
of as many cores as we can, in the most optimized way.  So if we had a 24 way 
job and then a second  24 way job we want them packed on the same node with 
processor and memory affinity.

Any guidance appreciated.
thanks
tom


From: devel  on behalf of Ralph Castain 

Sent: Friday, April 17, 2015 7:36 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

Hi Tom

Glad you are making some progress! Note that the 1.8 series uses hwloc for its 
affinity operations, while the 1.4 and 1.6 series used the old plpa code. 
Hence, you will not find the “affinity” components in the 1.8 ompi_info output.

Is there some reason you didn’t compile OMPI on the AMD machine? I ask because 
there are some config switches in various areas that differ between AMD and 
Intel architectures.


On Apr 17, 2015, at 11:16 AM, Tom Wurgler 
mailto:twu...@goodyear.com>> wrote:

Note where I said "1 hour 14 minutes" it should have read "1 hour 24 minutes"...




From: Tom Wurgler
Sent: Friday, April 17, 2015 2:14 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

Ok, seems like I am making some progress here.  Thanks for the help.
I turned HT off.
Now I can run v 1.4.2, 1.6.4 and 1.8.4 all compiled the same compiler and run 
on the same machine
1.4.2 runs this job in 59 minutes.   1.6.4 and 1.8.4 run the job in 1hr 24 
minutes.
1.4.2 uses just --mca paffinuty-alone 1 and the processes are bound
  PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4
 N5 ]
22232 prog1 0469.9M [ 469.9M 0  0  0  0 
 0  ]
22233 prog1 1479.0M [   4.0M 475.0M 0  0  0 
 0  ]
22234 prog1 2516.7M [ 516.7M 0  0  0  0 
 0  ]
22235 prog1 3485.4M [   8.0M 477.4M 0  0  0 
 0  ]
22236 prog1 4482.6M [ 482.6M 0  0  0  0 
 0  ]
22237 prog1 5486.6M [   6.0M 480.6M 0  0  0 
 0  ]
22238 prog1 6481.3M [ 481.3M 0  0  0  0 
 0  ]
22239 prog1 7419.4M [   8.0M 411.4M 0  0  0 
 0  ]

If I use 1.6.4 and 1.8.4 with --mca paffinity-alone 1, the run time is now 1hr 
14 minutes.  The process map now looks like:
bash-4.3# numa-maps -n eagle
  PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4
 N5 ]
12248 eagle 0163.3M [ 155.3M   8.0M 0  0  0 
 0  ]
12249 eagle 2161.6M [ 159.6M   2.0M 0  0  0 
 0  ]
12250 eagle 4164.3M [ 160.3M   4.0M 0  0  0 
 0  ]
12251 eagle 6160.4M [ 156.4M   4.0M 0  0  0 
 0  ]
12252 eagle 8160.6M [ 154.6M   6.0M 0  0  0 
 0  ]
12253 eagle10159.8M [ 151.8M   8.0M 0  0  0 
 0  ]
12254 eagle12160.9M [ 152.9M   8.0M 0  0  0 
 0  ]
12255 eagle14159.8M [ 157.8M   2.0M 0  0  0 
 0  ]

If I take off the --mca paffinity-alone 1, and instead use --bysocket 
--bind-to-core (1.6.4)  or --map-by socket --bind-to core (1.8.4), the job runs 
in 59 minutes and the process map look like the 1.4.2 one above...looks super!

Now the issue:

If I move the same openmi ins

Re: [OMPI devel] Fwd: OpenIB module initialisation causes segmentation fault when locked memory limit too low

2015-04-22 Thread Nathan Hjelm

PR https://github.com/open-mpi/ompi-release/pull/250. Raphaël, can you
please confirm this fixes your issue.

-Nathan

On Wed, Apr 22, 2015 at 04:55:57PM +0200, Raphaël Fouassier wrote:
> We are experiencing a bug in OpenMPI in 1.8.4 which happens also on
> master: if locked memory limits are too low, a segfault happens
> in openib/udcm because some memory is not correctly deallocated.
> 
> To reproduce it, modify /etc/security/limits.conf with:
> * soft memlock 64
> * hard memlock 64
> and launch with mpirun (not in a slurm allocation).
> 
> 
> I propose 2 patches for 1.8.4 and master (because of the btl move to
> opal) which:
> - free all allocated ressources
> - print the limits error
> 

> diff --git a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c 
> b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> index 19753a9..b74 100644
> --- a/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> +++ b/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c
> @@ -5,6 +5,7 @@
>   * Copyright (c) 2009  IBM Corporation.  All rights reserved.
>   * Copyright (c) 2011-2014 Los Alamos National Security, LLC.  All rights
>   * reserved.
> + * Copyright (c) 2015  Bull SAS. All rights reserved.
>   *
>   * $COPYRIGHT$
>   * 
> @@ -460,6 +461,8 @@ static int udcm_component_query(mca_btl_openib_module_t 
> *btl,
>  
>  rc = udcm_module_init (m, btl);
>  if (OMPI_SUCCESS != rc) {
> +free(m);
> +m = NULL;
>  break;
>  }
>  
> @@ -536,9 +539,19 @@ static int udcm_endpoint_finalize(struct 
> mca_btl_base_endpoint_t *lcl_ep)
>  return OMPI_SUCCESS;
>  }
>  
> +static void *udcm_unmonitor(int fd, int flags, void *context)
> +{
> +volatile int *barrier = (volatile int *)context;
> +
> +*barrier = 1;
> +
> +return NULL;
> +}
> +
>  static int udcm_module_init (udcm_module_t *m, mca_btl_openib_module_t *btl)
>  {
>  int rc = OMPI_ERR_NOT_SUPPORTED;
> +volatile int barrier = 0;
>  
>  BTL_VERBOSE(("created cpc module %p for btl %p",
>   (void*)m, (void*)btl));
> @@ -549,7 +562,7 @@ static int udcm_module_init (udcm_module_t *m, 
> mca_btl_openib_module_t *btl)
>  m->cm_channel = ibv_create_comp_channel (btl->device->ib_dev_context);
>  if (NULL == m->cm_channel) {
>  BTL_VERBOSE(("error creating ud completion channel"));
> -return OMPI_ERR_NOT_SUPPORTED;
> +goto out;
>  }
>  
>  /* Create completion queues */
> @@ -558,29 +571,33 @@ static int udcm_module_init (udcm_module_t *m, 
> mca_btl_openib_module_t *btl)
> m->cm_channel, 0);
>  if (NULL == m->cm_recv_cq) {
>  BTL_VERBOSE(("error creating ud recv completion queue"));
> -return OMPI_ERR_NOT_SUPPORTED;
> +mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
> +   
> ibv_get_device_name(btl->device->ib_dev));
> +goto out1;
>  }
>  
>  m->cm_send_cq = ibv_create_cq (btl->device->ib_dev_context,
> UDCM_SEND_CQ_SIZE, NULL, NULL, 0);
>  if (NULL == m->cm_send_cq) {
>  BTL_VERBOSE(("error creating ud send completion queue"));
> -return OMPI_ERR_NOT_SUPPORTED;
> +mca_btl_openib_show_init_error(__FILE__, __LINE__, "ibv_create_cq",
> +   
> ibv_get_device_name(btl->device->ib_dev));
> +goto out2;
>  }
>  
>  if (0 != (rc = udcm_module_allocate_buffers (m))) {
>  BTL_VERBOSE(("error allocating cm buffers"));
> -return rc;
> +goto out3;
>  }
>  
>  if (0 != (rc = udcm_module_create_listen_qp (m))) {
>  BTL_VERBOSE(("error creating UD QP"));
> -return rc;
> +goto out4;
>  }
>  
>  if (0 != (rc = udcm_module_post_all_recvs (m))) {
>  BTL_VERBOSE(("error posting receives"));
> -return rc;
> +goto out5;
>  }
>  
>  /* UD CM initialized properly.  So fill in the rest of the CPC
> @@ -633,12 +650,41 @@ static int udcm_module_init (udcm_module_t *m, 
> mca_btl_openib_module_t *btl)
>  /* Finally, request CQ notification */
>  if (0 != ibv_req_notify_cq (m->cm_recv_cq, 0)) {
>  BTL_VERBOSE(("error requesting recv completions"));
> -return OMPI_ERROR;
> +rc = OMPI_ERROR;
> +goto out6;
>  }
>  
>  /* Ready to use */
>  
>  return OMPI_SUCCESS;
> +
> +out6:
> +OBJ_DESTRUCT(&m->cm_timeout_lock);
> +OBJ_DESTRUCT(&m->flying_messages);
> +OBJ_DESTRUCT(&m->cm_recv_msg_queue_lock);
> +OBJ_DESTRUCT(&m->cm_recv_msg_queue);
> +OBJ_DESTRUCT(&m->cm_send_lock);
> +OBJ_DESTRUCT(&m->cm_lock);
> +
> +m->channel_monitored = false;
> +
> +ompi_btl_openib_fd_unmonitor(m->cm_channel->fd,
> + udcm_unmonitor, (void *)&barrier);
> +

Re: [OMPI devel] Incorrect timer frequency [w/ patch]

2015-04-22 Thread Jeff Squyres (jsquyres)
Fixed in 
https://github.com/open-mpi/ompi/commit/46aa20a9191db2f5cc1850c0f4f881ac51653cb4.

Thanks!

> On Apr 22, 2015, at 3:01 PM, Paul Hargrove  wrote:
> 
> I had reason to look at the linux timer code today and noticed what I believe 
> to be a subtle error.
> This is in both 'master' and v1.8.5rc2
> 
> Since casts bind tighter than multiplication in C, I believe that the 1-line 
> patch below is required to produce the desired result of conversion to an 
> integer *after* the multiplication.
> While I don't see how to exercise the code, I can see that cpu_f=2692.841 
> will yield opal_timer_linux_freq=2,692,000,000 rather than 2,692,841,000.
> 
> -Paul
> 
> 
> diff --git a/opal/mca/timer/linux/timer_linux_component.c 
> b/opal/mca/timer/linux/timer_linux_component.c
> index b130826..7abe578 100644
> --- a/opal/mca/timer/linux/timer_linux_component.c
> +++ b/opal/mca/timer/linux/timer_linux_component.c
> @@ -128,7 +128,7 @@ static int opal_timer_linux_find_freq(void)
>  ret = sscanf(loc, "%f", &cpu_f);
>  if (1 == ret) {
>  /* numer is in MHz - convert to Hz and make an integer */
> -opal_timer_linux_freq = (opal_timer_t) cpu_f * 100;
> +opal_timer_linux_freq = (opal_timer_t) (cpu_f * 100);
>  }
>  }
>  }
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17312.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Broken flex-required error message

2015-04-22 Thread Jeff Squyres (jsquyres)
Fixed -- thanks: 
https://github.com/open-mpi/ompi/commit/4b8fa246824418f8bd46419286bb1bcb8ce6e941

FWIW, it *did* print a list of files for me on my Mac when I faked it out and 
forced it to *not* find flex.

Shrug.

So I just took that part of the error message out -- let devs always install 
flex.  :-)


> On Apr 22, 2015, at 2:15 PM, Paul Hargrove  wrote:
> 
> When building from a git clone of master I encountered the following:
> 
> checking for flex... no
> checking for lex... no
> configure: WARNING: *** Could not find GNU Flex on your system.
> configure: WARNING: *** GNU Flex required for developer builds of Open MPI.
> configure: WARNING: *** Other versions of Lex are not supported.
> configure: WARNING: *** YOU DO NOT NEED FLEX FOR DISTRIBUTION TARBALLS!
> configure: WARNING: *** If you absolutely cannot install GNU Flex on this 
> system
> configure: WARNING: *** consider using a distribution tarball, or generate the
> configure: WARNING: *** following files on another system (using Flex) and
> configure: WARNING: *** copy them here:
> configure: error: Cannot continue
> 
> 
> I do not disagree with the requirement for flex.
> However, there are multiple problems with that output.
> 
> 1) The "following files" list is EMPTY.
> If there is no way to output the list, then the last 4 WARNING lines are 
> pointless.
> 
> 2) A minor grammar point: "Flex required" should be "Flex is required"
> 
> 3) It is NOT known as "GNU Flex".
> Quoting from https://www.gnu.org/software/flex/flex.html
>Flex is a free (but non-GNU) implementation of the original Unix lex 
> program.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17310.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Broken flex-required error message

2015-04-22 Thread Paul Hargrove
On Wed, Apr 22, 2015 at 2:02 PM, Jeff Squyres (jsquyres)  wrote:

> FWIW, it *did* print a list of files for me on my Mac when I faked it out
> and forced it to *not* find flex.
>


A quick look and the commit shows
for lfile in `find . -name \*.l -print`; do

Notice the find is rooted at ".".
I pretty much always use VPATH builds, and I am guessing you did not.
That would explain the difference in output.

If you should happen to want to restore the logic to print the files, then
"find ." should probably be "find $(top_srcdir)"

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Broken flex-required error message

2015-04-22 Thread Jeff Squyres (jsquyres)
Good catch.

But I'm ok forcing devs to install flex.  Or be creative -- without any help -- 
to generate .c files from .l files themselves.  :-)


> On Apr 22, 2015, at 5:08 PM, Paul Hargrove  wrote:
> 
> 
> On Wed, Apr 22, 2015 at 2:02 PM, Jeff Squyres (jsquyres)  
> wrote:
> FWIW, it *did* print a list of files for me on my Mac when I faked it out and 
> forced it to *not* find flex.
> 
> 
> A quick look and the commit shows
> for lfile in `find . -name \*.l -print`; do
> 
> Notice the find is rooted at ".".
> I pretty much always use VPATH builds, and I am guessing you did not.
> That would explain the difference in output.
> 
> If you should happen to want to restore the logic to print the files, then 
> "find ." should probably be "find $(top_srcdir)"
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17321.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] powerpc64le support [1-line patch]

2015-04-22 Thread Paul Hargrove
I had an opportunity to try the 1.8.5rc2 tarball on a little-endian POWER8
(aka ppc64el or powerpc64le).
The good news is that things "just worked" as they did when I tried ARMv8
(aka aarch64).

However, I see a little room for improvement with almost no work at all.

I noticed:

checking for __sync builtin atomics... yes
checking for assembly architecture... UNSUPPORTED
checking for builtin atomics... BUILTIN_SYNC
checking for atomic assembly filename... none

and I confirmed the same behavior on master.

The existing powerpc64 inline asm should work.
So, I made a one-line change (at bottom of this email) to recognize the
architecture and now get "native" atomics.

checking if PowerPC registers have r prefix... no
checking if powerpc64le-linux-gnu-gcc-4.9 -std=gnu99 supports GCC inline
assembly... yes
checking if powerpc64le-linux-gnu-gcc-4.9 -std=gnu99 supports DEC inline
assembly... no
checking if powerpc64le-linux-gnu-gcc-4.9 -std=gnu99 supports XLC inline
assembly... no
checking for assembly format... default-.text-.globl-:--.L-@-1-1-0-1-1
checking for assembly architecture... POWERPC64
checking for builtin atomics... BUILTIN_NO
checking for perl... (cached) perl
checking for pre-built assembly file... no (not in asm-data)
checking whether possible to generate assembly file... yes
checking for atomic assembly filename... atomic-local.s


and

phargrov@ppc64el:~/OMPI/ompi-master/BLD$ make -C test/asm check
[...]

Testsuite summary for Open MPI gitclone

# TOTAL: 8
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0




In addition to the one-line patch below, I needed to run autogen.pl with a
new enough config/config.{guess,sub}.
Along the way I noticed
opal/mca/common/libfabric/libfabric/config/config.guess
opal/mca/common/libfabric/libfabric/config/config.sub
opal/mca/hwloc/hwloc191/hwloc/config/config.guess
opal/mca/hwloc/hwloc191/hwloc/config/config.sub
which appear to be too old to recognize powerpc64le and are *not* updated
when autogen.pl is run.
I manually updated them, but somebody may want to either commit newer
versions to git or teach autogen.pl to update them.
FWIW, it appears that tarballs *are* generated with up-to-date versions.
Go figure.


-Paul

diff --git a/config/opal_config_asm.m4 b/config/opal_config_asm.m4
index 7ceadfc..c5f72c5 100644
--- a/config/opal_config_asm.m4
+++ b/config/opal_config_asm.m4
@@ -988,7 +988,7 @@ AC_DEFUN([OPAL_CONFIG_ASM],[
 OPAL_GCC_INLINE_ASSIGN='"or %0,[$]0,[$]0" : "=&r"(ret)'
 ;;

-powerpc-*|powerpc64-*|rs6000-*|ppc-*)
+powerpc-*|powerpc64-*|powerpc64le-*|rs6000-*|ppc-*)
 OPAL_CHECK_POWERPC_REG
 if test "$ac_cv_sizeof_long" = "4" ; then
 opal_cv_asm_arch="POWERPC32"


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Jeff Squyres (jsquyres)
On Apr 22, 2015, at 10:05 AM, Marco Atzeri  wrote:
> 
> Making all in mpi/fortran/use-mpi-f08
> make[2]: Entering directory 
> '/cygdrive/e/cyg_pub/devel/openmpi/openmpi-1.8.5rc2-1.x86_64/build/ompi/mpi/fortran/use-mpi-f08'
>  FCLD libmpi_usempif08.la
> .libs/abort_f08.o: In function `mpi_abort_f08_':
> /usr/src/debug/openmpi-1.8.5rc2-1/ompi/mpi/fortran/use-mpi-f08/abort_f08.F90:17:
>  undefined reference to `ompi_abort_f'
> /usr/src/debug/openmpi-1.8.5rc2-1/ompi/mpi/fortran/use-mpi-f08/abort_f08.F90:17:(.text+0xe):
>  relocation truncated to fit: R_X86_64_PC32 against undefined symbol 
> `ompi_abort_f'
> .libs/accumulate_f08.o: In function `mpi_accumulate_f08_':
> /usr/src/debug/openmpi-1.8.5rc2-1/ompi/mpi/fortran/use-mpi-f08/accumulate_f08.F90:28:
>  undefined reference to `ompi_accumulate_f'

Mmm.  That's a minor bummer -- we usually only linked the libompi_mpifh in the 
wrapper compiler, not here.  But I suspect this is not harmful to also link it 
here -- thanks for the patch.

> Question:
> what is the scope of the new two shared libs
> 
> usr/bin/cygmpi_usempi_ignore_tkr-0.dll
> usr/bin/cygmpi_usempif08-0.dll
> 
> in comparison to previous
> 
> usr/bin/cygmpi_mpifh-2.dll
> usr/bin/cygmpi_usempi-1.dll
> 
> already present in 1.8.4 ?

All 4 were present in 1.8.4, too -- but it depends on your compiler which of 
the fortran libraries are compiled.

I'm guessing you upgraded your fortran compiler?

With an "old" fortran compiler, we build the "old" Open MPI "use mpi" Fortran 
bindings -- cygmpi_usempi-1.dll (which is basically some script-generated 
code).  

With a "new" fortran compiler, we build the "new" Open MPI "use mpi" Fortran 
bindings -- cygmpi_usempi_ignore_tkr-0.dll.  This is the same Fortran bindings 
interface as the usempi library, but it uses a compiler extension (that was 
found by configure) that is effectively a (void*) equivalent in Fortran (the 
extension is called "Ignore TKR").  The code that is compiled into the 
usempi_ignore_tkr library is quite a bit simpler, cleaner, and more inclusive 
than the generated code.

The usempif08 library is the "use mpi_f08" bindings; it will only be built if 
you have a "new" Fortran compiler.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Compile "remark" for Openmpi 1.8.4

2015-04-22 Thread Tom Wurgler

Compilation of OpenMPI 1.8.4 using Intel compiler version 14.0.4.211 results in 
usable code but has the following "remarks":
thanks
tom


make[2]: Entering directory 
`/home02/tom/src/openmpi-1.8.4_intel_1404211/ompi/mpi/fortran/use-mpi-f08'
  PPFC mpi-f08-types.lo
  GENERATE sizeof_f08.h
  CC   constants.lo
  GENERATE sizeof_f08.f90
  GENERATE profile/psizeof_f08.f90
  FC   sizeof_f08.lo
  FC   profile/psizeof_f08.lo
  PPFC mpi-f08-interfaces-callbacks.lo
  PPFC mpi-f08-interfaces.lo
  PPFC pmpi-f08-interfaces.lo
pmpi-f08-interfaces.F90(28): remark #5140: Unrecognized directive
!DIR$ IGNORE_TKR buf
^
pmpi-f08-interfaces.F90(45): remark #5140: Unrecognized directive
!DIR$ IGNORE_TKR buf
^
pmpi-f08-interfaces.F90(62): remark #5140: Unrecognized directive
!DIR$ IGNORE_TKR buffer
---^
pmpi-f08-interfaces.F90(76): remark #5140: Unrecognized directive
!DIR$ IGNORE_TKR buffer_addr
^
pmpi-f08-interfaces.F90(111): remark #5140: Unrecognized directive
!DIR$ IGNORE_TKR buf
^
[lots more of same, so truncated here]

Re: [OMPI devel] powerpc64le support [1-line patch]

2015-04-22 Thread Jeff Squyres (jsquyres)
On Apr 22, 2015, at 5:19 PM, Paul Hargrove  wrote:
> 
> I had an opportunity to try the 1.8.5rc2 tarball on a little-endian POWER8 
> (aka ppc64el or powerpc64le).
> The existing powerpc64 inline asm should work.

Sweet -- I put your patch in here:

   https://github.com/open-mpi/ompi/pull/550

Just to run it by the assembly / IBM/POWER gatekeepers first.  :-)

> In addition to the one-line patch below, I needed to run autogen.pl with a 
> new enough config/config.{guess,sub}.
> Along the way I noticed
> opal/mca/common/libfabric/libfabric/config/config.guess
> opal/mca/common/libfabric/libfabric/config/config.sub
> opal/mca/hwloc/hwloc191/hwloc/config/config.guess
> opal/mca/hwloc/hwloc191/hwloc/config/config.sub
> which appear to be too old to recognize powerpc64le and are *not* updated 
> when autogen.pl is run.

It's ok -- we don't run those scripts during OMPI's top-level configure (that's 
why they're not re-generated during autogen.pl).

> I manually updated them, but somebody may want to either commit newer 
> versions to git or teach autogen.pl to update them.
> FWIW, it appears that tarballs *are* generated with up-to-date versions.  Go 
> figure.

I believe the script we use to make official tarballs removes/replaces *all* 
config.sub|guess, just to be completely thorough.  But it's overkill / just 
being defensive; it isn't technically necessary.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Compile "remark" for Openmpi 1.8.4

2015-04-22 Thread Jeff Squyres (jsquyres)
This is actually expected.

We use compiler pragmas in the Fortran code that are not recognized by all 
compilers.  But they're safely ignored (even though they're noisy).  :-\


> On Apr 22, 2015, at 5:22 PM, Tom Wurgler  wrote:
> 
> 
> Compilation of OpenMPI 1.8.4 using Intel compiler version 14.0.4.211 results 
> in usable code but has the following "remarks":
> thanks
> tom
> 
> 
> make[2]: Entering directory 
> `/home02/tom/src/openmpi-1.8.4_intel_1404211/ompi/mpi/fortran/use-mpi-f08'
>  PPFC mpi-f08-types.lo
>  GENERATE sizeof_f08.h
>  CC   constants.lo
>  GENERATE sizeof_f08.f90
>  GENERATE profile/psizeof_f08.f90
>  FC   sizeof_f08.lo
>  FC   profile/psizeof_f08.lo
>  PPFC mpi-f08-interfaces-callbacks.lo
>  PPFC mpi-f08-interfaces.lo
>  PPFC pmpi-f08-interfaces.lo
> pmpi-f08-interfaces.F90(28): remark #5140: Unrecognized directive
> !DIR$ IGNORE_TKR buf
> ^
> pmpi-f08-interfaces.F90(45): remark #5140: Unrecognized directive
> !DIR$ IGNORE_TKR buf
> ^
> pmpi-f08-interfaces.F90(62): remark #5140: Unrecognized directive
> !DIR$ IGNORE_TKR buffer
> ---^
> pmpi-f08-interfaces.F90(76): remark #5140: Unrecognized directive
> !DIR$ IGNORE_TKR buffer_addr
> ^
> pmpi-f08-interfaces.F90(111): remark #5140: Unrecognized directive
> !DIR$ IGNORE_TKR buf
> ^
> [lots more of same, so truncated here]
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17325.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Jeff Squyres (jsquyres)
Oops -- missed this when I reviewed / updated README for v1.8.  Will fix -- 
thanks!

> On Apr 22, 2015, at 12:02 AM, Paul Hargrove  wrote:
> 
> Unless I am mistaken, the text quoted below from README no longer reflects 
> the current behavior.
> The text appears to be the same in master and v1.8.
> 
> -Paul
> 
> --with-libltdl(=value)
>   This option specifies where to find the GNU Libtool libltdl support
>   library.  The following values are permitted:
> 
> internal:Use Open MPI's internal copy of libltdl.
> external:Use an external libltdl installation (rely on default
>  compiler and linker paths to find it)
> :  Same as "internal".
> : Specify the location of a specific libltdl
>  installation to use
> 
>   By default (or if --with-libltdl is specified with no VALUE), Open
>   MPI will build and use the copy of libltdl that it has in its source
>   tree.  However, if the VALUE is "external", Open MPI will look for
>   the relevant libltdl header file and library in default compiler /
>   linker locations.  Or, VALUE can be a directory tree where the
>   libltdl header file and library can be found.  This option allows
>   operating systems to include Open MPI and use their default libltdl
>   installation instead of Open MPI's bundled libltdl.
> 
> 
> On Tue, Apr 21, 2015 at 3:43 PM, Jeff Squyres (jsquyres)  
> wrote:
> In the usual location:
> 
> http://www.open-mpi.org/software/ompi/v1.8/
> 
> The NEWS changed completely between rc1 and r2, so I don't know easily 
> exactly what is different between rc1 and rc2.  Here's the full 1.8.5 NEWS:
> 
> - Fixed configure problems in some cases when using an external hwloc
>   installation.  Thanks to Erick Schnetter for reporting the error and
>   helping track down the source of the problem.
> - Fixed linker error on OS X when using the clang compiler.  Thanks to
>   Erick Schnetter for reporting the error and helping track down the
>   source of the problem.
> - Fixed MPI_THREAD_MULTIPLE deadlock error in the vader BTL.  Thanks
>   to Thomas Klimpel for reporting the issue.
> - Fixed several Valgrind warnings.  Thanks for Lisandro Dalcin for
>   contributing a patch fixing some one-sided code paths.
> - Fixed version compatibility test in OOB that broke ABI within the
>   1.8 series. NOTE: this will not resolve the problem between pre-1.8.5
>   versions, but will fix it going forward.
> - Fix some issues related to running on Intel Xeon Phi coprocessors.
> - Opportunistically switch away from using GNU Libtool's libltdl
>   library when possible (by default).
> - Fix some VampirTrace errors.  Thanks to Paul Hargrove for reporting
>   the issues.
> - Correct default binding patterns when --use-hwthread-cpus was
>   specified and nprocs <= 2.
> - Fix warnings about -finline-functions when compiling with clang.
> - Updated the embedded hwloc with several bug fixes, including the
>   "duplicate Lhwloc1 symbol" that multiple users reported on some
>   platforms.
> - Do not error when mpirun is invoked with with default bindings
>   (i.e., no binding was specified), and one or more nodes do not
>   support bindings.  Thanks to Annu Desari for pointing out the
>   problem.
> - Let root invoke "mpirun --version" to check the version without
>   printing the "Don't run as root!" warnings.  Thanks to Robert McLay
>   for the suggestion.
> - Fixed several bugs in OpenSHMEM support.
> - Extended vader shared memory support to 32-bit architectures.
> - Fix handling of very large datatypes.  Thanks to Bogdan Sataric for
>   the bug report.
> - Fixed a bug in handling subarray MPI datatypes, and a bug when using
>   MPI_LB and MPI_UB.  Thanks to Gus Correa for pointing out the issue.
> - Restore user-settable bandwidth and latency PML MCA variables.
> - Multiple bug fixes for cleanup during MPI_FINALIZE in unusual
>   situations.
> - Added support for TCP keepalive signals to ensure timely termination
>   when sockets between daemons cannot be created (e.g., due to a
>   firewall).
> - Added MCA parameter to allow full use of a SLURM allocation when
>   started from a tool (supports LLNL debugger).
> - Fixed several bugs in the configure logic for PMI and hwloc.
> - Fixed incorrect interface index in TCP communications setup.  Thanks
>   to Mark Kettenis for spotting the problem and providing a patch.
> - Fixed MPI_IREDUCE_SCATTER with single-process communicators when
>   MPI_IN_PLACE was not used.
> - Added XRC support for OFED v3.12 and higher.
> - Various updates and bug fixes to the Mellanox hcoll collective
>   support.
> - Fix problems with Fortran compilers that did not support
>   REAL*16/COMPLEX*32 types.  Thanks to Orion Poplawski for identifying
>   the issue.
> - Fixed problem with rpath/runpath support in pkg-config files.
>   Thanks to Christoph Junghans for notifying us of the issue.
> - Man page fixes:
>   - Removed erroneous "color" discussion from MPI_COMM_SPLIT_TYPE.
> Thanks to Erick S

[OMPI devel] 1.8.5rc2 testing report

2015-04-22 Thread Paul Hargrove
Well, I tried rc2 on just about everything except my phone and my linksys.

For me the configure failure (dlopen() not found) on {Free,Net,Open}BSD is
the only problem.
Since it works on 'master' I am confident Jeff will sort this out.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Marco Atzeri

On 4/22/2015 11:19 PM, Jeff Squyres (jsquyres) wrote:




Question:
what is the scope of the new two shared libs

usr/bin/cygmpi_usempi_ignore_tkr-0.dll
usr/bin/cygmpi_usempif08-0.dll

in comparison to previous

usr/bin/cygmpi_mpifh-2.dll
usr/bin/cygmpi_usempi-1.dll

already present in 1.8.4 ?


All 4 were present in 1.8.4, too -- but it depends on your compiler which of 
the fortran libraries are compiled.

I'm guessing you upgraded your fortran compiler?


eventually just from 4.8.x to 4.9x



With an "old" fortran compiler, we build the "old" Open MPI "use mpi" Fortran 
bindings -- cygmpi_usempi-1.dll (which is basically some script-generated code).

With a "new" fortran compiler, we build the "new" Open MPI "use mpi" Fortran bindings -- 
cygmpi_usempi_ignore_tkr-0.dll.  This is the same Fortran bindings interface as the usempi library, but it uses a 
compiler extension (that was found by configure) that is effectively a (void*) equivalent in Fortran (the extension is 
called "Ignore TKR").  The code that is compiled into the usempi_ignore_tkr library is quite a bit simpler, 
cleaner, and more inclusive than the generated code.

The usempif08 library is the "use mpi_f08" bindings; it will only be built if you have a 
"new" Fortran compiler.


It seems I will need to add 2 two Fortran sub packages ...



Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Jeff Squyres (jsquyres)
On Apr 22, 2015, at 6:18 PM, Marco Atzeri  wrote:
> 
>> I'm guessing you upgraded your fortran compiler?
> 
> eventually just from 4.8.x to 4.9x

Yep -- that would do it.  gfortran 4.8.x is is "old enough" Fortran, gfortran 
4.9.x is "new enough" Fortran.

>> The usempif08 library is the "use mpi_f08" bindings; it will only be built 
>> if you have a "new" Fortran compiler.
> 
> It seems I will need to add 2 two Fortran sub packages ...

It's a bit tricky.  It depends on what version of the fortran compiler the user 
wants to use.  You will basically need to match it (binary-wise).  :-\

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Jeff Squyres (jsquyres)
I think we missed 2 commits on v1.8.  Filed PR 
https://github.com/open-mpi/ompi-release/pull/254 to fix the problem.

bot:hargrove -- can you test?


> On Apr 21, 2015, at 8:40 PM, Paul Hargrove  wrote:
> 
> 
> 
> On Tue, Apr 21, 2015 at 5:33 PM, Jeff Squyres (jsquyres)  
> wrote:
> What happens with master tar balls?
> 
> Master is fine building dl:dlopen:
> 
> --- MCA component dl:dlopen (m4 configuration macro, priority 80)
> checking for MCA component dl:dlopen compile mode... static
> checking dlfcn.h usability... yes
> checking dlfcn.h presence... yes
> checking for dlfcn.h... yes
> looking for library without search path
> checking for library containing dlopen... none required
> checking if MCA component dl:dlopen can compile... yes
> 
> -Paul
> 
> 
> 
>  
> 
> Sent from my phone. No type good. 
> 
> On Apr 21, 2015, at 7:38 PM, Paul Hargrove  wrote:
> 
>> Sorry the output in the previous email left out some relevant detail.
>> See here that BOTH dl components were unable to compile with the 1.8.5rc2 
>> tarball:
>> 
>> +++ Configuring MCA framework dl
>> checking for no configure components in framework dl...
>> checking for m4 configure components in framework dl... libltdl, dlopen
>> 
>> --- MCA component dl:dlopen (m4 configuration macro, priority 80)
>> checking for MCA component dl:dlopen compile mode... static
>> checking dlfcn.h usability... yes
>> checking dlfcn.h presence... yes
>> checking for dlfcn.h... yes
>> looking for library without search path
>> checking for dlopen in -ldl... no
>> checking if MCA component dl:dlopen can compile... no   
>> 
>> --- MCA component dl:libltdl (m4 configuration macro, priority 50)
>> checking for MCA component dl:libltdl compile mode... static
>> checking --with-libltdl value... simple ok (unspecified)
>> checking --with-libltdl-libdir value... simple ok (unspecified)
>> checking for libltdl dir... compiler default
>> checking for libltdl library dir... linker default
>> checking ltdl.h usability... no
>> checking ltdl.h presence... no
>> checking for ltdl.h... no
>> checking if MCA component dl:libltdl can compile... no
>> configure: WARNING: Did not find a suitable static opal dl component
>> configure: WARNING: You might need to install libltld (and its headers) or
>> configure: WARNING: specify --disable-dlopen to configure.
>> configure: error: Cannot continue
>> 
>> I am getting this on ALL of my {Free,Net,Open}BSD platforms.
>> However, they all built the dl:dlopen component fine when testing Jeff''s 
>> tarballs from PR410:
>> 
>> --- MCA component dl:dlopen (m4 configuration macro, priority 80)
>> checking for MCA component dl:dlopen compile mode... static
>> checking dlfcn.h usability... yes
>> checking dlfcn.h presence... yes
>> checking for dlfcn.h... yes
>> looking for library without search path
>> checking for library containing dlopen... none required
>> checking if MCA component dl:dlopen can compile... yes
>> 
>> The key difference I see is that dlopen() is available in libc, not in (the 
>> non-existent libdl).
>> So it looks likely that something wasn't brought over correctly/completely 
>> from master to v1.8.
>> 
>> -Paul [a.k.a. bot:hargrove]
>> 
>> 
>> 
>> On Tue, Apr 21, 2015 at 4:22 PM, Paul Hargrove  wrote:
>> Is the following configure-fails-by-default behavior really the desired one 
>> in 1.8.5?
>> I thought this was more of a 1.9 change than a mid-series change.
>> 
>> -Paul
>> 
>> --- MCA component dl:libltdl (m4 configuration macro, priority 50)
>> checking for MCA component dl:libltdl compile mode... static
>> checking --with-libltdl value... simple ok (unspecified)
>> checking --with-libltdl-libdir value... simple ok (unspecified)
>> checking for libltdl dir... compiler default
>> checking for libltdl library dir... linker default
>> checking ltdl.h usability... no
>> checking ltdl.h presence... no
>> checking for ltdl.h... no
>> checking if MCA component dl:libltdl can compile... no
>> configure: WARNING: Did not find a suitable static opal dl component
>> configure: WARNING: You might need to install libltld (and its headers) or
>> configure: WARNING: specify --disable-dlopen to configure.
>> configure: error: Cannot continue
>> 
>> On Tue, Apr 21, 2015 at 3:43 PM, Jeff Squyres (jsquyres) 
>>  wrote:
>> In the usual location:
>> 
>> http://www.open-mpi.org/software/ompi/v1.8/
>> 
>> The NEWS changed completely between rc1 and r2, so I don't know easily 
>> exactly what is different between rc1 and rc2.  Here's the full 1.8.5 NEWS:
>> 
>> - Fixed configure problems in some cases when using an external hwloc
>>   installation.  Thanks to Erick Schnetter for reporting the error and
>>   helping track down the source of the problem.
>> - Fixed linker error on OS X when using the clang compiler.  Thanks to
>>   Erick Schnetter for reporting the error and helping track down the
>>   source of the problem.
>> - Fixed MPI_THREAD_MULTIPLE deadlock error in the vader BTL.  Thanks
>>   to Thomas Klimpel for reporting th

Re: [OMPI devel] powerpc64le support [1-line patch]

2015-04-22 Thread Paul Hargrove
On Wed, Apr 22, 2015 at 2:43 PM, Jeff Squyres (jsquyres)  wrote:

> > In addition to the one-line patch below, I needed to run autogen.pl
> with a new enough config/config.{guess,sub}.
> > Along the way I noticed
> > opal/mca/common/libfabric/libfabric/config/config.guess
> > opal/mca/common/libfabric/libfabric/config/config.sub
> > opal/mca/hwloc/hwloc191/hwloc/config/config.guess
> > opal/mca/hwloc/hwloc191/hwloc/config/config.sub
> > which appear to be too old to recognize powerpc64le and are *not*
> updated when autogen.pl is run.
>
> It's ok -- we don't run those scripts during OMPI's top-level configure
> (that's why they're not re-generated during autogen.pl).


OK, makes sense.

I believe the script we use to make official tarballs removes/replaces
> *all* config.sub|guess, just to be completely thorough.

But it's overkill / just being defensive; it isn't technically necessary.


So the script author(s) suffered from the same paranoia I was applying to
the situation.  :-)


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.5rc2 released

2015-04-22 Thread Paul Hargrove
On Wed, Apr 22, 2015 at 4:20 PM, Jeff Squyres (jsquyres)  wrote:

> I think we missed 2 commits on v1.8.  Filed PR
> https://github.com/open-mpi/ompi-release/pull/254 to fix the problem.
>
> bot:hargrove -- can you test?
>

Initial testing failed autogen.pl (as did Jenkins).
I am past that point and making updates to the PR instead of this email
thread.

-Paul


>
>
> > On Apr 21, 2015, at 8:40 PM, Paul Hargrove  wrote:
> >
> >
> >
> > On Tue, Apr 21, 2015 at 5:33 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > What happens with master tar balls?
> >
> > Master is fine building dl:dlopen:
> >
> > --- MCA component dl:dlopen (m4 configuration macro, priority 80)
> > checking for MCA component dl:dlopen compile mode... static
> > checking dlfcn.h usability... yes
> > checking dlfcn.h presence... yes
> > checking for dlfcn.h... yes
> > looking for library without search path
> > checking for library containing dlopen... none required
> > checking if MCA component dl:dlopen can compile... yes
> >
> > -Paul
> >
> >
> >
> >
> >
> > Sent from my phone. No type good.
> >
> > On Apr 21, 2015, at 7:38 PM, Paul Hargrove  wrote:
> >
> >> Sorry the output in the previous email left out some relevant detail.
> >> See here that BOTH dl components were unable to compile with the
> 1.8.5rc2 tarball:
> >>
> >> +++ Configuring MCA framework dl
> >> checking for no configure components in framework dl...
> >> checking for m4 configure components in framework dl... libltdl, dlopen
> >>
> >> --- MCA component dl:dlopen (m4 configuration macro, priority 80)
> >> checking for MCA component dl:dlopen compile mode... static
> >> checking dlfcn.h usability... yes
> >> checking dlfcn.h presence... yes
> >> checking for dlfcn.h... yes
> >> looking for library without search path
> >> checking for dlopen in -ldl... no
> >> checking if MCA component dl:dlopen can compile... no
> >>
> >> --- MCA component dl:libltdl (m4 configuration macro, priority 50)
> >> checking for MCA component dl:libltdl compile mode... static
> >> checking --with-libltdl value... simple ok (unspecified)
> >> checking --with-libltdl-libdir value... simple ok (unspecified)
> >> checking for libltdl dir... compiler default
> >> checking for libltdl library dir... linker default
> >> checking ltdl.h usability... no
> >> checking ltdl.h presence... no
> >> checking for ltdl.h... no
> >> checking if MCA component dl:libltdl can compile... no
> >> configure: WARNING: Did not find a suitable static opal dl component
> >> configure: WARNING: You might need to install libltld (and its headers)
> or
> >> configure: WARNING: specify --disable-dlopen to configure.
> >> configure: error: Cannot continue
> >>
> >> I am getting this on ALL of my {Free,Net,Open}BSD platforms.
> >> However, they all built the dl:dlopen component fine when testing
> Jeff''s tarballs from PR410:
> >>
> >> --- MCA component dl:dlopen (m4 configuration macro, priority 80)
> >> checking for MCA component dl:dlopen compile mode... static
> >> checking dlfcn.h usability... yes
> >> checking dlfcn.h presence... yes
> >> checking for dlfcn.h... yes
> >> looking for library without search path
> >> checking for library containing dlopen... none required
> >> checking if MCA component dl:dlopen can compile... yes
> >>
> >> The key difference I see is that dlopen() is available in libc, not in
> (the non-existent libdl).
> >> So it looks likely that something wasn't brought over
> correctly/completely from master to v1.8.
> >>
> >> -Paul [a.k.a. bot:hargrove]
> >>
> >>
> >>
> >> On Tue, Apr 21, 2015 at 4:22 PM, Paul Hargrove 
> wrote:
> >> Is the following configure-fails-by-default behavior really the desired
> one in 1.8.5?
> >> I thought this was more of a 1.9 change than a mid-series change.
> >>
> >> -Paul
> >>
> >> --- MCA component dl:libltdl (m4 configuration macro, priority 50)
> >> checking for MCA component dl:libltdl compile mode... static
> >> checking --with-libltdl value... simple ok (unspecified)
> >> checking --with-libltdl-libdir value... simple ok (unspecified)
> >> checking for libltdl dir... compiler default
> >> checking for libltdl library dir... linker default
> >> checking ltdl.h usability... no
> >> checking ltdl.h presence... no
> >> checking for ltdl.h... no
> >> checking if MCA component dl:libltdl can compile... no
> >> configure: WARNING: Did not find a suitable static opal dl component
> >> configure: WARNING: You might need to install libltld (and its headers)
> or
> >> configure: WARNING: specify --disable-dlopen to configure.
> >> configure: error: Cannot continue
> >>
> >> On Tue, Apr 21, 2015 at 3:43 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >> In the usual location:
> >>
> >> http://www.open-mpi.org/software/ompi/v1.8/
> >>
> >> The NEWS changed completely between rc1 and r2, so I don't know easily
> exactly what is different between rc1 and rc2.  Here's the full 1.8.5 NEWS:
> >>
> >> - Fixed configure problems in some cases when usi

Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4

2015-04-22 Thread Ralph Castain

> On Apr 22, 2015, at 12:27 PM, Tom Wurgler  wrote:
> 
> Well, still not working.  I compiled on the AMD and for 1.6.4 I get:
> (note: ideally, at this point, we really want 1.6.4 ,not 1.8.4 (yet)).
> 
> 1.6.4 using --bind-to-socket --bind-to-core
> 
> --
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI developers.
> --
> --
> mpirun was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Input/output error
> Node: rdsargo36
> 
> when attempting to start process rank 0.
> --
> Error: Previous command failed (exitcode=1)
> =

Sounds like there is something in the old 1.6.4 release that just doesn’t grok 
your system. That release is getting pretty long-in-the-tooth, so it is quite 
likely that it just isn’t happy. Sadly, we don’t maintain that series as it is 
just too old at this point.

> 
> Now using 1.8.4 and --map-by socket --bind-to core, the job runs and is on 
> the cores I want, namely 0,1,2,3,4,5,6,7

Hooray!

> BUT when a job lands on a node with other jobs, it over-subcribes the cores.  
> I tried adding --nooversubscribe with no effect.
> 
> The already running jobs were submitted with openmpi 1.6.4 using --mca 
> mpi-paffinity-alone 1
> and none of the bind-core etc args. 
> 
> We have nodes with 4 sockets with 12 cores each, so we really want to make 
> use of as many cores as we can, in the most optimized way.  So if we had a 24 
> way job and then a second  24 way job we want them packed on the same node 
> with processor and memory affinity.

The problem is that the two mpirun’s have no way of knowing about each other’s 
existence. Thus, they each think that they are sole owners of the node and 
being mapping with core0.

You can resolve that by passing each mpirun a cpu-set, thus isolating them from 
each other. For the 1.8 series, just add "—cpu-set 0-23” for one job, and 
“—cpu-set 24-47” for the other. We’ll still map and bind as directed, but each 
mpirun will stay within the defined envelope.

Some resource managers will do this for you, so you might want to explore that 
option as well. If the RM externally applies a cpu-set envelope to mpirun, 
mpirun will pick that up and treat it as a defined cpu-set.


> 
> Any guidance appreciated.
> thanks
> tom
> 
> From: devel  on behalf of Ralph Castain 
> 
> Sent: Friday, April 17, 2015 7:36 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4
>  
> Hi Tom
> 
> Glad you are making some progress! Note that the 1.8 series uses hwloc for 
> its affinity operations, while the 1.4 and 1.6 series used the old plpa code. 
> Hence, you will not find the “affinity” components in the 1.8 ompi_info 
> output.
> 
> Is there some reason you didn’t compile OMPI on the AMD machine? I ask 
> because there are some config switches in various areas that differ between 
> AMD and Intel architectures.
> 
> 
>> On Apr 17, 2015, at 11:16 AM, Tom Wurgler > > wrote:
>> 
>> Note where I said "1 hour 14 minutes" it should have read "1 hour 24 
>> minutes"...
>> 
>> 
>> 
>> From: Tom Wurgler
>> Sent: Friday, April 17, 2015 2:14 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] Assigning processes to cores 1.4.2, 1.6.4 and 1.8.4
>>  
>> Ok, seems like I am making some progress here.  Thanks for the help.
>> I turned HT off.
>> Now I can run v 1.4.2, 1.6.4 and 1.8.4 all compiled the same compiler and 
>> run on the same machine
>> 1.4.2 runs this job in 59 minutes.   1.6.4 and 1.8.4 run the job in 1hr 24 
>> minutes.
>> 1.4.2 uses just --mca paffinuty-alone 1 and the processes are bound
>>   PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4 
>> N5 ]
>> 22232 prog1 0469.9M [ 469.9M 0  0  0  0  
>> 0  ]
>> 22233 prog1 1479.0M [   4.0M 475.0M 0  0  0  
>> 0  ]
>> 22234 prog1 2516.7M [ 516.7M 0  0  0  0  
>> 0  ]
>> 22235 prog1 3485.4M [   8.0M 477.4M 0  0  0  
>> 0  ]
>> 22236 prog1 4482.6M [ 482.6M 0  0  0  0  
>> 0  ]
>> 22237 prog1 5486.6M [   6.0M 480.6M 0  0  0  
>> 0  ]
>> 22238 prog1 6481.3M [ 481.3M 0  0  0  0  
>> 0  ]
>> 22239 prog1 7419.4M [   8.0M 411.4M 0  0  0  
>> 0  ]
>> 
>> If I use 1.6.4 and 1.8.

Re: [OMPI devel] binding output error

2015-04-22 Thread Ralph Castain
Here is what I see on my machine:

07:59:55  (v1.8) /home/common/openmpi/ompi-release$ mpirun -np 8 
--display-devel-map --report-bindings --map-by core -host bend001 --bind-to 
core hostname
 Data for JOB [45531,1] offset 0

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYCORE  
Ranking policy: CORE
 Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
Num new daemons: 0  New daemon starting vpid INVALID
Num nodes: 1

 Data for node: bend001 Launch id: -1   State: 2
Daemon: [[45531,0],0]   Daemon launched: True
Num slots: 12   Slots in use: 8 Oversubscribed: FALSE
Num slots allocated: 12 Max slots: 0
Username on node: NULL
Num procs: 8Next node_rank: 8
Data for proc: [[45531,1],0]
Pid: 0  Local rank: 0   Node rank: 0App rank: 0
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
0,12Bind location: 0,12 Binding: 0,12
Data for proc: [[45531,1],1]
Pid: 0  Local rank: 1   Node rank: 1App rank: 1
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
2,14Bind location: 2,14 Binding: 2,14
Data for proc: [[45531,1],2]
Pid: 0  Local rank: 2   Node rank: 2App rank: 2
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
4,16Bind location: 4,16 Binding: 4,16
Data for proc: [[45531,1],3]
Pid: 0  Local rank: 3   Node rank: 3App rank: 3
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
6,18Bind location: 6,18 Binding: 6,18
Data for proc: [[45531,1],4]
Pid: 0  Local rank: 4   Node rank: 4App rank: 4
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
8,20Bind location: 8,20 Binding: 8,20
Data for proc: [[45531,1],5]
Pid: 0  Local rank: 5   Node rank: 5App rank: 5
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
10,22   Bind location: 10,22Binding: 10,22
Data for proc: [[45531,1],6]
Pid: 0  Local rank: 6   Node rank: 6App rank: 6
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
1,13Bind location: 1,13 Binding: 1,13
Data for proc: [[45531,1],7]
Pid: 0  Local rank: 7   Node rank: 7App rank: 7
State: INITIALIZED  Restarts: 0 App_context: 0  Locale: 
3,15Bind location: 3,15 Binding: 3,15
[bend001:15493] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../..][../../../../../..]
[bend001:15493] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../../../..][../../../../../..]
[bend001:15493] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: 
[../../BB/../../..][../../../../../..]
[bend001:15493] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: 
[../../../BB/../..][../../../../../..]
[bend001:15493] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: 
[../../../../BB/..][../../../../../..]
[bend001:15493] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: 
[../../../../../BB][../../../../../..]
[bend001:15493] MCW rank 6 bound to socket 1[core 6[hwt 0-1]]: 
[../../../../../..][BB/../../../../..]
[bend001:15493] MCW rank 7 bound to socket 1[core 7[hwt 0-1]]: 
[../../../../../..][../BB/../../../..]


I have HT enabled on my box, so the devel-map is showing Locale, Bind location, 
and Binding as the logical HT numbers (i.e., the PUs) for that proc. As you can 
see in the report-bindings output, things are indeed going where they should go.

The numbering in the devel-map always looks a little funny because it depends 
on how the bios numbered cpus. Unlike you might expect, they do tend to bounce 
around. In my case, for example, the bios has assigned the HTs and cores in the 
first socket with all the even numbered PUs, and the second socket got all the 
odd numbers. In other words, it assigned PUs round-robin by socket instead of 
sequentially across each socket.

 every bios does it differently, so there is no way to provide a 
standardized output. This is why we have report-bindings to tell the user where 
they actually wound up.

HTH
Ralph


> On Apr 21, 2015, at 7:54 AM, Devendar Bureddy  wrote:
> 
> I agree.   
> 
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
> (jsquyres)
> Sent: Tuesday, April 21, 2015 7:17 AM
> To: Open MPI Developers List
> Subject: Re: [OMPI devel] binding output error
> 
> +1
> 
> Devendar, you seem to be reporting a different issue than Elena...?  FWIW: 
> Open MPI has always used logical CPU numbering.  As far as I can tell from 
> your output, it looks like Open MPI did the Right Thing with your examples.
> 
> Elena's example seemed to show conflicting cpu numbering -- where OMPI said 
> it would bind a process and then where it ac