Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1

2013-08-30 Thread Rolf vandeVaart
George, I appreciate the thought you have put into these comments.  Perhaps I 
can try and make some of the changes as you suggest, but as you noted, sometime 
in the future.  In theory, I agree with them, but in practice they are not so 
easy to implement.  I have added a few more comments below.  Also note, I am 
hoping to move this change into Open MPI 1.7.4 and have you as a reviewer for 
the ticket.

https://svn.open-mpi.org/trac/ompi/ticket/3732

  
>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George
>Bosilca
>Sent: Wednesday, August 28, 2013 5:12 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29055 - in
>trunk/ompi/mca: btl btl/smcuda common/cuda pml/ob1
>
>Rolf,
>
>I'm not arguing against the choices you made. I just want to pinpoint that
>there are other way of achieving a similar goal with less impact on the
>components outside of the SMCUDA BTL, an approach which might be
>interesting in the long run. Thus my answer below is just so there is a trace 
>of
>a possible starting point toward a less intrusive approach, in case someone is
>interested in implementing in the future (an intern or some PhD student).
>
>On Aug 23, 2013, at 21:53 , Rolf vandeVaart  wrote:
>
>> Yes, I agree that the CUDA support is more intrusive and ends up in
>different areas.  The problem is that the changes could not be simply isolated
>in a BTL.
>>
>> 1. To support the direct movement of GPU buffers, we often utilize copying
>>into host memory and then out of host memory.   These copies have to be
>>done utilizing the cuMemcpy() functions rather than the memcpy() function.
>>This is why some changes ended up in the opal datatype area.  Must of that
>>copying is driven by the convertor.
>
>Each proc has a master_convertor which is a specialized class of convertors
>used for the target processor(s), and this specialized convertor is cloned for
>every new request. Some of these convertors support heterogeneous
>operations, some of them checksums and some of them just memcpy. The
>CUDA could have become just another class of convertors, with their
>definition visible only in the SMCUDA component.
This would be nice to do.  I just do not know how to implement.  Note that this 
is not used only within smcuda BTL.  It is also used in openib BTL and any 
other BTL if they end up supporting GPU buffers.

>> 2. I added support for doing an asynchronous copy into and out of host
>>buffers.  This ended up touching datatype, PML, and BTL layers.
>
>I'm not sure I understand the issue here, but what you do in the BTL is none of
>the PML business as long as you respect the rules. For message logging we
>have to copy the data locally in addition to sending it over the network, and
>we do not sequentialize these two steps, the can progress simultaneously
>(supported by extra threads eventually).
This touched the datatype layer because I added a field into the convertor (a 
stream).  I would have to think more about this one.  
>
>> 3. A GPU buffer may utilize a different protocol than a HOST buffer within a
>BTL.  This required me to find different ways to direct which PML protocol to
>use.  In addition, it is assume that a BTL either supports RDMA or not.  There 
>is
>no idea of supporting based on the type of buffer one is sending.
>
>The change where the PML is using the flags on the endpoint instead of the
>one on the BTL is a good change, allowing far more flexibility on the handling
>of the transfer protocols.
>
>Based on my understudying of your code, you changed the AM tag used by
>the PML to send the message something that could have been done easily in
>the BTL as an example by providing different path for the PUT/GET operations
>based on the memory ownership. Basically I could not figure out why the
>decision of using specialized RDMA (IPC based) should be reported upstream
>to the PML instead of being a totally local decision on the BTL and exposed to
>the PML as a single PUT/GET interface.
I am using a new active message tag that is local to the smcuda for the smcuda 
BTLs to decide if they can support RDMA of GPU memory.  When they agree that 
they can, I then piggy backed on the error handler callback support from the 
PML. I created a new flag so that when we call the error handler  with that 
flag, the PML realizes it wants to adjust to add RDMA of GPU buffers, and 
adjusts the flag on the endpoint.   And for this case, this can happen sometime 
after MPI_Init(), so that is why we call into the PML layer sometime later.  
For all other cases, the decision to support large message RDMA is determined 
at MPI_Init() time.And we need to report it upstream because the that is 
how the PML determines what protocol it will instruct the BTL to do.   
>
>  George.
>
>> Therefore, to leverage much of the existing datatype and PML support, I
>had to make changes in there.  Overall, I agree it is not ideal, but the best I
>could come up with.

Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-08-30 Thread Ralph Castain
Hi Chris et al

We did some work on the OMPI side and removed the O(N) calls to "get", so it 
should behave better now. If you get the chance, please try the 1.7.3 nightly 
tarball. We hope to officially release it soon.

Ralph


On Aug 14, 2013, at 10:27 AM, Joshua Ladd  wrote:

> Hi, Chris
> 
> Sorry for the delayed response. After much effort, the Open MPI 1.7 branch 
> now supports PMI2 (in general, not just for ALPS) and has been tested and 
> evaluated at small-ish scale (up to 512 ranks) with SLURM 2.6. We need to 
> test this at larger scale and plan to do so in the coming weeks, but what we 
> have observed thus far is the following:
> 
> 1. KVS Fence operation appears to scale worse than linear. This issue resides 
> solely on the SLURM side. Perhaps a better algorithm could be implemented - 
> we have discussed recursive doubling and Bruck's as alternatives.  
> 
> 2. There are still O(N) calls to PMI2_get at the OMPI/ORTE level that don't 
> appear to scale particularly well. Circumventing this remains an open 
> challenge, though proposals have been tossed around such as having a single 
> node leader get all the data from KVS space, put it into a shared segment 
> where the other ranks on host can read from. Unfortunately, this is still 
> O(N), just with a reduced coefficient. 
> 
> 3. We observed launch times take longer with SLURM 2.6 than they did with the 
> 2.5.X series. However, anecdotally, scaling appears to be improved. From our 
> (Mellanox's) point of view, getting something that doesn't "blow-up" 
> quadratically as N goes to 4K ranks and beyond is more important than the 
> absolute performance in launching any one job size.
> 
> From the data that I have seen, it appears that simply switching to SLURM 2.6 
> (along with the latest OMPI 1.7) will most likely not provide comparable 
> performance to launching with mpirun. I'll be sure to keep you and the 
> community appraised of the situation as more data on larger systems becomes 
> available in the coming weeks. 
> 
> 
> Best regards,
> 
> Josh
> 
> 
> Joshua S. Ladd, PhD
> HPC Algorithms Engineer
> Mellanox Technologies 
> 
> Email: josh...@mellanox.com
> Cell: +1 (865) 258 - 8898
> 
> 
> 
> 
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Christopher 
> Samuel
> Sent: Thursday, August 08, 2013 12:26 AM
> To: de...@open-mpi.org
> Subject: Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% 
> slowed than with mpirun
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi Joshua,
> 
> On 23/07/13 19:34, Joshua Ladd wrote:
> 
>> The proposed solution that "we" (OMPI + SLURM) have come up with is to 
>> modify OMPI to support PMI2 and to use SLURM 2.6 which has support for 
>> PMI2 and is (allegedly) much more scalable than PMI1.
>> Several folks in the combined communities are working hard, as we 
>> speak, trying to get this functional to see if it indeed makes a 
>> difference. Stay tuned, Chris. Hopefully we will have some data by the 
>> end of the week.
> 
> Is there any news on this?
> 
> We'd love to be able to test this out if we can as I currently see a 60% 
> penalty with srun with my test NAMD job from our tame MM person.
> 
> thanks!
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIDHbQACgkQO2KABBYQAh8vjgCgjPFB354t8dldPEA3pw2IHHze
> vB4Ani5vfK+9+BkbRF92FGhtB4eyIF1u
> =KoTt
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn] svn:open-mpi r29079 - in trunk: opal/mca/hwloc/base orte/mca/rmaps/mindist

2013-08-30 Thread Ralph Castain
You know, I've been looking around the code base, and I cannot find this MCA 
param registered anywhere, and neither does ompi_info show it.

> rmaps_base_dist_hca


Is some code missing?? According to your code, you use a device name that is 
obtained from the standard "--map-by dist:device" option. So did you mean to 
add another variable and then realized one wasn't required??



On Aug 28, 2013, at 9:23 AM, svn-commit-mai...@open-mpi.org wrote:

> Author: jladd (Joshua Ladd)
> Date: 2013-08-28 12:23:33 EDT (Wed, 28 Aug 2013)
> New Revision: 29079
> URL: https://svn.open-mpi.org/trac/ompi/changeset/29079
> 
> Log:
> Add support for autodetecting a MLNX HCA in the rmaps min distance feature. 
> In this way, .ini files distributed with software stacks need not specify a 
> particular HCA but instead may select the key word auto which will 
> automatically select the discovered device. To use this feature, simply pass 
> the keyword auto instead of a specific device name, --mca rmaps_base_dist_hca 
> auto. If more than one card is installed, the mapper will inform the user of 
> this and, at this point, the user will then need to specify which card via 
> the normal route, e.g. --mca rmaps_base_dist_hca . This should be 
> added to \ncmr=v1.7.4:reviewer=rhc:subject=Autodetect logic for min dist 
> mapping
> 
> Text files modified: 
>  trunk/opal/mca/hwloc/base/base.h| 4 ++-- 
>
>  trunk/opal/mca/hwloc/base/hwloc_base_util.c |40 
> 
>  trunk/orte/mca/rmaps/mindist/help-orte-rmaps-md.txt | 8  
>
>  trunk/orte/mca/rmaps/mindist/rmaps_mindist_module.c |11 +--  
>
>  4 files changed, 55 insertions(+), 8 deletions(-)
> 
> Modified: trunk/opal/mca/hwloc/base/base.h
> ==
> --- trunk/opal/mca/hwloc/base/base.h  Wed Aug 28 12:03:23 2013(r29078)
> +++ trunk/opal/mca/hwloc/base/base.h  2013-08-28 12:23:33 EDT (Wed, 28 Aug 
> 2013)  (r29079)
> @@ -169,8 +169,8 @@
>   hwloc_obj_t obj,
>   
> opal_hwloc_resource_type_t rtype);
> 
> -OPAL_DECLSPEC void opal_hwloc_get_sorted_numa_list(hwloc_topology_t topo, 
> -const char* device_name, 
> +OPAL_DECLSPEC int opal_hwloc_get_sorted_numa_list(hwloc_topology_t topo, 
> +char* device_name, 
>opal_list_t *sorted_list);
> 
> /**
> 
> Modified: trunk/opal/mca/hwloc/base/hwloc_base_util.c
> ==
> --- trunk/opal/mca/hwloc/base/hwloc_base_util.c   Wed Aug 28 12:03:23 
> 2013(r29078)
> +++ trunk/opal/mca/hwloc/base/hwloc_base_util.c   2013-08-28 12:23:33 EDT 
> (Wed, 28 Aug 2013)  (r29079)
> @@ -1729,7 +1729,7 @@
>}
> }
> 
> -static void sort_by_dist(hwloc_topology_t topo, const char* device_name, 
> opal_list_t *sorted_list)
> +static void sort_by_dist(hwloc_topology_t topo, char* device_name, 
> opal_list_t *sorted_list)
> {
>hwloc_obj_t device_obj = NULL;
>hwloc_obj_t obj = NULL, root = NULL;
> @@ -1751,6 +1751,9 @@
>obj = obj->parent;
>}
>if (obj == NULL) {
> +opal_output_verbose(5, 
> opal_hwloc_base_framework.framework_output,
> +"hwloc:base:get_sorted_numa_list: NUMA node 
> closest to %s wasn't found.",
> +device_name);
>return;
>} else {
>close_node_index = obj->logical_index;
> @@ -1762,6 +1765,8 @@
>/* we can try to find distances under group object. This 
> info can be there. */
>depth = hwloc_get_type_depth(topo, HWLOC_OBJ_NODE);
>if (depth < 0) {
> +opal_output_verbose(5, 
> opal_hwloc_base_framework.framework_output,
> +"hwloc:base:get_sorted_numa_list: There is 
> no information about distances on the node.");
>return;
>}
>root = hwloc_get_root_obj(topo);
> @@ -1779,6 +1784,8 @@
>}
>/* find all distances for our close node with logical index = 
> close_node_index as close_node_index + nbobjs*j */
>if ((NULL == distances) || (0 == distances->nbobjs)) {
> +opal_output_verbose(5, 
> opal_hwloc_base_framework.framework_output,
> +"hwloc:base:get_sorted_numa_list: There is no 
> information about distances on the node.");
>return;
>}
>/* fill list of numa nodes */
> @@ -1797,13 +18

Re: [OMPI devel] NO LT_DLADVISE - CANNOT LOAD LIBOMPI JAVA BINDINGS

2013-08-30 Thread Bibrak Qamar
so it happens to be that installing lt_dladvise package (libltdl) using yum
is quite easy a task but I don't have sudo powers.

I found the following other solution, i.e. to distribute libltdl with the a
package (here package will be openmpi - which means that I have to change
to aclocal.m4 of openmpi)

http://www.gnu.org/software/libtool/manual/html_node/Distributing-libltdl.html

Should I do that or is there any other way?

Thanks
Bibrak


On Thu, Aug 29, 2013 at 9:30 AM, Ralph Castain  wrote:

> you need to install the lt_dladvise package as well
>
> On Aug 29, 2013, at 6:18 AM, Bibrak Qamar  wrote:
>
> Hi all,
>
> I have the following runtime error while running Java MPI jobs. I have
> check the previous answers to the mailing list regarding this issue.
>
> The solutions were to install libtool and configure-compile-and-install
> openmpi again this time with the latest version of
>
> m4
> autoconfig
> automake
> libtools
> and flex
>
> I did all that but again the same issue that it can't load the libraries.
> Any remedies?
>
>
>
> -bash-3.2$ mpirun -np 2 java Hello
> [compute-0-21.local:14205] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
> JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
> [compute-0-21.local:14204] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
> JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[48748,1],1]
>   Exit code:1
> --
>
>
> -Bibrak
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Bibrak Qamar



On Thu, Aug 29, 2013 at 9:30 AM, Ralph Castain  wrote:

> you need to install the lt_dladvise package as well
>
> On Aug 29, 2013, at 6:18 AM, Bibrak Qamar  wrote:
>
> Hi all,
>
> I have the following runtime error while running Java MPI jobs. I have
> check the previous answers to the mailing list regarding this issue.
>
> The solutions were to install libtool and configure-compile-and-install
> openmpi again this time with the latest version of
>
> m4
> autoconfig
> automake
> libtools
> and flex
>
> I did all that but again the same issue that it can't load the libraries.
> Any remedies?
>
>
>
> -bash-3.2$ mpirun -np 2 java Hello
> [compute-0-21.local:14205] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
> JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
> [compute-0-21.local:14204] NO LT_DLADVISE - CANNOT LOAD LIBOMPI
> JAVA BINDINGS FAILED TO LOAD REQUIRED LIBRARIES
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[48748,1],1]
>   Exit code:1
> --
>
>
> -Bibrak
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] GNU Automake 1.14 released

2013-08-30 Thread Bert Wesarg
Hi,

On Fri, Jun 21, 2013 at 2:01 PM, Stefano Lattarini
 wrote:
> We are pleased to announce the GNU Automake 1.14 minor release.
>
>
>   - The next major Automake version (2.0) will unconditionally activate
> the 'subdir-objects' option.  In order to smooth out the transition,
> we now give a warning (in the category 'unsupported') whenever a
> source file is present in a subdirectory but the 'subdir-object' is
> not enabled.  For example, the following usage will trigger such a
> warning:
>
> bin_PROGRAMS = sub/foo
> sub_foo_SOURCES = sub/main.c sub/bar.c
>

we don't understand how this warning should 'smooth' the transition to
post-1.14 in our project.

Here is our situation:

We have a source file which needs to be compiled twice. But with
different compilers. Thus we can't use per-target flags and we use two
separate Makefile.am files for this. Because the compilation rules are
nearly identical, we use a Makefile.common.inc.am file which will be
included by both Makefile.am's. Here is the directory layout (the
complete reduced testcase is attached):

src/foo.c
src/Makefile.am
src/Makefile.common.inc.am
src/second/Makefile.am

The src/Makefile.am looks like:

 8< src/Makefile.am 8< ---
SUBDIRS = second

MY_SRCDIR=.
include Makefile.common.inc.am

bin_PROGRAMS=foo
foo_SOURCES=$(FOO_COMMONSOURCES)
 >8 src/Makefile.am >8 ---

 8< src/second/Makefile.am 8< ---
CC=$(top_srcdir)/bin/wrapper

MY_SRCDIR=..
include ../Makefile.common.inc.am

bin_PROGRAMS=foo-wrapped
foo_wrapped_SOURCES=$(FOO_COMMONSOURCES)
 >8 src/second/Makefile.am >8 ---

 8< src/Makefile.common.inc.am 8< ---
FOO_COMMONSOURCES = $(MY_SRCDIR)/foo.c
 >8 src/Makefile.common.inc.am >8 ---

This works with automake 1.13.4 as expected. Now, with automake 1.14
we get the newly introduced warning mentioned above in the release
statements. Now enabling subdir-objects is not yet an option for us,
because we use variables in the _SOURCES list and bug 13928 [1] hits
us.

So what would be the best transition in this situation? We don't want
to remove the Makefile.common.inc.am to avoid the resulting redundancy
in the two Makefile.am files. We also can't use the newly introduced
%reldir%, because it also throws the warning, and also want to
maintain compatibly with pre-1.14 automake.

Any guidance is more than welcomed.

Kind Regards,
Matthias Jurenz & Bert Wesarg

[1] http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13928


foo-subdir-objects-warnings.tar.gz
Description: GNU Zip compressed data


Re: [OMPI devel] Possible OMPI 1.6.5 bug? SEGV in malloc.c

2013-08-30 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hiya Jeff,

On 30/08/13 11:13, Jeff Squyres (jsquyres) wrote:

> FWIW, the stack traces you sent are not during MPI_INIT.

I did say it was a suspicion. ;-)

> What happens with OMPI's memory manager is that it inserts itself
> to be *the* memory allocator for the entire process before main()
> even starts.  We have to do this as part of the horribleness of
> that is OpenFabrics/verbs and how it just doesn't match the MPI
> programming model at all.  :-(  (I think I wrote some blog entries
> about this a while ago...  Ah, here's a few:

Thanks!  I'll take a look next week (just got out of a 5.5 hour
meeting and have to head home now).

> Therefore, (in C) if you call malloc() before MPI_Init(), it'll be 
> calling OMPI's ptmalloc.  The stack traces you sent imply that
> it's just when your app is calling the fortran allocate -- which is
> after MPI_Init().

OK, that makes sense.

> FWIW, you can build OMPI with --without-memory-manager, or you can 
> setenv OMPI_MCA_memory_linux_disable to 1 (note: this is NOT a 
> regular MCA parameter -- it *must* be set in the environment
> before the MPI app starts).  If this env variable is set, OMPI will
> *not* interpose its own memory manager in the pre-main hook.  That
> should be a quick/easy way to try with and without the memory
> manager and see what happens.

Well with OMPI_MCA_memory_linux_disable=1 I don't get the crash at all,
or the spin with the Intel compiler build.  Nice!

Thanks for this, I'll take a look further next week..

Very much obliged,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIgNSgACgkQO2KABBYQAh9UhwCfXPKDbParUn3XBOOcwBNjionS
KxAAnRH1HGFsKWNVGqvmh4caE8cN85jn
=U4yB
-END PGP SIGNATURE-