Re: [OMPI users] Problem building OpenMPI 1.4.4 with PGI 11.7 compilers

2011-11-08 Thread Samuel K. Gutierrez
Hi,

I think I've seen this before.

I can't speak to the details surrounding this issue, but when I upgraded to the 
newest version of libtool, the problem went away .

Take a look at "Use of GNU m4, Autoconf, Automake, and Libtool" in our HACKING 
file.  libtool-2.4.2.tar.gz **should** work, if that's the problem that you are 
experiencing.

I would suggest starting with a fresh source tree, before you try again.

Hope that helps,

Samuel K. Gutierrez
Los Alamos National Laboratory

On Nov 8, 2011, at 2:06 PM, Gustavo Correa wrote:

> Dear OpenMPI pros
> 
> When I try to build OpenMPI 1.4.4 with PGI compilers 11.7 [pgcc, pgcc, 
> pgfortran]
> I get the awkward error message on the bottom of this email.
> 
> I say awkward because I assigned the value 'shanghai-64' to the '-tp' flag,
> as you can see from the successful 'libtool:compile' command in the error 
> message.
> However, the subsequent 'libtool:link' command has '-tp' without a value.
> Note that the remaining flags '-fast -Mfprelaxed' were also dropped in the 
> libtool:link command. 
> The 'partial' flag '-tp' is worse than no flag at all, and the pgcc compiler 
> fails.
> 
> By contrast, OpenMPI 1.4.3 builds just fine with  the same compilers and 
> the same compiler flags.
> 
> Is this the revival of an old idiosyncrasy between libtool and PGI?
> Could perhaps the OMPI 1.4.4. configure script have stripped off my compiler 
> flags after '-tp',
> when passing it to libtool in link mode? [Somehow it works in 1.4.3.]
> Is there any workaround or patch?
> 
> 
> Many thanks,
> Gus Correa
> 
> **
> 
> More details:
> CentOS Linux 5.2 x86_64, libtool 1.5.22, PGI 11.7.
> 
> Configure parameters:
> export CC=pgcc
> export CXX=pgcpp
> export F77='pgfortran'
> export FC=${F77}
> 
> export CFLAGS='-tp shanghai-64 -fast -Mfprelaxed'
> export CXXFLAGS=${CFLAGS}
> export FFLAGS=${CFLAGS}
> export FCFLAGS=${FFLAGS}
> 
> ../configure \
> --prefix=${MYINSTALLDIR} \
> --with-libnuma=/usr \
> --with-tm=/opt/torque/2.4.11/gnu-4.1.2 \
> --with-openib=/usr \
> --enable-static \
> 2>&1 | tee configure_${build_id}.log
> 
> 
> 
>  ERROR MESSAGE ###
> 
> libtool: compile:  pgcc -DHAVE_CONFIG_H -I. 
> -I../../../../../opal/mca/memory/ptmalloc2 -I../../../../opal/include 
> -I../../../../orte/include -I../../../../ompi/include 
> -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -DMALLOC_DEBUG=0 
> -D_GNU_SOURCE=1 -DUSE_TSD_DATA_HACK=1 -DMALLOC_HOOKS=1 
> -I../../../../../opal/mca/memory/ptmalloc2/sysdeps/pthread 
> -I../../../../../opal/mca/memory/ptmalloc2/sysdeps/generic -I../../../../.. 
> -I../../../.. -I../../../../../opal/include -I../../../../../orte/include 
> -I../../../../../ompi/include -D_REENTRANT -DNDEBUG -tp shanghai-64 -fast 
> -Mfprelaxed -c ../../../../../opal/mca/memory/ptmalloc2/dummy.c -o dummy.o 
> >/dev/null 2>&1
> /bin/sh ../../../../libtool --tag=CC   --mode=link pgcc  -DNDEBUG -tp 
> shanghai-64 -fast -Mfprelaxed   -export-dynamic   -o libopenmpi_malloc.la 
> -rpath /home/sw/openmpi/1.4.4/pgi-11.7/lib dummy.lo  -lnsl -lutil  
> libtool: link: pgcc -shared  -fpic -DPIC  .libs/dummy.o   -lnsl -lutil -lc  
> -tp   -Wl,-soname -Wl,libopenmpi_malloc.so.0 -o 
> .libs/libopenmpi_malloc.so.0.0.0
> pgcc-Fatal-Switch -tp must have a value
> -tp=amd64|amd64e|athlon|athlonxp|barcelona|barcelona-32|barcelona-64|core2|core2-32|core2-64|istanbul|istanbul-32|istanbul-64|k7|k8|k8-32|k8-64|k8-64e|nehalem|nehalem-32|nehalem-64|p5|p6|p7|p7-32|p7-64|penryn|penryn-32|penryn-64|piii|piv|px|px-32|px-64|sandybridge|sandybridge-32|sandybridge-64|shanghai|shanghai-32|shanghai-64|x64
>Choose target processor type
>amd64   Same as -tp k8-64
>amd64e  Same as -tp k8-64e
>athlon  AMD 32-bit Athlon Processor
>athlonxpAMD 32-bit Athlon XP Processor
>barcelona   AMD Barcelona processor
>barcelona-32AMD Barcelona processor, 32-bit mode
>barcelona-64AMD Barcelona processor, 64-bit mode
>core2   Intel Core-2 Architecture
>core2-32Intel Core-2 Architecture, 32-bit mode
>core2-64Intel Core-2 Architecture, 64-bit mode
>istanbulAMD Istanbul processor
>istanbul-32 AMD Istanbul processor, 32-bit mode
>istanbul-64 AMD Istanbul processor, 64-bit mode
>k7  AMD Athlon Processor
>k8  AMD64 Processor
>k8-32   AMD64 Processor 32-bit mode
>k8-64   AMD64 Processor 64-bit mode
>k8-64e  AMD64 Processor rev E or later, 64-bit mode
>nehalem Intel Nehalem processor
>nehalem-32  Intel Neha

Re: [OMPI users] How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Samuel K. Gutierrez
Hi,

Maybe you can leverage some of the techniques outlined in:

Robert W. Robey, Jonathan M. Robey, and Rob Aulwes. 2011. In search of 
numerical consistency in parallel programming. Parallel Comput. 37, 4-5 (April 
2011), 217-229. DOI=10.1016/j.parco.2011.02.009 
http://dx.doi.org/10.1016/j.parco.2011.02.009

Hope that helps,

Samuel K. Gutierrez
Los Alamos National Laboratory

On Sep 20, 2011, at 6:25 AM, Reuti wrote:

> Am 20.09.2011 um 13:52 schrieb Tim Prince:
> 
>> On 9/20/2011 7:25 AM, Reuti wrote:
>>> Hi,
>>> 
>>> Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L:
>>> 
>>>> I am observing differences in floating-point results from an application 
>>>> program that appear to be related to whether I link with OpenMPI 1.4.3 or 
>>>> MVAPICH 1.2.0.  Both packages were built with the same installation of 
>>>> Intel 11.1, as well as the application program; identical flags passed to 
>>>> the compiler in each case.
>>>> 
>>>> I’ve tracked down some differences in a compute-only routine where I’ve 
>>>> printed out the inputs to the routine (to 18 digits) ; the inputs are 
>>>> identical.  The output numbers are different in the 16th place (perhaps a 
>>>> few in the 15th place).  These differences only show up for optimized 
>>>> code, not for –O0.
>>>> 
>>>> My assumption is that some optimized math intrinsic is being replaced 
>>>> dynamically, but I do not know how to confirm this.  Anyone have guidance 
>>>> to offer? Or similar experience?
>>> 
>>> yes, I face it often but always at a magnitude where it's not of any 
>>> concern (and not related to any MPI). Due to the limited precision in 
>>> computers, a simple reordering of operation (although being equivalent in a 
>>> mathematical sense) can lead to different results. Removing the anomalies 
>>> with -O0 could proof that.
>>> 
>>> The other point I heard especially for the x86 instruction set is, that the 
>>> internal FPU has still 80 bits, while the presentation in memory is only 64 
>>> bit. Hence when all can be done in the registers, the result can be 
>>> different compared to the case when some interim results need to be stored 
>>> to RAM. For the Portland compiler there is a switch -Kieee -pc64 to force 
>>> it to stay always in 64 bit, and a similar one for Intel is -mp (now 
>>> -fltconsistency) and -mp1.
>>> 
>> Diagnostics below indicate that ifort 11.1 64-bit is in use.  The options 
>> aren't the same as Reuti's "now" version (a 32-bit compiler which hasn't 
>> been supported for 3 years or more?).
> 
> In the 11.1 documentation they are also still listed:
> 
> http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm
> 
> I read it in the way, that -mp is deprecated syntax (therefore listed under 
> "Alternate Options"), but -fltconsistency is still a valid and supported 
> option.
> 
> -- Reuti
> 
> 
>> With ifort 10.1 and more recent, you would set at least
>> -assume protect_parens -prec-div -prec-sqrt
>> if you are interested in numerical consistency.  If you don't want 
>> auto-vectorization of sum reductions, you would use instead
>> -fp-model source -ftz
>> (ftz sets underflow mode back to abrupt, while "source" sets gradual).
>> It may be possible to expose 80-bit x87 by setting the ancient -mp option, 
>> but such a course can't be recommended without additional cautions.
>> 
>> Quoted comment from OP seem to show a somewhat different question: Does 
>> OpenMPI implement any operations in a different way from MVAPICH?  I would 
>> think it probable that the answer could be affirmative for operations such 
>> as allreduce, but this leads well outside my expertise with respect to 
>> specific MPI implementations.  It isn't out of the question to suspect that 
>> such differences might be aggravated when using excessively aggressive ifort 
>> options such as -fast.
>> 
>> 
>>>>libifport.so.5 =>  
>>>> /opt/intel/Compiler/11.1/072/lib/intel64/libifport.so.5 
>>>> (0x2b6e7e081000)
>>>>libifcoremt.so.5 =>  
>>>> /opt/intel/Compiler/11.1/072/lib/intel64/libifcoremt.so.5 
>>>> (0x2b6e7e1ba000)
>>>>libimf.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libimf.so 
>>>> (0x2b6e7e45f000)
>>>>libsvml.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
>>>> (0x2b6e7e7f4000)
>>>>libintlc.so.5 =>  
>>>> /opt/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2b6e7ea0a000)
>>>> 
>> 
>> -- 
>> Tim Prince
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem

2011-09-12 Thread Samuel K. Gutierrez

On Sep 12, 2011, at 10:16 AM, Blosch, Edwin L wrote:

> Samuel,
>  
> This worked. 

Great!

> Did this magic line disable the use of per-peer queue pairs? 

Yes, it sure did.

> I have seen a previous post by Jeff that explains what this line does 
> generally, but I didn’t study the post in detail, so if you could provide a 
> little explanation I would appreciate it.


The btl_openib_receive_queues MCA parameter takes a comma and colon delimited 
list of parameters.  The most general form is:

Queue pair type identifier,Buffer size (Required),Number of buffers 
(Required):Queue pair type identifier,Buffer size (Required),Number of buffers 
(Required):...:Queue pair type identifier,Buffer size (Required),Number of 
buffers (Required)

NOTE: Optional QP parameters omitted - see the OMPI FAQ regarding OpenFabrics 
tuning for more details.

The 'S' queue pair type identifier corresponds to "Shared queues."
The 'P' queue pair type identifier corresponds to "Per-peer queues."

Hope that helps,

Sam

>  
> Ed
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Samuel K. Gutierrez
> Sent: Monday, September 12, 2011 10:49 AM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] qp memory allocation problem
>  
> Hi,
>  
> This problem can be  caused by a variety of things, but I suspect our default 
> queue pair parameters (QP) aren't helping the situation :-).
>  
> What happens when you add the following to your mpirun command?
>  
> -mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12
>  
> OMPI Developers:
>  
> Maybe we should consider disabling the use of per-peer queue pairs by 
> default.  Do they buy us anything?  For what it is worth, we have stopped 
> using them on all of our large systems here at LANL.
>  
> Thanks,
>  
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>  
> On Sep 12, 2011, at 9:23 AM, Blosch, Edwin L wrote:
> 
> 
> I am getting this error message below and I don’t know what it means or how 
> to fix it.   It only happens when I run on a large number of processes, e.g. 
> 960.  Things work fine on 480, and I don’t think the application has a bug.  
> Any help is appreciated…
>  
> [c1n01][[30697,1],3][connect/btl_openib_connect_oob.c:464:qp_create_one] 
> error creating qp errno says Cannot allocate memory
> [c1n01][[30697,1],4][connect/btl_openib_connect_oob.c:464:qp_create_one] 
> error creating qp errno says Cannot allocate memory
>  
> Here’s the mpirun command I used:
> mpirun --prefix /usr/mpi/intel/openmpi-1.4.3 --machinefile  -np 
> 960 --mca btl ^tcp --mca mpool_base_use_mem_hooks 1 --mca mpi_leave_pinned 1 
> -x LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 
>  
> Here’s the applicable hardware from the 
> /usr/mpi/intel/openmpi-1.4.3/share/openmpi/mca-btl-openib-device-params.ini 
> file:
> # A.k.a. ConnectX
> [Mellanox Hermon]
> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
> vendor_part_id = 
> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
> use_eager_rdma = 1
> mtu = 2048
> max_inline_data = 128
>  
> And this is the output of ompi_info –param btl openib:
>  MCA btl: parameter "btl_base_verbose" (current value: "0", 
> data source: default value)
>   Verbosity level of the BTL framework
>  MCA btl: parameter "btl" (current value: , data 
> source: default value)
>   Default selection set of components for the btl 
> framework ( means use all components that can be found)
>  MCA btl: parameter "btl_openib_verbose" (current value: "0", 
> data source: default value)
>   Output some verbose OpenIB BTL information (0 = no 
> output, nonzero = output)
>  MCA btl: parameter "btl_openib_warn_no_device_params_found" 
> (current value: "1", data source: default value, synonyms:
>   btl_openib_warn_no_hca_params_found)
>   Warn when no device-specific parameters are found 
> in the INI file specified by the btl_openib_device_param_files MCA
>   parameter (0 = do not warn; any other value = warn)
>  MCA btl: parameter "btl_openib_warn_no_hca_params_found" 
> (current value: "1", data source: default value, deprecated, synonym
>   of: btl_openib_warn_no_device_params_found)
>   Warn when no device-specific parameters are found 
> in the INI file specified by the btl_openib_device_param_files MCA
>   parameter (0 = do not warn;

Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Samuel K. Gutierrez
Hi,

This problem can be  caused by a variety of things, but I suspect our default 
queue pair parameters (QP) aren't helping the situation :-).

What happens when you add the following to your mpirun command?

-mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12

OMPI Developers:

Maybe we should consider disabling the use of per-peer queue pairs by default.  
Do they buy us anything?  For what it is worth, we have stopped using them on 
all of our large systems here at LANL.

Thanks,
 
Samuel K. Gutierrez
Los Alamos National Laboratory

On Sep 12, 2011, at 9:23 AM, Blosch, Edwin L wrote:

> I am getting this error message below and I don’t know what it means or how 
> to fix it.   It only happens when I run on a large number of processes, e.g. 
> 960.  Things work fine on 480, and I don’t think the application has a bug.  
> Any help is appreciated…
>  
> [c1n01][[30697,1],3][connect/btl_openib_connect_oob.c:464:qp_create_one] 
> error creating qp errno says Cannot allocate memory
> [c1n01][[30697,1],4][connect/btl_openib_connect_oob.c:464:qp_create_one] 
> error creating qp errno says Cannot allocate memory
>  
> Here’s the mpirun command I used:
> mpirun --prefix /usr/mpi/intel/openmpi-1.4.3 --machinefile  -np 
> 960 --mca btl ^tcp --mca mpool_base_use_mem_hooks 1 --mca mpi_leave_pinned 1 
> -x LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 
>  
> Here’s the applicable hardware from the 
> /usr/mpi/intel/openmpi-1.4.3/share/openmpi/mca-btl-openib-device-params.ini 
> file:
> # A.k.a. ConnectX
> [Mellanox Hermon]
> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f
> vendor_part_id = 
> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
> use_eager_rdma = 1
> mtu = 2048
> max_inline_data = 128
>  
> And this is the output of ompi_info –param btl openib:
>  MCA btl: parameter "btl_base_verbose" (current value: "0", 
> data source: default value)
>   Verbosity level of the BTL framework
>  MCA btl: parameter "btl" (current value: , data 
> source: default value)
>   Default selection set of components for the btl 
> framework ( means use all components that can be found)
>  MCA btl: parameter "btl_openib_verbose" (current value: "0", 
> data source: default value)
>   Output some verbose OpenIB BTL information (0 = no 
> output, nonzero = output)
>  MCA btl: parameter "btl_openib_warn_no_device_params_found" 
> (current value: "1", data source: default value, synonyms:
>   btl_openib_warn_no_hca_params_found)
>   Warn when no device-specific parameters are found 
> in the INI file specified by the btl_openib_device_param_files MCA
>   parameter (0 = do not warn; any other value = warn)
>  MCA btl: parameter "btl_openib_warn_no_hca_params_found" 
> (current value: "1", data source: default value, deprecated, synonym
>   of: btl_openib_warn_no_device_params_found)
>   Warn when no device-specific parameters are found 
> in the INI file specified by the btl_openib_device_param_files MCA
>   parameter (0 = do not warn; any other value = warn)
>  MCA btl: parameter "btl_openib_warn_default_gid_prefix" 
> (current value: "1", data source: default value)
>   Warn when there is more than one active ports and 
> at least one of them connected to the network with only default GID
>   prefix configured (0 = do not warn; any other value 
> = warn)
>  MCA btl: parameter "btl_openib_warn_nonexistent_if" (current 
> value: "1", data source: default value)
>   Warn if non-existent devices and/or ports are 
> specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not
>   warn; any other value = warn)
>  MCA btl: parameter "btl_openib_want_fork_support" (current 
> value: "-1", data source: default value)
>   Whether fork support is desired or not (negative = 
> try to enable fork support, but continue even if it is not
>   available, 0 = do not enable fork support, positive 
> = try to enable fork support and fail if it is not available)
>  MCA btl: parameter "btl_openib_device_param_files" (current 
> value:
>   
> "/usr/mpi/intel/openmpi-1.4.3/share/openmpi/mca-btl-openib-device-para

Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Samuel K. Gutierrez
Hi,

QP = Queue Pair

Here are a couple of nice FAQ entries that I find useful.

http://www.open-mpi.org/faq/?category=openfabrics

And videos:

http://www.open-mpi.org/video/?category=openfabrics


--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Jun 23, 2011, at 8:22 AM, Mathieu Gontier wrote:

> Hi,
> 
> Thanks for your answer. It makes sense. 
> Sorry if my question seems silly, but what does QP mean? It is difficult to 
> read the FAQ without knowing that!
> 
> Thanks. 
> 
> On 06/23/2011 04:00 PM, Ralph Castain wrote:
>> 
>> One possibility: if you increase the number of processes in the job, and 
>> they all interconnect, then the IB interface can (I believe) run out of 
>> memory at some point. IIRC, the answer was to reduce the size of the QPs so 
>> that you could support a larger number of them.
>> 
>> You should find info about controlling QP size in the IB FAQ area on the 
>> OMPI web site, I believe.
>> 
>> On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote:
>> 
>>> Hello, 
>>> 
>>> Thank for the answer.
>>> I am testing with OpenMPI-1.4.3: my computation is queuing. But I did not 
>>> read anything obvious related to my issue. Have you read something which 
>>> could solve it? 
>>> I am going to submit my computation with --mca mpi_leave_pinned 0, but do 
>>> you have any idea how it affect the performance? Compared to using 
>>> Ethernet? 
>>> 
>>> Many thanks for your support. 
>>> 
>>> On 06/23/2011 03:01 PM, Josh Hursey wrote:
>>>> 
>>>> I wonder if this is related to memory pinning. Can you try turning off
>>>> the leave pinned, and see if the problem persists (this may affect
>>>> performance, but should avoid the crash):
>>>>   mpirun ... --mca mpi_leave_pinned 0 ...
>>>> 
>>>> Also it looks like Smoky has a slightly newer version of the 1.4
>>>> branch that you should try to switch to if you can. The following
>>>> command will show you all of the available installs on that machine:
>>>>   shell$ module avail ompi
>>>> 
>>>> For a list of supported compilers for that version try the 'show' option:
>>>> shell$ module show ompi/1.4.3
>>>> ---
>>>> /sw/smoky/modulefiles-centos/ompi/1.4.3:
>>>> 
>>>> module-whatis   This module configures your environment to make Open
>>>> MPI 1.4.3 available.
>>>> Supported Compilers:
>>>>  pathscale/3.2.99
>>>>  pathscale/3.2
>>>>  pgi/10.9
>>>>  pgi/10.4
>>>>  intel/11.1.072
>>>>  gcc/4.4.4
>>>>  gcc/4.4.3
>>>> ---
>>>> 
>>>> Let me know if that helps.
>>>> 
>>>> Josh
>>>> 
>>>> 
>>>> On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
>>>> <mathieu.gont...@gmail.com> wrote:
>>>>> Dear all,
>>>>> 
>>>>> First of all, all my apologies because I post this message to both the bug
>>>>> and user mailing list. But for the moment, I do not know if it is a bug!
>>>>> 
>>>>> I am running a CFD structured flow solver at ORNL, and I have an access 
>>>>> to a
>>>>> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
>>>>> Recently we increased the size of our models, and since that time we have
>>>>> run into many infiniband related problems.  The most serious problem is a
>>>>> hard crash with the following error message:
>>>>> 
>>>>> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>>>>> error creating qp errno says Cannot allocate memory
>>>>> 
>>>>> If we force the solver to use ethernet (mpirun -mca btl ^openib) the
>>>>> computations works correctly, although very slowly (a single iteration 
>>>>> take
>>>>> ages). Do you have any idea what could be causing these problems?
>>>>> 
>>>>> If it is due to a bug or a limitation into OpenMPI, do you think the 
>>>>> version
>>>>> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
>>>>> the release notes, but I did not read any obvious patch which could fix my
>>>>> problem. The system administrator is ready to compile a new package for 
>>>>> us,
>>>>> but I do not want to ask to install to many of them.
>>>>> 
>>>>> Thanks.
>>>>> --
>>>>> 
>>>>> Mathieu Gontier
>>>>> skype: mathieu_gontier
>>>>> ___
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>> 
>>> -- 
>>> 
>>> Mathieu Gontier 
>>> skype: mathieu_gontier
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> 
> Mathieu Gontier 
> skype: mathieu_gontier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users








Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl

2011-06-23 Thread Samuel K. Gutierrez
Hi,

What happens when you don't run with per-peer queue pairs?  Try:

-mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,128

--
Samuel K. Gutierrez
Los Alamos National Laborator


On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote:

> Hello, 
> 
> Thank for the answer.
> I am testing with OpenMPI-1.4.3: my computation is queuing. But I did not 
> read anything obvious related to my issue. Have you read something which 
> could solve it? 
> I am going to submit my computation with --mca mpi_leave_pinned 0, but do you 
> have any idea how it affect the performance? Compared to using Ethernet? 
> 
> Many thanks for your support. 
> 
> On 06/23/2011 03:01 PM, Josh Hursey wrote:
>> 
>> I wonder if this is related to memory pinning. Can you try turning off
>> the leave pinned, and see if the problem persists (this may affect
>> performance, but should avoid the crash):
>>   mpirun ... --mca mpi_leave_pinned 0 ...
>> 
>> Also it looks like Smoky has a slightly newer version of the 1.4
>> branch that you should try to switch to if you can. The following
>> command will show you all of the available installs on that machine:
>>   shell$ module avail ompi
>> 
>> For a list of supported compilers for that version try the 'show' option:
>> shell$ module show ompi/1.4.3
>> ---
>> /sw/smoky/modulefiles-centos/ompi/1.4.3:
>> 
>> module-whatis This module configures your environment to make Open
>> MPI 1.4.3 available.
>> Supported Compilers:
>>  pathscale/3.2.99
>>  pathscale/3.2
>>  pgi/10.9
>>  pgi/10.4
>>  intel/11.1.072
>>  gcc/4.4.4
>>  gcc/4.4.3
>> ---
>> 
>> Let me know if that helps.
>> 
>> Josh
>> 
>> 
>> On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier
>> <mathieu.gont...@gmail.com> wrote:
>>> Dear all,
>>> 
>>> First of all, all my apologies because I post this message to both the bug
>>> and user mailing list. But for the moment, I do not know if it is a bug!
>>> 
>>> I am running a CFD structured flow solver at ORNL, and I have an access to a
>>> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default.
>>> Recently we increased the size of our models, and since that time we have
>>> run into many infiniband related problems.  The most serious problem is a
>>> hard crash with the following error message:
>>> 
>>> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>>> error creating qp errno says Cannot allocate memory
>>> 
>>> If we force the solver to use ethernet (mpirun -mca btl ^openib) the
>>> computations works correctly, although very slowly (a single iteration take
>>> ages). Do you have any idea what could be causing these problems?
>>> 
>>> If it is due to a bug or a limitation into OpenMPI, do you think the version
>>> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read
>>> the release notes, but I did not read any obvious patch which could fix my
>>> problem. The system administrator is ready to compile a new package for us,
>>> but I do not want to ask to install to many of them.
>>> 
>>> Thanks.
>>> --
>>> 
>>> Mathieu Gontier
>>> skype: mathieu_gontier
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
> 
> -- 
> 
> Mathieu Gontier 
> skype: mathieu_gontier
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Openib with > 32 cores per node

2011-05-19 Thread Samuel K. Gutierrez
Hi,

On May 19, 2011, at 9:37 AM, Robert Horton wrote

> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
>> Hi,
>> 
>> Try the following QP parameters that only use shared receive queues.
>> 
>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
>> 
> 
> Thanks for that. If I run the job over 2 x 48 cores it now works and the
> performance seems reasonable (I need to do some more tuning) but when I
> go up to 4 x 48 cores I'm getting the same problem:
> 
> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>  error creating qp errno says Cannot allocate memory
> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
> abort)
> 
> Any thoughts?

How much memory does each node have?  Does this happen at startup?

Try adding:

-mca btl_openib_cpc_include rdmacm

I'm not sure if your version of OFED supports this feature, but maybe using XRC 
may help.  I **think** other tweaks are needed to get this going, but I'm not 
familiar with the details.

Hope that helps,

Samuel K. Gutierrez
Los Alamos National Laboratory


> 
> Thanks,
> Rob
> -- 
> Robert Horton
> System Administrator (Research Support) - School of Mathematical Sciences
> Queen Mary, University of London
> r.hor...@qmul.ac.uk  -  +44 (0) 20 7882 7345
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Openib with > 32 cores per node

2011-05-19 Thread Samuel K. Gutierrez
Hi,

Try the following QP parameters that only use shared receive queues.

-mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32

Samuel K. Gutierrez
Los Alamos National Laboratory

On May 19, 2011, at 5:28 AM, Robert Horton wrote:

> Hi,
> 
> I'm having problems getting the MPIRandomAccess part of the HPCC
> benchmark to run with more than 32 processes on each node (each node has
> 4 x AMD 6172 so 48 cores total). Once I go past 32 processes I get an
> error like:
> 
> [compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>  error creating qp errno says Cannot allocate memory
> [compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:815:rml_recv_cb]
>  error in endpoint reply start connect
> [compute-1-13.local:06117] [[5637,0],0]-[[5637,1],18] mca_oob_tcp_msg_recv: 
> readv failed: Connection reset by peer (104)
> [compute-1-13.local:6137] *** An error occurred in MPI_Isend
> [compute-1-13.local:6137] *** on communicator MPI_COMM_WORLD
> [compute-1-13.local:6137] *** MPI_ERR_OTHER: known error not in list
> [compute-1-13.local:6137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
> abort)
> [compute-1-13.local][[5637,1],26][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>  error creating qp errno says Cannot allocate memory
> [[5637,1],66][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc]
>  from compute-1-13.local to: compute-1-13 error polling LP CQ with status 
> RETRY EXCEEDED ERROR status number 12 for wr_id 278870912 opcode 
> 
> I've tried changing btl_openib_receive_queues from
> P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> to
> P,128,512,256,512:S,2048,512,256,32:S,12288,512,256,32:S,65536,512,256,32
> 
> doing this lets the code run without the error, but it does so extremely
> slowly - I'm also seeing errors in dmesg such as:
> 
> CPU 12:
> Modules linked in: nfs fscache nfs_acl blcr(U) blcr_imports(U) autofs4 
> ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ip_conntrack_netbios_ns 
> ipt_REJECT xt_state
> ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp 
> ip6table_filter ip6_tables x_tables cpufreq_ondemand powernow_k8 freq_table 
> rdma_ucm(U) ib_sd
> p(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) 
> ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_nes(U) 
> iw_cxgb3(U) cxgb3(U) 
> mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) dm_mirror dm_multipath scsi_dh 
> video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac 
> parport_p
> c lp parport joydev shpchp sg i2c_piix4 i2c_core ib_qib(U) dca ib_mad(U) 
> ib_core(U) igb 8021q serio_raw pcspkr dm_raid45 dm_message dm_region_hash 
> dm_log dm_mod dm_
> mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
> Pid: 3980, comm: qib/12 Tainted: G  2.6.18-164.6.1.el5 #1
> RIP: 0010:[]  [] tasklet_action+0x90/0xfd
> RSP: 0018:810c2f1bff40  EFLAGS: 0246
> RAX:  RBX: 0001 RCX: 810c2f1bff30
> RDX:  RSI: 81042f063400 RDI: 8030d180
> RBP: 810c2f1bfec0 R08: 0001 R09: 8104aec2d000
> R10: 810c2f1bff00 R11: 810c2f1bff00 R12: 8005dc8e
> R13: 81042f063480 R14: 80077874 R15: 810c2f1bfec0
> FS:  2b20829592e0() GS:81042f186bc0() knlGS:
> CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> CR2: 2b2080b70720 CR3: 00201000 CR4: 06e0
> 
> Call Trace:
>   [] __do_softirq+0x89/0x133
> [] call_softirq+0x1c/0x28
> [] do_softirq+0x2c/0x85
> [] apic_timer_interrupt+0x66/0x6c
>   [] __kmalloc+0x97/0x9f
> [] :ib_qib:qib_verbs_send+0xdb3/0x104a
> [] _spin_unlock_irqrestore+0x8/0x9
> [] :ib_qib:qib_make_rc_req+0xbb1/0xbbf
> [] :ib_qib:qib_make_rc_req+0x0/0xbbf
> [] :ib_qib:qib_do_send+0x0/0x950
> [] :ib_qib:qib_do_send+0x91a/0x950
> [] __wake_up+0x38/0x4f
> [] :ib_qib:qib_do_send+0x0/0x950
> [] run_workqueue+0x94/0xe4
> [] worker_thread+0x0/0x122
> [] keventd_create_kthread+0x0/0xc4
> [] worker_thread+0xf0/0x122
> [] default_wake_function+0x0/0xe
> [] keventd_create_kthread+0x0/0xc4
> [] kthread+0xfe/0x132
> [] child_rip+0xa/0x11
> [] keventd_create_kthread+0x0/0xc4
> [] kthread+0x0/0x132
> [] child_rip+0x0/0x11
> 
> Any thoughts on how to proceed?
> 
> I'm running OpenMPI 1.4.3 compiled with gcc 4.1.2 and OFED 1.5.3.1
> 
> Thanks,
> Rob
> -- 
> Robert Horton
> System Administrator (Research Support) - School of Mathematical Sciences
> Queen Mary, University of London
> r.hor...@qmul.ac.uk  -  +44 (0) 20 7882 7345
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Samuel K. Gutierrez
Hi,

Just out of curiosity - what happens when you add the following MCA option to 
your openib runs?

-mca btl_openib_flags 305

Thanks,

Samuel Gutierrez
Los Alamos National Laboratory

On May 13, 2011, at 2:38 PM, Brock Palen wrote:

> On May 13, 2011, at 4:09 PM, Dave Love wrote:
> 
>> Jeff Squyres  writes:
>> 
>>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>> 
 We can reproduce it with IMB.  We could provide access, but we'd have to
 negotiate with the owners of the relevant nodes to give you interactive
 access to them.  Maybe Brock's would be more accessible?  (If you
 contact me, I may not be able to respond for a few days.)
>>> 
>>> Brock has replied off-list that he, too, is able to reliably reproduce the 
>>> issue with IMB, and is working to get access for us.  Many thanks for your 
>>> offer; let's see where Brock's access takes us.
>> 
>> Good.  Let me know if we could be useful
>> 
> -- we have not closed this issue,
 
 Which issue?   I couldn't find a relevant-looking one.
>>> 
>>> https://svn.open-mpi.org/trac/ompi/ticket/2714
>> 
>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>> connectx with more than one collective I can't recall.
> 
> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  well 
> that doesn't help here, both my production code (crash) and IMB still hang.
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
>> 
>> -- 
>> Excuse the typping -- I have a broken wrist
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpi problems,

2011-04-04 Thread Samuel K. Gutierrez

What does 'ldd ring2' show?  How was it compiled?

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote:


[jian@therock ~]$ echo $LD_LIBRARY_PATH
/opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26- 
amd64:/opt/gridengine/lib/lx26-amd64:/home/jian/.crlibs:/home/ 
jian/.crlibs32
[jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun  -np 4 - 
hostfile list ring2
ring2: error while loading shared libraries: libfui.so.1: cannot  
open shared object file: No such file or directory
ring2: error while loading shared libraries: libfui.so.1: cannot  
open shared object file: No such file or directory
ring2: error while loading shared libraries: libfui.so.1: cannot  
open shared object file: No such file or directory

mpirun: killing job...

--
mpirun noticed that process rank 1 with PID 31763 on node  
compute-0-1 exited on signal 0 (Unknown signal 0).

--
mpirun: clean termination accomplished

I really don't know what's wrong here. I was sure that would work

On Mon, Apr 4, 2011 at 2:43 PM, Samuel K. Gutierrez  
<sam...@lanl.gov> wrote:

Hi,

Try prepending the path to your compiler libraries.

Example (bash-like):

export LD_LIBRARY_PATH=/compiler/prefix/lib:/ompi/prefix/lib: 
$LD_LIBRARY_PATH


--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Apr 4, 2011, at 1:33 PM, Nehemiah Dacres wrote:

altering LD_LIBRARY_PATH alter's the process's path to mpi's  
libraries, how do i alter its path to compiler libs like libfui.so. 
1? it needs to find them cause it was compiled by a sun compiler


On Mon, Apr 4, 2011 at 10:06 AM, Nehemiah Dacres <dacre...@slu.edu>  
wrote:


As Ralph indicated, he'll add the hostname to the error message  
(but that might be tricky; that error message is coming from rsh/ 
ssh...).


In the meantime, you might try (csh style):

foreach host (`cat list`)
   echo $host
   ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted
end


that's what the tentakel line was refering to, or ...



On Apr 4, 2011, at 10:24 AM, Nehemiah Dacres wrote:

> I have installed it via a symlink on all of the nodes, I can go  
'tentakel which mpirun ' and it finds it' I'll check the library  
paths but isn't there a way to find out which nodes are returning  
the error?


I found it misslinked on a couple nodes. thank you

--
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University




--
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University





Re: [OMPI users] mpi problems,

2011-04-04 Thread Samuel K. Gutierrez

Hi,

Try prepending the path to your compiler libraries.

Example (bash-like):

export LD_LIBRARY_PATH=/compiler/prefix/lib:/ompi/prefix/lib: 
$LD_LIBRARY_PATH


--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Apr 4, 2011, at 1:33 PM, Nehemiah Dacres wrote:

altering LD_LIBRARY_PATH alter's the process's path to mpi's  
libraries, how do i alter its path to compiler libs like libfui.so. 
1? it needs to find them cause it was compiled by a sun compiler


On Mon, Apr 4, 2011 at 10:06 AM, Nehemiah Dacres <dacre...@slu.edu>  
wrote:


As Ralph indicated, he'll add the hostname to the error message (but  
that might be tricky; that error message is coming from rsh/ssh...).


In the meantime, you might try (csh style):

foreach host (`cat list`)
   echo $host
   ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted
end


that's what the tentakel line was refering to, or ...



On Apr 4, 2011, at 10:24 AM, Nehemiah Dacres wrote:

> I have installed it via a symlink on all of the nodes, I can go  
'tentakel which mpirun ' and it finds it' I'll check the library  
paths but isn't there a way to find out which nodes are returning  
the error?


I found it misslinked on a couple nodes. thank you

--
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University




--
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OMPI seg fault by a class with weird address.

2011-03-15 Thread Samuel K. Gutierrez
I -think- setting OMPI_MCA_memory_ptmalloc2_disable to 1 will turn off  
OMPI's memory wrappers without having to rebuild.  Someone please  
correct me if I'm wrong :-).


For example (bash-like shell):

export OMPI_MCA_memory_ptmalloc2_disable=1

Hope that helps,

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Mar 15, 2011, at 9:19 AM, Jack Bryan wrote:


Thanks,

I do not have system administrator authorization.
I am afraid that I cannot rebuild OpenMPI --without-memory-manager.

Are there other ways to get around it ?

For example, use other things to replace "ptmalloc" ?

Any help is really appreciated.

thanks

From: belaid_...@hotmail.com
To: dtustud...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird  
address.

Date: Tue, 15 Mar 2011 08:00:56 +

Hi Jack,
  I may need to see the whole code to decide but my quick look  
suggest that ptmalloc is causing a problem with STL-vector  
allocation. ptmalloc is the openMPI internal malloc library. Could  
you try to build openMPI without memory management (using --without- 
memory-manager) and let us know the outcome. ptmalloc is not needed  
if you are not using an RDMA interconnect.


  With best regards,
-Belaid.

From: dtustud...@hotmail.com
To: belaid_...@hotmail.com; us...@open-mpi.org
Subject: RE: [OMPI users] OMPI seg fault by a class with weird  
address.

Date: Tue, 15 Mar 2011 00:30:19 -0600

Hi,

Because the code is very long, I just  show the calling relationship  
of functions.


main()
{
scheduler();

}
scheduler()
{
 ImportIndices();
}

ImportIndices()
{
Index IdxNode ;
IdxNode = ReadFile("fileName");
}

Index ReadFile(const char* fileinput)
{
Index TempIndex;
.

}

vector Index::GetPosition() const { return Position; }
vector Index::GetColumn() const { return Column; }
vector Index::GetYear() const { return Year; }
vector Index::GetName() const { return Name; }
int Index::GetPosition(const int idx) const { return Position[idx]; }
int Index::GetColumn(const int idx) const { return Column[idx]; }
int Index::GetYear(const int idx) const { return Year[idx]; }
string Index::GetName(const int idx) const { return Name[idx]; }
int Index::GetSize() const { return Position.size(); }

The sequential code works well, and there is no  scheduler().

The parallel code output from gdb:
--
Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *,  
char, int, message_para_to_workers_VecT &, MPI_Datatype, int &, int  
&, std::vector<std::vector<double, std::allocator >,  
std::allocator<std::vector<double, std::allocator > > > &,  
std::vector<std::vector<double, std::allocator >,  
std::allocator<std::vector<double, std::allocator > > > &,  
std::vector<double, std::allocator > &, int,  
std::vector<std::vector<double, std::allocator >,  
std::allocator<std::vector<double, std::allocator > > > &,  
MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490,

popSize=, nodeSize=,
myRank=, myChildpop=0x1208d80,  
genCandTag=65 'A',
generationNum=1, myPopParaVec=std::vector of length 4, capacity  
4 = {...},

message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c,
myT2Flag=@0x7fffd688,
resultTaskPackageT1=std::vector of length 4, capacity 4 = {...},
resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...},
xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7,
resultTaskPackageT12=std::vector of length 4, capacity 4 = {...},
xdata_to_workers_type=0x121c410, myGenerationNum=1,
Mpara_to_workers_type=0x121b9b0, nconNum=0)
at src/nsga2/myNetplanScheduler.cpp:109
109 ImportIndices();
(gdb) c
Continuing.

Breakpoint 2, ImportIndices () at src/index.cpp:120
120 IdxNode = ReadFile("prepdata/idx_node.csv");
(gdb) c
Continuing.

Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv")
at src/index.cpp:86
86  Index TempIndex;
(gdb) c
Continuing.

Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:20
20  Name(0) {}
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc ()
   from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0

---
the backtrace output from the above parallel OpenMPI code:

(gdb) bt
#0  0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc ()
   from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
#1  0x2b3b2bd3 in opal_memory_ptmalloc2_malloc ()
   from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0
#2  0x003f7c8bd1dd in operator new(unsigned long) ()
   from /usr/lib64/libstdc++.so.6
#3  0x004646a7 in __gnu_cxx::new_allocator::allocate (
this=0x7fffcb80, __n=0)
  

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-09 Thread Samuel K. Gutierrez

On Feb 8, 2011, at 8:21 PM, Ralph Castain wrote:

I would personally suggest not reconfiguring your system simply to  
support a particular version of OMPI. The only difference between  
the 1.4 and 1.5 series wrt slurm is that we changed a few things to  
support a more recent version of slurm. It is relatively easy to  
backport that code to the 1.4 series, and it should be (mostly)  
backward compatible.


OMPI is agnostic wrt resource managers. We try to support all  
platforms, with our effort reflective of the needs of our developers  
and their organizations, and our perception of the relative size of  
the user community for a particular platform. Slurm is a fairly  
small community, mostly centered in the three DOE weapons labs, so  
our support for that platform tends to focus on their usage.


So, with that understanding...

Sam: can you confirm that 1.5.1 works on your TLCC machines?


Open MPI 1.5.1 works as expected on our TLCC machines.  Open MPI 1.4.3  
with your SLURM update also tested.




I have created a ticket to upgrade the 1.4.4 release (due out any  
time now) with the 1.5.1 slurm support. Any interested parties can  
follow it here:


Thanks Ralph!

Sam



https://svn.open-mpi.org/trac/ompi/ticket/2717

Ralph


On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote:



On 09/02/2011, at 9:16 AM, Ralph Castain wrote:


See below


On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:



On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:


Hi Michael,

You may have tried to send some debug information to the list,  
but it appears to have been blocked.  Compressed text output of  
the backtrace text is sufficient.



Odd, I thought I sent it to you directly.  In any case, here is  
the backtrace and some information from gdb:


$ salloc -n16 gdb -args mpirun mpi
(gdb) run
Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/ 
michael/home/ServerAdmin/mpi

[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x77b76869 in process_orted_launch_report (fd=-1,  
opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
342	pdatorted[mev->sender.vpid]->state =  
ORTE_PROC_STATE_RUNNING;

(gdb) bt
#0  0x77b76869 in process_orted_launch_report (fd=-1,  
opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
#1  0x778a7338 in event_process_active (base=0x615240) at  
event.c:651
#2  0x778a797e in opal_event_base_loop (base=0x615240,  
flags=1) at event.c:823

#3  0x778a756f in opal_event_loop (flags=1) at event.c:730
#4  0x7789b916 in opal_progress () at runtime/ 
opal_progress.c:189
#5  0x77b76e20 in orte_plm_base_daemon_callback  
(num_daemons=2) at base/plm_base_launch_support.c:459
#6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560)  
at plm_slurm_module.c:360
#7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8)  
at orterun.c:754
#8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at  
main.c:13

(gdb) print pdatorted
$1 = (orte_proc_t **) 0x67c610
(gdb) print mev
$2 = (orte_message_event_t *) 0x681550
(gdb) print mev->sender.vpid
$3 = 4294967295
(gdb) print mev->sender
$4 = {jobid = 1721696256, vpid = 4294967295}
(gdb) print *mev
$5 = {super = {obj_magic_id = 16046253926196952813, obj_class =  
0x77dd4f40, obj_reference_count = 1, cls_init_file_name =  
0x77bb9a78 "base/plm_base_launch_support.c",
cls_init_lineno = 423}, ev = 0x680850, sender = {jobid =  
1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file  
= 0x680640 "rml_oob_component.c", line = 279}


The jobid and vpid look like the defined INVALID values,  
indicating that something is quite wrong. This would quite likely  
lead to the segfault.


From this, it would indeed appear that you are getting some kind  
of library confusion - the most likely cause of such an error is  
a daemon from a different version trying to respond, and so the  
returned message isn't correct.


Not sure why else it would be happening...you could try setting - 
mca plm_base_verbose 5 to get more debug output displayed on your  
screen, assuming you built OMPI with --enable-debug.




Found the problem It is a site configuration issue, which I'll  
need to find a workaround for.


[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component  
[slurm] set priority to 75
[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component  
[slurm]

[bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
[bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523  
nodename hash 1936089714

[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job  
[31383,1]
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job  

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Samuel K. Gutierrez

Hi Michael,

You may have tried to send some debug information to the list, but it  
appears to have been blocked.  Compressed text output of the backtrace  
text is sufficient.


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:


Hi,

A detailed backtrace from a core dump may help us debug this.  Would  
you be willing to provide that information for us?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:



On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

I just tried to reproduce the problem that you are experiencing  
and was unable to.


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the  
same platform file (the only change was to re-enable btl-tcp).


Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

   JOB MAP   

Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 0
Process OMPI jobid: [6932,1] Process rank: 1
Process OMPI jobid: [6932,1] Process rank: 2
Process OMPI jobid: [6932,1] Process rank: 3
Process OMPI jobid: [6932,1] Process rank: 4
Process OMPI jobid: [6932,1] Process rank: 5
Process OMPI jobid: [6932,1] Process rank: 6
Process OMPI jobid: [6932,1] Process rank: 7

Data for node: Name: ipc3   Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 8
Process OMPI jobid: [6932,1] Process rank: 9
Process OMPI jobid: [6932,1] Process rank: 10
Process OMPI jobid: [6932,1] Process rank: 11
Process OMPI jobid: [6932,1] Process rank: 12
Process OMPI jobid: [6932,1] Process rank: 13
Process OMPI jobid: [6932,1] Process rank: 14
Process OMPI jobid: [6932,1] Process rank: 15

=
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
0(opal_event_loop+0x1f) [0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so. 
0(opal_progress+0x89) [0x7f81cef87916]
[eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
[0x7f81cf267ed7]

[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
[0x7f81ce14bc4d]

[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
ServerAdmin/mpi


I've anonymised the paths and domain, otherwise pasted verbatim.   
The only odd thing I notice is that the launching machine uses its  
full domain name, whereas the other machine is referred to by the  
short name.  Despite the FQDN, the domain does not exist in the DNS  
(for historical reasons), but does exist in the /etc/hosts file.


Any further clues would be appreciated.  In case it may be  
relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel  
2.6.32.  One other point of difference may be that our environment  
is tcp (ethernet) based whereas the LANL test environment is not?


Michael


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Samuel K. Gutierrez

Hi,

A detailed backtrace from a core dump may help us debug this.  Would  
you be willing to provide that information for us?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:



On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

I just tried to reproduce the problem that you are experiencing and  
was unable to.


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same  
platform file (the only change was to re-enable btl-tcp).


Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

   JOB MAP   

Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 0
Process OMPI jobid: [6932,1] Process rank: 1
Process OMPI jobid: [6932,1] Process rank: 2
Process OMPI jobid: [6932,1] Process rank: 3
Process OMPI jobid: [6932,1] Process rank: 4
Process OMPI jobid: [6932,1] Process rank: 5
Process OMPI jobid: [6932,1] Process rank: 6
Process OMPI jobid: [6932,1] Process rank: 7

Data for node: Name: ipc3   Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 8
Process OMPI jobid: [6932,1] Process rank: 9
Process OMPI jobid: [6932,1] Process rank: 10
Process OMPI jobid: [6932,1] Process rank: 11
Process OMPI jobid: [6932,1] Process rank: 12
Process OMPI jobid: [6932,1] Process rank: 13
Process OMPI jobid: [6932,1] Process rank: 14
Process OMPI jobid: [6932,1] Process rank: 15

=
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
0(opal_event_loop+0x1f) [0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress 
+0x89) [0x7f81cef87916]
[eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
[0x7f81cf267ed7]

[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
[0x7f81ce14bc4d]

[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
ServerAdmin/mpi


I've anonymised the paths and domain, otherwise pasted verbatim.   
The only odd thing I notice is that the launching machine uses its  
full domain name, whereas the other machine is referred to by the  
short name.  Despite the FQDN, the domain does not exist in the DNS  
(for historical reasons), but does exist in the /etc/hosts file.


Any further clues would be appreciated.  In case it may be relevant,  
core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One  
other point of difference may be that our environment is tcp  
(ethernet) based whereas the LANL test environment is not?


Michael


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-03 Thread Samuel K. Gutierrez

Hi,

I just tried to reproduce the problem that you are experiencing and  
was unable to.


[samuel@lo1-fe ~]$ salloc -n32 mpirun --display-map ./mpi_app
salloc: Job is in held state, pending scheduler release
salloc: Pending job allocation 138319
salloc: job 138319 queued and waiting for resources
salloc: job 138319 has been allocated resources
salloc: Granted job allocation 138319

    JOB MAP   

 Data for node: Name: lob083Num procs: 16
Process OMPI jobid: [26464,1] Process rank: 0
Process OMPI jobid: [26464,1] Process rank: 1
Process OMPI jobid: [26464,1] Process rank: 2
Process OMPI jobid: [26464,1] Process rank: 3
Process OMPI jobid: [26464,1] Process rank: 4
Process OMPI jobid: [26464,1] Process rank: 5
Process OMPI jobid: [26464,1] Process rank: 6
Process OMPI jobid: [26464,1] Process rank: 7
Process OMPI jobid: [26464,1] Process rank: 8
Process OMPI jobid: [26464,1] Process rank: 9
Process OMPI jobid: [26464,1] Process rank: 10
Process OMPI jobid: [26464,1] Process rank: 11
Process OMPI jobid: [26464,1] Process rank: 12
Process OMPI jobid: [26464,1] Process rank: 13
Process OMPI jobid: [26464,1] Process rank: 14
Process OMPI jobid: [26464,1] Process rank: 15

 Data for node: Name: lob084Num procs: 16
Process OMPI jobid: [26464,1] Process rank: 16
Process OMPI jobid: [26464,1] Process rank: 17
Process OMPI jobid: [26464,1] Process rank: 18
Process OMPI jobid: [26464,1] Process rank: 19
Process OMPI jobid: [26464,1] Process rank: 20
Process OMPI jobid: [26464,1] Process rank: 21
Process OMPI jobid: [26464,1] Process rank: 22
Process OMPI jobid: [26464,1] Process rank: 23
Process OMPI jobid: [26464,1] Process rank: 24
Process OMPI jobid: [26464,1] Process rank: 25
Process OMPI jobid: [26464,1] Process rank: 26
Process OMPI jobid: [26464,1] Process rank: 27
Process OMPI jobid: [26464,1] Process rank: 28
Process OMPI jobid: [26464,1] Process rank: 29
Process OMPI jobid: [26464,1] Process rank: 30
Process OMPI jobid: [26464,1] Process rank: 31


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I'll dig a bit further.

Sam

On Feb 2, 2011, at 9:53 AM, Samuel K. Gutierrez wrote:


Hi,

We'll try to reproduce the problem.

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote:



On 28/01/2011, at 8:16 PM, Michael Curtis wrote:



On 27/01/2011, at 4:51 PM, Michael Curtis wrote:

Some more debugging information:
Is anyone able to help with this problem?  As far as I can tell  
it's a stock-standard recently installed SLURM installation.


I can try 1.5.1 but hesitant to deploy this as it would require a  
recompile of some rather large pieces of software.  Should I re- 
post to the -devel lists?


Regards,


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-02 Thread Samuel K. Gutierrez

Hi,

We'll try to reproduce the problem.

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote:



On 28/01/2011, at 8:16 PM, Michael Curtis wrote:



On 27/01/2011, at 4:51 PM, Michael Curtis wrote:

Some more debugging information:
Is anyone able to help with this problem?  As far as I can tell it's  
a stock-standard recently installed SLURM installation.


I can try 1.5.1 but hesitant to deploy this as it would require a  
recompile of some rather large pieces of software.  Should I re-post  
to the -devel lists?


Regards,


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Deprecated parameter: plm_rsh_agent

2010-11-05 Thread Samuel K. Gutierrez

Hi Josh,

I -think- the new name is orte_rsh_agent.  At least according to  
ompi_info.


$ ompi_info -a --parsable | grep orte_rsh_agent
mca:orte:base:param:orte_rsh_agent:value:ssh : rsh
mca:orte:base:param:orte_rsh_agent:data_source:default value
mca:orte:base:param:orte_rsh_agent:status:writable
mca:orte:base:param:orte_rsh_agent:help:The command used to launch  
executables on remote nodes (typically either "ssh" or "rsh")

mca:orte:base:param:orte_rsh_agent:deprecated:no
mca:orte:base:param:orte_rsh_agent:synonym:name:pls_rsh_agent
mca:orte:base:param:orte_rsh_agent:synonym:name:plm_rsh_agent
mca:plm:base:param:plm_rsh_agent:synonym_of:name:orte_rsh_agent

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Nov 5, 2010, at 12:41 PM, Joshua Bernstein wrote:


Hello All,

When building the examples included with OpenMPI version 1.5 I see a  
message printed as follows:


--
A deprecated MCA parameter value was specified in an MCA parameter
file.  Deprecated MCA parameters should be avoided; they may disappear
in future releases.

 Deprecated parameter: plm_rsh_agent
--

While I know that in pre 1.3.x releases the variable was  
pls_rsh_agent, plm_rsh_agent worked all the way through at least  
1.4.3. What is the new keyword name? I can't seem to find it in the  
FAQ located here:


http://www.open-mpi.org/faq/?category=rsh

-Josh
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Samuel K. Gutierrez

Hi Gus,

Doh!  I didn't see the kernel-related messages after the segfault  
message.  Definitely some weirdness here that is beyond your  
control... Sorry about that.


--
Samuel K. Gutierrez
Los Alamos National Laboratory

On May 6, 2010, at 3:28 PM, Gus Correa wrote:


Hi Samuel

Samuel K. Gutierrez wrote:

Hi Gus,
This may not help, but it's worth a try.  If it's not too much  
trouble, can you please reconfigure your Open MPI installation with  
--enable-debug and then rebuild?  After that, may we see the stack  
trace from a core file that is produced after the segmentation fault?

Thanks,
--
Samuel K. Gutierrez
Los Alamos National Laboratory


Thank you for the suggestion.

I am a bit reluctant to try this because when it fails,
it *really* fails.
Most of the times the machine doesn't even return the prompt,
and in all cases it freezes and requires a hard reboot.
It is not a segfault that the OS can catch, I guess.
I wonder if enabling debug mode would do much for us,
and get to the point of dumping a core, or just die before that.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


On May 6, 2010, at 12:01 PM, Gus Correa wrote:

Hi Eugene

Thanks for the detailed answer.

*

1) Now I can see and use the btl_sm_num_fifos component:

I had committed already "btl = ^sm" to the openmpi-mca-params.conf
file.  This apparently hides the btl_sm_num_fifos from ompi_info.

After I switched to no options in openmpi-mca-params.conf,
then ompi_info showed the btl_sm_num_fifos component.

ompi_info --all | grep btl_sm_num_fifos
   MCA btl: parameter "btl_sm_num_fifos" (current  
value: "1", data source: default value)


A side comment:
This means that the system administrator can
hide some Open MPI options from the users, depending on what
he puts in the openmpi-mca-params.conf file, right?

*

2) However, running with "sm" still breaks, unfortunately:

Boomer!
I get the same errors that I reported in my very
first email, if I increase the number of processes to 16,
to explore the hyperthreading range.

This is using "sm" (i.e. not excluded in the mca config file),
and btl_sm_num_fifos (mpiexec command line)

The machine hangs, requires a hard reboot, etc, etc,
as reported earlier.  See the below, please.

So, I guess the conclusion is that I can use sm,
but I have to remain within the range of physical cores (8),
not oversubscribe, not try to explore the HT range.
Should I expect it to work also for np>number of physical cores?

I wonder if this would still work with np<=8, but with heavier code.
(I only used hello_c.c so far.)
Not sure I'll be able to test this, the user wants to use the  
machine.



$mpiexec -mca btl_sm_num_fifos 4 -np 4 a.out
Hello, world, I am 0 of 4
Hello, world, I am 1 of 4
Hello, world, I am 2 of 4
Hello, world, I am 3 of 4

$ mpiexec -mca btl_sm_num_fifos 8 -np 8 a.out
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 5 of 8
Hello, world, I am 6 of 8
Hello, world, I am 7 of 8

$ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out
--
mpiexec noticed that process rank 8 with PID 3659 on node  
spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).

--
$

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:[ cut here ]

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:invalid opcode:  [#1] SMP

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/ 
physical_package_id


Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:Stack:

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:Call Trace:

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00  
00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39  
d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94  
24 00 01


*

Many thanks,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Eugene Loh wrote:

Gus Correa wrote:

Hi Eugene

Thank you for answering one of my original questions.

However, there seems to be a problem with the syntax.
Is it really "-mca btl btl_sm_num_fifos=some_number"?

No.  Try "--mca btl_sm_num_fifos 4".  Or,
% setenv OMPI_MCA

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

2010-05-06 Thread Samuel K. Gutierrez

Hi Gus,

This may not help, but it's worth a try.  If it's not too much  
trouble, can you please reconfigure your Open MPI installation with -- 
enable-debug and then rebuild?  After that, may we see the stack trace  
from a core file that is produced after the segmentation fault?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On May 6, 2010, at 12:01 PM, Gus Correa wrote:


Hi Eugene

Thanks for the detailed answer.

*

1) Now I can see and use the btl_sm_num_fifos component:

I had committed already "btl = ^sm" to the openmpi-mca-params.conf
file.  This apparently hides the btl_sm_num_fifos from ompi_info.

After I switched to no options in openmpi-mca-params.conf,
then ompi_info showed the btl_sm_num_fifos component.

ompi_info --all | grep btl_sm_num_fifos
MCA btl: parameter "btl_sm_num_fifos" (current  
value: "1", data source: default value)


A side comment:
This means that the system administrator can
hide some Open MPI options from the users, depending on what
he puts in the openmpi-mca-params.conf file, right?

*

2) However, running with "sm" still breaks, unfortunately:

Boomer!
I get the same errors that I reported in my very
first email, if I increase the number of processes to 16,
to explore the hyperthreading range.

This is using "sm" (i.e. not excluded in the mca config file),
and btl_sm_num_fifos (mpiexec command line)

The machine hangs, requires a hard reboot, etc, etc,
as reported earlier.  See the below, please.

So, I guess the conclusion is that I can use sm,
but I have to remain within the range of physical cores (8),
not oversubscribe, not try to explore the HT range.
Should I expect it to work also for np>number of physical cores?

I wonder if this would still work with np<=8, but with heavier code.
(I only used hello_c.c so far.)
Not sure I'll be able to test this, the user wants to use the machine.


$mpiexec -mca btl_sm_num_fifos 4 -np 4 a.out
Hello, world, I am 0 of 4
Hello, world, I am 1 of 4
Hello, world, I am 2 of 4
Hello, world, I am 3 of 4

$ mpiexec -mca btl_sm_num_fifos 8 -np 8 a.out
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 5 of 8
Hello, world, I am 6 of 8
Hello, world, I am 7 of 8

$ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out
--
mpiexec noticed that process rank 8 with PID 3659 on node  
spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).

--
$

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:[ cut here ]

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:invalid opcode:  [#1] SMP

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/ 
physical_package_id


Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:Stack:

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:Call Trace:

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00  
00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0  
73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01


*

Many thanks,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Eugene Loh wrote:

Gus Correa wrote:

Hi Eugene

Thank you for answering one of my original questions.

However, there seems to be a problem with the syntax.
Is it really "-mca btl btl_sm_num_fifos=some_number"?

No.  Try "--mca btl_sm_num_fifos 4".  Or,
% setenv OMPI_MCA_btl_sm_num_fifos 4
% ompi_info -a | grep btl_sm_num_fifos # check that things were  
set correctly

% mpirun -n 4 a.out

When I grep any component starting with btl_sm I get nothing:

ompi_info --all | grep btl_sm
(No output)
I'm no guru, but I think the reason has something to do with  
dynamically loaded somethings.  E.g.,

% /home/eugene/ompi/bin/ompi_info --all | grep btl_sm_num_fifos
(no output)
% setenv OPAL_PREFIX /home/eugene/ompi
% set path = ( $OPAL_PREFIX/bin $path )
% ompi_info --all | grep btl_sm_num_fifos
   MCA btl: parameter "btl_sm_num_fifos" (current  
value: "1", data source: default value)

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem in using openmpi

2010-03-12 Thread Samuel K. Gutierrez

Hi,

If lib64 isn't there, try lib.  That is,

export LD_LIBRARY_PATH=/home/jess/local/ompi/lib

Referencing the example that I provided earlier.

--
Samuel K. Gutierrez
Los Alamos National Laboratory





On Mar 12, 2010, at 3:31 PM, vaibhav dutt wrote:


Hi,

I used the export command but it does not seem to work.It's still  
giving the same error as

the lib64 directory does not exist in the ompi folder.

Any Suggestions.

Thank You,
Vaibhav



On Fri, Mar 12, 2010 at 3:05 PM, Fernando Lemos  
<fernando...@gmail.com> wrote:
On Fri, Mar 12, 2010 at 6:02 PM, Samuel K. Gutierrez  
<sam...@lanl.gov> wrote:

> One more thing.  The line should have been:
>
> export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64
>
> The space in the previous email will make bash unhappy 8-|.
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
> On Mar 12, 2010, at 1:56 PM, Samuel K. Gutierrez wrote:
>
>> Hi,
>>
>> It sounds like you may need to set your LD_LIBRARY_PATH environment
>> variable correctly.  There are several ways that you can tell the  
dynamic
>> linker where the required libraries are located, but the  
following may be

>> sufficient for your needs.
>>
>> Let's say, for example, that your Open MPI installation is rooted  
at

>> /home/jess/local/ompi and the libraries are located in
>> /home/jess/local/ompi/lib64, try (bash-like shell):
>>
>> export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64
>>
>> Hope this helps,
>>
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>>
>> On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote:
>>
>>> Hi,
>>>
>>> I have installed openmpi on an Kubuntu , with Dual core Linux  
AMD Athlon

>>> When trying to compile a simple program, I am getting an error.
>>>
>>> mpicc: error while loading shared libraries: libopen-pal.so.0:  
cannot

>>> open shared object file: No such file or dir
>>>
>>> I read somewhere that this error is because of some intel compiler
>>> being not installed on the proper node, which I don't understand  
as I

>>> am using AMD.
>>>
>>> Kindly give your suggestions
>>>
>>> Thank You

It's probably a packaging error, if he used the distribution's
packages. In that case, he should report the bug to downstream.

If he installed from source, then it's most likely installed somewhere
outside the library path, and the LD_LIBRARY_PATH trick might work (if
it doesn't, make sure there are no leftovers, recompile, reinstall and
it should work fine).


Regards,

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem in using openmpi

2010-03-12 Thread Samuel K. Gutierrez

One more thing.  The line should have been:

export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64

The space in the previous email will make bash unhappy 8-|.

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Mar 12, 2010, at 1:56 PM, Samuel K. Gutierrez wrote:


Hi,

It sounds like you may need to set your LD_LIBRARY_PATH environment  
variable correctly.  There are several ways that you can tell the  
dynamic linker where the required libraries are located, but the  
following may be sufficient for your needs.


Let's say, for example, that your Open MPI installation is rooted  
at /home/jess/local/ompi and the libraries are located in /home/jess/ 
local/ompi/lib64, try (bash-like shell):


export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64

Hope this helps,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote:


Hi,

I have installed openmpi on an Kubuntu , with Dual core Linux AMD  
Athlon

When trying to compile a simple program, I am getting an error.

mpicc: error while loading shared libraries: libopen-pal.so.0: cannot
open shared object file: No such file or dir

I read somewhere that this error is because of some intel compiler
being not installed on the proper node, which I don't understand as I
am using AMD.

Kindly give your suggestions

Thank You
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem in using openmpi

2010-03-12 Thread Samuel K. Gutierrez

Hi,

It sounds like you may need to set your LD_LIBRARY_PATH environment  
variable correctly.  There are several ways that you can tell the  
dynamic linker where the required libraries are located, but the  
following may be sufficient for your needs.


Let's say, for example, that your Open MPI installation is rooted at / 
home/jess/local/ompi and the libraries are located in /home/jess/local/ 
ompi/lib64, try (bash-like shell):


export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64

Hope this helps,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote:


Hi,

I have installed openmpi on an Kubuntu , with Dual core Linux AMD  
Athlon

When trying to compile a simple program, I am getting an error.

mpicc: error while loading shared libraries: libopen-pal.so.0: cannot
open shared object file: No such file or dir

I read somewhere that this error is because of some intel compiler
being not installed on the proper node, which I don't understand as I
am using AMD.

Kindly give your suggestions

Thank You
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] memchecker overhead?

2009-10-26 Thread Samuel K. Gutierrez

Hi Jed,

I'm not sure if this will help, but it's worth a try.  Turn off OMPI's  
memory wrapper and see what happens.


c-like shell
setenv OMPI_MCA_memory_ptmalloc2_disable 1

bash-like shell
export OMPI_MCA_memory_ptmalloc2_disable=1

Also add the following MCA parameter to you run command.

--mca mpi_leave_pinned 0

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Oct 26, 2009, at 1:41 PM, Jed Brown wrote:


Jeff Squyres wrote:

Using --enable-debug adds in a whole pile of developer-level run-time
checking and whatnot.  You probably don't want that on production  
runs.


I have found that --enable-debug --enable-memchecker actually produces
more valgrind noise than leaving them off.  Are there options to make
Open MPI strict about initializing and freeing memory?  At one point I
tried to write policy files, but even with judicious globbing, I kept
getting different warnings when run on a different program.  (All  
these

codes were squeaky-clean under MPICH2.)

Jed

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] memalign usage in OpenMPI and it's consequencesfor TotalVIew

2009-10-01 Thread Samuel K. Gutierrez

Ticket created (#2040).  I hope it's okay ;-).

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Oct 1, 2009, at 11:58 AM, Jeff Squyres wrote:


Did that make it over to the v1.3 branch?


On Oct 1, 2009, at 1:39 PM, Samuel K. Gutierrez wrote:


Hi,

I think Jeff has already addressed this problem.

https://svn.open-mpi.org/trac/ompi/changeset/21744

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Oct 1, 2009, at 11:25 AM, Peter Thompson wrote:

> We had a question from a user who had turned on memory debugging in
> TotalView and experience a memory event error Invalid memory
> alignment request.  Having a 1.3.3 build of OpenMPI handy, I tested
> it and sure enough, saw the error.  I traced it down to,  
surprise, a

> call to memalign.  I find there are a few places where memalign is
> called, but the one I think I was dealing with was from malloc.c in
> ompi/mca//io/romio/romio/adio/common in the following lines:
>
>
> #ifdef ROMIO_XFS
>new = (void *) memalign(XFS_MEMALIGN, size);
> #else
>new = (void *) malloc(size);
> #endif
>
> I searched, but couldn't find a value for XFS_MEMALIGN, so maybe it
> was from opal_pt_malloc2_component.c instead, where the call is
>
>p = memalign(1, 1024 * 1024);
>
> There are only 10 to 12 references to memalign in the code that I
> can see, so it shouldn't be too hard to find.  What I can tell you
> is that the value that TotalView saw for alignment, the first arg,
> was 1, and the second, the size, was  0x10, which is probably
> right for 1024 squared.
>
> The man page for memalign says that the first argument is the
> alignment that the allocated memory use, and it must be a power of
> two.  The second is the length you want allocated.  One could argue
> that 1 is a power of two, but it seems a bit specious to me, and
> TotalView's memory debugger certainly objects to it. Can anyone  
tell

> me what the intent here is, and whether the memalign alignment
> argument is thought to be valid?  Or is this a bug (that might not
> affect anyone other than TotalView memory debug users?)
>
> Thanks,
> Peter Thompson
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-29 Thread Samuel K. Gutierrez

Hi,

I'm writing a simple post-mortem profiling tool that provides some of  
the information that you are looking for.  That being said, the tool,  
Loba, isn't publicly available just yet.  In the mean time, take a  
look at mpiP (http://mpip.sourceforge.net/).


--
Samuel K. Gutierrez
Los Alamos National Laboratory




On Tue, Sep 29, 2009 at 10:40 AM, Eugene Loh <eugene@sun.com>  
wrote:
to know.  It sounds like you want to be able to watch some %  
utilization of
a hardware interface as the program is running.  I *think* these  
tools (the
ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of  
that

class.


You are correct. A real time tool would be best that sniffs at the  
MPI
traffic. Post mortem profilers would be the next best option I  
assume.

I was trying to compile MPE but gave up. Too many errors. Trying to
decide if I should prod on or look at another tool.

--
Rahul

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users