Re: [OMPI users] Problem building OpenMPI 1.4.4 with PGI 11.7 compilers
Hi, I think I've seen this before. I can't speak to the details surrounding this issue, but when I upgraded to the newest version of libtool, the problem went away . Take a look at "Use of GNU m4, Autoconf, Automake, and Libtool" in our HACKING file. libtool-2.4.2.tar.gz **should** work, if that's the problem that you are experiencing. I would suggest starting with a fresh source tree, before you try again. Hope that helps, Samuel K. Gutierrez Los Alamos National Laboratory On Nov 8, 2011, at 2:06 PM, Gustavo Correa wrote: > Dear OpenMPI pros > > When I try to build OpenMPI 1.4.4 with PGI compilers 11.7 [pgcc, pgcc, > pgfortran] > I get the awkward error message on the bottom of this email. > > I say awkward because I assigned the value 'shanghai-64' to the '-tp' flag, > as you can see from the successful 'libtool:compile' command in the error > message. > However, the subsequent 'libtool:link' command has '-tp' without a value. > Note that the remaining flags '-fast -Mfprelaxed' were also dropped in the > libtool:link command. > The 'partial' flag '-tp' is worse than no flag at all, and the pgcc compiler > fails. > > By contrast, OpenMPI 1.4.3 builds just fine with the same compilers and > the same compiler flags. > > Is this the revival of an old idiosyncrasy between libtool and PGI? > Could perhaps the OMPI 1.4.4. configure script have stripped off my compiler > flags after '-tp', > when passing it to libtool in link mode? [Somehow it works in 1.4.3.] > Is there any workaround or patch? > > > Many thanks, > Gus Correa > > ** > > More details: > CentOS Linux 5.2 x86_64, libtool 1.5.22, PGI 11.7. > > Configure parameters: > export CC=pgcc > export CXX=pgcpp > export F77='pgfortran' > export FC=${F77} > > export CFLAGS='-tp shanghai-64 -fast -Mfprelaxed' > export CXXFLAGS=${CFLAGS} > export FFLAGS=${CFLAGS} > export FCFLAGS=${FFLAGS} > > ../configure \ > --prefix=${MYINSTALLDIR} \ > --with-libnuma=/usr \ > --with-tm=/opt/torque/2.4.11/gnu-4.1.2 \ > --with-openib=/usr \ > --enable-static \ > 2>&1 | tee configure_${build_id}.log > > > > ERROR MESSAGE ### > > libtool: compile: pgcc -DHAVE_CONFIG_H -I. > -I../../../../../opal/mca/memory/ptmalloc2 -I../../../../opal/include > -I../../../../orte/include -I../../../../ompi/include > -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -DMALLOC_DEBUG=0 > -D_GNU_SOURCE=1 -DUSE_TSD_DATA_HACK=1 -DMALLOC_HOOKS=1 > -I../../../../../opal/mca/memory/ptmalloc2/sysdeps/pthread > -I../../../../../opal/mca/memory/ptmalloc2/sysdeps/generic -I../../../../.. > -I../../../.. -I../../../../../opal/include -I../../../../../orte/include > -I../../../../../ompi/include -D_REENTRANT -DNDEBUG -tp shanghai-64 -fast > -Mfprelaxed -c ../../../../../opal/mca/memory/ptmalloc2/dummy.c -o dummy.o > >/dev/null 2>&1 > /bin/sh ../../../../libtool --tag=CC --mode=link pgcc -DNDEBUG -tp > shanghai-64 -fast -Mfprelaxed -export-dynamic -o libopenmpi_malloc.la > -rpath /home/sw/openmpi/1.4.4/pgi-11.7/lib dummy.lo -lnsl -lutil > libtool: link: pgcc -shared -fpic -DPIC .libs/dummy.o -lnsl -lutil -lc > -tp -Wl,-soname -Wl,libopenmpi_malloc.so.0 -o > .libs/libopenmpi_malloc.so.0.0.0 > pgcc-Fatal-Switch -tp must have a value > -tp=amd64|amd64e|athlon|athlonxp|barcelona|barcelona-32|barcelona-64|core2|core2-32|core2-64|istanbul|istanbul-32|istanbul-64|k7|k8|k8-32|k8-64|k8-64e|nehalem|nehalem-32|nehalem-64|p5|p6|p7|p7-32|p7-64|penryn|penryn-32|penryn-64|piii|piv|px|px-32|px-64|sandybridge|sandybridge-32|sandybridge-64|shanghai|shanghai-32|shanghai-64|x64 >Choose target processor type >amd64 Same as -tp k8-64 >amd64e Same as -tp k8-64e >athlon AMD 32-bit Athlon Processor >athlonxpAMD 32-bit Athlon XP Processor >barcelona AMD Barcelona processor >barcelona-32AMD Barcelona processor, 32-bit mode >barcelona-64AMD Barcelona processor, 64-bit mode >core2 Intel Core-2 Architecture >core2-32Intel Core-2 Architecture, 32-bit mode >core2-64Intel Core-2 Architecture, 64-bit mode >istanbulAMD Istanbul processor >istanbul-32 AMD Istanbul processor, 32-bit mode >istanbul-64 AMD Istanbul processor, 64-bit mode >k7 AMD Athlon Processor >k8 AMD64 Processor >k8-32 AMD64 Processor 32-bit mode >k8-64 AMD64 Processor 64-bit mode >k8-64e AMD64 Processor rev E or later, 64-bit mode >nehalem Intel Nehalem processor >nehalem-32 Intel Neha
Re: [OMPI users] How could OpenMPI (or MVAPICH) affect floating-point results?
Hi, Maybe you can leverage some of the techniques outlined in: Robert W. Robey, Jonathan M. Robey, and Rob Aulwes. 2011. In search of numerical consistency in parallel programming. Parallel Comput. 37, 4-5 (April 2011), 217-229. DOI=10.1016/j.parco.2011.02.009 http://dx.doi.org/10.1016/j.parco.2011.02.009 Hope that helps, Samuel K. Gutierrez Los Alamos National Laboratory On Sep 20, 2011, at 6:25 AM, Reuti wrote: > Am 20.09.2011 um 13:52 schrieb Tim Prince: > >> On 9/20/2011 7:25 AM, Reuti wrote: >>> Hi, >>> >>> Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L: >>> >>>> I am observing differences in floating-point results from an application >>>> program that appear to be related to whether I link with OpenMPI 1.4.3 or >>>> MVAPICH 1.2.0. Both packages were built with the same installation of >>>> Intel 11.1, as well as the application program; identical flags passed to >>>> the compiler in each case. >>>> >>>> I’ve tracked down some differences in a compute-only routine where I’ve >>>> printed out the inputs to the routine (to 18 digits) ; the inputs are >>>> identical. The output numbers are different in the 16th place (perhaps a >>>> few in the 15th place). These differences only show up for optimized >>>> code, not for –O0. >>>> >>>> My assumption is that some optimized math intrinsic is being replaced >>>> dynamically, but I do not know how to confirm this. Anyone have guidance >>>> to offer? Or similar experience? >>> >>> yes, I face it often but always at a magnitude where it's not of any >>> concern (and not related to any MPI). Due to the limited precision in >>> computers, a simple reordering of operation (although being equivalent in a >>> mathematical sense) can lead to different results. Removing the anomalies >>> with -O0 could proof that. >>> >>> The other point I heard especially for the x86 instruction set is, that the >>> internal FPU has still 80 bits, while the presentation in memory is only 64 >>> bit. Hence when all can be done in the registers, the result can be >>> different compared to the case when some interim results need to be stored >>> to RAM. For the Portland compiler there is a switch -Kieee -pc64 to force >>> it to stay always in 64 bit, and a similar one for Intel is -mp (now >>> -fltconsistency) and -mp1. >>> >> Diagnostics below indicate that ifort 11.1 64-bit is in use. The options >> aren't the same as Reuti's "now" version (a 32-bit compiler which hasn't >> been supported for 3 years or more?). > > In the 11.1 documentation they are also still listed: > > http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm > > I read it in the way, that -mp is deprecated syntax (therefore listed under > "Alternate Options"), but -fltconsistency is still a valid and supported > option. > > -- Reuti > > >> With ifort 10.1 and more recent, you would set at least >> -assume protect_parens -prec-div -prec-sqrt >> if you are interested in numerical consistency. If you don't want >> auto-vectorization of sum reductions, you would use instead >> -fp-model source -ftz >> (ftz sets underflow mode back to abrupt, while "source" sets gradual). >> It may be possible to expose 80-bit x87 by setting the ancient -mp option, >> but such a course can't be recommended without additional cautions. >> >> Quoted comment from OP seem to show a somewhat different question: Does >> OpenMPI implement any operations in a different way from MVAPICH? I would >> think it probable that the answer could be affirmative for operations such >> as allreduce, but this leads well outside my expertise with respect to >> specific MPI implementations. It isn't out of the question to suspect that >> such differences might be aggravated when using excessively aggressive ifort >> options such as -fast. >> >> >>>>libifport.so.5 => >>>> /opt/intel/Compiler/11.1/072/lib/intel64/libifport.so.5 >>>> (0x2b6e7e081000) >>>>libifcoremt.so.5 => >>>> /opt/intel/Compiler/11.1/072/lib/intel64/libifcoremt.so.5 >>>> (0x2b6e7e1ba000) >>>>libimf.so => /opt/intel/Compiler/11.1/072/lib/intel64/libimf.so >>>> (0x2b6e7e45f000) >>>>libsvml.so => /opt/intel/Compiler/11.1/072/lib/intel64/libsvml.so >>>> (0x2b6e7e7f4000) >>>>libintlc.so.5 => >>>> /opt/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2b6e7ea0a000) >>>> >> >> -- >> Tim Prince >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] EXTERNAL: Re: qp memory allocation problem
On Sep 12, 2011, at 10:16 AM, Blosch, Edwin L wrote: > Samuel, > > This worked. Great! > Did this magic line disable the use of per-peer queue pairs? Yes, it sure did. > I have seen a previous post by Jeff that explains what this line does > generally, but I didn’t study the post in detail, so if you could provide a > little explanation I would appreciate it. The btl_openib_receive_queues MCA parameter takes a comma and colon delimited list of parameters. The most general form is: Queue pair type identifier,Buffer size (Required),Number of buffers (Required):Queue pair type identifier,Buffer size (Required),Number of buffers (Required):...:Queue pair type identifier,Buffer size (Required),Number of buffers (Required) NOTE: Optional QP parameters omitted - see the OMPI FAQ regarding OpenFabrics tuning for more details. The 'S' queue pair type identifier corresponds to "Shared queues." The 'P' queue pair type identifier corresponds to "Per-peer queues." Hope that helps, Sam > > Ed > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Samuel K. Gutierrez > Sent: Monday, September 12, 2011 10:49 AM > To: Open MPI Users > Subject: EXTERNAL: Re: [OMPI users] qp memory allocation problem > > Hi, > > This problem can be caused by a variety of things, but I suspect our default > queue pair parameters (QP) aren't helping the situation :-). > > What happens when you add the following to your mpirun command? > > -mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12 > > OMPI Developers: > > Maybe we should consider disabling the use of per-peer queue pairs by > default. Do they buy us anything? For what it is worth, we have stopped > using them on all of our large systems here at LANL. > > Thanks, > > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Sep 12, 2011, at 9:23 AM, Blosch, Edwin L wrote: > > > I am getting this error message below and I don’t know what it means or how > to fix it. It only happens when I run on a large number of processes, e.g. > 960. Things work fine on 480, and I don’t think the application has a bug. > Any help is appreciated… > > [c1n01][[30697,1],3][connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [c1n01][[30697,1],4][connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > > Here’s the mpirun command I used: > mpirun --prefix /usr/mpi/intel/openmpi-1.4.3 --machinefile -np > 960 --mca btl ^tcp --mca mpool_base_use_mem_hooks 1 --mca mpi_leave_pinned 1 > -x LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 > > Here’s the applicable hardware from the > /usr/mpi/intel/openmpi-1.4.3/share/openmpi/mca-btl-openib-device-params.ini > file: > # A.k.a. ConnectX > [Mellanox Hermon] > vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f > vendor_part_id = > 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 > use_eager_rdma = 1 > mtu = 2048 > max_inline_data = 128 > > And this is the output of ompi_info –param btl openib: > MCA btl: parameter "btl_base_verbose" (current value: "0", > data source: default value) > Verbosity level of the BTL framework > MCA btl: parameter "btl" (current value: , data > source: default value) > Default selection set of components for the btl > framework ( means use all components that can be found) > MCA btl: parameter "btl_openib_verbose" (current value: "0", > data source: default value) > Output some verbose OpenIB BTL information (0 = no > output, nonzero = output) > MCA btl: parameter "btl_openib_warn_no_device_params_found" > (current value: "1", data source: default value, synonyms: > btl_openib_warn_no_hca_params_found) > Warn when no device-specific parameters are found > in the INI file specified by the btl_openib_device_param_files MCA > parameter (0 = do not warn; any other value = warn) > MCA btl: parameter "btl_openib_warn_no_hca_params_found" > (current value: "1", data source: default value, deprecated, synonym > of: btl_openib_warn_no_device_params_found) > Warn when no device-specific parameters are found > in the INI file specified by the btl_openib_device_param_files MCA > parameter (0 = do not warn;
Re: [OMPI users] qp memory allocation problem
Hi, This problem can be caused by a variety of things, but I suspect our default queue pair parameters (QP) aren't helping the situation :-). What happens when you add the following to your mpirun command? -mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,12 OMPI Developers: Maybe we should consider disabling the use of per-peer queue pairs by default. Do they buy us anything? For what it is worth, we have stopped using them on all of our large systems here at LANL. Thanks, Samuel K. Gutierrez Los Alamos National Laboratory On Sep 12, 2011, at 9:23 AM, Blosch, Edwin L wrote: > I am getting this error message below and I don’t know what it means or how > to fix it. It only happens when I run on a large number of processes, e.g. > 960. Things work fine on 480, and I don’t think the application has a bug. > Any help is appreciated… > > [c1n01][[30697,1],3][connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [c1n01][[30697,1],4][connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > > Here’s the mpirun command I used: > mpirun --prefix /usr/mpi/intel/openmpi-1.4.3 --machinefile -np > 960 --mca btl ^tcp --mca mpool_base_use_mem_hooks 1 --mca mpi_leave_pinned 1 > -x LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 > > Here’s the applicable hardware from the > /usr/mpi/intel/openmpi-1.4.3/share/openmpi/mca-btl-openib-device-params.ini > file: > # A.k.a. ConnectX > [Mellanox Hermon] > vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f > vendor_part_id = > 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 > use_eager_rdma = 1 > mtu = 2048 > max_inline_data = 128 > > And this is the output of ompi_info –param btl openib: > MCA btl: parameter "btl_base_verbose" (current value: "0", > data source: default value) > Verbosity level of the BTL framework > MCA btl: parameter "btl" (current value: , data > source: default value) > Default selection set of components for the btl > framework ( means use all components that can be found) > MCA btl: parameter "btl_openib_verbose" (current value: "0", > data source: default value) > Output some verbose OpenIB BTL information (0 = no > output, nonzero = output) > MCA btl: parameter "btl_openib_warn_no_device_params_found" > (current value: "1", data source: default value, synonyms: > btl_openib_warn_no_hca_params_found) > Warn when no device-specific parameters are found > in the INI file specified by the btl_openib_device_param_files MCA > parameter (0 = do not warn; any other value = warn) > MCA btl: parameter "btl_openib_warn_no_hca_params_found" > (current value: "1", data source: default value, deprecated, synonym > of: btl_openib_warn_no_device_params_found) > Warn when no device-specific parameters are found > in the INI file specified by the btl_openib_device_param_files MCA > parameter (0 = do not warn; any other value = warn) > MCA btl: parameter "btl_openib_warn_default_gid_prefix" > (current value: "1", data source: default value) > Warn when there is more than one active ports and > at least one of them connected to the network with only default GID > prefix configured (0 = do not warn; any other value > = warn) > MCA btl: parameter "btl_openib_warn_nonexistent_if" (current > value: "1", data source: default value) > Warn if non-existent devices and/or ports are > specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not > warn; any other value = warn) > MCA btl: parameter "btl_openib_want_fork_support" (current > value: "-1", data source: default value) > Whether fork support is desired or not (negative = > try to enable fork support, but continue even if it is not > available, 0 = do not enable fork support, positive > = try to enable fork support and fail if it is not available) > MCA btl: parameter "btl_openib_device_param_files" (current > value: > > "/usr/mpi/intel/openmpi-1.4.3/share/openmpi/mca-btl-openib-device-para
Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl
Hi, QP = Queue Pair Here are a couple of nice FAQ entries that I find useful. http://www.open-mpi.org/faq/?category=openfabrics And videos: http://www.open-mpi.org/video/?category=openfabrics -- Samuel K. Gutierrez Los Alamos National Laboratory On Jun 23, 2011, at 8:22 AM, Mathieu Gontier wrote: > Hi, > > Thanks for your answer. It makes sense. > Sorry if my question seems silly, but what does QP mean? It is difficult to > read the FAQ without knowing that! > > Thanks. > > On 06/23/2011 04:00 PM, Ralph Castain wrote: >> >> One possibility: if you increase the number of processes in the job, and >> they all interconnect, then the IB interface can (I believe) run out of >> memory at some point. IIRC, the answer was to reduce the size of the QPs so >> that you could support a larger number of them. >> >> You should find info about controlling QP size in the IB FAQ area on the >> OMPI web site, I believe. >> >> On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote: >> >>> Hello, >>> >>> Thank for the answer. >>> I am testing with OpenMPI-1.4.3: my computation is queuing. But I did not >>> read anything obvious related to my issue. Have you read something which >>> could solve it? >>> I am going to submit my computation with --mca mpi_leave_pinned 0, but do >>> you have any idea how it affect the performance? Compared to using >>> Ethernet? >>> >>> Many thanks for your support. >>> >>> On 06/23/2011 03:01 PM, Josh Hursey wrote: >>>> >>>> I wonder if this is related to memory pinning. Can you try turning off >>>> the leave pinned, and see if the problem persists (this may affect >>>> performance, but should avoid the crash): >>>> mpirun ... --mca mpi_leave_pinned 0 ... >>>> >>>> Also it looks like Smoky has a slightly newer version of the 1.4 >>>> branch that you should try to switch to if you can. The following >>>> command will show you all of the available installs on that machine: >>>> shell$ module avail ompi >>>> >>>> For a list of supported compilers for that version try the 'show' option: >>>> shell$ module show ompi/1.4.3 >>>> --- >>>> /sw/smoky/modulefiles-centos/ompi/1.4.3: >>>> >>>> module-whatis This module configures your environment to make Open >>>> MPI 1.4.3 available. >>>> Supported Compilers: >>>> pathscale/3.2.99 >>>> pathscale/3.2 >>>> pgi/10.9 >>>> pgi/10.4 >>>> intel/11.1.072 >>>> gcc/4.4.4 >>>> gcc/4.4.3 >>>> --- >>>> >>>> Let me know if that helps. >>>> >>>> Josh >>>> >>>> >>>> On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier >>>> <mathieu.gont...@gmail.com> wrote: >>>>> Dear all, >>>>> >>>>> First of all, all my apologies because I post this message to both the bug >>>>> and user mailing list. But for the moment, I do not know if it is a bug! >>>>> >>>>> I am running a CFD structured flow solver at ORNL, and I have an access >>>>> to a >>>>> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default. >>>>> Recently we increased the size of our models, and since that time we have >>>>> run into many infiniband related problems. The most serious problem is a >>>>> hard crash with the following error message: >>>>> >>>>> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] >>>>> error creating qp errno says Cannot allocate memory >>>>> >>>>> If we force the solver to use ethernet (mpirun -mca btl ^openib) the >>>>> computations works correctly, although very slowly (a single iteration >>>>> take >>>>> ages). Do you have any idea what could be causing these problems? >>>>> >>>>> If it is due to a bug or a limitation into OpenMPI, do you think the >>>>> version >>>>> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read >>>>> the release notes, but I did not read any obvious patch which could fix my >>>>> problem. The system administrator is ready to compile a new package for >>>>> us, >>>>> but I do not want to ask to install to many of them. >>>>> >>>>> Thanks. >>>>> -- >>>>> >>>>> Mathieu Gontier >>>>> skype: mathieu_gontier >>>>> ___ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>> >>> -- >>> >>> Mathieu Gontier >>> skype: mathieu_gontier >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > > Mathieu Gontier > skype: mathieu_gontier > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] [ompi-1.4.2] Infiniband issue on smoky @ ornl
Hi, What happens when you don't run with per-peer queue pairs? Try: -mca btl_openib_receive_queues S,4096,128:S,12288,128:S,65536,128 -- Samuel K. Gutierrez Los Alamos National Laborator On Jun 23, 2011, at 7:56 AM, Mathieu Gontier wrote: > Hello, > > Thank for the answer. > I am testing with OpenMPI-1.4.3: my computation is queuing. But I did not > read anything obvious related to my issue. Have you read something which > could solve it? > I am going to submit my computation with --mca mpi_leave_pinned 0, but do you > have any idea how it affect the performance? Compared to using Ethernet? > > Many thanks for your support. > > On 06/23/2011 03:01 PM, Josh Hursey wrote: >> >> I wonder if this is related to memory pinning. Can you try turning off >> the leave pinned, and see if the problem persists (this may affect >> performance, but should avoid the crash): >> mpirun ... --mca mpi_leave_pinned 0 ... >> >> Also it looks like Smoky has a slightly newer version of the 1.4 >> branch that you should try to switch to if you can. The following >> command will show you all of the available installs on that machine: >> shell$ module avail ompi >> >> For a list of supported compilers for that version try the 'show' option: >> shell$ module show ompi/1.4.3 >> --- >> /sw/smoky/modulefiles-centos/ompi/1.4.3: >> >> module-whatis This module configures your environment to make Open >> MPI 1.4.3 available. >> Supported Compilers: >> pathscale/3.2.99 >> pathscale/3.2 >> pgi/10.9 >> pgi/10.4 >> intel/11.1.072 >> gcc/4.4.4 >> gcc/4.4.3 >> --- >> >> Let me know if that helps. >> >> Josh >> >> >> On Wed, Jun 22, 2011 at 4:16 AM, Mathieu Gontier >> <mathieu.gont...@gmail.com> wrote: >>> Dear all, >>> >>> First of all, all my apologies because I post this message to both the bug >>> and user mailing list. But for the moment, I do not know if it is a bug! >>> >>> I am running a CFD structured flow solver at ORNL, and I have an access to a >>> small cluster (Smoky) using OpenMPI-1.4.2 with Infiniband by default. >>> Recently we increased the size of our models, and since that time we have >>> run into many infiniband related problems. The most serious problem is a >>> hard crash with the following error message: >>> >>> [smoky45][[60998,1],32][/sw/sources/ompi/1.4.2/ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] >>> error creating qp errno says Cannot allocate memory >>> >>> If we force the solver to use ethernet (mpirun -mca btl ^openib) the >>> computations works correctly, although very slowly (a single iteration take >>> ages). Do you have any idea what could be causing these problems? >>> >>> If it is due to a bug or a limitation into OpenMPI, do you think the version >>> 1.4.3, the coming 1.4.4 or any 1.5 version could solve the problem? I read >>> the release notes, but I did not read any obvious patch which could fix my >>> problem. The system administrator is ready to compile a new package for us, >>> but I do not want to ask to install to many of them. >>> >>> Thanks. >>> -- >>> >>> Mathieu Gontier >>> skype: mathieu_gontier >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> > > -- > > Mathieu Gontier > skype: mathieu_gontier > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Openib with > 32 cores per node
Hi, On May 19, 2011, at 9:37 AM, Robert Horton wrote > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: >> Hi, >> >> Try the following QP parameters that only use shared receive queues. >> >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 >> > > Thanks for that. If I run the job over 2 x 48 cores it now works and the > performance seems reasonable (I need to do some more tuning) but when I > go up to 4 x 48 cores I'm getting the same problem: > > [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [compute-1-7.local:18106] *** An error occurred in MPI_Isend > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > abort) > > Any thoughts? How much memory does each node have? Does this happen at startup? Try adding: -mca btl_openib_cpc_include rdmacm I'm not sure if your version of OFED supports this feature, but maybe using XRC may help. I **think** other tweaks are needed to get this going, but I'm not familiar with the details. Hope that helps, Samuel K. Gutierrez Los Alamos National Laboratory > > Thanks, > Rob > -- > Robert Horton > System Administrator (Research Support) - School of Mathematical Sciences > Queen Mary, University of London > r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Openib with > 32 cores per node
Hi, Try the following QP parameters that only use shared receive queues. -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 Samuel K. Gutierrez Los Alamos National Laboratory On May 19, 2011, at 5:28 AM, Robert Horton wrote: > Hi, > > I'm having problems getting the MPIRandomAccess part of the HPCC > benchmark to run with more than 32 processes on each node (each node has > 4 x AMD 6172 so 48 cores total). Once I go past 32 processes I get an > error like: > > [compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:815:rml_recv_cb] > error in endpoint reply start connect > [compute-1-13.local:06117] [[5637,0],0]-[[5637,1],18] mca_oob_tcp_msg_recv: > readv failed: Connection reset by peer (104) > [compute-1-13.local:6137] *** An error occurred in MPI_Isend > [compute-1-13.local:6137] *** on communicator MPI_COMM_WORLD > [compute-1-13.local:6137] *** MPI_ERR_OTHER: known error not in list > [compute-1-13.local:6137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > abort) > [compute-1-13.local][[5637,1],26][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [[5637,1],66][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc] > from compute-1-13.local to: compute-1-13 error polling LP CQ with status > RETRY EXCEEDED ERROR status number 12 for wr_id 278870912 opcode > > I've tried changing btl_openib_receive_queues from > P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > to > P,128,512,256,512:S,2048,512,256,32:S,12288,512,256,32:S,65536,512,256,32 > > doing this lets the code run without the error, but it does so extremely > slowly - I'm also seeing errors in dmesg such as: > > CPU 12: > Modules linked in: nfs fscache nfs_acl blcr(U) blcr_imports(U) autofs4 > ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ip_conntrack_netbios_ns > ipt_REJECT xt_state > ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp > ip6table_filter ip6_tables x_tables cpufreq_ondemand powernow_k8 freq_table > rdma_ucm(U) ib_sd > p(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) > ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_nes(U) > iw_cxgb3(U) cxgb3(U) > mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) dm_mirror dm_multipath scsi_dh > video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac > parport_p > c lp parport joydev shpchp sg i2c_piix4 i2c_core ib_qib(U) dca ib_mad(U) > ib_core(U) igb 8021q serio_raw pcspkr dm_raid45 dm_message dm_region_hash > dm_log dm_mod dm_ > mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd > Pid: 3980, comm: qib/12 Tainted: G 2.6.18-164.6.1.el5 #1 > RIP: 0010:[] [] tasklet_action+0x90/0xfd > RSP: 0018:810c2f1bff40 EFLAGS: 0246 > RAX: RBX: 0001 RCX: 810c2f1bff30 > RDX: RSI: 81042f063400 RDI: 8030d180 > RBP: 810c2f1bfec0 R08: 0001 R09: 8104aec2d000 > R10: 810c2f1bff00 R11: 810c2f1bff00 R12: 8005dc8e > R13: 81042f063480 R14: 80077874 R15: 810c2f1bfec0 > FS: 2b20829592e0() GS:81042f186bc0() knlGS: > CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b > CR2: 2b2080b70720 CR3: 00201000 CR4: 06e0 > > Call Trace: > [] __do_softirq+0x89/0x133 > [] call_softirq+0x1c/0x28 > [] do_softirq+0x2c/0x85 > [] apic_timer_interrupt+0x66/0x6c > [] __kmalloc+0x97/0x9f > [] :ib_qib:qib_verbs_send+0xdb3/0x104a > [] _spin_unlock_irqrestore+0x8/0x9 > [] :ib_qib:qib_make_rc_req+0xbb1/0xbbf > [] :ib_qib:qib_make_rc_req+0x0/0xbbf > [] :ib_qib:qib_do_send+0x0/0x950 > [] :ib_qib:qib_do_send+0x91a/0x950 > [] __wake_up+0x38/0x4f > [] :ib_qib:qib_do_send+0x0/0x950 > [] run_workqueue+0x94/0xe4 > [] worker_thread+0x0/0x122 > [] keventd_create_kthread+0x0/0xc4 > [] worker_thread+0xf0/0x122 > [] default_wake_function+0x0/0xe > [] keventd_create_kthread+0x0/0xc4 > [] kthread+0xfe/0x132 > [] child_rip+0xa/0x11 > [] keventd_create_kthread+0x0/0xc4 > [] kthread+0x0/0x132 > [] child_rip+0x0/0x11 > > Any thoughts on how to proceed? > > I'm running OpenMPI 1.4.3 compiled with gcc 4.1.2 and OFED 1.5.3.1 > > Thanks, > Rob > -- > Robert Horton > System Administrator (Research Support) - School of Mathematical Sciences > Queen Mary, University of London > r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] btl_openib_cpc_include rdmacm questions
Hi, Just out of curiosity - what happens when you add the following MCA option to your openib runs? -mca btl_openib_flags 305 Thanks, Samuel Gutierrez Los Alamos National Laboratory On May 13, 2011, at 2:38 PM, Brock Palen wrote: > On May 13, 2011, at 4:09 PM, Dave Love wrote: > >> Jeff Squyreswrites: >> >>> On May 11, 2011, at 3:21 PM, Dave Love wrote: >>> We can reproduce it with IMB. We could provide access, but we'd have to negotiate with the owners of the relevant nodes to give you interactive access to them. Maybe Brock's would be more accessible? (If you contact me, I may not be able to respond for a few days.) >>> >>> Brock has replied off-list that he, too, is able to reliably reproduce the >>> issue with IMB, and is working to get access for us. Many thanks for your >>> offer; let's see where Brock's access takes us. >> >> Good. Let me know if we could be useful >> > -- we have not closed this issue, Which issue? I couldn't find a relevant-looking one. >>> >>> https://svn.open-mpi.org/trac/ompi/ticket/2714 >> >> Thanks. In csse it's useful info, it hangs for me with 1.5.3 & np=32 on >> connectx with more than one collective I can't recall. > > Extra data point, that ticket said it ran with mpi_preconnect_mpi 1, well > that doesn't help here, both my production code (crash) and IMB still hang. > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > >> >> -- >> Excuse the typping -- I have a broken wrist >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpi problems,
What does 'ldd ring2' show? How was it compiled? -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote: [jian@therock ~]$ echo $LD_LIBRARY_PATH /opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26- amd64:/opt/gridengine/lib/lx26-amd64:/home/jian/.crlibs:/home/ jian/.crlibs32 [jian@therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 4 - hostfile list ring2 ring2: error while loading shared libraries: libfui.so.1: cannot open shared object file: No such file or directory ring2: error while loading shared libraries: libfui.so.1: cannot open shared object file: No such file or directory ring2: error while loading shared libraries: libfui.so.1: cannot open shared object file: No such file or directory mpirun: killing job... -- mpirun noticed that process rank 1 with PID 31763 on node compute-0-1 exited on signal 0 (Unknown signal 0). -- mpirun: clean termination accomplished I really don't know what's wrong here. I was sure that would work On Mon, Apr 4, 2011 at 2:43 PM, Samuel K. Gutierrez <sam...@lanl.gov> wrote: Hi, Try prepending the path to your compiler libraries. Example (bash-like): export LD_LIBRARY_PATH=/compiler/prefix/lib:/ompi/prefix/lib: $LD_LIBRARY_PATH -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 4, 2011, at 1:33 PM, Nehemiah Dacres wrote: altering LD_LIBRARY_PATH alter's the process's path to mpi's libraries, how do i alter its path to compiler libs like libfui.so. 1? it needs to find them cause it was compiled by a sun compiler On Mon, Apr 4, 2011 at 10:06 AM, Nehemiah Dacres <dacre...@slu.edu> wrote: As Ralph indicated, he'll add the hostname to the error message (but that might be tricky; that error message is coming from rsh/ ssh...). In the meantime, you might try (csh style): foreach host (`cat list`) echo $host ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted end that's what the tentakel line was refering to, or ... On Apr 4, 2011, at 10:24 AM, Nehemiah Dacres wrote: > I have installed it via a symlink on all of the nodes, I can go 'tentakel which mpirun ' and it finds it' I'll check the library paths but isn't there a way to find out which nodes are returning the error? I found it misslinked on a couple nodes. thank you -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University
Re: [OMPI users] mpi problems,
Hi, Try prepending the path to your compiler libraries. Example (bash-like): export LD_LIBRARY_PATH=/compiler/prefix/lib:/ompi/prefix/lib: $LD_LIBRARY_PATH -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 4, 2011, at 1:33 PM, Nehemiah Dacres wrote: altering LD_LIBRARY_PATH alter's the process's path to mpi's libraries, how do i alter its path to compiler libs like libfui.so. 1? it needs to find them cause it was compiled by a sun compiler On Mon, Apr 4, 2011 at 10:06 AM, Nehemiah Dacres <dacre...@slu.edu> wrote: As Ralph indicated, he'll add the hostname to the error message (but that might be tricky; that error message is coming from rsh/ssh...). In the meantime, you might try (csh style): foreach host (`cat list`) echo $host ls -l /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted end that's what the tentakel line was refering to, or ... On Apr 4, 2011, at 10:24 AM, Nehemiah Dacres wrote: > I have installed it via a symlink on all of the nodes, I can go 'tentakel which mpirun ' and it finds it' I'll check the library paths but isn't there a way to find out which nodes are returning the error? I found it misslinked on a couple nodes. thank you -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University -- Nehemiah I. Dacres System Administrator Advanced Technology Group Saint Louis University ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OMPI seg fault by a class with weird address.
I -think- setting OMPI_MCA_memory_ptmalloc2_disable to 1 will turn off OMPI's memory wrappers without having to rebuild. Someone please correct me if I'm wrong :-). For example (bash-like shell): export OMPI_MCA_memory_ptmalloc2_disable=1 Hope that helps, -- Samuel K. Gutierrez Los Alamos National Laboratory On Mar 15, 2011, at 9:19 AM, Jack Bryan wrote: Thanks, I do not have system administrator authorization. I am afraid that I cannot rebuild OpenMPI --without-memory-manager. Are there other ways to get around it ? For example, use other things to replace "ptmalloc" ? Any help is really appreciated. thanks From: belaid_...@hotmail.com To: dtustud...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. Date: Tue, 15 Mar 2011 08:00:56 + Hi Jack, I may need to see the whole code to decide but my quick look suggest that ptmalloc is causing a problem with STL-vector allocation. ptmalloc is the openMPI internal malloc library. Could you try to build openMPI without memory management (using --without- memory-manager) and let us know the outcome. ptmalloc is not needed if you are not using an RDMA interconnect. With best regards, -Belaid. From: dtustud...@hotmail.com To: belaid_...@hotmail.com; us...@open-mpi.org Subject: RE: [OMPI users] OMPI seg fault by a class with weird address. Date: Tue, 15 Mar 2011 00:30:19 -0600 Hi, Because the code is very long, I just show the calling relationship of functions. main() { scheduler(); } scheduler() { ImportIndices(); } ImportIndices() { Index IdxNode ; IdxNode = ReadFile("fileName"); } Index ReadFile(const char* fileinput) { Index TempIndex; . } vector Index::GetPosition() const { return Position; } vector Index::GetColumn() const { return Column; } vector Index::GetYear() const { return Year; } vector Index::GetName() const { return Name; } int Index::GetPosition(const int idx) const { return Position[idx]; } int Index::GetColumn(const int idx) const { return Column[idx]; } int Index::GetYear(const int idx) const { return Year[idx]; } string Index::GetName(const int idx) const { return Name[idx]; } int Index::GetSize() const { return Position.size(); } The sequential code works well, and there is no scheduler(). The parallel code output from gdb: -- Breakpoint 1, myNeplanTaskScheduler(CNSGA2 *, int, int, int, ._85 *, char, int, message_para_to_workers_VecT &, MPI_Datatype, int &, int &, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > > &, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > > &, std::vector<double, std::allocator > &, int, std::vector<std::vector<double, std::allocator >, std::allocator<std::vector<double, std::allocator > > > &, MPI_Datatype, int, MPI_Datatype, int) (nsga2=0x118c490, popSize=, nodeSize=, myRank=, myChildpop=0x1208d80, genCandTag=65 'A', generationNum=1, myPopParaVec=std::vector of length 4, capacity 4 = {...}, message_to_master_type=0x7fffd540, myT1Flag=@0x7fffd68c, myT2Flag=@0x7fffd688, resultTaskPackageT1=std::vector of length 4, capacity 4 = {...}, resultTaskPackageT2Pr=std::vector of length 4, capacity 4 = {...}, xdataV=std::vector of length 4, capacity 4 = {...}, objSize=7, resultTaskPackageT12=std::vector of length 4, capacity 4 = {...}, xdata_to_workers_type=0x121c410, myGenerationNum=1, Mpara_to_workers_type=0x121b9b0, nconNum=0) at src/nsga2/myNetplanScheduler.cpp:109 109 ImportIndices(); (gdb) c Continuing. Breakpoint 2, ImportIndices () at src/index.cpp:120 120 IdxNode = ReadFile("prepdata/idx_node.csv"); (gdb) c Continuing. Breakpoint 4, ReadFile (fileinput=0xd8663d "prepdata/idx_node.csv") at src/index.cpp:86 86 Index TempIndex; (gdb) c Continuing. Breakpoint 5, Index::Index (this=0x7fffcb80) at src/index.cpp:20 20 Name(0) {} (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 --- the backtrace output from the above parallel OpenMPI code: (gdb) bt #0 0x2b3b0b81 in opal_memory_ptmalloc2_int_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 #1 0x2b3b2bd3 in opal_memory_ptmalloc2_malloc () from /opt/openmpi-1.3.4-gnu/lib/libopen-pal.so.0 #2 0x003f7c8bd1dd in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6 #3 0x004646a7 in __gnu_cxx::new_allocator::allocate ( this=0x7fffcb80, __n=0)
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On Feb 8, 2011, at 8:21 PM, Ralph Castain wrote: I would personally suggest not reconfiguring your system simply to support a particular version of OMPI. The only difference between the 1.4 and 1.5 series wrt slurm is that we changed a few things to support a more recent version of slurm. It is relatively easy to backport that code to the 1.4 series, and it should be (mostly) backward compatible. OMPI is agnostic wrt resource managers. We try to support all platforms, with our effort reflective of the needs of our developers and their organizations, and our perception of the relative size of the user community for a particular platform. Slurm is a fairly small community, mostly centered in the three DOE weapons labs, so our support for that platform tends to focus on their usage. So, with that understanding... Sam: can you confirm that 1.5.1 works on your TLCC machines? Open MPI 1.5.1 works as expected on our TLCC machines. Open MPI 1.4.3 with your SLURM update also tested. I have created a ticket to upgrade the 1.4.4 release (due out any time now) with the 1.5.1 slurm support. Any interested parties can follow it here: Thanks Ralph! Sam https://svn.open-mpi.org/trac/ompi/ticket/2717 Ralph On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote: On 09/02/2011, at 9:16 AM, Ralph Castain wrote: See below On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. Odd, I thought I sent it to you directly. In any case, here is the backtrace and some information from gdb: $ salloc -n16 gdb -args mpirun mpi (gdb) run Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/ michael/home/ServerAdmin/mpi [Thread debugging using libthread_db enabled] Program received signal SIGSEGV, Segmentation fault. 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; (gdb) bt #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at event.c:823 #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 #4 0x7789b916 in opal_progress () at runtime/ opal_progress.c:189 #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:459 #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at plm_slurm_module.c:360 #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at orterun.c:754 #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 (gdb) print pdatorted $1 = (orte_proc_t **) 0x67c610 (gdb) print mev $2 = (orte_message_event_t *) 0x681550 (gdb) print mev->sender.vpid $3 = 4294967295 (gdb) print mev->sender $4 = {jobid = 1721696256, vpid = 4294967295} (gdb) print *mev $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 "base/plm_base_launch_support.c", cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 "rml_oob_component.c", line = 279} The jobid and vpid look like the defined INVALID values, indicating that something is quite wrong. This would quite likely lead to the segfault. From this, it would indeed appear that you are getting some kind of library confusion - the most likely cause of such an error is a daemon from a different version trying to respond, and so the returned message isn't correct. Not sure why else it would be happening...you could try setting - mca plm_base_verbose 5 to get more debug output displayed on your screen, assuming you built OMPI with --enable-debug. Found the problem It is a site configuration issue, which I'll need to find a workaround for. [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Query of component [slurm] set priority to 75 [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Selected component [slurm] [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename hash 1936089714 [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383 [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1] [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote: Hi, A detailed backtrace from a core dump may help us debug this. Would you be willing to provide that information for us? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: Hi, I just tried to reproduce the problem that you are experiencing and was unable to. SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ lanl/tlcc/debug-nopanasas I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp). Unfortunately, the result is the same: salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi salloc: Granted job allocation 145 JOB MAP Data for node: Name: eng-ipc4.{FQDN}Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 0 Process OMPI jobid: [6932,1] Process rank: 1 Process OMPI jobid: [6932,1] Process rank: 2 Process OMPI jobid: [6932,1] Process rank: 3 Process OMPI jobid: [6932,1] Process rank: 4 Process OMPI jobid: [6932,1] Process rank: 5 Process OMPI jobid: [6932,1] Process rank: 6 Process OMPI jobid: [6932,1] Process rank: 7 Data for node: Name: ipc3 Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 8 Process OMPI jobid: [6932,1] Process rank: 9 Process OMPI jobid: [6932,1] Process rank: 10 Process OMPI jobid: [6932,1] Process rank: 11 Process OMPI jobid: [6932,1] Process rank: 12 Process OMPI jobid: [6932,1] Process rank: 13 Process OMPI jobid: [6932,1] Process rank: 14 Process OMPI jobid: [6932,1] Process rank: 15 = [eng-ipc4:31754] *** Process received signal *** [eng-ipc4:31754] Signal: Segmentation fault (11) [eng-ipc4:31754] Signal code: Address not mapped (1) [eng-ipc4:31754] Failing at address: 0x8012eb748 [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) [0x7f81cf262869] [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) [0x7f81cef93338] [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) [0x7f81cef9397e] [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 0(opal_event_loop+0x1f) [0x7f81cef9356f] [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so. 0(opal_progress+0x89) [0x7f81cef87916] [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20] [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) [0x7f81cf267ed7] [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d] [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] [eng-ipc4:31754] *** End of error message *** salloc: Relinquishing job allocation 145 salloc: Job allocation 145 has been revoked. zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ ServerAdmin/mpi I've anonymised the paths and domain, otherwise pasted verbatim. The only odd thing I notice is that the launching machine uses its full domain name, whereas the other machine is referred to by the short name. Despite the FQDN, the domain does not exist in the DNS (for historical reasons), but does exist in the /etc/hosts file. Any further clues would be appreciated. In case it may be relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point of difference may be that our environment is tcp (ethernet) based whereas the LANL test environment is not? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, A detailed backtrace from a core dump may help us debug this. Would you be willing to provide that information for us? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: Hi, I just tried to reproduce the problem that you are experiencing and was unable to. SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ lanl/tlcc/debug-nopanasas I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp). Unfortunately, the result is the same: salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi salloc: Granted job allocation 145 JOB MAP Data for node: Name: eng-ipc4.{FQDN}Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 0 Process OMPI jobid: [6932,1] Process rank: 1 Process OMPI jobid: [6932,1] Process rank: 2 Process OMPI jobid: [6932,1] Process rank: 3 Process OMPI jobid: [6932,1] Process rank: 4 Process OMPI jobid: [6932,1] Process rank: 5 Process OMPI jobid: [6932,1] Process rank: 6 Process OMPI jobid: [6932,1] Process rank: 7 Data for node: Name: ipc3 Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 8 Process OMPI jobid: [6932,1] Process rank: 9 Process OMPI jobid: [6932,1] Process rank: 10 Process OMPI jobid: [6932,1] Process rank: 11 Process OMPI jobid: [6932,1] Process rank: 12 Process OMPI jobid: [6932,1] Process rank: 13 Process OMPI jobid: [6932,1] Process rank: 14 Process OMPI jobid: [6932,1] Process rank: 15 = [eng-ipc4:31754] *** Process received signal *** [eng-ipc4:31754] Signal: Segmentation fault (11) [eng-ipc4:31754] Signal code: Address not mapped (1) [eng-ipc4:31754] Failing at address: 0x8012eb748 [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) [0x7f81cf262869] [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) [0x7f81cef93338] [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) [0x7f81cef9397e] [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 0(opal_event_loop+0x1f) [0x7f81cef9356f] [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress +0x89) [0x7f81cef87916] [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20] [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) [0x7f81cf267ed7] [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d] [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] [eng-ipc4:31754] *** End of error message *** salloc: Relinquishing job allocation 145 salloc: Job allocation 145 has been revoked. zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ ServerAdmin/mpi I've anonymised the paths and domain, otherwise pasted verbatim. The only odd thing I notice is that the launching machine uses its full domain name, whereas the other machine is referred to by the short name. Despite the FQDN, the domain does not exist in the DNS (for historical reasons), but does exist in the /etc/hosts file. Any further clues would be appreciated. In case it may be relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point of difference may be that our environment is tcp (ethernet) based whereas the LANL test environment is not? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, I just tried to reproduce the problem that you are experiencing and was unable to. [samuel@lo1-fe ~]$ salloc -n32 mpirun --display-map ./mpi_app salloc: Job is in held state, pending scheduler release salloc: Pending job allocation 138319 salloc: job 138319 queued and waiting for resources salloc: job 138319 has been allocated resources salloc: Granted job allocation 138319 JOB MAP Data for node: Name: lob083Num procs: 16 Process OMPI jobid: [26464,1] Process rank: 0 Process OMPI jobid: [26464,1] Process rank: 1 Process OMPI jobid: [26464,1] Process rank: 2 Process OMPI jobid: [26464,1] Process rank: 3 Process OMPI jobid: [26464,1] Process rank: 4 Process OMPI jobid: [26464,1] Process rank: 5 Process OMPI jobid: [26464,1] Process rank: 6 Process OMPI jobid: [26464,1] Process rank: 7 Process OMPI jobid: [26464,1] Process rank: 8 Process OMPI jobid: [26464,1] Process rank: 9 Process OMPI jobid: [26464,1] Process rank: 10 Process OMPI jobid: [26464,1] Process rank: 11 Process OMPI jobid: [26464,1] Process rank: 12 Process OMPI jobid: [26464,1] Process rank: 13 Process OMPI jobid: [26464,1] Process rank: 14 Process OMPI jobid: [26464,1] Process rank: 15 Data for node: Name: lob084Num procs: 16 Process OMPI jobid: [26464,1] Process rank: 16 Process OMPI jobid: [26464,1] Process rank: 17 Process OMPI jobid: [26464,1] Process rank: 18 Process OMPI jobid: [26464,1] Process rank: 19 Process OMPI jobid: [26464,1] Process rank: 20 Process OMPI jobid: [26464,1] Process rank: 21 Process OMPI jobid: [26464,1] Process rank: 22 Process OMPI jobid: [26464,1] Process rank: 23 Process OMPI jobid: [26464,1] Process rank: 24 Process OMPI jobid: [26464,1] Process rank: 25 Process OMPI jobid: [26464,1] Process rank: 26 Process OMPI jobid: [26464,1] Process rank: 27 Process OMPI jobid: [26464,1] Process rank: 28 Process OMPI jobid: [26464,1] Process rank: 29 Process OMPI jobid: [26464,1] Process rank: 30 Process OMPI jobid: [26464,1] Process rank: 31 SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ lanl/tlcc/debug-nopanasas I'll dig a bit further. Sam On Feb 2, 2011, at 9:53 AM, Samuel K. Gutierrez wrote: Hi, We'll try to reproduce the problem. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote: On 28/01/2011, at 8:16 PM, Michael Curtis wrote: On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information: Is anyone able to help with this problem? As far as I can tell it's a stock-standard recently installed SLURM installation. I can try 1.5.1 but hesitant to deploy this as it would require a recompile of some rather large pieces of software. Should I re- post to the -devel lists? Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, We'll try to reproduce the problem. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote: On 28/01/2011, at 8:16 PM, Michael Curtis wrote: On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information: Is anyone able to help with this problem? As far as I can tell it's a stock-standard recently installed SLURM installation. I can try 1.5.1 but hesitant to deploy this as it would require a recompile of some rather large pieces of software. Should I re-post to the -devel lists? Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Deprecated parameter: plm_rsh_agent
Hi Josh, I -think- the new name is orte_rsh_agent. At least according to ompi_info. $ ompi_info -a --parsable | grep orte_rsh_agent mca:orte:base:param:orte_rsh_agent:value:ssh : rsh mca:orte:base:param:orte_rsh_agent:data_source:default value mca:orte:base:param:orte_rsh_agent:status:writable mca:orte:base:param:orte_rsh_agent:help:The command used to launch executables on remote nodes (typically either "ssh" or "rsh") mca:orte:base:param:orte_rsh_agent:deprecated:no mca:orte:base:param:orte_rsh_agent:synonym:name:pls_rsh_agent mca:orte:base:param:orte_rsh_agent:synonym:name:plm_rsh_agent mca:plm:base:param:plm_rsh_agent:synonym_of:name:orte_rsh_agent -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 5, 2010, at 12:41 PM, Joshua Bernstein wrote: Hello All, When building the examples included with OpenMPI version 1.5 I see a message printed as follows: -- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -- While I know that in pre 1.3.x releases the variable was pls_rsh_agent, plm_rsh_agent worked all the way through at least 1.4.3. What is the new keyword name? I can't seem to find it in the FAQ located here: http://www.open-mpi.org/faq/?category=rsh -Josh ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
Hi Gus, Doh! I didn't see the kernel-related messages after the segfault message. Definitely some weirdness here that is beyond your control... Sorry about that. -- Samuel K. Gutierrez Los Alamos National Laboratory On May 6, 2010, at 3:28 PM, Gus Correa wrote: Hi Samuel Samuel K. Gutierrez wrote: Hi Gus, This may not help, but it's worth a try. If it's not too much trouble, can you please reconfigure your Open MPI installation with --enable-debug and then rebuild? After that, may we see the stack trace from a core file that is produced after the segmentation fault? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory Thank you for the suggestion. I am a bit reluctant to try this because when it fails, it *really* fails. Most of the times the machine doesn't even return the prompt, and in all cases it freezes and requires a hard reboot. It is not a segfault that the OS can catch, I guess. I wonder if enabling debug mode would do much for us, and get to the point of dumping a core, or just die before that. Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - On May 6, 2010, at 12:01 PM, Gus Correa wrote: Hi Eugene Thanks for the detailed answer. * 1) Now I can see and use the btl_sm_num_fifos component: I had committed already "btl = ^sm" to the openmpi-mca-params.conf file. This apparently hides the btl_sm_num_fifos from ompi_info. After I switched to no options in openmpi-mca-params.conf, then ompi_info showed the btl_sm_num_fifos component. ompi_info --all | grep btl_sm_num_fifos MCA btl: parameter "btl_sm_num_fifos" (current value: "1", data source: default value) A side comment: This means that the system administrator can hide some Open MPI options from the users, depending on what he puts in the openmpi-mca-params.conf file, right? * 2) However, running with "sm" still breaks, unfortunately: Boomer! I get the same errors that I reported in my very first email, if I increase the number of processes to 16, to explore the hyperthreading range. This is using "sm" (i.e. not excluded in the mca config file), and btl_sm_num_fifos (mpiexec command line) The machine hangs, requires a hard reboot, etc, etc, as reported earlier. See the below, please. So, I guess the conclusion is that I can use sm, but I have to remain within the range of physical cores (8), not oversubscribe, not try to explore the HT range. Should I expect it to work also for np>number of physical cores? I wonder if this would still work with np<=8, but with heavier code. (I only used hello_c.c so far.) Not sure I'll be able to test this, the user wants to use the machine. $mpiexec -mca btl_sm_num_fifos 4 -np 4 a.out Hello, world, I am 0 of 4 Hello, world, I am 1 of 4 Hello, world, I am 2 of 4 Hello, world, I am 3 of 4 $ mpiexec -mca btl_sm_num_fifos 8 -np 8 a.out Hello, world, I am 0 of 8 Hello, world, I am 1 of 8 Hello, world, I am 2 of 8 Hello, world, I am 3 of 8 Hello, world, I am 4 of 8 Hello, world, I am 5 of 8 Hello, world, I am 6 of 8 Hello, world, I am 7 of 8 $ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out -- mpiexec noticed that process rank 8 with PID 3659 on node spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault). -- $ Message from syslogd@spinoza at May 6 13:38:13 ... kernel:[ cut here ] Message from syslogd@spinoza at May 6 13:38:13 ... kernel:invalid opcode: [#1] SMP Message from syslogd@spinoza at May 6 13:38:13 ... kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/ physical_package_id Message from syslogd@spinoza at May 6 13:38:13 ... kernel:Stack: Message from syslogd@spinoza at May 6 13:38:13 ... kernel:Call Trace: Message from syslogd@spinoza at May 6 13:38:13 ... kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01 * Many thanks, Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Eugene Loh wrote: Gus Correa wrote: Hi Eugene Thank you for answering one of my original questions. However, there seems to be a problem with the syntax. Is it really "-mca btl btl_sm_num_fifos=some_number"? No. Try "--mca btl_sm_num_fifos 4". Or, % setenv OMPI_MCA
Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?
Hi Gus, This may not help, but it's worth a try. If it's not too much trouble, can you please reconfigure your Open MPI installation with -- enable-debug and then rebuild? After that, may we see the stack trace from a core file that is produced after the segmentation fault? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On May 6, 2010, at 12:01 PM, Gus Correa wrote: Hi Eugene Thanks for the detailed answer. * 1) Now I can see and use the btl_sm_num_fifos component: I had committed already "btl = ^sm" to the openmpi-mca-params.conf file. This apparently hides the btl_sm_num_fifos from ompi_info. After I switched to no options in openmpi-mca-params.conf, then ompi_info showed the btl_sm_num_fifos component. ompi_info --all | grep btl_sm_num_fifos MCA btl: parameter "btl_sm_num_fifos" (current value: "1", data source: default value) A side comment: This means that the system administrator can hide some Open MPI options from the users, depending on what he puts in the openmpi-mca-params.conf file, right? * 2) However, running with "sm" still breaks, unfortunately: Boomer! I get the same errors that I reported in my very first email, if I increase the number of processes to 16, to explore the hyperthreading range. This is using "sm" (i.e. not excluded in the mca config file), and btl_sm_num_fifos (mpiexec command line) The machine hangs, requires a hard reboot, etc, etc, as reported earlier. See the below, please. So, I guess the conclusion is that I can use sm, but I have to remain within the range of physical cores (8), not oversubscribe, not try to explore the HT range. Should I expect it to work also for np>number of physical cores? I wonder if this would still work with np<=8, but with heavier code. (I only used hello_c.c so far.) Not sure I'll be able to test this, the user wants to use the machine. $mpiexec -mca btl_sm_num_fifos 4 -np 4 a.out Hello, world, I am 0 of 4 Hello, world, I am 1 of 4 Hello, world, I am 2 of 4 Hello, world, I am 3 of 4 $ mpiexec -mca btl_sm_num_fifos 8 -np 8 a.out Hello, world, I am 0 of 8 Hello, world, I am 1 of 8 Hello, world, I am 2 of 8 Hello, world, I am 3 of 8 Hello, world, I am 4 of 8 Hello, world, I am 5 of 8 Hello, world, I am 6 of 8 Hello, world, I am 7 of 8 $ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out -- mpiexec noticed that process rank 8 with PID 3659 on node spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault). -- $ Message from syslogd@spinoza at May 6 13:38:13 ... kernel:[ cut here ] Message from syslogd@spinoza at May 6 13:38:13 ... kernel:invalid opcode: [#1] SMP Message from syslogd@spinoza at May 6 13:38:13 ... kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/ physical_package_id Message from syslogd@spinoza at May 6 13:38:13 ... kernel:Stack: Message from syslogd@spinoza at May 6 13:38:13 ... kernel:Call Trace: Message from syslogd@spinoza at May 6 13:38:13 ... kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01 * Many thanks, Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Eugene Loh wrote: Gus Correa wrote: Hi Eugene Thank you for answering one of my original questions. However, there seems to be a problem with the syntax. Is it really "-mca btl btl_sm_num_fifos=some_number"? No. Try "--mca btl_sm_num_fifos 4". Or, % setenv OMPI_MCA_btl_sm_num_fifos 4 % ompi_info -a | grep btl_sm_num_fifos # check that things were set correctly % mpirun -n 4 a.out When I grep any component starting with btl_sm I get nothing: ompi_info --all | grep btl_sm (No output) I'm no guru, but I think the reason has something to do with dynamically loaded somethings. E.g., % /home/eugene/ompi/bin/ompi_info --all | grep btl_sm_num_fifos (no output) % setenv OPAL_PREFIX /home/eugene/ompi % set path = ( $OPAL_PREFIX/bin $path ) % ompi_info --all | grep btl_sm_num_fifos MCA btl: parameter "btl_sm_num_fifos" (current value: "1", data source: default value) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem in using openmpi
Hi, If lib64 isn't there, try lib. That is, export LD_LIBRARY_PATH=/home/jess/local/ompi/lib Referencing the example that I provided earlier. -- Samuel K. Gutierrez Los Alamos National Laboratory On Mar 12, 2010, at 3:31 PM, vaibhav dutt wrote: Hi, I used the export command but it does not seem to work.It's still giving the same error as the lib64 directory does not exist in the ompi folder. Any Suggestions. Thank You, Vaibhav On Fri, Mar 12, 2010 at 3:05 PM, Fernando Lemos <fernando...@gmail.com> wrote: On Fri, Mar 12, 2010 at 6:02 PM, Samuel K. Gutierrez <sam...@lanl.gov> wrote: > One more thing. The line should have been: > > export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64 > > The space in the previous email will make bash unhappy 8-|. > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Mar 12, 2010, at 1:56 PM, Samuel K. Gutierrez wrote: > >> Hi, >> >> It sounds like you may need to set your LD_LIBRARY_PATH environment >> variable correctly. There are several ways that you can tell the dynamic >> linker where the required libraries are located, but the following may be >> sufficient for your needs. >> >> Let's say, for example, that your Open MPI installation is rooted at >> /home/jess/local/ompi and the libraries are located in >> /home/jess/local/ompi/lib64, try (bash-like shell): >> >> export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64 >> >> Hope this helps, >> >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote: >> >>> Hi, >>> >>> I have installed openmpi on an Kubuntu , with Dual core Linux AMD Athlon >>> When trying to compile a simple program, I am getting an error. >>> >>> mpicc: error while loading shared libraries: libopen-pal.so.0: cannot >>> open shared object file: No such file or dir >>> >>> I read somewhere that this error is because of some intel compiler >>> being not installed on the proper node, which I don't understand as I >>> am using AMD. >>> >>> Kindly give your suggestions >>> >>> Thank You It's probably a packaging error, if he used the distribution's packages. In that case, he should report the bug to downstream. If he installed from source, then it's most likely installed somewhere outside the library path, and the LD_LIBRARY_PATH trick might work (if it doesn't, make sure there are no leftovers, recompile, reinstall and it should work fine). Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem in using openmpi
One more thing. The line should have been: export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64 The space in the previous email will make bash unhappy 8-|. -- Samuel K. Gutierrez Los Alamos National Laboratory On Mar 12, 2010, at 1:56 PM, Samuel K. Gutierrez wrote: Hi, It sounds like you may need to set your LD_LIBRARY_PATH environment variable correctly. There are several ways that you can tell the dynamic linker where the required libraries are located, but the following may be sufficient for your needs. Let's say, for example, that your Open MPI installation is rooted at /home/jess/local/ompi and the libraries are located in /home/jess/ local/ompi/lib64, try (bash-like shell): export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64 Hope this helps, -- Samuel K. Gutierrez Los Alamos National Laboratory On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote: Hi, I have installed openmpi on an Kubuntu , with Dual core Linux AMD Athlon When trying to compile a simple program, I am getting an error. mpicc: error while loading shared libraries: libopen-pal.so.0: cannot open shared object file: No such file or dir I read somewhere that this error is because of some intel compiler being not installed on the proper node, which I don't understand as I am using AMD. Kindly give your suggestions Thank You ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem in using openmpi
Hi, It sounds like you may need to set your LD_LIBRARY_PATH environment variable correctly. There are several ways that you can tell the dynamic linker where the required libraries are located, but the following may be sufficient for your needs. Let's say, for example, that your Open MPI installation is rooted at / home/jess/local/ompi and the libraries are located in /home/jess/local/ ompi/lib64, try (bash-like shell): export LD_LIBRARY_PATH= /home/jess/local/ompi/lib64 Hope this helps, -- Samuel K. Gutierrez Los Alamos National Laboratory On Mar 12, 2010, at 1:32 PM, vaibhav dutt wrote: Hi, I have installed openmpi on an Kubuntu , with Dual core Linux AMD Athlon When trying to compile a simple program, I am getting an error. mpicc: error while loading shared libraries: libopen-pal.so.0: cannot open shared object file: No such file or dir I read somewhere that this error is because of some intel compiler being not installed on the proper node, which I don't understand as I am using AMD. Kindly give your suggestions Thank You ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] memchecker overhead?
Hi Jed, I'm not sure if this will help, but it's worth a try. Turn off OMPI's memory wrapper and see what happens. c-like shell setenv OMPI_MCA_memory_ptmalloc2_disable 1 bash-like shell export OMPI_MCA_memory_ptmalloc2_disable=1 Also add the following MCA parameter to you run command. --mca mpi_leave_pinned 0 -- Samuel K. Gutierrez Los Alamos National Laboratory On Oct 26, 2009, at 1:41 PM, Jed Brown wrote: Jeff Squyres wrote: Using --enable-debug adds in a whole pile of developer-level run-time checking and whatnot. You probably don't want that on production runs. I have found that --enable-debug --enable-memchecker actually produces more valgrind noise than leaving them off. Are there options to make Open MPI strict about initializing and freeing memory? At one point I tried to write policy files, but even with judicious globbing, I kept getting different warnings when run on a different program. (All these codes were squeaky-clean under MPICH2.) Jed ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] memalign usage in OpenMPI and it's consequencesfor TotalVIew
Ticket created (#2040). I hope it's okay ;-). -- Samuel K. Gutierrez Los Alamos National Laboratory On Oct 1, 2009, at 11:58 AM, Jeff Squyres wrote: Did that make it over to the v1.3 branch? On Oct 1, 2009, at 1:39 PM, Samuel K. Gutierrez wrote: Hi, I think Jeff has already addressed this problem. https://svn.open-mpi.org/trac/ompi/changeset/21744 -- Samuel K. Gutierrez Los Alamos National Laboratory On Oct 1, 2009, at 11:25 AM, Peter Thompson wrote: > We had a question from a user who had turned on memory debugging in > TotalView and experience a memory event error Invalid memory > alignment request. Having a 1.3.3 build of OpenMPI handy, I tested > it and sure enough, saw the error. I traced it down to, surprise, a > call to memalign. I find there are a few places where memalign is > called, but the one I think I was dealing with was from malloc.c in > ompi/mca//io/romio/romio/adio/common in the following lines: > > > #ifdef ROMIO_XFS >new = (void *) memalign(XFS_MEMALIGN, size); > #else >new = (void *) malloc(size); > #endif > > I searched, but couldn't find a value for XFS_MEMALIGN, so maybe it > was from opal_pt_malloc2_component.c instead, where the call is > >p = memalign(1, 1024 * 1024); > > There are only 10 to 12 references to memalign in the code that I > can see, so it shouldn't be too hard to find. What I can tell you > is that the value that TotalView saw for alignment, the first arg, > was 1, and the second, the size, was 0x10, which is probably > right for 1024 squared. > > The man page for memalign says that the first argument is the > alignment that the allocated memory use, and it must be a power of > two. The second is the length you want allocated. One could argue > that 1 is a power of two, but it seems a bit specious to me, and > TotalView's memory debugger certainly objects to it. Can anyone tell > me what the intent here is, and whether the memalign alignment > argument is thought to be valid? Or is this a bug (that might not > affect anyone other than TotalView memory debug users?) > > Thanks, > Peter Thompson > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?
Hi, I'm writing a simple post-mortem profiling tool that provides some of the information that you are looking for. That being said, the tool, Loba, isn't publicly available just yet. In the mean time, take a look at mpiP (http://mpip.sourceforge.net/). -- Samuel K. Gutierrez Los Alamos National Laboratory On Tue, Sep 29, 2009 at 10:40 AM, Eugene Loh <eugene@sun.com> wrote: to know. It sounds like you want to be able to watch some % utilization of a hardware interface as the program is running. I *think* these tools (the ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of that class. You are correct. A real time tool would be best that sniffs at the MPI traffic. Post mortem profilers would be the next best option I assume. I was trying to compile MPE but gave up. Too many errors. Trying to decide if I should prod on or look at another tool. -- Rahul ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users