Re: [OMPI users] warnings and anachronisms in openmpi-1.7.4
Hi Oscar, I always favor no warnings, but it's not me to make this decision. Kind regards and thank you very much for your replies Siegmar > Quoting Siegmar Gross : > > > Hi Oscar, > > > >> The warnings of type "cast to pointer from integer of different size" > >> are provoked when a jlong (64 bit handle in Java) is copied to a C > >> pointer (32 bit) or vice versa. > >> > >> These warnings could be avoided with methods like these: > >> > >> void* ompi_java_cHandle(jlong handle) > >> { > >> union { jlong j; void* c; } u; > >> u.j = handle; > >> return u.c; > >> } > >> > >> jlong ompi_java_jHandle(void *handle) > >> { > >> union { jlong j; void* c; } u; > >> u.c = handle; > >> return u.j; > >> } > >> > >> We should change all the code in this manner: > >> > >> JNIEXPORT jlong JNICALL Java_mpi_Win_free( > >> JNIEnv *env, jobject jthis, jlong handle) > >> { > >> MPI_Win win = ompi_java_cHandle(handle); > >> int rc = MPI_Win_free(&win); > >> ompi_java_exceptionCheck(env, rc); > >> return ompi_java_jHandle(win); > >> } > >> > >> I don't know if it is worth it. > > > > I don't know either, but you will possibly get an error if you store > > a 64-bit value into a 32-bit pointer. If the Java interface should be > > available on 32-bit systems as well, it would be necessary (at least > > in my opinion). > > There is no loss of information, because the 64-bit values (java long) > come from 32-bit values (c pointers). It works ok. > > The question is if we want avoid these warnings. > > > > > > > Kind regards > > > > Siegmar > > > > > > > >> > >> Regards, > >> Oscar > >> > >> Quoting Siegmar Gross : > >> > >> > Hi, > >> > > >> > yesterday I compiled 32- and 64-bit versions of openmpi-1.7.4 for > >> > my platforms (Solaris 10 sparc, Solaris 10 x86_64, and openSUSE > >> > Linux 12.1 x86_64) with Sun C 5.12 and gcc-4.8.0. I could build > >> > a 64-bit version for Linux with gcc without warnings. Everything > >> > else showed warnings. I received many warnings for my 32-bit > >> > versions (mainly for the Java interface with gcc). I have combined > >> > all warnings for my platforms so that it is easier to fix them, if > >> > somebody wants to fix them. The attached files contain the warnings > >> > from each compiler. I can also provide specific files like > >> > Solaris.x86_64.32_cc.uniq or even my log files (e.g., > >> > log.make.SunOS.x86_64.32_cc). > >> > > >> > > >> > Kind regards > >> > > >> > Siegmar > >> > > >> > >> > >> > >> > >> This message was sent using IMP, the Internet Messaging Program. > >> > >> > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > This message was sent using IMP, the Internet Messaging Program. > >
[OMPI users] valgrind invalid reads for large self-sends using thread_multiple
Hello, I have used OpenMPI in conjunction with Valgrind for a long time now, and developed a list of suppressions for known false positives over time. Now I am developing a library for inter-thread communication that is based on using OpenMPI with MPI_THREAD_MULTIPLE support. I have noticed that sending large messages from one thread to another in the same process will cause valgrind to complain about invalid reads. I have narrowed it down to one function being executed on four threads in one process. Attached is a tarball containing the error-reproducing program, valgrind suppression file, and valgrind output. The strange thing is that the valgrind error message doesn't fit the pattern of read-after-free or read-past-the-end. I'd like to know the following: 1) Should I even worry ? The code doesn't crash, only valgrind complains. Is it a harmless false positive ? 2) If it is an issue, am I using MPI right? 3) If I'm using it right, then what causes this ? some kind of internal buffering issue ? Note that I use Issend, so nothing should be freed until its completely been read (in theory). Thank you, -- Dan Ibanez thread_test.tar Description: Unix tar archive
Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
After a bunch of off-list communication, it turns out that the OMPI warning cited in the first mail of this chain was indeed the culprit. OMPI was warning that it could only register about half the memory in the machine due to limitations in the OFED driver. Once those limits were raised to include the entire memory in the machine, the problems went away. That was the main/root issue; there were a small number of other side issues that crept in during debugging and troubleshooting. We fixed all those configuration issues along that way (e.g., just did a clean, new install into a fresh installation tree to avoid any stale/cruft from prior builds, etc.). On Jan 24, 2014, at 6:24 PM, Jeff Squyres (jsquyres) wrote: > Greg and I are chatting off list; there's something definitely weird going on > in his setup. > > We'll report back to the list when we figure it out. > > > On Jan 24, 2014, at 1:26 PM, Gus Correa > wrote: > >> On 01/24/2014 12:50 PM, Fischer, Greg A. wrote: >>> Yep. That was the problem. It works beautifully now. >>> >>> Thanks for prodding me to take another look. >>> >>> With regards to openmpi-1.6.5, the system that I'm compiling and running on, >> SLES10, contains some pretty dated software (e.g. Linux 2.6.x, python 2.4, >> gcc 4.1.2). Is it possible there's simply an >> incompatibility lurking in there somewhere that would trip >> openmpi-1.6.5 but not openmpi-1.4.3? >>> >>> Greg >>> >> >> Hi Greg >> >> FWIW, we have OpenMPI 1.6.5 installed >> (and we have used OMPI 1.4.5, 1.4.4, 1.4.3, ..., 1.2.8, before) >> in our older cluster that has CentOS 5.2, Linux kernel 2.6.18, >> gcc 4.1.2, Python 2.4.3, etc. >> Parallel programs compile and run with OMPI 1.6.5 without problems. >> >> I hope this helps, >> Gus Correa >> -Original Message- From: Fischer, Greg A. Sent: Friday, January 24, 2014 11:41 AM To: 'Open MPI Users' Cc: Fischer, Greg A. Subject: RE: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources Hmm... It looks like CMAKE was somehow finding openmpi-1.6.5 instead of openmpi-1.4.3, despite the environment variables being set otherwise. This is likely the explanation. I'll try to chase that down. > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) > Sent: Friday, January 24, 2014 11:39 AM > To: Open MPI Users > Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and > consumes all system resources > > Ok. I only mention this because the "mca_paffinity_linux.so: undefined > symbol: mca_base_param_reg_int" type of message is almost always an > indicator of two different versions being installed into the same tree. > > > On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A." > wrote: > >> Version 1.4.3 and 1.6.5 were and are installed in separate trees: >> >> 1003 fischega@lxlogin2[~]> ls >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.* >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3: >> bin etc include lib share >> >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5: >> bin etc include lib share >> >> I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was >> set > correctly, but I'll check again. >> >>> -Original Message- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>> Squyres (jsquyres) >>> Sent: Friday, January 24, 2014 11:07 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize >>> and consumes all system resources >>> >>> On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A." >>> wrote: >>> The reason for deleting the openmpi-1.6.5 installation was that I went back >>> and installed openmpi-1.4.3 and the problem (mostly) went away. >>> Openmpi- >>> 1.4.3 can run the simple tests without issue, but on my "real" >>> program, I'm getting symbol lookup errors: mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int >>> >>> This sounds like you are mixing 1.6.x and 1.4.x in the same >>> installation > tree. >>> This can definitely lead to sadness. >>> >>> More specifically: installing 1.6 over an existing 1.4 installation >>> (and vice >>> versa) is definitely NOT supported. The set of plugins that the two >>> install are different, and can lead to all manner of weird/undefined > behavior. >>> >>> FWIW: I typically install Open MPI into a tree by itself. And if I >>> later want to remove that installation, I just "rm -rf" that tree. >>> Then I can install a different version of OMPI into that same tree >>> (because the prior tree is completely
Re: [OMPI users] rankfiles in openmpi-1.7.4
Hmmm...afraid there isn't much I can offer here, Siegmar. For whatever reason, hwloc is indicating it cannot bind processes on that architecture. On Feb 9, 2014, at 12:08 PM, Siegmar Gross wrote: > Hi Ralph, > > thank you very much for your reply. I have changed my rankfile. > > rank 0=rs0 slot=0:0-1 > rank 1=rs0 slot=1 > rank 2=rs1 slot=0 > rank 3=rs1 slot=1 > > Now I get the following output. > > rs0 openmpi_1.7.x_or_newer 108 mpiexec --report-bindings \ > --use-hwthread-cpus -np 4 -rf rf_rs0_rs1 hostname > -- > Open MPI tried to bind a new process, but something went wrong. The > process was killed without launching the target application. Your job > will now abort. > > Local host:rs0 > Application name: /usr/local/bin/hostname > Error message: hwloc indicates cpu binding cannot be enforced > Location: > ../../../../../openmpi-1.7.4/orte/mca/odls/default/odls_default_module.c:499 > -- > rs0 openmpi_1.7.x_or_newer 109 > > > Kind regards > > Siegmar > > > > >>> today I tested rankfiles once more. The good news first: openmpi-1.7.4 >>> now supports my Sun M4000 server with Sparc VII processors on the >>> command line. >>> >>> rs0 openmpi_1.7.x_or_newer 104 mpiexec --report-bindings -np 4 \ >>> --bind-to hwthread hostname >>> [rs0.informatik.hs-fulda.de:06051] MCW rank 1 bound to >>> socket 0[core 1[hwt 0]]: [../B./../..][../../../..] >>> [rs0.informatik.hs-fulda.de:06051] MCW rank 2 bound to >>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..] >>> [rs0.informatik.hs-fulda.de:06051] MCW rank 3 bound to >>> socket 1[core 5[hwt 0]]: [../../../..][../B./../..] >>> [rs0.informatik.hs-fulda.de:06051] MCW rank 0 bound to >>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..] >>> rs0.informatik.hs-fulda.de >>> rs0.informatik.hs-fulda.de >>> rs0.informatik.hs-fulda.de >>> rs0.informatik.hs-fulda.de >>> rs0 openmpi_1.7.x_or_newer 105 >>> >>> Thank you very much for solving this problem. Unfortunately I still >>> have a problem with a rankfile. Contents of my rankfile: >>> >>> rank 0=rs0 slot=0:0-7 >>> rank 1=rs0 slot=1 >>> rank 2=rs1 slot=0 >>> rank 3=rs1 slot=1 >>> >> >> >> Here's your problem - you told us socket 0, cores 0-7. However, if >> you look at your topology, you only have *4* cores in socket 0 >> >> >>> >>> rs0 openmpi_1.7.x_or_newer 105 mpiexec --report-bindings \ >>> --use-hwthread-cpus -np 4 -rf rf_rs0_rs1 hostname >>> [rs0.informatik.hs-fulda.de:06060] [[7659,0],0] ORTE_ERROR_LOG: Not >>> found in file >>> .../openmpi-1.7.4/orte/mca/rmaps/rank_file/rmaps_rank_file.c >>> at line 283 >>> [rs0.informatik.hs-fulda.de:06060] [[7659,0],0] ORTE_ERROR_LOG: Not >>> found in file >>> .../openmpi-1.7.4/orte/mca/rmaps/base/rmaps_base_map_job.c >>> at line 284 >>> rs0 openmpi_1.7.x_or_newer 106 >>> >>> >>> rs0 openmpi_1.7.x_or_newer 110 mpiexec --report-bindings \ >>> --display-allocation --mca rmaps_base_verbose_100 \ >>> --use-hwthread-cpus -np 4 -rf rf_rs0_rs1 hostname >>> >>> == ALLOCATED NODES == >>> rs0: slots=2 max_slots=0 slots_inuse=0 >>> rs1: slots=2 max_slots=0 slots_inuse=0 >>> = >>> [rs0.informatik.hs-fulda.de:06074] [[7677,0],0] ORTE_ERROR_LOG: Not found >>> in > file >>> ../../../../../openmpi-1.7.4/orte/mca/rmaps/rank_file/rmaps_rank_file.c at > line 283 >>> [rs0.informatik.hs-fulda.de:06074] [[7677,0],0] ORTE_ERROR_LOG: Not found >>> in > file >>> ../../../../openmpi-1.7.4/orte/mca/rmaps/base/rmaps_base_map_job.c at line > 284 >>> rs0 openmpi_1.7.x_or_newer 111 >>> >>> >>> rs0 openmpi_1.7.x_or_newer 111 mpiexec --report-bindings > --display-allocation --mca ess_base_verbose 5 --use-hwthread-cpus -np >>> 4 -rf rf_rs0_rs1 hostname >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Querying > component [env] >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Skipping > component [env]. Query failed to return a module >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Querying > component [hnp] >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Query of > component [hnp] set priority to 100 >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Querying > component [singleton] >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Skipping > component [singleton]. Query failed to return a module >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Querying > component [tool] >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Skipping > component [tool]. Query failed to return a module >>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:( ess) Selected > component [hnp] >>> [rs0.informatik.hs-fulda.de:06078] [[INVALID],INVALID] Topology Info: >>> [rs0.informatik.hs-f