date:20140210

Re: [OMPI users] warnings and anachronisms in openmpi-1.7.4

2014-02-10 Thread Siegmar Gross

Hi Oscar,

I always favor no warnings, but it's not me to make this decision.


Kind regards and thank you very much for your replies

Siegmar


> Quoting Siegmar Gross :
> 
> > Hi Oscar,
> >
> >> The warnings of type "cast to pointer from integer of different size"
> >> are provoked when a jlong (64 bit handle in Java) is copied to a C
> >> pointer (32 bit) or vice versa.
> >>
> >> These warnings could be avoided with methods like these:
> >>
> >>  void* ompi_java_cHandle(jlong handle)
> >>  {
> >>  union { jlong j; void* c; } u;
> >>  u.j = handle;
> >>  return u.c;
> >>  }
> >>
> >>  jlong ompi_java_jHandle(void *handle)
> >>  {
> >>  union { jlong j; void* c; } u;
> >>  u.c = handle;
> >>  return u.j;
> >>  }
> >>
> >> We should change all the code in this manner:
> >>
> >>  JNIEXPORT jlong JNICALL Java_mpi_Win_free(
> >>  JNIEnv *env, jobject jthis, jlong handle)
> >>  {
> >>  MPI_Win win = ompi_java_cHandle(handle);
> >>  int rc = MPI_Win_free(&win);
> >>  ompi_java_exceptionCheck(env, rc);
> >>  return ompi_java_jHandle(win);
> >>  }
> >>
> >> I don't know if it is worth it.
> >
> > I don't know either, but you will possibly get an error if you store
> > a 64-bit value into a 32-bit pointer. If the Java interface should be
> > available on 32-bit systems as well, it would be necessary (at least
> > in my opinion).
> 
> There is no loss of information, because the 64-bit values (java long)  
> come from 32-bit values (c pointers). It works ok.
> 
> The question is if we want avoid these warnings.
> 
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >
> >
> >>
> >> Regards,
> >> Oscar
> >>
> >> Quoting Siegmar Gross :
> >>
> >> > Hi,
> >> >
> >> > yesterday I compiled 32- and 64-bit versions of openmpi-1.7.4 for
> >> > my platforms (Solaris 10 sparc, Solaris 10 x86_64, and openSUSE
> >> > Linux 12.1 x86_64) with Sun C 5.12 and gcc-4.8.0. I could build
> >> > a 64-bit version for Linux with gcc without warnings. Everything
> >> > else showed warnings. I received many warnings for my 32-bit
> >> > versions (mainly for the Java interface with gcc). I have combined
> >> > all warnings for my platforms so that it is easier to fix them, if
> >> > somebody wants to fix them. The attached files contain the warnings
> >> > from each compiler. I can also provide specific files like
> >> > Solaris.x86_64.32_cc.uniq or even my log files (e.g.,
> >> > log.make.SunOS.x86_64.32_cc).
> >> >
> >> >
> >> > Kind regards
> >> >
> >> > Siegmar
> >> >
> >>
> >>
> >>
> >> 
> >> This message was sent using IMP, the Internet Messaging Program.
> >>
> >>
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> 
> 
> 
> This message was sent using IMP, the Internet Messaging Program.
> 
>

[OMPI users] valgrind invalid reads for large self-sends using thread_multiple

2014-02-10 Thread Daniel Ibanez

Hello,

I have used OpenMPI in conjunction with Valgrind for a long time now, and
developed a list of suppressions for known false positives over time.

Now I am developing a library for inter-thread communication that is based
on using OpenMPI with MPI_THREAD_MULTIPLE support. I have noticed that
sending large messages from one thread to another in the same process will
cause valgrind to complain about invalid reads. I have narrowed it down to
one function being executed on four threads in one process. Attached is a
tarball containing the error-reproducing program, valgrind suppression
file, and valgrind output.

The strange thing is that the valgrind error message doesn't fit the
pattern of read-after-free or read-past-the-end. I'd like to know the
following:

1) Should I even worry ? The code doesn't crash, only valgrind complains.
Is it a harmless false positive ?
2) If it is an issue, am I using MPI right?
3) If I'm using it right, then what causes this ? some kind of internal
buffering issue ?

Note that I use Issend, so nothing should be freed until its completely
been read (in theory).

Thank you,

-- 

Dan Ibanez


thread_test.tar
Description: Unix tar archive

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-02-10 Thread Jeff Squyres (jsquyres)

After a bunch of off-list communication, it turns out that the OMPI warning 
cited in the first mail of this chain was indeed the culprit.

OMPI was warning that it could only register about half the memory in the 
machine due to limitations in the OFED driver.  Once those limits were raised 
to include the entire memory in the machine, the problems went away.

That was the main/root issue; there were a small number of other side issues 
that crept in during debugging and troubleshooting.  We fixed all those 
configuration issues along that way (e.g., just did a clean, new install into a 
fresh installation tree to avoid any stale/cruft from prior builds, etc.).




On Jan 24, 2014, at 6:24 PM, Jeff Squyres (jsquyres)  wrote:

> Greg and I are chatting off list; there's something definitely weird going on 
> in his setup.
> 
> We'll report back to the list when we figure it out.
> 
> 
> On Jan 24, 2014, at 1:26 PM, Gus Correa 
> wrote:
> 
>> On 01/24/2014 12:50 PM, Fischer, Greg A. wrote:
>>> Yep. That was the problem. It works beautifully now.
>>> 
>>> Thanks for prodding me to take another look.
>>> 
>>> With regards to openmpi-1.6.5, the system that I'm compiling and running on,
>> SLES10, contains some pretty dated software (e.g. Linux 2.6.x, python 2.4,
>> gcc 4.1.2). Is it possible there's simply an
>> incompatibility lurking in there somewhere that would trip
>> openmpi-1.6.5 but not openmpi-1.4.3?
>>> 
>>> Greg
>>> 
>> 
>> Hi Greg
>> 
>> FWIW, we have OpenMPI 1.6.5 installed
>> (and we have used OMPI 1.4.5, 1.4.4, 1.4.3, ..., 1.2.8, before)
>> in our older cluster that has CentOS 5.2, Linux kernel 2.6.18,
>> gcc 4.1.2, Python 2.4.3, etc.
>> Parallel programs compile and run with OMPI 1.6.5 without problems.
>> 
>> I hope this helps,
>> Gus Correa
>> 
 -Original Message-
 From: Fischer, Greg A.
 Sent: Friday, January 24, 2014 11:41 AM
 To: 'Open MPI Users'
 Cc: Fischer, Greg A.
 Subject: RE: [OMPI users] simple test problem hangs on mpi_finalize and
 consumes all system resources
 
 Hmm... It looks like CMAKE was somehow finding openmpi-1.6.5 instead of
 openmpi-1.4.3, despite the environment variables being set otherwise. This
 is likely the explanation. I'll try to chase that down.
 
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres)
> Sent: Friday, January 24, 2014 11:39 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
> consumes all system resources
> 
> Ok.  I only mention this because the "mca_paffinity_linux.so: undefined
> symbol: mca_base_param_reg_int" type of message is almost always an
> indicator of two different versions being installed into the same tree.
> 
> 
> On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A."
>   wrote:
> 
>> Version 1.4.3 and 1.6.5 were and are installed in separate trees:
>> 
>> 1003 fischega@lxlogin2[~]>  ls
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.*
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3:
>> bin  etc  include  lib  share
>> 
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5:
>> bin  etc  include  lib  share
>> 
>> I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was
>> set
> correctly, but I'll check again.
>> 
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>>> Squyres (jsquyres)
>>> Sent: Friday, January 24, 2014 11:07 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize
>>> and consumes all system resources
>>> 
>>> On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A."
>>>   wrote:
>>> 
 The reason for deleting the openmpi-1.6.5 installation was that I
 went back
>>> and installed openmpi-1.4.3 and the problem (mostly) went away.
>>> Openmpi-
>>> 1.4.3 can run the simple tests without issue, but on my "real"
>>> program, I'm getting symbol lookup errors:
 
 mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int
>>> 
>>> This sounds like you are mixing 1.6.x and 1.4.x in the same
>>> installation
> tree.
>>> This can definitely lead to sadness.
>>> 
>>> More specifically: installing 1.6 over an existing 1.4 installation
>>> (and vice
>>> versa) is definitely NOT supported.  The set of plugins that the two
>>> install are different, and can lead to all manner of weird/undefined
> behavior.
>>> 
>>> FWIW: I typically install Open MPI into a tree by itself.  And if I
>>> later want to remove that installation, I just "rm -rf" that tree.
>>> Then I can install a different version of OMPI into that same tree
>>> (because the prior tree is completely

Re: [OMPI users] rankfiles in openmpi-1.7.4

2014-02-10 Thread Ralph Castain

Hmmm...afraid there isn't much I can offer here, Siegmar. For whatever reason, 
hwloc is indicating it cannot bind processes on that architecture.


On Feb 9, 2014, at 12:08 PM, Siegmar Gross 
 wrote:

> Hi Ralph,
> 
> thank you very much for your reply. I have changed my rankfile.
> 
> rank 0=rs0 slot=0:0-1
> rank 1=rs0 slot=1
> rank 2=rs1 slot=0
> rank 3=rs1 slot=1
> 
> Now I get the following output.
> 
> rs0 openmpi_1.7.x_or_newer 108 mpiexec --report-bindings \
>  --use-hwthread-cpus -np 4 -rf rf_rs0_rs1 hostname
> --
> Open MPI tried to bind a new process, but something went wrong.  The
> process was killed without launching the target application.  Your job
> will now abort.
> 
>  Local host:rs0
>  Application name:  /usr/local/bin/hostname
>  Error message: hwloc indicates cpu binding cannot be enforced
>  Location:  
> ../../../../../openmpi-1.7.4/orte/mca/odls/default/odls_default_module.c:499
> --
> rs0 openmpi_1.7.x_or_newer 109 
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
> 
>>> today I tested rankfiles once more. The good news first: openmpi-1.7.4
>>> now supports my Sun M4000 server with Sparc VII processors on the
>>> command line.
>>> 
>>> rs0 openmpi_1.7.x_or_newer 104 mpiexec --report-bindings -np 4 \
>>> --bind-to hwthread hostname
>>> [rs0.informatik.hs-fulda.de:06051] MCW rank 1 bound to
>>> socket 0[core 1[hwt 0]]: [../B./../..][../../../..]
>>> [rs0.informatik.hs-fulda.de:06051] MCW rank 2 bound to
>>> socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
>>> [rs0.informatik.hs-fulda.de:06051] MCW rank 3 bound to
>>> socket 1[core 5[hwt 0]]: [../../../..][../B./../..]
>>> [rs0.informatik.hs-fulda.de:06051] MCW rank 0 bound to
>>> socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
>>> rs0.informatik.hs-fulda.de
>>> rs0.informatik.hs-fulda.de
>>> rs0.informatik.hs-fulda.de
>>> rs0.informatik.hs-fulda.de
>>> rs0 openmpi_1.7.x_or_newer 105 
>>> 
>>> Thank you very much for solving this problem. Unfortunately I still
>>> have a problem with a rankfile. Contents of my rankfile:
>>> 
>>> rank 0=rs0 slot=0:0-7
>>> rank 1=rs0 slot=1
>>> rank 2=rs1 slot=0
>>> rank 3=rs1 slot=1
>>> 
>> 
>> 
>> Here's your problem - you told us socket 0, cores 0-7. However, if
>> you look at your topology, you only have *4* cores in socket 0
>> 
>> 
>>> 
>>> rs0 openmpi_1.7.x_or_newer 105 mpiexec --report-bindings \
>>> --use-hwthread-cpus -np 4 -rf rf_rs0_rs1 hostname
>>> [rs0.informatik.hs-fulda.de:06060] [[7659,0],0] ORTE_ERROR_LOG: Not
>>> found in file
>>> .../openmpi-1.7.4/orte/mca/rmaps/rank_file/rmaps_rank_file.c
>>> at line 283
>>> [rs0.informatik.hs-fulda.de:06060] [[7659,0],0] ORTE_ERROR_LOG: Not
>>> found in file
>>> .../openmpi-1.7.4/orte/mca/rmaps/base/rmaps_base_map_job.c
>>> at line 284
>>> rs0 openmpi_1.7.x_or_newer 106 
>>> 
>>> 
>>> rs0 openmpi_1.7.x_or_newer 110 mpiexec --report-bindings \
>>> --display-allocation --mca rmaps_base_verbose_100 \
>>> --use-hwthread-cpus -np 4 -rf rf_rs0_rs1 hostname
>>> 
>>> ==   ALLOCATED NODES   ==
>>>   rs0: slots=2 max_slots=0 slots_inuse=0
>>>   rs1: slots=2 max_slots=0 slots_inuse=0
>>> =
>>> [rs0.informatik.hs-fulda.de:06074] [[7677,0],0] ORTE_ERROR_LOG: Not found 
>>> in 
> file 
>>> ../../../../../openmpi-1.7.4/orte/mca/rmaps/rank_file/rmaps_rank_file.c at 
> line 283
>>> [rs0.informatik.hs-fulda.de:06074] [[7677,0],0] ORTE_ERROR_LOG: Not found 
>>> in 
> file 
>>> ../../../../openmpi-1.7.4/orte/mca/rmaps/base/rmaps_base_map_job.c at line 
> 284
>>> rs0 openmpi_1.7.x_or_newer 111 
>>> 
>>> 
>>> rs0 openmpi_1.7.x_or_newer 111 mpiexec --report-bindings 
> --display-allocation --mca ess_base_verbose 5 --use-hwthread-cpus -np 
>>> 4 -rf rf_rs0_rs1 hostname
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Querying 
> component [env]
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Skipping 
> component [env]. Query failed to return a module
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Querying 
> component [hnp]
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Query of 
> component [hnp] set priority to 100
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Querying 
> component [singleton]
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Skipping 
> component [singleton]. Query failed to return a module
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Querying 
> component [tool]
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Skipping 
> component [tool]. Query failed to return a module
>>> [rs0.informatik.hs-fulda.de:06078] mca:base:select:(  ess) Selected 
> component [hnp]
>>> [rs0.informatik.hs-fulda.de:06078] [[INVALID],INVALID] Topology Info:
>>> [rs0.informatik.hs-f

Re: [OMPI users] warnings and anachronisms in openmpi-1.7.4

[OMPI users] valgrind invalid reads for large self-sends using thread_multiple

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

Re: [OMPI users] rankfiles in openmpi-1.7.4

4 matches

Site Navigation

Mail list logo

Footer information