Re: [OMPI users] Large TCP cluster timeout issue

2011-09-20 Thread Ralph Castain
Truly am sorry about that - we were just talking today about the need to update 
and improve our FAQ on running on large clusters. Did you by any chance look at 
it? Would appreciate any thoughts on how it should be improved from a user's 
perspective.



On Sep 20, 2011, at 3:28 PM, Henderson, Brent wrote:

> Nope, but if I didn’t that would have saved me about an hour of coding time! 
>  
> I’m still curious if it would be beneficial to inject some barriers at 
> certain locations so that if you had a slow node, not everyone would end up 
> connecting to it all at once.  Anyway, if I get access to another large TCP 
> cluster, I’ll give it a try.
>  
> Thanks,
>  
> brent
>  
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Tuesday, September 20, 2011 4:15 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Large TCP cluster timeout issue
>  
> Hmmmperhaps you didn't notice the mpi_preconnect_all option? It does 
> precisely what you described - it pushes zero-byte messages around a ring to 
> force all the connections open at MPI_Init.
>  
>  
> On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote:
> 
> 
> I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
> cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, 
> but there were a couple of issues along the way.  After setting some system 
> tunables up a little bit on all of the nodes a hello_world program worked 
> just fine – it appears that the TCP connections between most or all of the 
> ranks are deferred until they are actually used so the easy test ran 
> reasonably quickly.  I then moved to IMB. 
>  
> I typically don’t care about the small rank counts, so I add the –npmin 9 
> option to just run the ‘big’ number of ranks.  This ended with an abort after 
> MPI_Init(), but before running any tests.  Lots (possibly all) of ranks 
> emitted messages that looked like:
>  
> 
> ‘[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 172.23.4.1 failed: Connection timed out (110)’
>  
> Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node 
> in the job.  One of the first things that IMB does before running a test is 
> create a communicator for each specific rank count it is testing.  Apparently 
> this collective operation causes a large number of connections to be made.  
> The abort messages (one example shown above) all show the connect failure to 
> a single node, so it would appear that a very large number of nodes attempt 
> to connect to that one at the same time and overwhelmed it.  (Or it was slow 
> and everyone ganged up on it as they worked their way around the ring.  J  Is 
> there a supported/suggested way to work around this?  It was very repeatable.
>  
> I was able to work around this by using the primary definitions for 
> MPI_Init() and MPI_Init_thread() by calling the ‘P’ version of the routine, 
> and then having each rank send its rank number to the rank one to the right, 
> then two to the right, and so-on around the ring.  I added a MPI_Barrier( 
> MPI_COMM_WORLD ), call every N messages to keep things at a controlled pace.  
> N was 64 by default, but settable via environment variable in case that 
> number didn’t work well for some reason.  This fully connected the mesh (110k 
> socket connections per host!) and allowed the tests to run.  Not a great 
> solution, I know, but I’ll throw it out there until I know the right way.
>  
> Once I had this in place, I used the workaround with HPCC as well.  Without 
> it, it would not get very far at all.  With it, I was able to make it through 
> the entire test.
>  
> Looking forward to getting the experts thoughts about the best way to handle 
> big TCP clusters – thanks!
>  
> Brent
>  
> P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster – not sure why, 
> but kudos to those working on changes since then!
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Large TCP cluster timeout issue

2011-09-20 Thread Henderson, Brent
Nope, but if I didn't that would have saved me about an hour of coding time!

I'm still curious if it would be beneficial to inject some barriers at certain 
locations so that if you had a slow node, not everyone would end up connecting 
to it all at once.  Anyway, if I get access to another large TCP cluster, I'll 
give it a try.

Thanks,

brent

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Tuesday, September 20, 2011 4:15 PM
To: Open MPI Users
Subject: Re: [OMPI users] Large TCP cluster timeout issue

Hmmmperhaps you didn't notice the mpi_preconnect_all option? It does 
precisely what you described - it pushes zero-byte messages around a ring to 
force all the connections open at MPI_Init.


On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote:


I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, but 
there were a couple of issues along the way.  After setting some system 
tunables up a little bit on all of the nodes a hello_world program worked just 
fine - it appears that the TCP connections between most or all of the ranks are 
deferred until they are actually used so the easy test ran reasonably quickly.  
I then moved to IMB.

I typically don't care about the small rank counts, so I add the -npmin 9 
option to just run the 'big' number of ranks.  This ended with an abort after 
MPI_Init(), but before running any tests.  Lots (possibly all) of ranks emitted 
messages that looked like:


'[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 172.23.4.1 failed: Connection timed out (110)'

Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node in 
the job.  One of the first things that IMB does before running a test is create 
a communicator for each specific rank count it is testing.  Apparently this 
collective operation causes a large number of connections to be made.  The 
abort messages (one example shown above) all show the connect failure to a 
single node, so it would appear that a very large number of nodes attempt to 
connect to that one at the same time and overwhelmed it.  (Or it was slow and 
everyone ganged up on it as they worked their way around the ring.  :)  Is 
there a supported/suggested way to work around this?  It was very repeatable.

I was able to work around this by using the primary definitions for MPI_Init() 
and MPI_Init_thread() by calling the 'P' version of the routine, and then 
having each rank send its rank number to the rank one to the right, then two to 
the right, and so-on around the ring.  I added a MPI_Barrier( MPI_COMM_WORLD ), 
call every N messages to keep things at a controlled pace.  N was 64 by 
default, but settable via environment variable in case that number didn't work 
well for some reason.  This fully connected the mesh (110k socket connections 
per host!) and allowed the tests to run.  Not a great solution, I know, but 
I'll throw it out there until I know the right way.

Once I had this in place, I used the workaround with HPCC as well.  Without it, 
it would not get very far at all.  With it, I was able to make it through the 
entire test.

Looking forward to getting the experts thoughts about the best way to handle 
big TCP clusters - thanks!

Brent

P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster - not sure why, 
but kudos to those working on changes since then!

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Large TCP cluster timeout issue

2011-09-20 Thread Ralph Castain
Hmmmperhaps you didn't notice the mpi_preconnect_all option? It does 
precisely what you described - it pushes zero-byte messages around a ring to 
force all the connections open at MPI_Init.


On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote:

> I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
> cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, 
> but there were a couple of issues along the way.  After setting some system 
> tunables up a little bit on all of the nodes a hello_world program worked 
> just fine – it appears that the TCP connections between most or all of the 
> ranks are deferred until they are actually used so the easy test ran 
> reasonably quickly.  I then moved to IMB. 
>  
> I typically don’t care about the small rank counts, so I add the –npmin 9 
> option to just run the ‘big’ number of ranks.  This ended with an abort after 
> MPI_Init(), but before running any tests.  Lots (possibly all) of ranks 
> emitted messages that looked like:
>  
> 
> ‘[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 172.23.4.1 failed: Connection timed out (110)’
>  
> Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node 
> in the job.  One of the first things that IMB does before running a test is 
> create a communicator for each specific rank count it is testing.  Apparently 
> this collective operation causes a large number of connections to be made.  
> The abort messages (one example shown above) all show the connect failure to 
> a single node, so it would appear that a very large number of nodes attempt 
> to connect to that one at the same time and overwhelmed it.  (Or it was slow 
> and everyone ganged up on it as they worked their way around the ring.  J  Is 
> there a supported/suggested way to work around this?  It was very repeatable.
>  
> I was able to work around this by using the primary definitions for 
> MPI_Init() and MPI_Init_thread() by calling the ‘P’ version of the routine, 
> and then having each rank send its rank number to the rank one to the right, 
> then two to the right, and so-on around the ring.  I added a MPI_Barrier( 
> MPI_COMM_WORLD ), call every N messages to keep things at a controlled pace.  
> N was 64 by default, but settable via environment variable in case that 
> number didn’t work well for some reason.  This fully connected the mesh (110k 
> socket connections per host!) and allowed the tests to run.  Not a great 
> solution, I know, but I’ll throw it out there until I know the right way.
>  
> Once I had this in place, I used the workaround with HPCC as well.  Without 
> it, it would not get very far at all.  With it, I was able to make it through 
> the entire test.
>  
> Looking forward to getting the experts thoughts about the best way to handle 
> big TCP clusters – thanks!
>  
> Brent
>  
> P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster – not sure why, 
> but kudos to those working on changes since then!
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Large TCP cluster timeout issue

2011-09-20 Thread Henderson, Brent
I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux 
cluster.  I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, but 
there were a couple of issues along the way.  After setting some system 
tunables up a little bit on all of the nodes a hello_world program worked just 
fine - it appears that the TCP connections between most or all of the ranks are 
deferred until they are actually used so the easy test ran reasonably quickly.  
I then moved to IMB.

I typically don't care about the small rank counts, so I add the -npmin 9 
option to just run the 'big' number of ranks.  This ended with an abort after 
MPI_Init(), but before running any tests.  Lots (possibly all) of ranks emitted 
messages that looked like:


'[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 172.23.4.1 failed: Connection timed out (110)'

Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node in 
the job.  One of the first things that IMB does before running a test is create 
a communicator for each specific rank count it is testing.  Apparently this 
collective operation causes a large number of connections to be made.  The 
abort messages (one example shown above) all show the connect failure to a 
single node, so it would appear that a very large number of nodes attempt to 
connect to that one at the same time and overwhelmed it.  (Or it was slow and 
everyone ganged up on it as they worked their way around the ring.  :)  Is 
there a supported/suggested way to work around this?  It was very repeatable.

I was able to work around this by using the primary definitions for MPI_Init() 
and MPI_Init_thread() by calling the 'P' version of the routine, and then 
having each rank send its rank number to the rank one to the right, then two to 
the right, and so-on around the ring.  I added a MPI_Barrier( MPI_COMM_WORLD ), 
call every N messages to keep things at a controlled pace.  N was 64 by 
default, but settable via environment variable in case that number didn't work 
well for some reason.  This fully connected the mesh (110k socket connections 
per host!) and allowed the tests to run.  Not a great solution, I know, but 
I'll throw it out there until I know the right way.

Once I had this in place, I used the workaround with HPCC as well.  Without it, 
it would not get very far at all.  With it, I was able to make it through the 
entire test.

Looking forward to getting the experts thoughts about the best way to handle 
big TCP clusters - thanks!

Brent

P.S.  v1.5.4 worked *much* better that v1.4.3 on this cluster - not sure why, 
but kudos to those working on changes since then!



Re: [OMPI users] Trouble compiling 1.4.3 with PGI 10.9 compilers

2011-09-20 Thread Blosch, Edwin L
Follow-up #1:  I tried using the autogen.sh script referenced here
 https://svn.open-mpi.org/trac/ompi/changeset/22274
but that did not resolve the build problem.

Follow-up #2:  configuring with --disable-mpi-cxx does allow the compilation to 
succeed.  Perhaps that's obvious, but I had to check.



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Blosch, Edwin L
Sent: Tuesday, September 20, 2011 12:17 PM
To: Open MPI Users
Subject: EXTERNAL: [OMPI users] Trouble compiling 1.4.3 with PGI 10.9 compilers

I'm having trouble building 1.4.3 using PGI 10.9.  I searched the list archives 
briefly but I didn't stumble across anything that looked like the same problem, 
so I thought I'd ask if an expert might recognize the nature of the problem 
here.

The configure command:

./configure --prefix=/release/openmpi-pgi --without-tm --without-sge 
--enable-mpirun-prefix-by-default --enable-contrib-no-build=vt 
--enable-mca-no-build=maffinity --disable-per-user-config-files 
--disable-io-romio --with-mpi-f90-size=small --enable-static --disable-shared 
--with-wrapper-cflags=-Msignextend --with-wrapper-cxxflags=-Msignextend 
CXX=/appserv/pgi/linux86-64/10.9/bin/pgCC 
CC=/appserv/pgi/linux86-64/10.9/bin/pgcc 'CFLAGS=  -O2 -Mcache_align -Minfo 
-Msignextend -Msignextend' 'CXXFLAGS=  -O2 -Mcache_align -Minfo -Msignextend 
-Msignextend' F77=/appserv/pgi/linux86-64/10.9/bin/pgf95 'FFLAGS=-D_GNU_SOURCE  
-O2 -Mcache_align -Minfo -Munixlogical' 
FC=/appserv/pgi/linux86-64/10.9/bin/pgf95 'FCFLAGS=-D_GNU_SOURCE  -O2 
-Mcache_align -Minfo -Munixlogical' 'LDFLAGS= -Bstatic_pgi'

The place where the build eventually dies:

/bin/sh ../../../libtool --tag=CXX   --mode=link 
/appserv/pgi/linux86-64/10.9/bin/pgCC  -DNDEBUG   -O2 -Mcache_align -Minfo 
-Msignextend -Msignextend  -version-info 0:1:0 -export-dynamic  -Bstatic_pgi  
-o libmpi_cxx.la -rpath /release/cfd/openmpi-pgi/lib mpicxx.lo intercepts.lo 
comm.lo datatype.lo win.lo file.lo ../../../ompi/libmpi.la -lnsl -lutil  
-lpthread
libtool: link: tpldir=Template.dir
libtool: link:  rm -rf Template.dir
libtool: link:  /appserv/pgi/linux86-64/10.9/bin/pgCC --prelink_objects 
--instantiation_dir Template.dir   mpicxx.o intercepts.o comm.o datatype.o 
win.o file.o
pgCC-Warning-prelink_objects switch is deprecated
pgCC-Warning-instantiation_dir switch is deprecated
/usr/lib64/crt1.o: In function `_start':
/usr/src/packages/BUILD/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:109: 
undefined reference to `main'
mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec':
(.text+0x49): undefined reference to `ompi_mpi_errors_are_fatal'
mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec':
(.text+0x62): undefined reference to `ompi_mpi_errors_return'
mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec':
(.text+0x7b): undefined reference to `ompi_mpi_errors_throw_exceptions'
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


[OMPI users] Trouble compiling 1.4.3 with PGI 10.9 compilers

2011-09-20 Thread Blosch, Edwin L
I'm having trouble building 1.4.3 using PGI 10.9.  I searched the list archives 
briefly but I didn't stumble across anything that looked like the same problem, 
so I thought I'd ask if an expert might recognize the nature of the problem 
here.

The configure command:

./configure --prefix=/release/openmpi-pgi --without-tm --without-sge 
--enable-mpirun-prefix-by-default --enable-contrib-no-build=vt 
--enable-mca-no-build=maffinity --disable-per-user-config-files 
--disable-io-romio --with-mpi-f90-size=small --enable-static --disable-shared 
--with-wrapper-cflags=-Msignextend --with-wrapper-cxxflags=-Msignextend 
CXX=/appserv/pgi/linux86-64/10.9/bin/pgCC 
CC=/appserv/pgi/linux86-64/10.9/bin/pgcc 'CFLAGS=  -O2 -Mcache_align -Minfo 
-Msignextend -Msignextend' 'CXXFLAGS=  -O2 -Mcache_align -Minfo -Msignextend 
-Msignextend' F77=/appserv/pgi/linux86-64/10.9/bin/pgf95 'FFLAGS=-D_GNU_SOURCE  
-O2 -Mcache_align -Minfo -Munixlogical' 
FC=/appserv/pgi/linux86-64/10.9/bin/pgf95 'FCFLAGS=-D_GNU_SOURCE  -O2 
-Mcache_align -Minfo -Munixlogical' 'LDFLAGS= -Bstatic_pgi'

The place where the build eventually dies:

/bin/sh ../../../libtool --tag=CXX   --mode=link 
/appserv/pgi/linux86-64/10.9/bin/pgCC  -DNDEBUG   -O2 -Mcache_align -Minfo 
-Msignextend -Msignextend  -version-info 0:1:0 -export-dynamic  -Bstatic_pgi  
-o libmpi_cxx.la -rpath /release/cfd/openmpi-pgi/lib mpicxx.lo intercepts.lo 
comm.lo datatype.lo win.lo file.lo ../../../ompi/libmpi.la -lnsl -lutil  
-lpthread
libtool: link: tpldir=Template.dir
libtool: link:  rm -rf Template.dir
libtool: link:  /appserv/pgi/linux86-64/10.9/bin/pgCC --prelink_objects 
--instantiation_dir Template.dir   mpicxx.o intercepts.o comm.o datatype.o 
win.o file.o
pgCC-Warning-prelink_objects switch is deprecated
pgCC-Warning-instantiation_dir switch is deprecated
/usr/lib64/crt1.o: In function `_start':
/usr/src/packages/BUILD/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:109: 
undefined reference to `main'
mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec':
(.text+0x49): undefined reference to `ompi_mpi_errors_are_fatal'
mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec':
(.text+0x62): undefined reference to `ompi_mpi_errors_return'
mpicxx.o: In function `__sti___9_mpicxx_cc_a6befbec':
(.text+0x7b): undefined reference to `ompi_mpi_errors_throw_exceptions'


Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Blosch, Edwin L
Here is a diff -y output of the compilation of one of the program's files.  The 
one on the left is OpenMPI mpif90, the one on the right is MVAPICH mpif90.

Does that suggest perhaps I should try adding -fPIC to the OpenMPI-linked 
compilation?


/appserv/intel/Compiler/11.1/072/bin/intel64/fortcom
/appserv/intel/Compiler/11.1/072/bin/intel64/fortcom
-D__INTEL_COMPILER=1110 
-D__INTEL_COMPILER=1110
-D_MT   
-D_MT
-D__ELF__   
-D__ELF__
-D__INTEL_COMPILER_BUILD_DATE=20100414  
   -D__INTEL_COMPILER_BUILD_DATE=20100414

 >   -D__PIC__

 >   -D__pic__
-D__unix__  
-D__unix__
-D__unix
-D__unix
-D__linux__ 
-D__linux__
-D__linux   
-D__linux
-D__gnu_linux__ 
-D__gnu_linux__
-Dunix  
   -Dunix
-Dlinux 
-Dlinux
-D__x86_64  
-D__x86_64
-D__x86_64__
-D__x86_64__
-mGLOB_pack_sort_init_list  
-mGLOB_pack_sort_init_list
-I../../../code/src/main
-I../../../code/src/main
-I. 
-I.
-I. 
-I.
-I/usr/mpi/intel/openmpi-1.4.3/include  
   |   -I/usr/mpi/intel/mvapich-1.2.0/include
  
 |   
-I/usr/mpi/intel/openmpi-1.4.3/include  
   |   -I/usr/mpi/intel/mvapich-1.2.0/include/f90base
-I/usr/mpi/intel/openmpi-1.4.3/lib64
   <
-I/appserv/intel/Compiler/11.1/072/include/intel64  
   -I/appserv/intel/Compiler/11.1/072/include/intel64
-I/appserv/intel/Compiler/11.1/072/include/intel64  
   -I/appserv/intel/Compiler/11.1/072/include/intel64
-I/appserv/intel/Compiler/11.1/072/include  
   -I/appserv/intel/Compiler/11.1/072/include
-I/usr/local/include
   -I/usr/local/include
-I/usr/include  
   -I/usr/include
-I/usr/lib64/gcc/x86_64-suse-linux/4.3/include  
   -I/usr/lib64/gcc/x86_64-suse-linux/4.3/include
"-align all"
"-align all"
"-align records"
"-align records"

 >   -D__INTEL_COMPILER
-D_GNU_SOURCE   
   -D_GNU_SOURCE
-fpconstant 
-fpconstant
-O2 
-O2
"-reentrancy threaded"  
   <
-traceback  
-traceback
-mP1OPT_version=11.1-intel64
   -mP1OPT_version=11.1-intel64
-mGLOB_diag_file=main.diag  
-mGLOB_diag_file=main.diag
-mGLOB_source_language=GLOB_SOURCE_LANGUAGE_F90 
   -mGLOB_source_language=GLOB_SOURCE_LANGUAGE_F90
-mGLOB_tune_for_fort
   -mGLOB_tune_for_fort
-mGLOB_use_fort_dope_vector  

Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Blosch, Edwin L
Thank you for this explanation.  I will assume that my problem here is some 
kind of memory corruption.


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Tim Prince
Sent: Tuesday, September 20, 2011 10:36 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect 
floating-point results?

On 9/20/2011 10:50 AM, Blosch, Edwin L wrote:

> It appears to be a side effect of linkage that is able to change a 
> compute-only routine's answers.
>
> I have assumed that max/sqrt/tiny/abs might be replaced, but some other kind 
> of corruption may be going on.
>

Those intrinsics have direct instruction set translations which 
shouldn't vary from -O1 on up nor with linkage options nor be affected 
by MPI or insertion of WRITEs.

-- 
Tim Prince
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Tim Prince

On 9/20/2011 10:50 AM, Blosch, Edwin L wrote:


It appears to be a side effect of linkage that is able to change a compute-only 
routine's answers.

I have assumed that max/sqrt/tiny/abs might be replaced, but some other kind of 
corruption may be going on.



Those intrinsics have direct instruction set translations which 
shouldn't vary from -O1 on up nor with linkage options nor be affected 
by MPI or insertion of WRITEs.


--
Tim Prince


Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Reuti
Am 20.09.2011 um 16:50 schrieb Blosch, Edwin L:

> Thank you all for the replies.
> 
> Certainly optimization flags can be useful to address differences between 
> compilers, etc. And differences in MPI_ALLREDUCE are appreciated as possible. 
>  But I don't think either is quite relevant because:
> 
> - It was exact same compiler, with identical compilation flags.  So whatever 
> optimizations are applied, we should have the same instructions; 

I'm not sure about this. When you compile a program with mpicc, mpif77, ... you 
include automatically the header files of the MPI version in question. Hence 
you get a different set of variables to be stored (although you are not 
accessing them directly) as the internal representation is unique to each MPI 
implementation. If you compare the mpi.h between them they are far from looking 
similar. As a result different operations might be used to transfer your 
application data into the internal representation inside the MPI implementation.


> - I'm looking at inputs and outputs to a compute-only routine - there are no 
> MPI calls within the routine

So this is a serial part in your application?

You can compile with the option -S to get the assembler output.

-- Reuti


> Again, most numbers going into the routine were checked, and there were no 
> differences in the numbers out to 18 digits (i.e. beyond the precision of the 
> FP representation).  Yet, coming out of the routine, results differ.  I am 
> quite sure that no MPI routines were actually involved in calculations, and 
> that the compiler options given, were also the same.
> 
> It appears to be a side effect of linkage that is able to change a 
> compute-only routine's answers.
> 
> I have assumed that max/sqrt/tiny/abs might be replaced, but some other kind 
> of corruption may be going on.
> 
> I also could be mistaken about the inputs to the routine, i.e. they are not 
> truly identical as I have presumed and (partially) checked.
> 
> It is interesting that the whole of the calculation runs fine with MVAPICH 
> and blows up with OpenMPI.
> 
> Another diagnostic step I am taking: see if observation can be repeated with 
> a newer version of OpenMPI (currently using 1.4.3)
> 
> Ed
> 
>   
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Reuti
> Sent: Tuesday, September 20, 2011 7:25 AM
> To: tpri...@computer.org; Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] How could OpenMPI (or MVAPICH) affect 
> floating-point results?
> 
> Am 20.09.2011 um 13:52 schrieb Tim Prince:
> 
>> On 9/20/2011 7:25 AM, Reuti wrote:
>>> Hi,
>>> 
>>> Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L:
>>> 
 I am observing differences in floating-point results from an application 
 program that appear to be related to whether I link with OpenMPI 1.4.3 or 
 MVAPICH 1.2.0.  Both packages were built with the same installation of 
 Intel 11.1, as well as the application program; identical flags passed to 
 the compiler in each case.
 
 I've tracked down some differences in a compute-only routine where I've 
 printed out the inputs to the routine (to 18 digits) ; the inputs are 
 identical.  The output numbers are different in the 16th place (perhaps a 
 few in the 15th place).  These differences only show up for optimized 
 code, not for -O0.
 
 My assumption is that some optimized math intrinsic is being replaced 
 dynamically, but I do not know how to confirm this.  Anyone have guidance 
 to offer? Or similar experience?
>>> 
>>> yes, I face it often but always at a magnitude where it's not of any 
>>> concern (and not related to any MPI). Due to the limited precision in 
>>> computers, a simple reordering of operation (although being equivalent in a 
>>> mathematical sense) can lead to different results. Removing the anomalies 
>>> with -O0 could proof that.
>>> 
>>> The other point I heard especially for the x86 instruction set is, that the 
>>> internal FPU has still 80 bits, while the presentation in memory is only 64 
>>> bit. Hence when all can be done in the registers, the result can be 
>>> different compared to the case when some interim results need to be stored 
>>> to RAM. For the Portland compiler there is a switch -Kieee -pc64 to force 
>>> it to stay always in 64 bit, and a similar one for Intel is -mp (now 
>>> -fltconsistency) and -mp1.
>>> 
>> Diagnostics below indicate that ifort 11.1 64-bit is in use.  The options 
>> aren't the same as Reuti's "now" version (a 32-bit compiler which hasn't 
>> been supported for 3 years or more?).
> 
> In the 11.1 documentation they are also still listed:
> 
> http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm
> 
> I read it in the way, that -mp is deprecated syntax (therefore listed under 
> "Alternate Options"), but -fltconsistency is still a valid and supported 
> option.
> 
> -- Re

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Gus Correa

Ole Nielsen wrote:
Thanks for your suggestion Gus, we need a way of debugging what is going 
on. I am pretty sure the problem lies with our cluster configuration. I 
know MPI simply relies on the underlying network. However, we can ping 
and ssh to all nodes (and in between and pair as well) so it is 
currently a mystery why MPI doesn't communicate across nodes on our cluster.

Two further questions for the group

   1. I would love to run the test program connectivity.c, but cannot
  find it anywhere. Can anyone help please?


If you downloaded the OpenMPI tarball, it is in examples/connectivity.c
wherever you untarred it [now where you installed].



   2. After having left the job hanging over night we got the message
  
[node5][[9454,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
  mca_btl_tcp_frag_recv: readv failed: Connection timed out (110).
  Does anyone know what this means?


Cheers and thanks
Ole
PS - I don't see how separate buffers would help. Recall that the test 
program I use works fine on other installations and indeed when run on 
one the cores of one Node.




It probably won't help, as Eugene explained.
Your program works here, worked also for Davendra Rai.
If you were using MPI_ISend [non-blocking],
then you would need separate buffers.

For large amounts of data and many processes,
I would rather use non-blocking communication [and separate
buffers], specially if you do work between send and recv.
But that's not what hangs your program.

Gus Correa






Message: 11
Date: Mon, 19 Sep 2011 10:37:02 -0400
From: Gus Correa mailto:g...@ldeo.columbia.edu>>
Subject: Re: [OMPI users] RE :  MPI hangs on multiple nodes
To: Open MPI Users mailto:us...@open-mpi.org>>
Message-ID: <4e77538e.3070...@ldeo.columbia.edu 
>

Content-Type: text/plain; charset=iso-8859-1; format=flowed

Hi Ole

You could try the examples/connectivity.c program in the
OpenMPI source tree, to test if everything is alright.
It also hints how to solve the buffer re-use issue
that Sebastien [rightfully] pointed out [i.e., declare separate
buffers for MPI_Send and MPI_Recv].

Gus Correa




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Eugene Loh
I've not been following closely.  How do you know you're using the 
identical compilation flags?  Are you saying you specify the same flags 
to "mpicc" (or whatever) or are you confirming that the back-end 
compiler is seeing the same flags?  The MPI compiler wrapper (mpicc, et 
al.) can add flags.  E.g., as I remember it, "mpicc" with no flags means 
no optimization with OMPI but with optimization for MVAPICH.


On 9/20/2011 7:50 AM, Blosch, Edwin L wrote:

- It was exact same compiler, with identical compilation flags.


Re: [OMPI users] EXTERNAL: Re: How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Blosch, Edwin L
Thank you all for the replies.

Certainly optimization flags can be useful to address differences between 
compilers, etc. And differences in MPI_ALLREDUCE are appreciated as possible.  
But I don't think either is quite relevant because:

- It was exact same compiler, with identical compilation flags.  So whatever 
optimizations are applied, we should have the same instructions; 
- I'm looking at inputs and outputs to a compute-only routine - there are no 
MPI calls within the routine

Again, most numbers going into the routine were checked, and there were no 
differences in the numbers out to 18 digits (i.e. beyond the precision of the 
FP representation).  Yet, coming out of the routine, results differ.  I am 
quite sure that no MPI routines were actually involved in calculations, and 
that the compiler options given, were also the same.

It appears to be a side effect of linkage that is able to change a compute-only 
routine's answers.

I have assumed that max/sqrt/tiny/abs might be replaced, but some other kind of 
corruption may be going on.

I also could be mistaken about the inputs to the routine, i.e. they are not 
truly identical as I have presumed and (partially) checked.

It is interesting that the whole of the calculation runs fine with MVAPICH and 
blows up with OpenMPI.

Another diagnostic step I am taking: see if observation can be repeated with a 
newer version of OpenMPI (currently using 1.4.3)

Ed


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Reuti
Sent: Tuesday, September 20, 2011 7:25 AM
To: tpri...@computer.org; Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] How could OpenMPI (or MVAPICH) affect 
floating-point results?

Am 20.09.2011 um 13:52 schrieb Tim Prince:

> On 9/20/2011 7:25 AM, Reuti wrote:
>> Hi,
>> 
>> Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L:
>> 
>>> I am observing differences in floating-point results from an application 
>>> program that appear to be related to whether I link with OpenMPI 1.4.3 or 
>>> MVAPICH 1.2.0.  Both packages were built with the same installation of 
>>> Intel 11.1, as well as the application program; identical flags passed to 
>>> the compiler in each case.
>>> 
>>> I've tracked down some differences in a compute-only routine where I've 
>>> printed out the inputs to the routine (to 18 digits) ; the inputs are 
>>> identical.  The output numbers are different in the 16th place (perhaps a 
>>> few in the 15th place).  These differences only show up for optimized code, 
>>> not for -O0.
>>> 
>>> My assumption is that some optimized math intrinsic is being replaced 
>>> dynamically, but I do not know how to confirm this.  Anyone have guidance 
>>> to offer? Or similar experience?
>> 
>> yes, I face it often but always at a magnitude where it's not of any concern 
>> (and not related to any MPI). Due to the limited precision in computers, a 
>> simple reordering of operation (although being equivalent in a mathematical 
>> sense) can lead to different results. Removing the anomalies with -O0 could 
>> proof that.
>> 
>> The other point I heard especially for the x86 instruction set is, that the 
>> internal FPU has still 80 bits, while the presentation in memory is only 64 
>> bit. Hence when all can be done in the registers, the result can be 
>> different compared to the case when some interim results need to be stored 
>> to RAM. For the Portland compiler there is a switch -Kieee -pc64 to force it 
>> to stay always in 64 bit, and a similar one for Intel is -mp (now 
>> -fltconsistency) and -mp1.
>> 
> Diagnostics below indicate that ifort 11.1 64-bit is in use.  The options 
> aren't the same as Reuti's "now" version (a 32-bit compiler which hasn't been 
> supported for 3 years or more?).

In the 11.1 documentation they are also still listed:

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm

I read it in the way, that -mp is deprecated syntax (therefore listed under 
"Alternate Options"), but -fltconsistency is still a valid and supported option.

-- Reuti


> With ifort 10.1 and more recent, you would set at least
> -assume protect_parens -prec-div -prec-sqrt
> if you are interested in numerical consistency.  If you don't want 
> auto-vectorization of sum reductions, you would use instead
> -fp-model source -ftz
> (ftz sets underflow mode back to abrupt, while "source" sets gradual).
> It may be possible to expose 80-bit x87 by setting the ancient -mp option, 
> but such a course can't be recommended without additional cautions.
> 
> Quoted comment from OP seem to show a somewhat different question: Does 
> OpenMPI implement any operations in a different way from MVAPICH?  I would 
> think it probable that the answer could be affirmative for operations such as 
> allreduce, but this leads well outside my expertise with respect to specific 
> MPI implementations.  It isn't out of the question to suspect that su

Re: [OMPI users] How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Samuel K. Gutierrez
Hi,

Maybe you can leverage some of the techniques outlined in:

Robert W. Robey, Jonathan M. Robey, and Rob Aulwes. 2011. In search of 
numerical consistency in parallel programming. Parallel Comput. 37, 4-5 (April 
2011), 217-229. DOI=10.1016/j.parco.2011.02.009 
http://dx.doi.org/10.1016/j.parco.2011.02.009

Hope that helps,

Samuel K. Gutierrez
Los Alamos National Laboratory

On Sep 20, 2011, at 6:25 AM, Reuti wrote:

> Am 20.09.2011 um 13:52 schrieb Tim Prince:
> 
>> On 9/20/2011 7:25 AM, Reuti wrote:
>>> Hi,
>>> 
>>> Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L:
>>> 
 I am observing differences in floating-point results from an application 
 program that appear to be related to whether I link with OpenMPI 1.4.3 or 
 MVAPICH 1.2.0.  Both packages were built with the same installation of 
 Intel 11.1, as well as the application program; identical flags passed to 
 the compiler in each case.
 
 I’ve tracked down some differences in a compute-only routine where I’ve 
 printed out the inputs to the routine (to 18 digits) ; the inputs are 
 identical.  The output numbers are different in the 16th place (perhaps a 
 few in the 15th place).  These differences only show up for optimized 
 code, not for –O0.
 
 My assumption is that some optimized math intrinsic is being replaced 
 dynamically, but I do not know how to confirm this.  Anyone have guidance 
 to offer? Or similar experience?
>>> 
>>> yes, I face it often but always at a magnitude where it's not of any 
>>> concern (and not related to any MPI). Due to the limited precision in 
>>> computers, a simple reordering of operation (although being equivalent in a 
>>> mathematical sense) can lead to different results. Removing the anomalies 
>>> with -O0 could proof that.
>>> 
>>> The other point I heard especially for the x86 instruction set is, that the 
>>> internal FPU has still 80 bits, while the presentation in memory is only 64 
>>> bit. Hence when all can be done in the registers, the result can be 
>>> different compared to the case when some interim results need to be stored 
>>> to RAM. For the Portland compiler there is a switch -Kieee -pc64 to force 
>>> it to stay always in 64 bit, and a similar one for Intel is -mp (now 
>>> -fltconsistency) and -mp1.
>>> 
>> Diagnostics below indicate that ifort 11.1 64-bit is in use.  The options 
>> aren't the same as Reuti's "now" version (a 32-bit compiler which hasn't 
>> been supported for 3 years or more?).
> 
> In the 11.1 documentation they are also still listed:
> 
> http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm
> 
> I read it in the way, that -mp is deprecated syntax (therefore listed under 
> "Alternate Options"), but -fltconsistency is still a valid and supported 
> option.
> 
> -- Reuti
> 
> 
>> With ifort 10.1 and more recent, you would set at least
>> -assume protect_parens -prec-div -prec-sqrt
>> if you are interested in numerical consistency.  If you don't want 
>> auto-vectorization of sum reductions, you would use instead
>> -fp-model source -ftz
>> (ftz sets underflow mode back to abrupt, while "source" sets gradual).
>> It may be possible to expose 80-bit x87 by setting the ancient -mp option, 
>> but such a course can't be recommended without additional cautions.
>> 
>> Quoted comment from OP seem to show a somewhat different question: Does 
>> OpenMPI implement any operations in a different way from MVAPICH?  I would 
>> think it probable that the answer could be affirmative for operations such 
>> as allreduce, but this leads well outside my expertise with respect to 
>> specific MPI implementations.  It isn't out of the question to suspect that 
>> such differences might be aggravated when using excessively aggressive ifort 
>> options such as -fast.
>> 
>> 
libifport.so.5 =>  
 /opt/intel/Compiler/11.1/072/lib/intel64/libifport.so.5 
 (0x2b6e7e081000)
libifcoremt.so.5 =>  
 /opt/intel/Compiler/11.1/072/lib/intel64/libifcoremt.so.5 
 (0x2b6e7e1ba000)
libimf.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libimf.so 
 (0x2b6e7e45f000)
libsvml.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
 (0x2b6e7e7f4000)
libintlc.so.5 =>  
 /opt/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2b6e7ea0a000)
 
>> 
>> -- 
>> Tim Prince
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Open MPI and Objective C

2011-09-20 Thread Barrett, Brian W
The problem you're running into is not due to Open MPI.  The Objective C and C 
compilers on OS X (and most platforms) are the same binary, so you should be 
able to use mpicc without any problems.  It will see the .m extension and 
switch to Objective C mode.  However, NSLog is in the Foundation framework, so 
you must add the compiler option

  -framework Foundation

to the compiler flags (both when compiling and linking).  If you ripped out all 
the MPI and used gcc directly to compile your example code, you'd run into the 
same linker error without the -framework option.

Hope this helps,

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of Jeff 
Squyres [jsquy...@cisco.com]
Sent: Monday, September 19, 2011 6:46 AM
To: Open MPI Users
Subject: Re: [OMPI users] Open MPI and Objective C

+1

You'll probably have to run "mpicc --showme" to see all the flags that OMPI is 
passing to the underlying compiler, and use those (or equivalents) to the ObjC 
compiler.


On Sep 19, 2011, at 8:34 AM, Ralph Castain wrote:

> Nothing to do with us - you call a function "NSLog" that Objective C doesn't 
> recognize. That isn't an MPI function.
>
> On Sep 18, 2011, at 8:20 PM, Scott Wilcox wrote:
>
>> I have been asked to convert some C++ code using Open MPI to Objective C and 
>> I am having problems getting a simple Obj C program to compile.  I have 
>> searched through the FAQs and have not found anything specific.  Is it an 
>> incorrect assumption that the C interfaces work with Obj C, or am I missing 
>> something?
>>
>> Thanks in advance for your help!
>> Scott
>>
>>
>> open MPI version: 1.4.3
>> OSX 10.5.1
>>
>> file: main.m
>>
>> #import 
>> #import "mpi.h"
>>
>> int main (int argc, char** argv)
>>
>> {
>>//***
>>// Variable Declaration
>>//***
>>int theRank;
>>int theSize;
>>
>>//***
>>// Initializing Message Passing Interface
>>//***
>>MPI_Init(&argc,&argv);
>>MPI_Comm_size(MPI_COMM_WORLD,&theSize);
>>MPI_Comm_rank(MPI_COMM_WORLD,&theRank);
>>//*** end
>>
>>NSLog(@"Executing open MPI Objective C");
>>
>> }
>>
>> Compile:
>>
>> [87]UNC ONLY: SAW>mpicc main.m -o test
>> Undefined symbols:
>>   "___CFConstantStringClassReference", referenced from:
>>   cfstring=Executing open MPI Objective C in ccj1AlL9.o
>>   "_NSLog", referenced from:
>>   _main in ccj1AlL9.o
>> ld: symbol(s) not found
>> collect2: ld returned 1 exit status
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Rolf vandeVaart

>> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't
>happen until the third iteration. I take that to mean that the basic
>communication works, but that something is saturating. Is there some notion
>of buffer size somewhere in the MPI system that could explain this?
>
>Hmm.  This is not a good sign; it somewhat indicates a problem with your OS.
>Based on this email and your prior emails, I'm guessing you're using TCP for
>communication, and that the problem is based on inter-node communication
>(e.g., the problem would occur even if you only run 1 process per machine,
>but does not occur if you run all N processes on a single machine, per your #4,
>below).
>

I agree with Jeff here.  Open MPI uses lazy connections to establish 
connections and round robins through the interfaces.
So, the first few communications could work as they are using interfaces that 
could communicate between the nodes, but the third iteration uses an interface 
that for some reason cannot establish the connection.

One flag you can use that may help is --mca btl_base_verbose 20, like this;

mpirun --mca btl_base_verbose 20 connectivity_c

It will dump out a bunch of stuff, but there will be a few lines that look like 
this:

[...snip...]
[dt:09880] btl: tcp: attempting to connect() to [[58627,1],1] address 
10.20.14.101 on port 1025
[...snip...]

Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI users] How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Reuti
Am 20.09.2011 um 13:52 schrieb Tim Prince:

> On 9/20/2011 7:25 AM, Reuti wrote:
>> Hi,
>> 
>> Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L:
>> 
>>> I am observing differences in floating-point results from an application 
>>> program that appear to be related to whether I link with OpenMPI 1.4.3 or 
>>> MVAPICH 1.2.0.  Both packages were built with the same installation of 
>>> Intel 11.1, as well as the application program; identical flags passed to 
>>> the compiler in each case.
>>> 
>>> I’ve tracked down some differences in a compute-only routine where I’ve 
>>> printed out the inputs to the routine (to 18 digits) ; the inputs are 
>>> identical.  The output numbers are different in the 16th place (perhaps a 
>>> few in the 15th place).  These differences only show up for optimized code, 
>>> not for –O0.
>>> 
>>> My assumption is that some optimized math intrinsic is being replaced 
>>> dynamically, but I do not know how to confirm this.  Anyone have guidance 
>>> to offer? Or similar experience?
>> 
>> yes, I face it often but always at a magnitude where it's not of any concern 
>> (and not related to any MPI). Due to the limited precision in computers, a 
>> simple reordering of operation (although being equivalent in a mathematical 
>> sense) can lead to different results. Removing the anomalies with -O0 could 
>> proof that.
>> 
>> The other point I heard especially for the x86 instruction set is, that the 
>> internal FPU has still 80 bits, while the presentation in memory is only 64 
>> bit. Hence when all can be done in the registers, the result can be 
>> different compared to the case when some interim results need to be stored 
>> to RAM. For the Portland compiler there is a switch -Kieee -pc64 to force it 
>> to stay always in 64 bit, and a similar one for Intel is -mp (now 
>> -fltconsistency) and -mp1.
>> 
> Diagnostics below indicate that ifort 11.1 64-bit is in use.  The options 
> aren't the same as Reuti's "now" version (a 32-bit compiler which hasn't been 
> supported for 3 years or more?).

In the 11.1 documentation they are also still listed:

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm

I read it in the way, that -mp is deprecated syntax (therefore listed under 
"Alternate Options"), but -fltconsistency is still a valid and supported option.

-- Reuti


> With ifort 10.1 and more recent, you would set at least
> -assume protect_parens -prec-div -prec-sqrt
> if you are interested in numerical consistency.  If you don't want 
> auto-vectorization of sum reductions, you would use instead
> -fp-model source -ftz
> (ftz sets underflow mode back to abrupt, while "source" sets gradual).
> It may be possible to expose 80-bit x87 by setting the ancient -mp option, 
> but such a course can't be recommended without additional cautions.
> 
> Quoted comment from OP seem to show a somewhat different question: Does 
> OpenMPI implement any operations in a different way from MVAPICH?  I would 
> think it probable that the answer could be affirmative for operations such as 
> allreduce, but this leads well outside my expertise with respect to specific 
> MPI implementations.  It isn't out of the question to suspect that such 
> differences might be aggravated when using excessively aggressive ifort 
> options such as -fast.
> 
> 
>>> libifport.so.5 =>  
>>> /opt/intel/Compiler/11.1/072/lib/intel64/libifport.so.5 (0x2b6e7e081000)
>>> libifcoremt.so.5 =>  
>>> /opt/intel/Compiler/11.1/072/lib/intel64/libifcoremt.so.5 
>>> (0x2b6e7e1ba000)
>>> libimf.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libimf.so 
>>> (0x2b6e7e45f000)
>>> libsvml.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
>>> (0x2b6e7e7f4000)
>>> libintlc.so.5 =>  
>>> /opt/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2b6e7ea0a000)
>>> 
> 
> -- 
> Tim Prince
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




Re: [OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

2011-09-20 Thread Yevgeny Kliteynik
Hi Sébastien,

If I understand you correctly, you are running your application on two
different MPIs on two different clusters with two different IB vendors.

Could you make a comparison more "apples to apples"-ish?
For instance:
 - run the same version of Open MPI on both clusters
 - run the same version of MVAPICH on both clusters


-- YK

On 18-Sep-11 1:59 AM, Sébastien Boisvert wrote:
> Hello,
> 
> Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 
> microseconds with 256 MPI ranks on super-computer A (name is colosse).
> 
> The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic 
> Infiniband hardware with 512 MPI ranks on super-computer B (name is 
> guillimin).
> 
> 
> Here are the relevant information listed in 
> http://www.open-mpi.org/community/help/
> 
> 
> 1. Check the FAQ first.
> 
> done !
> 
> 
> 2. The version of Open MPI that you're using.
> 
> Open-MPI 1.4.3
> 
> 
> 3. The config.log file from the top-level Open MPI directory, if available 
> (please compress!).
> 
> See below.
> 
> Command file: http://pastebin.com/mW32ntSJ
> 
> 
> 4. The output of the "ompi_info --all" command from the node where you're 
> invoking mpirun.
> 
> ompi_info -a on colosse: http://pastebin.com/RPyY9s24
> 
> 
> 5. If running on more than one node -- especially if you're having problems 
> launching Open MPI processes -- also include the output of the "ompi_info -v 
> ompi full --parsable" command from each node on which you're trying to run.
> 
> I am not having problems launching Open-MPI processes.
> 
> 
> 6. A detailed description of what is failing.
> 
> Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 
> microseconds with 256 MPI ranks on super-computer A (name is colosse).
> 
> The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic 
> Infiniband hardware on 512 MPI ranks on super-computer B (name is guillimin).
> 
> Details follow.
> 
> 
> I am developing a distributed genome assembler that runs with the 
> message-passing interface (I am a PhD student).
> It is called Ray. Link: http://github.com/sebhtml/ray
> 
> I recently added the option -test-network-only so that Ray can be used to 
> test the latency. Each MPI rank has to send 10 messages (4000 bytes 
> each), one by one.
> The destination of any message is picked up at random.
> 
> 
> On colosse, a super-computer located at Laval University, I get an average 
> latency of 250 microseconds with the test done in Ray.
> 
> See http://pastebin.com/9nyjSy5z
> 
> On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI 
> middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2.
> 
> colosse has 8 compute cores per node (Intel Nehalem).
> 
> 
> Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds.
> 
>local address:  LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b
>remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6
> 8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec
> 1000 iters in 0.01 seconds = 11.35 usec/iter
> 
> So I know that the Infiniband has a correct latency between two HCAs because 
> of the output of ibv_rc_pingpong.
> 
> 
> 
> Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that Open-MPI 
> detects the hardware correctly:
> 
> [r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] 
> Querying INI files for vendor 0x02c9, part ID 26428
> [r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found 
> corresponding INI values: Mellanox Hermon
> 
> see http://pastebin.com/pz03f0B3
> 
> 
> So I don't think this is the problem described in the FAQ ( 
> http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency
>  )
> and on the mailing list ( 
> http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the 
> INI values are found.
> 
> 
> 
> 
> Running the network test implemented in Ray on 32 MPI ranks, I get an average 
> latency of 65 microseconds.
> 
> See http://pastebin.com/nWDmGhvM
> 
> 
> Thus, with 256 MPI ranks I get an average latency of 250 microseconds and 
> with 32 MPI ranks I get 65 microseconds.
> 
> 
> Running the network test on 32 MPI ranks again but only allowing the MPI rank 
> 0 to send messages gives a latency of 10 microseconds for this rank.
> See http://pastebin.com/dWMXsHpa
> 
> 
> 
> Because I get 10 microseconds in the network test in Ray when only the MPI 
> rank sends messages, I would say that there may be some I/O contention.
> 
> To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per 
> node to send messages (there are 8 MPI ranks per node and a total of 32 MPI 
> ranks).
> Ranks 0, 8, 16 and 24 all reported 13 microseconds. See 
> http://pastebin.com/h84Fif3g
> 
> The next test was to allow 2 MPI ranks on each node to send messages. Ranks 
> 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds.
> See http://pastebin.com/REdhJXkS
> 
> With 3 MPI ranks per node that can send mess

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Jeff Squyres
On Sep 19, 2011, at 10:23 PM, Ole Nielsen wrote:

> Hi all - and sorry for the multiple postings, but I have more information.

+1 on Eugene's comments.  The test program looks fine to me.

FWIW, you don't need -lmpi to compile your program; OMPI's wrapper compiler 
allows you to just:

mpicc mpi_test.c -o mpi_test -Wall

> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't 
> happen until the third iteration. I take that to mean that the basic 
> communication works, but that something is saturating. Is there some notion 
> of buffer size somewhere in the MPI system that could explain this?

Hmm.  This is not a good sign; it somewhat indicates a problem with your OS.  
Based on this email and your prior emails, I'm guessing you're using TCP for 
communication, and that the problem is based on inter-node communication (e.g., 
the problem would occur even if you only run 1 process per machine, but does 
not occur if you run all N processes on a single machine, per your #4, below).

> 2: The nodes have 4 ethernet cards each. Could the mapping be a problem?

Shouldn't be.  If it runs at all, then it should run fine.

Do you have all your ethernet cards on a single subnet, or multiple subnets?  I 
have heard of problems when you have multiple ethernet cards on the same subnet 
-- I believe there's some non-determinism in than case in what wire/NIC a 
packet will actually go out, which may be problematic for OMPI.

> 3: The cpus are running at a 100% for all processes involved in the freeze

That's probably right.  OMPI aggressively polls for progress as a way to 
decrease latency.  So all processes are trying to make progress, and therefore 
are aggressively polling, eating up 100% of the CPU.

> 4: The same test program 
> (http://code.google.com/p/pypar/source/browse/source/mpi_test.c) works fine 
> when run within one node so the problem must be with MPI and/or our network. 

This helps identify the issue as the TCP communication, not the shared memory 
communication.

> 5: The network and ssh works otherwise fine.

Good.

> Again many thanks for any hint that can get us going again. The main thing we 
> need is some diagnostics that may point to what causes this problem for MPI.

If you are running with multiple NICs on the same subnet, change them to 
multiple subnets and see if it starts working fine.

If they're on different subnets, try using the btl_tcp_if_include / 
btl_tcp_if_exclude MCA parameters to exclude certain networks and see if 
they're the problematic ones.  Keep in mind that ..._include and ..._exclude 
are mutually exclusive; you should only specify one.  And if you specify 
exclude, be sure to exclude loopback.  E.g:

  mpirun --mca btl_if_include eth0,eth1 -np 16 --hostfile hostfile mpi_test
or
  mpirun --mca btl_if_exclude lo0,eth1 -np 16 --hostfile hostfile mpi_test

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Jeff Squyres
On Sep 20, 2011, at 7:52 AM, Tim Prince wrote:

> Quoted comment from OP seem to show a somewhat different question: Does 
> OpenMPI implement any operations in a different way from MVAPICH?  I would 
> think it probable that the answer could be affirmative for operations such as 
> allreduce, but this leads well outside my expertise with respect to specific 
> MPI implementations.  It isn't out of the question to suspect that such 
> differences might be aggravated when using excessively aggressive ifort 
> options such as -fast.

This is 'zactly what I was going to say -- reductions between Open MPI and 
MVAPICH may well perform global arithmetic operations in different orders.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Tim Prince

On 9/20/2011 7:25 AM, Reuti wrote:

Hi,

Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L:


I am observing differences in floating-point results from an application 
program that appear to be related to whether I link with OpenMPI 1.4.3 or 
MVAPICH 1.2.0.  Both packages were built with the same installation of Intel 
11.1, as well as the application program; identical flags passed to the 
compiler in each case.

I’ve tracked down some differences in a compute-only routine where I’ve printed 
out the inputs to the routine (to 18 digits) ; the inputs are identical.  The 
output numbers are different in the 16th place (perhaps a few in the 15th 
place).  These differences only show up for optimized code, not for –O0.

My assumption is that some optimized math intrinsic is being replaced 
dynamically, but I do not know how to confirm this.  Anyone have guidance to 
offer? Or similar experience?


yes, I face it often but always at a magnitude where it's not of any concern 
(and not related to any MPI). Due to the limited precision in computers, a 
simple reordering of operation (although being equivalent in a mathematical 
sense) can lead to different results. Removing the anomalies with -O0 could 
proof that.

The other point I heard especially for the x86 instruction set is, that the 
internal FPU has still 80 bits, while the presentation in memory is only 64 
bit. Hence when all can be done in the registers, the result can be different 
compared to the case when some interim results need to be stored to RAM. For 
the Portland compiler there is a switch -Kieee -pc64 to force it to stay always 
in 64 bit, and a similar one for Intel is -mp (now -fltconsistency) and -mp1.

Diagnostics below indicate that ifort 11.1 64-bit is in use.  The 
options aren't the same as Reuti's "now" version (a 32-bit compiler 
which hasn't been supported for 3 years or more?).

With ifort 10.1 and more recent, you would set at least
-assume protect_parens -prec-div -prec-sqrt
if you are interested in numerical consistency.  If you don't want 
auto-vectorization of sum reductions, you would use instead

-fp-model source -ftz
(ftz sets underflow mode back to abrupt, while "source" sets gradual).
It may be possible to expose 80-bit x87 by setting the ancient -mp 
option, but such a course can't be recommended without additional cautions.


Quoted comment from OP seem to show a somewhat different question: Does 
OpenMPI implement any operations in a different way from MVAPICH?  I 
would think it probable that the answer could be affirmative for 
operations such as allreduce, but this leads well outside my expertise 
with respect to specific MPI implementations.  It isn't out of the 
question to suspect that such differences might be aggravated when using 
excessively aggressive ifort options such as -fast.




 libifport.so.5 =>  
/opt/intel/Compiler/11.1/072/lib/intel64/libifport.so.5 (0x2b6e7e081000)
 libifcoremt.so.5 =>  
/opt/intel/Compiler/11.1/072/lib/intel64/libifcoremt.so.5 (0x2b6e7e1ba000)
 libimf.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libimf.so 
(0x2b6e7e45f000)
 libsvml.so =>  /opt/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
(0x2b6e7e7f4000)
 libintlc.so.5 =>  
/opt/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2b6e7ea0a000)



--
Tim Prince


Re: [OMPI users] How could OpenMPI (or MVAPICH) affect floating-point results?

2011-09-20 Thread Reuti
Hi,

Am 20.09.2011 um 00:41 schrieb Blosch, Edwin L:

> I am observing differences in floating-point results from an application 
> program that appear to be related to whether I link with OpenMPI 1.4.3 or 
> MVAPICH 1.2.0.  Both packages were built with the same installation of Intel 
> 11.1, as well as the application program; identical flags passed to the 
> compiler in each case.
>  
> I’ve tracked down some differences in a compute-only routine where I’ve 
> printed out the inputs to the routine (to 18 digits) ; the inputs are 
> identical.  The output numbers are different in the 16th place (perhaps a few 
> in the 15th place).  These differences only show up for optimized code, not 
> for –O0.
>  
> My assumption is that some optimized math intrinsic is being replaced 
> dynamically, but I do not know how to confirm this.  Anyone have guidance to 
> offer? Or similar experience?

yes, I face it often but always at a magnitude where it's not of any concern 
(and not related to any MPI). Due to the limited precision in computers, a 
simple reordering of operation (although being equivalent in a mathematical 
sense) can lead to different results. Removing the anomalies with -O0 could 
proof that.

The other point I heard especially for the x86 instruction set is, that the 
internal FPU has still 80 bits, while the presentation in memory is only 64 
bit. Hence when all can be done in the registers, the result can be different 
compared to the case when some interim results need to be stored to RAM. For 
the Portland compiler there is a switch -Kieee -pc64 to force it to stay always 
in 64 bit, and a similar one for Intel is -mp (now -fltconsistency) and -mp1.

http://www.pgroup.com/doc/pgiref.pdf (page 42)

http://software.intel.com/file/6335 (page 260)

You could try with the mentioned switches whether you get more consistent 
output.


If there would be a MPI ABI, and you could just drop in any MPI library, it 
would be quite easy to spot the real point where the discrepancy occured.

-- Reuti


> Thanks very much
>  
> Ed
>  
> Just for what it’s worth, here’s the output of ldd:
>  
> % ldd application_mvapich
> linux-vdso.so.1 =>  (0x7fffe3746000)
> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x2b5b45fc1000)
> libmpich.so.1.0 => 
> /usr/mpi/intel/mvapich-1.2.0/lib/shared/libmpich.so.1.0 (0x2b5b462cd000)
> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x2b5b465ed000)
> libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x2b5b467fc000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x2b5b46a04000)
> librt.so.1 => /lib64/librt.so.1 (0x2b5b46c21000)
> libm.so.6 => /lib64/libm.so.6 (0x2b5b46e2a000)
> libdl.so.2 => /lib64/libdl.so.2 (0x2b5b47081000)
> libc.so.6 => /lib64/libc.so.6 (0x2b5b47285000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b5b475e3000)
> /lib64/ld-linux-x86-64.so.2 (0x2b5b45da)
> libimf.so => /opt/intel/Compiler/11.1/072/lib/intel64/libimf.so 
> (0x2b5b477fb000)
> libsvml.so => /opt/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
> (0x2b5b47b8f000)
> libintlc.so.5 => 
> /opt/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2b5b47da5000)
>  
> % ldd application_openmpi
>linux-vdso.so.1 =>  (0x7fff6ebff000)
> libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x2b6e7c17d000)
> libmpi_f90.so.0 => /usr/mpi/intel/openmpi-1.4.3/lib64/libmpi_f90.so.0 
> (0x2b6e7c489000)
> libmpi_f77.so.0 => /usr/mpi/intel/openmpi-1.4.3/lib64/libmpi_f77.so.0 
> (0x2b6e7c68d000)
> libmpi.so.0 => /usr/mpi/intel/openmpi-1.4.3/lib64/libmpi.so.0 
> (0x2b6e7c8ca000)
> libopen-rte.so.0 => 
> /usr/mpi/intel/openmpi-1.4.3/lib64/libopen-rte.so.0 (0x2b6e7cb9c000)
> libopen-pal.so.0 => 
> /usr/mpi/intel/openmpi-1.4.3/lib64/libopen-pal.so.0 (0x2b6e7ce01000)
> libdl.so.2 => /lib64/libdl.so.2 (0x2b6e7d077000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x2b6e7d27c000)
> libutil.so.1 => /lib64/libutil.so.1 (0x2b6e7d494000)
> libm.so.6 => /lib64/libm.so.6 (0x2b6e7d697000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x2b6e7d8ee000)
> libc.so.6 => /lib64/libc.so.6 (0x2b6e7db0b000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b6e7de69000)
> /lib64/ld-linux-x86-64.so.2 (0x2b6e7bf5c000)
> libifport.so.5 => 
> /opt/intel/Compiler/11.1/072/lib/intel64/libifport.so.5 (0x2b6e7e081000)
> libifcoremt.so.5 => 
> /opt/intel/Compiler/11.1/072/lib/intel64/libifcoremt.so.5 (0x2b6e7e1ba000)
> libimf.so => /opt/intel/Compiler/11.1/072/lib/intel64/libimf.so 
> (0x2b6e7e45f000)
> libsvml.so => /opt/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
> (0x2b6e7e7f4000)
> libintlc.so.5 => 
> /opt/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2b6e