Sorry for the late reply, but, as I just say, we are attempting
to recover the full operation of part of our cluster
Yes, it was a typo, I use to add the "sm" flag to the "--mca btl"
option. However I think this is not mandatory, as I suppose
openmpi use the the so-called "Law of Least Astonishment"
also in this case and adopts "sm" for the intra-node communication
or, if you prefer, avoiding to add the sm string does not mean "not use
shared memory".
Indeed if I remove or add this string nothing change, or if
I run an mpi job on a single multicore node without this
flag all works well.
Thanls
Salvatore
On 20/mag/11, at 20:53, users-requ...@open-mpi.org wrote:
Message: 1
Date: Fri, 20 May 2011 14:30:13 -0400
From: Gus Correa <g...@ldeo.columbia.edu>
Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer
XE 2011 (aka 12.0)
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <4dd6b335.2090...@ldeo.columbia.edu>
Content-Type: text/plain; charset=us-ascii; format=flowed
Hi Salvatore
Just in case ...
You say you have problems when you use "--mca btl openib,self".
Is this a typo in your email?
I guess this will disable the shared memory btl intra-node,
whereas your other choice "--mca btl_tcp_if_include ib0" will not.
Could this be the problem?
Here we use "--mca btl openib,self,sm",
to enable the shared memory btl intra-node as well,
and it works just fine on programs that do use collective calls.
My two cents,
Gus Correa
Salvatore Podda wrote:
We are still struggling we these problems. Actually the new version
of
intel compilers does
not seem to be the real issue. We clash against the same errors using
also the `gcc' compilers.
We succeed in building an openmi-1.2.8 (with different compiler
flavours) rpm from the installation
of the cluster section where all seems to work well. We are now
doing a
severe IMB benchmark campaign.
However, yes this happen only whe we use the --mca btl openib,self,
on
the contrary if we use
--mca btl_tcp_if_include ib0 all works well.
Yes we can try the flag you suggest. I can check on the FAQ and on
the
opem-mpi.org documentation,
but can you be so kindly to explain the meaning of this flag?
Thanks
Salvatore Podda
On 20/mag/11, at 03:37, Jeff Squyres wrote:
Sorry for the late reply.
Other users have seen something similar but we have never been
able to
reproduce it. Is this only when using IB? If you use "mpirun --mca
btl_openib_cpc_if_include rdmacm", does the problem go away?
On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1
then the collectives hangs go away. I don't know what, if anything,
the higher optimization buys you when compiling openmpi, so I'm not
sure if that's an acceptable workaround or not.
My system is similar to yours - Intel X5570 with QDR Mellanox IB
running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
using IMB 3.2.2 with a single iteration of Barrier to reproduce the
hang, and it happens 100% of the time for me when I invoke it
like this:
# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
The hang happens on the first Barrier (64 ranks) and each of the
participating ranks have this backtrace:
__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()
The one non-participating rank has this backtrace:
__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/
libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()
If I use more nodes I can get it to hang with 1ppn, so that seems
to
rule out the sm btl (or interactions with it) as a culprit at
least.
I can't reproduce this with openmpi 1.5.3, interestingly.
-Marcus
On 05/10/2011 03:37 AM, Salvatore Podda wrote:
Dear all,
we succeed in building several version of openmpi from 1.2.8 to
1.4.3
with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending
from the
application: IMB, xhpl or user applications
and form the number of required cores) above which the application
hangs
(sort of deadlocks).
The building of openmpi with 'gcc' and 'pgi' does not show the
same
limits.
There are any known incompatibilities of openmpi with this
version of
intel compiilers?
The characteristics of our computational infrastructure are:
Intel processors E7330, E5345, E5530 e E5620
CentOS 5.3, CentOS 5.5.
Intel composer XE 2011
gcc 4.1.2
pgi 10.2-1
Regards
Salvatore Podda
ENEA UTICT-HPC
Department for Computer Science Development and ICT
Facilities Laboratory for Science and High Performace Computing
C.R. Frascati
Via E. Fermi, 45
PoBox 65
00044 Frascati (Rome)
Italy
Tel: +39 06 9400 5342
Fax: +39 06 9400 5551
Fax: +39 06 9400 5735
E-mail: salvatore.po...@enea.it
Home Page: www.cresco.enea.it
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users