[OMPI users] Symbol not found: _evsignal_base

2011-05-24 Thread charles reid
Hi -

I'm trying to compile a simple hello world program with mpicc,

$ cat test.c
#include 

main()
{
  printf ("Hello World!\n");
}


but I'm seeing this issue:

$ ~/pkg/openmpi/1.4.3_bigmac/bin/mpicc test.c
dyld: Symbol not found: _evsignal_base
  Referenced from: /uufs/
chpc.utah.edu/common/home/u0552682/pkg/openmpi/1.4.3_bigmac/lib/libopen-pal.0.dylib
  Expected in: flat namespace
 in /uufs/
chpc.utah.edu/common/home/u0552682/pkg/openmpi/1.4.3_bigmac/lib/libopen-pal.0.dylib
Trace/BPT trap


I found this previous thread,
http://comments.gmane.org/gmane.comp.clustering.open-mpi.user/13033 , which
suggested adding the installation directory's lib/ to LD_LIBRARY_PATH would
fix things, but it did not:

$ export
LD_LIBRARY_PATH="${HOME}/pkg/openmpi/1.4.3_bigmac/lib:${LD_LIBRARY_PATH}";
~/pkg/openmpi/1.4.3_bigmac/bin/mpicc test.c
dyld: Symbol not found: _evsignal_base
  Referenced from: /uufs/
chpc.utah.edu/common/home/u0552682/pkg/openmpi/1.4.3_bigmac/lib/libopen-pal.0.dylib
  Expected in: flat namespace
 in /uufs/
chpc.utah.edu/common/home/u0552682/pkg/openmpi/1.4.3_bigmac/lib/libopen-pal.0.dylib
Trace/BPT trap


Any suggestions on what I might be doing wrong?


Charles


Re: [OMPI users] openmpi self checkpointing - error while running example

2011-05-24 Thread Faisal
Hellmüller  Roman  student.ethz.ch> writes:

> 
> Hi
> 
> I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.
> 
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, 
> blcr 
0.8.2
> 
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run 
the application and
> also do checkpoints, but restarting won't work.  I got the following error by 
doning as sugested:
> 
> mpicc my-app.c -export -export-dynamic -o my-app
> 
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
> 
> hroman  cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_0.ckpt). Returned -1.
> 
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
> 
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> 
> i also tryed around with setting the path in the example file (restart_path 
variable), changing the
> checkpoint directorys, and running the application in different directorys...
> 
> do you have an idea where the error could be?
> 
> here
> 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz
> (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of
> ompi_info. there is one for the login and the other for the compute nodes due 
to different kernels.  and here
> 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz
> there is the produced checkpoint. please let me know if more outputs are 
needed.
> 
> cheers
> roman
> 

Hi Roman,

Try putting name of your executable at end of the path.
char restart_path[128] = "/full/path/to/personal-cr"; 
Here 'personal-cr' is executable.

I hope it helps.

Kind regards,
Faisal




Re: [OMPI users] reading from file

2011-05-24 Thread sushil samant
hi rob
thanks a lot . But if you give some example with .h5 read in c++ or
fortran, it will help a lot.

On 5/24/11, users-requ...@open-mpi.org  wrote:
> Send users mailing list submissions to
>   us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>   http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>   users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>   users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Invitation to connect on LinkedIn
>   (Nurul Azri Mohd Radzi via LinkedIn)
>2. Re: Invitation to connect on LinkedIn (Jeff Squyres)
>3. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1
>   (francoise.r...@obs.ujf-grenoble.fr)
>4. Re: users Digest, Vol 1911, Issue 3 (Salvatore Podda)
>5. Re: openmpi (1.2.8 or above) and Intel composer XE  2011 (aka
>   12.0) (Salvatore Podda)
>6. Re: openmpi (1.2.8 or above) and Intel composer XE  2011 (aka
>   12.0) (Salvatore Podda)
>7. Re: btl_openib_cpc_include rdmacm questions (Dave Love)
>8. Re: Trouble with MPI-IO (Rob Latham)
>9. Re: reading from a file (Rob Latham)
>   10. Re: Openib with > 32 cores per node (Dave Love)
>   11. Re: Trouble with MPI-IO (Tom Rosmond)
>
>
> --
>
> Message: 1
> Date: Tue, 24 May 2011 00:16:52 + (UTC)
> From: Nurul Azri Mohd Radzi via LinkedIn 
> Subject: [OMPI users] Invitation to connect on LinkedIn
> To: Mohan L 
> Message-ID:
>   <1621713298.532717.1306196212953.javamail@ela4-bed33.prod>
> Content-Type: text/plain; charset="utf-8"
>
> LinkedIn
> 
>
>
>
>
> Nurul Azri Mohd Radzi requested to add you as a connection on LinkedIn:
>
> --
>
> Mohan,
>
> I'd like to add you to my professional network on LinkedIn.
>
> - Nurul Azri
>
> Accept invitation from Nurul Azri Mohd Radzi
> http://www.linkedin.com/e/kq0fyp-go23i09i-48/uYFEuWAc-_V_w7MB9hFjx_pd4WRoHI/blk/I47709029_55/pmpxnSRJrSdvj4R5fnhv9ClRsDgZp6lQs6lzoQ5AomZIpn8_djlvej8Mej0TdPh9bPB1pCpRtkFhbPAPcj8OdzoTej8LrCBxbOYWrSlI/EML_comm_afe/
>
> View invitation from Nurul Azri Mohd Radzi
> http://www.linkedin.com/e/kq0fyp-go23i09i-48/uYFEuWAc-_V_w7MB9hFjx_pd4WRoHI/blk/I47709029_55/0RdlYVcz0Vc3sTd4ALqnpPbOYWrSlI/svi/
> --
>
> DID YOU KNOW you can be the first to know when a trusted member of your
> network changes jobs? With Network Updates on your LinkedIn home page,
> you'll be notified as members of your network change their current position.
> Be the first to know and reach out!
> http://www.linkedin.com/
>
>
> --
> (c) 2011, LinkedIn Corporation
> -- next part --
> HTML attachment scrubbed and removed
>
> --
>
> Message: 2
> Date: Mon, 23 May 2011 20:52:30 -0400
> From: Jeff Squyres 
> Subject: Re: [OMPI users] Invitation to connect on LinkedIn
> To: Nurul Azri Mohd Radzi ,  Open MPI Users
>   
> Message-ID: <2c83e966-5529-4f59-839e-d25a06579...@cisco.com>
> Content-Type: text/plain; charset=iso-8859-1
>
> Please do not send such invitations to the Open MPI lists.
>
>
> On May 23, 2011, at 8:16 PM, Nurul Azri Mohd Radzi via LinkedIn wrote:
>
>> LinkedIn
>> Nurul Azri Mohd Radzi requested to add you as a connection on LinkedIn:
>> Mohan,
>>
>> I'd like to add you to my professional network on LinkedIn.
>>
>> - Nurul Azri
>>
>>
>> Accept
>> View invitation from Nurul Azri Mohd Radzi
>>
>>
>>
>> DID YOU KNOW you can be the first to know when a trusted member of your
>> network changes jobs?
>> With Network Updates on your LinkedIn home page, you'll be notified as
>> members of your network change their current position. Be the first to
>> know and reach out!
>>
>>
>> ? 2011, LinkedIn Corporation
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> --
>
> Message: 3
> Date: Tue, 24 May 2011 10:42:48 +0200
> From: "francoise.r...@obs.ujf-grenoble.fr"
>   
> Subject: Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1
> To: Open MPI Users 
> Message-ID: <4ddb6f88.5060...@obs.ujf-grenoble.fr>
> Content-Type: text/plain; charset=us-ascii; format=flowed
>
> Jeff Squyres wrote:
>> On May 13, 2011, at 8:31 AM, francoise.r...@obs.ujf-grenoble.fr wrote:
>>
>>
>>> Here is the MUMPS portion of code (in zmumps_part1.F file) where the
>>> slaves call MPI_COMM_DUP , id%PAR and MASTER are initialized to 0 before
>>> :
>>>
>>> CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
>>>
>>
>> I re-inden

Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-24 Thread Marcus R. Epperson

On 05/19/2011 07:37 PM, Jeff Squyres wrote:

Other users have seen something similar but we have never been able
to reproduce it.  Is this only when using IB?


Actually no, when I use --mca btl tcp,sm,self it hangs in the same place.


If you use "mpirun --mca btl_openib_cpc_if_include rdmacm", does the
problem go away?


No, that doesn't help with the hang I'm seeing. Though it sounds like 
I'm hitting a different issue than Salvatore, fwiw.


-Marcus



On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:


I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only 
when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives 
hangs go away. I don't know what, if anything, the higher optimization buys you 
when compiling openmpi, so I'm not sure if that's an acceptable workaround or 
not.

My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 
5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a 
single iteration of Barrier to reproduce the hang, and it happens 100% of the 
time for me when I invoke it like this:

# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the participating 
ranks have this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from 
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems to rule out 
the sm btl (or interactions with it) as a culprit at least.

I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:

Dear all,

we succeed in building several version of openmpi from 1.2.8 to 1.4.3
with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending from the
application: IMB, xhpl or user applications
and form the number of required cores) above which the application hangs
(sort of deadlocks).
The building of openmpi with 'gcc' and 'pgi' does not show the same limits.
There are any known incompatibilities of openmpi with this version of
intel compiilers?

The characteristics of our computational infrastructure are:

Intel processors E7330, E5345, E5530 e E5620

CentOS 5.3, CentOS 5.5.

Intel composer XE 2011
gcc 4.1.2
pgi 10.2-1

Regards

Salvatore Podda

ENEA UTICT-HPC
Department for Computer Science Development and ICT
Facilities Laboratory for Science and High Performace Computing
C.R. Frascati
Via E. Fermi, 45
PoBox 65
00044 Frascati (Rome)
Italy

Tel: +39 06 9400 5342
Fax: +39 06 9400 5551
Fax: +39 06 9400 5735
E-mail: salvatore.po...@enea.it
Home Page: www.cresco.enea.it
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] Trouble with MPI-IO

2011-05-24 Thread Tom Rosmond
Rob,

Thanks for the clarification.  I had seen that point about
non-decreasing offsets in the standard and it was just beginning to dawn
on me that maybe it was my problem.  I will rethink my mapping strategy
to comply with the restriction.  Thanks again.

T. Rosmond


On Tue, 2011-05-24 at 10:09 -0500, Rob Latham wrote:
> On Fri, May 20, 2011 at 08:14:07AM -0400, Jeff Squyres wrote:
> > On May 20, 2011, at 6:23 AM, Jeff Squyres wrote:
> > 
> > > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
> > 
> > Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the compile 
> > error (even though they're allocatable -- so allocate was a red herring, 
> > sorry).  That's all that "use mpi" is complaining about -- that the 
> > function signatures didn't match.
> > 
> > use mpi is your friend -- even if you don't use F90 constructs much.  
> > Compile-time checking is Very Good Thing (you were effectively "getting 
> > lucky" by passing in the 2D arrays, I think).
> > 
> > Attached is my final version.  And with this version, I see the hang when 
> > running it with the "T" parameter.
> > 
> > That being said, I'm not an expert on the MPI IO stuff -- your code *looks* 
> > right to me, but I could be missing something subtle in the interpretation 
> > of MPI_FILE_SET_VIEW.  I tried running your code with MPICH 1.3.2p1 and it 
> > also hung.
> > 
> > Rob (ROMIO guy) -- can you comment this code?  Is it correct?
> 
> There's a kind of obscure but important rule in MPI-IO: the file view
> must describe monotonically non-decreasing offsets. 
> 
> the T type creates a file type with the following flattened
> representation (you can kind of think of the flattened representation
> as a type map, except everything is in terms of bytes):
> 
> (0, 32), (96, 32), (32, 64)
> 
> So, 32 bytes at offset 0, 32 bytes at offset 96 and 64 bytes at offset
> 32. 
> 
> That sort of looks like this:
> ||
> 
> But you need the  and  pieces to be swapped in file view.
> 
> It's an annoying part of the standard but as you can see if you
> violate that ROMIO will go off and spin in an infinite loop looking
> for the next piece of I/O (which in this case was "behind" the current
> piece).
> 
> You can work around this by adjusting your memory datatype: data must
> be read off of the disk in this monotonically non-decreasing order but
> it can be jammed into memory any which way you want.
> 
> ROMIO should be better about reporting file views that violate this
> part of the standard.  We report it in a few places but clearly not
> enough. 
> 
> ==rob
> 



Re: [OMPI users] Openib with > 32 cores per node

2011-05-24 Thread Dave Love
Jeff Squyres  writes:

> Assuming you built OMPI with PSM support:
>
> mpirun --mca pml cm --mca mtl psm 
>
> (although probably just the pml/cm setting is sufficient -- the mtl/psm 
> option will probably happen automatically)

For what it's worth, you needn't specify anything to get psm used if
it's available

-- 
Excuse the typping -- I have a broken wrist



Re: [OMPI users] reading from a file

2011-05-24 Thread Rob Latham
On Sat, May 21, 2011 at 05:15:13PM +0530, sushil samant wrote:
> hi all,
>  i am a new comer in openmpi programing.i have a txt file containing
> seven column each column contains double type data. What i want to do
> is to read the file in parallel and find the average value and
> standard deviation of each column using c++ and openmpi. If someone
> can provide a sample program with explanation it will be very useful.
> And if understand it i would like to do it for .h5 file.

MPI-IO does not do formatted I/O. 

You should just start with the .h5 (HDF5 ? ) file, where decomposing
the dataset over N processors will be more straightforward.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


Re: [OMPI users] Trouble with MPI-IO

2011-05-24 Thread Rob Latham
On Fri, May 20, 2011 at 08:14:07AM -0400, Jeff Squyres wrote:
> On May 20, 2011, at 6:23 AM, Jeff Squyres wrote:
> 
> > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
> 
> Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the compile 
> error (even though they're allocatable -- so allocate was a red herring, 
> sorry).  That's all that "use mpi" is complaining about -- that the function 
> signatures didn't match.
> 
> use mpi is your friend -- even if you don't use F90 constructs much.  
> Compile-time checking is Very Good Thing (you were effectively "getting 
> lucky" by passing in the 2D arrays, I think).
> 
> Attached is my final version.  And with this version, I see the hang when 
> running it with the "T" parameter.
> 
> That being said, I'm not an expert on the MPI IO stuff -- your code *looks* 
> right to me, but I could be missing something subtle in the interpretation of 
> MPI_FILE_SET_VIEW.  I tried running your code with MPICH 1.3.2p1 and it also 
> hung.
> 
> Rob (ROMIO guy) -- can you comment this code?  Is it correct?

There's a kind of obscure but important rule in MPI-IO: the file view
must describe monotonically non-decreasing offsets. 

the T type creates a file type with the following flattened
representation (you can kind of think of the flattened representation
as a type map, except everything is in terms of bytes):

(0, 32), (96, 32), (32, 64)

So, 32 bytes at offset 0, 32 bytes at offset 96 and 64 bytes at offset
32. 

That sort of looks like this:
||

But you need the  and  pieces to be swapped in file view.

It's an annoying part of the standard but as you can see if you
violate that ROMIO will go off and spin in an infinite loop looking
for the next piece of I/O (which in this case was "behind" the current
piece).

You can work around this by adjusting your memory datatype: data must
be read off of the disk in this monotonically non-decreasing order but
it can be jammed into memory any which way you want.

ROMIO should be better about reporting file views that violate this
part of the standard.  We report it in a few places but clearly not
enough. 

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-24 Thread Dave Love
Brock Palen  writes:

> Well I have a new wrench into this situation.
> We have a power failure at our datacenter took down our entire system 
> nodes,switch,sm.  
> Now I am unable to produce the error with oob default ibflags etc.

As far as I know, we could still reproduce it.  Mail me if you need an
alternative, but we may have trouble getting access to the relevant
nodes.

-- 
Excuse the typping -- I have a broken wrist


Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-24 Thread Salvatore Podda
OK! I catch the meaning of the "--mca btl_openib_cpc_include rdmacm"  
parameter.
Howerver, as I just said, we are doing, in the meanwhile, several IMB  
tests on openmpi
1.2.8 and on this (our) version either the RDMA CM support is not  
implemented or has

not been included in the compilation phase

Salvatore Podda


On 20/mag/11, at 03:37, Jeff Squyres wrote:


Sorry for the late reply.

Other users have seen something similar but we have never been able  
to reproduce it.  Is this only when using IB?  If you use "mpirun -- 
mca btl_openib_cpc_if_include rdmacm", does the problem go away?



On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:

I've seen the same thing when I build openmpi 1.4.3 with Intel 12,  
but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1  
then the collectives hangs go away. I don't know what, if anything,  
the higher optimization buys you when compiling openmpi, so I'm not  
sure if that's an acceptable workaround or not.


My system is similar to yours - Intel X5570 with QDR Mellanox IB  
running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm  
using IMB 3.2.2 with a single iteration of Barrier to reproduce the  
hang, and it happens 100% of the time for me when I invoke it like  
this:


# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the  
participating ranks have this backtrace:


__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/ 
lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/ 
libmpi.so.0

PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/ 
libmpi.so.0

PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems  
to rule out the sm btl (or interactions with it) as a culprit at  
least.


I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:

Dear all,

we succeed in building several version of openmpi from 1.2.8 to  
1.4.3

with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending  
from the

application: IMB, xhpl or user applications
and form the number of required cores) above which the application  
hangs

(sort of deadlocks).
The building of openmpi with 'gcc' and 'pgi' does not show the  
same limits.
There are any known incompatibilities of openmpi with this version  
of

intel compiilers?

The characteristics of our computational infrastructure are:

Intel processors E7330, E5345, E5530 e E5620

CentOS 5.3, CentOS 5.5.

Intel composer XE 2011
gcc 4.1.2
pgi 10.2-1

Regards

Salvatore Podda

ENEA UTICT-HPC
Department for Computer Science Development and ICT
Facilities Laboratory for Science and High Performace Computing
C.R. Frascati
Via E. Fermi, 45
PoBox 65
00044 Frascati (Rome)
Italy

Tel: +39 06 9400 5342
Fax: +39 06 9400 5551
Fax: +39 06 9400 5735
E-mail: salvatore.po...@enea.it
Home Page: www.cresco.enea.it
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




==
Investi nel futuro. Investi nelle nostre ricerche.
Destina il 5 x 1000 all'ENEA
Cerchiamo:
- nuove fonti e nuovi modi per produrre energia pulita e sicura.
- modi migliori per utilizzare e risparmiare energia.
- metodologie e tecnologie per innovare e rendere piu' competitivo il sistema 
produttivo nazionale.
- metodologie e tecnologie per la salvaguardia e il recupero dell'ambiente e 
per la tutela della nostra salute e del patrimonio artistico del Paese.
Il nostro codice fiscale e': 01320740580



Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)

2011-05-24 Thread Salvatore Podda

Apoligize, I forgot to edit the subject line.
I send again with the sensible subject.

Salvatore

Begin forwarded message:


From: Salvatore Podda 
Date: 24 maggio 2011 12:46:17 GMT+02:00
To: g...@ldeo.columbia.edu
Cc: users open-mpi 
Subject: Re: users Digest, Vol 1911, Issue 3

Sorry for the late reply, but, as I just say, we are attempting
to recover the full operation of part of our cluster

Yes, it was a typo, I use to add the "sm" flag to the "--mca btl"
option. However I think this is not mandatory, as I suppose
openmpi use the the so-called "Law of Least Astonishment"
also in this case and adopts "sm" for the intra-node communication
or, if you prefer, avoiding to add the sm string does not mean "not  
use

shared memory".
Indeed if  I remove or add this string nothing change, or if
I run an mpi job on a single multicore node without this
flag all works well.

Thanks

Salvatore



On 20/mag/11, at 20:53, users-requ...@open-mpi.org wrote:


Message: 1
Date: Fri, 20 May 2011 14:30:13 -0400
From: Gus Correa 
Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer
XE  2011 (aka 12.0)
To: Open MPI Users 
Message-ID: <4dd6b335.2090...@ldeo.columbia.edu>
Content-Type: text/plain; charset=us-ascii; format=flowed

Hi Salvatore

Just in case ...
You say you have problems when you use "--mca btl openib,self".
Is this a typo in your email?
I guess this will disable the shared memory btl intra-node,
whereas your other choice "--mca btl_tcp_if_include ib0" will not.
Could this be the problem?

Here we use "--mca btl openib,self,sm",
to enable the shared memory btl intra-node as well,
and it works just fine on programs that do use collective calls.

My two cents,
Gus Correa

Salvatore Podda wrote:
We are still struggling we these problems. Actually the new  
version of

intel compilers does
not seem to be the real issue. We clash against the same errors  
using

also the `gcc' compilers.
We succeed in building an openmi-1.2.8 (with different compiler
flavours) rpm from the installation
of the cluster section where all seems to work well. We are now  
doing a

severe IMB benchmark campaign.

However, yes this happen only whe we use the --mca btl  
openib,self, on

the contrary if we use
--mca btl_tcp_if_include ib0 all works well.
Yes we can try the flag you suggest. I can check on the FAQ and on  
the

opem-mpi.org documentation,
but can you be so kindly to explain the meaning of this flag?

Thanks

Salvatore Podda

On 20/mag/11, at 03:37, Jeff Squyres wrote:


Sorry for the late reply.

Other users have seen something similar but we have never been  
able to
reproduce it.  Is this only when using IB?  If you use "mpirun -- 
mca

btl_openib_cpc_if_include rdmacm", does the problem go away?


On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:


I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
but only when I have -O2 or -O3 in CFLAGS. If I drop it down to - 
O1
then the collectives hangs go away. I don't know what, if  
anything,
the higher optimization buys you when compiling openmpi, so I'm  
not

sure if that's an acceptable workaround or not.

My system is similar to yours - Intel X5570 with QDR Mellanox IB
running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
using IMB 3.2.2 with a single iteration of Barrier to reproduce  
the
hang, and it happens 100% of the time for me when I invoke it  
like this:


# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the
participating ranks have this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/ 
libmpi.so.0

ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that  
seems to
rule out the sm btl (or interactions with it) as a culprit at  
least.


I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:

Dear all,

we succeed in b

Re: [OMPI users] users Digest, Vol 1911, Issue 3

2011-05-24 Thread Salvatore Podda

Sorry for the late reply, but, as I just say, we are attempting
to recover the full operation of part of our cluster

Yes, it was a typo, I use to add the "sm" flag to the "--mca btl"
option. However I think this is not mandatory, as I suppose
openmpi use the the so-called "Law of Least Astonishment"
also in this case and adopts "sm" for the intra-node communication
or, if you prefer, avoiding to add the sm string does not mean "not use
shared memory".
Indeed if  I remove or add this string nothing change, or if
I run an mpi job on a single multicore node without this
flag all works well.

Thanls

Salvatore



On 20/mag/11, at 20:53, users-requ...@open-mpi.org wrote:


Message: 1
Date: Fri, 20 May 2011 14:30:13 -0400
From: Gus Correa 
Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer
XE  2011 (aka 12.0)
To: Open MPI Users 
Message-ID: <4dd6b335.2090...@ldeo.columbia.edu>
Content-Type: text/plain; charset=us-ascii; format=flowed

Hi Salvatore

Just in case ...
You say you have problems when you use "--mca btl openib,self".
Is this a typo in your email?
I guess this will disable the shared memory btl intra-node,
whereas your other choice "--mca btl_tcp_if_include ib0" will not.
Could this be the problem?

Here we use "--mca btl openib,self,sm",
to enable the shared memory btl intra-node as well,
and it works just fine on programs that do use collective calls.

My two cents,
Gus Correa

Salvatore Podda wrote:
We are still struggling we these problems. Actually the new version  
of

intel compilers does
not seem to be the real issue. We clash against the same errors using
also the `gcc' compilers.
We succeed in building an openmi-1.2.8 (with different compiler
flavours) rpm from the installation
of the cluster section where all seems to work well. We are now  
doing a

severe IMB benchmark campaign.

However, yes this happen only whe we use the --mca btl openib,self,  
on

the contrary if we use
--mca btl_tcp_if_include ib0 all works well.
Yes we can try the flag you suggest. I can check on the FAQ and on  
the

opem-mpi.org documentation,
but can you be so kindly to explain the meaning of this flag?

Thanks

Salvatore Podda

On 20/mag/11, at 03:37, Jeff Squyres wrote:


Sorry for the late reply.

Other users have seen something similar but we have never been  
able to

reproduce it.  Is this only when using IB?  If you use "mpirun --mca
btl_openib_cpc_if_include rdmacm", does the problem go away?


On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:


I've seen the same thing when I build openmpi 1.4.3 with Intel 12,
but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1
then the collectives hangs go away. I don't know what, if anything,
the higher optimization buys you when compiling openmpi, so I'm not
sure if that's an acceptable workaround or not.

My system is similar to yours - Intel X5570 with QDR Mellanox IB
running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm
using IMB 3.2.2 with a single iteration of Barrier to reproduce the
hang, and it happens 100% of the time for me when I invoke it  
like this:


# salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier

The hang happens on the first Barrier (64 ranks) and each of the
participating ranks have this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_recursivedoubling () from
[instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
IMB_barrier ()
IMB_init_buffers_iter ()
main ()

The one non-participating rank has this backtrace:

__poll (...)
poll_dispatch () from [instdir]/lib/libopen-pal.so.0
opal_event_loop () from [instdir]/lib/libopen-pal.so.0
opal_progress () from [instdir]/lib/libopen-pal.so.0
ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/ 
libmpi.so.0

ompi_coll_tuned_barrier_intra_dec_fixed () from
[instdir]/lib/libmpi.so.0
PMPI_Barrier () from [instdir]/lib/libmpi.so.0
main ()

If I use more nodes I can get it to hang with 1ppn, so that seems  
to
rule out the sm btl (or interactions with it) as a culprit at  
least.


I can't reproduce this with openmpi 1.5.3, interestingly.

-Marcus


On 05/10/2011 03:37 AM, Salvatore Podda wrote:

Dear all,

we succeed in building several version of openmpi from 1.2.8 to  
1.4.3

with Intel composer XE 2011 (aka 12.0).
However we found a threshold in the number of cores (depending  
from the

application: IMB, xhpl or user applications
and form the number of required cores) above which the application
hangs
(sort of

Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1

2011-05-24 Thread francoise.r...@obs.ujf-grenoble.fr

Jeff Squyres wrote:

On May 13, 2011, at 8:31 AM, francoise.r...@obs.ujf-grenoble.fr wrote:

  

Here is the MUMPS portion of code (in zmumps_part1.F file) where the slaves 
call MPI_COMM_DUP , id%PAR and MASTER are initialized to 0 before :

CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )



I re-indented so that I could read it better:

  CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
  IF ( id%PAR .eq. 0 ) THEN
 IF ( id%MYID .eq. MASTER ) THEN
color = MPI_UNDEFINED
 ELSE
color = 0
 END IF
 CALL MPI_COMM_SPLIT( id%COMM, color, 0,
 & id%COMM_NODES, IERR )
 id%NSLAVES = id%NPROCS - 1
  ELSE
 CALL MPI_COMM_DUP( id%COMM, id%COMM_NODES, IERR )
 id%NSLAVES = id%NPROCS
  END IF

  IF (id%PAR .ne. 0 .or. id%MYID .NE. MASTER) THEN
 CALL MPI_COMM_DUP( id%COMM_NODES, id%COMM_LOAD, IERR
  ENDIF

That doesn't look right -- both MPI_COMM_SPLIT and MPI_COMM_DUP are collective, 
meaning that all processes in the communicator must call them. In the first 
case, only some processes are calling MPI_COMM_SPLIT.  Is there some other 
logic that forces the rest of the processes to call MPI_COMM_SPLIT, too?

  
Actually, we look at the first case, that is id%par = 0. But the 
MPI_COMM_SPLIT routine is called by all the processes and creates a new 
communicator named "id%COMM_NODES". This communicator contains all the 
slaves, but not the master. The first MPI_COMM_DUP is not executed, the 
second one is executed on all the slaves nodes (id%MYID .NE. MASTER ), 
because the communicator is "id%COMM_NODES" and so implies all the 
processes of this communicator.
So it seems correct to me but perhaps I make a mistake because the 
MPI_COMM_DUP freezes.


Franc,oise