[OMPI users] strange problem with OpenMPI + rankfile + Intel compiler 11.0.074 + centos/fedora-12

2010-03-24 Thread Anton Starikov
Intel compiler 11.0.074
OpenMPI 1.4.1

Two different OSes: centos 5.4 (2.6.18 kernel) and Fedora-12 (2.6.32 kernel)
Two different CPUs: Opteron 248 and Opteron 8356.

same binary for OpenMPI. Same binary for user code (vasp compiled for older 
arch)

When I supply rankfile, then depending on combo of OS and CPU results are 
different

centos+Opt8356 : works
centos+Opt248 : works
fedora+Opt8356 : works
fedora+Opt248 : fails

rankfile is (in case of Opt248)

rank 0=node014 slot=1
rank 1=node014 slot=0

I tried play with formats, leave one slot (and start one process) - it doesn't 
change result
Without rankfile it works on all combos.
Just in case, all this happens inside of cpuset which always wraps all slots 
given in rankfile (I just use torque with cpusets and my custom patch for 
torque which also creates rankfile for openmpi, in this case MPI tasks are 
bound to particular cores and multithreaded codes limited by given cpuset).

AFAIR, it also works without problem on both hardware setups with 1.3.x/1.4.0 
and 2.6.30 kernel from OpenSuSE 11.1.

Strangely, but when I run OSU benchmarks (osu_bw etc), it works without any 
problems.


And finally two errorlogs (starting 1 and 2 processes):

mpirun -mca paffinity_base_verbose 8  -np 1 vasp
[node014:26373] mca:base:select:(paffinity) Querying component [linux]
[node014:26373] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node014:26373] mca:base:select:(paffinity) Selected component [linux]
[node014:26373] paffinity slot assignment: slot_list == 1
[node014:26373] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node014:26374] mca:base:select:(paffinity) Querying component [linux]
[node014:26374] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node014:26374] mca:base:select:(paffinity) Selected component [linux]
[node014:26374] paffinity slot assignment: slot_list == 1
[node014:26374] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node014:26374] *** An error occurred in MPI_Comm_rank
[node014:26374] *** on a NULL communicator
[node014:26374] *** Unknown error
[node014:26374] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image  PCRoutineLineSource  
   
libmpi.so.02ACC26BB36C3  Unknown   Unknown  Unknown
libmpi.so.02ACC26BA0EB8  Unknown   Unknown  Unknown
libmpi.so.02ACC26BA0B4B  Unknown   Unknown  Unknown
libmpi.so.02ACC26BCF77E  Unknown   Unknown  Unknown
libmpi_f77.so.02ACC269528FB  Unknown   Unknown  Unknown
vasp   0046FE66  Unknown   Unknown  Unknown
vasp   00486102  Unknown   Unknown  Unknown
vasp   0042A1AB  Unknown   Unknown  Unknown
vasp   0042A02C  Unknown   Unknown  Unknown
libc.so.6  00364DE1EB1D  Unknown   Unknown  Unknown
vasp   00429F29  Unknown   Unknown  Unknown
--
mpirun has exited due to process rank 0 with PID 26374 on
node node014 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--

$ mpirun -mca paffinity_base_verbose 8  -np 2 vasp
[node014:26402] mca:base:select:(paffinity) Querying component [linux]
[node014:26402] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node014:26402] mca:base:select:(paffinity) Selected component [linux]
[node014:26402] paffinity slot assignment: slot_list == 1
[node014:26402] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node014:26402] paffinity slot assignment: slot_list == 0
[node014:26402] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node014:26403] mca:base:select:(paffinity) Querying component [linux]
[node014:26403] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node014:26403] mca:base:select:(paffinity) Selected component [linux]
[node014:26404] mca:base:select:(paffinity) Querying component [linux]
[node014:26404] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node014:26404] mca:base:select:(paffinity) Selected component [linux]
[node014:26403] paffinity slot assignment: slot_list == 1
[node014:26403] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node014:26403] *** An error occurred in MPI_Comm_rank
[node014:26403] *** on a NULL communicator
[node014:26403] *** Unknown error
[node014:26403] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[node014:26404] paffinity slot assignment: slot_list == 0
[node014:26404] paffinity slot assignment: rank 1 runs on 

[OMPI users] Meaning and the significance of MCA parameter "opal_cr_use_thread"

2010-03-24 Thread ananda.mudar
The description for MCA parameter "opal_cr_use_thread" is very short at
URL:  http://osl.iu.edu/research/ft/ompi-cr/api.php



Can someone explain the usefulness of enabling this parameter vs
disabling it? In other words, what are pros/cons of disabling it?



 I found that this gets enabled automatically when openmpi library is
configured with -ft-enable-threads option.



Thanks

Ananda


Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com


Re: [OMPI users] error depends on the number of processors

2010-03-24 Thread Jeff Squyres
On Mar 23, 2010, at 12:06 PM, Junwei Huang wrote:

> I am still using LAM/MPI on an old cluster and wonder if I can get
> some help from this mail list.

Please upgrade to Open MPI if possible.  :-)

> Here is the problem. I am using a 18
> node cluster, each node has 2 CPU and each CPU supports up to 2
> threads. So I assume I can use 18*4 number of processors. As running
> the following code, an error message will always pops up for np=30 or
> np=60.

Depending on your CPU type and application behavior, using hyperthreads may be 
more of a hinderance than a help.

> But works fine for np=12, np=1. The error message is always the
> same, something like: one of  the processor n15, exit with (0), ip
> 192..,
> 
> Here is a part of the code, where the n15 exit. All other PE can
> finish writing the file, except PE15. Then I see the error message
> about n15 and the written of file by PE15 is not completed.  An quick
> question here, is PE15 necessarily generated by node 15 on the
> cluster? Appreciate if anyone would share experiences in debuging
> errors like this.
> 
> code:
> 
> sprintf(p_obsfile,"%s%d",obsfile,my_rank); //my_rank is processor ID,
> each PE opens a different file

If each MPI process is opening a separate file, then it may not be a file issue 
that is causing the problem.  For example, if each process opens /dev/null, do 
you have the same problem?

> if ((fp=fopen(p_obsfile,"w"))==NULL)
> printf("PE_%d: The file %s cannot be 
> opened\n",my_rank,p_obsfile);

I do note that you don't have an escape clause here -- if you fail to open the 
file, you still fall through and try to write to the file.

> for (int id=loc*my_rank;id loc=TotalNum/NumofPE
> //call a function to calculate U, the function will return the
> finishing message
>// no communication is needed among processors
> for (int j=0;j fprintf (fp, "%f\n",U[j]); //output updated U
> }

I think you just want to try standard debugging stuff here -- are you going 
beyond the end of the U array?  And so on.  Perhaps try running your app 
through valgrind, or under a debugger, etc.  Do you get corefiles from the run? 
 And so on.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Non-root install; hang there running on multiple nodes

2010-03-24 Thread haoanyi
Hi, 

I installed OpenMPI1.4.1 as a non-root user on a cluster. It is totally OK when 
I run with mpirun or mpiexec on one single node for many processes. However, 
when I lauch many processes on multiple nodes, I can observe jobs are 
distributed to those nodes (by using "top"), but all the jobs just hang there 
and cannot finish.

I think the nodes use TCP to communicate with each other. This cluster also 
provides MPICH2, which was configured by the sys admin., and has no problem to 
do node communication in MPICH2. Besides, I read from some posts, which says 
this may be caused by TCP firewall. Since I have no root's right, and I don't 
know what shall request the admin. to do to fix this problem. So, can you tell 
me how to do that either by the admin root or by the non-root user (if 
possible)?

Thank you very much.
Hao


Re: [OMPI users] Non-root install; hang there running on multiple nodes

2010-03-24 Thread Jeff Squyres
Can you mpirun non-MPI applications, like "hostname"?  I frequently run this as 
a first step to debugging a wonky install.  For example:

shell$ hostname
barney
shell$ mpirun hostname
barney
shell$ cat hosts
barney
rubble
shell$ mpirun --hostfile hosts hostname
barney
rubble
shell$


On Mar 24, 2010, at 4:28 PM, haoanyi wrote:

> Hi, 
> 
> I installed OpenMPI1.4.1 as a non-root user on a cluster. It is totally OK 
> when I run with mpirun or mpiexec on one single node for many processes. 
> However, when I lauch many processes on multiple nodes, I can observe jobs 
> are distributed to those nodes (by using "top"), but all the jobs just hang 
> there and cannot finish.
> 
> I think the nodes use TCP to communicate with each other. This cluster also 
> provides MPICH2, which was configured by the sys admin., and has no problem 
> to do node communication in MPICH2. Besides, I read from some posts, which 
> says this may be caused by TCP firewall. Since I have no root's right, and I 
> don't know what shall request the admin. to do to fix this problem. So, can 
> you tell me how to do that either by the admin root or by the non-root user 
> (if possible)?
> 
> Thank you very much.
> Hao
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Non-root install; hang there running on multiple nodes

2010-03-24 Thread haoanyi
Yes, I can do all of these on each node.



In 2010-03-25 04:33:24, "Jeff Squyres"  wrote :

>Can you mpirun non-MPI applications, like "hostname"?  I frequently run this 
>as a first step to debugging a wonky install.  For example:
>
>shell$ hostname
>barney
>shell$ mpirun hostname
>barney
>shell$ cat hosts
>barney
>rubble
>shell$ mpirun --hostfile hosts hostname
>barney
>rubble
>shell$
>
>
>On Mar 24, 2010, at 4:28 PM, haoanyi wrote:
>
>> Hi, 
>> 
>> I installed OpenMPI1.4.1 as a non-root user on a cluster. It is totally OK 
>> when I run with mpirun or mpiexec on one single node for many processes. 
>> However, when I lauch many processes on multiple nodes, I can observe jobs 
>> are distributed to those nodes (by using "top"), but all the jobs just hang 
>> there and cannot finish.
>> 
>> I think the nodes use TCP to communicate with each other. This cluster also 
>> provides MPICH2, which was configured by the sys admin., and has no problem 
>> to do node communication in MPICH2. Besides, I read from some posts, which 
>> says this may be caused by TCP firewall. Since I have no root's right, and I 
>> don't know what shall request the admin. to do to fix this problem. So, can 
>> you tell me how to do that either by the admin root or by the non-root user 
>> (if possible)?
>> 
>> Thank you very much.
>> Hao
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>-- 
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Non-root install; hang there running on multiple nodes

2010-03-24 Thread haoanyi
I run a program with the following command line, and obtain the error message
mpirun -x LD_LIBRARY_PATH=/home/haoanyi1/socIntel/goto --prefix 
/home/haoanyi1/openmpi1.4.1 -np 2 -host intel01,intel02  -rf hosts ./main 62 62 
tests/ > newtest_64x64_np2_omp

[btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 
192.168.122.1 failed: Connection refused (111)

In the hostsfile, I use following to do cpu mapping 
rank 0=intel01 slot=0
rank 1=intel02 slot=1

This file is different from the hosts file that I do with mpurun --hostfile 
hosts hostname, which reads like
intel01
intel02
..


2010-03-25 04:33:24, "Jeff Squyres"  wrote:

>Can you mpirun non-MPI applications, like "hostname"?  I frequently run this 
>as a first step to debugging a wonky install.  For example: > >shell$ hostname 
>>barney >shell$ mpirun hostname >barney >shell$ cat hosts >barney >rubble 
>>shell$ mpirun --hostfile hosts hostname >barney >rubble >shell$ > > >On Mar 
>24, 2010, at 4:28 PM, haoanyi wrote: > >> Hi,  >>  >> I installed OpenMPI1.4.1 
>as a non-root user on a cluster. It is totally OK when I run with mpirun or 
>mpiexec on one single node for many processes. However, when I lauch many 
>processes on multiple nodes, I can observe jobs are distributed to those nodes 
>(by using "top"), but all the jobs just hang there and cannot finish. >>  >> I 
>think the nodes use TCP to communicate with each other. This cluster also 
>provides MPICH2, which was configured by the sys admin., and has no problem to 
>do node communication in MPICH2. Besides, I read from some posts, which says 
>this may be caused by TCP firewall. Since I have no root's right, and I don't 
>know what shall request the admin. to do to fix this problem. So, can you tell 
>me how to do that either by the admin root or by the non-root user (if 
>possible)? >>  >> Thank you very much. >> Hao >>  >>  >> 
>___ >> users mailing list >> 
>us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > 
>>--  >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: 
>>http://www.cisco.com/web/about/doing_business/legal/cri/ > > 
>>___ >users mailing list 
>>us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users 

Re: [OMPI users] Non-root install; hang there running on multiple nodes

2010-03-24 Thread Trent Creekmore
You may also want to check with the admin. I know on the system I use, he
will prevent you from using many nodes until you demonstrate you know what
you are doing. 


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Jeff Squyres
Sent: Wednesday, March 24, 2010 3:33 PM
To: Open MPI Users
Subject: Re: [OMPI users] Non-root install; hang there running on multiple
nodes

Can you mpirun non-MPI applications, like "hostname"?  I frequently run this
as a first step to debugging a wonky install.  For example:

shell$ hostname
barney
shell$ mpirun hostname
barney
shell$ cat hosts
barney
rubble
shell$ mpirun --hostfile hosts hostname
barney
rubble
shell$


On Mar 24, 2010, at 4:28 PM, haoanyi wrote:

> Hi, 
> 
> I installed OpenMPI1.4.1 as a non-root user on a cluster. It is totally OK
when I run with mpirun or mpiexec on one single node for many processes.
However, when I lauch many processes on multiple nodes, I can observe jobs
are distributed to those nodes (by using "top"), but all the jobs just hang
there and cannot finish.
> 
> I think the nodes use TCP to communicate with each other. This cluster
also provides MPICH2, which was configured by the sys admin., and has no
problem to do node communication in MPICH2. Besides, I read from some posts,
which says this may be caused by TCP firewall. Since I have no root's right,
and I don't know what shall request the admin. to do to fix this problem.
So, can you tell me how to do that either by the admin root or by the
non-root user (if possible)?
> 
> Thank you very much.
> Hao
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users