[OMPI users] Open MPI via SSH noob issue

2011-08-09 Thread Christopher Jones
Hi again,

I changed the subject of my previous posting to reflect a new problem 
encountered when I changed my strategy to using SSH instead of Xgrid on two mac 
pros. I've set up a login-less ssh communication between the two macs 
(connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) per 
the instructions on the FAQ. I can type in 'ssh computer-name.local' on either 
computer and connect without a password prompt. From what I can see, the 
ssh-agent is up and running - the following is listed in my ENV:

SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners
SSH_AGENT_PID=61058

My host file simply lists 'localhost' and 
'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world 
test, I get what seems like a reasonable output:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./test_hello
Hello world from process 0 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 6 of 8

I can also run hostname and get what seems to be an ok response (unless I'm 
wrong about this):

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
quadcore.mikrob.slu.se
quadcore.mikrob.slu.se
quadcore.mikrob.slu.se
quadcore.mikrob.slu.se


However, when I run the ring_c test, it freezes:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9

(I noted that processors on both computers are active).

ring_c was compiled separately on each computer, however both have the same 
version of openmpi and OSX. I've gone through the FAQ and searched the user 
forum, but I can't quite seems to get this problem unstuck.

Many thanks for your time,
Chris

On Aug 5, 2011, at 6:00 PM,  
 wrote:

> Send users mailing list submissions to
>us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>   1. Re: OpenMPI causing WRF to crash (Jeff Squyres)
>   2. Re: OpenMPI causing WRF to crash (Anthony Chan)
>   3. Re: Program hangs on send when run with nodes on  remote
>  machine (Jeff Squyres)
>   4. Re: openmpi 1.2.8 on Xgrid noob issue (Jeff Squyres)
>   5. Re: parallel I/O on 64-bit indexed arays (Rob Latham)
>
>
> --
>
> Message: 1
> Date: Thu, 4 Aug 2011 19:18:36 -0400
> From: Jeff Squyres 
> Subject: Re: [OMPI users] OpenMPI causing WRF to crash
> To: Open MPI Users 
> Message-ID: <3f0e661f-a74f-4e51-86c0-1f84feb07...@cisco.com>
> Content-Type: text/plain; charset=windows-1252
>
> Signal 15 is usually SIGTERM on Linux, meaning that some external entity 
> probably killed the job.
>
> The OMPI error message you describe is also typical for that kind of scenario 
> -- i.e., a process exited without calling MPI_Finalize could mean that it 
> called exit() or some external process killed it.
>
>
> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:
>
>> I am trying to run a rather heavy wrf simulation with spectral nudging but 
>> the simulation crashes after 1.8 minutes of integration.
>> The simulation has two domainswith  d01 = 601x601 and d02 = 721x721 and 
>> 51 vertical levels. I tried this simulation on two different systems but 
>> result was more or less same. For example
>>
>> On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF compiler 
>> I tried to run wrf on 2048 shared memory nodes (1 compute node = 4 cores , 
>> 32 bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and mpixlf90. 
>>  I got the following error message in the wrf.err file
>>
>>  BE_MPI (ERROR): The error message in the job
>> record is as follows:
>>  BE_MPI (ERROR):   "killed with signal 15"
>>
>> I also tried to run the same simulation on our linux cluster (Linux Red Hat 
>> Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64 nodes (1 
>> compute node=8 cores). For the parallel run I am used 
>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the error 
>> log after couple of minutes of integration.
>>
>> "mpirun has exited due to process rank 45 with PID 19540 on
>> node ci118 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here)."
>>
>> I tried many things but not

Re: [OMPI users] Open MPI via SSH noob issue

2011-08-09 Thread Reuti
Hi,

Am 09.08.2011 um 08:46 schrieb Christopher Jones:

> I changed the subject of my previous posting to reflect a new problem 
> encountered when I changed my strategy to using SSH instead of Xgrid on two 
> mac pros. I've set up a login-less ssh communication between the two macs 
> (connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) per 
> the instructions on the FAQ. I can type in 'ssh computer-name.local' on 
> either computer and connect without a password prompt. From what I can see, 
> the ssh-agent is up and running - the following is listed in my ENV:
> 
> SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners
> SSH_AGENT_PID=61058
> 
> My host file simply lists 'localhost' and 
> 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world 
> test, I get what seems like a reasonable output:
> 
> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile 
> ./test_hello
> Hello world from process 0 of 8
> Hello world from process 1 of 8
> Hello world from process 2 of 8
> Hello world from process 3 of 8
> Hello world from process 4 of 8
> Hello world from process 7 of 8
> Hello world from process 5 of 8
> Hello world from process 6 of 8
> 
> I can also run hostname and get what seems to be an ok response (unless I'm 
> wrong about this):
> 
> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname
> allana-welshs-mac-pro.local
> allana-welshs-mac-pro.local
> allana-welshs-mac-pro.local
> allana-welshs-mac-pro.local
> quadcore.mikrob.slu.se
> quadcore.mikrob.slu.se
> quadcore.mikrob.slu.se
> quadcore.mikrob.slu.se
> 
> 
> However, when I run the ring_c test, it freezes:
> 
> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c
> Process 0 sending 10 to 1, tag 201 (8 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> 
> (I noted that processors on both computers are active).
> 
> ring_c was compiled separately on each computer, however both have the same 
> version of openmpi and OSX. I've gone through the FAQ and searched the user 
> forum, but I can't quite seems to get this problem unstuck.

do you have any firewall on the machines?

-- Reuti

> 
> Many thanks for your time,
> Chris
> 
> On Aug 5, 2011, at 6:00 PM,  
>  wrote:
> 
>> Send users mailing list submissions to
>>   us...@open-mpi.org
>> 
>> To subscribe or unsubscribe via the World Wide Web, visit
>>   http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>>   users-requ...@open-mpi.org
>> 
>> You can reach the person managing the list at
>>   users-ow...@open-mpi.org
>> 
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>> 
>> 
>> Today's Topics:
>> 
>>  1. Re: OpenMPI causing WRF to crash (Jeff Squyres)
>>  2. Re: OpenMPI causing WRF to crash (Anthony Chan)
>>  3. Re: Program hangs on send when run with nodes on  remote
>> machine (Jeff Squyres)
>>  4. Re: openmpi 1.2.8 on Xgrid noob issue (Jeff Squyres)
>>  5. Re: parallel I/O on 64-bit indexed arays (Rob Latham)
>> 
>> 
>> --
>> 
>> Message: 1
>> Date: Thu, 4 Aug 2011 19:18:36 -0400
>> From: Jeff Squyres 
>> Subject: Re: [OMPI users] OpenMPI causing WRF to crash
>> To: Open MPI Users 
>> Message-ID: <3f0e661f-a74f-4e51-86c0-1f84feb07...@cisco.com>
>> Content-Type: text/plain; charset=windows-1252
>> 
>> Signal 15 is usually SIGTERM on Linux, meaning that some external entity 
>> probably killed the job.
>> 
>> The OMPI error message you describe is also typical for that kind of 
>> scenario -- i.e., a process exited without calling MPI_Finalize could mean 
>> that it called exit() or some external process killed it.
>> 
>> 
>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:
>> 
>>> I am trying to run a rather heavy wrf simulation with spectral nudging but 
>>> the simulation crashes after 1.8 minutes of integration.
>>> The simulation has two domainswith  d01 = 601x601 and d02 = 721x721 and 
>>> 51 vertical levels. I tried this simulation on two different systems but 
>>> result was more or less same. For example
>>> 
>>> On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF 
>>> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node = 4 
>>> cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and 
>>> mpixlf90.  I got the following error message in the wrf.err file
>>> 
>>>  BE_MPI (ERROR): The error message in the job
>>> record is as follows:
>>>  BE_MPI (ERROR):   "killed with signal 15"
>>> 
>>> I also tried to run the same simulation on our linux cluster (Linux Red Hat 
>>> Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64 nodes (1 
>>> compute node=8 cores). For the parallel run I am used 
>>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the error 
>>> log after couple of minutes o

Re: [OMPI users] How to setup and use nodes for OpenMPI on Windows

2011-08-09 Thread Shiqing Fan

Hi Clinton,

Just wondering if anyone can point me to the detailed information on 
how to setup multiple nodes and network them together to use OpenMPI. 
Also what is the proper way to specify which nodes to run on. I wish 
to use OpenMPI on the Windows XP or Windows Server 2008 platform, with 
Intel Fortran 11 as the programming language.


If you use Windows XP, you have to specify --host or --hostfile option 
in the command line. see mpirun --help for more details. On Windows 
server 2008, you can also specify the node names through the Job Monitor 
GUI.


I have searched Google and also looked through the OpenMPI website but 
there does not seem to be any comprehensive documents to run OpenMPI 
on Windows, especially setting up the nodes.


For working on multiple nodes on Windows XP, the only thing you have to 
make sure is that the WMI is able to launch process remotely, which 
referred to the two MSDN links in the WINDOWS.TXT file. Other then this, 
no other special setting is necessary, just install one pre-compiled 
installer, it will configure the environment automatically for you.


Many years ago, I played a bit with Mpich and LAM MPI on a purely 
Linux platform. On Linux clusters, it is easy -- the head node see 
each of the compute nodes. As I remember, the programmer need to only 
write the names of the compute nodes in some sort of config file. The 
job is submitted and based on the compute nodes listed, the mpi will 
run on them.
On Windows, what kind of networking is needed to tie the nodes 
together? (assuming we are not using any of the MS HPC Pack or Compute 
Cluster). How do we specify which nodes to use? How do we specify any 
security or group permissions for the nodes?
You should have at least TCP connections among the nodes. I don't 
understand what security and group permissions stands for here.



Regards,
Shiqing




** IMPORTANT MESSAGE *
This e-mail message is intended only for the addressee(s) and contains 
information which may be

confidential.
If you are not the intended recipient please advise the sender by 
return email, do not use or
disclose the contents, and delete the message and any attachments from 
your system. Unless
specifically indicated, this email does not constitute formal advice 
or commitment by the sender
or the Commonwealth Bank of Australia (ABN 48 123 123 124) or its 
subsidiaries.

We can be contacted through our web site: commbank.com.au.
If you no longer wish to receive commercial electronic messages from 
us, please reply to this

e-mail by typing Unsubscribe in the subject line.
**



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email: f...@hlrs.de



Re: [OMPI users] Open MPI via SSH noob issue

2011-08-09 Thread David Warren
I don't know if this is it, but if you use the name localhost, won't 
processes on both machines try to talk to 127.0.0.1? I believe you need 
to use the real hostname in you host file. I think that your two tests 
work because there is no interprocess communication, just stdout.


On 08/08/11 23:46, Christopher Jones wrote:

Hi again,

I changed the subject of my previous posting to reflect a new problem 
encountered when I changed my strategy to using SSH instead of Xgrid on two mac 
pros. I've set up a login-less ssh communication between the two macs 
(connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) per 
the instructions on the FAQ. I can type in 'ssh computer-name.local' on either 
computer and connect without a password prompt. From what I can see, the 
ssh-agent is up and running - the following is listed in my ENV:

SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners
SSH_AGENT_PID=61058

My host file simply lists 'localhost' and 
'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world 
test, I get what seems like a reasonable output:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./test_hello
Hello world from process 0 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 6 of 8

I can also run hostname and get what seems to be an ok response (unless I'm 
wrong about this):

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
allana-welshs-mac-pro.local
quadcore.mikrob.slu.se
quadcore.mikrob.slu.se
quadcore.mikrob.slu.se
quadcore.mikrob.slu.se


However, when I run the ring_c test, it freezes:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9

(I noted that processors on both computers are active).

ring_c was compiled separately on each computer, however both have the same 
version of openmpi and OSX. I've gone through the FAQ and searched the user 
forum, but I can't quite seems to get this problem unstuck.

Many thanks for your time,
Chris

On Aug 5, 2011, at 6:00 PM,  
  wrote:

   

Send users mailing list submissions to
us...@open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
http://www.open-mpi.org/mailman/listinfo.cgi/users
or, via email, send a message with subject or body 'help' to
users-requ...@open-mpi.org

You can reach the person managing the list at
users-ow...@open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

   1. Re: OpenMPI causing WRF to crash (Jeff Squyres)
   2. Re: OpenMPI causing WRF to crash (Anthony Chan)
   3. Re: Program hangs on send when run with nodes on  remote
  machine (Jeff Squyres)
   4. Re: openmpi 1.2.8 on Xgrid noob issue (Jeff Squyres)
   5. Re: parallel I/O on 64-bit indexed arays (Rob Latham)


--

Message: 1
Date: Thu, 4 Aug 2011 19:18:36 -0400
From: Jeff Squyres
Subject: Re: [OMPI users] OpenMPI causing WRF to crash
To: Open MPI Users
Message-ID:<3f0e661f-a74f-4e51-86c0-1f84feb07...@cisco.com>
Content-Type: text/plain; charset=windows-1252

Signal 15 is usually SIGTERM on Linux, meaning that some external entity 
probably killed the job.

The OMPI error message you describe is also typical for that kind of scenario 
-- i.e., a process exited without calling MPI_Finalize could mean that it 
called exit() or some external process killed it.


On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:

 

I am trying to run a rather heavy wrf simulation with spectral nudging but the 
simulation crashes after 1.8 minutes of integration.
The simulation has two domainswith  d01 = 601x601 and d02 = 721x721 and 51 
vertical levels. I tried this simulation on two different systems but result 
was more or less same. For example

On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF compiler I 
tried to run wrf on 2048 shared memory nodes (1 compute node = 4 cores , 32 
bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and mpixlf90.  I 
got the following error message in the wrf.err file

  BE_MPI (ERROR): The error message in the job
record is as follows:
  BE_MPI (ERROR):   "killed with signal 15"

I also tried to run the same simulation on our linux cluster (Linux Red Hat 
Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64 nodes (1 compute 
node=8 cores). For the parallel run I am used mpi/openmpi/1.4.2-intel-11. I got 
the following error message in the error log after couple of minutes of 
integration.

"mpirun has exited due to process rank 45 with PID 19540 

Re: [OMPI users] Open MPI via SSH noob issue

2011-08-09 Thread Jeff Squyres
No, Open MPI doesn't use the names in the hostfile to figure out which TCP/IP 
addresses to use (for example).  Each process ends up publishing a list of IP 
addresses at which it can be connected, and OMPI does routability computations 
to figure out which is the "best" address to contact a given peer on.

If you're just starting with Open MPI, can you upgrade?  1.2.8 is pretty 
ancient.  Open MPI 1.4.3 is the most recent stable release; 1.5.3 is our 
"feature" series, but it's also relatively stable (new releases are coming in 
both the 1.4.x and 1.5.x series soon, FWIW).


On Aug 9, 2011, at 12:14 PM, David Warren wrote:

> I don't know if this is it, but if you use the name localhost, won't 
> processes on both machines try to talk to 127.0.0.1? I believe you need to 
> use the real hostname in you host file. I think that your two tests work 
> because there is no interprocess communication, just stdout.
> 
> On 08/08/11 23:46, Christopher Jones wrote:
>> Hi again,
>> 
>> I changed the subject of my previous posting to reflect a new problem 
>> encountered when I changed my strategy to using SSH instead of Xgrid on two 
>> mac pros. I've set up a login-less ssh communication between the two macs 
>> (connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) 
>> per the instructions on the FAQ. I can type in 'ssh computer-name.local' on 
>> either computer and connect without a password prompt. From what I can see, 
>> the ssh-agent is up and running - the following is listed in my ENV:
>> 
>> SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners
>> SSH_AGENT_PID=61058
>> 
>> My host file simply lists 'localhost' and 
>> 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world 
>> test, I get what seems like a reasonable output:
>> 
>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile 
>> ./test_hello
>> Hello world from process 0 of 8
>> Hello world from process 1 of 8
>> Hello world from process 2 of 8
>> Hello world from process 3 of 8
>> Hello world from process 4 of 8
>> Hello world from process 7 of 8
>> Hello world from process 5 of 8
>> Hello world from process 6 of 8
>> 
>> I can also run hostname and get what seems to be an ok response (unless I'm 
>> wrong about this):
>> 
>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname
>> allana-welshs-mac-pro.local
>> allana-welshs-mac-pro.local
>> allana-welshs-mac-pro.local
>> allana-welshs-mac-pro.local
>> quadcore.mikrob.slu.se
>> quadcore.mikrob.slu.se
>> quadcore.mikrob.slu.se
>> quadcore.mikrob.slu.se
>> 
>> 
>> However, when I run the ring_c test, it freezes:
>> 
>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c
>> Process 0 sending 10 to 1, tag 201 (8 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> 
>> (I noted that processors on both computers are active).
>> 
>> ring_c was compiled separately on each computer, however both have the same 
>> version of openmpi and OSX. I've gone through the FAQ and searched the user 
>> forum, but I can't quite seems to get this problem unstuck.
>> 
>> Many thanks for your time,
>> Chris
>> 
>> On Aug 5, 2011, at 6:00 PM,  
>>   wrote:
>> 
>>   
>>> Send users mailing list submissions to
>>>us...@open-mpi.org
>>> 
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> or, via email, send a message with subject or body 'help' to
>>>users-requ...@open-mpi.org
>>> 
>>> You can reach the person managing the list at
>>>users-ow...@open-mpi.org
>>> 
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of users digest..."
>>> 
>>> 
>>> Today's Topics:
>>> 
>>>   1. Re: OpenMPI causing WRF to crash (Jeff Squyres)
>>>   2. Re: OpenMPI causing WRF to crash (Anthony Chan)
>>>   3. Re: Program hangs on send when run with nodes on  remote
>>>  machine (Jeff Squyres)
>>>   4. Re: openmpi 1.2.8 on Xgrid noob issue (Jeff Squyres)
>>>   5. Re: parallel I/O on 64-bit indexed arays (Rob Latham)
>>> 
>>> 
>>> --
>>> 
>>> Message: 1
>>> Date: Thu, 4 Aug 2011 19:18:36 -0400
>>> From: Jeff Squyres
>>> Subject: Re: [OMPI users] OpenMPI causing WRF to crash
>>> To: Open MPI Users
>>> Message-ID:<3f0e661f-a74f-4e51-86c0-1f84feb07...@cisco.com>
>>> Content-Type: text/plain; charset=windows-1252
>>> 
>>> Signal 15 is usually SIGTERM on Linux, meaning that some external entity 
>>> probably killed the job.
>>> 
>>> The OMPI error message you describe is also typical for that kind of 
>>> scenario -- i.e., a process exited without calling MPI_Finalize could mean 
>>> that it called exit() or some external process killed it.
>>> 
>>> 
>>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:
>>> 
>>> 
 I am trying to run a rather heavy wrf simulation with spectral nudging but 
 the simulation crashes af

[OMPI users] scaling issue beyond 1024 processes

2011-08-09 Thread CB
Hi,

Currently I'm having trouble to scale an MPI job beyond a certain limit.
So I'm running an MPI hello example to test beyond 1024 but it failed with
the following error with 2048 processes.
It worked fine with 1024 processes.  I have enough file descriptor limit
(65536) defined for each process.

I appreciate if anyone gives me any suggestions.
I'm running (Open MPI) 1.4.3

[x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had inadequate
space in file base/odls_base_default_fns.c at line 335
[x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had inadequate
space in file base/odls_base_default_fns.c at line 335
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
[x-03-20-b:23316] *** Process received signal ***
[x-03-20-b:23316] Signal: Segmentation fault (11)
[x-03-20-b:23316] Signal code: Address not mapped (1)
[x-03-20-b:23316] Failing at address: 0x6c
[x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
[x-03-20-b:23316] [ 1]
/usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230)
[0x7f0dbe0c5010]
[x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
[0x7f0dbde5c8f8]
[x-03-20-b:23316] [ 3] mpirun [0x403bbe]
[x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
[0x7f0dbde5c8f8]
[x-03-20-b:23316] [ 5]
/usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
[0x7f0dbde50e49]
[x-03-20-b:23316] [ 6]
/usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42)
[0x7f0dbe0a7ca2]
[x-03-20-b:23316] [ 7]
/usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d)
[0x7f0dbe0c500d]
[x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
[0x7f0dbde5c8f8]
[x-03-20-b:23316] [ 9]
/usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
[0x7f0dbde50e49]
[x-03-20-b:23316] [10]
/usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d)
[0x7f0dbe0c5ddd]
[x-03-20-b:23316] [11]
/usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
[x-03-20-b:23316] [12] mpirun [0x40373f]
[x-03-20-b:23316] [13] mpirun [0x402a1c]
[x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd)
[0x3107e1ea2d]
[x-03-20-b:23316] [15] mpirun [0x402939]
[x-03-20-b:23316] *** End of error message ***
[x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
[x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)
./sge_jsb.sh: line 9: 23316 Segmentation fault  (core dumped) mpirun -np
$NSLOTS ./hello_openmpi.exe


Re: [OMPI users] scaling issue beyond 1024 processes

2011-08-09 Thread Ralph Castain
That error makes no sense - line 335 is just a variable declaration. Sure you 
are not picking up a different version on that node?


On Aug 9, 2011, at 11:37 AM, CB wrote:

> Hi,
> 
> Currently I'm having trouble to scale an MPI job beyond a certain limit.
> So I'm running an MPI hello example to test beyond 1024 but it failed with 
> the following error with 2048 processes.
> It worked fine with 1024 processes.  I have enough file descriptor limit 
> (65536) defined for each process.
> 
> I appreciate if anyone gives me any suggestions. 
> I'm running (Open MPI) 1.4.3
> 
> [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had inadequate 
> space in file base/odls_base_default_fns.c at line 335
> [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had inadequate 
> space in file base/odls_base_default_fns.c at line 335
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> [x-03-20-b:23316] *** Process received signal ***
> [x-03-20-b:23316] Signal: Segmentation fault (11)
> [x-03-20-b:23316] Signal code: Address not mapped (1)
> [x-03-20-b:23316] Failing at address: 0x6c
> [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
> [x-03-20-b:23316] [ 1] 
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230)
>  [0x7f0dbe0c5010]
> [x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 
> [0x7f0dbde5c8f8]
> [x-03-20-b:23316] [ 3] mpirun [0x403bbe]
> [x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 
> [0x7f0dbde5c8f8]
> [x-03-20-b:23316] [ 5] 
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99) 
> [0x7f0dbde50e49]
> [x-03-20-b:23316] [ 6] 
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42) 
> [0x7f0dbe0a7ca2]
> [x-03-20-b:23316] [ 7] 
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d)
>  [0x7f0dbe0c500d]
> [x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0 
> [0x7f0dbde5c8f8]
> [x-03-20-b:23316] [ 9] 
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99) 
> [0x7f0dbde50e49]
> [x-03-20-b:23316] [10] 
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d)
>  [0x7f0dbe0c5ddd]
> [x-03-20-b:23316] [11] 
> /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
> [x-03-20-b:23316] [12] mpirun [0x40373f]
> [x-03-20-b:23316] [13] mpirun [0x402a1c]
> [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3107e1ea2d]
> [x-03-20-b:23316] [15] mpirun [0x402939]
> [x-03-20-b:23316] *** End of error message ***
> [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv: readv 
> failed: Connection reset by peer (104)
> [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv: readv 
> failed: Connection reset by peer (104)
> ./sge_jsb.sh: line 9: 23316 Segmentation fault  (core dumped) mpirun -np 
> $NSLOTS ./hello_openmpi.exe
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] scaling issue beyond 1024 processes

2011-08-09 Thread CB
Hi Ralph,

Yes, you are right. Those nodes were still pointing to an old version.
I'll check the installation on all nodes and try to run it again.

Thanks,
- Chansup

On Tue, Aug 9, 2011 at 1:48 PM, Ralph Castain  wrote:

> That error makes no sense - line 335 is just a variable declaration. Sure
> you are not picking up a different version on that node?
>
>
> On Aug 9, 2011, at 11:37 AM, CB wrote:
>
> > Hi,
> >
> > Currently I'm having trouble to scale an MPI job beyond a certain limit.
> > So I'm running an MPI hello example to test beyond 1024 but it failed
> with the following error with 2048 processes.
> > It worked fine with 1024 processes.  I have enough file descriptor limit
> (65536) defined for each process.
> >
> > I appreciate if anyone gives me any suggestions.
> > I'm running (Open MPI) 1.4.3
> >
> > [x-01-06-a:25989] [[37568,0],69] ORTE_ERROR_LOG: Data unpack had
> inadequate space in file base/odls_base_default_fns.c at line 335
> > [x-01-06-b:09532] [[37568,0],74] ORTE_ERROR_LOG: Data unpack had
> inadequate space in file base/odls_base_default_fns.c at line 335
> >
> --
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> >
> --
> > [x-03-20-b:23316] *** Process received signal ***
> > [x-03-20-b:23316] Signal: Segmentation fault (11)
> > [x-03-20-b:23316] Signal code: Address not mapped (1)
> > [x-03-20-b:23316] Failing at address: 0x6c
> > [x-03-20-b:23316] [ 0] /lib64/libpthread.so.0 [0x310860ee90]
> > [x-03-20-b:23316] [ 1]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x230)
> [0x7f0dbe0c5010]
> > [x-03-20-b:23316] [ 2] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 3] mpirun [0x403bbe]
> > [x-03-20-b:23316] [ 4] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 5]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
> [0x7f0dbde50e49]
> > [x-03-20-b:23316] [ 6]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_trigger_event+0x42)
> [0x7f0dbe0a7ca2]
> > [x-03-20-b:23316] [ 7]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_app_report_launch+0x22d)
> [0x7f0dbe0c500d]
> > [x-03-20-b:23316] [ 8] /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0
> [0x7f0dbde5c8f8]
> > [x-03-20-b:23316] [ 9]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0x99)
> [0x7f0dbde50e49]
> > [x-03-20-b:23316] [10]
> /usr/local/MPI/openmpi-1.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x23d)
> [0x7f0dbe0c5ddd]
> > [x-03-20-b:23316] [11]
> /usr/local/MPI/openmpi-1.4.3/lib/openmpi/mca_plm_rsh.so [0x7f0dbd41d679]
> > [x-03-20-b:23316] [12] mpirun [0x40373f]
> > [x-03-20-b:23316] [13] mpirun [0x402a1c]
> > [x-03-20-b:23316] [14] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x3107e1ea2d]
> > [x-03-20-b:23316] [15] mpirun [0x402939]
> > [x-03-20-b:23316] *** End of error message ***
> > [x-01-06-a:25989] [[37568,0],69]-[[37568,0],0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> > [x-01-06-b:09532] [[37568,0],74]-[[37568,0],0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> > ./sge_jsb.sh: line 9: 23316 Segmentation fault  (core dumped) mpirun
> -np $NSLOTS ./hello_openmpi.exe
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] CMAQ crashes with OpenMPI

2011-08-09 Thread Matthew Russell
Hi,

I'm trying to run CMAQ - an air quality model developed by the US EPA - on a
Mac (Lion) using OpenMPI (1.5.3) installed with MacPorts.

I am able to run CMAQ in parallel, and am able to run small programs that
use OpenMPI.

I set the OpenMPI environment variables to use pgf90/pgcc (10.9) as my
compiler.  Using PGI because some of the code I need to build is fortran 77
( *sigh* ), and for some other reasons.

The error I get is:

/opt/local/lib/openmpi/bin/mpirun -v -machinefile
/Users/matt/cmaq/darwin11/scripts/cctm/machines8 -np 2
/Users/matt/cmaq/darwin11/scripts/cctm/CCTM_e1a_Darwin11_x86_64pg
[pontus:72547] *** Process received signal ***
[pontus:72547] Signal: Segmentation fault: 11 (11)
[pontus:72547] Signal code: Address not mapped (1)
[pontus:72547] Failing at address: 0x0
[pontus:72547] [ 0] 2   libsystem_c.dylib
0x7fff91065cfa _sigtramp + 26
[pontus:72547] [ 1] 3   ???
0x7fff5fbe58ab 0x0 + 140734799698091
[pontus:72547] [ 2] 4   CCTM_e1a_Darwin11_x86_64pg
 0x00010003c89b distr_env_ + 971
[pontus:72547] [ 3] 5   CCTM_e1a_Darwin11_x86_64pg
 0x00010003cbe5 par_init_ + 565
[pontus:72547] [ 4] 6   CCTM_e1a_Darwin11_x86_64pg
 0x000100032e1b MAIN_ + 219
[pontus:72547] [ 5] 7   CCTM_e1a_Darwin11_x86_64pg
 0x000116f6 main + 70
[pontus:72547] [ 6] 8   CCTM_e1a_Darwin11_x86_64pg
 0x0001163a _start + 248
[pontus:72547] [ 7] 9   CCTM_e1a_Darwin11_x86_64pg
 0x00011541 start + 33
[pontus:72547] [ 8] 10  ???
0x0001 0x0 + 1
[pontus:72547] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 72547 on node
pontus.cee.carleton.ca exited on signal 11 (Segmentation fault: 11).
--

I don't expect anyone to know the solution from this brief error message,
however I was wondering if anyone has insight on how I might debug this?  I
am too new to both OpenMPI and CMAQ to be served that well from this
traceback.

I'm told by others in my research group that CMAQ with OpenMPI on Linux
works fine, and that the error I'm getting is very similar to the error
others got when trying this on a Mac (Snow Leopard) with ifort.. before they
gave up...

OpenMPI was configured with:
configure.args  --sysconfdir=${prefix}/etc/${name} \
--includedir=${prefix}/include/${name} \
--bindir=${prefix}/lib/${name}/bin \
--mandir=${prefix}/share/man \
--with-memory-manager=none

# enable build on Lion
if {$os.major} >= 11} {
configure.compiler   gcc-4.2
}

The --with-memory-manager is there because I saw it fix potentially similar
problems in other postings to this Mailing list.  It didn't make a
difference though.

Thanks!


Re: [OMPI users] CMAQ crashes with OpenMPI

2011-08-09 Thread Doug Reeder
Matt,

Are you sure you are building against your macports version of openmpi and not 
the one that ships w/ lion. In the trace back are items 4-9, that end w/ 
x86_64pg from the pgi compiler. You said you are using pgf90 and pgcc but in 
the configure input it looks like gcc is being used on lion.

Doug Reeder
On Aug 9, 2011, at 1:49 PM, Matthew Russell wrote:

> 
> Hi,
> 
> I'm trying to run CMAQ - an air quality model developed by the US EPA - on a 
> Mac (Lion) using OpenMPI (1.5.3) installed with MacPorts.
> 
> I am able to run CMAQ in parallel, and am able to run small programs that use 
> OpenMPI.
> 
> I set the OpenMPI environment variables to use pgf90/pgcc (10.9) as my 
> compiler.  Using PGI because some of the code I need to build is fortran 77 ( 
> *sigh* ), and for some other reasons. 
> 
> The error I get is:
> 
> /opt/local/lib/openmpi/bin/mpirun -v -machinefile 
> /Users/matt/cmaq/darwin11/scripts/cctm/machines8 -np 2 
> /Users/matt/cmaq/darwin11/scripts/cctm/CCTM_e1a_Darwin11_x86_64pg
> [pontus:72547] *** Process received signal ***
> [pontus:72547] Signal: Segmentation fault: 11 (11)
> [pontus:72547] Signal code: Address not mapped (1)
> [pontus:72547] Failing at address: 0x0
> [pontus:72547] [ 0] 2   libsystem_c.dylib   
> 0x7fff91065cfa _sigtramp + 26
> [pontus:72547] [ 1] 3   ??? 
> 0x7fff5fbe58ab 0x0 + 140734799698091
> [pontus:72547] [ 2] 4   CCTM_e1a_Darwin11_x86_64pg  
> 0x00010003c89b distr_env_ + 971
> [pontus:72547] [ 3] 5   CCTM_e1a_Darwin11_x86_64pg  
> 0x00010003cbe5 par_init_ + 565
> [pontus:72547] [ 4] 6   CCTM_e1a_Darwin11_x86_64pg  
> 0x000100032e1b MAIN_ + 219
> [pontus:72547] [ 5] 7   CCTM_e1a_Darwin11_x86_64pg  
> 0x000116f6 main + 70
> [pontus:72547] [ 6] 8   CCTM_e1a_Darwin11_x86_64pg  
> 0x0001163a _start + 248
> [pontus:72547] [ 7] 9   CCTM_e1a_Darwin11_x86_64pg  
> 0x00011541 start + 33
> [pontus:72547] [ 8] 10  ??? 
> 0x0001 0x0 + 1
> [pontus:72547] *** End of error message ***
> --
> mpirun noticed that process rank 1 with PID 72547 on node 
> pontus.cee.carleton.ca exited on signal 11 (Segmentation fault: 11).
> --
> 
> I don't expect anyone to know the solution from this brief error message, 
> however I was wondering if anyone has insight on how I might debug this?  I 
> am too new to both OpenMPI and CMAQ to be served that well from this 
> traceback.
> 
> I'm told by others in my research group that CMAQ with OpenMPI on Linux works 
> fine, and that the error I'm getting is very similar to the error others got 
> when trying this on a Mac (Snow Leopard) with ifort.. before they gave up...
> 
> OpenMPI was configured with:
> configure.args  --sysconfdir=${prefix}/etc/${name} \
> --includedir=${prefix}/include/${name} \
> --bindir=${prefix}/lib/${name}/bin \
> --mandir=${prefix}/share/man \
> --with-memory-manager=none
> 
> # enable build on Lion
> if {$os.major} >= 11} {
> configure.compiler   gcc-4.2
> }
> 
> The --with-memory-manager is there because I saw it fix potentially similar 
> problems in other postings to this Mailing list.  It didn't make a difference 
> though.
> 
> Thanks!
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] CMAQ crashes with OpenMPI

2011-08-09 Thread Ralph Castain
Also, please be aware that we haven't done any testing of OMPI on Lion, so this 
is truly new ground.


On Aug 9, 2011, at 3:00 PM, Doug Reeder wrote:

> Matt,
> 
> Are you sure you are building against your macports version of openmpi and 
> not the one that ships w/ lion. In the trace back are items 4-9, that end w/ 
> x86_64pg from the pgi compiler. You said you are using pgf90 and pgcc but in 
> the configure input it looks like gcc is being used on lion.
> 
> Doug Reeder
> On Aug 9, 2011, at 1:49 PM, Matthew Russell wrote:
> 
>> 
>> Hi,
>> 
>> I'm trying to run CMAQ - an air quality model developed by the US EPA - on a 
>> Mac (Lion) using OpenMPI (1.5.3) installed with MacPorts.
>> 
>> I am able to run CMAQ in parallel, and am able to run small programs that 
>> use OpenMPI.
>> 
>> I set the OpenMPI environment variables to use pgf90/pgcc (10.9) as my 
>> compiler.  Using PGI because some of the code I need to build is fortran 77 
>> ( *sigh* ), and for some other reasons. 
>> 
>> The error I get is:
>> 
>> /opt/local/lib/openmpi/bin/mpirun -v -machinefile 
>> /Users/matt/cmaq/darwin11/scripts/cctm/machines8 -np 2 
>> /Users/matt/cmaq/darwin11/scripts/cctm/CCTM_e1a_Darwin11_x86_64pg
>> [pontus:72547] *** Process received signal ***
>> [pontus:72547] Signal: Segmentation fault: 11 (11)
>> [pontus:72547] Signal code: Address not mapped (1)
>> [pontus:72547] Failing at address: 0x0
>> [pontus:72547] [ 0] 2   libsystem_c.dylib   
>> 0x7fff91065cfa _sigtramp + 26
>> [pontus:72547] [ 1] 3   ??? 
>> 0x7fff5fbe58ab 0x0 + 140734799698091
>> [pontus:72547] [ 2] 4   CCTM_e1a_Darwin11_x86_64pg  
>> 0x00010003c89b distr_env_ + 971
>> [pontus:72547] [ 3] 5   CCTM_e1a_Darwin11_x86_64pg  
>> 0x00010003cbe5 par_init_ + 565
>> [pontus:72547] [ 4] 6   CCTM_e1a_Darwin11_x86_64pg  
>> 0x000100032e1b MAIN_ + 219
>> [pontus:72547] [ 5] 7   CCTM_e1a_Darwin11_x86_64pg  
>> 0x000116f6 main + 70
>> [pontus:72547] [ 6] 8   CCTM_e1a_Darwin11_x86_64pg  
>> 0x0001163a _start + 248
>> [pontus:72547] [ 7] 9   CCTM_e1a_Darwin11_x86_64pg  
>> 0x00011541 start + 33
>> [pontus:72547] [ 8] 10  ??? 
>> 0x0001 0x0 + 1
>> [pontus:72547] *** End of error message ***
>> --
>> mpirun noticed that process rank 1 with PID 72547 on node 
>> pontus.cee.carleton.ca exited on signal 11 (Segmentation fault: 11).
>> --
>> 
>> I don't expect anyone to know the solution from this brief error message, 
>> however I was wondering if anyone has insight on how I might debug this?  I 
>> am too new to both OpenMPI and CMAQ to be served that well from this 
>> traceback.
>> 
>> I'm told by others in my research group that CMAQ with OpenMPI on Linux 
>> works fine, and that the error I'm getting is very similar to the error 
>> others got when trying this on a Mac (Snow Leopard) with ifort.. before they 
>> gave up...
>> 
>> OpenMPI was configured with:
>> configure.args  --sysconfdir=${prefix}/etc/${name} \
>> --includedir=${prefix}/include/${name} \
>> --bindir=${prefix}/lib/${name}/bin \
>> --mandir=${prefix}/share/man \
>> --with-memory-manager=none
>> 
>> # enable build on Lion
>> if {$os.major} >= 11} {
>> configure.compiler   gcc-4.2
>> }
>> 
>> The --with-memory-manager is there because I saw it fix potentially similar 
>> problems in other postings to this Mailing list.  It didn't make a 
>> difference though.
>> 
>> Thanks!
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] CMAQ crashes with OpenMPI

2011-08-09 Thread Barrett, Brian W
The error message looks like it's no where near an MPI function; I would
guess that this is not an Open MPI problem but, particularly given your
statements about Snow Leopard) a CMAQ problem.  The easiest way to debug
on OS X is to launch the application code in a debugger, something like:

  mpirun -np 2 xterm -e gdb 

One thing that can get people on OS X is that the maximum stack size is
extremely small compared to Linux.  Fortran apps, in particular, can end
up putting things on the stack which cause an overrun and all kinds of fun.

Brian

On 8/9/11 3:18 PM, "Ralph Castain"  wrote:

>Also, please be aware that we haven't done any testing of OMPI on Lion,
>so this is truly new ground.
>
>On Aug 9, 2011, at 3:00 PM, Doug Reeder wrote:
>
>
>Matt,
>Are you sure you are building against your macports version of openmpi
>and not the one that ships w/ lion. In the trace back are items 4-9, that
>end w/ x86_64pg from the pgi compiler. You said you are using pgf90 and
>pgcc but in the configure input it looks like gcc is being used on lion.
>
>Doug Reeder
>On Aug 9, 2011, at 1:49 PM, Matthew Russell wrote:
>
>
>
>Hi,
>I'm trying to run CMAQ - an air quality model developed by the US EPA -
>on a Mac (Lion) using OpenMPI (1.5.3) installed with MacPorts.
>
>I am able to run CMAQ in parallel, and am able to run small programs that
>use OpenMPI.
>
>I set the OpenMPI environment variables to use pgf90/pgcc (10.9) as my
>compiler.  Using PGI because some of the code I need to build is fortran
>77 ( *sigh* ), and for some other reasons.
>
>
>The error I get is:
>
>/opt/local/lib/openmpi/bin/mpirun -v -machinefile
>/Users/matt/cmaq/darwin11/scripts/cctm/machines8 -np 2
>/Users/matt/cmaq/darwin11/scripts/cctm/CCTM_e1a_Darwin11_x86_64pg
>[pontus:72547] *** Process received signal ***
>[pontus:72547] Signal: Segmentation fault: 11 (11)
>[pontus:72547] Signal code: Address not mapped (1)
>[pontus:72547] Failing at address: 0x0
>[pontus:72547] [ 0] 2   libsystem_c.dylib
>0x7fff91065cfa _sigtramp + 26
>[pontus:72547] [ 1] 3   ???
>0x7fff5fbe58ab 0x0 + 140734799698091
>[pontus:72547] [ 2] 4   CCTM_e1a_Darwin11_x86_64pg
>0x00010003c89b distr_env_ + 971
>[pontus:72547] [ 3] 5   CCTM_e1a_Darwin11_x86_64pg
>0x00010003cbe5 par_init_ + 565
>[pontus:72547] [ 4] 6   CCTM_e1a_Darwin11_x86_64pg
>0x000100032e1b MAIN_ + 219
>[pontus:72547] [ 5] 7   CCTM_e1a_Darwin11_x86_64pg
>0x000116f6 main + 70
>[pontus:72547] [ 6] 8   CCTM_e1a_Darwin11_x86_64pg
>0x0001163a _start + 248
>[pontus:72547] [ 7] 9   CCTM_e1a_Darwin11_x86_64pg
>0x00011541 start + 33
>[pontus:72547] [ 8] 10  ???
>0x0001 0x0 + 1
>[pontus:72547] *** End of error message ***
>--
>mpirun noticed that process rank 1 with PID 72547 on node
>pontus.cee.carleton.ca  exited on signal
>11 (Segmentation fault: 11).
>--
>
>
>I don't expect anyone to know the solution from this brief error message,
>however I was wondering if anyone has insight on how I might debug this?
>I am too new to both OpenMPI and CMAQ to be served that well from this
>traceback.
>
>I'm told by others in my research group that CMAQ with OpenMPI on Linux
>works fine, and that the error I'm getting is very similar to the error
>others got when trying this on a Mac (Snow Leopard) with ifort.. before
>they gave up...
>
>OpenMPI was configured with:
>configure.args  --sysconfdir=${prefix}/etc/${name} \
>
>--includedir=${prefix}/include/${name} \
>--bindir=${prefix}/lib/${name}/bin \
>--mandir=${prefix}/share/man \
>--with-memory-manager=none
>
># enable build on Lion
>if {$os.major} >= 11} {
>configure.compiler   gcc-4.2
>}
>
>
>The --with-memory-manager is there because I saw it fix potentially
>similar problems in other postings to this Mailing list.  It didn't make
>a difference though.
>
>Thanks!
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories