[OMPI users] openmpi 1.2.8 on Xgrid noob issue

2011-08-04 Thread Christopher Jones
Hi there,

I'm currently trying to set up a small xgrid between two mac pros (a single 
quadcore and a 2 duo core), where both are directly connected via an ethernet 
cable. I've set up xgrid using the password authentication (rather than the 
kerberos), and from what I can tell in the Xgrid admin tool it seems to be 
working. However, once I try a simple hello world program, I get this error:

chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello
mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited on 
signal 15 (Terminated).
1 additional process aborted (not shown)
2011-08-04 10:02:16.329 mpirun[350:903] *** Terminating app due to uncaught 
exception 'NSInvalidArgumentException', reason: '*** 
-[NSKVONotifying_XGConnection<0x1001325a0> finalize]: called when collecting 
not enabled'
*** Call stack at first throw:
(
0   CoreFoundation  0x7fff814237b4 
__exceptionPreprocess + 180
1   libobjc.A.dylib 0x7fff84fe8f03 objc_exception_throw 
+ 45
2   CoreFoundation  0x7fff8143e631 -[NSObject(NSObject) 
finalize] + 129
3   mca_pls_xgrid.so0x0001002a9ce3 -[PlsXGridClient 
dealloc] + 419
4   mca_pls_xgrid.so0x0001002a9837 
orte_pls_xgrid_finalize + 40
5   libopen-rte.0.dylib 0x00010002d0f9 orte_pls_base_close 
+ 249
6   libopen-rte.0.dylib 0x000100012027 orte_system_finalize 
+ 119
7   libopen-rte.0.dylib 0x0001e968 orte_finalize + 40
8   mpirun  0x000111ff orterun + 2042
9   mpirun  0x00010a03 main + 27
10  mpirun  0x000109e0 start + 52
11  ??? 0x0004 0x0 + 4
)
terminate called after throwing an instance of 'NSException'
[chris-joness-mac-pro:00350] *** Process received signal ***
[chris-joness-mac-pro:00350] Signal: Abort trap (6)
[chris-joness-mac-pro:00350] Signal code:  (0)
[chris-joness-mac-pro:00350] [ 0] 2   libSystem.B.dylib   
0x7fff81ca51ba _sigtramp + 26
[chris-joness-mac-pro:00350] [ 1] 3   ??? 
0x0001000cd400 0x0 + 4295808000
[chris-joness-mac-pro:00350] [ 2] 4   libstdc++.6.dylib   
0x7fff830965d2 __tcf_0 + 0
[chris-joness-mac-pro:00350] [ 3] 5   libobjc.A.dylib 
0x7fff84fecb39 _objc_terminate + 100
[chris-joness-mac-pro:00350] [ 4] 6   libstdc++.6.dylib   
0x7fff83094ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11
[chris-joness-mac-pro:00350] [ 5] 7   libstdc++.6.dylib   
0x7fff83094b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0
[chris-joness-mac-pro:00350] [ 6] 8   libstdc++.6.dylib   
0x7fff83094bfc 
_ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
[chris-joness-mac-pro:00350] [ 7] 9   libobjc.A.dylib 
0x7fff84fe8fa2 object_getIvar + 0
[chris-joness-mac-pro:00350] [ 8] 10  CoreFoundation  
0x7fff8143e631 -[NSObject(NSObject) finalize] + 129
[chris-joness-mac-pro:00350] [ 9] 11  mca_pls_xgrid.so
0x0001002a9ce3 -[PlsXGridClient dealloc] + 419
[chris-joness-mac-pro:00350] [10] 12  mca_pls_xgrid.so
0x0001002a9837 orte_pls_xgrid_finalize + 40
[chris-joness-mac-pro:00350] [11] 13  libopen-rte.0.dylib 
0x00010002d0f9 orte_pls_base_close + 249
[chris-joness-mac-pro:00350] [12] 14  libopen-rte.0.dylib 
0x000100012027 orte_system_finalize + 119
[chris-joness-mac-pro:00350] [13] 15  libopen-rte.0.dylib 
0x0001e968 orte_finalize + 40
[chris-joness-mac-pro:00350] [14] 16  mpirun  
0x000111ff orterun + 2042
[chris-joness-mac-pro:00350] [15] 17  mpirun  
0x00010a03 main + 27
[chris-joness-mac-pro:00350] [16] 18  mpirun  
0x000109e0 start + 52
[chris-joness-mac-pro:00350] [17] 19  ??? 
0x0004 0x0 + 4
[chris-joness-mac-pro:00350] *** End of error message ***
Abort trap


I've seen this error in a previous mailing, and it seems that the issue has 
something to do with forcing everything to use kerberos (SSO). However, I 
noticed that in the computer being used as an agent, this option is grayed on 
in the Xgrid sharing configuration (I have no idea why). I would therefore ask 
if it is absolutely necessary to use SSO to get openmpi to run with xgrid, or 
am I missing something with the password setup. Seems that the kerberos option 
is much more complicated, and I may even want to switch to just using openmpi 
with ssh.

Many thanks,
Chris


Chris Jones
Post-doctoral Research Assistant,

Department of Microbiology
Swedish University of Agricultural Sciences
Uppsala, Sweden
phone:

[OMPI users] Program hangs on send when run with nodes on remote machine

2011-08-04 Thread Keith Manville
I am having trouble running my MPI program on multiple nodes. I can
run multiple processes on a single node, and I can spawn processes on
on remote nodes, but when I call Send from a remote node, the node
never returns, even though there is an appropriate Recv waiting. I'm
pretty sure this is an issue with my configuration, not my code. I've
tried some other sample programs I found and had the same problem of
hanging on a send from one host to another.

Here's an in depth description:

I wrote a quick test program where each process with rank > 1 sends an
int to the master (rank 0), and the master receives until it gets
something from every other process.

My test program works fine when I run multiple processes on a single machine.

either the local node:

$ ./mpirun -n 4 ./mpi-test
Hi I'm localhost:2
Hi I'm localhost:1
localhost:1 sending 11...
localhost:2 sending 12...
localhost:2 sent 12
localhost:1 sent 11
Hi I'm localhost:0
localhost:0 received 11 from 1
localhost:0 received 12 from 2
Hi I'm localhost:3
localhost:3 sending 13...
localhost:3 sent 13
localhost:0 received 13 from 3
all workers checked in!

or a remote one:

$ ./mpirun -np 2 -host remotehost ./mpi-test
Hi I'm remotehost:0
remotehost:0 received 11 from 1
all workers checked in!
Hi I'm remotehost:1
remotehost:1 sending 11...
remotehost:1 sent 11

But when I try to run the master locally and the worker(s) remotely
(this is the way I am actually interested in running it), Send never
returns and it hangs indefinitely.

$ ./mpirun -np 2 -host localhost,remotehost ./mpi-test
Hi I'm localhost:0
Hi I'm remotehost:1
remotehost:1 sending 11...

Just to see if it would work, I tried spawning the master on the
remotehost and the worker on the localhost.

$ ./mpirun -np 2 -host remotehost,localhost ./mpi-test
Hi I'm localhost:1
localhost:1 sending 11...
localhost:1 sent 11
Hi I'm remotehost:0
remotehost:0 received 0 from 1
all workers checked in!

It doesn't hang on Send, but the wrong value is received.

Any idea what's going on? I've attached my code, my config.log,
ifconfig output, and ompi_info output.

Thanks,
Keith


mpi.tgz
Description: GNU Zip compressed data


Re: [OMPI users] OpenMPI causing WRF to crash

2011-08-04 Thread Jeff Squyres
Signal 15 is usually SIGTERM on Linux, meaning that some external entity 
probably killed the job.

The OMPI error message you describe is also typical for that kind of scenario 
-- i.e., a process exited without calling MPI_Finalize could mean that it 
called exit() or some external process killed it.


On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:

> I am trying to run a rather heavy wrf simulation with spectral nudging but 
> the simulation crashes after 1.8 minutes of integration.
>  The simulation has two domainswith  d01 = 601x601 and d02 = 721x721 and 
> 51 vertical levels. I tried this simulation on two different systems but 
> result was more or less same. For example 
> 
> On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF compiler 
> I tried to run wrf on 2048 shared memory nodes (1 compute node = 4 cores , 32 
> bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and mpixlf90.  I 
> got the following error message in the wrf.err file
> 
>  BE_MPI (ERROR): The error message in the job
> record is as follows:
>  BE_MPI (ERROR):   "killed with signal 15"
> 
> I also tried to run the same simulation on our linux cluster (Linux Red Hat 
> Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64 nodes (1 
> compute node=8 cores). For the parallel run I am used 
> mpi/openmpi/1.4.2-intel-11. I got the following error message in the error 
> log after couple of minutes of integration. 
> 
> "mpirun has exited due to process rank 45 with PID 19540 on
> node ci118 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here)."
> 
> I tried many things but nothing seems to be working. However, if I reduce  
> grid points below 200, the simulation goes fine. It appears that probably 
> OpenMP has problem with large number of grid points but I have no idea how to 
> fix it. I will greatly appreciate if you could suggest some solution.
> 
> Best regards, 
> ---
> Basit A. Khan, Ph.D.
> Postdoctoral Fellow
> Division of Physical Sciences & Engineering
> Office# 3204, Level 3, Building 1,
> King Abdullah University of Science & Technology
> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 –6900,
> Kingdom of Saudi Arabia.
> 
> Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
> E-mail: basitali.k...@kaust.edu.sa
> Skype name: basit.a.khan 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI causing WRF to crash

2011-08-04 Thread Anthony Chan

If you want to debug this on BGP, you could set BG_COREDUMPONERROR=1
and look at the backtrace in the light weight core files
(you probably need to recompile everything with -g).

A.Chan

- Original Message -
> Hi Dmitry,
> Thanks for a prompt and fairly detailed response. I have also
> forwarded
> the email to wrf community in the hope that somebody would have some
> straight forward solution. I will try to debug the error as suggested
> by
> you if I would not have much luck from the wrf forum.
> 
> Cheers,
> ---
> 
> Basit A. Khan, Ph.D.
> Postdoctoral Fellow
> Division of Physical Sciences & Engineering
> Office# 3204, Level 3, Building 1,
> King Abdullah University of Science & Technology
> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
> Kingdom of Saudi Arabia.
> 
> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592
> E-mail: basitali.k...@kaust.edu.sa
> Skype name: basit.a.khan
> 
> 
> 
> 
> On 8/3/11 2:46 PM, "Dmitry N. Mikushin"  wrote:
> 
> >5 apparently means one of the WRF's MPI processes has been
> >unexpectedly terminated, maybe by program decision. No matter, if it
> >is OpenMPI-specifi
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Program hangs on send when run with nodes on remote machine

2011-08-04 Thread Jeff Squyres
I notice that in the worker, you have:

eth2  Link encap:Ethernet  HWaddr 00:1b:21:77:c5:d4  
  inet addr:192.168.1.155  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: fe80::21b:21ff:fe77:c5d4/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:9225846 errors:0 dropped:75175 overruns:0 frame:0
  TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:1336628768 (1.3 GB)  TX bytes:552 (552.0 B)

eth3  Link encap:Ethernet  HWaddr 00:1b:21:77:c5:d5  
  inet addr:192.168.1.156  Bcast:192.168.1.255  Mask:255.255.255.0
  inet6 addr: fe80::21b:21ff:fe77:c5d5/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:26481809 errors:0 dropped:75059 overruns:0 frame:0
  TX packets:18030236 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:70061260271 (70.0 GB)  TX bytes:11844181778 (11.8 GB)

Two different NICs are on the same subnet -- that doesn't seem like a good 
idea...?  I think this topic has come up on the users list before, and, IIRC, 
the general consensus is "don't do that" because it's not clear as to which NIC 
Linux will actually send outgoing traffic across bound for the 192.168.1.x 
subnet.



On Aug 4, 2011, at 1:59 PM, Keith Manville wrote:

> I am having trouble running my MPI program on multiple nodes. I can
> run multiple processes on a single node, and I can spawn processes on
> on remote nodes, but when I call Send from a remote node, the node
> never returns, even though there is an appropriate Recv waiting. I'm
> pretty sure this is an issue with my configuration, not my code. I've
> tried some other sample programs I found and had the same problem of
> hanging on a send from one host to another.
> 
> Here's an in depth description:
> 
> I wrote a quick test program where each process with rank > 1 sends an
> int to the master (rank 0), and the master receives until it gets
> something from every other process.
> 
> My test program works fine when I run multiple processes on a single machine.
> 
> either the local node:
> 
> $ ./mpirun -n 4 ./mpi-test
> Hi I'm localhost:2
> Hi I'm localhost:1
> localhost:1 sending 11...
> localhost:2 sending 12...
> localhost:2 sent 12
> localhost:1 sent 11
> Hi I'm localhost:0
> localhost:0 received 11 from 1
> localhost:0 received 12 from 2
> Hi I'm localhost:3
> localhost:3 sending 13...
> localhost:3 sent 13
> localhost:0 received 13 from 3
> all workers checked in!
> 
> or a remote one:
> 
> $ ./mpirun -np 2 -host remotehost ./mpi-test
> Hi I'm remotehost:0
> remotehost:0 received 11 from 1
> all workers checked in!
> Hi I'm remotehost:1
> remotehost:1 sending 11...
> remotehost:1 sent 11
> 
> But when I try to run the master locally and the worker(s) remotely
> (this is the way I am actually interested in running it), Send never
> returns and it hangs indefinitely.
> 
> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test
> Hi I'm localhost:0
> Hi I'm remotehost:1
> remotehost:1 sending 11...
> 
> Just to see if it would work, I tried spawning the master on the
> remotehost and the worker on the localhost.
> 
> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test
> Hi I'm localhost:1
> localhost:1 sending 11...
> localhost:1 sent 11
> Hi I'm remotehost:0
> remotehost:0 received 0 from 1
> all workers checked in!
> 
> It doesn't hang on Send, but the wrong value is received.
> 
> Any idea what's going on? I've attached my code, my config.log,
> ifconfig output, and ompi_info output.
> 
> Thanks,
> Keith
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] openmpi 1.2.8 on Xgrid noob issue

2011-08-04 Thread Jeff Squyres
I'm afraid our Xgrid support has lagged, and Apple hasn't show much interest in 
MPI + Xgrid support -- much less HPC.  :-\

Have you see the FAQ items about Xgrid?

http://www.open-mpi.org/faq/?category=osx#xgrid-howto


On Aug 4, 2011, at 4:16 AM, Christopher Jones wrote:

> Hi there,
> 
> I'm currently trying to set up a small xgrid between two mac pros (a single 
> quadcore and a 2 duo core), where both are directly connected via an ethernet 
> cable. I've set up xgrid using the password authentication (rather than the 
> kerberos), and from what I can tell in the Xgrid admin tool it seems to be 
> working. However, once I try a simple hello world program, I get this error:
> 
> chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello
> mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited on 
> signal 15 (Terminated). 
> 1 additional process aborted (not shown)
> 2011-08-04 10:02:16.329 mpirun[350:903] *** Terminating app due to uncaught 
> exception 'NSInvalidArgumentException', reason: '*** 
> -[NSKVONotifying_XGConnection<0x1001325a0> finalize]: called when collecting 
> not enabled'
> *** Call stack at first throw:
> (
>   0   CoreFoundation  0x7fff814237b4 
> __exceptionPreprocess + 180
>   1   libobjc.A.dylib 0x7fff84fe8f03 
> objc_exception_throw + 45
>   2   CoreFoundation  0x7fff8143e631 
> -[NSObject(NSObject) finalize] + 129
>   3   mca_pls_xgrid.so0x0001002a9ce3 
> -[PlsXGridClient dealloc] + 419
>   4   mca_pls_xgrid.so0x0001002a9837 
> orte_pls_xgrid_finalize + 40
>   5   libopen-rte.0.dylib 0x00010002d0f9 
> orte_pls_base_close + 249
>   6   libopen-rte.0.dylib 0x000100012027 
> orte_system_finalize + 119
>   7   libopen-rte.0.dylib 0x0001e968 
> orte_finalize + 40
>   8   mpirun  0x000111ff orterun + 
> 2042
>   9   mpirun  0x00010a03 main + 27
>   10  mpirun  0x000109e0 start + 52
>   11  ??? 0x0004 0x0 + 4
> )
> terminate called after throwing an instance of 'NSException'
> [chris-joness-mac-pro:00350] *** Process received signal ***
> [chris-joness-mac-pro:00350] Signal: Abort trap (6)
> [chris-joness-mac-pro:00350] Signal code:  (0)
> [chris-joness-mac-pro:00350] [ 0] 2   libSystem.B.dylib   
> 0x7fff81ca51ba _sigtramp + 26
> [chris-joness-mac-pro:00350] [ 1] 3   ??? 
> 0x0001000cd400 0x0 + 4295808000
> [chris-joness-mac-pro:00350] [ 2] 4   libstdc++.6.dylib   
> 0x7fff830965d2 __tcf_0 + 0
> [chris-joness-mac-pro:00350] [ 3] 5   libobjc.A.dylib 
> 0x7fff84fecb39 _objc_terminate + 100
> [chris-joness-mac-pro:00350] [ 4] 6   libstdc++.6.dylib   
> 0x7fff83094ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11
> [chris-joness-mac-pro:00350] [ 5] 7   libstdc++.6.dylib   
> 0x7fff83094b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0
> [chris-joness-mac-pro:00350] [ 6] 8   libstdc++.6.dylib   
> 0x7fff83094bfc 
> _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
> [chris-joness-mac-pro:00350] [ 7] 9   libobjc.A.dylib 
> 0x7fff84fe8fa2 object_getIvar + 0
> [chris-joness-mac-pro:00350] [ 8] 10  CoreFoundation  
> 0x7fff8143e631 -[NSObject(NSObject) finalize] + 129
> [chris-joness-mac-pro:00350] [ 9] 11  mca_pls_xgrid.so
> 0x0001002a9ce3 -[PlsXGridClient dealloc] + 419
> [chris-joness-mac-pro:00350] [10] 12  mca_pls_xgrid.so
> 0x0001002a9837 orte_pls_xgrid_finalize + 40
> [chris-joness-mac-pro:00350] [11] 13  libopen-rte.0.dylib 
> 0x00010002d0f9 orte_pls_base_close + 249
> [chris-joness-mac-pro:00350] [12] 14  libopen-rte.0.dylib 
> 0x000100012027 orte_system_finalize + 119
> [chris-joness-mac-pro:00350] [13] 15  libopen-rte.0.dylib 
> 0x0001e968 orte_finalize + 40
> [chris-joness-mac-pro:00350] [14] 16  mpirun  
> 0x000111ff orterun + 2042
> [chris-joness-mac-pro:00350] [15] 17  mpirun  
> 0x00010a03 main + 27
> [chris-joness-mac-pro:00350] [16] 18  mpirun  
> 0x000109e0 start + 52
> [chris-joness-mac-pro:00350] [17] 19  ??? 
> 0x0004 0x0 + 4
> [chris-joness-mac-pro:00350] *** End of error message ***
> Abort trap
> 
> 
> I've seen this error in a previous mailing, and it seems that the issue has 
> something to do with forcing everything to use kerberos (SSO). However, I 
> noticed that in the computer being use