Hi Jeff,
Thank you for your email.  The program make an MPI_Reduce call as the only form 
of explicit communication between machines… I said it was simple because it's 
effectively a very trivial distributed computation for me to learn MPI.  I am 
using the same version, by doing "brew install openmpi" on each of the 
machines.  They're both running the last update of OSX 10.7 but their PATHs and 
LD_LIBRARY_PATHs might be slightly different.  I am able to run n-way jobs on a 
single machine.

UPDATE: I wish I could reproduce the error, because now it's gone and I can run 
the same program from each machine in the hostfile.  I would still be very 
interested to know what kind of MPI situations are likely to cause these kinds 
of seg faults….

-Paul

On Feb 11, 2013, at 8:27 AM, Jeff Squyres (jsquyres) wrote:

> Can you provide any more detail?  
> 
> Your report looks weird - you said its a simple c++ hello world, but the 
> executable you show is "pi", which is typically a simple C example program. 
> 
> Are you using the same version of open MPI on all nodes?  Are you able to run 
> n way jobs on single nodes?
> 
> Sent from my phone. No type good. 
> 
> On Feb 9, 2013, at 2:03 PM, "Paul Gribelyuk" <paul.qu...@gmail.com> wrote:
> 
>>> Hello,
>>> I am getting the following stacktrace when running a simple hello world MPI 
>>> C++ program on 2 machines:
>>> 
>>> 
>>> mini:mpi_cw paul$ mpirun --prefix /usr/local/Cellar/open-mpi/1.6.3 
>>> --hostfile hosts_home -np 2 ./pi 1000000
>>> rank and name: 0 aka mini.local
>>> [home-mini:12175] *** Process received signal ***
>>> [home-mini:12175] Signal: Segmentation fault: 11 (11)
>>> [home-mini:12175] Signal code: Address not mapped (1)
>>> [home-mini:12175] Failing at address: 0x1042e0000
>>> [home-mini:12175] [ 0] 2   libsystem_c.dylib                   
>>> 0x00007fff94050cfa _sigtramp + 26
>>> [home-mini:12175] [ 1] 3   mca_btl_tcp.so                      
>>> 0x000000010397092c best_addr + 2620
>>> [home-mini:12175] [ 2] 4   pi                                  
>>> 0x0000000103649d24 start + 52
>>> [home-mini:12175] [ 3] 5   ???                                 
>>> 0x0000000000000002 0x0 + 2
>>> [home-mini:12175] *** End of error message ***
>>> rank: 0 sum: 1.85459
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 12175 on node home-mini.local 
>>> exited on signal 11 (Segmentation fault: 11).
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> I get a similar result even when I don't use --prefix since the .bashrc 
>>> file on the remote machine is correctly pointing to PATH and LD_LIBRARY_PATH
>>> 
>>> Any help with this seg fault is greatly appreciated.  Thanks.
>>> 
>>> -Paul
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to