Hi Jeff, Thank you for your email. The program make an MPI_Reduce call as the only form of explicit communication between machines… I said it was simple because it's effectively a very trivial distributed computation for me to learn MPI. I am using the same version, by doing "brew install openmpi" on each of the machines. They're both running the last update of OSX 10.7 but their PATHs and LD_LIBRARY_PATHs might be slightly different. I am able to run n-way jobs on a single machine.
UPDATE: I wish I could reproduce the error, because now it's gone and I can run the same program from each machine in the hostfile. I would still be very interested to know what kind of MPI situations are likely to cause these kinds of seg faults…. -Paul On Feb 11, 2013, at 8:27 AM, Jeff Squyres (jsquyres) wrote: > Can you provide any more detail? > > Your report looks weird - you said its a simple c++ hello world, but the > executable you show is "pi", which is typically a simple C example program. > > Are you using the same version of open MPI on all nodes? Are you able to run > n way jobs on single nodes? > > Sent from my phone. No type good. > > On Feb 9, 2013, at 2:03 PM, "Paul Gribelyuk" <paul.qu...@gmail.com> wrote: > >>> Hello, >>> I am getting the following stacktrace when running a simple hello world MPI >>> C++ program on 2 machines: >>> >>> >>> mini:mpi_cw paul$ mpirun --prefix /usr/local/Cellar/open-mpi/1.6.3 >>> --hostfile hosts_home -np 2 ./pi 1000000 >>> rank and name: 0 aka mini.local >>> [home-mini:12175] *** Process received signal *** >>> [home-mini:12175] Signal: Segmentation fault: 11 (11) >>> [home-mini:12175] Signal code: Address not mapped (1) >>> [home-mini:12175] Failing at address: 0x1042e0000 >>> [home-mini:12175] [ 0] 2 libsystem_c.dylib >>> 0x00007fff94050cfa _sigtramp + 26 >>> [home-mini:12175] [ 1] 3 mca_btl_tcp.so >>> 0x000000010397092c best_addr + 2620 >>> [home-mini:12175] [ 2] 4 pi >>> 0x0000000103649d24 start + 52 >>> [home-mini:12175] [ 3] 5 ??? >>> 0x0000000000000002 0x0 + 2 >>> [home-mini:12175] *** End of error message *** >>> rank: 0 sum: 1.85459 >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 1 with PID 12175 on node home-mini.local >>> exited on signal 11 (Segmentation fault: 11). >>> -------------------------------------------------------------------------- >>> >>> >>> >>> I get a similar result even when I don't use --prefix since the .bashrc >>> file on the remote machine is correctly pointing to PATH and LD_LIBRARY_PATH >>> >>> Any help with this seg fault is greatly appreciated. Thanks. >>> >>> -Paul >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users