[OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-02-28 Thread Fernando Lemos
Hello, I'm trying to come up with a fault tolerant OpenMPI setup for research purposes. I'm doing some tests now, but I'm stuck with a segfault when I try to restart my test program from a checkpoint. My test program is the "ring" program, where messages are sent to the next node in the ring N t

Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-03-02 Thread Fernando Lemos
On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos wrote: > Hello, > > > I'm trying to come up with a fault tolerant OpenMPI setup for research > purposes. I'm doing some tests now, but I'm stuck with a segfault when > I try to restart my test program from a check

Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)

2010-03-03 Thread Fernando Lemos
On Wed, Mar 3, 2010 at 5:31 PM, Joshua Hursey wrote: > > Yes, ompi-restart should be printing a helpful message and exiting normally. > Thanks for the bug report. I believe that I have seen and fixed this on a > development branch making its way to the trunk. I'll make sure to move the > fix t

[OMPI users] checkpointing multi node and multi process applications

2010-03-03 Thread Fernando Lemos
Hi, First, I'm hoping setting the subject of this e-mail will get it attached to the thread that starts with this e-mail: http://www.open-mpi.org/community/lists/users/2009/12/11608.php The reason I'm not replying to that thread is that I wasn't subscribed to the list at the time. My environm

Re: [OMPI users] checkpointing multi node and multi process applications

2010-03-04 Thread Fernando Lemos
On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos wrote: > Is there anything I can do to provide more information about this bug? > E.g. try to compile the code in the SVN trunk? I also have kept the > snapshots intact, I can tar them up and upload them somewhere in case > you guys ne

Re: [OMPI users] change hosts to restart the checkpoint

2010-03-07 Thread Fernando Lemos
On Fri, Mar 5, 2010 at 12:03 PM, Josh Hursey wrote: > This type of failure is usually due to prelink'ing being left enabled on one > or more of the systems. This has come up multiple times on the Open MPI > list, but is actually a problem between BLCR and the Linux kernel. BLCR has > a FAQ entry o

Re: [OMPI users] Problem in using openmpi

2010-03-12 Thread Fernando Lemos
On Fri, Mar 12, 2010 at 6:02 PM, Samuel K. Gutierrez wrote: > One more thing.  The line should have been: > > export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64 > > The space in the previous email will make bash unhappy 8-|. > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Mar

Re: [OMPI users] Problem in remote nodes

2010-03-17 Thread Fernando Lemos
On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote: > On Mar 17, 2010, at 4:39 AM, wrote: > >> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in >> a 6 nodes cluster with Scientific Linux. When I execute it in local it >> works perfectly, but when I try to execute it on t

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian wrote: > I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI > program runs well on the clusters, > but how to checkpoint the MPI program on this clusters? > for example: > here is what I do for a test: > mpiu@nimbus: /mirror$

Re: [OMPI users] problem with opal_net_private_ipv4

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 10:25 AM, Nicolas Niclausse wrote: > Hello, > > > I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several > interfaces are private; > > on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible > from cluster2. > > chicon-3 > eth0     in

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian wrote: > I have created the shared file system. but I created a /mirror at root > directory,not at the $HOME directory,is that the > problem? thank you Others might be able to give you more a accurate explanation. The way I understood it, in OpenMP

Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian wrote: > Hi > > I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint > and restart work fine in single machine,but when doing checkpoint in > clusters environment, the ompi-checkpoint hangs Besdies what has been said in anoth

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote: > > I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir > --hostfile .mpihostfile > to store the global checkpoint snapshot into the shared > directory:/mirror,but the problems are still there, > when ompi-checkpoin

Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 1:25 PM, fengguang tian wrote: > now, I set $HOME as shared directory, but when doing ompi-checkpoint, it > shows:(nimbus1 is the remote machine in > my cluster) > > [nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the > sub-directory (/home/mpiu/ompi_global_

Re: [OMPI users] ompi-checkpoint --term

2010-03-31 Thread Fernando Lemos
On Wed, Mar 31, 2010 at 7:39 PM, Addepalli, Srirangam V wrote: > Hello All. > I am trying to checkpoint a mpi application that has been started using the > follwong mpirun command > > mpirun -am ft-enable-cr -np 8 pw.x  < Ge46.pw.in > Ge46.ph.out > > ompi-checkpoint 31396 ( Works) How ever when i

Re: [OMPI users] orted: error while loading shared libraries

2010-04-08 Thread Fernando Lemos
On Thu, Apr 8, 2010 at 10:31 AM, Jeff Squyres wrote: > Yes.  There is usually a difference between interactive logins and > non-interactive logins on which paths, etc. get set.  Look in your shell > startup and see if there is somewhere that it exits early (or otherwise > doesn't process) for n

[OMPI users] Using a rankfile for ompi-restart

2010-04-08 Thread Fernando Lemos
Hello, I've noticed that ompi-restart doesn't support the --rankfile option. It only supports --hostfile/--machinefile. Is there any reason --rankfile isn't supported? Suppose you have a cluster without a shared file system. When one node fails, you transfer its checkpoint to a spare node and in

Re: [OMPI users] Adding new process to running job

2010-04-10 Thread Fernando Lemos
On Sat, Apr 10, 2010 at 6:07 AM, Juergen Kaiser wrote: > Hi, > > is it possible to add a new MPI process to a set of running MPI processes > such that they can commnicate as usual? If so, how? OpenMPI supports MPI-2, so, as far as I can tell, yes, you can do so by using the dynamic process manage

Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Fernando Lemos
On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto wrote: > Hi Members, > > I tried to use checkpoint/restart by openmpi. > But I can not get collect checkpoint data. > I prepared execution environment as follows, the strings in () mean > name of output file which attached on next e-mail ( for ma

Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-14 Thread Fernando Lemos
On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto wrote: > Fernando, > > Thank you for your reply. > I tried to patch the file you mentioned, but the output did not change. I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it works great. >>Are you using a shared file system?

Re: [OMPI users] communicate C++ STL strucutures ??

2010-05-07 Thread Fernando Lemos
On Fri, May 7, 2010 at 5:33 PM, Cristobal Navarro wrote: > Hello, > > my question is the following. > > is it possible to send and receive C++ objects or STL structures (for > example, send map myMap) through openMPI SEND and RECEIVE functions? > at first glance i thought it was possible, but afte

Re: [OMPI users] getc in openmpi

2010-05-12 Thread Fernando Lemos
On Wed, May 12, 2010 at 2:51 PM, Jeff Squyres wrote: > On May 12, 2010, at 1:48 PM, Hanjun Kim wrote: > >> I am working on parallelizing my sequential program using OpenMPI. >> Although I got performance speedup using many threads, there was >> slowdown on a small number of threads like 4 threads.

Re: [OMPI users] Using a rankfile for ompi-restart

2010-05-21 Thread Fernando Lemos
On Tue, May 18, 2010 at 3:53 PM, Josh Hursey wrote: >> I've noticed that ompi-restart doesn't support the --rankfile option. >> It only supports --hostfile/--machinefile. Is there any reason >> --rankfile isn't supported? >> >> Suppose you have a cluster without a shared file system. When one node

Re: [OMPI users] About the necessity of cancelation of pending communication and the use of buffer

2010-05-25 Thread Fernando Lemos
On Tue, May 25, 2010 at 1:03 AM, Yves Caniou wrote: > 2 ** When I use a Isend() operation, the manpage says that I can't use the > buffer until the operation completes. > What happens if I use an Isend() operation in a function, with a buffer > declared inside the function? > Do I have to Wait() f