Hello,
I'm trying to come up with a fault tolerant OpenMPI setup for research
purposes. I'm doing some tests now, but I'm stuck with a segfault when
I try to restart my test program from a checkpoint.
My test program is the "ring" program, where messages are sent to the
next node in the ring N t
On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos wrote:
> Hello,
>
>
> I'm trying to come up with a fault tolerant OpenMPI setup for research
> purposes. I'm doing some tests now, but I'm stuck with a segfault when
> I try to restart my test program from a check
On Wed, Mar 3, 2010 at 5:31 PM, Joshua Hursey wrote:
>
> Yes, ompi-restart should be printing a helpful message and exiting normally.
> Thanks for the bug report. I believe that I have seen and fixed this on a
> development branch making its way to the trunk. I'll make sure to move the
> fix t
Hi,
First, I'm hoping setting the subject of this e-mail will get it
attached to the thread that starts with this e-mail:
http://www.open-mpi.org/community/lists/users/2009/12/11608.php
The reason I'm not replying to that thread is that I wasn't subscribed
to the list at the time.
My environm
On Wed, Mar 3, 2010 at 10:24 PM, Fernando Lemos wrote:
> Is there anything I can do to provide more information about this bug?
> E.g. try to compile the code in the SVN trunk? I also have kept the
> snapshots intact, I can tar them up and upload them somewhere in case
> you guys ne
On Fri, Mar 5, 2010 at 12:03 PM, Josh Hursey wrote:
> This type of failure is usually due to prelink'ing being left enabled on one
> or more of the systems. This has come up multiple times on the Open MPI
> list, but is actually a problem between BLCR and the Linux kernel. BLCR has
> a FAQ entry o
On Fri, Mar 12, 2010 at 6:02 PM, Samuel K. Gutierrez wrote:
> One more thing. The line should have been:
>
> export LD_LIBRARY_PATH=/home/jess/local/ompi/lib64
>
> The space in the previous email will make bash unhappy 8-|.
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
> On Mar
On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres wrote:
> On Mar 17, 2010, at 4:39 AM, wrote:
>
>> Hi everyone I'm a new Open MPI user and I have just installed Open MPI in
>> a 6 nodes cluster with Scientific Linux. When I execute it in local it
>> works perfectly, but when I try to execute it on t
On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian wrote:
> I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
> program runs well on the clusters,
> but how to checkpoint the MPI program on this clusters?
> for example:
> here is what I do for a test:
> mpiu@nimbus: /mirror$
On Tue, Mar 23, 2010 at 10:25 AM, Nicolas Niclausse
wrote:
> Hello,
>
>
> I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several
> interfaces are private;
>
> on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible
> from cluster2.
>
> chicon-3
> eth0 in
On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian wrote:
> I have created the shared file system. but I created a /mirror at root
> directory,not at the $HOME directory,is that the
> problem? thank you
Others might be able to give you more a accurate explanation. The way
I understood it, in OpenMP
On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian wrote:
> Hi
>
> I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint
> and restart work fine in single machine,but when doing checkpoint in
> clusters environment, the ompi-checkpoint hangs
Besdies what has been said in anoth
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote:
>
> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
> --hostfile .mpihostfile
> to store the global checkpoint snapshot into the shared
> directory:/mirror,but the problems are still there,
> when ompi-checkpoin
On Tue, Mar 23, 2010 at 1:25 PM, fengguang tian wrote:
> now, I set $HOME as shared directory, but when doing ompi-checkpoint, it
> shows:(nimbus1 is the remote machine in
> my cluster)
>
> [nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the
> sub-directory (/home/mpiu/ompi_global_
On Wed, Mar 31, 2010 at 7:39 PM, Addepalli, Srirangam V
wrote:
> Hello All.
> I am trying to checkpoint a mpi application that has been started using the
> follwong mpirun command
>
> mpirun -am ft-enable-cr -np 8 pw.x < Ge46.pw.in > Ge46.ph.out
>
> ompi-checkpoint 31396 ( Works) How ever when i
On Thu, Apr 8, 2010 at 10:31 AM, Jeff Squyres wrote:
> Yes. There is usually a difference between interactive logins and
> non-interactive logins on which paths, etc. get set. Look in your shell
> startup and see if there is somewhere that it exits early (or otherwise
> doesn't process) for n
Hello,
I've noticed that ompi-restart doesn't support the --rankfile option.
It only supports --hostfile/--machinefile. Is there any reason
--rankfile isn't supported?
Suppose you have a cluster without a shared file system. When one node
fails, you transfer its checkpoint to a spare node and in
On Sat, Apr 10, 2010 at 6:07 AM, Juergen Kaiser wrote:
> Hi,
>
> is it possible to add a new MPI process to a set of running MPI processes
> such that they can commnicate as usual? If so, how?
OpenMPI supports MPI-2, so, as far as I can tell, yes, you can do so
by using the dynamic process manage
On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto
wrote:
> Hi Members,
>
> I tried to use checkpoint/restart by openmpi.
> But I can not get collect checkpoint data.
> I prepared execution environment as follows, the strings in () mean
> name of output file which attached on next e-mail ( for ma
On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
wrote:
> Fernando,
>
> Thank you for your reply.
> I tried to patch the file you mentioned, but the output did not change.
I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
works great.
>>Are you using a shared file system?
On Fri, May 7, 2010 at 5:33 PM, Cristobal Navarro wrote:
> Hello,
>
> my question is the following.
>
> is it possible to send and receive C++ objects or STL structures (for
> example, send map myMap) through openMPI SEND and RECEIVE functions?
> at first glance i thought it was possible, but afte
On Wed, May 12, 2010 at 2:51 PM, Jeff Squyres wrote:
> On May 12, 2010, at 1:48 PM, Hanjun Kim wrote:
>
>> I am working on parallelizing my sequential program using OpenMPI.
>> Although I got performance speedup using many threads, there was
>> slowdown on a small number of threads like 4 threads.
On Tue, May 18, 2010 at 3:53 PM, Josh Hursey wrote:
>> I've noticed that ompi-restart doesn't support the --rankfile option.
>> It only supports --hostfile/--machinefile. Is there any reason
>> --rankfile isn't supported?
>>
>> Suppose you have a cluster without a shared file system. When one node
On Tue, May 25, 2010 at 1:03 AM, Yves Caniou wrote:
> 2 ** When I use a Isend() operation, the manpage says that I can't use the
> buffer until the operation completes.
> What happens if I use an Isend() operation in a function, with a buffer
> declared inside the function?
> Do I have to Wait() f
24 matches
Mail list logo