[OMPI devel] New MOSIX components draft

2012-03-31 Thread Alex Margolin

Hi,

I think i'm close to finishing an initial version of the MOSIX support 
for open-mpi. A perliminary draft is attached.
The support consists of two modules: ODLS module for launching processes 
under MOSIX, and BTL module for efficient communication between processes.
I'm not quite there yet - I'm sure the BTL module needs more work... 
first because it fails (see error output below) and second because I'm 
not sure I got all the function output right. I've written some 
documentation inside the code, which is pretty short at the moment. The 
ODLS component is working fine.


Is it possible someone will take a look at my code to see if i'm in the 
right direction? I would like to submit my code to the repository 
eventually... I know of quite a few open-mpi users interested in MOSIX 
support (they know I'm working on it), and I was hoping to publish some 
benchmark results for it at the upcoming EuroMPI.


P.S. I get the following Error - I'm pretty sure my BTL is to blame here:

alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl_base_verbose 
100 -mca btl self,mosix hello
[singularity:10838] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open 
shared object file: No such file or directory (ignored)

[singularity:10838] mca: base: components_open: Looking for btl components
[singularity:10838] mca: base: components_open: opening btl components
[singularity:10838] mca: base: components_open: found loaded component mosix
[singularity:10838] mca: base: components_open: component mosix register 
function successful
[singularity:10838] mca: base: components_open: component mosix open 
function successful

[singularity:10838] mca: base: components_open: found loaded component self
[singularity:10838] mca: base: components_open: component self has no 
register function
[singularity:10838] mca: base: components_open: component self open 
function successful
[singularity:10838] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open 
shared object file: No such file or directory (ignored)

[singularity:10838] select: initializing btl component mosix
[singularity:10838] select: init of component mosix returned success
[singularity:10838] select: initializing btl component self
[singularity:10838] select: init of component self returned success
[singularity:10838] *** Process received signal ***
[singularity:10838] Signal: Segmentation fault (11)
[singularity:10838] Signal code: Address not mapped (1)
[singularity:10838] Failing at address: 0x30
[singularity:10838] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36420) 
[0x7fa94a3cd420]
[singularity:10838] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x84391) 
[0x7fa94a41b391]
[singularity:10838] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__strdup+0x16) 
[0x7fa94a41b086]
[singularity:10838] [ 3] 
/usr/local/lib/libmpi.so.0(opal_argv_append_nosize+0xf7) [0x7fa94add66a4]
[singularity:10838] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1cf5) 
[0x7fa946177cf5]
[singularity:10838] [ 5] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1e50) 
[0x7fa946177e50]
[singularity:10838] [ 6] 
/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x12f) 
[0x7fa946382b6d]
[singularity:10838] [ 7] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x909) 
[0x7fa94acd1549]
[singularity:10838] [ 8] /usr/local/lib/libmpi.so.0(MPI_Init+0x16c) 
[0x7fa94ad033ec]
[singularity:10838] [ 9] 
/home/alex/huji/benchmarks/simple/hello(_ZN3MPI4InitERiRPPc+0x23) [0x409e2d]
[singularity:10838] [10] 
/home/alex/huji/benchmarks/simple/hello(main+0x22) [0x408f66]
[singularity:10838] [11] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fa94a3b830d]
[singularity:10838] [12] /home/alex/huji/benchmarks/simple/hello() 
[0x408e89]

[singularity:10838] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 10838 on node singularity 
exited on signal 11 (Segmentation fault).

--
alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,tcp hello
[singularity:10841] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open 
shared object file: No such file or directory (ignored)
[singularity:10841] mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open 
shared object file: No such file or directory (ignored)

Hello world!
alex@singularity:~/huji/benchmarks/simple$
Index: ompi/mca/btl/mosix/configure.m4
===
--- ompi/mca/btl/mosix/configure.m4	(revision 0)
+++ ompi/mca/btl/mosix/configure.m4	(revision 0)
@@ -0,0 +1,30 @@
+# -*- shell-script -*-
+#
+# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
+# University Re

Re: [OMPI devel] New MOSIX components draft

2012-03-31 Thread Ralph Castain
I can't speak to the BTL itself, but I do have questions as to how this can 
work. If MOSIX migrates a process, or starts new processes on another node 
during the course of a job, there is no way for MPI to handle the wireup and so 
it will fail. We need ALL the procs started at the beginning of time, and for 
them to remain in their initial location throughout the job. There are people 
working on how to handle proc movement, but mostly from a fault recovery 
perspective - i.e., the process is already  known and wired, but fails and 
restarts at a new location, so we can try to re-wire it.

I've looked at MOSIX before for other folks (easy enough to fork/exec a proc), 
but could find no real way to support the way MOSIX wants to manage resources 
without the constraint that MOSIX only operate at a job level - i.e., it start 
all specified procs at the beginning of time, and it not migrate them. Kinda 
defeated the intent of MOSIX.


On Mar 31, 2012, at 10:04 AM, Alex Margolin wrote:

> Hi,
> 
> I think i'm close to finishing an initial version of the MOSIX support for 
> open-mpi. A perliminary draft is attached.
> The support consists of two modules: ODLS module for launching processes 
> under MOSIX, and BTL module for efficient communication between processes.
> I'm not quite there yet - I'm sure the BTL module needs more work... first 
> because it fails (see error output below) and second because I'm not sure I 
> got all the function output right. I've written some documentation inside the 
> code, which is pretty short at the moment. The ODLS component is working fine.
> 
> Is it possible someone will take a look at my code to see if i'm in the right 
> direction? I would like to submit my code to the repository eventually... I 
> know of quite a few open-mpi users interested in MOSIX support (they know I'm 
> working on it), and I was hoping to publish some benchmark results for it at 
> the upcoming EuroMPI.
> 
> P.S. I get the following Error - I'm pretty sure my BTL is to blame here:
> 
> alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl_base_verbose 100 
> -mca btl self,mosix hello
> [singularity:10838] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open 
> shared object file: No such file or directory (ignored)
> [singularity:10838] mca: base: components_open: Looking for btl components
> [singularity:10838] mca: base: components_open: opening btl components
> [singularity:10838] mca: base: components_open: found loaded component mosix
> [singularity:10838] mca: base: components_open: component mosix register 
> function successful
> [singularity:10838] mca: base: components_open: component mosix open function 
> successful
> [singularity:10838] mca: base: components_open: found loaded component self
> [singularity:10838] mca: base: components_open: component self has no 
> register function
> [singularity:10838] mca: base: components_open: component self open function 
> successful
> [singularity:10838] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared 
> object file: No such file or directory (ignored)
> [singularity:10838] select: initializing btl component mosix
> [singularity:10838] select: init of component mosix returned success
> [singularity:10838] select: initializing btl component self
> [singularity:10838] select: init of component self returned success
> [singularity:10838] *** Process received signal ***
> [singularity:10838] Signal: Segmentation fault (11)
> [singularity:10838] Signal code: Address not mapped (1)
> [singularity:10838] Failing at address: 0x30
> [singularity:10838] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36420) 
> [0x7fa94a3cd420]
> [singularity:10838] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x84391) 
> [0x7fa94a41b391]
> [singularity:10838] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__strdup+0x16) 
> [0x7fa94a41b086]
> [singularity:10838] [ 3] 
> /usr/local/lib/libmpi.so.0(opal_argv_append_nosize+0xf7) [0x7fa94add66a4]
> [singularity:10838] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1cf5) 
> [0x7fa946177cf5]
> [singularity:10838] [ 5] /usr/local/lib/openmpi/mca_bml_r2.so(+0x1e50) 
> [0x7fa946177e50]
> [singularity:10838] [ 6] 
> /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x12f) 
> [0x7fa946382b6d]
> [singularity:10838] [ 7] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x909) 
> [0x7fa94acd1549]
> [singularity:10838] [ 8] /usr/local/lib/libmpi.so.0(MPI_Init+0x16c) 
> [0x7fa94ad033ec]
> [singularity:10838] [ 9] 
> /home/alex/huji/benchmarks/simple/hello(_ZN3MPI4InitERiRPPc+0x23) [0x409e2d]
> [singularity:10838] [10] /home/alex/huji/benchmarks/simple/hello(main+0x22) 
> [0x408f66]
> [singularity:10838] [11] 
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fa94a3b830d]
> [singularity:10838] [12] /home/alex/huji/benchmarks/simple/hello() [0x408e89]
> [singularity:10838] *** End of error message ***
>

Re: [OMPI devel] New MOSIX components draft

2012-03-31 Thread Alex Margolin
MOSIX works as a sandbox, wrapping the executed process. Suppose I run 
with "-n 3": three processes will be launched via MOSIX on nodes A, B 
and C. MOSIX can choose to "migrate" process #2 from B to D - this will 
not restart the process, nor will the process know about it's current 
location unless it "asks" by reading  /proc/mosix/mosip for example. The 
process will run on D (and consume CPU and memory on D), but it'll think 
it's still on B and most system-calls will still be executed on B. This 
is, of course, better for CPU-intensive apps then i/o-intensive ones... 
Since MPI would qualify as "communication-intensive", I've prepared a 
special BTL component for it. You don't have to use the BTL to run with 
MOSIX - ODLS is enough, but it'll give you reduced communication 
performance. MPI runs as usual (with the slight performance penalty) - 
no processes added/removed so no re-wiring...


I'll be happy to elaborate if you're interested.

On 03/31/2012 10:29 PM, Ralph Castain wrote:

I can't speak to the BTL itself, but I do have questions as to how this can 
work. If MOSIX migrates a process, or starts new processes on another node 
during the course of a job, there is no way for MPI to handle the wireup and so 
it will fail. We need ALL the procs started at the beginning of time, and for 
them to remain in their initial location throughout the job. There are people 
working on how to handle proc movement, but mostly from a fault recovery 
perspective - i.e., the process is already  known and wired, but fails and 
restarts at a new location, so we can try to re-wire it.

I've looked at MOSIX before for other folks (easy enough to fork/exec a proc), 
but could find no real way to support the way MOSIX wants to manage resources 
without the constraint that MOSIX only operate at a job level - i.e., it start 
all specified procs at the beginning of time, and it not migrate them. Kinda 
defeated the intent of MOSIX.


On Mar 31, 2012, at 10:04 AM, Alex Margolin wrote:


Hi,

I think i'm close to finishing an initial version of the MOSIX support for 
open-mpi. A perliminary draft is attached.
The support consists of two modules: ODLS module for launching processes under 
MOSIX, and BTL module for efficient communication between processes.
I'm not quite there yet - I'm sure the BTL module needs more work... first 
because it fails (see error output below) and second because I'm not sure I got 
all the function output right. I've written some documentation inside the code, 
which is pretty short at the moment. The ODLS component is working fine.

Is it possible someone will take a look at my code to see if i'm in the right 
direction? I would like to submit my code to the repository eventually... I 
know of quite a few open-mpi users interested in MOSIX support (they know I'm 
working on it), and I was hoping to publish some benchmark results for it at 
the upcoming EuroMPI.



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] New MOSIX components draft

2012-03-31 Thread Alex Margolin
I've added some documentation and made a few other changes in the hope 
of making the code more readable (the attached diff replaces the 
previous one), though the BTL is still giving me that error. There are 
some TODOs in the code where I was unsure about the code (it should 
still work - I'm not aware of any code missing), and I'll appreciate any 
comments...


Here's an example: Suppose I have a TCP and a UDP channel in parallel. 
This is not critical for the first version (I'm not using UDP before I 
make TCP work), but I an curious as to how I can make the upper layers 
make use of both according to the task at hand. It would seem that TCP 
requires less code overhead, but requires more resources then UDP, but I 
ran a few tests and it seems TCP can beat UDP in performance is some 
scenarios... It sounds odd to me, but this may be the result of 
intensive kernel optimizations on the TCP side. Still, UDP may perform 
better with fire-and-forget scenarios.


Thanks a lot (and sorry for the hassle),
Alex


On 03/31/2012 07:04 PM, Alex Margolin wrote:

Hi,

I think i'm close to finishing an initial version of the MOSIX support 
for open-mpi. A perliminary draft is attached.
The support consists of two modules: ODLS module for launching 
processes under MOSIX, and BTL module for efficient communication 
between processes.
I'm not quite there yet - I'm sure the BTL module needs more work... 
first because it fails (see error output below) and second because I'm 
not sure I got all the function output right. I've written some 
documentation inside the code, which is pretty short at the moment. 
The ODLS component is working fine.




Index: ompi/mca/btl/mosix/configure.m4
===
--- ompi/mca/btl/mosix/configure.m4	(revision 0)
+++ ompi/mca/btl/mosix/configure.m4	(revision 0)
@@ -0,0 +1,30 @@
+# -*- shell-script -*-
+#
+# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
+# University Research and Technology
+# Corporation.  All rights reserved.
+# Copyright (c) 2004-2005 The University of Tennessee and The University
+# of Tennessee Research Foundation.  All rights
+# reserved.
+# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
+# University of Stuttgart.  All rights reserved.
+# Copyright (c) 2004-2005 The Regents of the University of California.
+# All rights reserved.
+# Copyright (c) 2010  Cisco Systems, Inc.  All rights reserved.
+# $COPYRIGHT$
+# 
+# Additional copyrights may follow
+# 
+# $HEADER$
+#
+
+# MCA_btl_mosix_CONFIG([action-if-found], [action-if-not-found])
+# ---
+AC_DEFUN([MCA_ompi_btl_mosix_CONFIG],[
+AC_CONFIG_FILES([ompi/mca/btl/mosix/Makefile])
+
+# check for mosix presence
+AC_CHECK_FILE([/proc/mosix/mosip], 
+  [$1],
+  [$2])
+])dnl
Index: ompi/mca/btl/mosix/btl_mosix_endpoint.c
===
--- ompi/mca/btl/mosix/btl_mosix_endpoint.c	(revision 0)
+++ ompi/mca/btl/mosix/btl_mosix_endpoint.c	(revision 0)
@@ -0,0 +1,155 @@
+/*
+ * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
+ * University Research and Technology
+ * Corporation.  All rights reserved.
+ * Copyright (c) 2004-2008 The University of Tennessee and The University
+ * of Tennessee Research Foundation.  All rights
+ * reserved.
+ * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
+ * University of Stuttgart.  All rights reserved.
+ * Copyright (c) 2004-2005 The Regents of the University of California.
+ * All rights reserved.
+ * Copyright (c) 2007-2008 Sun Microsystems, Inc.  All rights reserved.
+ * $COPYRIGHT$
+ * 
+ * Additional copyrights may follow
+ * 
+ * $HEADER$
+ *
+ */
+
+#include "ompi_config.h"
+
+#include 
+#include 
+#ifdef HAVE_UNISTD_H
+#include 
+#endif
+#ifdef HAVE_SYS_TYPES_H
+#include 
+#endif
+#ifdef HAVE_FCNTL_H
+#include 
+#endif
+
+#include "opal/opal_socket_errno.h"
+
+#include "ompi/types.h"
+#include "ompi/mca/btl/base/btl_base_error.h"
+
+#include "btl_mosix.h"
+#include "btl_mosix_endpoint.h"
+
+#define CLOSE_FD(x) { close(x); x = -1; }
+
+/*
+ * Attempt to send a descriptor using a given endpoint.
+ * If the channel has not been in use so far - open it.
+ */
+int mca_btl_mosix_endpoint_send(mca_btl_mosix_endpoint_t* mosix_endpoint,
+		mca_btl_base_descriptor_t* des)
+{
+	int cnt = -1;
+char* remote_mbox_path = NULL;
+	struct iovec writer[] = {{NULL, sizeof(mca_btl_base_header_t)}};
+
+/* Open connection if not open already */
+if( 0 > mosix_endpoint->endpoint_tcp_fd ) {
+