Re: [OMPI devel] Notes from mem hooks call today

2008-05-28 Thread Brian Barrett

On May 28, 2008, at 5:09 PM, Roland Dreier wrote:

I think Patrick's point is that it's not too much more expensive to  
do the

syscall on Linux vs just doing the cache lookup, particularly in the
context of a long message.  And it means that upper layer protocols  
like

MPI don't have to deal with caches (and since MPI implementors hate
registration caches only slightly less than we hate MPI_CANCEL,  
that will

make us happy).


Stick in a separate library then?

I don't think we want the complexity in the kernel -- I personally  
would
argue against merging it upstream; and given that the userspace  
solution

is actually faster, it becomes pretty hard to justify.


If someone would like to pull registration cache into OFED, that would  
be great.  But something tells me they won't want to.  It's a pain, it  
screws up users, and it only works about 50% of the time.


It's a support issue -- pushing it in a separate library doesn't help  
anyone unless someone's willing to handle the support.  I sure as heck  
don't want to do the support anymore, particularly since OFED is the  
*ONLY* major software stack that requires such evil hacks.  MX handles  
it at the lower layer.  Portals is specified such that the hardware  
and/or Portals library must handle it (by specifying semantics that  
require registration per message).  Quadrics (with tports) handles it  
in a combination of the kernel and library.  TCP doesn't require  
pinning and/or registration.


Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI devel] Notes from mem hooks call today

2008-05-28 Thread Roland Dreier
 > I think Patrick's point is that it's not too much more expensive to do the 
 > syscall on Linux vs just doing the cache lookup, particularly in the 
 > context of a long message.  And it means that upper layer protocols like 
 > MPI don't have to deal with caches (and since MPI implementors hate 
 > registration caches only slightly less than we hate MPI_CANCEL, that will 
 > make us happy).

Stick in a separate library then?

I don't think we want the complexity in the kernel -- I personally would
argue against merging it upstream; and given that the userspace solution
is actually faster, it becomes pretty hard to justify.


Re: [OMPI devel] Notes from mem hooks call today

2008-05-28 Thread Brian W. Barrett

On Wed, 28 May 2008, Roland Dreier wrote:


>- gleb asks: don't we want to avoid the system call when possible?
>- patrick: a single syscall can be/is cheaper than a reg cache
>  lookup in user space

This doesn't really make sense -- syscall + cache lookup in kernel is
"obviously" more expensive than cache lookup in userspace with no
context switch (I don't see any tricks the kernel can do that make the
cache lookup cheaper there).

However the solution I proposed a long time ago (when Pete Wyckoff
originally did his work on having the kernel track this -- and as a side
note, it's not clear to me whether MMU notifiers really help what Pete
did) is for userspace to provide a pointer to a flag when registering
memory with the kernel, and then the kernel can mark the flag if the
mapping changes -- ie keep the userspace cache but have the kernel
manage invalidation "perfectly" without any malloc hooks.


I think Patrick's point is that it's not too much more expensive to do the 
syscall on Linux vs just doing the cache lookup, particularly in the 
context of a long message.  And it means that upper layer protocols like 
MPI don't have to deal with caches (and since MPI implementors hate 
registration caches only slightly less than we hate MPI_CANCEL, that will 
make us happy).


Brian


Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-28 Thread Jeff Squyres

On May 28, 2008, at 8:02 AM, Jeff Squyres wrote:


Note that the two /sys checks may be redundant; I'm not entirely sure
how the two files relate to each other.  libibverbs will complain
about the first if it is not present; the second is used to indicate
that the kernel drivers are loaded.


I got some more feedback from Roland off-list explaining that if /sys/ 
class/infiniband does exist and is non-empty and /sys/class/ 
infiniband_verbs/abi_version does not exist, then this is definitely a  
case where we want to warn because it implies that config is screwed  
up -- RDMA devices are present but not usable.


In this case, I think the warning that libibverbs itself prints is  
suitable ("Fatal: couldn't read...").  So let's just eliminate that  
check in OMPI and go with something like the following (pretty much  
exactly what was proposed a while ago by Pasha :-) ):


  # If sysfs/class/infiniband does not exist, the driver was not
  # started.  Therefore: assume that the user does not want RDMA
  # hardware support -- do *not* print a warning message.
  if (! -d "$sysfsdir/class/infiniband") {
  if ($always_want_to_see_warnings)
  print "Warning: $sysfsdir/class/infiniband does not exist\n";
  return SKIP_THIS_BTL;
  }

  # If we get to this point, the drivers are loaded and therefore we
  # will assume that there is supposed to be at least one RDMA device
  # present.  Warn if we don't find any.
  $list = ibv_get_device_list();
  if (empty($list)) {
  print "Warning: couldn't find any RDMA devices -- if you have  
no RDMA devices, stop the driver to avoid this warning message\n";

  return SKIP_THIS_BTL;
  }

  # ...continue with initialization; warnings and errors are
  # *always* displayed after this point

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Notes from mem hooks call today

2008-05-28 Thread Roland Dreier
 >- gleb asks: don't we want to avoid the system call when possible?
 >- patrick: a single syscall can be/is cheaper than a reg cache
 >  lookup in user space

This doesn't really make sense -- syscall + cache lookup in kernel is
"obviously" more expensive than cache lookup in userspace with no
context switch (I don't see any tricks the kernel can do that make the
cache lookup cheaper there).

However the solution I proposed a long time ago (when Pete Wyckoff
originally did his work on having the kernel track this -- and as a side
note, it's not clear to me whether MMU notifiers really help what Pete
did) is for userspace to provide a pointer to a flag when registering
memory with the kernel, and then the kernel can mark the flag if the
mapping changes -- ie keep the userspace cache but have the kernel
manage invalidation "perfectly" without any malloc hooks.

 - R.


Re: [OMPI devel] Open MPI session directory location

2008-05-28 Thread Ralph H Castain
After chatting with Jeff to better understand the ompi_info issue, I
consolidated all the ORTE-level MCA param registrations that are relevant to
users and had ompi_info call it. You will now see them displayed by
ompi_info.

Ralph


On 5/27/08 1:57 PM, "Jeff Squyres"  wrote:

> Oops, sorry.
> 
> We were having problems with the memory allocator when ompi_info
> called orte_init().  I think it might be best to call the ORTE MCA
> registration function directly...
> 
> 
> On May 27, 2008, at 10:40 AM, Ralph H Castain wrote:
> 
>> I see the problem (I think). A recent change was made to ompi_info
>> so it no
>> longer calls orte_init. As a result, none of the ORTE-level params
>> (i.e.,
>> those params registered outside of ORTE frameworks) are being
>> reported.
>> 
>> I'll chat with Jeff and see how we resolve the problem.
>> 
>> 
>> On 5/27/08 8:32 AM, "Ralph H Castain"  wrote:
>> 
>>> It "should" be visible nownot sure why it isn't. It conforms to
>>> the
>>> naming rules and -used- to be reported by ompi_info...
>>> 
>>> 
>>> 
>>> On 5/27/08 8:31 AM, "Shipman, Galen M."  wrote:
>>> 
 Make that "ompi_info".
 
 We need to make that visible via orte_info.
 I thought this was done at some point, perhaps it got overwritten?
 
 Thanks,
 
 Galen
 
 On May 27, 2008, at 10:27 AM, Ralph H Castain wrote:
 
> -mca orte_tmpdir_base foo
> 
> 
> 
> On 5/27/08 8:24 AM, "Gleb Natapov"  wrote:
> 
>> Hi,
>> 
>>  Is there a way to change where Open MPI creates session
>> directory. I
>> can't find mca parameter that specifies this.
>> 
>> --
>> Gleb.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 




[OMPI devel] SM BTL NUMA awareness patches

2008-05-28 Thread Gleb Natapov
Hi,

Attached two patches implement NUMA awareness in SM BTL. The first one
adds two new functions to maffinity framework required by the second
patch. The functions are:

 opal_maffinity_base_node_name_to_id() - gets a string that represents a
 memory node name and translates
 it to memory node id.
 opal_maffinity_base_bind()- binds an address range to specific
 memory node.

The bind() function cannot be implemented by all maffinity components.
(There is no way first_use maffinity component can implement such
functionality). In this case this function can be set to NULL.

The second one adds NUMA awareness support to SM BTL and SM MPOOL. Each
process determines what CPU it is running on and exchange this info with
other local processes. Each process creates separate MPOOL for every
memory node available and use them to allocate memory on specific memory
nodes if needed. For instance circular buffer memory is always allocated
on memory node local to receiver process.

To use this on a Linux machine carto file with HW topology description should
be provided. Processes should be bound to specific CPU (by specifying
rank file for instance) and session directory should be created on tmpfs
file system (otherwise Linux ignores memory binding commands) by
setting orte_tmpdir_base parameter to point to tmpfs mount point.

Questions and suggestion are alway welcome.

--
Gleb.
commit 883db5e1ce8c3b49cc1376e6acf9c2d5d0d77983
Author: Gleb Natapov 
List-Post: devel@lists.open-mpi.org
Date:   Tue May 27 14:55:11 2008 +0300

Add functions to maffinity.

diff --git a/opal/mca/maffinity/base/base.h b/opal/mca/maffinity/base/base.h
index c44efed..339e6a1 100644
--- a/opal/mca/maffinity/base/base.h
+++ b/opal/mca/maffinity/base/base.h
@@ -105,6 +105,9 @@ OPAL_DECLSPEC int opal_maffinity_base_select(void);
  */
 OPAL_DECLSPEC int opal_maffinity_base_set(opal_maffinity_base_segment_t *segments, size_t num_segments);

+OPAL_DECLSPEC int opal_maffinity_base_node_name_to_id(char *, int *);
+OPAL_DECLSPEC int opal_maffinity_base_bind(opal_maffinity_base_segment_t *, size_t, int);
+
 /**
  * Shut down the maffinity MCA framework.
  *
diff --git a/opal/mca/maffinity/base/maffinity_base_wrappers.c b/opal/mca/maffinity/base/maffinity_base_wrappers.c
index ec843eb..eef5c7d 100644
--- a/opal/mca/maffinity/base/maffinity_base_wrappers.c
+++ b/opal/mca/maffinity/base/maffinity_base_wrappers.c
@@ -31,3 +31,33 @@ int opal_maffinity_base_set(opal_maffinity_base_segment_t *segments,
 }
 return opal_maffinity_base_module->maff_module_set(segments, num_segments);
 }
+
+int opal_maffinity_base_node_name_to_id(char *node_name, int *node_id)
+{
+if (!opal_maffinity_base_selected) {
+return OPAL_ERR_NOT_FOUND;
+}
+
+if (!opal_maffinity_base_module->maff_module_name_to_id) {
+*node_id = 0;
+return OPAL_ERR_NOT_IMPLEMENTED;
+}
+
+return opal_maffinity_base_module->maff_module_name_to_id(node_name,
+node_id);
+}
+
+int opal_maffinity_base_bind(opal_maffinity_base_segment_t *segments,
+size_t num_segments, int node_id)
+{
+if (!opal_maffinity_base_selected) {
+return OPAL_ERR_NOT_FOUND;
+}
+
+if (!opal_maffinity_base_module->maff_module_bind) {
+return OPAL_ERR_NOT_IMPLEMENTED;
+}
+
+return opal_maffinity_base_module->maff_module_bind(segments, num_segments,
+node_id);
+}
diff --git a/opal/mca/maffinity/first_use/maffinity_first_use_module.c b/opal/mca/maffinity/first_use/maffinity_first_use_module.c
index a68c2a9..0ae33e1 100644
--- a/opal/mca/maffinity/first_use/maffinity_first_use_module.c
+++ b/opal/mca/maffinity/first_use/maffinity_first_use_module.c
@@ -41,7 +41,9 @@ static const opal_maffinity_base_module_1_0_0_t loc_module = {
 first_use_module_init,

 /* Module function pointers */
-first_use_module_set
+first_use_module_set,
+NULL,
+NULL
 };

 int opal_maffinity_first_use_component_query(mca_base_module_t **module, int *priority)
diff --git a/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c b/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c
index 1fc2231..b2b109c 100644
--- a/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c
+++ b/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c
@@ -20,6 +20,7 @@

 #include 
 #include 
+#include 

 #include "opal/constants.h"
 #include "opal/mca/maffinity/maffinity.h"
@@ -33,6 +34,8 @@
 static int libnuma_module_init(void);
 static int libnuma_module_set(opal_maffinity_base_segment_t *segments,
   size_t num_segments);
+static int libnuma_module_node_name_to_id(char *, int *);
+static int libnuma_modules_bind(opal_maffinity_base_segment_t *, size_t, int);

 /*
  * Libnuma maffinity module
@@ -42,7 +45,9 @@ static const 

Re: [OMPI devel] mpirun hangs

2008-05-28 Thread Greg Watson
That fixed it, thanks. I wonder if this is the same problem I'm seeing  
for 1.2.x?


Greg

On May 27, 2008, at 10:34 PM, Ralph Castain wrote:

Aha! This is a problem that continues to bite us - it relates to the  
pty
problem in Mac OSX. Been a ton of chatter about this, but Mac  
doesn't seem

inclined to fix it.

Try configuring --disable-pty-support and see if that helps. FWIW,  
you will
find a platform file for Mac OSX in the trunk - I always build with  
it, and

have spent considerable time fine-tuning it. You configure with:

./configure --prefix=whatever
--with-platform=contrib/platform/lanl/macosx-dynamic

In that directory, you will also find platform files for static  
builds under

both Tiger and Leopard (slight differences).

ralph


On 5/27/08 8:01 PM, "Greg Watson"  wrote:


Ralph,

I tried rolling back to 18513 but no luck. Steps:

$ ./autogen.sh
$ ./configure --prefix=/usr/local/openmpi-1.3-devel
$ make
$ make install
$ mpicc -g -o xxx xxx.c
$ mpirun -np 2 ./xxx
$ ps x
44832 s001  R+ 0:50.00 mpirun -np 2 ./xxx
44833 s001  S+ 0:00.03 ./xxx
$ gdb /usr/local/openmpi-1.3-devel/bin/mpirun
...
(gdb) attach 44832
Attaching to program: `/usr/local/openmpi-1.3-devel/bin/mpirun',
process 44832.
Reading symbols for shared libraries 
+.. done
0x9371b3dd in ioctl ()
(gdb) where
#0  0x9371b3dd in ioctl ()
#1  0x93754812 in grantpt ()
#2  0x9375470b in openpty ()
#3  0x001446d9 in opal_openpty ()
#4  0x000bf3bf in orte_iof_base_setup_prefork ()
#5  0x003da62f in odls_default_fork_local_proc (context=0x216a60,
child=0x216dd0, environ_copy=0x217930) at odls_default_module.c:191
#6  0x000c3e76 in orte_odls_base_default_launch_local ()
#7  0x003daace in orte_odls_default_launch_local_procs  
(data=0x216780)

at odls_default_module.c:360
#8  0x000ad2f6 in process_commands (sender=0x216768, buffer=0x216780,
tag=1) at orted/orted_comm.c:441
#9  0x000acd52 in orte_daemon_cmd_processor (fd=-1, opal_event=1,
data=0x216750) at orted/orted_comm.c:346
#10 0x0012bd21 in event_process_active () at opal_object.h:498
#11 0x0012c3c5 in opal_event_base_loop () at opal_object.h:498
#12 0x0012bf8c in opal_event_loop () at opal_object.h:498
#13 0x0011b334 in opal_progress () at runtime/opal_progress.c:169
#14 0x000cd9b4 in orte_plm_base_report_launched () at opal_object.h: 
498

#15 0x000cc2b7 in orte_plm_base_launch_apps () at opal_object.h:498
#16 0x0003d626 in orte_plm_rsh_launch (jdata=0x200ae0) at
plm_rsh_module.c:1126
#17 0x2604 in orterun (argc=4, argv=0xb880) at orterun.c:549
#18 0x1bd6 in main (argc=4, argv=0xb880) at main.c:13

On May 27, 2008, at 9:11 PM, Ralph Castain wrote:


Yo Greg

I'm not seeing any problem on my Mac OSX - I'm running Leopard. Can
you tell
me how you configured, and the precise command you executed?

Thanks
Ralph



On 5/27/08 5:15 PM, "Ralph Castain"  wrote:


Hmmm...well, it was working about 3 hours ago! I'll try to take a
look
tonight, but it may be tomorrow.

Try rolling it back just a little to r18513 - that's the last rev I
tested
on my Mac.


On 5/27/08 5:00 PM, "Greg Watson"  wrote:

Something seems to be broken in the trunk for MacOS X. I can run  
a 1

process job, but a >1 process job hangs. It was working a few days
ago.

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-28 Thread Jeff Squyres
Ok.  With lots more off-list discussion, how's this pseudocode for a  
proposal:


  # Main assumption: if the kernel drivers are loaded, the user wants  
RDMA

  # hardware support in OMPI.

  $sysfsdir = ibv_get_sysfs_path();
  # Avoid printing "Fatal: couldn't read uverbs ABI version" message.
  if (! -r "$sysfsdir/class/infiniband_verbs/abi_version") {
  if ($always_want_to_see_warnings)
  print "Warning: verbs ABI version unreadable\n";
  return SKIP_THIS_BTL;
  }

  # If sysfs/class/infiniband does not exist, the driver was not  
started.
  # Therefore: assume that the user does not want RDMA hardware  
support --

  # do *not* print a warning message.
  if (! -d "$sysfsdir/class/infiniband") {
  if ($always_want_to_see_warnings)
  print "Warning: $sysfsdir/class/infiniband does not exist\n";
  return SKIP_THIS_BTL;
  }

  # If we get to this point, the drivers are loaded and therefore we  
will
  # assume that there is supposed to be at least one RDMA device  
present.

  # Warn if we don't find any.
  $list = ibv_get_device_list();
  if (empty($list)) {
  print "Warning: couldn't find any RDMA devices -- if you have  
no RDMA devices, stop the driver to avoid this warning message\n";

  return SKIP_THIS_BTL;
  }

  # ...continue with initialization; warnings and errors are
  # *always* displayed after this point

An overriding assumption here is that if the user requested *only* the  
openib BTL in OMPI and it fails to find any devices, OMPI will always  
print an error that it was unable to reach remote MPI peers  
(regardless of whether the default warning was previously printed or  
not).


Note that the two /sys checks may be redundant; I'm not entirely sure  
how the two files relate to each other.  libibverbs will complain  
about the first if it is not present; the second is used to indicate  
that the kernel drivers are loaded.




On May 26, 2008, at 5:10 AM, Manuel Prinz wrote:


Am Samstag, den 24.05.2008, 17:30 +0200 schrieb Manuel Prinz:

Am Donnerstag, den 22.05.2008, 17:18 -0400 schrieb Jeff Squyres:

Could you check with some of your other Debian maintainers?


I'm sorry that I can't check that before Monday! I'll let you know
then but I'm not aware of that.


I just checked on a box with no InfiniBand hardware: /dev/infiniband
*does not* exist. Loading the IB kernel modules *does not* create the
device. I seems like it only exists if the hardware is present.

Best regards
Manuel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems