Re: [OMPI devel] Notes from mem hooks call today
On May 28, 2008, at 5:09 PM, Roland Dreier wrote: I think Patrick's point is that it's not too much more expensive to do the syscall on Linux vs just doing the cache lookup, particularly in the context of a long message. And it means that upper layer protocols like MPI don't have to deal with caches (and since MPI implementors hate registration caches only slightly less than we hate MPI_CANCEL, that will make us happy). Stick in a separate library then? I don't think we want the complexity in the kernel -- I personally would argue against merging it upstream; and given that the userspace solution is actually faster, it becomes pretty hard to justify. If someone would like to pull registration cache into OFED, that would be great. But something tells me they won't want to. It's a pain, it screws up users, and it only works about 50% of the time. It's a support issue -- pushing it in a separate library doesn't help anyone unless someone's willing to handle the support. I sure as heck don't want to do the support anymore, particularly since OFED is the *ONLY* major software stack that requires such evil hacks. MX handles it at the lower layer. Portals is specified such that the hardware and/or Portals library must handle it (by specifying semantics that require registration per message). Quadrics (with tports) handles it in a combination of the kernel and library. TCP doesn't require pinning and/or registration. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] Notes from mem hooks call today
> I think Patrick's point is that it's not too much more expensive to do the > syscall on Linux vs just doing the cache lookup, particularly in the > context of a long message. And it means that upper layer protocols like > MPI don't have to deal with caches (and since MPI implementors hate > registration caches only slightly less than we hate MPI_CANCEL, that will > make us happy). Stick in a separate library then? I don't think we want the complexity in the kernel -- I personally would argue against merging it upstream; and given that the userspace solution is actually faster, it becomes pretty hard to justify.
Re: [OMPI devel] Notes from mem hooks call today
On Wed, 28 May 2008, Roland Dreier wrote: >- gleb asks: don't we want to avoid the system call when possible? >- patrick: a single syscall can be/is cheaper than a reg cache > lookup in user space This doesn't really make sense -- syscall + cache lookup in kernel is "obviously" more expensive than cache lookup in userspace with no context switch (I don't see any tricks the kernel can do that make the cache lookup cheaper there). However the solution I proposed a long time ago (when Pete Wyckoff originally did his work on having the kernel track this -- and as a side note, it's not clear to me whether MMU notifiers really help what Pete did) is for userspace to provide a pointer to a flag when registering memory with the kernel, and then the kernel can mark the flag if the mapping changes -- ie keep the userspace cache but have the kernel manage invalidation "perfectly" without any malloc hooks. I think Patrick's point is that it's not too much more expensive to do the syscall on Linux vs just doing the cache lookup, particularly in the context of a long message. And it means that upper layer protocols like MPI don't have to deal with caches (and since MPI implementors hate registration caches only slightly less than we hate MPI_CANCEL, that will make us happy). Brian
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
On May 28, 2008, at 8:02 AM, Jeff Squyres wrote: Note that the two /sys checks may be redundant; I'm not entirely sure how the two files relate to each other. libibverbs will complain about the first if it is not present; the second is used to indicate that the kernel drivers are loaded. I got some more feedback from Roland off-list explaining that if /sys/ class/infiniband does exist and is non-empty and /sys/class/ infiniband_verbs/abi_version does not exist, then this is definitely a case where we want to warn because it implies that config is screwed up -- RDMA devices are present but not usable. In this case, I think the warning that libibverbs itself prints is suitable ("Fatal: couldn't read..."). So let's just eliminate that check in OMPI and go with something like the following (pretty much exactly what was proposed a while ago by Pasha :-) ): # If sysfs/class/infiniband does not exist, the driver was not # started. Therefore: assume that the user does not want RDMA # hardware support -- do *not* print a warning message. if (! -d "$sysfsdir/class/infiniband") { if ($always_want_to_see_warnings) print "Warning: $sysfsdir/class/infiniband does not exist\n"; return SKIP_THIS_BTL; } # If we get to this point, the drivers are loaded and therefore we # will assume that there is supposed to be at least one RDMA device # present. Warn if we don't find any. $list = ibv_get_device_list(); if (empty($list)) { print "Warning: couldn't find any RDMA devices -- if you have no RDMA devices, stop the driver to avoid this warning message\n"; return SKIP_THIS_BTL; } # ...continue with initialization; warnings and errors are # *always* displayed after this point -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Notes from mem hooks call today
>- gleb asks: don't we want to avoid the system call when possible? >- patrick: a single syscall can be/is cheaper than a reg cache > lookup in user space This doesn't really make sense -- syscall + cache lookup in kernel is "obviously" more expensive than cache lookup in userspace with no context switch (I don't see any tricks the kernel can do that make the cache lookup cheaper there). However the solution I proposed a long time ago (when Pete Wyckoff originally did his work on having the kernel track this -- and as a side note, it's not clear to me whether MMU notifiers really help what Pete did) is for userspace to provide a pointer to a flag when registering memory with the kernel, and then the kernel can mark the flag if the mapping changes -- ie keep the userspace cache but have the kernel manage invalidation "perfectly" without any malloc hooks. - R.
Re: [OMPI devel] Open MPI session directory location
After chatting with Jeff to better understand the ompi_info issue, I consolidated all the ORTE-level MCA param registrations that are relevant to users and had ompi_info call it. You will now see them displayed by ompi_info. Ralph On 5/27/08 1:57 PM, "Jeff Squyres"wrote: > Oops, sorry. > > We were having problems with the memory allocator when ompi_info > called orte_init(). I think it might be best to call the ORTE MCA > registration function directly... > > > On May 27, 2008, at 10:40 AM, Ralph H Castain wrote: > >> I see the problem (I think). A recent change was made to ompi_info >> so it no >> longer calls orte_init. As a result, none of the ORTE-level params >> (i.e., >> those params registered outside of ORTE frameworks) are being >> reported. >> >> I'll chat with Jeff and see how we resolve the problem. >> >> >> On 5/27/08 8:32 AM, "Ralph H Castain" wrote: >> >>> It "should" be visible nownot sure why it isn't. It conforms to >>> the >>> naming rules and -used- to be reported by ompi_info... >>> >>> >>> >>> On 5/27/08 8:31 AM, "Shipman, Galen M." wrote: >>> Make that "ompi_info". We need to make that visible via orte_info. I thought this was done at some point, perhaps it got overwritten? Thanks, Galen On May 27, 2008, at 10:27 AM, Ralph H Castain wrote: > -mca orte_tmpdir_base foo > > > > On 5/27/08 8:24 AM, "Gleb Natapov" wrote: > >> Hi, >> >> Is there a way to change where Open MPI creates session >> directory. I >> can't find mca parameter that specifies this. >> >> -- >> Gleb. >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI devel] SM BTL NUMA awareness patches
Hi, Attached two patches implement NUMA awareness in SM BTL. The first one adds two new functions to maffinity framework required by the second patch. The functions are: opal_maffinity_base_node_name_to_id() - gets a string that represents a memory node name and translates it to memory node id. opal_maffinity_base_bind()- binds an address range to specific memory node. The bind() function cannot be implemented by all maffinity components. (There is no way first_use maffinity component can implement such functionality). In this case this function can be set to NULL. The second one adds NUMA awareness support to SM BTL and SM MPOOL. Each process determines what CPU it is running on and exchange this info with other local processes. Each process creates separate MPOOL for every memory node available and use them to allocate memory on specific memory nodes if needed. For instance circular buffer memory is always allocated on memory node local to receiver process. To use this on a Linux machine carto file with HW topology description should be provided. Processes should be bound to specific CPU (by specifying rank file for instance) and session directory should be created on tmpfs file system (otherwise Linux ignores memory binding commands) by setting orte_tmpdir_base parameter to point to tmpfs mount point. Questions and suggestion are alway welcome. -- Gleb. commit 883db5e1ce8c3b49cc1376e6acf9c2d5d0d77983 Author: Gleb NatapovList-Post: devel@lists.open-mpi.org Date: Tue May 27 14:55:11 2008 +0300 Add functions to maffinity. diff --git a/opal/mca/maffinity/base/base.h b/opal/mca/maffinity/base/base.h index c44efed..339e6a1 100644 --- a/opal/mca/maffinity/base/base.h +++ b/opal/mca/maffinity/base/base.h @@ -105,6 +105,9 @@ OPAL_DECLSPEC int opal_maffinity_base_select(void); */ OPAL_DECLSPEC int opal_maffinity_base_set(opal_maffinity_base_segment_t *segments, size_t num_segments); +OPAL_DECLSPEC int opal_maffinity_base_node_name_to_id(char *, int *); +OPAL_DECLSPEC int opal_maffinity_base_bind(opal_maffinity_base_segment_t *, size_t, int); + /** * Shut down the maffinity MCA framework. * diff --git a/opal/mca/maffinity/base/maffinity_base_wrappers.c b/opal/mca/maffinity/base/maffinity_base_wrappers.c index ec843eb..eef5c7d 100644 --- a/opal/mca/maffinity/base/maffinity_base_wrappers.c +++ b/opal/mca/maffinity/base/maffinity_base_wrappers.c @@ -31,3 +31,33 @@ int opal_maffinity_base_set(opal_maffinity_base_segment_t *segments, } return opal_maffinity_base_module->maff_module_set(segments, num_segments); } + +int opal_maffinity_base_node_name_to_id(char *node_name, int *node_id) +{ +if (!opal_maffinity_base_selected) { +return OPAL_ERR_NOT_FOUND; +} + +if (!opal_maffinity_base_module->maff_module_name_to_id) { +*node_id = 0; +return OPAL_ERR_NOT_IMPLEMENTED; +} + +return opal_maffinity_base_module->maff_module_name_to_id(node_name, +node_id); +} + +int opal_maffinity_base_bind(opal_maffinity_base_segment_t *segments, +size_t num_segments, int node_id) +{ +if (!opal_maffinity_base_selected) { +return OPAL_ERR_NOT_FOUND; +} + +if (!opal_maffinity_base_module->maff_module_bind) { +return OPAL_ERR_NOT_IMPLEMENTED; +} + +return opal_maffinity_base_module->maff_module_bind(segments, num_segments, +node_id); +} diff --git a/opal/mca/maffinity/first_use/maffinity_first_use_module.c b/opal/mca/maffinity/first_use/maffinity_first_use_module.c index a68c2a9..0ae33e1 100644 --- a/opal/mca/maffinity/first_use/maffinity_first_use_module.c +++ b/opal/mca/maffinity/first_use/maffinity_first_use_module.c @@ -41,7 +41,9 @@ static const opal_maffinity_base_module_1_0_0_t loc_module = { first_use_module_init, /* Module function pointers */ -first_use_module_set +first_use_module_set, +NULL, +NULL }; int opal_maffinity_first_use_component_query(mca_base_module_t **module, int *priority) diff --git a/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c b/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c index 1fc2231..b2b109c 100644 --- a/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c +++ b/opal/mca/maffinity/libnuma/maffinity_libnuma_module.c @@ -20,6 +20,7 @@ #include #include +#include #include "opal/constants.h" #include "opal/mca/maffinity/maffinity.h" @@ -33,6 +34,8 @@ static int libnuma_module_init(void); static int libnuma_module_set(opal_maffinity_base_segment_t *segments, size_t num_segments); +static int libnuma_module_node_name_to_id(char *, int *); +static int libnuma_modules_bind(opal_maffinity_base_segment_t *, size_t, int); /* * Libnuma maffinity module @@ -42,7 +45,9 @@ static const
Re: [OMPI devel] mpirun hangs
That fixed it, thanks. I wonder if this is the same problem I'm seeing for 1.2.x? Greg On May 27, 2008, at 10:34 PM, Ralph Castain wrote: Aha! This is a problem that continues to bite us - it relates to the pty problem in Mac OSX. Been a ton of chatter about this, but Mac doesn't seem inclined to fix it. Try configuring --disable-pty-support and see if that helps. FWIW, you will find a platform file for Mac OSX in the trunk - I always build with it, and have spent considerable time fine-tuning it. You configure with: ./configure --prefix=whatever --with-platform=contrib/platform/lanl/macosx-dynamic In that directory, you will also find platform files for static builds under both Tiger and Leopard (slight differences). ralph On 5/27/08 8:01 PM, "Greg Watson"wrote: Ralph, I tried rolling back to 18513 but no luck. Steps: $ ./autogen.sh $ ./configure --prefix=/usr/local/openmpi-1.3-devel $ make $ make install $ mpicc -g -o xxx xxx.c $ mpirun -np 2 ./xxx $ ps x 44832 s001 R+ 0:50.00 mpirun -np 2 ./xxx 44833 s001 S+ 0:00.03 ./xxx $ gdb /usr/local/openmpi-1.3-devel/bin/mpirun ... (gdb) attach 44832 Attaching to program: `/usr/local/openmpi-1.3-devel/bin/mpirun', process 44832. Reading symbols for shared libraries +.. done 0x9371b3dd in ioctl () (gdb) where #0 0x9371b3dd in ioctl () #1 0x93754812 in grantpt () #2 0x9375470b in openpty () #3 0x001446d9 in opal_openpty () #4 0x000bf3bf in orte_iof_base_setup_prefork () #5 0x003da62f in odls_default_fork_local_proc (context=0x216a60, child=0x216dd0, environ_copy=0x217930) at odls_default_module.c:191 #6 0x000c3e76 in orte_odls_base_default_launch_local () #7 0x003daace in orte_odls_default_launch_local_procs (data=0x216780) at odls_default_module.c:360 #8 0x000ad2f6 in process_commands (sender=0x216768, buffer=0x216780, tag=1) at orted/orted_comm.c:441 #9 0x000acd52 in orte_daemon_cmd_processor (fd=-1, opal_event=1, data=0x216750) at orted/orted_comm.c:346 #10 0x0012bd21 in event_process_active () at opal_object.h:498 #11 0x0012c3c5 in opal_event_base_loop () at opal_object.h:498 #12 0x0012bf8c in opal_event_loop () at opal_object.h:498 #13 0x0011b334 in opal_progress () at runtime/opal_progress.c:169 #14 0x000cd9b4 in orte_plm_base_report_launched () at opal_object.h: 498 #15 0x000cc2b7 in orte_plm_base_launch_apps () at opal_object.h:498 #16 0x0003d626 in orte_plm_rsh_launch (jdata=0x200ae0) at plm_rsh_module.c:1126 #17 0x2604 in orterun (argc=4, argv=0xb880) at orterun.c:549 #18 0x1bd6 in main (argc=4, argv=0xb880) at main.c:13 On May 27, 2008, at 9:11 PM, Ralph Castain wrote: Yo Greg I'm not seeing any problem on my Mac OSX - I'm running Leopard. Can you tell me how you configured, and the precise command you executed? Thanks Ralph On 5/27/08 5:15 PM, "Ralph Castain" wrote: Hmmm...well, it was working about 3 hours ago! I'll try to take a look tonight, but it may be tomorrow. Try rolling it back just a little to r18513 - that's the last rev I tested on my Mac. On 5/27/08 5:00 PM, "Greg Watson" wrote: Something seems to be broken in the trunk for MacOS X. I can run a 1 process job, but a >1 process job hangs. It was working a few days ago. Greg ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Ok. With lots more off-list discussion, how's this pseudocode for a proposal: # Main assumption: if the kernel drivers are loaded, the user wants RDMA # hardware support in OMPI. $sysfsdir = ibv_get_sysfs_path(); # Avoid printing "Fatal: couldn't read uverbs ABI version" message. if (! -r "$sysfsdir/class/infiniband_verbs/abi_version") { if ($always_want_to_see_warnings) print "Warning: verbs ABI version unreadable\n"; return SKIP_THIS_BTL; } # If sysfs/class/infiniband does not exist, the driver was not started. # Therefore: assume that the user does not want RDMA hardware support -- # do *not* print a warning message. if (! -d "$sysfsdir/class/infiniband") { if ($always_want_to_see_warnings) print "Warning: $sysfsdir/class/infiniband does not exist\n"; return SKIP_THIS_BTL; } # If we get to this point, the drivers are loaded and therefore we will # assume that there is supposed to be at least one RDMA device present. # Warn if we don't find any. $list = ibv_get_device_list(); if (empty($list)) { print "Warning: couldn't find any RDMA devices -- if you have no RDMA devices, stop the driver to avoid this warning message\n"; return SKIP_THIS_BTL; } # ...continue with initialization; warnings and errors are # *always* displayed after this point An overriding assumption here is that if the user requested *only* the openib BTL in OMPI and it fails to find any devices, OMPI will always print an error that it was unable to reach remote MPI peers (regardless of whether the default warning was previously printed or not). Note that the two /sys checks may be redundant; I'm not entirely sure how the two files relate to each other. libibverbs will complain about the first if it is not present; the second is used to indicate that the kernel drivers are loaded. On May 26, 2008, at 5:10 AM, Manuel Prinz wrote: Am Samstag, den 24.05.2008, 17:30 +0200 schrieb Manuel Prinz: Am Donnerstag, den 22.05.2008, 17:18 -0400 schrieb Jeff Squyres: Could you check with some of your other Debian maintainers? I'm sorry that I can't check that before Monday! I'll let you know then but I'm not aware of that. I just checked on a box with no InfiniBand hardware: /dev/infiniband *does not* exist. Loading the IB kernel modules *does not* create the device. I seems like it only exists if the hardware is present. Best regards Manuel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems