Re: [OMPI devel] [PATCH] Not optimal SRQ resource allocation
SRQ hardware vendors -- please review and reply... More below. On Dec 2, 2009, at 10:20 AM, Vasily Philipov wrote: > diff -r a5938d9dcada ompi/mca/btl/openib/btl_openib.c > --- a/ompi/mca/btl/openib/btl_openib.cMon Nov 23 19:00:16 2009 -0800 > +++ b/ompi/mca/btl/openib/btl_openib.cWed Dec 02 16:24:55 2009 +0200 > @@ -214,6 +214,7 @@ > static int create_srq(mca_btl_openib_module_t *openib_btl) > { > int qp; > +int32_t rd_num, rd_curr_num; > > /* create the SRQ's */ > for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++) { > @@ -242,6 +243,24 @@ > > ibv_get_device_name(openib_btl->device->ib_dev)); > return OMPI_ERROR; > } > + > +rd_num = mca_btl_openib_component.qp_infos[qp].rd_num; > +rd_curr_num = openib_btl->qps[qp].u.srq_qp.rd_curr_num = > mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_init; > + > +if(true == mca_btl_openib_component.enable_srq_resize) { > +if(0 == rd_curr_num) { > +openib_btl->qps[qp].u.srq_qp.rd_curr_num = 1; > +} > + > +openib_btl->qps[qp].u.srq_qp.rd_low_local = rd_curr_num - > (rd_curr_num >> 2); > +openib_btl->qps[qp].u.srq_qp.srq_limit_event_flag = true; > +} else { > +openib_btl->qps[qp].u.srq_qp.rd_curr_num = rd_num; > +openib_btl->qps[qp].u.srq_qp.rd_low_local = > mca_btl_openib_component.qp_infos[qp].rd_low; > +/* Not used in this case, but we don't need a garbage */ > +mca_btl_openib_component.qp_infos[qp].u.srq_qp.srq_limit = 0; > +openib_btl->qps[qp].u.srq_qp.srq_limit_event_flag = false; > +} > } > } > > diff -r a5938d9dcada ompi/mca/btl/openib/btl_openib.h > --- a/ompi/mca/btl/openib/btl_openib.hMon Nov 23 19:00:16 2009 -0800 > +++ b/ompi/mca/btl/openib/btl_openib.hWed Dec 02 16:24:55 2009 +0200 > @@ -87,6 +87,12 @@ > > struct mca_btl_openib_srq_qp_info_t { > int32_t sd_max; > +/* The init value for rd_curr_num variables of all SRQs */ > +int32_t rd_init; > +/* The watermark, threshold - if the number of WQEs in SRQ is less then > this value => > + the SRQ limit event (IBV_EVENT_SRQ_LIMIT_REACHED) will be generated > on corresponding SRQ. > + As result the maximal number of pre-posted WQEs on the SRQ will be > increased */ > +int32_t srq_limit; > }; typedef struct mca_btl_openib_srq_qp_info_t mca_btl_openib_srq_qp_info_t; > > struct mca_btl_openib_qp_info_t { > @@ -254,6 +260,8 @@ > ompi_free_list_t recv_user_free; > /**< frags for coalesced massages */ > ompi_free_list_t send_free_coalesced; > +/**< Whether we want a dynamically resizing srq, enabled by default */ > +bool enable_srq_resize; /**< means that the comment refers to the field above. I think you mean /* or /** here (although we haven't used doxygen for a long, long time). I see that other fields are incorrectly marked /**<, but please don't propagate the badness. ;-) > }; typedef struct mca_btl_openib_component_t mca_btl_openib_component_t; > > OMPI_MODULE_DECLSPEC extern mca_btl_openib_component_t > mca_btl_openib_component; > @@ -348,6 +356,16 @@ > int32_t sd_credits; /* the max number of outstanding sends on a QP when > using SRQ */ > /* i.e. the number of frags that can be > outstanding (down counter) */ > opal_list_t pending_frags[2];/**< list of high/low prio frags */ > +/**< The number of max rd that we can post in the current time. > + The value may be increased in the IBV_EVENT_SRQ_LIMIT_REACHED > + event handler. The value starts from (rd_num / 4) and increased up > to rd_num */ > +int32_t rd_curr_num; The comment says "max", but the field name is "curr"[ent]? Seems a little odd. > +/**< We post additional WQEs only if a number of WQEs (in specific SRQ) > is less of this value. > + The value increased together with rd_curr_num. The value is unique > for every SRQ. */ > +int32_t rd_low_local; > +/**< The flag points if we want to get the > + IBV_EVENT_SRQ_LIMIT_REACHED events for dynamically resizing SRQ */ > +bool srq_limit_event_flag; Can you explain how this is different than enable_srq_resize? > }; typedef struct mca_btl_openib_module_srq_qp_t > mca_btl_openib_module_srq_qp_t; > > struct mca_btl_openib_module_qp_t { > diff -r a5938d9dcada ompi/mca/btl/openib/btl_openib_async.c > --- a/ompi/mca/btl/openib/btl_openib_async.c Mon Nov 23 19:00:16 2009 -0800 > +++ b/ompi/mca/btl/openib/btl_openib_async.c Wed Dec 02 16:24:55 2009 +0200 > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2008 Mellanox Technologies. All rights reserved. > + * Copyright (c) 2008-2009 Mellanox Technologies. All rights reserved. > * Copyright (c) 2007-2009 Cisco Systems, Inc. All rights
Re: [OMPI devel] Crash when using MPI_REAL8
There is definetly something wrong in types. OMPI_DATATYPE_MAX_PREDEFINED is set to 45, while there are 55 predefined types. When accessing ompi_op_ddt_map[ddt->id] with MPI_REAL8 (ddt->id=54), we're reading the ompi_mpi_op_bxor struct. Depending on various things (padding, uninitialized memory), we may get 0 and not crash. If you're not lucky, you get a random value and crash soon afterwards. So, I extended things a bit and it seems to fix my problem. I'm not sure all types are now handled, I just added some that are not defined. Sylvain diff -r e82b914000bd -r 1a40aee2925c ompi/datatype/ompi_datatype.h --- a/ompi/datatype/ompi_datatype.h Thu Dec 03 04:46:31 2009 + +++ b/ompi/datatype/ompi_datatype.h Fri Dec 04 19:59:26 2009 +0100 @@ -57,7 +57,7 @@ #define OMPI_DATATYPE_FLAG_DATA_FORTRAN 0xC000 #define OMPI_DATATYPE_FLAG_DATA_LANGUAGE 0xC000 -#define OMPI_DATATYPE_MAX_PREDEFINED 45 +#define OMPI_DATATYPE_MAX_PREDEFINED 55 #if OMPI_DATATYPE_MAX_PREDEFINED > OPAL_DATATYPE_MAX_SUPPORTED #error Need to increase the number of supported dataypes by OPAL (value OPAL_DATATYPE_MAX_SUPPORTED). diff -r e82b914000bd -r 1a40aee2925c ompi/op/op.c --- a/ompi/op/op.c Thu Dec 03 04:46:31 2009 + +++ b/ompi/op/op.c Fri Dec 04 19:59:26 2009 +0100 @@ -137,6 +137,14 @@ ompi_op_ddt_map[OMPI_DATATYPE_MPI_2INTEGER] = OMPI_OP_BASE_TYPE_2INTEGER; ompi_op_ddt_map[OMPI_DATATYPE_MPI_LONG_DOUBLE_INT] = OMPI_OP_BASE_TYPE_LONG_DOUBLE_INT; ompi_op_ddt_map[OMPI_DATATYPE_MPI_WCHAR] = OMPI_OP_BASE_TYPE_WCHAR; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER2] = OMPI_OP_BASE_TYPE_INTEGER2; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER4] = OMPI_OP_BASE_TYPE_INTEGER4; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER8] = OMPI_OP_BASE_TYPE_INTEGER8; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER16] = OMPI_OP_BASE_TYPE_INTEGER16; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL2] = OMPI_OP_BASE_TYPE_REAL2; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL4] = OMPI_OP_BASE_TYPE_REAL4; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL8] = OMPI_OP_BASE_TYPE_REAL8; +ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL16] = OMPI_OP_BASE_TYPE_REAL16; /* Create the intrinsic ops */ diff -r e82b914000bd -r 1a40aee2925c opal/datatype/opal_datatype.h --- a/opal/datatype/opal_datatype.h Thu Dec 03 04:46:31 2009 + +++ b/opal/datatype/opal_datatype.h Fri Dec 04 19:59:26 2009 +0100 @@ -56,7 +56,7 @@ * * XXX TODO Adapt to whatever the OMPI-layer needs */ -#define OPAL_DATATYPE_MAX_SUPPORTED 46 +#define OPAL_DATATYPE_MAX_SUPPORTED 56 /* flags for the datatypes. */ On Fri, 4 Dec 2009, Sylvain Jeaugey wrote: For the record, and to try to explain why all MTT tests may have missed this "bug", configuring without --enable-debug makes the bug disappear. Still trying to figure out why. Sylvain On Thu, 3 Dec 2009, Sylvain Jeaugey wrote: Hi list, I hope this time I won't be the only one to suffer this bug :) It is very simple indeed, just perform an allreduce with MPI_REAL8 (fortran) and you should get a crash in ompi/op/op.h:411. Tested with trunk and v1.5, working fine on v1.3. From what I understand, in the trunk, MPI_REAL8 has now a fixed id (in ompi/datatype/ompi_datatype_internal.h), but operations do not have an index going as far as 54 (0x36), leading to a crash when looking for op->o_func.intrinsic.fns[ompi_op_ddt_map[ddt->id]] in ompi_op_is_valid() (or, if I disable mpi_param_check, in ompi_op_reduce()). Here is a reproducer, just in case : program main use mpi integer ierr real(8) myreal, realsum call MPI_INIT(ierr) call MPI_ALLREDUCE(myreal, realsum, 1, MPI_REAL8, MPI_SUM, MPI_COMM_WORLD, ierr) call MPI_FINALIZE(ierr) stop end Has anyone an idea on how to fix this ? Or am I doing something wrong ? Thanks for any help, Sylvain ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] MPI_Request teleconf next week
To discuss Brian's MPI_Request RFC and the improvements he did with ROMIO's request handling recently. Webex Meeting: Date/Time: Thursday, December 10, 2009 11:00:00 AM ET To join this meeting as an attendee, click the following url: https://cisco.webex.com/cisco/j.php?ED=130394562=484033237 The Meeting Password is: -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Crash when using MPI_REAL8
For the record, and to try to explain why all MTT tests may have missed this "bug", configuring without --enable-debug makes the bug disappear. Still trying to figure out why. Sylvain On Thu, 3 Dec 2009, Sylvain Jeaugey wrote: Hi list, I hope this time I won't be the only one to suffer this bug :) It is very simple indeed, just perform an allreduce with MPI_REAL8 (fortran) and you should get a crash in ompi/op/op.h:411. Tested with trunk and v1.5, working fine on v1.3. From what I understand, in the trunk, MPI_REAL8 has now a fixed id (in ompi/datatype/ompi_datatype_internal.h), but operations do not have an index going as far as 54 (0x36), leading to a crash when looking for op->o_func.intrinsic.fns[ompi_op_ddt_map[ddt->id]] in ompi_op_is_valid() (or, if I disable mpi_param_check, in ompi_op_reduce()). Here is a reproducer, just in case : program main use mpi integer ierr real(8) myreal, realsum call MPI_INIT(ierr) call MPI_ALLREDUCE(myreal, realsum, 1, MPI_REAL8, MPI_SUM, MPI_COMM_WORLD, ierr) call MPI_FINALIZE(ierr) stop end Has anyone an idea on how to fix this ? Or am I doing something wrong ? Thanks for any help, Sylvain
Re: [hwloc-devel] hwloc-bind syntax
On Dec 4, 2009, at 5:36 AM, Brice Goglin wrote: > > It might be good to safely ignore 0x if it's present, but that's a small > > feature enhancement that can be done at any time (I filed a future ticket). > > It seems to work actually :) Hmm -- I don't think so...? "0x1" can't pass this test in hwloc_mask_process_arg(): } else if (strlen(arg) == strspn(arg, "0123456789abcdefABCDEF,")) { In my tests, it's falling through to the "err = -1" case, but just not printing out an error. Even more fun -- note the lack of error shown, and the lack of "ls" output, except for when we specify -v: [8:33] rtp-jsquyres-8711:~/svn/hwloc % ./utils/hwloc-bind 0x1 ls [8:33] rtp-jsquyres-8711:~/svn/hwloc % ./utils/hwloc-bind -v 0x1 ls assuming the command starts at 0x1 execvp: No such file or directory - If think that if execvp() fails, we should *always* print an error, not just if -v was specified. I'll fix. > > Linux is likely to be among the most popular target for hwloc -- so can you > > explain in good words definitions for the following: [snipped] Thanks. > > Additionally -- the word "father" is used in the docs. Should we use the > > gender-neutral "parent" instead? > > I am not sure. The object structure contains a father pointer. We use > parent in the API, but it might refer to different things, like father, > grandfather, ... FWIW, the english word "parent" definitely refers to the immediate ancestor. It does *not* refer to grandparents or great-grandparents, etc. > > What I meant by my question was -- aren't the 3 diagrams above equivalent > > to "core:6"? If so, what's the value of the foo.bar.baz notation? > > If you have a 96 core machine like we do, the hierarchical notation > (foo.bar.baz) is really nice. If I want to bind on > node:2.socket:3.core:4, it's much easier than looking at the topology > and finding that it's core:70. Ah, ok. Fair enough. > Using physical or logical indexes doesn't > change anything here. I agree that we don't do that often in real > applications, but I actually use that quite a lot for my own debugging :) Another good reason. :-) > I actually don't see why people would like to use physical numbers in > such a hierarchical notation since physical socket/core numbers are > often strange/illogical and nobody remembers them. However, I agree that > the physical indexes are useful when *not* using a hierarchical > notation, ie I want to bind on thread OS index #46. As a server vendor, using physical/OS indexes is actually quite useful to me (e.g., to ensure that the hardware and OS are playing nicely). My point is that everyone has a different view here -- we should just support both. IMHO, the common case is logical indexes -- so let's make those the default. But there are definitely cases where physical indexes are useful as well. -- Jeff Squyres jsquy...@cisco.com
Re: [hwloc-devel] hwloc-bind syntax
On Dec 4, 2009, at 5:32 AM, Ashley Pittman wrote: > > It might be good to safely ignore 0x if it's present, but that's a small > > feature enhancement that can be done at any time (I filed a future ticket). > > Maybe not relevant but it bit me so I'll say it here, using "%x" with > sscanf on a string of "0x1" will match the whole thing and give a value > of 1 on Linux but on Solaris it'll match the "0" as a hex value of 0 and > not match the "x1" at all leading to further errors in subsequent > matches as well. The most annoying thing is that sscanf() thinks it's > matched and it's return code will be set accordingly. Yuck! Thankfully, we don't appear to be using sscanf() to convert the cpuset strings. -- Jeff Squyres jsquy...@cisco.com
Re: [hwloc-devel] hwloc-bind again
On Dec 4, 2009, at 1:09 AM, Brice Goglin wrote: > > shell$ hwloc-bind > > > > (i.e., invoking hwloc-bind with no arguments) > > > > returns an exit status of 0. Shouldn't it return non-zero? > > Yeah maybe I'm going to interpret that as "Hell yes! Please implement. THANKS!!!" ;-) -- Jeff Squyres jsquy...@cisco.com
Re: [hwloc-devel] Disabling X component
On Fri, 2009-12-04 at 13:04 +0100, Samuel Thibault wrote: > Ashley Pittman, le Fri 04 Dec 2009 11:06:12 +, a écrit : > > The debian version of -.txt (lstopo 0.9.3rc1) leaves my terminal with > > the colours inverted after I call it, I have to do a reset to get back > > to black on grey background. > > Uh, odd. Which terminal are you using? gnome-terminal with $TERM set to xterm. I've not done anything special with this, it's just a debian unstable install. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [hwloc-devel] Disabling X component
Ashley Pittman, le Fri 04 Dec 2009 11:06:12 +, a écrit : > The debian version of -.txt (lstopo 0.9.3rc1) leaves my terminal with > the colours inverted after I call it, I have to do a reset to get back > to black on grey background. Uh, odd. Which terminal are you using? Samuel
Re: [hwloc-devel] Disabling X component
Ashley Pittman wrote: > I installed the debian package of hwloc yesterday and discovered that > the default action of lstopo is to display a window with a picture in. > I guess I don't have the right development packages installed for this > to be enabled in my local build. > > In my tool I want to ensure the text version is displayed, padb popping > up a number of windows isn't what people will expect or want. Obviously > I can unset DISPLAY before calling lstopo but a --no-x or --text-based > option would be a nice thing to have as well. > lstopo - tells lstopo to output on /dev/stdout Brice
[hwloc-devel] Disabling X component
I installed the debian package of hwloc yesterday and discovered that the default action of lstopo is to display a window with a picture in. I guess I don't have the right development packages installed for this to be enabled in my local build. In my tool I want to ensure the text version is displayed, padb popping up a number of windows isn't what people will expect or want. Obviously I can unset DISPLAY before calling lstopo but a --no-x or --text-based option would be a nice thing to have as well. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [hwloc-devel] hwloc-bind syntax
Jeff Squyres wrote: > It might be good to safely ignore 0x if it's present, but that's a small > feature enhancement that can be done at any time (I filed a future ticket). > It seems to work actually :) >> We might want to drop the Linux "cpuset" word and use "cgroup" instead. >> Both are supported by Linux, but the latter now contains the former and >> more, so people are supposed to use cgroup now. hwloc supports both. >> > > Linux is likely to be among the most popular target for hwloc -- so can you > explain in good words definitions for the following: > > - hwloc cpuset > Opaque structure describing a set of logical processors. Each hwloc object structure contains a cpuset field that describes which logical processors are contained in the corresponding physical object. hwloc cpusets are used by hwloc binding routines. > - Linux cpuset > - Linux cgroup > See http://www.mjmwired.net/kernel/Documentation/cgroups.txt, and look for cpusets in there: Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour. [...] On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access. For example, cpusets allows you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. > Additionally -- the word "father" is used in the docs. Should we use the > gender-neutral "parent" instead? > I am not sure. The object structure contains a father pointer. We use parent in the API, but it might refer to different things, like father, grandfather, ... >> You don't care about starting with system or something else. You can >> ignore the system level as you could ignore the socket level between >> nodes and cores. >> >> If you have 1 system with 2 nodes with 2 sockets each with 2 cores each, >> you get: >> node:1 core:2 is equivalent to system:0 node:1 socket:2 core:0 and >> equivalent to system:0 core:6 >> > > Did you mean: > > node:1.core:2 == system:0.node:1.socket:2.core:0 == system:0.core:6 > > ? > Yes. > What I meant by my question was -- aren't the 3 diagrams above equivalent to > "core:6"? If so, what's the value of the foo.bar.baz notation? If you have a 96 core machine like we do, the hierarchical notation (foo.bar.baz) is really nice. If I want to bind on node:2.socket:3.core:4, it's much easier than looking at the topology and finding that it's core:70. Using physical or logical indexes doesn't change anything here. I agree that we don't do that often in real applications, but I actually use that quite a lot for my own debugging :) I actually don't see why people would like to use physical numbers in such a hierarchical notation since physical socket/core numbers are often strange/illogical and nobody remembers them. However, I agree that the physical indexes are useful when *not* using a hierarchical notation, ie I want to bind on thread OS index #46. Brice
Re: [hwloc-devel] hwloc-bind syntax
On Thu, 2009-12-03 at 20:32 -0500, Jeff Squyres wrote: > > > Ah, ok. To be clear, is it accurate to say that it is one of the > > > following forms: > > > > > > - a hex number (without leading "0x" -- would "0x" be ignored if it is > > > supplied?) > > > > We never used 0x there. > > Ok. > > It might be good to safely ignore 0x if it's present, but that's a small > feature enhancement that can be done at any time (I filed a future ticket). Maybe not relevant but it bit me so I'll say it here, using "%x" with sscanf on a string of "0x1" will match the whole thing and give a value of 1 on Linux but on Solaris it'll match the "0" as a hex value of 0 and not match the "x1" at all leading to further errors in subsequent matches as well. The most annoying thing is that sscanf() thinks it's matched and it's return code will be set accordingly. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [hwloc-devel] hwloc-bind again
Jeff Squyres wrote: > I notice that > > shell$ hwloc-bind > > (i.e., invoking hwloc-bind with no arguments) > > returns an exit status of 0. Shouldn't it return non-zero? I'd think it was > an error if you didn't give hwloc-bind anything to do. For example, we > wouldn't want a script with something like this: > > hwloc-bind $actions_to_do > > to return 0 if $actions_to_do was mistakenly empty. > > Right? > Yeah maybe