Re: [OMPI devel] v1.5 r25914 DOA
That's what we needed to know - i.e., that setting num_sockets=1 generates an error instead of segfaulting down the road. I can submit a CMR to do so. thx! On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote: > On 02/22/12 14:54, Ralph Castain wrote: >> That doesn't really address the issue, though. What I want to know is: what >> happens when you try to bind processes? What about -bind-to-socket, and >> -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if >> the socket layer isn't present. The logic in 1.5 is pretty old, but I >> believe it relies heavily on sockets being present. > Okay. So, > > *) "out of the box", basically nothing works. For example, "mpirun > hostname" segfaults. > > *) With "--mca orte_num_sockets 1", stuff appears to work. > > *) With "--mca orte_num_sockets 1" and adding either "--bysocket > --bind-to-socket" or "--npersocket ", I get: > > -- > Unable to bind to socket -13 on node burl-ct-v20z-10. > -- > -- > mpirun was unable to start the specified application as it encountered an > error: > > Error name: Fatal > Node: burl-ct-v20z-10 > > when attempting to start process rank 0. > -- > 2 total processes failed to start > > So, I hear Brice's comment that this is an old kernel. And, I hear what > you're saying about a "real" fix being expensive. Nevertheless, to my taste, > automatically setting num_sockets==1 when num_sockets==0 is detected makes a > lot of sense. It makes things "basically" work, turning a situation where > everything including "mpirun hostname" segfaults into a situation where > default usage works just fine. What remains broken is binding, which > generates an error message that gives the user a hope of making progress > (turning off binding). That's in contrast from expecting users to go from > > % mpirun hostname > Segmentation fault > > to knowing that they should set orte_num_sockets==1. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] v1.5 r25914 DOA
On 02/22/12 14:54, Ralph Castain wrote: That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't present. The logic in 1.5 is pretty old, but I believe it relies heavily on sockets being present. Okay. So, *) "out of the box", basically nothing works. For example, "mpirun hostname" segfaults. *) With "--mca orte_num_sockets 1", stuff appears to work. *) With "--mca orte_num_sockets 1" and adding either "--bysocket --bind-to-socket" or "--npersocket ", I get: -- Unable to bind to socket -13 on node burl-ct-v20z-10. -- -- mpirun was unable to start the specified application as it encountered an error: Error name: Fatal Node: burl-ct-v20z-10 when attempting to start process rank 0. -- 2 total processes failed to start So, I hear Brice's comment that this is an old kernel. And, I hear what you're saying about a "real" fix being expensive. Nevertheless, to my taste, automatically setting num_sockets==1 when num_sockets==0 is detected makes a lot of sense. It makes things "basically" work, turning a situation where everything including "mpirun hostname" segfaults into a situation where default usage works just fine. What remains broken is binding, which generates an error message that gives the user a hope of making progress (turning off binding). That's in contrast from expecting users to go from % mpirun hostname Segmentation fault to knowing that they should set orte_num_sockets==1.
Re: [OMPI devel] v1.5 r25914 DOA
Le 22/02/2012 20:24, Eugene Loh a écrit : > On 2/22/2012 11:08 AM, Ralph Castain wrote: >> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: >>> Le 22/02/2012 17:48, Ralph Castain a écrit : On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote > On 2/21/2012 10:31 PM, Eugene Loh wrote: >> ... "sockets" is unknown and hwloc returns 0 for num_sockets and >> OMPI pukes on divide by zero. OS info was listed in the original >> message (below). Might we want to do something else? E.g., >> assume num_sockets==1 when num_sockets==0 (if you know what I >> mean)? So, which one (or more) of the following should be fixed? >> >> *) on this platform, hwloc finds no socket level >> *) therefore hwloc returns num_sockets==0 to OMPI >> *) OMPI divides by 0 and barfs on basically everything > Okay. So, Brice's other e-mail indicates that the first two are > "not really uncommon": > > On 2/22/2012 7:55 AM, Brice Goglin wrote: >> Anyway, we have seen other systems (mostly non-Linux) where lstopo >> reports nothing interesting (only one machine object with >> multiple PU >> children). So numsockets==0 isn't really uncommon. > So, it seems to me that OMPI needs to handle the num_sockets==0 > case rather than just dividing by num_sockets. This is v1.5 > orte_odls_base_open() since r25914. Unfortunately, just artificially setting the num_sockets to 1 won't solve much - you'll get past that point in the code, but attempts to bind are likely to fail down the road. Fixing it will require some significant effort. Given we haven't heard reports of this before, I'm not convinced it is a widespread problem. > I assume we don't see the problem as widespread because it was only > introduced into v1.5 in r25914. In my mind, the real question is how > common it is for hwloc to decide numsockets==0. On that one, Brice > asserts it "isn't really uncommon." On Linux, it's uncommon: it only happens on some platforms with very old kernels (2.6.10 or so). Solaris, Darwin and Windows should get sockets in some/most cases. FreeBSD should get x86 sockets correctly because we use cpuid directly there. Unless I am missing something, others have nothing related to sockets in their driver: AIX, HPUX, OSF. Brice
Re: [OMPI devel] v1.5 r25914 DOA
On Feb 22, 2012, at 12:24 PM, Eugene Loh wrote: > On 2/22/2012 11:08 AM, Ralph Castain wrote: >> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: >>> Le 22/02/2012 17:48, Ralph Castain a écrit : On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote > On 2/21/2012 10:31 PM, Eugene Loh wrote: >> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI >> pukes on divide by zero. OS info was listed in the original message >> (below). Might we want to do something else? E.g., assume >> num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which >> one (or more) of the following should be fixed? >> >> *) on this platform, hwloc finds no socket level >> *) therefore hwloc returns num_sockets==0 to OMPI >> *) OMPI divides by 0 and barfs on basically everything > Okay. So, Brice's other e-mail indicates that the first two are "not > really uncommon": > > On 2/22/2012 7:55 AM, Brice Goglin wrote: >> Anyway, we have seen other systems (mostly non-Linux) where lstopo >> reports nothing interesting (only one machine object with multiple PU >> children). So numsockets==0 isn't really uncommon. > So, it seems to me that OMPI needs to handle the num_sockets==0 case > rather than just dividing by num_sockets. This is v1.5 > orte_odls_base_open() since r25914. Unfortunately, just artificially setting the num_sockets to 1 won't solve much - you'll get past that point in the code, but attempts to bind are likely to fail down the road. Fixing it will require some significant effort. Given we haven't heard reports of this before, I'm not convinced it is a widespread problem. > I assume we don't see the problem as widespread because it was only > introduced into v1.5 in r25914. In my mind, the real question is how common > it is for hwloc to decide numsockets==0. On that one, Brice asserts it > "isn't really uncommon." For now, let's just use the mca param and see what happens. >>> I am probably missing something but: Why would setting num_sockets to 1 >>> work fine as a mca param, while artificially setting it as said above >>> wouldn't ? >> Because the param means that it isn't hardwired into the code base. I want >> to first verify that artificially forcing num_sockets to 1 doesn't break the >> code down the road, so the less change to find out, the better. > That sounds a lot different to me than the earlier statement. Thanks for > asking that question, Brice. Anyhow, I tried using "--mca orte_num_sockets > 1" and that seems to allow basic programs to run. That doesn't really address the issue, though. What I want to know is: what happens when you try to bind processes? What about -bind-to-socket, and -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if the socket layer isn't present. The logic in 1.5 is pretty old, but I believe it relies heavily on sockets being present. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] v1.5 r25914 DOA
On 2/22/2012 11:08 AM, Ralph Castain wrote: On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: Le 22/02/2012 17:48, Ralph Castain a écrit : On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed? *) on this platform, hwloc finds no socket level *) therefore hwloc returns num_sockets==0 to OMPI *) OMPI divides by 0 and barfs on basically everything Okay. So, Brice's other e-mail indicates that the first two are "not really uncommon": On 2/22/2012 7:55 AM, Brice Goglin wrote: Anyway, we have seen other systems (mostly non-Linux) where lstopo reports nothing interesting (only one machine object with multiple PU children). So numsockets==0 isn't really uncommon. So, it seems to me that OMPI needs to handle the num_sockets==0 case rather than just dividing by num_sockets. This is v1.5 orte_odls_base_open() since r25914. Unfortunately, just artificially setting the num_sockets to 1 won't solve much - you'll get past that point in the code, but attempts to bind are likely to fail down the road. Fixing it will require some significant effort. Given we haven't heard reports of this before, I'm not convinced it is a widespread problem. I assume we don't see the problem as widespread because it was only introduced into v1.5 in r25914. In my mind, the real question is how common it is for hwloc to decide numsockets==0. On that one, Brice asserts it "isn't really uncommon." For now, let's just use the mca param and see what happens. I am probably missing something but: Why would setting num_sockets to 1 work fine as a mca param, while artificially setting it as said above wouldn't ? Because the param means that it isn't hardwired into the code base. I want to first verify that artificially forcing num_sockets to 1 doesn't break the code down the road, so the less change to find out, the better. That sounds a lot different to me than the earlier statement. Thanks for asking that question, Brice. Anyhow, I tried using "--mca orte_num_sockets 1" and that seems to allow basic programs to run.
Re: [OMPI devel] v1.5 r25914 DOA
On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote: > Le 22/02/2012 17:48, Ralph Castain a écrit : >> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote: >> >>> On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed? *) on this platform, hwloc finds no socket level *) therefore hwloc returns num_sockets==0 to OMPI *) OMPI divides by 0 and barfs on basically everything >>> Okay. So, Brice's other e-mail indicates that the first two are "not >>> really uncommon": >>> >>> On 2/22/2012 7:55 AM, Brice Goglin wrote: Anyway, we have seen other systems (mostly non-Linux) where lstopo reports nothing interesting (only one machine object with multiple PU children). So numsockets==0 isn't really uncommon. >>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather >>> than just dividing by num_sockets. This is v1.5 orte_odls_base_open() >>> since r25914. >> Unfortunately, just artificially setting the num_sockets to 1 won't solve >> much - you'll get past that point in the code, but attempts to bind are >> likely to fail down the road. Fixing it will require some significant effort. >> >> Given we haven't heard reports of this before, I'm not convinced it is a >> widespread problem. For now, let's just use the mca param and see what >> happens. > > I am probably missing something but: Why would setting num_sockets to 1 > work fine as a mca param, while artificially setting it as said above > wouldn't ? Because the param means that it isn't hardwired into the code base. I want to first verify that artificially forcing num_sockets to 1 doesn't break the code down the road, so the less change to find out, the better. > > Brice > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] v1.5 r25914 DOA
Le 22/02/2012 17:48, Ralph Castain a écrit : > On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote: > >> On 2/21/2012 10:31 PM, Eugene Loh wrote: >>> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI >>> pukes on divide by zero. OS info was listed in the original message >>> (below). Might we want to do something else? E.g., assume num_sockets==1 >>> when num_sockets==0 (if you know what I mean)? So, which one (or more) of >>> the following should be fixed? >>> >>> *) on this platform, hwloc finds no socket level >>> *) therefore hwloc returns num_sockets==0 to OMPI >>> *) OMPI divides by 0 and barfs on basically everything >> Okay. So, Brice's other e-mail indicates that the first two are "not really >> uncommon": >> >> On 2/22/2012 7:55 AM, Brice Goglin wrote: >>> Anyway, we have seen other systems (mostly non-Linux) where lstopo >>> reports nothing interesting (only one machine object with multiple PU >>> children). So numsockets==0 isn't really uncommon. >> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather >> than just dividing by num_sockets. This is v1.5 orte_odls_base_open() since >> r25914. > Unfortunately, just artificially setting the num_sockets to 1 won't solve > much - you'll get past that point in the code, but attempts to bind are > likely to fail down the road. Fixing it will require some significant effort. > > Given we haven't heard reports of this before, I'm not convinced it is a > widespread problem. For now, let's just use the mca param and see what > happens. I am probably missing something but: Why would setting num_sockets to 1 work fine as a mca param, while artificially setting it as said above wouldn't ? Brice
Re: [OMPI devel] v1.5 r25914 DOA
On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote: > On 2/21/2012 10:31 PM, Eugene Loh wrote: >> ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes >> on divide by zero. OS info was listed in the original message (below). >> Might we want to do something else? E.g., assume num_sockets==1 when >> num_sockets==0 (if you know what I mean)? So, which one (or more) of the >> following should be fixed? >> >> *) on this platform, hwloc finds no socket level >> *) therefore hwloc returns num_sockets==0 to OMPI >> *) OMPI divides by 0 and barfs on basically everything > Okay. So, Brice's other e-mail indicates that the first two are "not really > uncommon": > > On 2/22/2012 7:55 AM, Brice Goglin wrote: >> Anyway, we have seen other systems (mostly non-Linux) where lstopo >> reports nothing interesting (only one machine object with multiple PU >> children). So numsockets==0 isn't really uncommon. > So, it seems to me that OMPI needs to handle the num_sockets==0 case rather > than just dividing by num_sockets. This is v1.5 orte_odls_base_open() since > r25914. Unfortunately, just artificially setting the num_sockets to 1 won't solve much - you'll get past that point in the code, but attempts to bind are likely to fail down the road. Fixing it will require some significant effort. Given we haven't heard reports of this before, I'm not convinced it is a widespread problem. For now, let's just use the mca param and see what happens. >>> On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote: In r25914, orte/mca/odls/base/odls_base_open.c, we get 222 /* get the number of local sockets unless we were given a number */ 223 if (0 == orte_default_num_sockets_per_board) { 224 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); 225 } 226 /* get the number of local processors */ 227 opal_paffinity_base_get_processor_info(_odls_globals.num_processors); 228 /* compute the base number of cores/socket, if not given */ 229 if (0 == orte_default_num_cores_per_socket) { 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets; 231 } Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] v1.5 r25914 DOA
Much simpler solution - on that platform, you should add "orte_num_sockets=1" to your default mca param file. Problem solved. It's why that param exists, and we added it specifically at Terry's request for an earlier, similar problem. On Feb 22, 2012, at 8:55 AM, Brice Goglin wrote: > Le 22/02/2012 07:36, Eugene Loh a écrit : >> On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: >>> Here are the first of the results of the testing I promised. >>> I am not 100% sure how to reach the code that Eugene reported as >>> problematic, >> I don't think you're going to see it. Somehow, hwloc on the config in >> question thinks there is no socket level and returns num_sockets==0. >> If you can run something successfully, your platform won't show the >> issue. > > (Eugene sent hwloc info offlist) > > This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it > had no sysfs topology info, but there was some "physical package" info > in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or > single-core-processor based system. sysfs still has NUMA topology info > (this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes > with one core each but no socket at all. We could assume there one > socket per NUMA node but that's a risky hack. > > Anyway, we have seen other systems (mostly non-Linux) where lstopo > reports nothing interesting (only one machine object with multiple PU > children). So numsockets==0 isn't really uncommon. Replacing 0 with 1 > will likely work for your computations. Make sure the code isn't going > to use the first hwloc socket object later, it would get NULL obviously. > > Brice > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] v1.5 r25914 DOA
On 2/21/2012 10:31 PM, Eugene Loh wrote: ... "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed? *) on this platform, hwloc finds no socket level *) therefore hwloc returns num_sockets==0 to OMPI *) OMPI divides by 0 and barfs on basically everything Okay. So, Brice's other e-mail indicates that the first two are "not really uncommon": On 2/22/2012 7:55 AM, Brice Goglin wrote: Anyway, we have seen other systems (mostly non-Linux) where lstopo reports nothing interesting (only one machine object with multiple PU children). So numsockets==0 isn't really uncommon. So, it seems to me that OMPI needs to handle the num_sockets==0 case rather than just dividing by num_sockets. This is v1.5 orte_odls_base_open() since r25914. On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote: In r25914, orte/mca/odls/base/odls_base_open.c, we get 222 /* get the number of local sockets unless we were given a number */ 223 if (0 == orte_default_num_sockets_per_board) { 224 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); 225 } 226 /* get the number of local processors */ 227 opal_paffinity_base_get_processor_info(_odls_globals.num_processors); 228 /* compute the base number of cores/socket, if not given */ 229 if (0 == orte_default_num_cores_per_socket) { 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets; 231 } Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230.
Re: [OMPI devel] v1.5 r25914 DOA
Le 22/02/2012 07:36, Eugene Loh a écrit : > On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: >> Here are the first of the results of the testing I promised. >> I am not 100% sure how to reach the code that Eugene reported as >> problematic, > I don't think you're going to see it. Somehow, hwloc on the config in > question thinks there is no socket level and returns num_sockets==0. > If you can run something successfully, your platform won't show the > issue. (Eugene sent hwloc info offlist) This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it had no sysfs topology info, but there was some "physical package" info in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or single-core-processor based system. sysfs still has NUMA topology info (this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes with one core each but no socket at all. We could assume there one socket per NUMA node but that's a risky hack. Anyway, we have seen other systems (mostly non-Linux) where lstopo reports nothing interesting (only one machine object with multiple PU children). So numsockets==0 isn't really uncommon. Replacing 0 with 1 will likely work for your computations. Make sure the code isn't going to use the first hwloc socket object later, it would get NULL obviously. Brice
Re: [OMPI devel] v1.5 r25914 DOA
On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: Here are the first of the results of the testing I promised. I am not 100% sure how to reach the code that Eugene reported as problematic, I don't think you're going to see it. Somehow, hwloc on the config in question thinks there is no socket level and returns num_sockets==0. If you can run something successfully, your platform won't show the issue.
Re: [OMPI devel] v1.5 r25914 DOA
On 02/21/12 19:29, Jeffrey Squyres wrote: What's the output of running lstopo from hwloc 1.3.2? (this is the version that's in the OMPI trunk and v1.5 branches) http://www.open-mpi.org/software/hwloc/v1.3/ Is there any difference from v1.4 hwloc? http://www.open-mpi.org/software/hwloc/v1.4/ Machine (8192MB) NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0) NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1) No difference between 1.3 and 1.4. No information about sockets. As Paul says, doesn't look like a compiler thing. (I get the same with Intel and gcc.) The hwloc README has a sample program that has ("third example") depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET); if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) { printf("*** The number of sockets is unknown\n"); } else { ... } that reports that the number of sockets is unknown. So, "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by zero. OS info was listed in the original message (below). Might we want to do something else? E.g., assume num_sockets==1 when num_sockets==0 (if you know what I mean)? So, which one (or more) of the following should be fixed? *) on this platform, hwloc finds no socket level *) therefore hwloc returns num_sockets==0 to OMPI *) OMPI divides by 0 and barfs on basically everything On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote: We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is. In r25914, orte/mca/odls/base/odls_base_open.c, we get 222 /* get the number of local sockets unless we were given a number */ 223 if (0 == orte_default_num_sockets_per_board) { 224 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); 225 } 226 /* get the number of local processors */ 227 opal_paffinity_base_get_processor_info(_odls_globals.num_processors); 228 /* compute the base number of cores/socket, if not given */ 229 if (0 == orte_default_num_cores_per_socket) { 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets; 231 } Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out): static int module_get_socket_info(int *num_sockets) { hwloc_topology_t *t =_hwloc_topology; *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET); return OPAL_SUCCESS; } Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0. I can poke around more, but does someone want to advise? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] v1.5 r25914 DOA
My build with the "2011_sp1.8.273" Intel compilers passes the same tests as I detailed below for "2011_sp1.7.256". I don't suspect any longer that the compiler is at fault, but am willing to try additional/alternate tests to help confirm. -Paul On 2/21/2012 5:40 PM, Paul H. Hargrove wrote: Here are the first of the results of the testing I promised. I am not 100% sure how to reach the code that Eugene reported as problematic, so I tried just running the ring test with various -bind-to-* options. I am quite willing to run additional test cases. All runs are w/ OMPI_MCA_btl=sm,self. + 2011.5.220 FAIL: "make check" fails opal_datatype_test OK: mpirun -np 2 ./ring_c OK: mpirun -np 2 -bind-to-none ./ring_c OK: mpirun -np 2 -bind-to-core ./ring_c OK: mpirun -np 2 -bind-to-socket ./ring_c + 2011_sp1.7.256 OK: "make check" OK: mpirun -np 2 -bind-to-none ./ring_c OK: mpirun -np 2 -bind-to-core ./ring_c OK: mpirun -np 2 -bind-to-socket ./ring_c So, I don't think the "2011_sp1.7.256" compilers are broken (and are "better" than the ones I've been using). I have a build with "2011_sp1.8.273" churning away right now (est. 45minutes to complete - should have disabled the Fortan bindings) If there is something other than the -bind-to-* flags I should be using to reach the problematic code, let me know. But based on what I've seen so far, I think we can probably rule out the compiler as the problem. -Paul On 2/21/2012 4:37 PM, Paul H. Hargrove wrote: I have been testing v1.5 with slightly older Intel "composerxe-2011.5.220" compilers. I see a "make check" failure in opal_datatype_test which is not present with any other compiler (such as gcc on the same node). This has been seen most recently on the 1.5.5rc2r25990 tarball generated earlier today. With "make check -k" I can confirm that opal_datatype_test is the ONLY failure I see with this compiler. So, I have just assumed this was a buggy compiler and thought nothing more of it. I have not yet tested them, but also have the same "composer_xe_2011_sp1.7.256" compiler and a more recent "composer_xe_2011_sp1.8.273". I will test both ASAP and report back with my findings. -Paul On 2/21/2012 4:20 PM, Eugene Loh wrote: We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is. In r25914, orte/mca/odls/base/odls_base_open.c, we get 222 /* get the number of local sockets unless we were given a number */ 223 if (0 == orte_default_num_sockets_per_board) { 224 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); 225 } 226 /* get the number of local processors */ 227 opal_paffinity_base_get_processor_info(_odls_globals.num_processors); 228 /* compute the base number of cores/socket, if not given */ 229 if (0 == orte_default_num_cores_per_socket) { 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets; 231 } Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out): static int module_get_socket_info(int *num_sockets) { hwloc_topology_t *t = _hwloc_topology; *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET); return OPAL_SUCCESS; } Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0. I can poke around more, but does someone want to advise? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] v1.5 r25914 DOA
Here are the first of the results of the testing I promised. I am not 100% sure how to reach the code that Eugene reported as problematic, so I tried just running the ring test with various -bind-to-* options. I am quite willing to run additional test cases. All runs are w/ OMPI_MCA_btl=sm,self. + 2011.5.220 FAIL: "make check" fails opal_datatype_test OK: mpirun -np 2 ./ring_c OK: mpirun -np 2 -bind-to-none ./ring_c OK: mpirun -np 2 -bind-to-core ./ring_c OK: mpirun -np 2 -bind-to-socket ./ring_c + 2011_sp1.7.256 OK: "make check" OK: mpirun -np 2 -bind-to-none ./ring_c OK: mpirun -np 2 -bind-to-core ./ring_c OK: mpirun -np 2 -bind-to-socket ./ring_c So, I don't think the "2011_sp1.7.256" compilers are broken (and are "better" than the ones I've been using). I have a build with "2011_sp1.8.273" churning away right now (est. 45minutes to complete - should have disabled the Fortan bindings) If there is something other than the -bind-to-* flags I should be using to reach the problematic code, let me know. But based on what I've seen so far, I think we can probably rule out the compiler as the problem. -Paul On 2/21/2012 4:37 PM, Paul H. Hargrove wrote: I have been testing v1.5 with slightly older Intel "composerxe-2011.5.220" compilers. I see a "make check" failure in opal_datatype_test which is not present with any other compiler (such as gcc on the same node). This has been seen most recently on the 1.5.5rc2r25990 tarball generated earlier today. With "make check -k" I can confirm that opal_datatype_test is the ONLY failure I see with this compiler. So, I have just assumed this was a buggy compiler and thought nothing more of it. I have not yet tested them, but also have the same "composer_xe_2011_sp1.7.256" compiler and a more recent "composer_xe_2011_sp1.8.273". I will test both ASAP and report back with my findings. -Paul On 2/21/2012 4:20 PM, Eugene Loh wrote: We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is. In r25914, orte/mca/odls/base/odls_base_open.c, we get 222 /* get the number of local sockets unless we were given a number */ 223 if (0 == orte_default_num_sockets_per_board) { 224 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); 225 } 226 /* get the number of local processors */ 227 opal_paffinity_base_get_processor_info(_odls_globals.num_processors); 228 /* compute the base number of cores/socket, if not given */ 229 if (0 == orte_default_num_cores_per_socket) { 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets; 231 } Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out): static int module_get_socket_info(int *num_sockets) { hwloc_topology_t *t = _hwloc_topology; *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET); return OPAL_SUCCESS; } Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0. I can poke around more, but does someone want to advise? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] v1.5 r25914 DOA
I have been testing v1.5 with slightly older Intel "composerxe-2011.5.220" compilers. I see a "make check" failure in opal_datatype_test which is not present with any other compiler (such as gcc on the same node). This has been seen most recently on the 1.5.5rc2r25990 tarball generated earlier today. With "make check -k" I can confirm that opal_datatype_test is the ONLY failure I see with this compiler. So, I have just assumed this was a buggy compiler and thought nothing more of it. I have not yet tested them, but also have the same "composer_xe_2011_sp1.7.256" compiler and a more recent "composer_xe_2011_sp1.8.273". I will test both ASAP and report back with my findings. -Paul On 2/21/2012 4:20 PM, Eugene Loh wrote: We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is. In r25914, orte/mca/odls/base/odls_base_open.c, we get 222 /* get the number of local sockets unless we were given a number */ 223 if (0 == orte_default_num_sockets_per_board) { 224 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); 225 } 226 /* get the number of local processors */ 227 opal_paffinity_base_get_processor_info(_odls_globals.num_processors); 228 /* compute the base number of cores/socket, if not given */ 229 if (0 == orte_default_num_cores_per_socket) { 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets; 231 } Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out): static int module_get_socket_info(int *num_sockets) { hwloc_topology_t *t = _hwloc_topology; *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET); return OPAL_SUCCESS; } Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0. I can poke around more, but does someone want to advise? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] v1.5 r25914 DOA
What's the output of running lstopo from hwloc 1.3.2? (this is the version that's in the OMPI trunk and v1.5 branches) http://www.open-mpi.org/software/hwloc/v1.3/ Is there any difference from v1.4 hwloc? http://www.open-mpi.org/software/hwloc/v1.4/ On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote: > We have some amount of MTT testing going on every night and on ONE of our > systems v1.5 has been dead since r25914. The system is > > Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 > x86_64 x86_64 x86_64 GNU/Linux > > and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) > compilers. I haven't poked around enough yet to figure out what the > problematic characteristic of this configuration is. > > In r25914, orte/mca/odls/base/odls_base_open.c, we get > >222 /* get the number of local sockets unless we were given a number */ >223 if (0 == orte_default_num_sockets_per_board) { >224 > opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); >225 } >226 /* get the number of local processors */ >227 > opal_paffinity_base_get_processor_info(_odls_globals.num_processors); >228 /* compute the base number of cores/socket, if not given */ >229 if (0 == orte_default_num_cores_per_socket) { >230 orte_odls_globals.num_cores_per_socket = > orte_odls_globals.num_processors / orte_odls_globals.num_sockets; >231 } > > Well, we execute the branch at line 224, but num_sockets remains 0. This > leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 > led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff > left out): > > static int module_get_socket_info(int *num_sockets) { >hwloc_topology_t *t = _hwloc_topology; >*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET); >return OPAL_SUCCESS; > } > > Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0. > > I can poke around more, but does someone want to advise? > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] v1.5 r25914 DOA
We have some amount of MTT testing going on every night and on ONE of our systems v1.5 has been dead since r25914. The system is Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) compilers. I haven't poked around enough yet to figure out what the problematic characteristic of this configuration is. In r25914, orte/mca/odls/base/odls_base_open.c, we get 222 /* get the number of local sockets unless we were given a number */ 223 if (0 == orte_default_num_sockets_per_board) { 224 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets); 225 } 226 /* get the number of local processors */ 227 opal_paffinity_base_get_processor_info(_odls_globals.num_processors); 228 /* compute the base number of cores/socket, if not given */ 229 if (0 == orte_default_num_cores_per_socket) { 230 orte_odls_globals.num_cores_per_socket = orte_odls_globals.num_processors / orte_odls_globals.num_sockets; 231 } Well, we execute the branch at line 224, but num_sockets remains 0. This leads to the divide-by-0 at line 230. Digging deeper, the call at line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out): static int module_get_socket_info(int *num_sockets) { hwloc_topology_t *t = _hwloc_topology; *num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET); return OPAL_SUCCESS; } Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0. I can poke around more, but does someone want to advise?