Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
That's what we needed to know - i.e., that setting num_sockets=1 generates an 
error instead of segfaulting down the road. I can submit a CMR to do so.

thx!

On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote:

> On 02/22/12 14:54, Ralph Castain wrote:
>> That doesn't really address the issue, though. What I want to know is: what 
>> happens when you try to bind processes? What about -bind-to-socket, and 
>> -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if 
>> the socket layer isn't present. The logic in 1.5 is pretty old, but I 
>> believe it relies heavily on sockets being present.
> Okay.  So,
> 
> *)  "out of the box", basically nothing works.  For example, "mpirun 
> hostname" segfaults.
> 
> *)  With "--mca orte_num_sockets 1", stuff appears to work.
> 
> *)  With "--mca orte_num_sockets 1" and adding either "--bysocket 
> --bind-to-socket" or "--npersocket ", I get:
> 
> --
> Unable to bind to socket -13 on node burl-ct-v20z-10.
> --
> --
> mpirun was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Fatal
> Node: burl-ct-v20z-10
> 
> when attempting to start process rank 0.
> --
> 2 total processes failed to start
> 
> So, I hear Brice's comment that this is an old kernel.  And, I hear what 
> you're saying about a "real" fix being expensive.  Nevertheless, to my taste, 
> automatically setting num_sockets==1 when num_sockets==0 is detected makes a 
> lot of sense.  It makes things "basically" work, turning a situation where 
> everything including "mpirun hostname" segfaults into a situation where 
> default usage works just fine.  What remains broken is binding, which 
> generates an error message that gives the user a hope of making progress 
> (turning off binding).  That's in contrast from expecting users to go from
> 
> % mpirun hostname
> Segmentation fault
> 
> to knowing that they should set orte_num_sockets==1.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 02/22/12 14:54, Ralph Castain wrote:
That doesn't really address the issue, though. What I want to know is: 
what happens when you try to bind processes? What about 
-bind-to-socket, and -persocket options? Etc. Reason I'm concerned: 
I'm not sure what happens if the socket layer isn't present. The logic 
in 1.5 is pretty old, but I believe it relies heavily on sockets being 
present.

Okay.  So,

*)  "out of the box", basically nothing works.  For example, "mpirun 
hostname" segfaults.


*)  With "--mca orte_num_sockets 1", stuff appears to work.

*)  With "--mca orte_num_sockets 1" and adding either "--bysocket 
--bind-to-socket" or "--npersocket ", I get:


--
Unable to bind to socket -13 on node burl-ct-v20z-10.
--
--
mpirun was unable to start the specified application as it encountered 
an error:


Error name: Fatal
Node: burl-ct-v20z-10

when attempting to start process rank 0.
--
2 total processes failed to start

So, I hear Brice's comment that this is an old kernel.  And, I hear what 
you're saying about a "real" fix being expensive.  Nevertheless, to my 
taste, automatically setting num_sockets==1 when num_sockets==0 is 
detected makes a lot of sense.  It makes things "basically" work, 
turning a situation where everything including "mpirun hostname" 
segfaults into a situation where default usage works just fine.  What 
remains broken is binding, which generates an error message that gives 
the user a hope of making progress (turning off binding).  That's in 
contrast from expecting users to go from


% mpirun hostname
Segmentation fault

to knowing that they should set orte_num_sockets==1.


Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 20:24, Eugene Loh a écrit :
> On 2/22/2012 11:08 AM, Ralph Castain wrote:
>> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:
>>> Le 22/02/2012 17:48, Ralph Castain a écrit :
 On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote
> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and
>> OMPI pukes on divide by zero.  OS info was listed in the original
>> message (below).  Might we want to do something else?  E.g.,
>> assume num_sockets==1 when num_sockets==0 (if you know what I
>> mean)?  So, which one (or more) of the following should be fixed?
>>
>> *) on this platform, hwloc finds no socket level
>> *) therefore hwloc returns num_sockets==0 to OMPI
>> *) OMPI divides by 0 and barfs on basically everything
> Okay.  So, Brice's other e-mail indicates that the first two are
> "not really uncommon":
>
> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>> reports nothing interesting (only one machine object with
>> multiple PU
>> children). So numsockets==0 isn't really uncommon.
> So, it seems to me that OMPI needs to handle the num_sockets==0
> case rather than just dividing by num_sockets.  This is v1.5
> orte_odls_base_open() since r25914.
 Unfortunately, just artificially setting the num_sockets to 1 won't
 solve much - you'll get past that point in the code, but attempts
 to bind are likely to fail down the road. Fixing it will require
 some significant effort.

 Given we haven't heard reports of this before, I'm not convinced it
 is a widespread problem.
> I assume we don't see the problem as widespread because it was only
> introduced into  v1.5 in r25914.  In my mind, the real question is how
> common it is for hwloc to decide numsockets==0.  On that one, Brice
> asserts it "isn't really uncommon."

On Linux, it's uncommon: it only happens on some platforms with very old
kernels (2.6.10 or so).
Solaris, Darwin and Windows should get sockets in some/most cases.
FreeBSD should get x86 sockets correctly because we use cpuid directly
there.

Unless I am missing something, others have nothing related to sockets in
their driver: AIX, HPUX, OSF.

Brice



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain

On Feb 22, 2012, at 12:24 PM, Eugene Loh wrote:

> On 2/22/2012 11:08 AM, Ralph Castain wrote:
>> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:
>>> Le 22/02/2012 17:48, Ralph Castain a écrit :
 On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote
> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
>> pukes on divide by zero.  OS info was listed in the original message 
>> (below).  Might we want to do something else?  E.g., assume 
>> num_sockets==1 when num_sockets==0 (if you know what I mean)?  So, which 
>> one (or more) of the following should be fixed?
>> 
>> *) on this platform, hwloc finds no socket level
>> *) therefore hwloc returns num_sockets==0 to OMPI
>> *) OMPI divides by 0 and barfs on basically everything
> Okay.  So, Brice's other e-mail indicates that the first two are "not 
> really uncommon":
> 
> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>> reports nothing interesting (only one machine object with multiple PU
>> children). So numsockets==0 isn't really uncommon.
> So, it seems to me that OMPI needs to handle the num_sockets==0 case 
> rather than just dividing by num_sockets.  This is v1.5 
> orte_odls_base_open() since r25914.
 Unfortunately, just artificially setting the num_sockets to 1 won't solve 
 much - you'll get past that point in the code, but attempts to bind are 
 likely to fail down the road. Fixing it will require some significant 
 effort.
 
 Given we haven't heard reports of this before, I'm not convinced it is a 
 widespread problem.
> I assume we don't see the problem as widespread because it was only 
> introduced into  v1.5 in r25914.  In my mind, the real question is how common 
> it is for hwloc to decide numsockets==0.  On that one, Brice asserts it 
> "isn't really uncommon."
 For now, let's just use the mca param and see what happens.
>>> I am probably missing something but: Why would setting num_sockets to 1
>>> work fine as a mca param, while artificially setting it as said above
>>> wouldn't ?
>> Because the param means that it isn't hardwired into the code base. I want 
>> to first verify that artificially forcing num_sockets to 1 doesn't break the 
>> code down the road, so the less change to find out, the better.
> That sounds a lot different to me than the earlier statement.  Thanks for 
> asking that question, Brice.  Anyhow, I tried using "--mca orte_num_sockets 
> 1" and that seems to allow basic programs to run.

That doesn't really address the issue, though. What I want to know is: what 
happens when you try to bind processes? What about -bind-to-socket, and 
-persocket options? Etc.

Reason I'm concerned: I'm not sure what happens if the socket layer isn't 
present. The logic in 1.5 is pretty old, but I believe it relies heavily on 
sockets being present.

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 2/22/2012 11:08 AM, Ralph Castain wrote:

On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:

Le 22/02/2012 17:48, Ralph Castain a écrit :

On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote

On 2/21/2012 10:31 PM, Eugene Loh wrote:

...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on 
divide by zero.  OS info was listed in the original message (below).  Might we want to do 
something else?  E.g., assume num_sockets==1 when num_sockets==0 (if you know what I 
mean)?  So, which one (or more) of the following should be fixed?

*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything

Okay.  So, Brice's other e-mail indicates that the first two are "not really 
uncommon":

On 2/22/2012 7:55 AM, Brice Goglin wrote:

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon.

So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() since 
r25914.

Unfortunately, just artificially setting the num_sockets to 1 won't solve much 
- you'll get past that point in the code, but attempts to bind are likely to 
fail down the road. Fixing it will require some significant effort.

Given we haven't heard reports of this before, I'm not convinced it is a 
widespread problem.
I assume we don't see the problem as widespread because it was only 
introduced into  v1.5 in r25914.  In my mind, the real question is how 
common it is for hwloc to decide numsockets==0.  On that one, Brice 
asserts it "isn't really uncommon."

For now, let's just use the mca param and see what happens.

I am probably missing something but: Why would setting num_sockets to 1
work fine as a mca param, while artificially setting it as said above
wouldn't ?

Because the param means that it isn't hardwired into the code base. I want to 
first verify that artificially forcing num_sockets to 1 doesn't break the code 
down the road, so the less change to find out, the better.
That sounds a lot different to me than the earlier statement.  Thanks 
for asking that question, Brice.  Anyhow, I tried using "--mca 
orte_num_sockets 1" and that seems to allow basic programs to run.


Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain

On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:

> Le 22/02/2012 17:48, Ralph Castain a écrit :
>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:
>> 
>>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
 ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
 pukes on divide by zero.  OS info was listed in the original message 
 (below).  Might we want to do something else?  E.g., assume num_sockets==1 
 when num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
 the following should be fixed?
 
 *) on this platform, hwloc finds no socket level
 *) therefore hwloc returns num_sockets==0 to OMPI
 *) OMPI divides by 0 and barfs on basically everything
>>> Okay.  So, Brice's other e-mail indicates that the first two are "not 
>>> really uncommon":
>>> 
>>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
 Anyway, we have seen other systems (mostly non-Linux) where lstopo
 reports nothing interesting (only one machine object with multiple PU
 children). So numsockets==0 isn't really uncommon.
>>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
>>> than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() 
>>> since r25914.
>> Unfortunately, just artificially setting the num_sockets to 1 won't solve 
>> much - you'll get past that point in the code, but attempts to bind are 
>> likely to fail down the road. Fixing it will require some significant effort.
>> 
>> Given we haven't heard reports of this before, I'm not convinced it is a 
>> widespread problem. For now, let's just use the mca param and see what 
>> happens.
> 
> I am probably missing something but: Why would setting num_sockets to 1
> work fine as a mca param, while artificially setting it as said above
> wouldn't ?

Because the param means that it isn't hardwired into the code base. I want to 
first verify that artificially forcing num_sockets to 1 doesn't break the code 
down the road, so the less change to find out, the better.


> 
> Brice
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 17:48, Ralph Castain a écrit :
> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:
>
>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
>>> pukes on divide by zero.  OS info was listed in the original message 
>>> (below).  Might we want to do something else?  E.g., assume num_sockets==1 
>>> when num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
>>> the following should be fixed?
>>>
>>> *) on this platform, hwloc finds no socket level
>>> *) therefore hwloc returns num_sockets==0 to OMPI
>>> *) OMPI divides by 0 and barfs on basically everything
>> Okay.  So, Brice's other e-mail indicates that the first two are "not really 
>> uncommon":
>>
>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>>> reports nothing interesting (only one machine object with multiple PU
>>> children). So numsockets==0 isn't really uncommon.
>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
>> than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() since 
>> r25914.
> Unfortunately, just artificially setting the num_sockets to 1 won't solve 
> much - you'll get past that point in the code, but attempts to bind are 
> likely to fail down the road. Fixing it will require some significant effort.
>
> Given we haven't heard reports of this before, I'm not convinced it is a 
> widespread problem. For now, let's just use the mca param and see what 
> happens.

I am probably missing something but: Why would setting num_sockets to 1
work fine as a mca param, while artificially setting it as said above
wouldn't ?

Brice



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain

On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:

> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes 
>> on divide by zero.  OS info was listed in the original message (below).  
>> Might we want to do something else?  E.g., assume num_sockets==1 when 
>> num_sockets==0 (if you know what I mean)?  So, which one (or more) of the 
>> following should be fixed?
>> 
>> *) on this platform, hwloc finds no socket level
>> *) therefore hwloc returns num_sockets==0 to OMPI
>> *) OMPI divides by 0 and barfs on basically everything
> Okay.  So, Brice's other e-mail indicates that the first two are "not really 
> uncommon":
> 
> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>> reports nothing interesting (only one machine object with multiple PU
>> children). So numsockets==0 isn't really uncommon.
> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
> than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() since 
> r25914.

Unfortunately, just artificially setting the num_sockets to 1 won't solve much 
- you'll get past that point in the code, but attempts to bind are likely to 
fail down the road. Fixing it will require some significant effort.

Given we haven't heard reports of this before, I'm not convinced it is a 
widespread problem. For now, let's just use the mca param and see what happens.

>>> On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
 In r25914, orte/mca/odls/base/odls_base_open.c, we get
 
222 /* get the number of local sockets unless we were given a 
 number */
223 if (0 == orte_default_num_sockets_per_board) {
224 
 opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227 
 opal_paffinity_base_get_processor_info(_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
 orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }
 
 Well, we execute the branch at line 224, but num_sockets remains 0.  This 
 leads to the divide-by-0 at line 230.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
Much simpler solution - on that platform, you should add "orte_num_sockets=1" 
to your default mca param file. Problem solved. It's why that param exists, and 
we added it specifically at Terry's request for an earlier, similar problem.


On Feb 22, 2012, at 8:55 AM, Brice Goglin wrote:

> Le 22/02/2012 07:36, Eugene Loh a écrit :
>> On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
>>> Here are the first of the results of the testing I promised.
>>> I am not 100% sure how to reach the code that Eugene reported as
>>> problematic,
>> I don't think you're going to see it.  Somehow, hwloc on the config in
>> question thinks there is no socket level and returns num_sockets==0. 
>> If you can run something successfully, your platform won't show the
>> issue.
> 
> (Eugene sent hwloc info offlist)
> 
> This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it
> had no sysfs topology info, but there was some "physical package" info
> in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or
> single-core-processor based system. sysfs still has NUMA topology info
> (this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes
> with one core each but no socket at all. We could assume there one
> socket per NUMA node but that's a risky hack.
> 
> Anyway, we have seen other systems (mostly non-Linux) where lstopo
> reports nothing interesting (only one machine object with multiple PU
> children). So numsockets==0 isn't really uncommon. Replacing 0 with 1
> will likely work for your computations. Make sure the code isn't going
> to use the first hwloc socket object later, it would get NULL obviously.
> 
> Brice
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 2/21/2012 10:31 PM, Eugene Loh wrote:
...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
pukes on divide by zero.  OS info was listed in the original message 
(below).  Might we want to do something else?  E.g., assume 
num_sockets==1 when num_sockets==0 (if you know what I mean)?  So, 
which one (or more) of the following should be fixed?


*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything
Okay.  So, Brice's other e-mail indicates that the first two are "not 
really uncommon":


On 2/22/2012 7:55 AM, Brice Goglin wrote:

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon.
So, it seems to me that OMPI needs to handle the num_sockets==0 case 
rather than just dividing by num_sockets.  This is v1.5 
orte_odls_base_open() since r25914.

On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given 
a number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.


Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 07:36, Eugene Loh a écrit :
> On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
>> Here are the first of the results of the testing I promised.
>> I am not 100% sure how to reach the code that Eugene reported as
>> problematic,
> I don't think you're going to see it.  Somehow, hwloc on the config in
> question thinks there is no socket level and returns num_sockets==0. 
> If you can run something successfully, your platform won't show the
> issue.

(Eugene sent hwloc info offlist)

This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it
had no sysfs topology info, but there was some "physical package" info
in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or
single-core-processor based system. sysfs still has NUMA topology info
(this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes
with one core each but no socket at all. We could assume there one
socket per NUMA node but that's a risky hack.

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon. Replacing 0 with 1
will likely work for your computations. Make sure the code isn't going
to use the first hwloc socket object later, it would get NULL obviously.

Brice



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:

Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as 
problematic,
I don't think you're going to see it.  Somehow, hwloc on the config in 
question thinks there is no socket level and returns num_sockets==0.  If 
you can run something successfully, your platform won't show the issue.


Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 02/21/12 19:29, Jeffrey Squyres wrote:

What's the output of running lstopo from hwloc 1.3.2?  (this is the version 
that's in the OMPI trunk and v1.5 branches)

 http://www.open-mpi.org/software/hwloc/v1.3/

Is there any difference from v1.4 hwloc?

 http://www.open-mpi.org/software/hwloc/v1.4/

Machine (8192MB)
  NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0)
  NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1)

No difference between 1.3 and 1.4.  No information about sockets.

As Paul says, doesn't look like a compiler thing.  (I get the same with 
Intel and gcc.)


The hwloc README has a sample program that has ("third example")

 depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
 if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
 printf("*** The number of sockets is unknown\n");
 } else {
...
 }

that reports that the number of sockets is unknown.  So, "sockets" is 
unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by 
zero.  OS info was listed in the original message (below).  Might we 
want to do something else?  E.g., assume num_sockets==1 when 
num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
the following should be fixed?


*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything

On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

We have some amount of MTT testing going on every night and on ONE of our 
systems v1.5 has been dead since r25914.  The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 
x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) 
compilers.  I haven't poked around enough yet to figure out what the 
problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a number */
223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  This leads 
to the divide-by-0 at line 230.  Digging deeper, the call at line 224 led us to 
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t =_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Paul H. Hargrove
My build with the "2011_sp1.8.273" Intel compilers passes the same tests 
as I detailed below for "2011_sp1.7.256".
I don't suspect any longer that the compiler is at fault, but am willing 
to try additional/alternate tests to help confirm.


-Paul

On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:

Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as 
problematic, so I tried just running the ring test with various 
-bind-to-* options.   I am quite willing to run additional test 
cases.  All runs are w/ OMPI_MCA_btl=sm,self.


+ 2011.5.220
  FAIL: "make check" fails opal_datatype_test
  OK: mpirun -np 2 ./ring_c
  OK: mpirun -np 2 -bind-to-none ./ring_c
  OK: mpirun -np 2 -bind-to-core ./ring_c
  OK: mpirun -np 2 -bind-to-socket ./ring_c

+ 2011_sp1.7.256
  OK: "make check"
  OK: mpirun -np 2 -bind-to-none ./ring_c
  OK: mpirun -np 2 -bind-to-core ./ring_c
  OK: mpirun -np 2 -bind-to-socket ./ring_c

So, I don't think the "2011_sp1.7.256" compilers are broken (and are 
"better" than the ones I've been using).
I have a build with "2011_sp1.8.273" churning away right now (est. 
45minutes to complete - should have disabled the Fortan bindings)


If there is something other than the -bind-to-* flags I should be 
using to reach the problematic code, let me know.
But based on what I've seen so far, I think we can probably rule out 
the compiler as the problem.


-Paul


On 2/21/2012 4:37 PM, Paul H. Hargrove wrote:
I have been testing v1.5 with slightly older Intel 
"composerxe-2011.5.220" compilers.
I see a "make check" failure in opal_datatype_test which is not 
present with any other compiler (such as gcc on the same node).
This has been seen most recently on the 1.5.5rc2r25990 tarball 
generated earlier today.
With "make check -k" I can confirm that opal_datatype_test is the 
ONLY failure I see with this compiler.
So, I have just assumed this was a buggy compiler and thought nothing 
more of it.


I have not yet tested them, but also have the same 
"composer_xe_2011_sp1.7.256" compiler and a more recent 
"composer_xe_2011_sp1.8.273".  I will test both ASAP and report back 
with my findings.


-Paul


On 2/21/2012 4:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE 
of our systems v1.5 has been dead since r25914.  The system is


Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 
2007 x86_64 x86_64 x86_64 GNU/Linux


and I'm encountering the problem with Intel 
(composer_xe_2011_sp1.7.256) compilers.  I haven't poked around 
enough yet to figure out what the problematic characteristic of this 
configuration is.


In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given 
a number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.  Digging deeper, the call 
at line 224 led us to 
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff 
left out):


static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = _hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, 
HWLOC_OBJ_SOCKET);

return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is 
returning 0.


I can poke around more, but does someone want to advise?
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Paul H. Hargrove

Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as 
problematic, so I tried just running the ring test with various 
-bind-to-* options.   I am quite willing to run additional test cases.  
All runs are w/ OMPI_MCA_btl=sm,self.


+ 2011.5.220
  FAIL: "make check" fails opal_datatype_test
  OK: mpirun -np 2 ./ring_c
  OK: mpirun -np 2 -bind-to-none ./ring_c
  OK: mpirun -np 2 -bind-to-core ./ring_c
  OK: mpirun -np 2 -bind-to-socket ./ring_c

+ 2011_sp1.7.256
  OK: "make check"
  OK: mpirun -np 2 -bind-to-none ./ring_c
  OK: mpirun -np 2 -bind-to-core ./ring_c
  OK: mpirun -np 2 -bind-to-socket ./ring_c

So, I don't think the "2011_sp1.7.256" compilers are broken (and are 
"better" than the ones I've been using).
I have a build with "2011_sp1.8.273" churning away right now (est. 
45minutes to complete - should have disabled the Fortan bindings)


If there is something other than the -bind-to-* flags I should be using 
to reach the problematic code, let me know.
But based on what I've seen so far, I think we can probably rule out the 
compiler as the problem.


-Paul


On 2/21/2012 4:37 PM, Paul H. Hargrove wrote:
I have been testing v1.5 with slightly older Intel 
"composerxe-2011.5.220" compilers.
I see a "make check" failure in opal_datatype_test which is not 
present with any other compiler (such as gcc on the same node).
This has been seen most recently on the 1.5.5rc2r25990 tarball 
generated earlier today.
With "make check -k" I can confirm that opal_datatype_test is the ONLY 
failure I see with this compiler.
So, I have just assumed this was a buggy compiler and thought nothing 
more of it.


I have not yet tested them, but also have the same 
"composer_xe_2011_sp1.7.256" compiler and a more recent 
"composer_xe_2011_sp1.8.273".  I will test both ASAP and report back 
with my findings.


-Paul


On 2/21/2012 4:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE of 
our systems v1.5 has been dead since r25914.  The system is


Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 
2007 x86_64 x86_64 x86_64 GNU/Linux


and I'm encountering the problem with Intel 
(composer_xe_2011_sp1.7.256) compilers.  I haven't poked around 
enough yet to figure out what the problematic characteristic of this 
configuration is.


In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a 
number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.  Digging deeper, the call 
at line 224 led us to 
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left 
out):


static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = _hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is 
returning 0.


I can poke around more, but does someone want to advise?
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Paul H. Hargrove
I have been testing v1.5 with slightly older Intel 
"composerxe-2011.5.220" compilers.
I see a "make check" failure in opal_datatype_test which is not present 
with any other compiler (such as gcc on the same node).
This has been seen most recently on the 1.5.5rc2r25990 tarball generated 
earlier today.
With "make check -k" I can confirm that opal_datatype_test is the ONLY 
failure I see with this compiler.
So, I have just assumed this was a buggy compiler and thought nothing 
more of it.


I have not yet tested them, but also have the same 
"composer_xe_2011_sp1.7.256" compiler and a more recent 
"composer_xe_2011_sp1.8.273".  I will test both ASAP and report back 
with my findings.


-Paul


On 2/21/2012 4:20 PM, Eugene Loh wrote:
We have some amount of MTT testing going on every night and on ONE of 
our systems v1.5 has been dead since r25914.  The system is


Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 
2007 x86_64 x86_64 x86_64 GNU/Linux


and I'm encountering the problem with Intel 
(composer_xe_2011_sp1.7.256) compilers.  I haven't poked around enough 
yet to figure out what the problematic characteristic of this 
configuration is.


In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a 
number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.  Digging deeper, the call 
at line 224 led us to 
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left 
out):


static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = _hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is 
returning 0.


I can poke around more, but does someone want to advise?
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Jeffrey Squyres
What's the output of running lstopo from hwloc 1.3.2?  (this is the version 
that's in the OMPI trunk and v1.5 branches)

http://www.open-mpi.org/software/hwloc/v1.3/

Is there any difference from v1.4 hwloc?

http://www.open-mpi.org/software/hwloc/v1.4/


On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

> We have some amount of MTT testing going on every night and on ONE of our 
> systems v1.5 has been dead since r25914.  The system is
> 
> Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) 
> compilers.  I haven't poked around enough yet to figure out what the 
> problematic characteristic of this configuration is.
> 
> In r25914, orte/mca/odls/base/odls_base_open.c, we get
> 
>222 /* get the number of local sockets unless we were given a number */
>223 if (0 == orte_default_num_sockets_per_board) {
>224 
> opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);
>225 }
>226 /* get the number of local processors */
>227 
> opal_paffinity_base_get_processor_info(_odls_globals.num_processors);
>228 /* compute the base number of cores/socket, if not given */
>229 if (0 == orte_default_num_cores_per_socket) {
>230 orte_odls_globals.num_cores_per_socket = 
> orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
>231 }
> 
> Well, we execute the branch at line 224, but num_sockets remains 0.  This 
> leads to the divide-by-0 at line 230.  Digging deeper, the call at line 224 
> led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff 
> left out):
> 
> static int module_get_socket_info(int *num_sockets) {
>hwloc_topology_t *t = _hwloc_topology;
>*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
>return OPAL_SUCCESS;
> }
> 
> Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.
> 
> I can poke around more, but does someone want to advise?
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] v1.5 r25914 DOA

2012-02-21 Thread Eugene Loh
We have some amount of MTT testing going on every night and on ONE of 
our systems v1.5 has been dead since r25914.  The system is


Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 
x86_64 x86_64 x86_64 GNU/Linux


and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) 
compilers.  I haven't poked around enough yet to figure out what the 
problematic characteristic of this configuration is.


In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a 
number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.  Digging deeper, the call at 
line 224 led us to opal/mca/paffinity/hwloc/paffinity_hwloc_module.c 
(lots of stuff left out):


static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t = _hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?