Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 02/21/12 19:29, Jeffrey Squyres wrote:

What's the output of running lstopo from hwloc 1.3.2?  (this is the version 
that's in the OMPI trunk and v1.5 branches)

 http://www.open-mpi.org/software/hwloc/v1.3/

Is there any difference from v1.4 hwloc?

 http://www.open-mpi.org/software/hwloc/v1.4/

Machine (8192MB)
  NUMANode L#0 (P#0 4096MB) + PU L#0 (P#0)
  NUMANode L#1 (P#1 4096MB) + PU L#1 (P#1)

No difference between 1.3 and 1.4.  No information about sockets.

As Paul says, doesn't look like a compiler thing.  (I get the same with 
Intel and gcc.)


The hwloc README has a sample program that has ("third example")

 depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
 if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
 printf("*** The number of sockets is unknown\n");
 } else {
...
 }

that reports that the number of sockets is unknown.  So, "sockets" is 
unknown and hwloc returns 0 for num_sockets and OMPI pukes on divide by 
zero.  OS info was listed in the original message (below).  Might we 
want to do something else?  E.g., assume num_sockets==1 when 
num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
the following should be fixed?


*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything

On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

We have some amount of MTT testing going on every night and on ONE of our 
systems v1.5 has been dead since r25914.  The system is

Linux burl-ct-v20z-10 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 
x86_64 x86_64 GNU/Linux

and I'm encountering the problem with Intel (composer_xe_2011_sp1.7.256) 
compilers.  I haven't poked around enough yet to figure out what the 
problematic characteristic of this configuration is.

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given a number */
223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  This leads 
to the divide-by-0 at line 230.  Digging deeper, the call at line 224 led us to 
opal/mca/paffinity/hwloc/paffinity_hwloc_module.c (lots of stuff left out):

static int module_get_socket_info(int *num_sockets) {
hwloc_topology_t *t =&opal_hwloc_topology;
*num_sockets = (int) hwloc_get_nbobjs_by_type(*t, HWLOC_OBJ_SOCKET);
return OPAL_SUCCESS;
}

Anyhow, SOCKET is somehow an unknown layer, so num_sockets is returning 0.

I can poke around more, but does someone want to advise?
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:

Here are the first of the results of the testing I promised.
I am not 100% sure how to reach the code that Eugene reported as 
problematic,
I don't think you're going to see it.  Somehow, hwloc on the config in 
question thinks there is no socket level and returns num_sockets==0.  If 
you can run something successfully, your platform won't show the issue.


Re: [OMPI devel] v1.5 build failure w/ Solaris Studio 12.2 on Linux

2012-02-22 Thread Paul H. Hargrove

More notes:

I've tested ompi-1.5.4 and it has the same problem.  So, this is NOT a 
regression.


Terry D. has observed that Ubuntu is NOT a supported platform for the 
Solaris Studio compilers.
So, I've reproduced on a Scientific Linux 5.5 platform (Red Hat 
Enterprise Linux 5.5 clone, like CentOS) to be sure that was NOT the cause.


When I configure for the SS12.x compilers, I've been passing  
CXXFLAGS="-library=stlport4" as the VT sub-configure has informed me I 
should, due to something wrong the the default STL.  I tried dropping 
that from configure, and THE BUILD WAS SUCCESSFUL.


So, one has 2 choices:
+ build w/ SS12.2 without VT
+ update to SS12.3 and have VT

I don't think there is sufficient reason to delay 1.5.5 for this.

-Paul

On 2/21/2012 4:39 PM, Paul H. Hargrove wrote:

A few things to note:

1) This is NOT a problem w/ the SS12.3 compilers on the same machine.
So, one could say "upgrade your compiler" (a free download) and not 
delay 1.5.5 for this issue.


2) This is ONLY a problem on Linux, and not on Solaris (both SS12.2 
and SS12.3 tested for x86, x86-64, Sparc/v9 and Sparc/v8plus)


3) Testing the trunk I DON'T see the problem with either SS12.2 or 
SS12.3.
This is interesting, because it probably means that a u_char 
definition is SOMEWHERE in the headers (because libevent *is* getting 
built).


Whatever else may be done, I think this should be fixed "properly" 
(whatever that may equate to) for 1.6.
The way I see it now, it feels like OMPI is getting a definition of 
u_char only "by accident".


-Paul

On 2/21/2012 12:16 PM, Paul H. Hargrove wrote:
Building the v1.5 branch on Linux with the Solaris Studio 12.2 
compilers I see the following failure:
"[srcdir]/opal/event/event.h", line 797: Error: Type name expected 
instead of "u_char".
"[srcdir]/opal/event/event.h", line 798: Error: Type name expected 
instead of "u_char".
"[srcdir]/opal/event/event.h", line 1184: Error: "," expected 
instead of "*".

Where line 1184 is a prototype containing "u_char *".

As far as I can find, only several files below opal/event/ contain 
any use of "u_char".

There is a typedef for u_char in hwloc, but no use that I could see.

To the best of my knowledge u_char is NOT defined by any standard, 
and thus there is no particular header one can reliably find it in.
The alternatives, of course are "unsigned char" or "uint8_t" (defined 
in stdint.h).


I had a look at the trunk and VISUALLY is appears the same problem 
exists in:

   opal/event/event.h
   opal/mca/event/libevent2013/libevent/event.h
However, my testing is currently confined to the v1.5 branch in the 
hopes of finally getting the next 1.5.5rc out the door.


-Paul





--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 07:36, Eugene Loh a écrit :
> On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
>> Here are the first of the results of the testing I promised.
>> I am not 100% sure how to reach the code that Eugene reported as
>> problematic,
> I don't think you're going to see it.  Somehow, hwloc on the config in
> question thinks there is no socket level and returns num_sockets==0. 
> If you can run something successfully, your platform won't show the
> issue.

(Eugene sent hwloc info offlist)

This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it
had no sysfs topology info, but there was some "physical package" info
in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or
single-core-processor based system. sysfs still has NUMA topology info
(this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes
with one core each but no socket at all. We could assume there one
socket per NUMA node but that's a risky hack.

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon. Replacing 0 with 1
will likely work for your computations. Make sure the code isn't going
to use the first hwloc socket object later, it would get NULL obviously.

Brice



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 2/21/2012 10:31 PM, Eugene Loh wrote:
...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
pukes on divide by zero.  OS info was listed in the original message 
(below).  Might we want to do something else?  E.g., assume 
num_sockets==1 when num_sockets==0 (if you know what I mean)?  So, 
which one (or more) of the following should be fixed?


*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything
Okay.  So, Brice's other e-mail indicates that the first two are "not 
really uncommon":


On 2/22/2012 7:55 AM, Brice Goglin wrote:

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon.
So, it seems to me that OMPI needs to handle the num_sockets==0 case 
rather than just dividing by num_sockets.  This is v1.5 
orte_odls_base_open() since r25914.

On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:

In r25914, orte/mca/odls/base/odls_base_open.c, we get

222 /* get the number of local sockets unless we were given 
a number */

223 if (0 == orte_default_num_sockets_per_board) {
224 
opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);

225 }
226 /* get the number of local processors */
227 
opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);

228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
orte_odls_globals.num_processors / orte_odls_globals.num_sockets;

231 }

Well, we execute the branch at line 224, but num_sockets remains 0.  
This leads to the divide-by-0 at line 230.


Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
Much simpler solution - on that platform, you should add "orte_num_sockets=1" 
to your default mca param file. Problem solved. It's why that param exists, and 
we added it specifically at Terry's request for an earlier, similar problem.


On Feb 22, 2012, at 8:55 AM, Brice Goglin wrote:

> Le 22/02/2012 07:36, Eugene Loh a écrit :
>> On 2/21/2012 5:40 PM, Paul H. Hargrove wrote:
>>> Here are the first of the results of the testing I promised.
>>> I am not 100% sure how to reach the code that Eugene reported as
>>> problematic,
>> I don't think you're going to see it.  Somehow, hwloc on the config in
>> question thinks there is no socket level and returns num_sockets==0. 
>> If you can run something successfully, your platform won't show the
>> issue.
> 
> (Eugene sent hwloc info offlist)
> 
> This is an "interesting" case. Last time I used a RHEL4 2.6.9 kernel, it
> had no sysfs topology info, but there was some "physical package" info
> in /proc/cpuinfo. Yours has nothing. Maybe because it's an AMD and/or
> single-core-processor based system. sysfs still has NUMA topology info
> (this was added to the kernel around 2.5 iirc) so we get 2 NUMA nodes
> with one core each but no socket at all. We could assume there one
> socket per NUMA node but that's a risky hack.
> 
> Anyway, we have seen other systems (mostly non-Linux) where lstopo
> reports nothing interesting (only one machine object with multiple PU
> children). So numsockets==0 isn't really uncommon. Replacing 0 with 1
> will likely work for your computations. Make sure the code isn't going
> to use the first hwloc socket object later, it would get NULL obviously.
> 
> Brice
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain

On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:

> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes 
>> on divide by zero.  OS info was listed in the original message (below).  
>> Might we want to do something else?  E.g., assume num_sockets==1 when 
>> num_sockets==0 (if you know what I mean)?  So, which one (or more) of the 
>> following should be fixed?
>> 
>> *) on this platform, hwloc finds no socket level
>> *) therefore hwloc returns num_sockets==0 to OMPI
>> *) OMPI divides by 0 and barfs on basically everything
> Okay.  So, Brice's other e-mail indicates that the first two are "not really 
> uncommon":
> 
> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>> reports nothing interesting (only one machine object with multiple PU
>> children). So numsockets==0 isn't really uncommon.
> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
> than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() since 
> r25914.

Unfortunately, just artificially setting the num_sockets to 1 won't solve much 
- you'll get past that point in the code, but attempts to bind are likely to 
fail down the road. Fixing it will require some significant effort.

Given we haven't heard reports of this before, I'm not convinced it is a 
widespread problem. For now, let's just use the mca param and see what happens.

>>> On Feb 21, 2012, at 7:20 PM, Eugene Loh wrote:
 In r25914, orte/mca/odls/base/odls_base_open.c, we get
 
222 /* get the number of local sockets unless we were given a 
 number */
223 if (0 == orte_default_num_sockets_per_board) {
224 
 opal_paffinity_base_get_socket_info(&orte_odls_globals.num_sockets);
225 }
226 /* get the number of local processors */
227 
 opal_paffinity_base_get_processor_info(&orte_odls_globals.num_processors);
228 /* compute the base number of cores/socket, if not given */
229 if (0 == orte_default_num_cores_per_socket) {
230 orte_odls_globals.num_cores_per_socket = 
 orte_odls_globals.num_processors / orte_odls_globals.num_sockets;
231 }
 
 Well, we execute the branch at line 224, but num_sockets remains 0.  This 
 leads to the divide-by-0 at line 230.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 17:48, Ralph Castain a écrit :
> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:
>
>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
>>> pukes on divide by zero.  OS info was listed in the original message 
>>> (below).  Might we want to do something else?  E.g., assume num_sockets==1 
>>> when num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
>>> the following should be fixed?
>>>
>>> *) on this platform, hwloc finds no socket level
>>> *) therefore hwloc returns num_sockets==0 to OMPI
>>> *) OMPI divides by 0 and barfs on basically everything
>> Okay.  So, Brice's other e-mail indicates that the first two are "not really 
>> uncommon":
>>
>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>>> reports nothing interesting (only one machine object with multiple PU
>>> children). So numsockets==0 isn't really uncommon.
>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
>> than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() since 
>> r25914.
> Unfortunately, just artificially setting the num_sockets to 1 won't solve 
> much - you'll get past that point in the code, but attempts to bind are 
> likely to fail down the road. Fixing it will require some significant effort.
>
> Given we haven't heard reports of this before, I'm not convinced it is a 
> widespread problem. For now, let's just use the mca param and see what 
> happens.

I am probably missing something but: Why would setting num_sockets to 1
work fine as a mca param, while artificially setting it as said above
wouldn't ?

Brice



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain

On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:

> Le 22/02/2012 17:48, Ralph Castain a écrit :
>> On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote:
>> 
>>> On 2/21/2012 10:31 PM, Eugene Loh wrote:
 ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
 pukes on divide by zero.  OS info was listed in the original message 
 (below).  Might we want to do something else?  E.g., assume num_sockets==1 
 when num_sockets==0 (if you know what I mean)?  So, which one (or more) of 
 the following should be fixed?
 
 *) on this platform, hwloc finds no socket level
 *) therefore hwloc returns num_sockets==0 to OMPI
 *) OMPI divides by 0 and barfs on basically everything
>>> Okay.  So, Brice's other e-mail indicates that the first two are "not 
>>> really uncommon":
>>> 
>>> On 2/22/2012 7:55 AM, Brice Goglin wrote:
 Anyway, we have seen other systems (mostly non-Linux) where lstopo
 reports nothing interesting (only one machine object with multiple PU
 children). So numsockets==0 isn't really uncommon.
>>> So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
>>> than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() 
>>> since r25914.
>> Unfortunately, just artificially setting the num_sockets to 1 won't solve 
>> much - you'll get past that point in the code, but attempts to bind are 
>> likely to fail down the road. Fixing it will require some significant effort.
>> 
>> Given we haven't heard reports of this before, I'm not convinced it is a 
>> widespread problem. For now, let's just use the mca param and see what 
>> happens.
> 
> I am probably missing something but: Why would setting num_sockets to 1
> work fine as a mca param, while artificially setting it as said above
> wouldn't ?

Because the param means that it isn't hardwired into the code base. I want to 
first verify that artificially forcing num_sockets to 1 doesn't break the code 
down the road, so the less change to find out, the better.


> 
> Brice
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 2/22/2012 11:08 AM, Ralph Castain wrote:

On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:

Le 22/02/2012 17:48, Ralph Castain a écrit :

On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote

On 2/21/2012 10:31 PM, Eugene Loh wrote:

...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI pukes on 
divide by zero.  OS info was listed in the original message (below).  Might we want to do 
something else?  E.g., assume num_sockets==1 when num_sockets==0 (if you know what I 
mean)?  So, which one (or more) of the following should be fixed?

*) on this platform, hwloc finds no socket level
*) therefore hwloc returns num_sockets==0 to OMPI
*) OMPI divides by 0 and barfs on basically everything

Okay.  So, Brice's other e-mail indicates that the first two are "not really 
uncommon":

On 2/22/2012 7:55 AM, Brice Goglin wrote:

Anyway, we have seen other systems (mostly non-Linux) where lstopo
reports nothing interesting (only one machine object with multiple PU
children). So numsockets==0 isn't really uncommon.

So, it seems to me that OMPI needs to handle the num_sockets==0 case rather 
than just dividing by num_sockets.  This is v1.5 orte_odls_base_open() since 
r25914.

Unfortunately, just artificially setting the num_sockets to 1 won't solve much 
- you'll get past that point in the code, but attempts to bind are likely to 
fail down the road. Fixing it will require some significant effort.

Given we haven't heard reports of this before, I'm not convinced it is a 
widespread problem.
I assume we don't see the problem as widespread because it was only 
introduced into  v1.5 in r25914.  In my mind, the real question is how 
common it is for hwloc to decide numsockets==0.  On that one, Brice 
asserts it "isn't really uncommon."

For now, let's just use the mca param and see what happens.

I am probably missing something but: Why would setting num_sockets to 1
work fine as a mca param, while artificially setting it as said above
wouldn't ?

Because the param means that it isn't hardwired into the code base. I want to 
first verify that artificially forcing num_sockets to 1 doesn't break the code 
down the road, so the less change to find out, the better.
That sounds a lot different to me than the earlier statement.  Thanks 
for asking that question, Brice.  Anyhow, I tried using "--mca 
orte_num_sockets 1" and that seems to allow basic programs to run.


Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain

On Feb 22, 2012, at 12:24 PM, Eugene Loh wrote:

> On 2/22/2012 11:08 AM, Ralph Castain wrote:
>> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:
>>> Le 22/02/2012 17:48, Ralph Castain a écrit :
 On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote
> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and OMPI 
>> pukes on divide by zero.  OS info was listed in the original message 
>> (below).  Might we want to do something else?  E.g., assume 
>> num_sockets==1 when num_sockets==0 (if you know what I mean)?  So, which 
>> one (or more) of the following should be fixed?
>> 
>> *) on this platform, hwloc finds no socket level
>> *) therefore hwloc returns num_sockets==0 to OMPI
>> *) OMPI divides by 0 and barfs on basically everything
> Okay.  So, Brice's other e-mail indicates that the first two are "not 
> really uncommon":
> 
> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>> reports nothing interesting (only one machine object with multiple PU
>> children). So numsockets==0 isn't really uncommon.
> So, it seems to me that OMPI needs to handle the num_sockets==0 case 
> rather than just dividing by num_sockets.  This is v1.5 
> orte_odls_base_open() since r25914.
 Unfortunately, just artificially setting the num_sockets to 1 won't solve 
 much - you'll get past that point in the code, but attempts to bind are 
 likely to fail down the road. Fixing it will require some significant 
 effort.
 
 Given we haven't heard reports of this before, I'm not convinced it is a 
 widespread problem.
> I assume we don't see the problem as widespread because it was only 
> introduced into  v1.5 in r25914.  In my mind, the real question is how common 
> it is for hwloc to decide numsockets==0.  On that one, Brice asserts it 
> "isn't really uncommon."
 For now, let's just use the mca param and see what happens.
>>> I am probably missing something but: Why would setting num_sockets to 1
>>> work fine as a mca param, while artificially setting it as said above
>>> wouldn't ?
>> Because the param means that it isn't hardwired into the code base. I want 
>> to first verify that artificially forcing num_sockets to 1 doesn't break the 
>> code down the road, so the less change to find out, the better.
> That sounds a lot different to me than the earlier statement.  Thanks for 
> asking that question, Brice.  Anyhow, I tried using "--mca orte_num_sockets 
> 1" and that seems to allow basic programs to run.

That doesn't really address the issue, though. What I want to know is: what 
happens when you try to bind processes? What about -bind-to-socket, and 
-persocket options? Etc.

Reason I'm concerned: I'm not sure what happens if the socket layer isn't 
present. The logic in 1.5 is pretty old, but I believe it relies heavily on 
sockets being present.

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 build failure w/ Solaris Studio 12.2 on Linux

2012-02-22 Thread Paul H. Hargrove

I think I have the beginning of a fix for this issue.

I had not even noticed earlier that the error in event.h is from the C++ 
compiler, when compiling file.cxx in the c++ bindings.  That makes the 
vendor-specific addition of "-library=stlport4" to CXXFLAGS quite 
relevant to the problem/solution.


It eventually occurred to me that when VT's sub-configure told me to add 
configure arguments, I could have used --with-contrib-vt-flags to pass 
that ONLY to VT and perhaps NOT mess with whatever karma was providing 
the definition of u_char.  However, when I tried that I was disappointed 
to find that the bit of configure logic that suggests/requires 
CXXFLAGS=-library=stlport4 (from ompi/contrib/vt/configure.m4) runs 
BEFORE the processing of --with-contrib-vt-flags.  So, that was a dead end.


So, the next idea was to look for a fix specific to sltport.  I tried 
adding near the top of opal/event/event.h (after the WINDOWS equivalent):

#ifdef STLPORT
typedef unsigned char u_char;
#endif


That managed to clear up the original problem w/ SS12.2.
With SS12.3, things also built fine.
This suggests the typedef is not "conflicting" with whatever other defn 
was present.
I think the "safety" of this needs to be examined more widely before 
this can be adopted.
My concern is that some system could "typedef char u_char" if it has 
char unsigned by default, leading to a conflict.

Now that would, I suppose, only be a problem if STLPORT is also defined.
So, maybe I am over thinking this.

-Paul

On 2/21/2012 11:10 PM, Paul H. Hargrove wrote:

More notes:

I've tested ompi-1.5.4 and it has the same problem.  So, this is NOT a 
regression.


Terry D. has observed that Ubuntu is NOT a supported platform for the 
Solaris Studio compilers.
So, I've reproduced on a Scientific Linux 5.5 platform (Red Hat 
Enterprise Linux 5.5 clone, like CentOS) to be sure that was NOT the 
cause.


When I configure for the SS12.x compilers, I've been passing  
CXXFLAGS="-library=stlport4" as the VT sub-configure has informed me I 
should, due to something wrong the the default STL.  I tried dropping 
that from configure, and THE BUILD WAS SUCCESSFUL.


So, one has 2 choices:
+ build w/ SS12.2 without VT
+ update to SS12.3 and have VT

I don't think there is sufficient reason to delay 1.5.5 for this.

-Paul

On 2/21/2012 4:39 PM, Paul H. Hargrove wrote:

A few things to note:

1) This is NOT a problem w/ the SS12.3 compilers on the same machine.
So, one could say "upgrade your compiler" (a free download) and not 
delay 1.5.5 for this issue.


2) This is ONLY a problem on Linux, and not on Solaris (both SS12.2 
and SS12.3 tested for x86, x86-64, Sparc/v9 and Sparc/v8plus)


3) Testing the trunk I DON'T see the problem with either SS12.2 or 
SS12.3.
This is interesting, because it probably means that a u_char 
definition is SOMEWHERE in the headers (because libevent *is* getting 
built).


Whatever else may be done, I think this should be fixed "properly" 
(whatever that may equate to) for 1.6.
The way I see it now, it feels like OMPI is getting a definition of 
u_char only "by accident".


-Paul

On 2/21/2012 12:16 PM, Paul H. Hargrove wrote:
Building the v1.5 branch on Linux with the Solaris Studio 12.2 
compilers I see the following failure:
"[srcdir]/opal/event/event.h", line 797: Error: Type name expected 
instead of "u_char".
"[srcdir]/opal/event/event.h", line 798: Error: Type name expected 
instead of "u_char".
"[srcdir]/opal/event/event.h", line 1184: Error: "," expected 
instead of "*".

Where line 1184 is a prototype containing "u_char *".

As far as I can find, only several files below opal/event/ contain 
any use of "u_char".

There is a typedef for u_char in hwloc, but no use that I could see.

To the best of my knowledge u_char is NOT defined by any standard, 
and thus there is no particular header one can reliably find it in.
The alternatives, of course are "unsigned char" or "uint8_t" 
(defined in stdint.h).


I had a look at the trunk and VISUALLY is appears the same problem 
exists in:

   opal/event/event.h
   opal/mca/event/libevent2013/libevent/event.h
However, my testing is currently confined to the v1.5 branch in the 
hopes of finally getting the next 1.5.5rc out the door.


-Paul







--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Brice Goglin
Le 22/02/2012 20:24, Eugene Loh a écrit :
> On 2/22/2012 11:08 AM, Ralph Castain wrote:
>> On Feb 22, 2012, at 11:59 AM, Brice Goglin wrote:
>>> Le 22/02/2012 17:48, Ralph Castain a écrit :
 On Feb 22, 2012, at 9:39 AM, Eugene Loh wrote
> On 2/21/2012 10:31 PM, Eugene Loh wrote:
>> ...  "sockets" is unknown and hwloc returns 0 for num_sockets and
>> OMPI pukes on divide by zero.  OS info was listed in the original
>> message (below).  Might we want to do something else?  E.g.,
>> assume num_sockets==1 when num_sockets==0 (if you know what I
>> mean)?  So, which one (or more) of the following should be fixed?
>>
>> *) on this platform, hwloc finds no socket level
>> *) therefore hwloc returns num_sockets==0 to OMPI
>> *) OMPI divides by 0 and barfs on basically everything
> Okay.  So, Brice's other e-mail indicates that the first two are
> "not really uncommon":
>
> On 2/22/2012 7:55 AM, Brice Goglin wrote:
>> Anyway, we have seen other systems (mostly non-Linux) where lstopo
>> reports nothing interesting (only one machine object with
>> multiple PU
>> children). So numsockets==0 isn't really uncommon.
> So, it seems to me that OMPI needs to handle the num_sockets==0
> case rather than just dividing by num_sockets.  This is v1.5
> orte_odls_base_open() since r25914.
 Unfortunately, just artificially setting the num_sockets to 1 won't
 solve much - you'll get past that point in the code, but attempts
 to bind are likely to fail down the road. Fixing it will require
 some significant effort.

 Given we haven't heard reports of this before, I'm not convinced it
 is a widespread problem.
> I assume we don't see the problem as widespread because it was only
> introduced into  v1.5 in r25914.  In my mind, the real question is how
> common it is for hwloc to decide numsockets==0.  On that one, Brice
> asserts it "isn't really uncommon."

On Linux, it's uncommon: it only happens on some platforms with very old
kernels (2.6.10 or so).
Solaris, Darwin and Windows should get sockets in some/most cases.
FreeBSD should get x86 sockets correctly because we use cpuid directly
there.

Unless I am missing something, others have nothing related to sockets in
their driver: AIX, HPUX, OSF.

Brice



[OMPI devel] 1.5 supported systems

2012-02-22 Thread Jeffrey Squyres
Please verify this list of supported systems for the v1.5.5 release:

- The run-time systems that are currently supported are:
  - rsh / ssh
  - LoadLeveler
  - PBS Pro, Open PBS, Torque
  - Platform LSF (v7.0.2 and later)
  - SLURM
  - Cray XT-3, XT-4, and XT-5
  - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)

- Systems that have been tested are:
  - Linux (various flavors/distros), 32 bit, with gcc, and Oracle
Solaris Studio 12
  - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
Intel, Portland, and Oracle Solaris Studio 12 compilers (*)
  - OS X (10.5, 10.6, 10.7), 32 and 64 bit (x86_64), with gcc and
Absoft compilers (*)
  - Oracle Solaris 10, 32 and 64 bit (SPARC, i386, x86_64), with
Oracle Solaris Studio 12

  (*) Be sure to read the Compiler Notes, below.

- Other systems have been lightly (but not fully tested):
  - Other 64 bit platforms (e.g., Linux on PPC64)
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
see the README.WINDOWS file.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Eugene Loh

On 02/22/12 14:54, Ralph Castain wrote:
That doesn't really address the issue, though. What I want to know is: 
what happens when you try to bind processes? What about 
-bind-to-socket, and -persocket options? Etc. Reason I'm concerned: 
I'm not sure what happens if the socket layer isn't present. The logic 
in 1.5 is pretty old, but I believe it relies heavily on sockets being 
present.

Okay.  So,

*)  "out of the box", basically nothing works.  For example, "mpirun 
hostname" segfaults.


*)  With "--mca orte_num_sockets 1", stuff appears to work.

*)  With "--mca orte_num_sockets 1" and adding either "--bysocket 
--bind-to-socket" or "--npersocket ", I get:


--
Unable to bind to socket -13 on node burl-ct-v20z-10.
--
--
mpirun was unable to start the specified application as it encountered 
an error:


Error name: Fatal
Node: burl-ct-v20z-10

when attempting to start process rank 0.
--
2 total processes failed to start

So, I hear Brice's comment that this is an old kernel.  And, I hear what 
you're saying about a "real" fix being expensive.  Nevertheless, to my 
taste, automatically setting num_sockets==1 when num_sockets==0 is 
detected makes a lot of sense.  It makes things "basically" work, 
turning a situation where everything including "mpirun hostname" 
segfaults into a situation where default usage works just fine.  What 
remains broken is binding, which generates an error message that gives 
the user a hope of making progress (turning off binding).  That's in 
contrast from expecting users to go from


% mpirun hostname
Segmentation fault

to knowing that they should set orte_num_sockets==1.


Re: [OMPI devel] v1.5 r25914 DOA

2012-02-22 Thread Ralph Castain
That's what we needed to know - i.e., that setting num_sockets=1 generates an 
error instead of segfaulting down the road. I can submit a CMR to do so.

thx!

On Feb 22, 2012, at 4:12 PM, Eugene Loh wrote:

> On 02/22/12 14:54, Ralph Castain wrote:
>> That doesn't really address the issue, though. What I want to know is: what 
>> happens when you try to bind processes? What about -bind-to-socket, and 
>> -persocket options? Etc. Reason I'm concerned: I'm not sure what happens if 
>> the socket layer isn't present. The logic in 1.5 is pretty old, but I 
>> believe it relies heavily on sockets being present.
> Okay.  So,
> 
> *)  "out of the box", basically nothing works.  For example, "mpirun 
> hostname" segfaults.
> 
> *)  With "--mca orte_num_sockets 1", stuff appears to work.
> 
> *)  With "--mca orte_num_sockets 1" and adding either "--bysocket 
> --bind-to-socket" or "--npersocket ", I get:
> 
> --
> Unable to bind to socket -13 on node burl-ct-v20z-10.
> --
> --
> mpirun was unable to start the specified application as it encountered an 
> error:
> 
> Error name: Fatal
> Node: burl-ct-v20z-10
> 
> when attempting to start process rank 0.
> --
> 2 total processes failed to start
> 
> So, I hear Brice's comment that this is an old kernel.  And, I hear what 
> you're saying about a "real" fix being expensive.  Nevertheless, to my taste, 
> automatically setting num_sockets==1 when num_sockets==0 is detected makes a 
> lot of sense.  It makes things "basically" work, turning a situation where 
> everything including "mpirun hostname" segfaults into a situation where 
> default usage works just fine.  What remains broken is binding, which 
> generates an error message that gives the user a hope of making progress 
> (turning off binding).  That's in contrast from expecting users to go from
> 
> % mpirun hostname
> Segmentation fault
> 
> to knowing that they should set orte_num_sockets==1.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] v1.5 build failure w/ Solaris Studio 12.2 on Linux

2012-02-22 Thread Jeffrey Squyres
Terry / Eugene --

Can you comment?


On Feb 22, 2012, at 3:16 PM, Paul H. Hargrove wrote:

> I think I have the beginning of a fix for this issue.
> 
> I had not even noticed earlier that the error in event.h is from the C++ 
> compiler, when compiling file.cxx in the c++ bindings.  That makes the 
> vendor-specific addition of "-library=stlport4" to CXXFLAGS quite relevant to 
> the problem/solution.
> 
> It eventually occurred to me that when VT's sub-configure told me to add 
> configure arguments, I could have used --with-contrib-vt-flags to pass that 
> ONLY to VT and perhaps NOT mess with whatever karma was providing the 
> definition of u_char.  However, when I tried that I was disappointed to find 
> that the bit of configure logic that suggests/requires 
> CXXFLAGS=-library=stlport4 (from ompi/contrib/vt/configure.m4) runs BEFORE 
> the processing of --with-contrib-vt-flags.  So, that was a dead end.
> 
> So, the next idea was to look for a fix specific to sltport.  I tried adding 
> near the top of opal/event/event.h (after the WINDOWS equivalent):
>> #ifdef STLPORT
>> typedef unsigned char u_char;
>> #endif
> 
> That managed to clear up the original problem w/ SS12.2.
> With SS12.3, things also built fine.
> This suggests the typedef is not "conflicting" with whatever other defn was 
> present.
> I think the "safety" of this needs to be examined more widely before this can 
> be adopted.
> My concern is that some system could "typedef char u_char" if it has char 
> unsigned by default, leading to a conflict.
> Now that would, I suppose, only be a problem if STLPORT is also defined.
> So, maybe I am over thinking this.
> 
> -Paul
> 
> On 2/21/2012 11:10 PM, Paul H. Hargrove wrote:
>> More notes:
>> 
>> I've tested ompi-1.5.4 and it has the same problem.  So, this is NOT a 
>> regression.
>> 
>> Terry D. has observed that Ubuntu is NOT a supported platform for the 
>> Solaris Studio compilers.
>> So, I've reproduced on a Scientific Linux 5.5 platform (Red Hat Enterprise 
>> Linux 5.5 clone, like CentOS) to be sure that was NOT the cause.
>> 
>> When I configure for the SS12.x compilers, I've been passing  
>> CXXFLAGS="-library=stlport4" as the VT sub-configure has informed me I 
>> should, due to something wrong the the default STL.  I tried dropping that 
>> from configure, and THE BUILD WAS SUCCESSFUL.
>> 
>> So, one has 2 choices:
>> + build w/ SS12.2 without VT
>> + update to SS12.3 and have VT
>> 
>> I don't think there is sufficient reason to delay 1.5.5 for this.
>> 
>> -Paul
>> 
>> On 2/21/2012 4:39 PM, Paul H. Hargrove wrote:
>>> A few things to note:
>>> 
>>> 1) This is NOT a problem w/ the SS12.3 compilers on the same machine.
>>> So, one could say "upgrade your compiler" (a free download) and not delay 
>>> 1.5.5 for this issue.
>>> 
>>> 2) This is ONLY a problem on Linux, and not on Solaris (both SS12.2 and 
>>> SS12.3 tested for x86, x86-64, Sparc/v9 and Sparc/v8plus)
>>> 
>>> 3) Testing the trunk I DON'T see the problem with either SS12.2 or SS12.3.
>>> This is interesting, because it probably means that a u_char definition is 
>>> SOMEWHERE in the headers (because libevent *is* getting built).
>>> 
>>> Whatever else may be done, I think this should be fixed "properly" 
>>> (whatever that may equate to) for 1.6.
>>> The way I see it now, it feels like OMPI is getting a definition of u_char 
>>> only "by accident".
>>> 
>>> -Paul
>>> 
>>> On 2/21/2012 12:16 PM, Paul H. Hargrove wrote:
 Building the v1.5 branch on Linux with the Solaris Studio 12.2 compilers I 
 see the following failure:
> "[srcdir]/opal/event/event.h", line 797: Error: Type name expected 
> instead of "u_char".
> "[srcdir]/opal/event/event.h", line 798: Error: Type name expected 
> instead of "u_char".
> "[srcdir]/opal/event/event.h", line 1184: Error: "," expected instead of 
> "*".
 Where line 1184 is a prototype containing "u_char *".
 
 As far as I can find, only several files below opal/event/ contain any use 
 of "u_char".
 There is a typedef for u_char in hwloc, but no use that I could see.
 
 To the best of my knowledge u_char is NOT defined by any standard, and 
 thus there is no particular header one can reliably find it in.
 The alternatives, of course are "unsigned char" or "uint8_t" (defined in 
 stdint.h).
 
 I had a look at the trunk and VISUALLY is appears the same problem exists 
 in:
   opal/event/event.h
   opal/mca/event/libevent2013/libevent/event.h
 However, my testing is currently confined to the v1.5 branch in the hopes 
 of finally getting the next 1.5.5rc out the door.
 
 -Paul
 
>>> 
>> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> HPC Research Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> ___
> devel mail

Re: [OMPI devel] 1.5 supported systems

2012-02-22 Thread Paul H. Hargrove
Folks at Oracle should decide, but I suspect "Solaris 10" should be 
updated to "Solaris 10 and 11", or just "11".


-Paul

On 2/22/2012 2:44 PM, Jeffrey Squyres wrote:

Please verify this list of supported systems for the v1.5.5 release:

- The run-time systems that are currently supported are:
   - rsh / ssh
   - LoadLeveler
   - PBS Pro, Open PBS, Torque
   - Platform LSF (v7.0.2 and later)
   - SLURM
   - Cray XT-3, XT-4, and XT-5
   - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
   - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)

- Systems that have been tested are:
   - Linux (various flavors/distros), 32 bit, with gcc, and Oracle
 Solaris Studio 12
   - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
 Intel, Portland, and Oracle Solaris Studio 12 compilers (*)
   - OS X (10.5, 10.6, 10.7), 32 and 64 bit (x86_64), with gcc and
 Absoft compilers (*)
   - Oracle Solaris 10, 32 and 64 bit (SPARC, i386, x86_64), with
 Oracle Solaris Studio 12

   (*) Be sure to read the Compiler Notes, below.

- Other systems have been lightly (but not fully tested):
   - Other 64 bit platforms (e.g., Linux on PPC64)
   - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
 see the README.WINDOWS file.



--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] 1.5 supported systems

2012-02-22 Thread Larry Baker

Paul,

Haven't you been running Intel compilers on OS X?

Also, do we have specifics about which gcc's on Mac OS X?  I have (OS  
X 10.5.8):



savaii:~ baker$ ls -l /usr/bin/gcc*
lrwxr-xr-x  1 root  wheel   7 Oct  2  2009 /usr/bin/gcc -> gcc-4.0
-r-xr-xr-x  1 root  wheel  258368 Feb 19  2008 /usr/bin/gcc-3.3
-rwxr-xr-x  1 root  wheel   93088 Feb  5  2009 /usr/bin/gcc-4.0
-rwxr-xr-x  1 root  wheel  105680 Apr 27  2009 /usr/bin/gcc-4.2



savaii:~ baker$ ls -l /usr/bin/cc*
lrwxr-xr-x  1 root  wheel  7 Oct  2  2009 /usr/bin/cc -> gcc-4.0



savaii:~ baker$ ls /Developer/usr/llvm-gcc-4.2/bin/*cc*
/Developer/usr/llvm-gcc-4.2/bin/i686-apple-darwin9-llvm-gcc-4.2
/Developer/usr/llvm-gcc-4.2/bin/llvm-gcc-4.2
/Developer/usr/llvm-gcc-4.2/bin/powerpc-apple-darwin9-llvm-gcc-4.2



Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov

On 22 Feb 2012, at 5:55 PM, Paul H. Hargrove wrote:

Folks at Oracle should decide, but I suspect "Solaris 10" should be  
updated to "Solaris 10 and 11", or just "11".


-Paul

On 2/22/2012 2:44 PM, Jeffrey Squyres wrote:

Please verify this list of supported systems for the v1.5.5 release:

- The run-time systems that are currently supported are:
  - rsh / ssh
  - LoadLeveler
  - PBS Pro, Open PBS, Torque
  - Platform LSF (v7.0.2 and later)
  - SLURM
  - Cray XT-3, XT-4, and XT-5
  - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)

- Systems that have been tested are:
  - Linux (various flavors/distros), 32 bit, with gcc, and Oracle
Solaris Studio 12
  - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
Intel, Portland, and Oracle Solaris Studio 12 compilers (*)
  - OS X (10.5, 10.6, 10.7), 32 and 64 bit (x86_64), with gcc and
Absoft compilers (*)
  - Oracle Solaris 10, 32 and 64 bit (SPARC, i386, x86_64), with
Oracle Solaris Studio 12

  (*) Be sure to read the Compiler Notes, below.

- Other systems have been lightly (but not fully tested):
  - Other 64 bit platforms (e.g., Linux on PPC64)
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
see the README.WINDOWS file.



--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] 1.5 supported systems

2012-02-22 Thread Paul H. Hargrove

I have NOT been running Intel's compilers on Macs, only on Linux.
I *tried* PGI's compilers on MacOS, but that was a flop.
I have used Clang (comes w/ XCode 4.2) on MacOS, and that works for me 
but is not extensively tested.


-Paul

On 2/22/2012 6:13 PM, Larry Baker wrote:

Paul,

Haven't you been running Intel compilers on OS X?

Also, do we have specifics about which gcc's on Mac OS X?  I have (OS 
X 10.5.8):



savaii:~ baker$ ls -l /usr/bin/gcc*
lrwxr-xr-x  1 root  wheel   7 Oct  2  2009 /usr/bin/gcc -> gcc-4.0
-r-xr-xr-x  1 root  wheel  258368 Feb 19  2008 /usr/bin/gcc-3.3
-rwxr-xr-x  1 root  wheel   93088 Feb  5  2009 /usr/bin/gcc-4.0
-rwxr-xr-x  1 root  wheel  105680 Apr 27  2009 /usr/bin/gcc-4.2



savaii:~ baker$ ls -l /usr/bin/cc*
lrwxr-xr-x  1 root  wheel  7 Oct  2  2009 /usr/bin/cc -> gcc-4.0



savaii:~ baker$ ls /Developer/usr/llvm-gcc-4.2/bin/*cc*
/Developer/usr/llvm-gcc-4.2/bin/i686-apple-darwin9-llvm-gcc-4.2
/Developer/usr/llvm-gcc-4.2/bin/llvm-gcc-4.2
/Developer/usr/llvm-gcc-4.2/bin/powerpc-apple-darwin9-llvm-gcc-4.2


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov 

On 22 Feb 2012, at 5:55 PM, Paul H. Hargrove wrote:

Folks at Oracle should decide, but I suspect "Solaris 10" should be 
updated to "Solaris 10 and 11", or just "11".


-Paul

On 2/22/2012 2:44 PM, Jeffrey Squyres wrote:

Please verify this list of supported systems for the v1.5.5 release:

- The run-time systems that are currently supported are:
  - rsh / ssh
  - LoadLeveler
  - PBS Pro, Open PBS, Torque
  - Platform LSF (v7.0.2 and later)
  - SLURM
  - Cray XT-3, XT-4, and XT-5
  - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)

- Systems that have been tested are:
  - Linux (various flavors/distros), 32 bit, with gcc, and Oracle
Solaris Studio 12
  - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
Intel, Portland, and Oracle Solaris Studio 12 compilers (*)
  - OS X (10.5, 10.6, 10.7), 32 and 64 bit (x86_64), with gcc and
Absoft compilers (*)
  - Oracle Solaris 10, 32 and 64 bit (SPARC, i386, x86_64), with
Oracle Solaris Studio 12

  (*) Be sure to read the Compiler Notes, below.

- Other systems have been lightly (but not fully tested):
  - Other 64 bit platforms (e.g., Linux on PPC64)
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
see the README.WINDOWS file.



--
Paul H. Hargrove phhargr...@lbl.gov 
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

___
devel mailing list
de...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] 1.5 supported systems

2012-02-22 Thread Paul H. Hargrove
I can get exact info from my MacOS 10.7 machine later, but its gcc is 
llvm-gcc-4.2 IIRC.

Here are my 10.5 and 10.6:

ProductName:Mac OS X
ProductVersion: 10.5.8
BuildVersion:   9L31a
powerpc
lrwxr-xr-x  1 root  wheel   7 Nov  1  2008 /usr/bin/gcc -> gcc-4.0
-r-xr-xr-x  1 root  wheel  258368 Feb 19  2008 /usr/bin/gcc-3.3
-rwxr-xr-x  1 root  wheel   93088 Jul 17  2008 /usr/bin/gcc-4.0
-rwxr-xr-x  1 root  wheel  105680 May 18  2008 /usr/bin/gcc-4.2

ProductName:Mac OS X
ProductVersion: 10.5.8
BuildVersion:   9L30
i386
lrwxr-xr-x  1 root  wheel  7 Nov  8  2007 /usr/bin/gcc -> gcc-4.0
-rwxr-xr-x  1 root  wheel  93072 Sep 23  2007 /usr/bin/gcc-4.0

ProductName:Mac OS X
ProductVersion: 10.6.8
BuildVersion:   10K549
i386
lrwxr-xr-x  1 root  wheel   7 Sep 29  2009 /usr/bin/gcc -> gcc-4.2
-rwxr-xr-x  1 root  wheel   97392 May 18  2009 /usr/bin/gcc-4.0
-rwxr-xr-x  1 root  wheel  166128 May 18  2009 /usr/bin/gcc-4.2


On 2/22/2012 6:13 PM, Larry Baker wrote:

Paul,

Haven't you been running Intel compilers on OS X?

Also, do we have specifics about which gcc's on Mac OS X?  I have (OS 
X 10.5.8):



savaii:~ baker$ ls -l /usr/bin/gcc*
lrwxr-xr-x  1 root  wheel   7 Oct  2  2009 /usr/bin/gcc -> gcc-4.0
-r-xr-xr-x  1 root  wheel  258368 Feb 19  2008 /usr/bin/gcc-3.3
-rwxr-xr-x  1 root  wheel   93088 Feb  5  2009 /usr/bin/gcc-4.0
-rwxr-xr-x  1 root  wheel  105680 Apr 27  2009 /usr/bin/gcc-4.2



savaii:~ baker$ ls -l /usr/bin/cc*
lrwxr-xr-x  1 root  wheel  7 Oct  2  2009 /usr/bin/cc -> gcc-4.0



savaii:~ baker$ ls /Developer/usr/llvm-gcc-4.2/bin/*cc*
/Developer/usr/llvm-gcc-4.2/bin/i686-apple-darwin9-llvm-gcc-4.2
/Developer/usr/llvm-gcc-4.2/bin/llvm-gcc-4.2
/Developer/usr/llvm-gcc-4.2/bin/powerpc-apple-darwin9-llvm-gcc-4.2


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov 

On 22 Feb 2012, at 5:55 PM, Paul H. Hargrove wrote:

Folks at Oracle should decide, but I suspect "Solaris 10" should be 
updated to "Solaris 10 and 11", or just "11".


-Paul

On 2/22/2012 2:44 PM, Jeffrey Squyres wrote:

Please verify this list of supported systems for the v1.5.5 release:

- The run-time systems that are currently supported are:
  - rsh / ssh
  - LoadLeveler
  - PBS Pro, Open PBS, Torque
  - Platform LSF (v7.0.2 and later)
  - SLURM
  - Cray XT-3, XT-4, and XT-5
  - Oracle Grid Engine (OGE) 6.1, 6.2 and open source Grid Engine
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008)

- Systems that have been tested are:
  - Linux (various flavors/distros), 32 bit, with gcc, and Oracle
Solaris Studio 12
  - Linux (various flavors/distros), 64 bit (x86), with gcc, Absoft,
Intel, Portland, and Oracle Solaris Studio 12 compilers (*)
  - OS X (10.5, 10.6, 10.7), 32 and 64 bit (x86_64), with gcc and
Absoft compilers (*)
  - Oracle Solaris 10, 32 and 64 bit (SPARC, i386, x86_64), with
Oracle Solaris Studio 12

  (*) Be sure to read the Compiler Notes, below.

- Other systems have been lightly (but not fully tested):
  - Other 64 bit platforms (e.g., Linux on PPC64)
  - Microsoft Windows CCP (Microsoft Windows server 2003 and 2008);
see the README.WINDOWS file.



--
Paul H. Hargrove phhargr...@lbl.gov 
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

___
devel mailing list
de...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900