[OMPI devel] ud oob is borked

2012-07-31 Thread Jeff Squyres
There's a compile error in the ud oob right now.  I tried a few different ways 
to fix it, but I'm still consistently getting segv's.  

-
[svbu-mpi046:02934] wc status = 2
[svbu-mpi046:02934] *** Process received signal ***
[svbu-mpi046:02934] Signal: Segmentation fault (11)
[svbu-mpi046:02934] Signal code: Address not mapped (1)
[svbu-mpi046:02934] Failing at address: 0x128
[svbu-mpi046:02934] [ 0] /lib64/libpthread.so.0() [0x3d5940f4a0]
[svbu-mpi046:02934] [ 1] 
/home/jsquyres/bogus/lib/libopen-rte.so.0(mca_oob_ud_msg_post_send+0x1ce) 
[0x77c686d7]
[svbu-mpi046:02934] [ 2] 
/home/jsquyres/bogus/lib/libopen-rte.so.0(mca_oob_ud_send_nb+0x5d1) 
[0x77c6a851]
[svbu-mpi046:02934] [ 3] 
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_rml_oob_send_buffer_nb+0x5bd) 
[0x77cb70f3]
[svbu-mpi046:02934] [ 4] 
/home/jsquyres/bogus/lib/libopen-rte.so.0(orte_daemon+0x17de) [0x77c1c701]
[svbu-mpi046:02934] [ 5] /home/jsquyres/bogus/bin/orted() [0x40082a]
[svbu-mpi046:02934] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3d5901ecdd]
[svbu-mpi046:02934] [ 7] /home/jsquyres/bogus/bin/orted() [0x4006e9]
[svbu-mpi046:02934] *** End of error message ***
Segmentation fault (core dumped)
-

So that we don't get another night of 161K MTT failures at Cisco (before I 
killed it), I'm going to .ompi_ignore the ud oob on the trunk.

Nathan: feel free to un-ompi-ignore it when you have it fixed.  Thanks.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] MPI_Mprobe

2012-07-31 Thread Eugene Loh

On 7/31/2012 5:15 AM, Jeff Squyres wrote:

On Jul 31, 2012, at 2:58 AM, Eugene Loh wrote:

The main issue is this.  If I go to ompi/mpi/fortran/mpif-h, I see six files (*recv_f and 
*probe_f) that take status arguments.  Normally, we do some conversion between Fortran 
and C status arguments.  These files test if OMPI_SIZEOF_FORTRAN_INTEGER==SIZEOF_INT, 
however, and bypass the conversion if the two types of integers are the same size.  The 
problem with this is that while the structures may be the same size, the C status has a 
size_t in it.  So, while the Fortran INTEGER array can start on any 4-byte alignment, the 
C status can end up with a 64-bit pointer on a 4-byte alignment.  That's not pleasant in 
general and can incur some serious hand-slapping on SPARC.  Specifically, SPARC/-m64 
errors out on *probe and *recv with MPI_PROC_NULL sources.  Would it be all right if I 
removed these "shorts cuts"?

Ew.  Yes.  You're right.

What specifically do you propose?  I don't remember offhand if the status 
conversion macros are the same as the regular int/INTEGER conversion macros -- 
we want to keep the no-op behavior for the regular int/INTEGER conversion 
macros and only handle the conversion of MPI_Status separately, I think.  
Specifically: for MPI_Status, we can probably still have those shortcuts for 
the int/INTEGERs, but treat the copying to the size_t separately.
I'm embarrassingly unfamiliar with the code.  My impression is that 
internally we deal with C status structures and so our requirements for 
Fortran status are:

*)  enough bytes to hold whatever is in a C status
*)  several words are addressable via the indices MPI_SOURCE, MPI_TAG, 
and MPI_ERROR
So, I think what we do today is sufficient in most respects.  Copying 
between Fortran and C integer-by-integer is fine.  It might be a little 
nonsensical for an 8-byte size_t component to be handled as two 4-byte 
words, but if we do so only for copying and otherwise only use that 
component from the C side, things should be fine.


The only problem is if we try to use the Fortran array in-place.  It's 
big enough, but its alignment might be wrong.


So, specifically, what I propose is getting rid of the short-cuts that 
try to use Fortran statuses in-place if Fortran INTEGERs are as big as C 
ints.  I can make the changes.  Sanity checks on all that are welcome.

Thanks for fixing the ibm MPROBE tests, btw.  Further proof that I must have 
been clinically insane when I added all those tests.  :-(

Insane, no, but you might copy out long-hand 100x:
for(i=0;i

Re: [OMPI devel] OpenMPI and SGE integration made more stable

2012-07-31 Thread Kenneth A. Lloyd
I haven't used SGE or Oracle Grid Engine in ages, but apparently it is now
called Open Grid Engine
http://gridscheduler.sourceforge.net/


-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Rayson Ho
Sent: Friday, July 27, 2012 8:25 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] OpenMPI and SGE integration made more stable

On Fri, Jul 27, 2012 at 8:53 AM, Daniel Gruber  wrote:
> A while after u5 the open source repository was closed and most of the 
> German engineers from Sun/Oracle moved to Univa, working on Univa Grid 
> Engine. Currently you have the choice between Univa Grid Engine, Son 
> of Grid Engine (free acadmic project), and OGS.

Oracle Grid Engine is still alive, and in fact updates are still released by
Oracle from time to time.

(But of course it is not free, and since most people are looking for a free
download, it is usually not mentioned in the mailing list
discussions...)

Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/


> Daniel
>
>
>>
>> +-+--+
>> | Prof. Christoph van Wüllen  | Tele-Phone (+49) (0)631 205 2749 |
>> | TU Kaiserslautern, FB Chemie| Tele-Fax   (+49) (0)631 205 2750 |
>> | Erwin-Schrödinger-Str.  |  |
>> | D-67663 Kaiserslautern, Germany | vanwul...@chemie.uni-kl.de   |
>> ||
>> | HomePage:  http://www.chemie.uni-kl.de/vanwullen   |
>> +-+--+
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.2197 / Virus Database: 2437/5166 - Release Date: 07/30/12




Re: [OMPI devel] The hostfile option

2012-07-31 Thread George Bosilca

On Jul 30, 2012, at 15:29 , Ralph Castain wrote:

> 
> On Jul 30, 2012, at 2:37 AM, George Bosilca wrote:
> 
>> I think that as long as there is a single home area per cluster the 
>> difference between the different approaches might seem irrelevant to most of 
>> the people.
> 
> Yeah, I agree - after thinking about it, it probably didn't accomplish much.
> 
>> 
>> My problem is twofold. First, I have a common home area across several 
>> different development clusters. Thus I have direct access through ssh to any 
>> machine. If I create a single large machinefile, it turns out that every 
>> mpirun will spawn a daemon on every single node, even if I only run a 
>> ping-pong test.
> 
> That shouldn't happen if you specify the hosts you want to use, either via 
> -host or -hostfile. I assume you are specifying nothing and so you get that 
> behavior?
> 
>> Second, while I usually run my apps on the same set of resources I need on a 
>> regular base to switch my nodes for few tests.
>> 
>> What I was hoping to achieve is a machinefile containing the "default" 
>> development cluster (aka. the cluster where I'm almost alone so my deamons 
>> have minimal chances to disturb other people experiences), and then use a 
>> machinefile to sporadicly change the cluster where I run for smaller tests. 
>> Unfortunately, this doesn't work due to the filtering behavior described in 
>> my original email.
> 
> Why not just set the default hostfile to point to the new machinefile via the 
> "--default-hostfile foo" option to mpirun, or you can use the corresponding 
> MCA param?

I confirm, if instead of -machinefile I use --default-hostfile I get the 
behavior I expected (it overwrites the default).

> I'm not trying to re-open the hostfile discussion, but I would be interested 
> to hear how you feel -hostfile should work. I kinda gather you feel it should 
> override the default hostfile instead of filter it, yes? My point being that 
> I don't particularly know if anyone would disagree with that behavior, so we 
> might decide to modify things if you want to propose it.

Right, I would have expected to work in the same way as almost all the other 
MCA parameters, by overwriting the less variants with less priority. But I 
don't mind typing "--default-hostfile" instead of "-machinefile" to get the 
behavior I like.

  george.

> 
> Ralph
> 
> 
>> 
>> george.
>> 
>> 
>> On Jul 28, 2012, at 19:24 , Ralph Castain wrote:
>> 
>>> It's been awhile, but I vaguely remember the discussion. IIRC, the 
>>> rationale was that the default hostfile was equivalent to an RM allocation 
>>> and should be treated the same. So hostfile and -host become filters in 
>>> that case.
>>> 
>>> FWIW, I believe the discussion was split on that question. I added a "none" 
>>> option to the default hostfile MCA param so it would be ignored in the case 
>>> where (a) the sys admin has given a default hostfile, but (b) someone wants 
>>> to use hosts outside of it.
>>> 
>>>  MCA orte: parameter "orte_default_hostfile" (current value: 
>>> , data source: default value)
>>>Name of the default hostfile (relative or absolute 
>>> path, "none" to ignore environmental or default MCA param setting)
>>> 
>>> That said, I can see a use-case argument for behaving somewhat differently. 
>>> We've even had cases where users have gotten an allocation from an RM, but 
>>> want to add hosts that are external to the cluster to the job.
>>> 
>>> It would be rather trivial to modify the logic:
>>> 
>>> 1. read the default hostfile or RM allocation for our baseline
>>> 
>>> 2. remove any hosts on that list that are *not* in the given hostfile
>>> 
>>> 3. add any hosts that are in the given hostfile, but weren't in the default 
>>> hostfile
>>> 
>>> And subsequently do the same for -host. I think that would retain the 
>>> spirit of the discussion, but provide more flexibility and provide a tad 
>>> more "expected" behavior.
>>> 
>>> I don't have an iron in this fire as I don't use hostfiles, so I'm happy to 
>>> implement whatever the community would like to see.
>>> Ralph
>>> 
>>> On Jul 27, 2012, at 6:30 PM, George Bosilca wrote:
>>> 
 I'm somewhat puzzled by the behavior of the -hostfile in Open MPI. Based 
 on the FAQ it is supposed to provide a list of resources to be used by the 
 launcher (in my case ssh) to start the processes. Make sense so far.
 
 However, if the configuration file contain a value for 
 orte_default_hostfile, then the behavior of the hostfile option change 
 drastically, and the option become a filter (the machines must be on the 
 original list or a cryptic error message is displayed).
 
 Overall, we have a well defined [mostly] consistent behavior for 
 parameters in Open MPI. We have an order of precedence of sources of MCA 
 parameters, clearly defined which make understanding where a value comes 
 straightforward. I'm absolutely certain there was 

Re: [OMPI devel] MPI_Mprobe

2012-07-31 Thread Jeff Squyres
On Jul 31, 2012, at 2:58 AM, Eugene Loh wrote:

> The main issue is this.  If I go to ompi/mpi/fortran/mpif-h, I see six files 
> (*recv_f and *probe_f) that take status arguments.  Normally, we do some 
> conversion between Fortran and C status arguments.  These files test if 
> OMPI_SIZEOF_FORTRAN_INTEGER==SIZEOF_INT, however, and bypass the conversion 
> if the two types of integers are the same size.  The problem with this is 
> that while the structures may be the same size, the C status has a size_t in 
> it.  So, while the Fortran INTEGER array can start on any 4-byte alignment, 
> the C status can end up with a 64-bit pointer on a 4-byte alignment.  That's 
> not pleasant in general and can incur some serious hand-slapping on SPARC.  
> Specifically, SPARC/-m64 errors out on *probe and *recv with MPI_PROC_NULL 
> sources.  Would it be all right if I removed these "shorts cuts"?

Ew.  Yes.  You're right.

What specifically do you propose?  I don't remember offhand if the status 
conversion macros are the same as the regular int/INTEGER conversion macros -- 
we want to keep the no-op behavior for the regular int/INTEGER conversion 
macros and only handle the conversion of MPI_Status separately, I think.  
Specifically: for MPI_Status, we can probably still have those shortcuts for 
the int/INTEGERs, but treat the copying to the size_t separately.  

Thanks for fixing the ibm MPROBE tests, btw.  Further proof that I must have 
been clinically insane when I added all those tests.  :-(

Related issue: do we need to (conditionally) add padding for the size_t in the 
Fortran array?

> Here are two more smaller issues.  I'm pretty sure about them and can make 
> the appropriate changes, but if someone wants to give feedback...
> 
> 1)  If I look at, say, the v1.7 MPI_Mprobe man page, it says:
> 
> A  matching  probe  with  MPI_PROC_NULL  as  source  returns
> message  =  MPI_MESSAGE_NULL...
> 
> In contrast, if I look at ibm/pt2pt/mprobe_mpifh.f90, it's checking the 
> message to be MPI_MESSAGE_NO_PROC.  Further, if I look at the source code, 
> mprobe.c seems to set the message to "no proc".  So, I take it the man page 
> is wrong?  It should say "message = MPI_MESSAGE_NO_PROC"?

Oh, yes -- I think the man page is wrong.  The issue here is that the original 
MPI-3 proposal said to return MESSAGE_NULL, but this turns out to be ambiguous. 
 So we amended the original MPI-3 proposal with the new constant 
MPI_MESSAGE_NO_PROC.  So I think we fixed the implementation, but accidentally 
left the man page saying MESSAGE_NULL.

If you care, here's the specifics:

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/38
https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/328

> 2)  Next, looking further at mprobe.c, it looks like this:
> 
> int MPI_Mprobe(int source, int tag, MPI_Comm comm,
>   MPI_Message *message, MPI_Status *status)
> {
>if (MPI_PROC_NULL == source) {
>if (MPI_STATUS_IGNORE != status) {
>*status = ompi_request_empty.req_status;
>*message = _message_no_proc.message;
>}
>return MPI_SUCCESS;
>}
>..
> }
> 
> This means that if source==MPI_PROC_NULL and status==MPI_STATUS_IGNORE, the 
> message does not get set.  The assignment to *message should be moved outside 
> the status check, right?

Agreed.  Good catch.

Do the IBM MPROBE tests check for this condition?  If not, we should probably 
extend them to do so.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] MPI_Mprobe

2012-07-31 Thread Eugene Loh
I have some questions originally motivated by some mpif-h/MPI_Mprobe 
failures we've seen in SPARC MTT runs at 64-bit in both v1.7 and v1.9, 
but my poking around spread out from there.


The main issue is this.  If I go to ompi/mpi/fortran/mpif-h, I see six 
files (*recv_f and *probe_f) that take status arguments.  Normally, we 
do some conversion between Fortran and C status arguments.  These files 
test if OMPI_SIZEOF_FORTRAN_INTEGER==SIZEOF_INT, however, and bypass the 
conversion if the two types of integers are the same size.  The problem 
with this is that while the structures may be the same size, the C 
status has a size_t in it.  So, while the Fortran INTEGER array can 
start on any 4-byte alignment, the C status can end up with a 64-bit 
pointer on a 4-byte alignment.  That's not pleasant in general and can 
incur some serious hand-slapping on SPARC.  Specifically, SPARC/-m64 
errors out on *probe and *recv with MPI_PROC_NULL sources.  Would it be 
all right if I removed these "shorts cuts"?


Here are two more smaller issues.  I'm pretty sure about them and can 
make the appropriate changes, but if someone wants to give feedback...


1)  If I look at, say, the v1.7 MPI_Mprobe man page, it says:

 A  matching  probe  with  MPI_PROC_NULL  as  source  returns
 message  =  MPI_MESSAGE_NULL...

In contrast, if I look at ibm/pt2pt/mprobe_mpifh.f90, it's checking the 
message to be MPI_MESSAGE_NO_PROC.  Further, if I look at the source 
code, mprobe.c seems to set the message to "no proc".  So, I take it the 
man page is wrong?  It should say "message = MPI_MESSAGE_NO_PROC"?


2)  Next, looking further at mprobe.c, it looks like this:

int MPI_Mprobe(int source, int tag, MPI_Comm comm,
   MPI_Message *message, MPI_Status *status)
{
if (MPI_PROC_NULL == source) {
if (MPI_STATUS_IGNORE != status) {
*status = ompi_request_empty.req_status;
*message = _message_no_proc.message;
}
return MPI_SUCCESS;
}
..
}

This means that if source==MPI_PROC_NULL and status==MPI_STATUS_IGNORE, 
the message does not get set.  The assignment to *message should be 
moved outside the status check, right?


Re: [OMPI devel] The hostfile option

2012-07-31 Thread Rolf vandeVaart
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Ralph Castain
>Sent: Monday, July 30, 2012 9:29 AM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] The hostfile option
>
>
>On Jul 30, 2012, at 2:37 AM, George Bosilca wrote:
>
>> I think that as long as there is a single home area per cluster the 
>> difference
>between the different approaches might seem irrelevant to most of the
>people.
>
>Yeah, I agree - after thinking about it, it probably didn't accomplish much.
>
>>
>> My problem is twofold. First, I have a common home area across several
>different development clusters. Thus I have direct access through ssh to any
>machine. If I create a single large machinefile, it turns out that every mpirun
>will spawn a daemon on every single node, even if I only run a ping-pong test.
>
>That shouldn't happen if you specify the hosts you want to use, either via -
>host or -hostfile. I assume you are specifying nothing and so you get that
>behavior?
>
>> Second, while I usually run my apps on the same set of resources I need on
>a regular base to switch my nodes for few tests.
>>
>> What I was hoping to achieve is a machinefile containing the "default"
>development cluster (aka. the cluster where I'm almost alone so my deamons
>have minimal chances to disturb other people experiences), and then use a
>machinefile to sporadicly change the cluster where I run for smaller tests.
>Unfortunately, this doesn't work due to the filtering behavior described in my
>original email.
>
>Why not just set the default hostfile to point to the new machinefile via the 
>"-
>-default-hostfile foo" option to mpirun, or you can use the corresponding
>MCA param?
>
>I'm not trying to re-open the hostfile discussion, but I would be interested to
>hear how you feel -hostfile should work. I kinda gather you feel it should
>override the default hostfile instead of filter it, yes? My point being that I
>don't particularly know if anyone would disagree with that behavior, so we
>might decide to modify things if you want to propose it.
>
>Ralph
>

I wrote up the whole description in the Wiki a long while ago because there was 
a lot of confusion about
how things should behave with a resource manager.   The general thought was 
that folks thought of hostfile
and host as a filter when running with a resource manager. 

I never wrote anything about the case you are describing, with the hostfile 
filtering the default hostfile.
I would have assumed that the precedence of hostfile that you desire would be 
the way things work.
Therefore, I am fine if we change it with respect to default hostfile and 
hostfile.

The wiki reference is here: https://svn.open-mpi.org/trac/ompi/wiki/HostFilePlan


>>
>>
>> On Jul 28, 2012, at 19:24 , Ralph Castain wrote:
>>
>>> It's been awhile, but I vaguely remember the discussion. IIRC, the rationale
>was that the default hostfile was equivalent to an RM allocation and should be
>treated the same. So hostfile and -host become filters in that case.
>>>
>>> FWIW, I believe the discussion was split on that question. I added a "none"
>option to the default hostfile MCA param so it would be ignored in the case
>where (a) the sys admin has given a default hostfile, but (b) someone wants
>to use hosts outside of it.
>>>
>>>   MCA orte: parameter "orte_default_hostfile" (current value:
>, data source: default value)
>>> Name of the default hostfile (relative or absolute 
>>> path, "none"
>to ignore environmental or default MCA param setting)
>>>
>>> That said, I can see a use-case argument for behaving somewhat
>differently. We've even had cases where users have gotten an allocation from
>an RM, but want to add hosts that are external to the cluster to the job.
>>>
>>> It would be rather trivial to modify the logic:
>>>
>>> 1. read the default hostfile or RM allocation for our baseline
>>>
>>> 2. remove any hosts on that list that are *not* in the given hostfile
>>>
>>> 3. add any hosts that are in the given hostfile, but weren't in the default
>hostfile
>>>
>>> And subsequently do the same for -host. I think that would retain the spirit
>of the discussion, but provide more flexibility and provide a tad more
>"expected" behavior.
>>>
>>> I don't have an iron in this fire as I don't use hostfiles, so I'm happy to
>implement whatever the community would like to see.
>>> Ralph
>>>
>>> On Jul 27, 2012, at 6:30 PM, George Bosilca wrote:
>>>
 I'm somewhat puzzled by the behavior of the -hostfile in Open MPI.
>Based on the FAQ it is supposed to provide a list of resources to be used by
>the launcher (in my case ssh) to start the processes. Make sense so far.

 However, if the configuration file contain a value for
>orte_default_hostfile, then the behavior of the hostfile option change
>drastically, and the option become a filter (the machines must be on the
>original list or a cryptic error message is displayed).