Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff,

Some limited testing shows that that srun does seem to work where the
quote-y one did not. I'm working with our admins now to make sure it let's
the prolog work as expected as well.

I'll keep you informed,
Matt


On Thu, Sep 4, 2014 at 1:26 PM, Jeff Squyres (jsquyres) 
wrote:

> Try this (typed in editor, not tested!):
>
> #! /usr/bin/perl -w
>
> use strict;
> use warnings;
>
> use FindBin;
>
> # Specify the path to the prolog.
> my $prolog = '--task-prolog=/gpfsm//.task.prolog';
>
> # Build the path to the SLURM srun command.
> my $srun_slurm = "${FindBin::Bin}/srun.slurm";
>
> # Add the prolog option, but abort if the user specifies a prolog option.
> my @command = split(/ /, "$srun_slurm $prolog");
> foreach (@ARGV) {
> if (/^--task-prolog=/) {
> print("The --task-prolog option is unsupported at . Please " .
>   "contact the  for assistance.\n");
> exit(1);
> } else {
> push(@command, $_);
> }
> }
> system(@command);
>
>
>
> On Sep 4, 2014, at 1:21 PM, Matt Thompson  wrote:
>
> > Jeff,
> >
> > Here is the script (with a bit of munging for safety's sake):
> >
> > #! /usr/bin/perl -w
> >
> > use strict;
> > use warnings;
> >
> > use FindBin;
> >
> > # Specify the path to the prolog.
> > my $prolog = '--task-prolog=/gpfsm//.task.prolog';
> >
> > # Build the path to the SLURM srun command.
> > my $srun_slurm = "${FindBin::Bin}/srun.slurm";
> >
> > # Add the prolog option, but abort if the user specifies a prolog option.
> > my $command = "$srun_slurm $prolog";
> > foreach (@ARGV) {
> > if (/^--task-prolog=/) {
> > print("The --task-prolog option is unsupported at . Please "
> .
> >   "contact the  for assistance.\n");
> > exit(1);
> > } else {
> > $command .= " $_";
> > }
> > }
> > system($command);
> >
> > Ideas?
> >
> >
> >
> > On Thu, Sep 4, 2014 at 10:51 AM, Ralph Castain  wrote:
> > Still begs the bigger question, though, as others have used script
> wrappers before - and I'm not sure we (OMPI) want to be in the business of
> dictating the scripting language they can use. :-)
> >
> > Jeff and I will argue that one out
> >
> >
> > On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) 
> wrote:
> >
> >> Ah, if it's perl, it might be easy. It might just be the difference
> between system("...string...") and system(@argv).
> >>
> >> Sent from my phone. No type good.
> >>
> >> On Sep 4, 2014, at 8:35 AM, "Matt Thompson"  wrote:
> >>
> >>> Jeff,
> >>>
> >>> I actually misspoke earlier. It turns out our srun is a *Perl* script
> around the SLURM srun. I'll speak with our admins to see if they can
> massage the script to not interpret the arguments. If possible, I'll ask
> them if I can share the script with you (privately or on the list) and
> maybe you can see how it is affecting Open MPI's argument passage.
> >>>
> >>> Matt
> >>>
> >>>
> >>> On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >>> On Sep 3, 2014, at 9:27 AM, Matt Thompson  wrote:
> >>>
> >>> > Just saw this, sorry. Our srun is indeed a shell script. It seems to
> be a wrapper around the regular srun that runs a --task-prolog. What it
> does...that's beyond my ken, but I could ask. My guess is that it probably
> does something that helps keep our old PBS scripts running (sets
> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
> admins would, of course, prefer all future scripts be SLURM-native scripts,
> but there are a lot of production runs that uses many, many PBS scripts.
> Converting that would need slow, careful QC to make sure any "pure SLURM"
> versions act as expected.
> >>>
> >>> Ralph and I haven't had a chance to discuss this in detail yet, but I
> have thought about this quite a bit.
> >>>
> >>> What is happening is that one of the $argv OMPI passes is of the form
> "foo;bar".  Your srun script is interpreting the ";" as the end of the
> command the the "bar" as the beginning of a new command, and mayhem ensues.
> >>>
> >>> Basically, your srun script is violating what should be a very safe
> assumption: that the $argv we pass to it will not be interpreted by a
> shell.  Put differently: your "srun" script behaves differently than
> SLURM's "srun" executable.  This violates OMPI's expectations of how srun
> should behave.
> >>>
> >>> My $0.02 is that if we "fix" this in OMPI, we're effectively
> penalizing all other SLURM installations out there that *don't* violate
> this assumption (i.e., all of them).  Ralph may disagree with me on this
> point, BTW -- like I said, we haven't talked about this in detail since
> Tuesday.  :-)
> >>>
> >>> So here's my question: is there any chance you can change your "srun"
> script to a script language that doesn't recombine $argv?  This is a common
> problem, actually -- sh/csh/etc. script languages tend to recombine $argv,
> but other languages such as perl and python do not (e.g.,
> http://stackov

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
Try this (typed in editor, not tested!):

#! /usr/bin/perl -w

use strict;
use warnings;

use FindBin;

# Specify the path to the prolog.
my $prolog = '--task-prolog=/gpfsm//.task.prolog';

# Build the path to the SLURM srun command.
my $srun_slurm = "${FindBin::Bin}/srun.slurm";

# Add the prolog option, but abort if the user specifies a prolog option.
my @command = split(/ /, "$srun_slurm $prolog");
foreach (@ARGV) {
if (/^--task-prolog=/) {
print("The --task-prolog option is unsupported at . Please " .
  "contact the  for assistance.\n");
exit(1);
} else {
push(@command, $_);
}
}
system(@command);



On Sep 4, 2014, at 1:21 PM, Matt Thompson  wrote:

> Jeff,
> 
> Here is the script (with a bit of munging for safety's sake):
> 
> #! /usr/bin/perl -w
> 
> use strict;
> use warnings;
> 
> use FindBin;
> 
> # Specify the path to the prolog.
> my $prolog = '--task-prolog=/gpfsm//.task.prolog';
> 
> # Build the path to the SLURM srun command.
> my $srun_slurm = "${FindBin::Bin}/srun.slurm";
> 
> # Add the prolog option, but abort if the user specifies a prolog option.
> my $command = "$srun_slurm $prolog";
> foreach (@ARGV) {
> if (/^--task-prolog=/) {
> print("The --task-prolog option is unsupported at . Please " .
>   "contact the  for assistance.\n");
> exit(1);
> } else {
> $command .= " $_";
> }
> }
> system($command);
> 
> Ideas?
> 
> 
> 
> On Thu, Sep 4, 2014 at 10:51 AM, Ralph Castain  wrote:
> Still begs the bigger question, though, as others have used script wrappers 
> before - and I'm not sure we (OMPI) want to be in the business of dictating 
> the scripting language they can use. :-)
> 
> Jeff and I will argue that one out
> 
> 
> On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> Ah, if it's perl, it might be easy. It might just be the difference between 
>> system("...string...") and system(@argv). 
>> 
>> Sent from my phone. No type good. 
>> 
>> On Sep 4, 2014, at 8:35 AM, "Matt Thompson"  wrote:
>> 
>>> Jeff,
>>> 
>>> I actually misspoke earlier. It turns out our srun is a *Perl* script 
>>> around the SLURM srun. I'll speak with our admins to see if they can 
>>> massage the script to not interpret the arguments. If possible, I'll ask 
>>> them if I can share the script with you (privately or on the list) and 
>>> maybe you can see how it is affecting Open MPI's argument passage.
>>> 
>>> Matt
>>> 
>>> 
>>> On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) 
>>>  wrote:
>>> On Sep 3, 2014, at 9:27 AM, Matt Thompson  wrote:
>>> 
>>> > Just saw this, sorry. Our srun is indeed a shell script. It seems to be a 
>>> > wrapper around the regular srun that runs a --task-prolog. What it 
>>> > does...that's beyond my ken, but I could ask. My guess is that it 
>>> > probably does something that helps keep our old PBS scripts running (sets 
>>> > $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. 
>>> > The admins would, of course, prefer all future scripts be SLURM-native 
>>> > scripts, but there are a lot of production runs that uses many, many PBS 
>>> > scripts. Converting that would need slow, careful QC to make sure any 
>>> > "pure SLURM" versions act as expected.
>>> 
>>> Ralph and I haven't had a chance to discuss this in detail yet, but I have 
>>> thought about this quite a bit.
>>> 
>>> What is happening is that one of the $argv OMPI passes is of the form 
>>> "foo;bar".  Your srun script is interpreting the ";" as the end of the 
>>> command the the "bar" as the beginning of a new command, and mayhem ensues.
>>> 
>>> Basically, your srun script is violating what should be a very safe 
>>> assumption: that the $argv we pass to it will not be interpreted by a 
>>> shell.  Put differently: your "srun" script behaves differently than 
>>> SLURM's "srun" executable.  This violates OMPI's expectations of how srun 
>>> should behave.
>>> 
>>> My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing all 
>>> other SLURM installations out there that *don't* violate this assumption 
>>> (i.e., all of them).  Ralph may disagree with me on this point, BTW -- like 
>>> I said, we haven't talked about this in detail since Tuesday.  :-)
>>> 
>>> So here's my question: is there any chance you can change your "srun" 
>>> script to a script language that doesn't recombine $argv?  This is a common 
>>> problem, actually -- sh/csh/etc. script languages tend to recombine $argv, 
>>> but other languages such as perl and python do not (e.g., 
>>> http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a).
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff,

Here is the script (with a bit of munging for safety's sake):

#! /usr/bin/perl -w

use strict;
use warnings;

use FindBin;

# Specify the path to the prolog.
my $prolog = '--task-prolog=/gpfsm//.task.prolog';

# Build the path to the SLURM srun command.
my $srun_slurm = "${FindBin::Bin}/srun.slurm";

# Add the prolog option, but abort if the user specifies a prolog option.
my $command = "$srun_slurm $prolog";
foreach (@ARGV) {
if (/^--task-prolog=/) {
print("The --task-prolog option is unsupported at . Please " .
  "contact the  for assistance.\n");
exit(1);
} else {
$command .= " $_";
}
}
system($command);

Ideas?



On Thu, Sep 4, 2014 at 10:51 AM, Ralph Castain  wrote:

> Still begs the bigger question, though, as others have used script
> wrappers before - and I'm not sure we (OMPI) want to be in the business of
> dictating the scripting language they can use. :-)
>
> Jeff and I will argue that one out
>
>
> On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres) 
> wrote:
>
>  Ah, if it's perl, it might be easy. It might just be the difference
> between system("...string...") and system(@argv).
>
> Sent from my phone. No type good.
>
> On Sep 4, 2014, at 8:35 AM, "Matt Thompson"  wrote:
>
>   Jeff,
>
>  I actually misspoke earlier. It turns out our srun is a *Perl* script
> around the SLURM srun. I'll speak with our admins to see if they can
> massage the script to not interpret the arguments. If possible, I'll ask
> them if I can share the script with you (privately or on the list) and
> maybe you can see how it is affecting Open MPI's argument passage.
>
>  Matt
>
>
> On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> On Sep 3, 2014, at 9:27 AM, Matt Thompson  wrote:
>>
>> > Just saw this, sorry. Our srun is indeed a shell script. It seems to be
>> a wrapper around the regular srun that runs a --task-prolog. What it
>> does...that's beyond my ken, but I could ask. My guess is that it probably
>> does something that helps keep our old PBS scripts running (sets
>> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
>> admins would, of course, prefer all future scripts be SLURM-native scripts,
>> but there are a lot of production runs that uses many, many PBS scripts.
>> Converting that would need slow, careful QC to make sure any "pure SLURM"
>> versions act as expected.
>>
>>  Ralph and I haven't had a chance to discuss this in detail yet, but I
>> have thought about this quite a bit.
>>
>> What is happening is that one of the $argv OMPI passes is of the form
>> "foo;bar".  Your srun script is interpreting the ";" as the end of the
>> command the the "bar" as the beginning of a new command, and mayhem ensues.
>>
>> Basically, your srun script is violating what should be a very safe
>> assumption: that the $argv we pass to it will not be interpreted by a
>> shell.  Put differently: your "srun" script behaves differently than
>> SLURM's "srun" executable.  This violates OMPI's expectations of how srun
>> should behave.
>>
>> My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing
>> all other SLURM installations out there that *don't* violate this
>> assumption (i.e., all of them).  Ralph may disagree with me on this point,
>> BTW -- like I said, we haven't talked about this in detail since Tuesday.
>> :-)
>>
>> So here's my question: is there any chance you can change your "srun"
>> script to a script language that doesn't recombine $argv?  This is a common
>> problem, actually -- sh/csh/etc. script languages tend to recombine $argv,
>> but other languages such as perl and python do not (e.g.,
>> http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a
>> ).
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/09/25263.php
>>
>
>
>
>  --
>  "And, isn't sanity really just a one-trick pony anyway? I mean all you
>  get is one trick: rational thinking. But when you're good and crazy,
>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
>
>___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25264.php
>
>  ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25269.php
>
>
>
> ___

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Ralph Castain
Still begs the bigger question, though, as others have used script wrappers 
before - and I'm not sure we (OMPI) want to be in the business of dictating the 
scripting language they can use. :-)

Jeff and I will argue that one out


On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres)  wrote:

> Ah, if it's perl, it might be easy. It might just be the difference between 
> system("...string...") and system(@argv). 
> 
> Sent from my phone. No type good. 
> 
> On Sep 4, 2014, at 8:35 AM, "Matt Thompson"  wrote:
> 
>> Jeff,
>> 
>> I actually misspoke earlier. It turns out our srun is a *Perl* script around 
>> the SLURM srun. I'll speak with our admins to see if they can massage the 
>> script to not interpret the arguments. If possible, I'll ask them if I can 
>> share the script with you (privately or on the list) and maybe you can see 
>> how it is affecting Open MPI's argument passage.
>> 
>> Matt
>> 
>> 
>> On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> On Sep 3, 2014, at 9:27 AM, Matt Thompson  wrote:
>> 
>> > Just saw this, sorry. Our srun is indeed a shell script. It seems to be a 
>> > wrapper around the regular srun that runs a --task-prolog. What it 
>> > does...that's beyond my ken, but I could ask. My guess is that it probably 
>> > does something that helps keep our old PBS scripts running (sets 
>> > $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. 
>> > The admins would, of course, prefer all future scripts be SLURM-native 
>> > scripts, but there are a lot of production runs that uses many, many PBS 
>> > scripts. Converting that would need slow, careful QC to make sure any 
>> > "pure SLURM" versions act as expected.
>> 
>> Ralph and I haven't had a chance to discuss this in detail yet, but I have 
>> thought about this quite a bit.
>> 
>> What is happening is that one of the $argv OMPI passes is of the form 
>> "foo;bar".  Your srun script is interpreting the ";" as the end of the 
>> command the the "bar" as the beginning of a new command, and mayhem ensues.
>> 
>> Basically, your srun script is violating what should be a very safe 
>> assumption: that the $argv we pass to it will not be interpreted by a shell. 
>>  Put differently: your "srun" script behaves differently than SLURM's "srun" 
>> executable.  This violates OMPI's expectations of how srun should behave.
>> 
>> My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing all 
>> other SLURM installations out there that *don't* violate this assumption 
>> (i.e., all of them).  Ralph may disagree with me on this point, BTW -- like 
>> I said, we haven't talked about this in detail since Tuesday.  :-)
>> 
>> So here's my question: is there any chance you can change your "srun" script 
>> to a script language that doesn't recombine $argv?  This is a common 
>> problem, actually -- sh/csh/etc. script languages tend to recombine $argv, 
>> but other languages such as perl and python do not (e.g., 
>> http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a).
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25263.php
>> 
>> 
>> 
>> -- 
>> "And, isn't sanity really just a one-trick pony anyway? I mean all you
>>  get is one trick: rational thinking. But when you're good and crazy, 
>>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25264.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25269.php



Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
Ah, if it's perl, it might be easy. It might just be the difference between 
system("...string...") and system(@argv).

Sent from my phone. No type good.

On Sep 4, 2014, at 8:35 AM, "Matt Thompson" 
mailto:fort...@gmail.com>> wrote:

Jeff,

I actually misspoke earlier. It turns out our srun is a *Perl* script around 
the SLURM srun. I'll speak with our admins to see if they can massage the 
script to not interpret the arguments. If possible, I'll ask them if I can 
share the script with you (privately or on the list) and maybe you can see how 
it is affecting Open MPI's argument passage.

Matt


On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) 
mailto:jsquy...@cisco.com>> wrote:
On Sep 3, 2014, at 9:27 AM, Matt Thompson 
mailto:fort...@gmail.com>> wrote:

> Just saw this, sorry. Our srun is indeed a shell script. It seems to be a 
> wrapper around the regular srun that runs a --task-prolog. What it 
> does...that's beyond my ken, but I could ask. My guess is that it probably 
> does something that helps keep our old PBS scripts running (sets 
> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The 
> admins would, of course, prefer all future scripts be SLURM-native scripts, 
> but there are a lot of production runs that uses many, many PBS scripts. 
> Converting that would need slow, careful QC to make sure any "pure SLURM" 
> versions act as expected.

Ralph and I haven't had a chance to discuss this in detail yet, but I have 
thought about this quite a bit.

What is happening is that one of the $argv OMPI passes is of the form 
"foo;bar".  Your srun script is interpreting the ";" as the end of the command 
the the "bar" as the beginning of a new command, and mayhem ensues.

Basically, your srun script is violating what should be a very safe assumption: 
that the $argv we pass to it will not be interpreted by a shell.  Put 
differently: your "srun" script behaves differently than SLURM's "srun" 
executable.  This violates OMPI's expectations of how srun should behave.

My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing all 
other SLURM installations out there that *don't* violate this assumption (i.e., 
all of them).  Ralph may disagree with me on this point, BTW -- like I said, we 
haven't talked about this in detail since Tuesday.  :-)

So here's my question: is there any chance you can change your "srun" script to 
a script language that doesn't recombine $argv?  This is a common problem, 
actually -- sh/csh/etc. script languages tend to recombine $argv, but other 
languages such as perl and python do not (e.g., 
http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a).

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/09/25263.php



--
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/09/25264.php


Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Matt Thompson
Jeff,

I actually misspoke earlier. It turns out our srun is a *Perl* script
around the SLURM srun. I'll speak with our admins to see if they can
massage the script to not interpret the arguments. If possible, I'll ask
them if I can share the script with you (privately or on the list) and
maybe you can see how it is affecting Open MPI's argument passage.

Matt


On Thu, Sep 4, 2014 at 8:04 AM, Jeff Squyres (jsquyres) 
wrote:

> On Sep 3, 2014, at 9:27 AM, Matt Thompson  wrote:
>
> > Just saw this, sorry. Our srun is indeed a shell script. It seems to be
> a wrapper around the regular srun that runs a --task-prolog. What it
> does...that's beyond my ken, but I could ask. My guess is that it probably
> does something that helps keep our old PBS scripts running (sets
> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
> admins would, of course, prefer all future scripts be SLURM-native scripts,
> but there are a lot of production runs that uses many, many PBS scripts.
> Converting that would need slow, careful QC to make sure any "pure SLURM"
> versions act as expected.
>
> Ralph and I haven't had a chance to discuss this in detail yet, but I have
> thought about this quite a bit.
>
> What is happening is that one of the $argv OMPI passes is of the form
> "foo;bar".  Your srun script is interpreting the ";" as the end of the
> command the the "bar" as the beginning of a new command, and mayhem ensues.
>
> Basically, your srun script is violating what should be a very safe
> assumption: that the $argv we pass to it will not be interpreted by a
> shell.  Put differently: your "srun" script behaves differently than
> SLURM's "srun" executable.  This violates OMPI's expectations of how srun
> should behave.
>
> My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing
> all other SLURM installations out there that *don't* violate this
> assumption (i.e., all of them).  Ralph may disagree with me on this point,
> BTW -- like I said, we haven't talked about this in detail since Tuesday.
> :-)
>
> So here's my question: is there any chance you can change your "srun"
> script to a script language that doesn't recombine $argv?  This is a common
> problem, actually -- sh/csh/etc. script languages tend to recombine $argv,
> but other languages such as perl and python do not (e.g.,
> http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a
> ).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25263.php
>



-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick


Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Jeff Squyres (jsquyres)
On Sep 3, 2014, at 9:27 AM, Matt Thompson  wrote:

> Just saw this, sorry. Our srun is indeed a shell script. It seems to be a 
> wrapper around the regular srun that runs a --task-prolog. What it 
> does...that's beyond my ken, but I could ask. My guess is that it probably 
> does something that helps keep our old PBS scripts running (sets 
> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The 
> admins would, of course, prefer all future scripts be SLURM-native scripts, 
> but there are a lot of production runs that uses many, many PBS scripts. 
> Converting that would need slow, careful QC to make sure any "pure SLURM" 
> versions act as expected.

Ralph and I haven't had a chance to discuss this in detail yet, but I have 
thought about this quite a bit.

What is happening is that one of the $argv OMPI passes is of the form 
"foo;bar".  Your srun script is interpreting the ";" as the end of the command 
the the "bar" as the beginning of a new command, and mayhem ensues.

Basically, your srun script is violating what should be a very safe assumption: 
that the $argv we pass to it will not be interpreted by a shell.  Put 
differently: your "srun" script behaves differently than SLURM's "srun" 
executable.  This violates OMPI's expectations of how srun should behave.

My $0.02 is that if we "fix" this in OMPI, we're effectively penalizing all 
other SLURM installations out there that *don't* violate this assumption (i.e., 
all of them).  Ralph may disagree with me on this point, BTW -- like I said, we 
haven't talked about this in detail since Tuesday.  :-)

So here's my question: is there any chance you can change your "srun" script to 
a script language that doesn't recombine $argv?  This is a common problem, 
actually -- sh/csh/etc. script languages tend to recombine $argv, but other 
languages such as perl and python do not (e.g., 
http://stackoverflow.com/questions/6981533/how-to-preserve-single-and-double-quotes-in-shell-script-arguments-without-the-a).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Ralph Castain
Thanks Matt - that does indeed resolve the "how" question :-)

We'll talk internally about how best to resolve the issue. We could, of course, 
add a flag to indicate "we are using a shellscript version of srun" so we know 
to quote things, but it would mean another thing that the user would have to do 
(as opposed to just running out-of-the-box).

If we quote everything by default, then we have to modify our parser to strip 
the quotes when someone isn't using a script wrapper or else the system gets in 
trouble - but Jeff is concerned about us stripping things by default in case a 
user specifies an MCA param value that actually begins/ends with quotes. I'm 
not sure that's a valid use-case, but we'll debate it.

Either way, we'll give you a solution.
Ralph


On Sep 3, 2014, at 6:27 AM, Matt Thompson  wrote:

> On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres)  
> wrote:
> Matt: Random thought -- is your "srun" a shell script, perchance?  (it 
> shouldn't be, but perhaps there's some kind of local override...?)
> 
> Ralph's point on the call today is that it doesn't matter *how* this problem 
> is happening.  It *is* happening to real users, and so we need to account for 
> it.
> 
> But it really bothers me that we don't understand *how/why* this is happening 
> (e.g., is this OMPI's fault somehow?  I don't think so, but then again, we 
> don't understand how it's happening).  *Somewhere* in there, a shell is 
> getting invoked.  But "srun" shouldn't be invoking a shell on the remote side 
> -- it should be directly fork/exec'ing the tokens with no shell 
> interpretation at all.
> 
> Jeff,
> 
> Just saw this, sorry. Our srun is indeed a shell script. It seems to be a 
> wrapper around the regular srun that runs a --task-prolog. What it 
> does...that's beyond my ken, but I could ask. My guess is that it probably 
> does something that helps keep our old PBS scripts running (sets 
> $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The 
> admins would, of course, prefer all future scripts be SLURM-native scripts, 
> but there are a lot of production runs that uses many, many PBS scripts. 
> Converting that would need slow, careful QC to make sure any "pure SLURM" 
> versions act as expected.
> 
> Matt
> 
> 
> -- 
> "And, isn't sanity really just a one-trick pony anyway? I mean all you
>  get is one trick: rational thinking. But when you're good and crazy, 
>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25248.php



Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson
On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres) 
wrote:

> Matt: Random thought -- is your "srun" a shell script, perchance?  (it
> shouldn't be, but perhaps there's some kind of local override...?)
>
> Ralph's point on the call today is that it doesn't matter *how* this
> problem is happening.  It *is* happening to real users, and so we need to
> account for it.
>
> But it really bothers me that we don't understand *how/why* this is
> happening (e.g., is this OMPI's fault somehow?  I don't think so, but then
> again, we don't understand how it's happening).  *Somewhere* in there, a
> shell is getting invoked.  But "srun" shouldn't be invoking a shell on the
> remote side -- it should be directly fork/exec'ing the tokens with no shell
> interpretation at all.
>

Jeff,

Just saw this, sorry. Our srun is indeed a shell script. It seems to be a
wrapper around the regular srun that runs a --task-prolog. What it
does...that's beyond my ken, but I could ask. My guess is that it probably
does something that helps keep our old PBS scripts running (sets
$PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
admins would, of course, prefer all future scripts be SLURM-native scripts,
but there are a lot of production runs that uses many, many PBS scripts.
Converting that would need slow, careful QC to make sure any "pure SLURM"
versions act as expected.

Matt


-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick


Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Matt Thompson
Jeff,

I tried your script and I saw:

(1027) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun
-np 8 ./script.sh
(1028) $

Now, the very first time I ran it, I think I might have noticed a blip of
orted on the nodes, but it disappeared fast. When I re-run the same
command, it just seems to exit immediately with nothing showing up.

If I use my "debug-patch" version, I see:

(1028) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch//bin/mpirun
-np 8 ./script.sh
hello world
hello world
hello world
hello world
hello world
hello world
hello world
hello world

And, well, it's there for 10 minutes, I'm guessing. If I ssh to another of
the nodes in my allocation:

(1005) $ ps aux | grep openmpi
mathomp4 20317  0.0  0.0  59952  4256 ?S09:17   0:00
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/orted
-mca orte_ess_jobid 1842544640 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
6 -mca orte_hnp_uri 1842544640.0;tcp://10.1.24.169,172.31.1.254,
10.12.24.169:41684
mathomp4 20389  0.0  0.0   5524   844 pts/0S+   09:19   0:00 grep
--color=auto openmpi


Matt


On Tue, Sep 2, 2014 at 5:35 PM, Jeff Squyres (jsquyres) 
wrote:

> Matt --
>
> We were discussing this issue on our weekly OMPI engineering call today.
>
> Can you check one thing for me?  With the un-edited 1.8.2 tarball
> installation, I see that you're getting no output for commands that you run
> -- but also no errors.
>
> Can you verify and see if your commands are actually *running*?  E.g, try:
>
> $ cat > script.sh < #!/bin/sh
> echo hello world
> sleep 600
> echo goodbye world
> EOF
> $ chmod +x script.sh
> $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-clean/bin/mpirun
> -np 8 script.sh
>
> and then go "ps" on the back-end nodes and see if there is an "orted"
> process and N "sleep 600" processes running on them.
>
> I'm *assuming* you won't see the "hello world" output.
>
> The purpose of this test is that I want to see if OMPI is just totally
> erring out and not even running your job (which is quite unlikely; OMPI
> should be much more noisy when this happens), or whether we're simply not
> seeing the stdout from the job.
>
> Thanks.
>
>
>
> On Sep 2, 2014, at 9:36 AM, Matt Thompson  wrote:
>
> > On that machine, it would be SLES 11 SP1. I think it's soon
> transitioning to SLES 11 SP3.
> >
> > I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).
> >
> >
> > On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain  wrote:
> > Thanks - I expect we'll have to release 1.8.3 soon to fix this in case
> others have similar issues. Out of curiosity, what OS are you using?
> >
> >
> > On Sep 1, 2014, at 9:00 AM, Matt Thompson  wrote:
> >
> >> Ralph,
> >>
> >> Okay that seems to have done it here (well, minus the usual
> shmem_mmap_enable_nfs_warning that our system always generates):
> >>
> >> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> >> (1034) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
> -np 8 ./helloWorld.182-debug-patch.x
> >> Process7 of8 is on borg01w218
> >> Process5 of8 is on borg01w218
> >> Process1 of8 is on borg01w218
> >> Process3 of8 is on borg01w218
> >> Process0 of8 is on borg01w218
> >> Process2 of8 is on borg01w218
> >> Process4 of8 is on borg01w218
> >> Process6 of8 is on borg01w218
> >>
> >> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I
> suppose.
> >>
> >> Thanks,
> >> Matt
> >>
> >> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain 
> wrote:
> >> HmmmI may see the problem. Would you be so kind as to apply the
> attached patch to your 1.8.2 code, rebuild, and try again?
> >>
> >> Much appreciate the help. Everyone's system is slightly different, and
> I think you've uncovered one of those differences.
> >> Ralph
> >>
> >>
> >>
> >> On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:
> >>
> >>> Ralph,
> >>>
> >>> Sorry it took me a bit of time. Here you go:
> >>>
> >>> (1002) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
> plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
> >>> [borg01w063:03815] mca:base:select:(  plm) Querying component
> [isolated]
> >>> [borg01w063:03815] mca:base:select:(  plm) Query of component
> [isolated] set priority to 0
> >>> [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
> >>> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
> rsh path NULL
> >>> [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh]
> set priority to 10
> >>> [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
> >>> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for
> selection
> >>> [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm]
> set priority to 75
>

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Jeff Squyres (jsquyres)
Ah, I see the "sh: tcp://10.1.25.142,172.31.1.254,10.12.25.142:41686: No such 
file or directory" message now -- I was looking for something like that when I 
replied before and missed it.

I really wish I understood why the heck that is happening; it doesn't seem to 
make sense.  

Matt: Random thought -- is your "srun" a shell script, perchance?  (it 
shouldn't be, but perhaps there's some kind of local override...?)

Ralph's point on the call today is that it doesn't matter *how* this problem is 
happening.  It *is* happening to real users, and so we need to account for it.

But it really bothers me that we don't understand *how/why* this is happening 
(e.g., is this OMPI's fault somehow?  I don't think so, but then again, we 
don't understand how it's happening).  *Somewhere* in there, a shell is getting 
invoked.  But "srun" shouldn't be invoking a shell on the remote side -- it 
should be directly fork/exec'ing the tokens with no shell interpretation at all.




On Sep 2, 2014, at 7:04 PM, Ralph Castain  wrote:

> I can answer that for you right now. The launch of the orted's is what is 
> failing, and they are "silently" failing at this time. The reason is simple:
> 
> 1. we are failing due to truncation of the HNP uri at the first semicolon. 
> This causes the orted to emit an ORTE_ERROR_LOG message and then abort with a 
> non-zero exit status
> 
> 2. we throw away the error message unless someone adds --debug-daemons 
> because we redirect the srun output to /dev/null. This is done because slurm 
> spits out other things during our normal operation that confuse users
> 
> 3. srun detects the non-zero exit status of the orted and aborts the rest of 
> the job.
> 
> So when Matt adds --debug-daemons, he then sees the error messages. When he 
> further adds the oob and plm verbosity, the true error is fully exposed.
> 
> 
> On Sep 2, 2014, at 2:35 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> Matt --
>> 
>> We were discussing this issue on our weekly OMPI engineering call today.
>> 
>> Can you check one thing for me?  With the un-edited 1.8.2 tarball 
>> installation, I see that you're getting no output for commands that you run 
>> -- but also no errors.
>> 
>> Can you verify and see if your commands are actually *running*?  E.g, try:
>> 
>> $ cat > script.sh <> #!/bin/sh
>> echo hello world
>> sleep 600
>> echo goodbye world
>> EOF
>> $ chmod +x script.sh
>> $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
>> $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-clean/bin/mpirun 
>> -np 8 script.sh
>> 
>> and then go "ps" on the back-end nodes and see if there is an "orted" 
>> process and N "sleep 600" processes running on them.
>> 
>> I'm *assuming* you won't see the "hello world" output.
>> 
>> The purpose of this test is that I want to see if OMPI is just totally 
>> erring out and not even running your job (which is quite unlikely; OMPI 
>> should be much more noisy when this happens), or whether we're simply not 
>> seeing the stdout from the job.
>> 
>> Thanks.
>> 
>> 
>> 
>> On Sep 2, 2014, at 9:36 AM, Matt Thompson  wrote:
>> 
>>> On that machine, it would be SLES 11 SP1. I think it's soon transitioning 
>>> to SLES 11 SP3.
>>> 
>>> I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).
>>> 
>>> 
>>> On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain  wrote:
>>> Thanks - I expect we'll have to release 1.8.3 soon to fix this in case 
>>> others have similar issues. Out of curiosity, what OS are you using?
>>> 
>>> 
>>> On Sep 1, 2014, at 9:00 AM, Matt Thompson  wrote:
>>> 
 Ralph,
 
 Okay that seems to have done it here (well, minus the usual 
 shmem_mmap_enable_nfs_warning that our system always generates):
 
 (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
 (1034) $ 
 /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
  -np 8 ./helloWorld.182-debug-patch.x
 Process7 of8 is on borg01w218
 Process5 of8 is on borg01w218
 Process1 of8 is on borg01w218
 Process3 of8 is on borg01w218
 Process0 of8 is on borg01w218
 Process2 of8 is on borg01w218
 Process4 of8 is on borg01w218
 Process6 of8 is on borg01w218
 
 I'll ask the admin to apply the patch locally...and wait for 1.8.3, I 
 suppose.
 
 Thanks,
 Matt
 
 On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain  wrote:
 HmmmI may see the problem. Would you be so kind as to apply the 
 attached patch to your 1.8.2 code, rebuild, and try again?
 
 Much appreciate the help. Everyone's system is slightly different, and I 
 think you've uncovered one of those differences.
 Ralph
 
 
 
 On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:
 
> Ralph,
> 
> Sorry it took me a bit of time. Here you go:
> 
> (1002) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/b

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Ralph Castain
I can answer that for you right now. The launch of the orted's is what is 
failing, and they are "silently" failing at this time. The reason is simple:

1. we are failing due to truncation of the HNP uri at the first semicolon. This 
causes the orted to emit an ORTE_ERROR_LOG message and then abort with a 
non-zero exit status

2. we throw away the error message unless someone adds --debug-daemons because 
we redirect the srun output to /dev/null. This is done because slurm spits out 
other things during our normal operation that confuse users

3. srun detects the non-zero exit status of the orted and aborts the rest of 
the job.

So when Matt adds --debug-daemons, he then sees the error messages. When he 
further adds the oob and plm verbosity, the true error is fully exposed.


On Sep 2, 2014, at 2:35 PM, Jeff Squyres (jsquyres)  wrote:

> Matt --
> 
> We were discussing this issue on our weekly OMPI engineering call today.
> 
> Can you check one thing for me?  With the un-edited 1.8.2 tarball 
> installation, I see that you're getting no output for commands that you run 
> -- but also no errors.
> 
> Can you verify and see if your commands are actually *running*?  E.g, try:
> 
> $ cat > script.sh < #!/bin/sh
> echo hello world
> sleep 600
> echo goodbye world
> EOF
> $ chmod +x script.sh
> $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-clean/bin/mpirun 
> -np 8 script.sh
> 
> and then go "ps" on the back-end nodes and see if there is an "orted" process 
> and N "sleep 600" processes running on them.
> 
> I'm *assuming* you won't see the "hello world" output.
> 
> The purpose of this test is that I want to see if OMPI is just totally erring 
> out and not even running your job (which is quite unlikely; OMPI should be 
> much more noisy when this happens), or whether we're simply not seeing the 
> stdout from the job.
> 
> Thanks.
> 
> 
> 
> On Sep 2, 2014, at 9:36 AM, Matt Thompson  wrote:
> 
>> On that machine, it would be SLES 11 SP1. I think it's soon transitioning to 
>> SLES 11 SP3.
>> 
>> I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).
>> 
>> 
>> On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain  wrote:
>> Thanks - I expect we'll have to release 1.8.3 soon to fix this in case 
>> others have similar issues. Out of curiosity, what OS are you using?
>> 
>> 
>> On Sep 1, 2014, at 9:00 AM, Matt Thompson  wrote:
>> 
>>> Ralph,
>>> 
>>> Okay that seems to have done it here (well, minus the usual 
>>> shmem_mmap_enable_nfs_warning that our system always generates):
>>> 
>>> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
>>> (1034) $ 
>>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
>>>  -np 8 ./helloWorld.182-debug-patch.x
>>> Process7 of8 is on borg01w218
>>> Process5 of8 is on borg01w218
>>> Process1 of8 is on borg01w218
>>> Process3 of8 is on borg01w218
>>> Process0 of8 is on borg01w218
>>> Process2 of8 is on borg01w218
>>> Process4 of8 is on borg01w218
>>> Process6 of8 is on borg01w218
>>> 
>>> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I 
>>> suppose.
>>> 
>>> Thanks,
>>> Matt
>>> 
>>> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain  wrote:
>>> HmmmI may see the problem. Would you be so kind as to apply the 
>>> attached patch to your 1.8.2 code, rebuild, and try again?
>>> 
>>> Much appreciate the help. Everyone's system is slightly different, and I 
>>> think you've uncovered one of those differences.
>>> Ralph
>>> 
>>> 
>>> 
>>> On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:
>>> 
 Ralph,
 
 Sorry it took me a bit of time. Here you go:
 
 (1002) $ 
 /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun 
 --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca 
 plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
 [borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
 [borg01w063:03815] mca:base:select:(  plm) Query of component [isolated] 
 set priority to 0
 [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
 [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
 path NULL
 [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set 
 priority to 10
 [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
 [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
 [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set 
 priority to 75
 [borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
 [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash 
 1757783593
 [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
 [borg01w063:03815] mca: base: components_register: registering oob 
 components

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Jeff Squyres (jsquyres)
Matt --

We were discussing this issue on our weekly OMPI engineering call today.

Can you check one thing for me?  With the un-edited 1.8.2 tarball installation, 
I see that you're getting no output for commands that you run -- but also no 
errors.

Can you verify and see if your commands are actually *running*?  E.g, try:

$ cat > script.sh < wrote:

> On that machine, it would be SLES 11 SP1. I think it's soon transitioning to 
> SLES 11 SP3.
> 
> I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).
> 
> 
> On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain  wrote:
> Thanks - I expect we'll have to release 1.8.3 soon to fix this in case others 
> have similar issues. Out of curiosity, what OS are you using?
> 
> 
> On Sep 1, 2014, at 9:00 AM, Matt Thompson  wrote:
> 
>> Ralph,
>> 
>> Okay that seems to have done it here (well, minus the usual 
>> shmem_mmap_enable_nfs_warning that our system always generates):
>> 
>> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
>> (1034) $ 
>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
>>  -np 8 ./helloWorld.182-debug-patch.x
>> Process7 of8 is on borg01w218
>> Process5 of8 is on borg01w218
>> Process1 of8 is on borg01w218
>> Process3 of8 is on borg01w218
>> Process0 of8 is on borg01w218
>> Process2 of8 is on borg01w218
>> Process4 of8 is on borg01w218
>> Process6 of8 is on borg01w218
>> 
>> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I 
>> suppose.
>> 
>> Thanks,
>> Matt
>> 
>> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain  wrote:
>> HmmmI may see the problem. Would you be so kind as to apply the attached 
>> patch to your 1.8.2 code, rebuild, and try again?
>> 
>> Much appreciate the help. Everyone's system is slightly different, and I 
>> think you've uncovered one of those differences.
>> Ralph
>> 
>> 
>> 
>> On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:
>> 
>>> Ralph,
>>> 
>>> Sorry it took me a bit of time. Here you go:
>>> 
>>> (1002) $ 
>>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun 
>>> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca 
>>> plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
>>> [borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
>>> [borg01w063:03815] mca:base:select:(  plm) Query of component [isolated] 
>>> set priority to 0
>>> [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
>>> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
>>> path NULL
>>> [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
>>> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
>>> [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set 
>>> priority to 75
>>> [borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
>>> [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash 
>>> 1757783593
>>> [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
>>> [borg01w063:03815] mca: base: components_register: registering oob 
>>> components
>>> [borg01w063:03815] mca: base: components_register: found loaded component 
>>> tcp
>>> [borg01w063:03815] mca: base: components_register: component tcp register 
>>> function successful
>>> [borg01w063:03815] mca: base: components_open: opening oob components
>>> [borg01w063:03815] mca: base: components_open: found loaded component tcp
>>> [borg01w063:03815] mca: base: components_open: component tcp open function 
>>> successful
>>> [borg01w063:03815] mca:oob:select: checking available component tcp
>>> [borg01w063:03815] mca:oob:select: Querying component [tcp]
>>> [borg01w063:03815] oob:tcp: component_available called
>>> [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>> [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>>> [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
>>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our list 
>>> of V4 connections
>>> [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
>>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our 
>>> list of V4 connections
>>> [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
>>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our 
>>> list of V4 connections
>>> [borg01w063:03815] [[49163,0],0] TCP STARTUP
>>> [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
>>> [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
>>> [borg01w063:03815] mca:oob:select: Adding component to end
>>> [borg01w063:03815] mca:oob:select: Found 1 active transports
>>> [borg01w063:03815] [[49163,0],0] plm:base:receive start comm
>>> [borg01w063:03815] [[49163,0],0] plm:base:s

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Matt Thompson
On that machine, it would be SLES 11 SP1. I think it's soon transitioning
to SLES 11 SP3.

I also use Open MPI on an RHEL 6.5 box (possibly soon to be RHEL 7).


On Mon, Sep 1, 2014 at 8:41 PM, Ralph Castain  wrote:

> Thanks - I expect we'll have to release 1.8.3 soon to fix this in case
> others have similar issues. Out of curiosity, what OS are you using?
>
>
> On Sep 1, 2014, at 9:00 AM, Matt Thompson  wrote:
>
> Ralph,
>
> Okay that seems to have done it here (well, minus the
> usual shmem_mmap_enable_nfs_warning that our system always generates):
>
> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> (1034) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
> -np 8 ./helloWorld.182-debug-patch.x
> Process7 of8 is on borg01w218
> Process5 of8 is on borg01w218
> Process1 of8 is on borg01w218
> Process3 of8 is on borg01w218
> Process0 of8 is on borg01w218
> Process2 of8 is on borg01w218
> Process4 of8 is on borg01w218
> Process6 of8 is on borg01w218
>
> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I
> suppose.
>
> Thanks,
> Matt
>
> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain  wrote:
>
>> HmmmI may see the problem. Would you be so kind as to apply the
>> attached patch to your 1.8.2 code, rebuild, and try again?
>>
>> Much appreciate the help. Everyone's system is slightly different, and I
>> think you've uncovered one of those differences.
>> Ralph
>>
>>
>>
>> On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:
>>
>> Ralph,
>>
>> Sorry it took me a bit of time. Here you go:
>>
>> (1002) $
>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
>> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
>> plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
>> [borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
>> [borg01w063:03815] mca:base:select:(  plm) Query of component [isolated]
>> set priority to 0
>> [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
>> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>> path NULL
>> [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set
>> priority to 10
>> [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
>> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
>> [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set
>> priority to 75
>> [borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
>> [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash
>> 1757783593
>> [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
>> [borg01w063:03815] mca: base: components_register: registering oob
>> components
>> [borg01w063:03815] mca: base: components_register: found loaded component
>> tcp
>> [borg01w063:03815] mca: base: components_register: component tcp register
>> function successful
>> [borg01w063:03815] mca: base: components_open: opening oob components
>> [borg01w063:03815] mca: base: components_open: found loaded component tcp
>> [borg01w063:03815] mca: base: components_open: component tcp open
>> function successful
>> [borg01w063:03815] mca:oob:select: checking available component tcp
>> [borg01w063:03815] mca:oob:select: Querying component [tcp]
>> [borg01w063:03815] oob:tcp: component_available called
>> [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>> [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>> [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our
>> list of V4 connections
>> [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our
>> list of V4 connections
>> [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our
>> list of V4 connections
>> [borg01w063:03815] [[49163,0],0] TCP STARTUP
>> [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
>> [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
>> [borg01w063:03815] mca:oob:select: Adding component to end
>> [borg01w063:03815] mca:oob:select: Found 1 active transports
>> [borg01w063:03815] [[49163,0],0] plm:base:receive start comm
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_job
>> [borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
>> [[49163,0],1]
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
>> [[49163,0],1] to node borg01w064
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
>> [[49163,0],2]
>>

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Ralph Castain
Thanks - I expect we'll have to release 1.8.3 soon to fix this in case others 
have similar issues. Out of curiosity, what OS are you using?


On Sep 1, 2014, at 9:00 AM, Matt Thompson  wrote:

> Ralph,
> 
> Okay that seems to have done it here (well, minus the usual 
> shmem_mmap_enable_nfs_warning that our system always generates):
> 
> (1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> (1034) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
>  -np 8 ./helloWorld.182-debug-patch.x
> Process7 of8 is on borg01w218
> Process5 of8 is on borg01w218
> Process1 of8 is on borg01w218
> Process3 of8 is on borg01w218
> Process0 of8 is on borg01w218
> Process2 of8 is on borg01w218
> Process4 of8 is on borg01w218
> Process6 of8 is on borg01w218
> 
> I'll ask the admin to apply the patch locally...and wait for 1.8.3, I suppose.
> 
> Thanks,
> Matt
> 
> On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain  wrote:
> HmmmI may see the problem. Would you be so kind as to apply the attached 
> patch to your 1.8.2 code, rebuild, and try again?
> 
> Much appreciate the help. Everyone's system is slightly different, and I 
> think you've uncovered one of those differences.
> Ralph
> 
> 
> 
> On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:
> 
>> Ralph,
>> 
>> Sorry it took me a bit of time. Here you go:
>> 
>> (1002) $ 
>> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun 
>> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca 
>> plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
>> [borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
>> [borg01w063:03815] mca:base:select:(  plm) Query of component [isolated] set 
>> priority to 0
>> [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
>> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
>> path NULL
>> [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
>> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
>> [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set 
>> priority to 75
>> [borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
>> [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash 
>> 1757783593
>> [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
>> [borg01w063:03815] mca: base: components_register: registering oob components
>> [borg01w063:03815] mca: base: components_register: found loaded component tcp
>> [borg01w063:03815] mca: base: components_register: component tcp register 
>> function successful
>> [borg01w063:03815] mca: base: components_open: opening oob components
>> [borg01w063:03815] mca: base: components_open: found loaded component tcp
>> [borg01w063:03815] mca: base: components_open: component tcp open function 
>> successful
>> [borg01w063:03815] mca:oob:select: checking available component tcp
>> [borg01w063:03815] mca:oob:select: Querying component [tcp]
>> [borg01w063:03815] oob:tcp: component_available called
>> [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>> [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
>> [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our list 
>> of V4 connections
>> [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our 
>> list of V4 connections
>> [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
>> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our list 
>> of V4 connections
>> [borg01w063:03815] [[49163,0],0] TCP STARTUP
>> [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
>> [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
>> [borg01w063:03815] mca:oob:select: Adding component to end
>> [borg01w063:03815] mca:oob:select: Found 1 active transports
>> [borg01w063:03815] [[49163,0],0] plm:base:receive start comm
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_job
>> [borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon 
>> [[49163,0],1]
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon 
>> [[49163,0],1] to node borg01w064
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon 
>> [[49163,0],2]
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon 
>> [[49163,0],2] to node borg01w065
>> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon 
>> [[49163,0],3]
>> [borg01w063:0

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Matt Thompson
Ralph,

Okay that seems to have done it here (well, minus the
usual shmem_mmap_enable_nfs_warning that our system always generates):

(1033) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
(1034) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug-patch/bin/mpirun
-np 8 ./helloWorld.182-debug-patch.x
Process7 of8 is on borg01w218
Process5 of8 is on borg01w218
Process1 of8 is on borg01w218
Process3 of8 is on borg01w218
Process0 of8 is on borg01w218
Process2 of8 is on borg01w218
Process4 of8 is on borg01w218
Process6 of8 is on borg01w218

I'll ask the admin to apply the patch locally...and wait for 1.8.3, I
suppose.

Thanks,
Matt

On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain  wrote:

> HmmmI may see the problem. Would you be so kind as to apply the
> attached patch to your 1.8.2 code, rebuild, and try again?
>
> Much appreciate the help. Everyone's system is slightly different, and I
> think you've uncovered one of those differences.
> Ralph
>
>
>
> On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:
>
> Ralph,
>
> Sorry it took me a bit of time. Here you go:
>
> (1002) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
> plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
> [borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
> [borg01w063:03815] mca:base:select:(  plm) Query of component [isolated]
> set priority to 0
> [borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
> [borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
> path NULL
> [borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
> [borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
> [borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set
> priority to 75
> [borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
> [borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash
> 1757783593
> [borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
> [borg01w063:03815] mca: base: components_register: registering oob
> components
> [borg01w063:03815] mca: base: components_register: found loaded component
> tcp
> [borg01w063:03815] mca: base: components_register: component tcp register
> function successful
> [borg01w063:03815] mca: base: components_open: opening oob components
> [borg01w063:03815] mca: base: components_open: found loaded component tcp
> [borg01w063:03815] mca: base: components_open: component tcp open function
> successful
> [borg01w063:03815] mca:oob:select: checking available component tcp
> [borg01w063:03815] mca:oob:select: Querying component [tcp]
> [borg01w063:03815] oob:tcp: component_available called
> [borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our
> list of V4 connections
> [borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our
> list of V4 connections
> [borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our
> list of V4 connections
> [borg01w063:03815] [[49163,0],0] TCP STARTUP
> [borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
> [borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
> [borg01w063:03815] mca:oob:select: Adding component to end
> [borg01w063:03815] mca:oob:select: Found 1 active transports
> [borg01w063:03815] [[49163,0],0] plm:base:receive start comm
> [borg01w063:03815] [[49163,0],0] plm:base:setup_job
> [borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],1]
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
> [[49163,0],1] to node borg01w064
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],2]
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
> [[49163,0],2] to node borg01w065
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],3]
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
> [[49163,0],3] to node borg01w069
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
> [[49163,0],4]
> [borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
> [[49163,0],4] to node borg01w070
> [borg01w063:03815] [[49163,0],0] plm:base:set

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Ralph Castain
HmmmI may see the problem. Would you be so kind as to apply the attached patch to your 1.8.2 code, rebuild, and try again?Much appreciate the help. Everyone's system is slightly different, and I think you've uncovered one of those differences.Ralph

uri.diff
Description: Binary data
On Aug 31, 2014, at 6:25 AM, Matt Thompson  wrote:Ralph,Sorry it took me a bit of time. Here you go:(1002) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x

[borg01w063:03815] mca:base:select:(  plm) Querying component [isolated][borg01w063:03815] mca:base:select:(  plm) Query of component [isolated] set priority to 0

[borg01w063:03815] mca:base:select:(  plm) Querying component [rsh][borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL

[borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set priority to 10[borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]

[borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection[borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set priority to 75

[borg01w063:03815] mca:base:select:(  plm) Selected component [slurm][borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash 1757783593

[borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163[borg01w063:03815] mca: base: components_register: registering oob components

[borg01w063:03815] mca: base: components_register: found loaded component tcp[borg01w063:03815] mca: base: components_register: component tcp register function successful

[borg01w063:03815] mca: base: components_open: opening oob components[borg01w063:03815] mca: base: components_open: found loaded component tcp

[borg01w063:03815] mca: base: components_open: component tcp open function successful[borg01w063:03815] mca:oob:select: checking available component tcp

[borg01w063:03815] mca:oob:select: Querying component [tcp][borg01w063:03815] oob:tcp: component_available called

[borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4[borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4

[borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our list of V4 connections

[borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our list of V4 connections

[borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our list of V4 connections

[borg01w063:03815] [[49163,0],0] TCP STARTUP[borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0

[borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373[borg01w063:03815] mca:oob:select: Adding component to end

[borg01w063:03815] mca:oob:select: Found 1 active transports[borg01w063:03815] [[49163,0],0] plm:base:receive start comm

[borg01w063:03815] [[49163,0],0] plm:base:setup_job[borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED

[borg01w063:03815] [[49163,0],0] plm:base:setup_vm[borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map

[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],1][borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],1] to node borg01w064

[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],2][borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],2] to node borg01w065

[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],3][borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],3] to node borg01w069

[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],4][borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],4] to node borg01w070

[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon [[49163,0],5][borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon [[49163,0],5] to node borg01w071

[borg01w063:03815] [[49163,0],0] plm:slurm: launching on nodes borg01w064,borg01w065,borg01w069,borg01w070,borg01w071[borg01w063:03815] [[49163,0],0] plm:slurm: Set prefix:/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug

[borg01w063:03815] [[49163,0],0] plm:slurm: final top-level argv:	srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=5 --nodelist=borg01w064,borg01w065,borg01w069,borg01w070,borg01w071 --ntasks=5 orted -mca orte_debug_daemons 1 -mca orte_leave_session_attached 1 -mca orte_ess_jobid 3221946368 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 6 -mca orte_hnp_uri 3221946368.0;tcp://10.1.24.63,172.31.1.254,10.12.24.63:41373 --mca oob_base_verbose 10 -mca plm_base_verbose 5

[borg01w063:03815] [[49163,0],0

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Matt Thompson
Ralph,

Sorry it took me a bit of time. Here you go:

(1002) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -mca
plm_base_verbose 5 -np 8 ./helloWorld.182-debug.x
[borg01w063:03815] mca:base:select:(  plm) Querying component [isolated]
[borg01w063:03815] mca:base:select:(  plm) Query of component [isolated]
set priority to 0
[borg01w063:03815] mca:base:select:(  plm) Querying component [rsh]
[borg01w063:03815] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[borg01w063:03815] mca:base:select:(  plm) Query of component [rsh] set
priority to 10
[borg01w063:03815] mca:base:select:(  plm) Querying component [slurm]
[borg01w063:03815] [[INVALID],INVALID] plm:slurm: available for selection
[borg01w063:03815] mca:base:select:(  plm) Query of component [slurm] set
priority to 75
[borg01w063:03815] mca:base:select:(  plm) Selected component [slurm]
[borg01w063:03815] plm:base:set_hnp_name: initial bias 3815 nodename hash
1757783593
[borg01w063:03815] plm:base:set_hnp_name: final jobfam 49163
[borg01w063:03815] mca: base: components_register: registering oob
components
[borg01w063:03815] mca: base: components_register: found loaded component
tcp
[borg01w063:03815] mca: base: components_register: component tcp register
function successful
[borg01w063:03815] mca: base: components_open: opening oob components
[borg01w063:03815] mca: base: components_open: found loaded component tcp
[borg01w063:03815] mca: base: components_open: component tcp open function
successful
[borg01w063:03815] mca:oob:select: checking available component tcp
[borg01w063:03815] mca:oob:select: Querying component [tcp]
[borg01w063:03815] oob:tcp: component_available called
[borg01w063:03815] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01w063:03815] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01w063:03815] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.1.24.63 to our list
of V4 connections
[borg01w063:03815] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01w063:03815] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01w063:03815] [[49163,0],0] oob:tcp:init adding 10.12.24.63 to our
list of V4 connections
[borg01w063:03815] [[49163,0],0] TCP STARTUP
[borg01w063:03815] [[49163,0],0] attempting to bind to IPv4 port 0
[borg01w063:03815] [[49163,0],0] assigned IPv4 port 41373
[borg01w063:03815] mca:oob:select: Adding component to end
[borg01w063:03815] mca:oob:select: Found 1 active transports
[borg01w063:03815] [[49163,0],0] plm:base:receive start comm
[borg01w063:03815] [[49163,0],0] plm:base:setup_job
[borg01w063:03815] [[49163,0],0] plm:slurm: LAUNCH DAEMONS CALLED
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm creating map
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
[[49163,0],1]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
[[49163,0],1] to node borg01w064
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
[[49163,0],2]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
[[49163,0],2] to node borg01w065
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
[[49163,0],3]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
[[49163,0],3] to node borg01w069
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
[[49163,0],4]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
[[49163,0],4] to node borg01w070
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm add new daemon
[[49163,0],5]
[borg01w063:03815] [[49163,0],0] plm:base:setup_vm assigning new daemon
[[49163,0],5] to node borg01w071
[borg01w063:03815] [[49163,0],0] plm:slurm: launching on nodes
borg01w064,borg01w065,borg01w069,borg01w070,borg01w071
[borg01w063:03815] [[49163,0],0] plm:slurm: Set
prefix:/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug
[borg01w063:03815] [[49163,0],0] plm:slurm: final top-level argv:
srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=5
--nodelist=borg01w064,borg01w065,borg01w069,borg01w070,borg01w071
--ntasks=5 orted -mca orte_debug_daemons 1 -mca orte_leave_session_attached
1 -mca orte_ess_jobid 3221946368 -mca orte_ess_vpid 1 -mca
orte_ess_num_procs 6 -mca orte_hnp_uri 3221946368.0;tcp://10.1.24.63
,172.31.1.254,10.12.24.63:41373 --mca oob_base_verbose 10 -mca
plm_base_verbose 5
[borg01w063:03815] [[49163,0],0] plm:slurm: reset PATH:
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin:/usr/local/other/SLES11/gcc/4.9.1/bin:/usr/local/other/SLES11.1/git/
1.8.5.2/libexec/git-core:/usr/local/other/SLES11.1/git/1.8.5.2/bin:/usr/local/other/SLES11/svn/1.6.17/bin:/usr/local/other/SLES11/tkcvs/8.2.3/gcc-4.3.2/bin:.:/home/mathomp4/bin:/home/mathomp4/cv

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd line 
being executed. Can you add it?


On Aug 29, 2014, at 11:16 AM, Matt Thompson  wrote:

> Ralph,
> 
> Here you go:
> 
> (1080) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun 
> --leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8 
> ./helloWorld.182-debug.x
> [borg01x142:29232] mca: base: components_register: registering oob components
> [borg01x142:29232] mca: base: components_register: found loaded component tcp
> [borg01x142:29232] mca: base: components_register: component tcp register 
> function successful
> [borg01x142:29232] mca: base: components_open: opening oob components
> [borg01x142:29232] mca: base: components_open: found loaded component tcp
> [borg01x142:29232] mca: base: components_open: component tcp open function 
> successful
> [borg01x142:29232] mca:oob:select: checking available component tcp
> [borg01x142:29232] mca:oob:select: Querying component [tcp]
> [borg01x142:29232] oob:tcp: component_available called
> [borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our list 
> of V4 connections
> [borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our list 
> of V4 connections
> [borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our list 
> of V4 connections
> [borg01x142:29232] [[52298,0],0] TCP STARTUP
> [borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
> [borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
> [borg01x142:29232] mca:oob:select: Adding component to end
> [borg01x142:29232] mca:oob:select: Found 1 active transports
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> [borg01x153:01290] mca: base: components_register: registering oob components
> [borg01x153:01290] mca: base: components_register: found loaded component tcp
> [borg01x143:13793] mca: base: components_register: registering oob components
> [borg01x143:13793] mca: base: components_register: found loaded component tcp
> [borg01x153:01290] mca: base: components_register: component tcp register 
> function successful
> [borg01x153:01290] mca: base: components_open: opening oob components
> [borg01x153:01290] mca: base: components_open: found loaded component tcp
> [borg01x153:01290] mca: base: components_open: component tcp open function 
> successful
> [borg01x153:01290] mca:oob:select: checking available component tcp
> [borg01x153:01290] mca:oob:select: Querying component [tcp]
> [borg01x153:01290] oob:tcp: component_available called
> [borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to our list 
> of V4 connections
> [borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to our list 
> of V4 connections
> [borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to our list 
> of V4 connections
> [borg01x153:01290] [[52298,0],4] TCP STARTUP
> [borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0
> [borg01x143:13793] mca: base: components_register: component tcp register 
> function successful
> [borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028
> [borg01x143:13793] mca: base: components_open: opening oob components
> [borg01x143:13793] mca: base: components_open: found loaded component tcp
> [borg01x143:13793] mca: base: components_open: component tcp open function 
> successful
> [borg01x143:13793] mca:oob:select: checking available component tcp
> [borg01x143:13793] mca:oob:select: Querying component [tcp]
> [borg01x143:13793] oob:tcp: component_available called
> [borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
> [borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to our list 
> of V4 connections
> [borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to our list 
> of V4 connections
> [borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
> [borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to our list 
> of V4 connections
> [borg01x143:13793] [[52298,0

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson
Ralph,

Here you go:

(1080) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2-debug/bin/mpirun
--leave-session-attached --debug-daemons --mca oob_base_verbose 10 -np 8
./helloWorld.182-debug.x
[borg01x142:29232] mca: base: components_register: registering oob
components
[borg01x142:29232] mca: base: components_register: found loaded component
tcp
[borg01x142:29232] mca: base: components_register: component tcp register
function successful
[borg01x142:29232] mca: base: components_open: opening oob components
[borg01x142:29232] mca: base: components_open: found loaded component tcp
[borg01x142:29232] mca: base: components_open: component tcp open function
successful
[borg01x142:29232] mca:oob:select: checking available component tcp
[borg01x142:29232] mca:oob:select: Querying component [tcp]
[borg01x142:29232] oob:tcp: component_available called
[borg01x142:29232] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x142:29232] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.1.25.142 to our
list of V4 connections
[borg01x142:29232] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x142:29232] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x142:29232] [[52298,0],0] oob:tcp:init adding 10.12.25.142 to our
list of V4 connections
[borg01x142:29232] [[52298,0],0] TCP STARTUP
[borg01x142:29232] [[52298,0],0] attempting to bind to IPv4 port 0
[borg01x142:29232] [[52298,0],0] assigned IPv4 port 41686
[borg01x142:29232] mca:oob:select: Adding component to end
[borg01x142:29232] mca:oob:select: Found 1 active transports
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
[borg01x153:01290] mca: base: components_register: registering oob
components
[borg01x153:01290] mca: base: components_register: found loaded component
tcp
[borg01x143:13793] mca: base: components_register: registering oob
components
[borg01x143:13793] mca: base: components_register: found loaded component
tcp
[borg01x153:01290] mca: base: components_register: component tcp register
function successful
[borg01x153:01290] mca: base: components_open: opening oob components
[borg01x153:01290] mca: base: components_open: found loaded component tcp
[borg01x153:01290] mca: base: components_open: component tcp open function
successful
[borg01x153:01290] mca:oob:select: checking available component tcp
[borg01x153:01290] mca:oob:select: Querying component [tcp]
[borg01x153:01290] oob:tcp: component_available called
[borg01x153:01290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x153:01290] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.1.25.153 to our
list of V4 connections
[borg01x153:01290] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x153:01290] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x153:01290] [[52298,0],4] oob:tcp:init adding 10.12.25.153 to our
list of V4 connections
[borg01x153:01290] [[52298,0],4] TCP STARTUP
[borg01x153:01290] [[52298,0],4] attempting to bind to IPv4 port 0
[borg01x143:13793] mca: base: components_register: component tcp register
function successful
[borg01x153:01290] [[52298,0],4] assigned IPv4 port 38028
[borg01x143:13793] mca: base: components_open: opening oob components
[borg01x143:13793] mca: base: components_open: found loaded component tcp
[borg01x143:13793] mca: base: components_open: component tcp open function
successful
[borg01x143:13793] mca:oob:select: checking available component tcp
[borg01x143:13793] mca:oob:select: Querying component [tcp]
[borg01x143:13793] oob:tcp: component_available called
[borg01x143:13793] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 2 KERNEL INDEX 1 FAMILY: V4
[borg01x143:13793] WORKING INTERFACE 3 KERNEL INDEX 2 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.1.25.143 to our
list of V4 connections
[borg01x143:13793] WORKING INTERFACE 4 KERNEL INDEX 4 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 172.31.1.254 to our
list of V4 connections
[borg01x143:13793] WORKING INTERFACE 5 KERNEL INDEX 5 FAMILY: V4
[borg01x143:13793] [[52298,0],1] oob:tcp:init adding 10.12.25.143 to our
list of V4 connections
[borg01x143:13793] [[52298,0],1] TCP STARTUP
[borg01x143:13793] [[52298,0],1] attempting to bind to IPv4 port 0
[borg01x153:01290] mca:oob:select: Adding component to end
[borg01x153:01290] mca:oob:select: Found 1 active transports
[borg01x143:13793] [[52298,0],1] assigned IPv4 port 44719
[borg01x143:13793] mca:oob:select: Adding component to end
[borg01x143:13793] mca:oob:select: 

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
Okay, something quite weird is happening here. I can't replicate using the 
1.8.2 release tarball on a slurm machine, so my guess is that something else is 
going on here.

Could you please rebuild the 1.8.2 code with --enable-debug on the configure 
line (assuming you haven't already done so), and then rerun that version as 
before but adding "--mca oob_base_verbose 10" to the cmd line?


On Aug 29, 2014, at 4:22 AM, Matt Thompson  wrote:

> Ralph,
> 
> For 1.8.2rc4 I get:
> 
> (1003) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun 
> --leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
> [borg01x154:10990] [[47143,0],5] orted: up and running - waiting for commands!
> Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
> Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
> [borg01x144:08250] [[47143,0],2] orted: up and running - waiting for commands!
> [borg01x143:23473] [[47143,0],1] orted: up and running - waiting for commands!
> Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
> Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
> [borg01x153:10902] [[47143,0],4] orted: up and running - waiting for commands!
> [borg01x145:12320] [[47143,0],3] orted: up and running - waiting for commands!
> [borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],0]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],2]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],3]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],1]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],5]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],4]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],6]
> [borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from local 
> proc [[47143,1],7]
>   MPIR_being_debugged = 0
>   MPIR_debug_state = 1
>   MPIR_partial_attach_ok = 1
>   MPIR_i_am_starter = 0
>   MPIR_forward_output = 0
>   MPIR_proctable_size = 8
>   MPIR_proctable:
> (i, host, exe, pid) = (0, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647)
> (i, host, exe, pid) = (1, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648)
> (i, host, exe, pid) = (2, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650)
> (i, host, exe, pid) = (3, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652)
> (i, host, exe, pid) = (4, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654)
> (i, host, exe, pid) = (5, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656)
> (i, host, exe, pid) = (6, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658)
> (i, host, exe, pid) = (7, borg01x142, 
> /home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
> [borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
> [borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
> [borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
> [borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
> [borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
> [borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
> Process2 of8 is on borg01x142
> Process5 of8 is on borg01x142
> Process4 of8 is on borg01x142
> Process1 of8 is on borg01x142
> Process0 of8 is on borg01x142
> Process3 of8 is on borg01x142
> Process6 of8 is on borg01x142
> Process7 of8 is on borg01x142
> [borg01x154:10990] [[47143,0],5] orted_cmd:

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Matt Thompson
Ralph,

For 1.8.2rc4 I get:

(1003) $
/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun
--leave-session-attached --debug-daemons -np 8 ./helloWorld.182.x
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
Daemon [[47143,0],5] checking in as pid 10990 on host borg01x154
[borg01x154:10990] [[47143,0],5] orted: up and running - waiting for
commands!
Daemon [[47143,0],1] checking in as pid 23473 on host borg01x143
Daemon [[47143,0],2] checking in as pid 8250 on host borg01x144
[borg01x144:08250] [[47143,0],2] orted: up and running - waiting for
commands!
[borg01x143:23473] [[47143,0],1] orted: up and running - waiting for
commands!
Daemon [[47143,0],3] checking in as pid 12320 on host borg01x145
Daemon [[47143,0],4] checking in as pid 10902 on host borg01x153
[borg01x153:10902] [[47143,0],4] orted: up and running - waiting for
commands!
[borg01x145:12320] [[47143,0],3] orted: up and running - waiting for
commands!
[borg01x142:01629] [[47143,0],0] orted_cmd: received add_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received add_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received add_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received add_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received add_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received add_local_procs
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],0]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],2]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],3]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],1]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],5]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],4]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],6]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync+nidmap from
local proc [[47143,1],7]
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 8
  MPIR_proctable:
(i, host, exe, pid) = (0, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1647)
(i, host, exe, pid) = (1, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1648)
(i, host, exe, pid) = (2, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1650)
(i, host, exe, pid) = (3, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1652)
(i, host, exe, pid) = (4, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1654)
(i, host, exe, pid) = (5, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1656)
(i, host, exe, pid) = (6, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1658)
(i, host, exe, pid) = (7, borg01x142,
/home/mathomp4/HelloWorldTest/./helloWorld.182.x, 1660)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
Process2 of8 is on borg01x142
Process5 of8 is on borg01x142
Process4 of8 is on borg01x142
Process1 of8 is on borg01x142
Process0 of8 is on borg01x142
Process3 of8 is on borg01x142
Process6 of8 is on borg01x142
Process7 of8 is on borg01x142
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_cmd: received message_local_procs
[borg01x144:08250] [[47143,0],2] orted_cmd: received message_local_procs
[borg01x143:23473] [[47143,0],1] orted_cmd: received message_local_procs
[borg01x153:10902] [[47143,0],4] orted_cmd: received message_local_procs
[borg01x154:10990] [[47143,0],5] orted_cmd: received message_local_procs
[borg01x145:12320] [[47143,0],3] orted_cmd: received message_local_procs
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
[[47143,1],2]
[borg01x142:01629] [[47143,0],0] orted_recv: received sync from local proc
[[4

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Ralph Castain
I'm unaware of any changes to the Slurm integration between rc4 and final 
release. It sounds like this might be something else going on - try adding 
"--leave-session-attached --debug-daemons" to your 1.8.2 command line and let's 
see if any errors get reported.


On Aug 28, 2014, at 12:20 PM, Matt Thompson  wrote:

> Open MPI List,
> 
> I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our 
> cluster (reported on this list), and decided to try it with 1.8.2. However, 
> we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder, 
> Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout with 
> Open MPI 1.8.2. That is, HelloWorld doesn't work.
> 
> To wit, our sysadmin has two tarballs:
> 
> (1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2
> 7e7496913c949451f546f22a1a159df25f8bb683  openmpi-1.8.2rc4.tar.bz2
> (1442) $ sha1sum openmpi-1.8.2.tar.gz
> cf2b1e45575896f63367406c6c50574699d8b2e1  openmpi-1.8.2.tar.gz
> 
> I then build each with a script in the method our sysadmin usually does:
> 
> #!/bin/sh 
> set -x
> export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2
> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64
> build() {
>   echo `pwd`
>   ./configure --with-slurm --disable-wrapper-rpath --enable-shared 
> --enable-mca-no-build=btl-usnic \
>   CC=gcc CXX=g++ F77=gfortran FC=gfortran \
>   CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC -m64" 
> FFLAGS="-mtune=generic -fPIC -m64" \
>   F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC 
> -m64" F90FLAGS="-mtune=generic -fPIC -m64" \
>   LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" 
> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \
>  --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log
>   make 2>&1 | tee make.1.8.2.log
>   make check 2>&1 | tee makecheck.1.8.2.log
>   make install 2>&1 | tee makeinstall.1.8.2.log
> }
> echo "calling build"
> build
> echo "exiting"
> 
> The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX 
> and log file tees.  Now, let us test. First, I grab some nodes with slurm:
> 
> $ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00 
> --account=g0620 --mail-type=BEGIN
> 
> Once I get my nodes, I run with 1.8.2rc4:
> 
> (1142) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o 
> helloWorld.182rc4.x helloWorld.F90
> (1143) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 
> ./helloWorld.182rc4.x
> Process0 of8 is on borg01w044
> Process5 of8 is on borg01w044
> Process3 of8 is on borg01w044
> Process7 of8 is on borg01w044
> Process1 of8 is on borg01w044
> Process2 of8 is on borg01w044
> Process4 of8 is on borg01w044
> Process6 of8 is on borg01w044
> 
> Now 1.8.2:
> 
> (1144) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort 
> -o helloWorld.182.x helloWorld.F90
> (1145) $ /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun 
> -np 8 ./helloWorld.182.x
> (1146) $
> 
> No output at all. But, if I take the helloWorld.x from 1.8.2 and run it with 
> 1.8.2rc4's mpirun:
> 
> (1146) $ 
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8 
> ./helloWorld.182.x
> Process5 of8 is on borg01w044
> Process7 of8 is on borg01w044
> Process2 of8 is on borg01w044
> Process4 of8 is on borg01w044
> Process1 of8 is on borg01w044
> Process3 of8 is on borg01w044
> Process6 of8 is on borg01w044
> Process0 of8 is on borg01w044
> 
> So...any idea what is happening here? There did seem to be a few SLURM 
> related changes between the two tarballs involving /dev/null but it's a bit 
> above me to decipher.
> 
> You can find the ompi_info, build, make, config, etc logs at these links 
> (they are ~300kB which is over the mailing list limit according to the Open 
> MPI web page):
> 
> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2
> https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2
> 
> Thank you for any help and please let me know if you need more information,
> Matt
> 
> -- 
> "And, isn't sanity really just a one-trick pony anyway? I mean all you
>  get is one trick: rational thinking. But when you're good and crazy, 
>  oooh, oooh, oooh, the sky is the limit!" -- The Tick
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25182.php



[OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Matt Thompson
Open MPI List,

I recently encountered an odd bug with Open MPI 1.8.1 and GCC 4.9.1 on our
cluster (reported on this list), and decided to try it with 1.8.2. However,
we seem to be having an issue with Open MPI 1.8.2 and SLURM. Even weirder,
Open MPI 1.8.2rc4 doesn't show the bug. And the bug is: I get no stdout
with Open MPI 1.8.2. That is, HelloWorld doesn't work.

To wit, our sysadmin has two tarballs:

(1441) $ sha1sum openmpi-1.8.2rc4.tar.bz2
7e7496913c949451f546f22a1a159df25f8bb683  openmpi-1.8.2rc4.tar.bz2
(1442) $ sha1sum openmpi-1.8.2.tar.gz
cf2b1e45575896f63367406c6c50574699d8b2e1  openmpi-1.8.2.tar.gz

I then build each with a script in the method our sysadmin usually does:

#!/bin/sh
> set -x
> export PREFIX=/discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2
> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/nlocal/slurm/2.6.3/lib64
> build() {
>   echo `pwd`
>   ./configure --with-slurm --disable-wrapper-rpath --enable-shared
> --enable-mca-no-build=btl-usnic \
>   CC=gcc CXX=g++ F77=gfortran FC=gfortran \
>   CFLAGS="-mtune=generic -fPIC -m64" CXXFLAGS="-mtune=generic -fPIC
> -m64" FFLAGS="-mtune=generic -fPIC -m64" \
>   F77FLAGS="-mtune=generic -fPIC -m64" FCFLAGS="-mtune=generic -fPIC
> -m64" F90FLAGS="-mtune=generic -fPIC -m64" \
>   LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
> CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" LIBS="-lpciaccess" \
>  --prefix=${PREFIX} 2>&1 | tee configure.1.8.2.log
>   make 2>&1 | tee make.1.8.2.log
>   make check 2>&1 | tee makecheck.1.8.2.log
>   make install 2>&1 | tee makeinstall.1.8.2.log
> }
> echo "calling build"
> build
> echo "exiting"


The only difference between the two is '1.8.2' or '1.8.2rc4' in the PREFIX
and log file tees.  Now, let us test. First, I grab some nodes with slurm:

$ salloc --nodes=6 --ntasks-per-node=16 --constraint=sand --time=09:00:00
> --account=g0620 --mail-type=BEGIN


Once I get my nodes, I run with 1.8.2rc4:

(1142) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpifort -o
> helloWorld.182rc4.x helloWorld.F90
> (1143) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182rc4.x
> Process0 of8 is on borg01w044
> Process5 of8 is on borg01w044
> Process3 of8 is on borg01w044
> Process7 of8 is on borg01w044
> Process1 of8 is on borg01w044
> Process2 of8 is on borg01w044
> Process4 of8 is on borg01w044
> Process6 of8 is on borg01w044


Now 1.8.2:

(1144) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpifort -o
> helloWorld.182.x helloWorld.F90
> (1145) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2/bin/mpirun -np 8
> ./helloWorld.182.x
> (1146) $


No output at all. But, if I take the helloWorld.x from 1.8.2 and run it
with 1.8.2rc4's mpirun:

(1146) $
> /discover/nobackup/mathomp4/MPI/gcc_4.9.1-openmpi_1.8.2rc4/bin/mpirun -np 8
> ./helloWorld.182.x
> Process5 of8 is on borg01w044
> Process7 of8 is on borg01w044
> Process2 of8 is on borg01w044
> Process4 of8 is on borg01w044
> Process1 of8 is on borg01w044
> Process3 of8 is on borg01w044
> Process6 of8 is on borg01w044
> Process0 of8 is on borg01w044


So...any idea what is happening here? There did seem to be a few SLURM
related changes between the two tarballs involving /dev/null but it's a bit
above me to decipher.

You can find the ompi_info, build, make, config, etc logs at these links
(they are ~300kB which is over the mailing list limit according to the Open
MPI web page):

https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2rc4-Output.tar.bz2
https://dl.dropboxusercontent.com/u/61696/OMPI-1.8.2-Output.tar.bz2

Thank you for any help and please let me know if you need more information,
Matt

-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick