Re: [MTT devel] Analysis of hung jobs.

2009-11-02 Thread Ashley Pittman

For the record Ethan and I took this off-list and got it working shortly
afterwards, results are now on-line and the code is in SVN.

Attached is a final patch to cleanup the output by removing a extraneous
space which is inserted in the final output.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
Index: lib/MTT/DoCommand.pm
===
--- lib/MTT/DoCommand.pm	(revision 1327)
+++ lib/MTT/DoCommand.pm	(working copy)
@@ -619,7 +619,7 @@
 if (FindProgram(qw(padb))) {

 my $padb_cmd = "padb --config-option rmgr=mpirun --full-report=$pid";
-$ret .= "\n $padb_cmd";
+$ret .= "\n$padb_cmd";
 $ret .= "\n" . `$padb_cmd`;

 } else {


Re: [MTT devel] Analysis of hung jobs.

2009-10-08 Thread Ashley Pittman
On Thu, 2009-10-08 at 10:46 -0400, Ethan Mallove wrote:
> It looks like it's using a bad option to pdsh?
> 
>   $ padb --debug=all --verbose --config-option rmgr=mpirun --full-report=24303
>   ...
>   padb version 3.n (Revision 283)
>   full job report for job 24303
> 
>   Attaching to job 24303
>   Use of uninitialized value in string ne at padb line 2720.
>   Job has 1 process(es)
>   Job spans 0 host(s)
>   DEBUG (verbose):   0: There are 1 processes over 0 hosts
>   DEBUG (verbose):   0: Remote process data available on frontend
>   DEBUG (show_cmd):   0: pdsh -w  padb --inner --outer="burl-ct-v20z-0:52314"
>   einner: pdsh: illegal option -- -

I see the problem there, I hadn't allowed for dashes in hostnames either
try another update and it should match this time.

Ashley,

-- 
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [MTT devel] Analysis of hung jobs.

2009-10-08 Thread Ethan Mallove
On Thu, Oct/08/2009 03:18:07PM, Ashley Pittman wrote:
> On Thu, 2009-10-08 at 09:51 -0400, Ethan Mallove wrote:
> 
> > $ padb --verbose --debug=all --config-option rmgr=mpirun --full-report=6336
> >   ...
> >   full job report for job 6336
> > 
> >   Attaching to job 6336
> >   mpirun resource manager requires pdsh to be installed
> >   Use of uninitialized value in printf at padb line 729.
> >   Use of uninitialized value in printf at padb line 729.
> >   DEBUG (verbose):   0: There are 0 processes over 0 hosts
> >   Fatal problem setting up the resource manager: mpirun
> > 
> > I assume it's referring to the below "pdsh"?
> > 
> >   http://sourceforge.net/projects/pdsh
> 
> Yes, you'll need to able to ssh freely around from the node where
> padb/pdsh is running to all compute nodes as well.  For debian I had to
> add "export PDSH_RCMD_TYPE=ssh" to my .bashrc to tell it to use ssh
> rather than rsh.
> 
> Could you update to r283 as well, the "mpirun" resource manager is new
> and I discovered this morning that it didn't like digits in hostnames.
> As an added benefit it won't use pdsh or ssh if all processes are local.

It looks like it's using a bad option to pdsh?

  $ padb --debug=all --verbose --config-option rmgr=mpirun --full-report=24303
  ...
  padb version 3.n (Revision 283)
  full job report for job 24303

  Attaching to job 24303
  Use of uninitialized value in string ne at padb line 2720.
  Job has 1 process(es)
  Job spans 0 host(s)
  DEBUG (verbose):   0: There are 1 processes over 0 hosts
  DEBUG (verbose):   0: Remote process data available on frontend
  DEBUG (show_cmd):   0: pdsh -w  padb --inner --outer="burl-ct-v20z-0:52314"
  einner: pdsh: illegal option -- -
  einner: Usage: pdsh [-options] command ...
  einner: -Sreturn largest of remote command return values
  einner: -houtput usage menu and quit
  einner: -Voutput version information and quit
  einner: -qlist the option settings and quit
  einner: -bdisable ^C status feature (batch mode)
  einner: -denable extra debug information from ^C status
  einner: -l user   execute remote commands as user
  einner: -t secondsset connect timeout (default is 10 sec)
  einner: -u secondsset command timeout (no default)
  einner: -f n  use fanout of n nodes
  einner: -w host,host,...  set target node list on command line
  einner: -x host,host,...  set node exclusion list on command line
  einner: -R name   set rcmd module to name
  einner: -Ndisable hostname: labels on output lines
  einner: -Llist info on all loaded modules and exit
  einner: available rcmd modules: rsh,exec (default: rsh)
  Unexpected EOF from Inner stdout (connecting)
  Unexpected EOF from Inner stderr (connecting)
  Unexpected exit from parallel command (state=connecting)
  result from parallel command is 256 (state=connecting)
  Bad exit code from parallel command (exit_code=1)
  DEBUG (verbose):   5: Completed command

-Ethan

> 
> Ashley,
> 
> -- 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel


Re: [MTT devel] Analysis of hung jobs.

2009-10-08 Thread Ashley Pittman
On Thu, 2009-10-08 at 09:51 -0400, Ethan Mallove wrote:

> $ padb --verbose --debug=all --config-option rmgr=mpirun --full-report=6336
>   ...
>   full job report for job 6336
> 
>   Attaching to job 6336
>   mpirun resource manager requires pdsh to be installed
>   Use of uninitialized value in printf at padb line 729.
>   Use of uninitialized value in printf at padb line 729.
>   DEBUG (verbose):   0: There are 0 processes over 0 hosts
>   Fatal problem setting up the resource manager: mpirun
> 
> I assume it's referring to the below "pdsh"?
> 
>   http://sourceforge.net/projects/pdsh

Yes, you'll need to able to ssh freely around from the node where
padb/pdsh is running to all compute nodes as well.  For debian I had to
add "export PDSH_RCMD_TYPE=ssh" to my .bashrc to tell it to use ssh
rather than rsh.

Could you update to r283 as well, the "mpirun" resource manager is new
and I discovered this morning that it didn't like digits in hostnames.
As an added benefit it won't use pdsh or ssh if all processes are local.

Ashley,

-- 
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [MTT devel] Analysis of hung jobs.

2009-10-08 Thread Ethan Mallove
On Wed, Oct/07/2009 09:38:07PM, Ashley Pittman wrote:
> On Wed, 2009-10-07 at 16:21 -0400, Ethan Mallove wrote:
> 
> >   No secret file (/home/em162155/.padb-secret)
> >   Error: Could not load secret file on this node
> 
> You need to do this once to set a secret key for security purposes, run
> the following two commands and try again.
> 
> echo secret=ochi4aeZ > /home/em162155/.padb-secret
> chmod 0600 /home/em162155/.padb-secret

Getting closer ...

  $ padb --verbose --debug=all --config-option rmgr=mpirun --full-report=6336
  ...
  full job report for job 6336

  Attaching to job 6336
  mpirun resource manager requires pdsh to be installed
  Use of uninitialized value in printf at padb line 729.
  Use of uninitialized value in printf at padb line 729.
  DEBUG (verbose):   0: There are 0 processes over 0 hosts
  Fatal problem setting up the resource manager: mpirun

I assume it's referring to the below "pdsh"?

  http://sourceforge.net/projects/pdsh

-Ethan


> 
> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel


Re: [MTT devel] Analysis of hung jobs.

2009-10-07 Thread Ashley Pittman
On Wed, 2009-10-07 at 16:21 -0400, Ethan Mallove wrote:

>   No secret file (/home/em162155/.padb-secret)
>   Error: Could not load secret file on this node

You need to do this once to set a secret key for security purposes, run
the following two commands and try again.

echo secret=ochi4aeZ > /home/em162155/.padb-secret
chmod 0600 /home/em162155/.padb-secret

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [MTT devel] Analysis of hung jobs.

2009-10-07 Thread Ethan Mallove
On Wed, Oct/07/2009 09:04:22PM, Ashley Pittman wrote:
> On Wed, 2009-10-07 at 15:41 -0400, Ethan Mallove wrote:
> 
> > I got the following error doing a simple test:
> 
> As it happens I saw this error earlier on FC8, r279 should fix this
> problem.

Thanks. That eliminates the perl regex error.

> 
> >   $ perl --version
> >   This is perl, v5.8.4 built for sun4-solaris-64int
> 
> I had wondered if you'd be using solaris, this is not something I've
> tested and not something I'd expect to work.  The stack trace code
> should all be fine but there might be some problems reading data
> from /proc.  In the past padb has worked on Tru64, possibly all that
> needs porting would be getting parent pid and process name from ps
> rather than /proc/status.
> 

Okay. I've moved to Linux for testing:

  $ padb --debug=all --verbose --config-option rmgr=mpirun --full-report=29713
  Loading config from "/etc/padb.conf"
  Loading config from "/home/em162155/.padbrc"
  Loading config from environment
  Loading config from command line
  Setting 'rmgr' to 'mpirun'
  DEBUG (config):   0: Finished setting configuration options
  padb version 3.n (Revision 279)
  full job report for job 29713

  No secret file (/home/em162155/.padb-secret)
  Error: Could not load secret file on this node

-Ethan

> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel


Re: [MTT devel] Analysis of hung jobs.

2009-10-07 Thread Ashley Pittman
On Wed, 2009-10-07 at 15:41 -0400, Ethan Mallove wrote:

> I got the following error doing a simple test:

As it happens I saw this error earlier on FC8, r279 should fix this
problem.

>   $ perl --version
>   This is perl, v5.8.4 built for sun4-solaris-64int

I had wondered if you'd be using solaris, this is not something I've
tested and not something I'd expect to work.  The stack trace code
should all be fine but there might be some problems reading data
from /proc.  In the past padb has worked on Tru64, possibly all that
needs porting would be getting parent pid and process name from ps
rather than /proc/status.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [MTT devel] Analysis of hung jobs.

2009-10-07 Thread Ethan Mallove
On Tue, Oct/06/2009 04:30:52PM, Ashley Pittman wrote:
> On Tue, 2009-10-06 at 11:25 -0400, Ethan Mallove wrote:
> > On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote:
> > > 
> > > Further to the mail linked below, padb is able to perform diagnostics,
> > > including backtraces on hung jobs and integrates well into automated
> > > testing environments.
> > 
> > Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled
> > with -g)?
> 
> It's gets what is available from the application, without -g it will
> give you function names only, with -g it will also give you file names
> and line numbers and optionally variables, their types and values.
> 
> It can show the message queues regardless of the -g option.

I got the following error doing a simple test:

  $ padb --config-option rmgr=mpirun --full-report=12480
  Nested quantifiers in regex; marked by <-- HERE in m/\A# 
Start of str.
 "# Quote
 ((?:[^"\\]++ <-- HERE |\\.)*+) # Anyting which isn't \"
 "# Close quote
 ,?   # An optional comma.
 (.*) # Rest of line
 \z   # end.
 / at padb line 5044.

  $ perl --version
  This is perl, v5.8.4 built for sun4-solaris-64int

-Ethan

> 
> Ashley.
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel


Re: [MTT devel] Analysis of hung jobs.

2009-10-06 Thread Ashley Pittman
On Tue, 2009-10-06 at 11:25 -0400, Ethan Mallove wrote:
> On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote:
> > 
> > Further to the mail linked below, padb is able to perform diagnostics,
> > including backtraces on hung jobs and integrates well into automated
> > testing environments.
> 
> Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled
> with -g)?

It's gets what is available from the application, without -g it will
give you function names only, with -g it will also give you file names
and line numbers and optionally variables, their types and values.

It can show the message queues regardless of the -g option.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [MTT devel] Analysis of hung jobs.

2009-10-06 Thread Ethan Mallove
On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote:
> 
> Further to the mail linked below, padb is able to perform diagnostics,
> including backtraces on hung jobs and integrates well into automated
> testing environments.

Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled
with -g)?

-Ethan

> 
> The attached patch is a minimal change which should enable the
> functionality.  I don't however have access to a working MTT
> installation to test this however.
> 
> http://www.open-mpi.org/community/lists/mtt-devel/2009/06/0415.php
> 
> This will require a HEAD version of padb, at least r273 to allow it to
> accept the pid of mpirun rather than a jobid assigned by the underlying
> resource manager.
> 
> Yours,
> 
> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk

> Index: lib/MTT/DoCommand.pm
> ===
> --- lib/MTT/DoCommand.pm  (revision 1322)
> +++ lib/MTT/DoCommand.pm  (working copy)
> @@ -359,6 +359,7 @@
>  }
>  my $killed_status = undef;
>  my $last_over = 0;
> +my $padb_output;
>  while ($done > 0) {
>  my $nfound = select($rout = $rin, undef, undef, $t);
>  if (vec($rout, fileno(OUTread), 1) == 1) {
> @@ -410,6 +411,8 @@
>  my $timeout_email_recipient = 
> $MTT::Globals::Values->{docommand_timeout_notify_email};
>  my $timeout_notify_timeout  = 
> $MTT::Globals::Values->{docommand_timeout_notify_timeout};
>  
> + $padb_output = `padb --config-option rmgr=mpirun 
> --full-report=$pid`;
> +
>  if (defined($timeout_sentinel_file)) {
>  
>  # Email someone, if an email address has been specified
> @@ -493,6 +496,9 @@
>  # Return an anonymous hash containing the relevant data
>  
>  $ret->{result_stdout} = join('', @out);
> +if ( defined $padb_output ) {
> +  $ret->{result_stdout} .= "\n$padb_output";
> +}
>  $ret->{result_stderr} = join('', @err),
>  if (!$merge_output);
>  return $ret;

> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel