Re: [MTT devel] Analysis of hung jobs.
For the record Ethan and I took this off-list and got it working shortly afterwards, results are now on-line and the code is in SVN. Attached is a final patch to cleanup the output by removing a extraneous space which is inserted in the final output. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk Index: lib/MTT/DoCommand.pm === --- lib/MTT/DoCommand.pm (revision 1327) +++ lib/MTT/DoCommand.pm (working copy) @@ -619,7 +619,7 @@ if (FindProgram(qw(padb))) { my $padb_cmd = "padb --config-option rmgr=mpirun --full-report=$pid"; -$ret .= "\n $padb_cmd"; +$ret .= "\n$padb_cmd"; $ret .= "\n" . `$padb_cmd`; } else {
Re: [MTT devel] Analysis of hung jobs.
On Thu, 2009-10-08 at 10:46 -0400, Ethan Mallove wrote: > It looks like it's using a bad option to pdsh? > > $ padb --debug=all --verbose --config-option rmgr=mpirun --full-report=24303 > ... > padb version 3.n (Revision 283) > full job report for job 24303 > > Attaching to job 24303 > Use of uninitialized value in string ne at padb line 2720. > Job has 1 process(es) > Job spans 0 host(s) > DEBUG (verbose): 0: There are 1 processes over 0 hosts > DEBUG (verbose): 0: Remote process data available on frontend > DEBUG (show_cmd): 0: pdsh -w padb --inner --outer="burl-ct-v20z-0:52314" > einner: pdsh: illegal option -- - I see the problem there, I hadn't allowed for dashes in hostnames either try another update and it should match this time. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [MTT devel] Analysis of hung jobs.
On Thu, Oct/08/2009 03:18:07PM, Ashley Pittman wrote: > On Thu, 2009-10-08 at 09:51 -0400, Ethan Mallove wrote: > > > $ padb --verbose --debug=all --config-option rmgr=mpirun --full-report=6336 > > ... > > full job report for job 6336 > > > > Attaching to job 6336 > > mpirun resource manager requires pdsh to be installed > > Use of uninitialized value in printf at padb line 729. > > Use of uninitialized value in printf at padb line 729. > > DEBUG (verbose): 0: There are 0 processes over 0 hosts > > Fatal problem setting up the resource manager: mpirun > > > > I assume it's referring to the below "pdsh"? > > > > http://sourceforge.net/projects/pdsh > > Yes, you'll need to able to ssh freely around from the node where > padb/pdsh is running to all compute nodes as well. For debian I had to > add "export PDSH_RCMD_TYPE=ssh" to my .bashrc to tell it to use ssh > rather than rsh. > > Could you update to r283 as well, the "mpirun" resource manager is new > and I discovered this morning that it didn't like digits in hostnames. > As an added benefit it won't use pdsh or ssh if all processes are local. It looks like it's using a bad option to pdsh? $ padb --debug=all --verbose --config-option rmgr=mpirun --full-report=24303 ... padb version 3.n (Revision 283) full job report for job 24303 Attaching to job 24303 Use of uninitialized value in string ne at padb line 2720. Job has 1 process(es) Job spans 0 host(s) DEBUG (verbose): 0: There are 1 processes over 0 hosts DEBUG (verbose): 0: Remote process data available on frontend DEBUG (show_cmd): 0: pdsh -w padb --inner --outer="burl-ct-v20z-0:52314" einner: pdsh: illegal option -- - einner: Usage: pdsh [-options] command ... einner: -Sreturn largest of remote command return values einner: -houtput usage menu and quit einner: -Voutput version information and quit einner: -qlist the option settings and quit einner: -bdisable ^C status feature (batch mode) einner: -denable extra debug information from ^C status einner: -l user execute remote commands as user einner: -t secondsset connect timeout (default is 10 sec) einner: -u secondsset command timeout (no default) einner: -f n use fanout of n nodes einner: -w host,host,... set target node list on command line einner: -x host,host,... set node exclusion list on command line einner: -R name set rcmd module to name einner: -Ndisable hostname: labels on output lines einner: -Llist info on all loaded modules and exit einner: available rcmd modules: rsh,exec (default: rsh) Unexpected EOF from Inner stdout (connecting) Unexpected EOF from Inner stderr (connecting) Unexpected exit from parallel command (state=connecting) result from parallel command is 256 (state=connecting) Bad exit code from parallel command (exit_code=1) DEBUG (verbose): 5: Completed command -Ethan > > Ashley, > > -- > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
Re: [MTT devel] Analysis of hung jobs.
On Thu, 2009-10-08 at 09:51 -0400, Ethan Mallove wrote: > $ padb --verbose --debug=all --config-option rmgr=mpirun --full-report=6336 > ... > full job report for job 6336 > > Attaching to job 6336 > mpirun resource manager requires pdsh to be installed > Use of uninitialized value in printf at padb line 729. > Use of uninitialized value in printf at padb line 729. > DEBUG (verbose): 0: There are 0 processes over 0 hosts > Fatal problem setting up the resource manager: mpirun > > I assume it's referring to the below "pdsh"? > > http://sourceforge.net/projects/pdsh Yes, you'll need to able to ssh freely around from the node where padb/pdsh is running to all compute nodes as well. For debian I had to add "export PDSH_RCMD_TYPE=ssh" to my .bashrc to tell it to use ssh rather than rsh. Could you update to r283 as well, the "mpirun" resource manager is new and I discovered this morning that it didn't like digits in hostnames. As an added benefit it won't use pdsh or ssh if all processes are local. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [MTT devel] Analysis of hung jobs.
On Wed, Oct/07/2009 09:38:07PM, Ashley Pittman wrote: > On Wed, 2009-10-07 at 16:21 -0400, Ethan Mallove wrote: > > > No secret file (/home/em162155/.padb-secret) > > Error: Could not load secret file on this node > > You need to do this once to set a secret key for security purposes, run > the following two commands and try again. > > echo secret=ochi4aeZ > /home/em162155/.padb-secret > chmod 0600 /home/em162155/.padb-secret Getting closer ... $ padb --verbose --debug=all --config-option rmgr=mpirun --full-report=6336 ... full job report for job 6336 Attaching to job 6336 mpirun resource manager requires pdsh to be installed Use of uninitialized value in printf at padb line 729. Use of uninitialized value in printf at padb line 729. DEBUG (verbose): 0: There are 0 processes over 0 hosts Fatal problem setting up the resource manager: mpirun I assume it's referring to the below "pdsh"? http://sourceforge.net/projects/pdsh -Ethan > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
Re: [MTT devel] Analysis of hung jobs.
On Wed, 2009-10-07 at 16:21 -0400, Ethan Mallove wrote: > No secret file (/home/em162155/.padb-secret) > Error: Could not load secret file on this node You need to do this once to set a secret key for security purposes, run the following two commands and try again. echo secret=ochi4aeZ > /home/em162155/.padb-secret chmod 0600 /home/em162155/.padb-secret Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [MTT devel] Analysis of hung jobs.
On Wed, Oct/07/2009 09:04:22PM, Ashley Pittman wrote: > On Wed, 2009-10-07 at 15:41 -0400, Ethan Mallove wrote: > > > I got the following error doing a simple test: > > As it happens I saw this error earlier on FC8, r279 should fix this > problem. Thanks. That eliminates the perl regex error. > > > $ perl --version > > This is perl, v5.8.4 built for sun4-solaris-64int > > I had wondered if you'd be using solaris, this is not something I've > tested and not something I'd expect to work. The stack trace code > should all be fine but there might be some problems reading data > from /proc. In the past padb has worked on Tru64, possibly all that > needs porting would be getting parent pid and process name from ps > rather than /proc/status. > Okay. I've moved to Linux for testing: $ padb --debug=all --verbose --config-option rmgr=mpirun --full-report=29713 Loading config from "/etc/padb.conf" Loading config from "/home/em162155/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'mpirun' DEBUG (config): 0: Finished setting configuration options padb version 3.n (Revision 279) full job report for job 29713 No secret file (/home/em162155/.padb-secret) Error: Could not load secret file on this node -Ethan > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
Re: [MTT devel] Analysis of hung jobs.
On Wed, 2009-10-07 at 15:41 -0400, Ethan Mallove wrote: > I got the following error doing a simple test: As it happens I saw this error earlier on FC8, r279 should fix this problem. > $ perl --version > This is perl, v5.8.4 built for sun4-solaris-64int I had wondered if you'd be using solaris, this is not something I've tested and not something I'd expect to work. The stack trace code should all be fine but there might be some problems reading data from /proc. In the past padb has worked on Tru64, possibly all that needs porting would be getting parent pid and process name from ps rather than /proc/status. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [MTT devel] Analysis of hung jobs.
On Tue, Oct/06/2009 04:30:52PM, Ashley Pittman wrote: > On Tue, 2009-10-06 at 11:25 -0400, Ethan Mallove wrote: > > On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote: > > > > > > Further to the mail linked below, padb is able to perform diagnostics, > > > including backtraces on hung jobs and integrates well into automated > > > testing environments. > > > > Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled > > with -g)? > > It's gets what is available from the application, without -g it will > give you function names only, with -g it will also give you file names > and line numbers and optionally variables, their types and values. > > It can show the message queues regardless of the -g option. I got the following error doing a simple test: $ padb --config-option rmgr=mpirun --full-report=12480 Nested quantifiers in regex; marked by <-- HERE in m/\A# Start of str. "# Quote ((?:[^"\\]++ <-- HERE |\\.)*+) # Anyting which isn't \" "# Close quote ,? # An optional comma. (.*) # Rest of line \z # end. / at padb line 5044. $ perl --version This is perl, v5.8.4 built for sun4-solaris-64int -Ethan > > Ashley. > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
Re: [MTT devel] Analysis of hung jobs.
On Tue, 2009-10-06 at 11:25 -0400, Ethan Mallove wrote: > On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote: > > > > Further to the mail linked below, padb is able to perform diagnostics, > > including backtraces on hung jobs and integrates well into automated > > testing environments. > > Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled > with -g)? It's gets what is available from the application, without -g it will give you function names only, with -g it will also give you file names and line numbers and optionally variables, their types and values. It can show the message queues regardless of the -g option. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [MTT devel] Analysis of hung jobs.
On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote: > > Further to the mail linked below, padb is able to perform diagnostics, > including backtraces on hung jobs and integrates well into automated > testing environments. Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled with -g)? -Ethan > > The attached patch is a minimal change which should enable the > functionality. I don't however have access to a working MTT > installation to test this however. > > http://www.open-mpi.org/community/lists/mtt-devel/2009/06/0415.php > > This will require a HEAD version of padb, at least r273 to allow it to > accept the pid of mpirun rather than a jobid assigned by the underlying > resource manager. > > Yours, > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > Index: lib/MTT/DoCommand.pm > === > --- lib/MTT/DoCommand.pm (revision 1322) > +++ lib/MTT/DoCommand.pm (working copy) > @@ -359,6 +359,7 @@ > } > my $killed_status = undef; > my $last_over = 0; > +my $padb_output; > while ($done > 0) { > my $nfound = select($rout = $rin, undef, undef, $t); > if (vec($rout, fileno(OUTread), 1) == 1) { > @@ -410,6 +411,8 @@ > my $timeout_email_recipient = > $MTT::Globals::Values->{docommand_timeout_notify_email}; > my $timeout_notify_timeout = > $MTT::Globals::Values->{docommand_timeout_notify_timeout}; > > + $padb_output = `padb --config-option rmgr=mpirun > --full-report=$pid`; > + > if (defined($timeout_sentinel_file)) { > > # Email someone, if an email address has been specified > @@ -493,6 +496,9 @@ > # Return an anonymous hash containing the relevant data > > $ret->{result_stdout} = join('', @out); > +if ( defined $padb_output ) { > + $ret->{result_stdout} .= "\n$padb_output"; > +} > $ret->{result_stderr} = join('', @err), > if (!$merge_output); > return $ret; > ___ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel