Ashly still having trouble using padb with openmpi/1.4.2 [dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in file util/comm/comm.c at line 62 [nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in file orte-ps.c at line 799 [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: Communication retries exceeded. Can not communicate with peer No active jobs could be found for user 'dianawon'
The job is running, I get this error running just orte-ps, Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Sep 2, 2010, at 9:47 AM, Brock Palen wrote: > Ah ok, I put it there just because the user couldn't read that from my home > space, and never even thought of that. gahhh. > > Thanks, > > BTW I tried joining the padb mailing list. > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > On Sep 1, 2010, at 6:11 PM, Ashley Pittman wrote: > >> >> padb as a binary (it's a perl script) needs to exist on all nodes as it >> calls orterun on itself, try installing it to a shared directory or copying >> padb to /tmp on every node. >> >> To access the message queues padb needs a compiled helper program which is >> installed in $PREFIX/lib so I would recommend re-building padb giving it a >> prefix of a NFS shared directory. I can help you more with this if required. >> >> Ashley, >> >> On 1 Sep 2010, at 23:01, Brock Palen wrote: >> >>> We have ddt, but we do not have licenses to attach to the number of cores >>> these jobs run at. >>> >>> I tried padb, but it fails, >>> >>> Example: >>> >>> ssh to root node for running MPI job: >>> /tmp/padb -Q -a >>> >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: >>> Communication retries exceeded. Can not communicate with peer >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable >>> in file util/comm/comm.c at line 62 >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable >>> in file orte-ps.c at line 799 >>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: >>> Communication retries exceeded. Can not communicate with peer >>> einner: >>> -------------------------------------------------------------------------- >>> einner: orterun was unable to launch the specified application as it could >>> not access >>> einner: or execute an executable: >>> Unexpected EOF from Inner stdout (connecting) >>> Unexpected EOF from Inner stderr (connecting) >>> Unexpected exit from parallel command (state=connecting) >>> Bad exit code from parallel command (exit_code=131) >> >> -- >> >> Ashley Pittman, Bath, UK. >> >> Padb - A parallel job inspection tool for cluster computing >> http://padb.pittman.org.uk >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >