Ashly still having trouble using padb with openmpi/1.4.2

[dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q
[nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
Communication retries exceeded.  Can not communicate with peer
[nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
file util/comm/comm.c at line 62
[nyx0862.engin.umich.edu:30717] [[16608,0],0] ORTE_ERROR_LOG: Unreachable in 
file orte-ps.c at line 799
[nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp: 
Communication retries exceeded.  Can not communicate with peer
No active jobs could be found for user 'dianawon'


The job is running, I get this error running just orte-ps, 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Sep 2, 2010, at 9:47 AM, Brock Palen wrote:

> Ah ok, I put it there just because the user couldn't read that from my home 
> space, and never even thought of that.  gahhh.
> 
> Thanks,
> 
> BTW I tried joining the padb mailing list.
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Sep 1, 2010, at 6:11 PM, Ashley Pittman wrote:
> 
>> 
>> padb as a binary (it's a perl script) needs to exist on all nodes as it 
>> calls orterun on itself, try installing it to a shared directory or copying 
>> padb to /tmp on every node.
>> 
>> To access the message queues padb needs a compiled helper program which is 
>> installed in $PREFIX/lib so I would recommend re-building padb giving it a 
>> prefix of a NFS shared directory.  I can help you more with this if required.
>> 
>> Ashley,
>> 
>> On 1 Sep 2010, at 23:01, Brock Palen wrote:
>> 
>>> We have ddt, but we do not have licenses to attach to the number of cores 
>>> these jobs run at.
>>> 
>>> I tried padb,  but it fails, 
>>> 
>>> Example:
>>> 
>>> ssh to root node for running MPI job:
>>> /tmp/padb -Q -a
>>> 
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
>>> Communication retries exceeded.  Can not communicate with peer
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable 
>>> in file util/comm/comm.c at line 62
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0] ORTE_ERROR_LOG: Unreachable 
>>> in file orte-ps.c at line 799
>>> [nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp: 
>>> Communication retries exceeded.  Can not communicate with peer
>>> einner: 
>>> --------------------------------------------------------------------------
>>> einner: orterun was unable to launch the specified application as it could 
>>> not access
>>> einner: or execute an executable:
>>> Unexpected EOF from Inner stdout (connecting)
>>> Unexpected EOF from Inner stderr (connecting)
>>> Unexpected exit from parallel command (state=connecting)
>>> Bad exit code from parallel command (exit_code=131)
>> 
>> -- 
>> 
>> Ashley Pittman, Bath, UK.
>> 
>> Padb - A parallel job inspection tool for cluster computing
>> http://padb.pittman.org.uk
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Reply via email to