On Tue, 9 Jun 2015 15:36:00 +0000
Dan Hyatt <[email protected]> wrote:

 
> >> The execute nodes were updated, and some are not playing well in the
> >> sandbox. When the grid sends a job there, it hangs, sends an error but
> > Not clear what hangs the job or the node?
> > What is the error that is being sent and  how is it being sent?
> The nodes are hanging the job, some did not come up correctly. I had 

So if I understand you correctly the job is what is hanging but the node 
is otherwise OK ie you can ssh into it.  If that is the case I'd log into one 
while
a job is hung, look for some sign of the problem, write a load sensor to detect 
it 
and set an appropriate load_threshold.

We have one load sensor that backgrounds a ps command and if it hasn't finished
by the next time the load sensor is queried flags an error.  This catches a 
linux problem
where a process never finishes exiting (and locks out attempts to examine its 
status)

Still not sure what you mean by 'sends an error'.  That might give a clue as to 
what to look for.  



 
> > Not clear what was marking the node or how  it was marking the node.
 
> When I run qhost -j
> 
> I assume
> HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD MEMTOT  
> MEMUSE  SWAPTO  SWAPUS
> blade5-5-5              lx-amd64       24    2   12 24                 
> 0.00             126.0G    1.2G    2.0G     0.0
> blade5-5-6              lx-amd64       24    2   12   24           -  
> 126.0G       -    2.0G       -
> blade5-5-7              lx-amd64       24    2   12 
> 24                       -  126.0G       -    2.0G       -
> blade5-5-8              lx-amd64       24    2   12 
> 24                    0.00  126.0G    1.2G    2.0G     0.0
>
You're just looking for the - rather than an actual load value there.  If it is 
only the job that
is hung then the execd will be fine so that is expected behavior.  Those 
builtin sensors sometimes get 
cached so even when the node is hung they show up.

qselect -qs u |awk -F@ '{print $2}'|sort -u 
should give a more reliable indication of which nodes are currently hung.

 


-- 
William Hay <[email protected]>

Attachment: pgp56W8vsc8ul.pgp
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to