Re: [gridengine users] Queinstance stuck in E

William Hay Wed, 08 Jun 2016 02:51:12 -0700

On Thu, Jun 02, 2016 at 10:46:46PM +0000, Coleman, Marcus [JRDUS Non-J&J] wrote:
>    Hi all
> 
>     
> 
>    I am having a crazy time fixing an issue I have having with 3 qinstance
>    stuck in E.
> 
>     
>



>    [root@c1 active_jobs]# pwd
> 
>    /opt/sge/default/spool/c1/active_jobs
> 
>    [root@c1 active_jobs]#
> 
>     
> 
>    [root@c1 c1]# ls -l
> 
>    total 5980
> 
>    drwxrwxrwx 32000 sgeadmin sgeadmin  999424 May 30 04:54 active_jobs
I suspect that 32000 there may be the problem here.  Apparently linux 
artificially caps the maximum
number of links to a file at 32000 for ext[23] and possibly other file systems. 
 This in turn limits
the number of subdirectories in a directory to 2 lower

https://www.redhat.com/archives/rhl-list/2005-July/msg03301.html

The obvious question is 'Why do you have 31998 subdirectories in active_jobs?'. 
 There shouldn't be more
than one per job task and hitting 31998 tasks on a node is not what one 
normally expects.  I would 
look  in active_jobs with ls -al to see what directories are present.  One 
thing that might cause this is if
the spool directory were exported over NFS in which case the nfs server may 
translate attempts to delete
files/directories that it thinks are open remotely into a rename into a hidden 
file.

Once you get the number of hard links/subdirectories down to a more sane number 
then grid engine should
be able to create directories for new jobs/tasks normally and clearing the 
error state on the queue should stick.


William

signature.asc
Description: Digital signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queinstance stuck in E

Reply via email to