Hi,

Am 31.03.2014 um 18:22 schrieb Eric Kaufmann:

> We are using ge 6.2u5 with CentOS 6.4.
> 
> I have jobs that are randomly being killed. Here is the log entry. The jobs 
> that are getting killed are getting an exit status of 127 or 137. I did check 
> /var/log/messages on the nodes and didn't see anything out of the ordinary.
> 
> 03/31/2014 09:55:30|worker|kepler|W|job 33393.1 failed on host 
> research029.cm.cluster assumedly after job because: job 33393.1 died through 
> signal KILL (9)
> 
> 03/31/2014 09:55:34|worker|kepler|W|job 33394.1 failed on host 
> research026.cm.cluster assumedly after job because: job 33394.1 died through 
> signal KILL (9)

Did you request any limit during job submission? The lines above are in the 
messages file of the qmaster - is there anything in the messages file of SGE on 
the nodes (you checked the system one on the nodes)?

-- Reuti


> qacct -j 33394
> 
> qname        std                 
> hostname     research026.cm.cluster
> group        justinchem          
> owner        justinchem          
> project      NONE                
> department   defaultdepartment   
> jobname      runCHO-C6H5-Cs_opt.24081
> jobnumber    33394               
> taskid       undefined
> account      sge                 
> priority     0                   
> qsub_time    Mon Mar 31 09:54:53 2014
> start_time   Mon Mar 31 09:55:10 2014
> end_time     Mon Mar 31 09:55:33 2014
> granted_pe   gauss               
> slots        4                   
> failed       100 : assumedly after job
> exit_status  137                 
> ru_wallclock 23           
> ru_utime     0.003        
> ru_stime     0.008        
> ru_maxrss    1380                
> ru_ixrss     0                   
> ru_ismrss    0                   
> ru_idrss     0                   
> ru_isrss     0                   
> ru_minflt    1957                
> ru_majflt    5                   
> ru_nswap     0                   
> ru_inblock   584                 
> ru_oublock   40                  
> ru_msgsnd    0                   
> ru_msgrcv    0                   
> ru_nsignals  0                   
> ru_nvcsw     58                  
> ru_nivcsw    6                   
> cpu          82.570       
> mem          452.669           
> io           0.084             
> iow          0.000             
> maxvmem      5.710G
> arid         undefined
> 
> Thanks,
> 
> Eric
> 
> -- 
> Eric Kaufmann |  Application Support Analyst -  Advanced Technology Group | 
> Saint Louis University | 314-977-2257 | [email protected] 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to