Re: [gridengine users] User job fails silently

Reuti Wed, 08 Aug 2018 22:54:11 -0700

Hi,

Am 09.08.2018 um 01:26 schrieb Derrick Lin:


> >  What state of the job you see in this line? Is it just hanging there and 
> > doing nothing? They do not appear in `top`? And it never vanishes 
> > automatically but you have to kill the job by hand? 
> 
> Sorry for the confusion. The job state is "r" according to SGE, but as you 
> mentioned qstat output is not related to any process. 
> 
> The line I coped is what it shown in top/htop. So basically, all his jobs 
> became:
> 
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187677
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187690
> 
> Each of this scripts does copy & untar a file to the local XFS file system, 
> then a python script is called to operate on these untared files.
> 
> The job log shows that untaring is done, but the python script has never 
> started and the job process stuck as shown above.
> 
> We don't see any storage related contention.
> 
> I am more interested in knowing where this process  bash 
> /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 come from?

Unless the submitted job is marked as a binary, the jobscript is copied to 
SGE's internal database. At this point it would even be possible change the 
jobscript on the disk, while the submitted one keeps his content. On the 
exechost, this stored jobscript is then saved at the start of the job in a 
directory at <execd_spool_dir>/<hostname>/job_scripts/<job_id> and executed. As 
a consequense, this would also work without a shared file system if the 
<execd_spool_dir> is local on the excehost (like in /var/spool/sge).

If this <execd_spool_dir> is shared (like it seems to be in your case), the 
jobscript is first transferred by SGE's protocol to the node, where the execd 
writes the jobscript in the shared space, which is on the headnode again.

If you peek into the given file, you will hence find the original jobscript of 
the user. Does the jobscript try to modify itself, and the user can't (of 
course) not write at this location?

-- Reuti


> Cheers,
> 
> 
> On Wed, Aug 8, 2018 at 6:53 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> 
> > Am 08.08.2018 um 08:15 schrieb Derrick Lin <klin...@gmail.com>:
> > 
> > Hi guys,
> > 
> > I have a user reported his jobs stuck running for much longer than usual.
> > 
> > So I go to the exec host, check the process and all processes owned by that 
> > user look like:
> > 
> > `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671
> 
> What state of the job you see in this line? Is it just hanging there and 
> doing nothing? They do not appear in `top`? And it never vanishes 
> automatically but you have to kill the job by hand?
> 
> 
> > In qstat, it still shows job is in running state.
> 
> The `qstat`output is not really related to any running process. It's just 
> what SGE granted and think it is running or granted to run. Especially with 
> parallel jobs across nodes, the might or might not be any process on one of 
> the granted slave nodes.
> 
> 
> > The user resubmitted the jobs and they ran and completed without an problem.
> 
> Could it be a race condition with the shared file system?
> 
> -- Reuti
> 
> 
> > I am wondering what may has caused this situation in general?
> > 
> > Cheers,
> > Derrick
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] User job fails silently

Reply via email to