I forgot to copy this back to the list.

So far, this concept might be looking very good for what we need. I'm able to 
get files created, and modify the environment file for the job 
accordingly...the next question is how do I handle the case of two jobs 
launching simultaneously and writing to the same lockfile?

Thanks for the help everyone.

From: [email protected]
To: [email protected]
Subject: RE: [gridengine users] Requesting a resource OR another resource
Date: Thu, 20 Nov 2014 07:15:32 -0500




How do you handle the issue of two jobs starting simultaneously? Here's what 
I've done so far (our jobs for the moment are opengl, so we just need to know 
the screen number :0.0 and :0.1). I write easier in perl.

#!/usr/bin/perl

$hostname = `hostname`;
chop $hostname;

$jobnumber = $ENV{'JOB_ID'};
$lockprefix = "/var/tmp/$hostname-gpuinfo";

# Determine if a GPU was requested via the job ID info
$stats = `$ENV{'SGE_BINARY_PATH'}/qstat -j $jobnumber | grep gpu_free`;

# We asked for a GPU
if ($stats) {
   # Total number of GPUS
   # nvidia-smi -L will list the GPUs for a total
   $totalgpus = `/usr/bin/nvidia-smi -L | wc -l`;

   for ($i = 1; $i <= $totalgpus; $i++) {
      # If a lockfile already exists
      $lockfilename = "$lockprefix"."-GPU$i";
      if ( -e "$lockfilename" ) {
        next;
      }
      else {
        system("echo $jobnumber >> $lockfilename");
        $displayname=$i-1;
        system("echo \"GPUNAME=:0.$displayname\" >> 
$ENV{'SGE_JOB_SPOOL_DIR'}/environment");
        system("chown $ENV{'SGE_O_LOGNAME'} $lockfilename");
        exit 0;
      }
   }

}
# If we didn't ask for a GPU, just exit gracefully.
else {
  exit 0;
}

With this script I do run into a condition where if two jobs start at once, the 
lockfiles aren't there, so they both create the first lockfile.



> Date: Thu, 20 Nov 2014 08:43:28 +0000
> From: [email protected]
> To: [email protected]
> Subject: Re: [gridengine users] Requesting a resource OR another resource
> 
> On Wed, 19 Nov 2014 16:47:56 +0000
> Kevin Taylor <[email protected]> wrote:
> 
> > So, the lock file you create is just there to identify the assigned GPU? I 
> > haven't done anything with prolog stuff before, but I'll take a look.
> 
> Sort of.  Since unix doesn't provide an atomic test and set for chgrp so we 
> use lock files to prevent races when two jobs are starting simultaneously.
> The lock file also contains info that identifies the job uniquely while the 
> per-job groups get reused.  We can use this to detect if anything goes wrong
> with the normal cleanup when a job terminates.
> 
>  
> -- 
> William Hay <[email protected]>
                                                                                
  
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to