Re: [slurm-users] introduce short delay starting multiple parallel jobs with srun

Yakupov, Renat /DZNE Fri, 10 Nov 2017 01:03:56 -0800

Thank you for suggestions, John.
I am a user though, I cant install anything.
The first problem is not even access to the data, but to a "job list" file. 
When I started working on these data a month or so ago, I got access a system 
with several compute nodes (and eventually also an HPC). So I thought of a way 
for several compute nodes to work on the same pool of data - create a job list 
file, which each node reads at the beginning, get the next available data set, 
flags it as taken, works on it, finishes and flags it as done, takes the next 
available, and so on. This was the only way I could figure out to make several 
nodes to work together.
Then I got access to an HPC, so I just use the same script and run it 20 times 
in parallel. Previously, I could start the jobs manually on each node, so I 
didnt have the problem with simultaneous access. Now, SLURM starts all tasks 
simultaneously, they all try to access that job list file, and problems occur. 
Sometimes, a task cant edit it, so it crashes. Sometimes, different tasks take 
the same data set, and then they crash too.
Maybe there is a better way? I dont know how else to make it work on lets say 
20 data sets at the same time out of a 100 total? Job arrays can let me give it 
a lot of tasks, but work only at a certain number simultaneously, but that is 
not available on our SLURM.
I think I found a solution yesterday. I can use PID of the shell. I tried using 
PID of the process, but doing 20+ ps's is mess. Shell PID though is a shell 
variable $$, so I dont need to use any command, just last two digits as the 
delay!


Best,
Renat.

________________________________________
From: slurm-users [[email protected]] On Behalf Of John 
Hearns [[email protected]]
Sent: Thursday, November 09, 2017 4:39 PM
To: Slurm User Community List
Subject: Re: [slurm-users] introduce short delay starting multiple parallel 
jobs with srun

Renat,
   I know that this is not going to be helpful.  I can understand that perhaps 
if you are using NFS storage then 20(*) processes might not be able to open 
files at the same time.
I would consider the following:

a) looking at your storage. This is why HPC systems have high performance and 
parallel storage systems.
    You could consider isntalling a high performance storage system

b) if there is no option to get better storage, tne I ask how is this data 
being accessed?
    If you have multiple compute nodes, and the data is being read only, then 
consider copying the data across to TMPDIR on each compute node as a pre-job or 
at the start of the job.
If the speed of access to the data is critical then you might even consider 
creating a ramdisk for TMPDIR - then you might even see some nice better 
performance.


20 - err that does sound a bit low...











On 9 November 2017 at 15:55, Gennaro Oliva 
<[email protected]<mailto:[email protected]>> wrote:
Hi Renat,

On Thu, Nov 09, 2017 at 03:46:23PM +0100, Yakupov, Renat /DZNE wrote:
> I tried that. It doesnt even queue the job with an error:
> sbatch: unrecognized option '--array=1-24'
> sbatch: error: Try help for more information.

what version of slurm are you using?
Regards
--
Gennaro Oliva

Re: [slurm-users] introduce short delay starting multiple parallel jobs with srun

Reply via email to