Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-07-02 Thread Matteo Guglielmi
I've reported everything back to the actual sysadmin

of the cluster... and the truth behind this story is as

unbelievable as the story itself.


savvy cluster user asked "what is linux?" kind of user

to submit 'his' watchdog script to improve the cluster

load.


Basically you get the f. out of here twice a day so that

my jobs can start running.


Hhahaha!!!





From: slurm-users  on behalf of John 
Hearns 
Sent: Monday, July 2, 2018 12:37:13 PM
To: Slurm User Community List
Subject: Re: [slurm-users] All user's jobs killed at the same time on all nodes

A great detective story!

> June15 but there is no trace of it anywhere on the disk.

Do you have the process ID (pid) of the watchdog.sh
You could look in /proc/(pid) /cmdline and see what that shows





On 2 July 2018 at 11:37, Matteo Guglielmi 
mailto:matteo.guglie...@dalco.ch>> wrote:
Unbelievable... and got it by chance.

jobs were killed (again) at 21:04 and in the user's list of running
processes there was a 'sleep 5' command (13 hours + 53
minutes + 20 seconds) which was fired up exactly at the same
time.

The watchdog.sh script (from which the sleep command is fired)
was started on June15 but there is no trace of it anywhere on the
disk.

What's in that script I don't know but it kills all the users jobs
almost twice a day... and I've waited for it to do it again this
morning at 10:57... and sure enough all jobs disappeared and
a new sleep 5 command was fired.

Thank you all anyway!

-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764719.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764720.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764721.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764722.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764723.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764724.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764725.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764726.out


[moha@master ~]$ ps aux | grep moha
moha   1695  0.0  0.0 113128  1416 ?SJun15   0:00 sh watchdog.sh
moha  76720  0.0  0.0 150844  2696 ?SJun28   0:00 sshd: 
moha@pts/10
moha  76724  0.0  0.0 116692  3532 pts/10   Ss+  Jun28   0:00 -bash
moha 149663  0.0  0.0 150400  2240 ?SJun28   0:00 sshd: 
moha@pts/0
moha 149664  0.0  0.0 116692  3536 pts/0Ss+  Jun28   0:00 -bash
moha 156670  0.0  0.0 150400  2236 ?SJun28   0:00 sshd: 
moha@pts/5
moha 156671  0.0  0.0 116692  3604 pts/5Ss+  Jun28   0:00 -bash
moha 164364  0.0  0.0 107904   608 ?S21:04   0:00 sleep 5   
  <<<<<<<<<<=== 
moha 190871  0.0  0.0 116684  3472 pts/4S21:46   0:00 -bash
moha 194080  0.0  0.0 151060  1820 pts/4R+   21:52   0:00 ps aux
moha 194081  0.0  0.0 112664   972 pts/4S+   21:52   0:00 grep 
--color=auto moha



From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Thomas M. Payerle mailto:paye...@umd.edu>>
Sent: Friday, June 29, 2018 7:34:09 PM
To: Slurm User Community List
Subject: Re: [slurm-users] All user's jobs killed at the same time on all nodes

A couple comments/possible suggestions.

First, it looks to me that all the jobs are run from the same directory with 
same input/output files.  Or am I missing something?

Also, what MPI library is being used?

I would suggest verifying if any of the jobs in question are terminating 
normally.  I.e., is the mysterious issue which is causing all the user's jobs 
to terminate triggered by the completion of one of the jobs.

I recall having an issue years ago with MPICH MPI libraries when having 
multiple MPI jobs from the same user running on the same node.  IIRC, when one 
job terminated (usually successfully), it would call mpdallexit, which would 
happily kill all the mpds for that user on that node, making the other MPI jobs 
that user had on that node quite unhappy.  The solution was to set the 
environmental variable MPD_CON_EXT to unique values for each of the jobs.  See 
e.g. https://lists.mcs.anl.gov/pipermail/mpich-discuss/2008-May/003605.html

My users primarily use OpenMPI, and so do not have much recent experience with 
this issue.  IIRC, this issue only impacted other MPI jobs running by the same 
user on the same node, so a bit different than the symptoms as you describe 
them (impacting all MPI jobs running by the same user on ANY node), but as some 
similarity in the symptoms I thought I would mention it anyway.


On Fri, Jun 29, 2018 at 7:24 AM, John Hearns 
mailto:hear...@googlemail.com><mailto:hear...@googlemail.com<mailto:hear...@googlemail.com>>>
 wrote:
I have got this all wrong. Paddy Doyle has got it right.

However are you SURE than mpirun is not creating 

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-07-02 Thread John Hearns
A great detective story!

> June15 but there is no trace of it anywhere on the disk.

Do you have the process ID (pid) of the watchdog.sh
You could look in /proc/(pid) /cmdline and see what that shows





On 2 July 2018 at 11:37, Matteo Guglielmi  wrote:

> Unbelievable... and got it by chance.
>
> jobs were killed (again) at 21:04 and in the user's list of running
> processes there was a 'sleep 5' command (13 hours + 53
> minutes + 20 seconds) which was fired up exactly at the same
> time.
>
> The watchdog.sh script (from which the sleep command is fired)
> was started on June15 but there is no trace of it anywhere on the
> disk.
>
> What's in that script I don't know but it kills all the users jobs
> almost twice a day... and I've waited for it to do it again this
> morning at 10:57... and sure enough all jobs disappeared and
> a new sleep 5 command was fired.
>
> Thank you all anyway!
>
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764719.out
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764720.out
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764721.out
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764722.out
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764723.out
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764724.out
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764725.out
> -rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764726.out
>
>
> [moha@master ~]$ ps aux | grep moha
> moha   1695  0.0  0.0 113128  1416 ?SJun15   0:00 sh
> watchdog.sh
> moha  76720  0.0  0.0 150844  2696 ?SJun28   0:00 sshd:
> moha@pts/10
> moha  76724  0.0  0.0 116692  3532 pts/10   Ss+  Jun28   0:00 -bash
> moha 149663  0.0  0.0 150400  2240 ?SJun28   0:00 sshd:
> moha@pts/0
> moha 149664  0.0  0.0 116692  3536 pts/0Ss+  Jun28   0:00 -bash
> moha 156670  0.0  0.0 150400  2236 ?SJun28   0:00 sshd:
> moha@pts/5
> moha 156671  0.0  0.0 116692  3604 pts/5Ss+  Jun28   0:00 -bash
> moha 164364  0.0  0.0 107904   608 ?S21:04   0:00 sleep
> 5 <<<<<<<<<<=== 
> moha 190871  0.0  0.0 116684  3472 pts/4S21:46   0:00 -bash
> moha 194080  0.0  0.0 151060  1820 pts/4R+   21:52   0:00 ps aux
> moha 194081  0.0  0.0 112664   972 pts/4S+   21:52   0:00 grep
> --color=auto moha
>
>
> ________________________
> From: slurm-users  on behalf of
> Thomas M. Payerle 
> Sent: Friday, June 29, 2018 7:34:09 PM
> To: Slurm User Community List
> Subject: Re: [slurm-users] All user's jobs killed at the same time on all
> nodes
>
> A couple comments/possible suggestions.
>
> First, it looks to me that all the jobs are run from the same directory
> with same input/output files.  Or am I missing something?
>
> Also, what MPI library is being used?
>
> I would suggest verifying if any of the jobs in question are terminating
> normally.  I.e., is the mysterious issue which is causing all the user's
> jobs to terminate triggered by the completion of one of the jobs.
>
> I recall having an issue years ago with MPICH MPI libraries when having
> multiple MPI jobs from the same user running on the same node.  IIRC, when
> one job terminated (usually successfully), it would call mpdallexit, which
> would happily kill all the mpds for that user on that node, making the
> other MPI jobs that user had on that node quite unhappy.  The solution was
> to set the environmental variable MPD_CON_EXT to unique values for each of
> the jobs.  See e.g. https://lists.mcs.anl.gov/
> pipermail/mpich-discuss/2008-May/003605.html
>
> My users primarily use OpenMPI, and so do not have much recent experience
> with this issue.  IIRC, this issue only impacted other MPI jobs running by
> the same user on the same node, so a bit different than the symptoms as you
> describe them (impacting all MPI jobs running by the same user on ANY
> node), but as some similarity in the symptoms I thought I would mention it
> anyway.
>
>
> On Fri, Jun 29, 2018 at 7:24 AM, John Hearns  mailto:hear...@googlemail.com>> wrote:
> I have got this all wrong. Paddy Doyle has got it right.
>
> However are you SURE than mpirun is not creating tasks on the other
> machines?
> I would look at the compute nodes while the job is running and do
> ps -eaf --forest
>
> Also using mpirun to run a single core gives me the heebie-jeebies...
>
> https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom)
>
>
>
>
> On 29 June 2018 at 13:16, Matteo Guglielmi  mailto:matteo.guglie...@dalco.ch>> wrote:
> You are right but I'm actuall

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-07-02 Thread Matteo Guglielmi
Unbelievable... and got it by chance.

jobs were killed (again) at 21:04 and in the user's list of running
processes there was a 'sleep 5' command (13 hours + 53
minutes + 20 seconds) which was fired up exactly at the same
time.

The watchdog.sh script (from which the sleep command is fired)
was started on June15 but there is no trace of it anywhere on the
disk.

What's in that script I don't know but it kills all the users jobs
almost twice a day... and I've waited for it to do it again this
morning at 10:57... and sure enough all jobs disappeared and
a new sleep 5 command was fired.

Thank you all anyway!

-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764719.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764720.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764721.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764722.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764723.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764724.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764725.out
-rw-rw-r-- 1 moha moha 117 Jul  1 21:04 slurm-764726.out


[moha@master ~]$ ps aux | grep moha
moha   1695  0.0  0.0 113128  1416 ?SJun15   0:00 sh watchdog.sh
moha  76720  0.0  0.0 150844  2696 ?SJun28   0:00 sshd: 
moha@pts/10
moha  76724  0.0  0.0 116692  3532 pts/10   Ss+  Jun28   0:00 -bash
moha 149663  0.0  0.0 150400  2240 ?SJun28   0:00 sshd: 
moha@pts/0
moha 149664  0.0  0.0 116692  3536 pts/0Ss+  Jun28   0:00 -bash
moha 156670  0.0  0.0 150400  2236 ?SJun28   0:00 sshd: 
moha@pts/5
moha 156671  0.0  0.0 116692  3604 pts/5Ss+  Jun28   0:00 -bash
moha 164364  0.0  0.0 107904   608 ?S21:04   0:00 sleep 5   
  <<<<<<<<<<=== 
moha 190871  0.0  0.0 116684  3472 pts/4S21:46   0:00 -bash
moha 194080  0.0  0.0 151060  1820 pts/4R+   21:52   0:00 ps aux
moha 194081  0.0  0.0 112664   972 pts/4S+   21:52   0:00 grep 
--color=auto moha



From: slurm-users  on behalf of Thomas 
M. Payerle 
Sent: Friday, June 29, 2018 7:34:09 PM
To: Slurm User Community List
Subject: Re: [slurm-users] All user's jobs killed at the same time on all nodes

A couple comments/possible suggestions.

First, it looks to me that all the jobs are run from the same directory with 
same input/output files.  Or am I missing something?

Also, what MPI library is being used?

I would suggest verifying if any of the jobs in question are terminating 
normally.  I.e., is the mysterious issue which is causing all the user's jobs 
to terminate triggered by the completion of one of the jobs.

I recall having an issue years ago with MPICH MPI libraries when having 
multiple MPI jobs from the same user running on the same node.  IIRC, when one 
job terminated (usually successfully), it would call mpdallexit, which would 
happily kill all the mpds for that user on that node, making the other MPI jobs 
that user had on that node quite unhappy.  The solution was to set the 
environmental variable MPD_CON_EXT to unique values for each of the jobs.  See 
e.g. https://lists.mcs.anl.gov/pipermail/mpich-discuss/2008-May/003605.html

My users primarily use OpenMPI, and so do not have much recent experience with 
this issue.  IIRC, this issue only impacted other MPI jobs running by the same 
user on the same node, so a bit different than the symptoms as you describe 
them (impacting all MPI jobs running by the same user on ANY node), but as some 
similarity in the symptoms I thought I would mention it anyway.


On Fri, Jun 29, 2018 at 7:24 AM, John Hearns 
mailto:hear...@googlemail.com>> wrote:
I have got this all wrong. Paddy Doyle has got it right.

However are you SURE than mpirun is not creating tasks on the other machines?
I would look at the compute nodes while the job is running and do
ps -eaf --forest

Also using mpirun to run a single core gives me the heebie-jeebies...

https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom)




On 29 June 2018 at 13:16, Matteo Guglielmi 
mailto:matteo.guglie...@dalco.ch>> wrote:
You are right but I'm actually supporting the system administrator of that 
cluster, I'll mention this to him.

Beside that,

the user runs this for loop to submit the jobs:


# submit.sh #

typeset -i i=1
typeset -i j=12500  #number of frames goes to each core = number of frames 
(100)/40 (cores) =
typeset -i k=1

while [ $i -le 36 ]  #the number of frames
do

sbatch run-5o$i.sh $i $j $k

i=$i+1 # number of frames goes to each node (5*200 = 1000)
done

where each run-5oXX.sh jobfile looks like this:


#!/bin/bash

#SBATCH --job-name=charmm-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

export PATH=/usr/lib64/openmpi/bin/:$PATH
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH

mpirun -np 1 /opt/c

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Thomas M. Payerle
A couple comments/possible suggestions.

First, it looks to me that all the jobs are run from the same directory
with same input/output files.  Or am I missing something?

Also, what MPI library is being used?

I would suggest verifying if any of the jobs in question are terminating
normally.  I.e., is the mysterious issue which is causing all the user's
jobs to terminate triggered by the completion of one of the jobs.

I recall having an issue years ago with MPICH MPI libraries when having
multiple MPI jobs from the same user running on the same node.  IIRC, when
one job terminated (usually successfully), it would call mpdallexit, which
would happily kill all the mpds for that user on that node, making the
other MPI jobs that user had on that node quite unhappy.  The solution was
to set the environmental variable MPD_CON_EXT to unique values for each of
the jobs.  See e.g.
https://lists.mcs.anl.gov/pipermail/mpich-discuss/2008-May/003605.html

My users primarily use OpenMPI, and so do not have much recent experience
with this issue.  IIRC, this issue only impacted other MPI jobs running by
the same user on the same node, so a bit different than the symptoms as you
describe them (impacting all MPI jobs running by the same user on ANY
node), but as some similarity in the symptoms I thought I would mention it
anyway.


On Fri, Jun 29, 2018 at 7:24 AM, John Hearns  wrote:

> I have got this all wrong. Paddy Doyle has got it right.
>
> However are you SURE than mpirun is not creating tasks on the other
> machines?
> I would look at the compute nodes while the job is running and do
> ps -eaf --forest
>
> Also using mpirun to run a single core gives me the heebie-jeebies...
>
> https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom)
>
>
>
>
> On 29 June 2018 at 13:16, Matteo Guglielmi 
> wrote:
>
>> You are right but I'm actually supporting the system administrator of
>> that cluster, I'll mention this to him.
>>
>> Beside that,
>>
>> the user runs this for loop to submit the jobs:
>>
>>
>> # submit.sh #
>>
>> typeset -i i=1
>> typeset -i j=12500  #number of frames goes to each core = number of
>> frames (100)/40 (cores) =
>> typeset -i k=1
>>
>> while [ $i -le 36 ]  #the number of frames
>> do
>>
>> sbatch run-5o$i.sh $i $j $k
>>
>> i=$i+1 # number of frames goes to each node (5*200 = 1000)
>> done
>>
>> where each run-5oXX.sh jobfile looks like this:
>>
>>
>> #!/bin/bash
>>
>> #SBATCH --job-name=charmm-test
>> #SBATCH --nodes=1
>> #SBATCH --ntasks=1
>> #SBATCH --cpus-per-task=1
>>
>> export PATH=/usr/lib64/openmpi/bin/:$PATH
>> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>>
>> mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm <
>> newphcnl99a0.inp > newphcnl99a0.out
>>
>>
>>
>>
>> so they are all independent mpiruns...  if one of them is killed, why
>> would all others go down as well?
>>
>>
>> That would make sense if a single mpirun is running 36 tasks... but the
>> user is not doing this.
>>
>> 
>> From: slurm-users  on behalf of
>> John Hearns 
>> Sent: Friday, June 29, 2018 12:52:41 PM
>> To: Slurm User Community List
>> Subject: Re: [slurm-users] All user's jobs killed at the same time on all
>> nodes
>>
>> Matteo, a stupid question but if these are single CPU jobs why is mpirun
>> being used?
>>
>> Is your user using these 36 jobs to construct a parallel job to run
>> charmm?
>> If the mpirun is killed, yes all the other processes which are started by
>> it on the other compute nodes will be killed.
>>
>> I suspect your user is trying to do womething "smart". You should give
>> that person an example of how to reserve 36 cores and submit a charmm job.
>>
>>
>> On 29 June 2018 at 12:13, Matteo Guglielmi > lto:matteo.guglie...@dalco.ch>> wrote:
>> Dear comunity,
>>
>> I have a user who usually submits 36 (identical) jobs at a time using a
>> simple for loop,
>> thus jobs are sbatched all the same time.
>>
>> Each job requests a single core and all jobs are independent from one
>> another (read
>> different input files and write to different output files).
>>
>> Jobs are then usually started during the next couple of hours, somewhat
>> at random
>> times.
>>
>> What happens then is that after a certain amount of time (maybe from 2 to
>> 12 hours)
>> ALL jobs belonging to this particular user are killed by slurm on all
>> n

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread John Hearns
I have got this all wrong. Paddy Doyle has got it right.

However are you SURE than mpirun is not creating tasks on the other
machines?
I would look at the compute nodes while the job is running and do
ps -eaf --forest

Also using mpirun to run a single core gives me the heebie-jeebies...

https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom)




On 29 June 2018 at 13:16, Matteo Guglielmi 
wrote:

> You are right but I'm actually supporting the system administrator of that
> cluster, I'll mention this to him.
>
> Beside that,
>
> the user runs this for loop to submit the jobs:
>
>
> # submit.sh #
>
> typeset -i i=1
> typeset -i j=12500  #number of frames goes to each core = number of frames
> (100)/40 (cores) =
> typeset -i k=1
>
> while [ $i -le 36 ]  #the number of frames
> do
>
> sbatch run-5o$i.sh $i $j $k
>
> i=$i+1 # number of frames goes to each node (5*200 = 1000)
> done
>
> where each run-5oXX.sh jobfile looks like this:
>
>
> #!/bin/bash
>
> #SBATCH --job-name=charmm-test
> #SBATCH --nodes=1
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=1
>
> export PATH=/usr/lib64/openmpi/bin/:$PATH
> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>
> mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm <
> newphcnl99a0.inp > newphcnl99a0.out
>
>
>
>
> so they are all independent mpiruns...  if one of them is killed, why
> would all others go down as well?
>
>
> That would make sense if a single mpirun is running 36 tasks... but the
> user is not doing this.
>
> ________________
> From: slurm-users  on behalf of
> John Hearns 
> Sent: Friday, June 29, 2018 12:52:41 PM
> To: Slurm User Community List
> Subject: Re: [slurm-users] All user's jobs killed at the same time on all
> nodes
>
> Matteo, a stupid question but if these are single CPU jobs why is mpirun
> being used?
>
> Is your user using these 36 jobs to construct a parallel job to run charmm?
> If the mpirun is killed, yes all the other processes which are started by
> it on the other compute nodes will be killed.
>
> I suspect your user is trying to do womething "smart". You should give
> that person an example of how to reserve 36 cores and submit a charmm job.
>
>
> On 29 June 2018 at 12:13, Matteo Guglielmi  mailto:matteo.guglie...@dalco.ch>> wrote:
> Dear comunity,
>
> I have a user who usually submits 36 (identical) jobs at a time using a
> simple for loop,
> thus jobs are sbatched all the same time.
>
> Each job requests a single core and all jobs are independent from one
> another (read
> different input files and write to different output files).
>
> Jobs are then usually started during the next couple of hours, somewhat at
> random
> times.
>
> What happens then is that after a certain amount of time (maybe from 2 to
> 12 hours)
> ALL jobs belonging to this particular user are killed by slurm on all
> nodes at exactly the
> same time.
>
> One example:
>
> ### master: /var/log/slurmctld.log ###
>
> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560
> InitPrio=4294185624 usec=255
> ...
> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on
> node38
> ...
> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560
> uid 1007
> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560
> State=0x8004 NodeCnt=1 successful 0x8004
>
> ### node38: /var/log/slurmd.log ###
>
> [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran
> for 0 seconds
> [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
> [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature
> plugin loaded
> [2018-06-28T19:29:05.431] [718560.batch] debug level = 2
> [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
> [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started
> 2018-06-28T19:29:05
> [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of
> 65536 from submit host: Operation not permitted
> ...
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792
> (mpirun)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791
> (slurm_script)
> [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729
> [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38
> CANCELLED AT 2018-06-28T23:37:53 ***
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.488] [718560.b

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Matteo Guglielmi
You are right but I'm actually supporting the system administrator of that 
cluster, I'll mention this to him.

Beside that,

the user runs this for loop to submit the jobs:


# submit.sh #

typeset -i i=1
typeset -i j=12500  #number of frames goes to each core = number of frames 
(100)/40 (cores) =
typeset -i k=1

while [ $i -le 36 ]  #the number of frames
do

sbatch run-5o$i.sh $i $j $k

i=$i+1 # number of frames goes to each node (5*200 = 1000)
done

where each run-5oXX.sh jobfile looks like this:


#!/bin/bash

#SBATCH --job-name=charmm-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

export PATH=/usr/lib64/openmpi/bin/:$PATH
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH

mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm < 
newphcnl99a0.inp > newphcnl99a0.out




so they are all independent mpiruns...  if one of them is killed, why would all 
others go down as well?


That would make sense if a single mpirun is running 36 tasks... but the user is 
not doing this.


From: slurm-users  on behalf of John 
Hearns 
Sent: Friday, June 29, 2018 12:52:41 PM
To: Slurm User Community List
Subject: Re: [slurm-users] All user's jobs killed at the same time on all nodes

Matteo, a stupid question but if these are single CPU jobs why is mpirun being 
used?

Is your user using these 36 jobs to construct a parallel job to run charmm?
If the mpirun is killed, yes all the other processes which are started by it on 
the other compute nodes will be killed.

I suspect your user is trying to do womething "smart". You should give that 
person an example of how to reserve 36 cores and submit a charmm job.


On 29 June 2018 at 12:13, Matteo Guglielmi 
mailto:matteo.guglie...@dalco.ch>> wrote:
Dear comunity,

I have a user who usually submits 36 (identical) jobs at a time using a simple 
for loop,
thus jobs are sbatched all the same time.

Each job requests a single core and all jobs are independent from one another 
(read
different input files and write to different output files).

Jobs are then usually started during the next couple of hours, somewhat at 
random
times.

What happens then is that after a certain amount of time (maybe from 2 to 12 
hours)
ALL jobs belonging to this particular user are killed by slurm on all nodes at 
exactly the
same time.

One example:

### master: /var/log/slurmctld.log ###

[2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 
InitPrio=4294185624 usec=255
...
[2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on node38
...
[2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 uid 
1007
[2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 
NodeCnt=1 successful 0x8004

### node38: /var/log/slurmd.log ###

[2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran for 
0 seconds
[2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
[2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature plugin 
loaded
[2018-06-28T19:29:05.431] [718560.batch] debug level = 2
[2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
[2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started 
2018-06-28T19:29:05
[2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of 65536 
from submit host: Operation not permitted
...
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 (charmm)
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 (mpirun)
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 
(slurm_script)
[2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729
[2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 
CANCELLED AT 2018-06-28T23:37:53 ***
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 (charmm)
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 (mpirun)
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 
(slurm_script)
[2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to 718560.4294967294
[2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by 
signal 15.
[2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with slurm_rc = 
0, job_rc = 15
[2018-06-28T23:37:53.512] [718560.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0 status 15
[2018-06-28T23:37:53.516] [718560.batch] done with job

The slurm cluster has a minimal configuration:

ClusterName=cluster
ControlMachine=master
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
FastSchedule=1
SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm/
SlurmdSpoolDir=/var/spool/slurm/
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/sl

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Paddy Doyle
Hi Matteo,

On Fri, Jun 29, 2018 at 10:13:33AM +, Matteo Guglielmi wrote:

> Dear comunity,
> 
> I have a user who usually submits 36 (identical) jobs at a time using a 
> simple for loop,
> thus jobs are sbatched all the same time.
> 
> Each job requests a single core and all jobs are independent from one another 
> (read
> different input files and write to different output files).
> 
> Jobs are then usually started during the next couple of hours, somewhat at 
> random
> times.
> 
> What happens then is that after a certain amount of time (maybe from 2 to 12 
> hours)
> ALL jobs belonging to this particular user are killed by slurm on all nodes 
> at exactly the
> same time.
> 
> One example:
> 
> ### master: /var/log/slurmctld.log ###
> 
> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 
> InitPrio=4294185624 usec=255
> ...
> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on 
> node38
> ...
> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 
> uid 1007
> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 
> NodeCnt=1 successful 0x8004

That line looks like the user (presuming that uid 1007 is them; otherwise
it's an operator who can kill jobs) killed their job.

Have a look in the slurmctld.log for more lines with 'REQUEST_KILL_JOB'; if
they all appear at basically the same time, then it looks like uid 1007 did
something like 'scancel -u theusername'.

That might not be it, but that would be my first guess.

Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/



Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread John Hearns
Matteo, a stupid question but if these are single CPU jobs why is mpirun
being used?

Is your user using these 36 jobs to construct a parallel job to run charmm?
If the mpirun is killed, yes all the other processes which are started by
it on the other compute nodes will be killed.

I suspect your user is trying to do womething "smart". You should give that
person an example of how to reserve 36 cores and submit a charmm job.


On 29 June 2018 at 12:13, Matteo Guglielmi 
wrote:

> Dear comunity,
>
> I have a user who usually submits 36 (identical) jobs at a time using a
> simple for loop,
> thus jobs are sbatched all the same time.
>
> Each job requests a single core and all jobs are independent from one
> another (read
> different input files and write to different output files).
>
> Jobs are then usually started during the next couple of hours, somewhat at
> random
> times.
>
> What happens then is that after a certain amount of time (maybe from 2 to
> 12 hours)
> ALL jobs belonging to this particular user are killed by slurm on all
> nodes at exactly the
> same time.
>
> One example:
>
> ### master: /var/log/slurmctld.log ###
>
> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560
> InitPrio=4294185624 usec=255
> ...
> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on
> node38
> ...
> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560
> uid 1007
> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560
> State=0x8004 NodeCnt=1 successful 0x8004
>
> ### node38: /var/log/slurmd.log ###
>
> [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran
> for 0 seconds
> [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
> [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature
> plugin loaded
> [2018-06-28T19:29:05.431] [718560.batch] debug level = 2
> [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
> [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started
> 2018-06-28T19:29:05
> [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of
> 65536 from submit host: Operation not permitted
> ...
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792
> (mpirun)
> [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791
> (slurm_script)
> [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729
> [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38
> CANCELLED AT 2018-06-28T23:37:53 ***
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794
> (charmm)
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792
> (mpirun)
> [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791
> (slurm_script)
> [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to
> 718560.4294967294
> [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by
> signal 15.
> [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with
> slurm_rc = 0, job_rc = 15
> [2018-06-28T23:37:53.512] [718560.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> [2018-06-28T23:37:53.516] [718560.batch] done with job
>
> The slurm cluster has a minimal configuration:
>
> ClusterName=cluster
> ControlMachine=master
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> FastSchedule=1
> SlurmUser=slurm
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/var/spool/slurm/
> SlurmdSpoolDir=/var/spool/slurm/
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> Proctracktype=proctrack/linuxproc
> ReturnToService=2
> PropagatePrioProcess=0
> PropagateResourceLimitsExcept=MEMLOCK
> TaskPlugin=task/cgroup
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> SlurmctldDebug=4
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=4
> SlurmdLogFile=/var/log/slurmd.log
> JobCompType=jobcomp/none
> JobAcctGatherType=jobacct_gather/cgroup
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=master
> AccountingStorageLoc=all
> NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN
> PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> Thank you for your help.
>
>


[slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Matteo Guglielmi
Dear comunity,

I have a user who usually submits 36 (identical) jobs at a time using a simple 
for loop,
thus jobs are sbatched all the same time.

Each job requests a single core and all jobs are independent from one another 
(read
different input files and write to different output files).

Jobs are then usually started during the next couple of hours, somewhat at 
random
times.

What happens then is that after a certain amount of time (maybe from 2 to 12 
hours)
ALL jobs belonging to this particular user are killed by slurm on all nodes at 
exactly the
same time.

One example:

### master: /var/log/slurmctld.log ###

[2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 
InitPrio=4294185624 usec=255
...
[2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on node38
...
[2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 uid 
1007
[2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 
NodeCnt=1 successful 0x8004

### node38: /var/log/slurmd.log ###

[2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran for 
0 seconds
[2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007
[2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature plugin 
loaded
[2018-06-28T19:29:05.431] [718560.batch] debug level = 2
[2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks
[2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started 
2018-06-28T19:29:05
[2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of 65536 
from submit host: Operation not permitted
...
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 (charmm)
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 (mpirun)
[2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 
(slurm_script)
[2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729
[2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 
CANCELLED AT 2018-06-28T23:37:53 ***
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 (charmm)
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 (mpirun)
[2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 
(slurm_script)
[2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to 718560.4294967294
[2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by 
signal 15.
[2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with slurm_rc = 
0, job_rc = 15
[2018-06-28T23:37:53.512] [718560.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0 status 15
[2018-06-28T23:37:53.516] [718560.batch] done with job

The slurm cluster has a minimal configuration:

ClusterName=cluster
ControlMachine=master
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
FastSchedule=1
SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm/
SlurmdSpoolDir=/var/spool/slurm/
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
Proctracktype=proctrack/linuxproc
ReturnToService=2
PropagatePrioProcess=0
PropagateResourceLimitsExcept=MEMLOCK
TaskPlugin=task/cgroup
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SlurmctldDebug=4
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=4
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
JobAcctGatherType=jobacct_gather/cgroup
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=master
AccountingStorageLoc=all
NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN
PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Thank you for your help.