from:"John Hearns"

[slurm-dev] Re: Selecting a network interface with srun

2017-10-25 Thread John Hearns

Ralph, indeed.
As I have said before:  Finally, my one piece of advice to everyone
managing batch systems. It is a name resolution problem. No, really it is.
Even if your cluster catches fire, the real reason that your jobs are not
being submitted is that the DNS resolver is burning and the scheduler can't
resolve the hostname of the submit host.

I am having some problems with a GPFS cluster today. Guess what - name
resolution. We have some new hosts which for a reason have multiple IP
addresses.
Name resolution is producing 'fun and games'.




On 25 October 2017 at 17:30, r...@open-mpi.org <r...@open-mpi.org> wrote:

> Good points. I would also caution against renaming nodes using interfaces.
> This frequently causes failure of 3rd party software packages that compare
> the return value of “hostname” to the list of allocated nodes for
> optimization or placement purposes - e.g., mpirun! A quick grep of the
> mailing list logs will reveal all the woes that created.
>
> On Oct 25, 2017, at 8:22 AM, John Hearns <hear...@googlemail.com> wrote:
>
> When using “mpirun” we can specify “-iface ib0”this is true, and the
> exact syntax depends on your MPI of choice, as noted above.
>
> However,  don't get confused between IPOIB and Infiniband itself.   IPOIB
> is of course sending IP traffic over Infiniband.
> An Infiniband network can perfectly happily function without any IP
> addresses being assigned.
> The point I am getting at is that the IP connections should be used to set
> up/launch the job, depending on the launcher used by that MPI (eg Hydra as
> above)
> What I am saying is do not get confused between the activity which sets up
> the MPI processes on the remote nodes, and the actual MPI traffic.
>
> Unless you  really want to use IPOIB for your MPI traffic (maybe for doing
> a benchmark comparison)  I would say just run the srun with the ethernet
> IPs and let your MPI choose
> the best bit transport layer (BTL in OpenMPI speak).
>
> What I would do is tag the Infiniband equipped nodes with a feature called
> 'IB' or 'nonIB' for the others, and choose those nodes.
> (Sorry - my head is in PBSPro world these days so that would be a
> resources_available in that world)
>
> My advice - schedule just on the IB equipped nodes. Run your mpi with a
> verbose flag and see which BTL it is choosing.
> You may be pleasantly surprised!
> I would say 'take down the ib0 interface' but that may be a bad move - it
> is probably used for storage mounts at least.
>
>
> If I have misunderstood the point, and have been a bit rude here I
> apologise in advance. Someone with a clue will come along
> and slap me rounfd the head I am sure.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 25 October 2017 at 17:03, Le Biot, Pierre-Marie <
> pierre-marie.leb...@hpe.com> wrote:
>
>> Hi Sebastian,
>>
>>
>>
>> Another solution could be to change the configuration of nodes in
>> slurm.conf, making use of NodeName and NodeHostname (and NodeAddr if
>> needed) :
>>
>>
>>
>> “
>>
>> NodeName
>>
>> Name that Slurm uses to refer to a node[...]. Typically this would be the
>> string that "/bin/hostname -s" returns.[...]It may also be an arbitrary
>> string if NodeHostname is specified.[...]
>>
>>
>>
>> NodeHostname
>>
>> Typically this would be the string that "/bin/hostname -s"
>> returns.[...]By default, the NodeHostname will be identical in value to
>> NodeName.
>>
>>
>>
>> NodeAddr
>>
>> Name that a node should be referred to in establishing a communications
>> path.[...] NodeAddr may also contain IP addresses. By default, the NodeAddr
>> will be identical in value to NodeHostname.
>>
>> “
>>
>>
>>
>> For the nodes having an infiniband interface declare the associated name
>> in NodeName and the regular hostname in NodeHostname.
>>
>> SLURM_NODELIST will contain the names declared in NodeName.
>>
>>
>>
>> Regards,
>>
>> Pierre-Marie Le Biot
>>
>>
>>
>> *From:* Sebastian Eastham [mailto:seast...@mit.edu]
>> *Sent:* Tuesday, October 24, 2017 10:02 PM
>> *To:* slurm-dev <slurm-dev@schedmd.com>
>> *Subject:* [slurm-dev] Selecting a network interface with srun
>>
>>
>>
>> Dear Slurm Developers mailing list,
>>
>>
>>
>> When calling the “srun” command, is there any way to specify the desired
>> network interface? Our network is a mix of ethernet and inifiniband, such
>> that only a subset of the nodes hav

[slurm-dev] RE: Selecting a network interface with srun

2017-10-25 Thread John Hearns

When using “mpirun” we can specify “-iface ib0”this is true, and the
exact syntax depends on your MPI of choice, as noted above.

However,  don't get confused between IPOIB and Infiniband itself.   IPOIB
is of course sending IP traffic over Infiniband.
An Infiniband network can perfectly happily function without any IP
addresses being assigned.
The point I am getting at is that the IP connections should be used to set
up/launch the job, depending on the launcher used by that MPI (eg Hydra as
above)
What I am saying is do not get confused between the activity which sets up
the MPI processes on the remote nodes, and the actual MPI traffic.

Unless you  really want to use IPOIB for your MPI traffic (maybe for doing
a benchmark comparison)  I would say just run the srun with the ethernet
IPs and let your MPI choose
the best bit transport layer (BTL in OpenMPI speak).

What I would do is tag the Infiniband equipped nodes with a feature called
'IB' or 'nonIB' for the others, and choose those nodes.
(Sorry - my head is in PBSPro world these days so that would be a
resources_available in that world)

My advice - schedule just on the IB equipped nodes. Run your mpi with a
verbose flag and see which BTL it is choosing.
You may be pleasantly surprised!
I would say 'take down the ib0 interface' but that may be a bad move - it
is probably used for storage mounts at least.

If I have misunderstood the point, and have been a bit rude here I
apologise in advance. Someone with a clue will come along
and slap me rounfd the head I am sure.

On 25 October 2017 at 17:03, Le Biot, Pierre-Marie <
pierre-marie.leb...@hpe.com> wrote:

> Hi Sebastian,
>
>
>
> Another solution could be to change the configuration of nodes in
> slurm.conf, making use of NodeName and NodeHostname (and NodeAddr if
> needed) :
>
>
>
> “
>
> NodeName
>
> Name that Slurm uses to refer to a node[...]. Typically this would be the
> string that "/bin/hostname -s" returns.[...]It may also be an arbitrary
> string if NodeHostname is specified.[...]
>
>
>
> NodeHostname
>
> Typically this would be the string that "/bin/hostname -s" returns.[...]By
> default, the NodeHostname will be identical in value to NodeName.
>
>
>
> NodeAddr
>
> Name that a node should be referred to in establishing a communications
> path.[...] NodeAddr may also contain IP addresses. By default, the NodeAddr
> will be identical in value to NodeHostname.
>
> “
>
>
>
> For the nodes having an infiniband interface declare the associated name
> in NodeName and the regular hostname in NodeHostname.
>
> SLURM_NODELIST will contain the names declared in NodeName.
>
>
>
> Regards,
>
> Pierre-Marie Le Biot
>
>
>
> *From:* Sebastian Eastham [mailto:seast...@mit.edu]
> *Sent:* Tuesday, October 24, 2017 10:02 PM
> *To:* slurm-dev 
> *Subject:* [slurm-dev] Selecting a network interface with srun
>
>
>
> Dear Slurm Developers mailing list,
>
>
>
> When calling the “srun” command, is there any way to specify the desired
> network interface? Our network is a mix of ethernet and inifiniband, such
> that only a subset of the nodes have an infiniband interface. When using
> “mpirun” we can specify “-iface ib0”, but there does not appear to be a
> similar option for “srun”. Although we can successfully run our
> applications with “srun”, we can see from “iftop” that the application is
> communicating purely through the ethernet interface.
>
>
>
> Once again, I appreciate any help or guidance that you can give me!
>
>
>
> Regards,
>
>
>
> Seb
>
>
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> Dr. Sebastian D. Eastham
>
> Research Scientist
>
> Laboratory for Aviation and the Environment
>
> Massachusetts Institute of Technology
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
>
>
>
>

[slurm-dev] Re: job allocation lag

2017-10-11 Thread John Hearns

Vladimir, in cases where you have a 'hairs on the back of your neck'
feeling it is often the case that these indicate something real.
However, you do have to be scientific about this. If you think that uptime
is an influence, you have to record job startup times each hour, and plot
these.
Be scientific.

I would also suggest watching  a tail -f on the slurm logs, and then submit
a job. You might get some indication of there the slow-up is.
You have increased the debug level in the logs:  *SlurmctldDebug*

Finally, my one piece of advice to everyone managing batch systems. It is a
name resolution problem. No, really it is.
Even if your cluster catches fire, the real reason that your jobs are not
being submitted is that the DNS resolver is burning and the scheduler can't
resolve the hostname of the submit host.
Joking aside, many, many problems with batch systems are due to name
resolution.

On 11 October 2017 at 09:33, Vladimir Daric <
vladimir.da...@ips2.universite-paris-saclay.fr> wrote:

> Hello,
>
> We are running a 10 node cluster in our lab and we are experiencing a job
> allocation lag.
>
> srun commands wait for resource allocation up to 1 minute even if there
> are several idle nodes. It's the same with sbatch scripts. Even if there
> are idle nodes, jobs are waiting for about one minute for resource
> allocation..
>
> Our ControlMachine is on a virtual node. Compute nodes are all physical
> machines.
>
> In our config file we set those values :
> FastSchedule=1
> SchedulerType=sched/backfill
>
> I feel like after the whole cluster reboot, jobs are scheduled pretty fast
> and after few weeks uptime job scheduling slows down (at this moment
> ControlMAchine uptime is 25 days). I'm not quite sure those are related.
>
> Everything looks in order, there is no errors in logfiles ...
>
> I'll be grateful for any hint ... or advice.
>
> Thanks,
> Vladimir
>
>
>
>
>
>
>
>

[slurm-dev] Re: MPI-Jobs on cluster - how to set batchhost

2017-09-28 Thread John Hearns

Brigitte, are you able to tell us more about this scratch filesystem?

You could arrange that the compute nodes mount it directly, so you get the
performance you need.
Thsi can be achieved by putting a routing node onto the cluster network.
Or you could route throug the cluster head node.

Also you coudl look at the NFS mount parameters and try to increase
performance. What are your raise and wsize values?

On 28 September 2017 at 15:54, Selch, Brigitte (FIDF)  wrote:

> Hello,
>
> but then I'm not able to minimize the amount of core used on the headnode .
> The first running  job uses all cores of my headnode, no next job could
> start ...
>
> I have experimented with OverSubscribe=FORCE, but this can only be defined
> for a whole partition not for one node only.
> I'm just a bit frustrated
>
>
> Thank you!
> Brigitte
>
> 
>
> MAN Truck & Bus AG
> Sitz der Gesellschaft: München
> Registergericht: Amtsgericht München, HRB 86963
> Vorsitzender des Aufsichtsrates: Andreas Renschler
> Vorstand: Joachim Drees (Vorsitzender), Dr. Ulrich Dilling, Dirk
> Große-Loheide, Dr. Carsten Intra, Jan-Henrik Lafrentz, Heinz-Jürgen Löw,
> Dr. Frederik Zohm
>
> This e-mail (including any attachments) is confidential and may be
> privileged.
> If you have received it by mistake, please notify the sender by e-mail and
> delete this message from your system.
> Any unauthorised use or dissemination of this e-mail in whole or in part
> is strictly prohibited.
> Please note that e-mails are susceptible to change.
> MAN Truck & Bus AG (including its group companies) shall not be liable for
> the improper or incomplete transmission of the information contained in
> this communication nor for any delay in its receipt.
> MAN Truck & Bus AG (or its group companies) does not guarantee that the
> integrity of this communication has been maintained nor that this
> communication is free of viruses, interceptions or interference.
>
>

[slurm-dev] Re: MPI-Jobs on cluster - how to set batchhost

2017-09-28 Thread John Hearns

Brigitte, thankyou. That makes sense. I guess that there is an NFS
re-export of the scratch filesystem.

I know this is not an answer to the problem at the moment, maybe you shoud
look at Bee On Demand for the future.
https://www.beegfs.io/wiki/BeeOND

With the disclaimer that I have not implemented this!


On 28 September 2017 at 11:57, Gennaro Oliva  wrote:

>
> Hi Brigitte,
>
> On Thu, Sep 28, 2017 at 02:51:53AM -0600, Selch, Brigitte (FIDF) wrote:
> > We have a cluster with one headnodes and x computenodes.
> > Scratch Filesystems are locally attached to the headnode, so the MPI
> task which makes I/O should run on headnode.
> > But how can I determine, which node will be the batchhost?
> >
> > So far I only defined in my sbatch-script:
> > #SBATCH --ntasks=120
>
> you can try with:
>
> #SBATCH -w headnode --ntasks=120
>
> Where headnode is the hostname of your headnode.
>
> This should ensure that at least one task runs on the headnode.
>
> If this task doesn't have id 0 in your mpi prgram you can always change
> your code
> by checking the MPI_Get_processor_name in all your tasks and changing the
> I/O task id from 0 to the first task matching the headnode hostname on
> all your nodes.
>
> I hope I made myself clear.
> Best regards
> --
> Gennaro Oliva
>

[slurm-dev] Re: MPI-Jobs on cluster - how to set batchhost

2017-09-28 Thread John Hearns

Brigitte,  I understand what you are trying to achieve.
But may I ask  - is there local storage n your compute nodes?
You coudl run a job where the results are written to local storage, then
transferred to your scratch filesystem at the end of the job.

It is normal on HPC cluster to have the scratch filesystem mounted on all
the compute nodes.
I guess in this case there are good reasons for mounting this filesystem on
the head node only.

However - how do the compute nodes get access to the simulation files which
they need? Or the libraries and executables.
There must be some shared storage!



On 28 September 2017 at 10:51, Selch, Brigitte (FIDF)  wrote:

> Hello,
>
> How can I define the batchhost (the host with MPI task 0).
>
> We have a cluster with one headnodes and x computenodes.
> Scratch Filesystems are locally attached to the headnode, so the MPI task
> which makes I/O should run on headnode.
> But how can I determine, which node will be the batchhost?
>
> So far I only defined in my sbatch-script:
> #SBATCH --ntasks=120
>
> And then to make a hostfile form my application:
> srun hostname
> …
>
> Do I think wrong or too complicated?
> How can I achieve that the batchhost is always the cluster-headnode?
>
>
> Thank you!
>
>
>
>
> *    *
>
> MAN Truck & Bus AG
> Sitz der Gesellschaft: München
> Registergericht: Amtsgericht München, HRB 86963
> Vorsitzender des Aufsichtsrates: Andreas Renschler
> Vorstand: Joachim Drees (Vorsitzender), Dr. Ulrich Dilling, Dirk
> Große-Loheide, Dr. Carsten Intra, Jan-Henrik Lafrentz, Heinz-Jürgen Löw,
> Dr. Frederik Zohm
>
> This e-mail (including any attachments) is confidential and may be
> privileged.
> If you have received it by mistake, please notify the sender by e-mail and
> delete this message from your system.
> Any unauthorised use or dissemination of this e-mail in whole or in part
> is strictly prohibited.
> Please note that e-mails are susceptible to change.
> MAN Truck & Bus AG (including its group companies) shall not be liable for
> the improper or incomplete transmission of the information contained in
> this communication nor for any delay in its receipt.
> MAN Truck & Bus AG (or its group companies) does not guarantee that the
> integrity of this communication has been maintained nor that this
> communication is free of viruses, interceptions or interference.
>
>

[slurm-dev] Re: Interaction between cgroups and NFS

2017-09-03 Thread John Hearns

No, have never seen anything similar.
A small bit of help - the 'nfswatch' utility is useful for tracking down
NFS problems. '
Less relevant, but on a system which is running low on memory 'watch cat
/proc/meminfo' is often good for shining a light.


On 2 September 2017 at 00:16, Brendan Moloney 
wrote:

> Hello,
>
> I am using cgroups to track processes and limit memory. Occasionally it
> seems like a job will use too much memory and instead of getting killed it
> ends up in a unkillable state waiting for NFS I/O.  There are no other
> signs of NFS issues, and in fact other jobs (even on the same node) seem to
> be having no problem communicating with the same NFS server at that same
> time.  I just get hung task errors for that one specific process (that used
> too much memory).
>
> Has anyone else ran into this? Searching this mailing list archive I found
> some similar stuff, but that seemed to be in regards to installing Slurm
> itself onto an NFS4 mount rather than just having jobs use an NFS4 mount.
>
> Any advice is greatly appreciated.
>
> Thanks,
> Brendan
>

[slurm-dev] Re: Slurm and Environments and aliases

2017-08-16 Thread John Hearns

Lachlan, I will have to check when I get into work in the morning. I am
also sorry if I lead you down the wrong path here, however this does feel
to be an issue of login versus non-login shells.
Try a bash -l   (dash lower case L)

I am sure the login/non-login thing has been discussed on here recently.

On 17 August 2017 at 01:56, Lachlan Musicman  wrote:

> Hola,
>
> I was under the impression that environments travelled with slurm when
> sbatch was executed - so any node could execute any code as if it was the
> env I executed from or built within my sbatch scripts.
>
> We use Environment Modules and this has all worked just great. Very
> pleased.
>
> Recently I learnt about Environment Modules "set-alias" command, which
> seemed pretty nifty, especially for java executables that til now had been
> wrapped in shell scripts that looked like:
>
> #/bin/sh
>  BASEDIR=$(dirname $0)
>
>  if [[ -z "$TMPDIR" ]]; then
>  TMPDIR=/tmp
>  fi
>
>  java -Xmx8g -Djava.io.tmpdir=$TMPDIR -jar $BASEDIR/SOFTWARE.jar "$@"
>
>
>
> I hated having these shell scripts around because they are messy and
> cumbersome. Setting the alias seemed to be the perfect, modularised,
> solution.
>
> set-alias "java -Xmx8g -Djava.io.tmpdir=$TMPDIR -jar $BASEDIR/SOFTWARE.jar"
>
> But today I have discovered that the alias - while working on the login
> node - doesn't work when sent via
>
> sbatch script.sh
>
>
> Are we doing something wrong, or was I incorrect in thinking that
> set-alias was the balm for our shell script mess?
>
> cheers
> L.
>
> --
> "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic
> civics is the insistence that we cannot ignore the truth, nor should we
> panic about it. It is a shared consciousness that our institutions have
> failed and our ecosystem is collapsing, yet we are still here — and we are
> creative agents who can shape our destinies. Apocalyptic civics is the
> conviction that the only way out is through, and the only way through is
> together. "
>
> *Greg Bloom* @greggish https://twitter.com/greggish/
> status/873177525903609857
>

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-15 Thread John Hearns

For the /proc/self you need to start an interactive job under Slurm.

(I'm speaking from a PBSPro viewpoint here.
What? What?  Maud - release the dogs! Fetch my shotgun! Get off my property
Sir!)

On 15 August 2017 at 05:15, Lachlan Musicman  wrote:

> On 15 August 2017 at 11:38, Christopher Samuel 
> wrote:
>
>> On 15/08/17 09:41, Lachlan Musicman wrote:
>>
>> > I guess I'm not 100% sure what I'm looking for, but I do see that there
>> > is a
>> >
>> > 1:name=systemd:/user.slice/user-0.slice/session-373.scope
>> >
>> > in /proc/self/cgroup
>>
>> Something is wrong in your config then. It should look something like:
>>
>> 4:cpuacct:/slurm/uid_3959/job_6779703/step_9/task_1
>> 3:memory:/slurm/uid_3959/job_6779703/step_9/task_1
>> 2:cpuset:/slurm/uid_3959/job_6779703/step_9
>> 1:freezer:/slurm/uid_3959/job_6779703/step_9
>>
>> for /proc/${PID_OF_PROC}/cgroup
>>
>> I notice you have /proc/self - that will be the shell you are running in
>> for your SSH session and not the job!
>>
>
> Oh, that explains more.
>
> Now it looks like:
>
> 2:hugetlb:/
> 11:rdma:/
> 10:perf_event:/
> 9:cpu,cpuacct:/
> 8:cpuset:/slurm/uid_1506/job_1998/step_batch
> 7:pids:/
> 6:freezer:/slurm/uid_1506/job_1998/step_batch
> 5:net_cls,net_prio:/
> 4:devices:/system.slice
> 3:blkio:/
> 2:memory:/
> 1:name=systemd:/system.slice/slurmd.service
>
> I seem to have a lot of guff in there that I don't need?
>
> L.
>
>
> --
> "The antidote to apocalypticism is apocalyptic civics. Apocalyptic civics
> is the insistence that we cannot ignore the truth, nor should we panic
> about it. It is a shared consciousness that our institutions have failed
> and our ecosystem is collapsing, yet we are still here — and we are
> creative agents who can shape our destinies. Apocalyptic civics is the
> conviction that the only way out is through, and the only way through is
> together. "
>
> Greg Bloom @greggish https://twitter.com/greggish/
> status/873177525903609857
>

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread John Hearns

Lachlan regarding stress-ng  --cpu 5This is starting 5 workers
(threads).
It will be contained within a cgroup here, and allocated some cores/threads.
You can start as many workers/threads as you like within a cgroup.  There
is no signalling back between stress-ng and the cgroup setup!  (If I am
wrong forgive me)

Also  top is your friend here.  And more usefully 'htop'
Just look at top with the flag to show threads -H and 'j' to show last used
cpu




On 14 August 2017 at 08:12, John Hearns <hear...@googlemail.com> wrote:

> Lachlan,  forgive me if I am teaching granny to suck eggs..,,
> I have recently been workign with cgroups.
> If you run an interactive job what do you see when  cat /proc/self/cgroups
> Also have you explored in /sys/fs/cgroups and checked what resources are
> in the cgroups which a job has?
>
> On 14 August 2017 at 07:49, Lachlan Musicman <data...@gmail.com> wrote:
>
>> Hola,
>>
>> Slurm is complicated software, and sometimes the docs can be dense - I'm
>> looking for some clarification please.
>>
>> We have a system set up with Threads as CPUs. 1 socket, 4 cores, 2
>> threads = 8 cpus
>>
>> I would like to implement CGroups because some of our users are quite
>> happy to utilise all threads despite other users.
>>
>> We have TaskPlugin=task/cgroup and when testing I noticed that the # of
>> threads/cpus being allocated was rounded up to the nearest even. I presume
>> this was due to cgroups marking a core as a cpu, rather than a thread as a
>> cpu.
>>
>> So I set TaskPluginParam=Threads, but slurm is still allowing the use of
>> more threads than have been requested.
>>
>> In particular, I'm running this test:
>>
>> #!/bin/bash
>> #SBATCH --nodes=1
>> #SBATCH --ntasks=3
>>
>> stress-ng --cpu 5 --cpu-method all --io 5 --vm 1 --vm-bytes 1G --timeout
>> 600s --quiet
>>
>>
>> I was hoping that the cgroup would kill the job because of too many cpus,
>> but that's not how stress-ng works I've discovered.
>>
>> Regardless, when running this, I noted that squeue shows I've been
>> allocated 3 CPUs, but on the server itself, I'm seeing four cpus being used?
>>
>> What have I done wrong? Is it possible to have granular control at the
>> thread level with cgroups?
>>
>> cheers
>> L.
>>
>>
>> --
>> "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic
>> civics is the insistence that we cannot ignore the truth, nor should we
>> panic about it. It is a shared consciousness that our institutions have
>> failed and our ecosystem is collapsing, yet we are still here — and we are
>> creative agents who can shape our destinies. Apocalyptic civics is the
>> conviction that the only way out is through, and the only way through is
>> together. "
>>
>> *Greg Bloom* @greggish https://twitter.com/greggish/s
>> tatus/873177525903609857
>>
>
>

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread John Hearns

Lachlan,  forgive me if I am teaching granny to suck eggs..,,
I have recently been workign with cgroups.
If you run an interactive job what do you see when  cat /proc/self/cgroups
Also have you explored in /sys/fs/cgroups and checked what resources are in
the cgroups which a job has?

On 14 August 2017 at 07:49, Lachlan Musicman  wrote:

> Hola,
>
> Slurm is complicated software, and sometimes the docs can be dense - I'm
> looking for some clarification please.
>
> We have a system set up with Threads as CPUs. 1 socket, 4 cores, 2 threads
> = 8 cpus
>
> I would like to implement CGroups because some of our users are quite
> happy to utilise all threads despite other users.
>
> We have TaskPlugin=task/cgroup and when testing I noticed that the # of
> threads/cpus being allocated was rounded up to the nearest even. I presume
> this was due to cgroups marking a core as a cpu, rather than a thread as a
> cpu.
>
> So I set TaskPluginParam=Threads, but slurm is still allowing the use of
> more threads than have been requested.
>
> In particular, I'm running this test:
>
> #!/bin/bash
> #SBATCH --nodes=1
> #SBATCH --ntasks=3
>
> stress-ng --cpu 5 --cpu-method all --io 5 --vm 1 --vm-bytes 1G --timeout
> 600s --quiet
>
>
> I was hoping that the cgroup would kill the job because of too many cpus,
> but that's not how stress-ng works I've discovered.
>
> Regardless, when running this, I noted that squeue shows I've been
> allocated 3 CPUs, but on the server itself, I'm seeing four cpus being used?
>
> What have I done wrong? Is it possible to have granular control at the
> thread level with cgroups?
>
> cheers
> L.
>
>
> --
> "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic
> civics is the insistence that we cannot ignore the truth, nor should we
> panic about it. It is a shared consciousness that our institutions have
> failed and our ecosystem is collapsing, yet we are still here — and we are
> creative agents who can shape our destinies. Apocalyptic civics is the
> conviction that the only way out is through, and the only way through is
> together. "
>
> *Greg Bloom* @greggish https://twitter.com/greggish/
> status/873177525903609857
>

[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns

Bill, thankyou very much for that.  I guess I have to get my systemd hat on.
A hat which is very large and composed of many parts, and indeed functions
as a pair of pants too.



On 10 August 2017 at 14:33, Bill Barth <bba...@tacc.utexas.edu> wrote:

> If you use a modern enough OS (RHEL/CentOS 7, etc), XDG_RUNTIME_DIR will
> probably be set and mounted (it’s a tmpfs with a limited max size mounted,
> per-session, under /run/user/) on your login nodes, any node that
> environment propagates to (like the first compute node of a job), and
> anywhere that the user (or MPI stack) sshes to due to the PAM integration
> of pam_systemd.so in the auth process. Just having the environment variable
> set is not quite enough, though, you also need it mounted and unmounted at
> the end of each shell session. If you add the same line from
> /etc/pam.d/system-auth (or your OS’s equivalent) to /etc/pam.d/slurm, then
> srun- and sbatch-initiated shells and processes will also have the
> directory properly set up. MPI jobs that use ssh will get the mount
> automatically due to the ssh PAM integration with systemd, but those that
> use PMI-* and srun need the additional PAM integration.
>
> Like it or not, this systemd-based/freedesktop.org system for a private,
> ephemeral temporary directory appears to be the future on Linux, and lots
> of GUI-based programs (Qt) are already expecting it. There are instructions
> in the standard for what you’re supposed to do as a developer if it doesn’t
> exist or has the wrong permissions, but this method is at least becoming
> standardized across Linux distributions. We first discovered this recently
> on some new CentOS 7 boxes that we were running under SLURM and were
> complaining in some GUI apps that didn’t have it mounted. It took a little
> while to figure out where in the PAM stack to insert the pam_systemd.so
> configuration line to guarantee that it was working for all our SLURM jobs,
> but the above method seems to solve the problem.
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bba...@tacc.utexas.edu|   Phone: (512) 232-7069
> Office: ROC 1.435|   Fax:   (512) 475-9445
>
>
>
> On 8/10/17, 3:06 AM, "Fokke Dijkstra" <f.dijks...@rug.nl> wrote:
>
> We use the spank-private-tmp plugin developed at HPC2N in Sweden:
>
> https://github.com/hpc2n/spank-private-tmp
>
>
>
> See also: https://slurm.schedmd.com/SUG14/private_tmp.pdf
> for a presentation about the plugin.
>
>
>
>
> 2017-08-10 9:31 GMT+02:00 John Hearns <hear...@googlemail.com>:
>
> I am sure someone discussed this topic on this list a few months
> ago... if it rings any bells please let me know.
> I am not discussing setting the TMPDIR environment variable and
> crateing a new TMPDIR directory on a per job basis - though thankyou for
> the help I did get when discussing this.
>
>
> Rather I would like to set up a new namespace when a job runs such
> that /tmp is unique to every job.  /tmp can of course be a directly
> uniquely created for that job and deleted afterwards.
> This is to cope with any software which writes to  /tmp rather than
> using the TMPDIR variable.
> If anyone does this let me know what your techniques are please.
>
>
>
>
>
>
>
>
>
>
>
> --
> Fokke Dijkstra
> <f.dijks...@rug.nl> <mailto:f.dijks...@rug.nl>
> Research and Innovation Support
> Center for Information Technology, University of Groningen
> Postbus 11044, 9700 CA  Groningen, The Netherlands
> +31-50-363 9243
>
>
>
>
>
>
>
>

[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns

Fokke, thankyou very much for the response.


On 10 August 2017 at 10:07, Fokke Dijkstra <f.dijks...@rug.nl> wrote:

> We use the spank-private-tmp plugin developed at HPC2N in Sweden:
> https://github.com/hpc2n/spank-private-tmp
>
> See also: https://slurm.schedmd.com/SUG14/private_tmp.pdf
> for a presentation about the plugin.
>
>
> 2017-08-10 9:31 GMT+02:00 John Hearns <hear...@googlemail.com>:
>
>> I am sure someone discussed this topic on this list a few months ago...
>> if it rings any bells please let me know.
>> I am not discussing setting the TMPDIR environment variable and crateing
>> a new TMPDIR directory on a per job basis - though thankyou for the help I
>> did get when discussing this.
>>
>> Rather I would like to set up a new namespace when a job runs such that
>> /tmp is unique to every job.  /tmp can of course be a directly uniquely
>> created for that job and deleted afterwards.
>> This is to cope with any software which writes to  /tmp rather than using
>> the TMPDIR variable.
>> If anyone does this let me know what your techniques are please.
>>
>>
>
>
> --
> Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl>
> Research and Innovation Support
> Center for Information Technology, University of Groningen
> Postbus 11044, 9700 CA  Groningen, The Netherlands
> +31-50-363 9243 <+31%2050%20363%209243>
>

[slurm-dev] Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns

I am sure someone discussed this topic on this list a few months ago... if
it rings any bells please let me know.
I am not discussing setting the TMPDIR environment variable and crateing a
new TMPDIR directory on a per job basis - though thankyou for the help I
did get when discussing this.

Rather I would like to set up a new namespace when a job runs such that
/tmp is unique to every job.  /tmp can of course be a directly uniquely
created for that job and deleted afterwards.
This is to cope with any software which writes to  /tmp rather than using
the TMPDIR variable.
If anyone does this let me know what your techniques are please.

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-07 Thread John Hearns

Lachlan, in the Name of the Wee Man, so 'reboot' is now a 'legacy tool'
https://access.redhat.com/solutions/1580343

Jeez... Look HPC compute node - I'm in charge, gottit? Yeah, fight back all
you like with systemd, but I can pull the power plug.
Let's see you deal with that one.

On 7 August 2017 at 06:08, Lachlan Musicman  wrote:

> I've just been asked about implementing a "drain and reboot" for
> nodes/partitions.
>
> In slurm.conf, there is a RebootProgram - does this need to be a direct
> link to a bin or can it be a command?
>
>
> RebootProgram=/usr/sbin/reboot
>
> or
>
> RebootProgram='systemctl disable reboot-guard; reboot'
>
> Cheers
> L.
>
> --
> "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic
> civics is the insistence that we cannot ignore the truth, nor should we
> panic about it. It is a shared consciousness that our institutions have
> failed and our ecosystem is collapsing, yet we are still here — and we are
> creative agents who can shape our destinies. Apocalyptic civics is the
> conviction that the only way out is through, and the only way through is
> together. "
>
> *Greg Bloom* @greggish https://twitter.com/greggish/
> status/873177525903609857
>

[slurm-dev] RE: [Non-DoD Source] Re: General Post-Processing Question (UNCLASSIFIED)

2017-07-20 Thread John Hearns

Anthony, I back up what Peter says.  I had a project recently where we had
a render farm deployed with Slurm.
There were 'data mover' jobs needed which ran once a render was complete
and we used job dependencies for these.

I guess though from what you say that you wil have to monitor how long the
pos-processing jobs take to start.
Have you looked at the new job packs feature which was announced on the
list a few days ago?
https://slurm.schedmd.com/SLUG16/Job_Packs_SUG_2016.pdf


I also looked at doing the renderfarm tasks using an srun which reserved a
compute node, then within the srun script I did an sbatch for the render
then the data mover phase. That is of course not efficien tif you are
running on N compute nodes, as N-1 are left idle while the post-rpocessing
its taking place.







On 19 July 2017 at 19:08, Glover, Anthony E CTR USARMY RDECOM (US) <
anthony.e.glover@mail.mil> wrote:

> CLASSIFICATION: UNCLASSIFIED
>
> Thanks Pete. That looks exactly like what I need. I couldn't think of the
> right search term, but pipeline is exactly what I'm trying to do.
>
> Thanks!
> Tony
>
> -Original Message-
> From: Peter A Ruprecht [mailto:peter.rupre...@colorado.edu]
> Sent: Wednesday, July 19, 2017 11:32 AM
> To: slurm-dev 
> Subject: [Non-DoD Source] [slurm-dev] Re: General Post-Processing Question
> (UNCLASSIFIED)
>
> Tony,
>
> Have you considered using Slurm job dependencies for this workflow?  That
> way you can submit the initial job and the post-processing job at the same
> time, but set a dependency on the post-processing job so that it can't
> start until the first job has finished successfully.  We've had users who
> manage fairly complicated analysis pipelines entirely with job dependencies.
>
> Regards,
> Pete
>
> On 7/19/17, 10:07 AM, "Glover, Anthony E CTR USARMY RDECOM (US)" <
> anthony.e.glover@mail.mil> wrote:
>
>
> CLASSIFICATION: UNCLASSIFIED
>
> Got a general question, but one that might be specifically addressed
> by Slurm - don't know.
>
> We have a multi-process, distributed simulation that runs as a single
> job and generates a significant amount of data. At the end of that run, we
> would like to be able to post-process the data. The post-processing
> currently consists of python scripts wrapped up in luigi workflows/tasks.
> We would like to be able to distribute those tasks across the cluster as
> well to speed up the post-processing.
>
> So, my question is: what is the best way to trigger submitting a job
> to Slurm based upon the completion of a previous job? I see that the
> strigger command can probably do what I need, but maybe it is more of a
> workflow question that I have. If we have say 100 of these simulation jobs
> in the queue, then it would seem like I would want the post-processing to
> run at the end of each job, but if the trigger submits another job with
> multiple CPU needs, then that job would go in at the back of the queue. I
> guess I could set the priority such that it jumps the remaining simulation
> jobs, or maybe a separate post-processing queue is more appropriate.
> Anyway, just looking for some ideas as to how others might be addressing
> this type or problem. Any guidance would be much appreciated.
>
> Thanks,
> Tony
>
> CLASSIFICATION: UNCLASSIFIED
>
>
> CLASSIFICATION: UNCLASSIFIED
>

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns

Said,
thankyou for letting us know.
I'm going to blame this one on systemd.  Just because I can.


On 6 July 2017 at 13:22, Said Mohamed Said <said.moha...@oist.jp> wrote:

> John and Others,
>
>
> Thank you very much for your support. The problem is finally solved.
>
>
> After Installing nmap, it let me realize that some ports were blocked even
> with firewall daemon stopped and disabled. Turned out that iptables was on
> and enabled. After stopping iptables everything work just fine.
>
>
>
> Best Regards,
>
>
> Said.
> --
> *From:* John Hearns <hear...@googlemail.com>
> *Sent:* Thursday, July 6, 2017 6:47:48 PM
>
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Said, you are not out of ideas.
>
> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on
> your compute node and see which ports are open on the controller node
>
> Also do we have a DNS name resolution problem here?
> I alwasy remember sun Gridengine as being notoriously sensitive to name
> resolution, and that was my first question when any SGE problem was
> reported.
> So a couple of questions:
>
> On the ocntroller node and on the compute node run this:
> hostname
> hostname -f
>
> Do the cluster controller node or the compute nodes have more than one
> network interface.
> I bet the cluster controller node does!   From the compute node, do an
> nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
> both of those interfaces.
>
> Also as Rajul says - how are you making sure that both controller and
> compute nodes have the same slurm.conf file
> Actually if the slurm.conf files are different this will eb logged when
> the compute node starts up, but let us check everything.
>
>
>
>
>
>
>
>
>
> On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp> wrote:
>
>> Even after reinstalling everything from the beginning the problem is
>> still there. Right now I am out of Ideas.
>>
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>> --
>> *From:* Said Mohamed Said
>> *Sent:* Thursday, July 6, 2017 2:23:05 PM
>> *To:* slurm-dev
>> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>
>> Thank you all for your suggestions, the only thing I can do for now is to
>> uninstall and install from the beginning and I will use the most recent
>> version of slurm on both nodes.
>>
>> For Felix who asked, the OS is CentOS 7.3 on both machines.
>>
>> I will let you know if that can solve the issue.
>> --
>> *From:* Rajul Kumar <kumar.r...@husky.neu.edu>
>> *Sent:* Thursday, July 6, 2017 12:41:51 AM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> Sorry for the typo
>> It's generally when one of the controller or compute can reach the other
>> one but it's *not* happening vice-versa.
>>
>>
>> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <kumar.r...@husky.neu.edu>
>> wrote:
>>
>>> I came across the same problem sometime back. It's generally when one of
>>> the controller or compute can reach to other one but it's happening
>>> vice-versa.
>>>
>>> Have a look at the following points:
>>> - controller and compute can ping to each other
>>> - both share the same slurm.conf
>>> - slurm.conf has the location of both controller and compute
>>> - slurm services are running on the compute node when the controller
>>> says it's down
>>> - TCP connections are not being dropped
>>> - Ports are accessible that are to be used for communication,
>>> specifically response ports
>>> - Check the routing rules if any
>>> - Clocks are synced across
>>> - Hope there isn't any version mismatch but still have a look (doesn't
>>> recognize the nodes for major version differences)
>>>
>>> Hope this helps.
>>>
>>> Best,
>>> Rajul
>>>
>>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <hear...@googlemail.com>
>>> wrote:
>>>
>>>> Said,
>>>>a problem like this always has a simple cause. We share your
>>>> frustration, and several people her have offered help.
>>>> So please do not get discouraged. We have all been in your situation!
>>>>
>>>> The only way to handle problems like this is
>>>> a) start at the beginning and read the manuals and web

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns

Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your
compute node and see which ports are open on the controller node

Also do we have a DNS name resolution problem here?
I alwasy remember sun Gridengine as being notoriously sensitive to name
resolution, and that was my first question when any SGE problem was
reported.
So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than one
network interface.
I bet the cluster controller node does!   From the compute node, do an
nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
both of those interfaces.

Also as Rajul says - how are you making sure that both controller and
compute nodes have the same slurm.conf file
Actually if the slurm.conf files are different this will eb logged when the
compute node starts up, but let us check everything.









On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp> wrote:

> Even after reinstalling everything from the beginning the problem is still
> there. Right now I am out of Ideas.
>
>
>
>
> Best Regards,
>
>
> Said.
> --
> *From:* Said Mohamed Said
> *Sent:* Thursday, July 6, 2017 2:23:05 PM
> *To:* slurm-dev
> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> Thank you all for your suggestions, the only thing I can do for now is to
> uninstall and install from the beginning and I will use the most recent
> version of slurm on both nodes.
>
> For Felix who asked, the OS is CentOS 7.3 on both machines.
>
> I will let you know if that can solve the issue.
> --
> *From:* Rajul Kumar <kumar.r...@husky.neu.edu>
> *Sent:* Thursday, July 6, 2017 12:41:51 AM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Sorry for the typo
> It's generally when one of the controller or compute can reach the other
> one but it's *not* happening vice-versa.
>
>
> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar <kumar.r...@husky.neu.edu>
> wrote:
>
>> I came across the same problem sometime back. It's generally when one of
>> the controller or compute can reach to other one but it's happening
>> vice-versa.
>>
>> Have a look at the following points:
>> - controller and compute can ping to each other
>> - both share the same slurm.conf
>> - slurm.conf has the location of both controller and compute
>> - slurm services are running on the compute node when the controller says
>> it's down
>> - TCP connections are not being dropped
>> - Ports are accessible that are to be used for communication,
>> specifically response ports
>> - Check the routing rules if any
>> - Clocks are synced across
>> - Hope there isn't any version mismatch but still have a look (doesn't
>> recognize the nodes for major version differences)
>>
>> Hope this helps.
>>
>> Best,
>> Rajul
>>
>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns <hear...@googlemail.com>
>> wrote:
>>
>>> Said,
>>>a problem like this always has a simple cause. We share your
>>> frustration, and several people her have offered help.
>>> So please do not get discouraged. We have all been in your situation!
>>>
>>> The only way to handle problems like this is
>>> a) start at the beginning and read the manuals and webpages closely
>>> b) start at the lowest level, ie here the network and do NOT assume that
>>> any component is working
>>> c) look at all the log files closely
>>> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
>>> e) then start on more low-level diagnostics, such as tcpdump of network
>>> adapters and straces of the processes and gstacks
>>>
>>>
>>> you have been doing steps a b and c very well
>>> I suggest staying with these - I myself am going for Adam Huffmans
>>> suggestion of the NTP clock times.
>>> Are you SURE that on all nodes you have run the 'date' command and also
>>> 'ntpq -p'
>>> Are you SURE the master node and the node OBU-N6   are both connecting
>>> to an NTP server?   ntpq -p will tell you that
>>>
>>>
>>> And do not lose heart.  This is how we all learn.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5 July 2017 at 16:23, Said Mohamed Said <said.moha...@oist.j

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread John Hearns

Said,
   a problem like this always has a simple cause. We share your
frustration, and several people her have offered help.
So please do not get discouraged. We have all been in your situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT assume that
any component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose' flags set
e) then start on more low-level diagnostics, such as tcpdump of network
adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam Huffmans
suggestion of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command and also
'ntpq -p'
Are you SURE the master node and the node OBU-N6   are both connecting to
an NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said  wrote:

> Sinfo -R gives "NODE IS NOT RESPONDING"
> ping gives successful results from both nodes
>
> I really can not figure out what is causing the problem.
>
> Regards,
> Said
> --
> *From:* Felix Willenborg 
> *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> When the nodes change to the down state, what is 'sinfo -R' saying?
> Sometimes it gives you a reason for that.
>
> Best,
> Felix
>
> Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>
> Thank you Adam, For NTP I did that as well before posting but didn't fix
> the issue.
>
> Regards,
> Said
> --
> *From:* Adam Huffman  
> *Sent:* Wednesday, July 5, 2017 8:11:03 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> I've seen something similar when node clocks were skewed.
>
> Worth checking that NTP is running and they're all synchronised.
>
> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said 
>  wrote:
> > Thank you all for suggestions. I turned off firewall on both machines but
> > still no luck. I can confirm that No managed switch is preventing the
> nodes
> > from communicating. If you check the log file, there is communication for
> > about 4mins and then the node state goes down.
> > Any other idea?
> > 
> > From: Ole Holm Nielsen 
> 
> > Sent: Wednesday, July 5, 2017 7:07:15 PM
> > To: slurm-dev
> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
> >
> >
> > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
> >> in my network I encountered that managed switches were preventing
> >> necessary network communication between the nodes, on which SLURM
> >> relies. You should check if you're using managed switches to connect
> >> nodes to the network and if so, if they're blocking communication on
> >> slurm ports.
> >
> > Managed switches should permit IP layer 2 traffic just like unmanaged
> > switches!  We only have managed Ethernet switches, and they work without
> > problems.
> >
> > Perhaps you meant that Ethernet switches may perform some firewall
> > functions by themselves?
> >
> > Firewalls must be off between Slurm compute nodes as well as the
> > controller host.  See
> > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#
> configure-firewall-for-slurm-daemons
> >
> > /Ole
>
>
>

[slurm-dev] Re: Question about default mail path command

2017-06-14 Thread John Hearns

Bas,
you should be able to set that value in slurm.conf when you
install/customise your Slurm setup

MailProg=/usr/bin/mail

On 14 June 2017 at 15:24, Bas van der Vlies 
wrote:

>
> I am just starting with slurm and notice that the default mail path
> command is defined in src/common/read_config.h
> {{{
> #define DEFAULT_MAIL_PROG   "/bin/mail"
> }}}
>
> This is correct for redhat/centos systems, but for debian derivates it is
> /usr/bin/mail
>
> must the variable moved to config.h and let configure generate the correct
> value?
>
> regards
>
>
> --
> ---
> Bas van der Vlies
> | Operations, Support & Development | SURFsara | Science Park 140 | 1098
> XG  Amsterdam
> | T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |
>

[slurm-dev] Re: Launching a VMWare Virtual Machine

2017-06-02 Thread John Hearns

Sean,
this sound slike the difference between interactive and non-interactive
shells.

When you log in directly to the node, you have an interactive shell and the
environment is set up, and /etc/profile.d scripts are sourced.
Someone will be along in a minute with the correct answer, however try
submitting with #!/bin/bash -l




On 2 June 2017 at 01:18, Sean M  wrote:

> Greetings,
>
> I am trying to schedule a VMWare VM to start automatically but once the
> slurm script is submitted and executed, VMWare launches, it's window
> appears, and closes immediately without launching the VM. When I run VMWare
> with "nogui", the VM also does not run. For these cases, there are no
> errors in the VMWare or slurm logs. Also, if I schedule just VMWare to
> open, it opens but requires human interaction to launch the VM, which is
> not feasible for my use case.
>
> On my base case, I have two machines: my node is running Ubuntu Desktop 17
> and the controller Ubuntu Server.
>
> I have tried two methods.
> Method 1: My controller submits a script with the following command:
> vmrun -T ws start 
>
> Method 2: My controller executes a bash script on the node. The node's
> bash script has the following command:
> vmrun -T ws start 
>
> Both methods have the same result: the VMWare window appears briefly and
> then closes. The VM launches perfectly if I execute Method 2's bash script
> directly on the node; the bash script is owned by the same user and group
> with root access on the node and controller and has 777 rights. Here is a
> weird thing, if I change method 1's script (on the same line) to ssh into
> the node and launch the vmrun command, the VM successfully starts
> automatically. The ssh solution is not ideal because I will not know in the
> future which node will get the job. Any suggestions on how to resolve this
> issue?
>
> Thanks!
> Sean
>

[slurm-dev] Re: Multinode setup trouble

2017-05-17 Thread John Hearns

Ben, a stupid question, hoever - have you installed and configured Munge
authentication on the slave node?

On 17 May 2017 at 02:59, Ben Mann  wrote:

> Hello Slurm dev,
>
> I just set up a small test cluster on two Ubuntu 14.04 machines, installed
> SLURM 17.02 from source. I started slurmctld, slurmdbd and slurmd on a
> master and just slurmd on a slave. When I run a job on two nodes, it
> completes instantly on master, but never on slave.
>
> Here are my .conf files, which are on a NAS and symlinked from
> /usr/local/etc/ as well as log files for the srun below
> https://gist.github.com/8enmann/0637ee2cbb6e6f5aaedef6b3c3f24a1d
>
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  2   idle [91-92]
>
> $ srun -l hostname
> 0: 91.cirrascale.sci.openai.org
>
> $ srun -l -N2 hostname
> 0: 91.cirrascale.sci.openai.org
> $ srun -N2 -l hostname
> 0: 91.cirrascale.sci.openai.org
> srun: error: timeout waiting for task launch, started 1 of 2 tasks
> srun: Job step 36.0 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
> $ squeue
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
> 36 debug hostname  ben  R   8:42  2 [91-92]
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  2  alloc [91-92]
>
> I'm guessing I misconfigured something, but I don't see anything in the
> logs suggesting what it might be. I've also tried cranking up verbosity and
> didn't see anything. I know it's not recommended to use root to run
> everything, but doesn't at least slurmd need root to manage cgroups?
>
> Thanks in advance!!
> Ben
>

[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread John Hearns

Sun,
as the others have responded, you should make sure your userids are the
same across the cluster.
You really must put in the effort to do that.


However - SGE does have a usermapping feature
https://linux.die.net/man/5/sge_usermapping
I do not know if there is somethig similar in Slurm.




On 16 May 2017 at 14:26, E.S. Rosenberg 
wrote:

> On Tue, May 16, 2017 at 11:39 AM, Sun Chenggen 
> wrote:
>
>> Yes, user on my cluster synchronized, but I want to submit job on my
>> client machine, not on the cluster.
>>
> So only if you synchronize, for instance by making sure your UID/GID on
> your client matches your UID/GID on the cluster.
>
>>
>> 发件人: Felip Moll 
>> 答复: slurm-dev 
>> 日期: 2017年5月16日 星期二 下午4:25
>> 至: slurm-dev 
>> 主题: [slurm-dev] Re: Is there anyway to commit job with different user?
>>
>> It is not possible, at least in a supported way.
>>
>> The first requirement of the admin guide tells:
>>
>>1. Make sure the clocks, users and groups (UIDs and GIDs) are
>>synchronized across the cluster.
>>
>> From:
>>
>> https://slurm.schedmd.com/quickstart_admin.html
>>
>>
>>
>>
>>
>>
>> * -- Felip Moll Marquès*
>> Computer Science Engineer
>> E-Mail - lip...@gmail.com
>> WebPage - http://lipix.ciutadella.es
>>
>> 2017-05-16 9:24 GMT+02:00 Sun Chenggen :
>>
>>> Hi everyone:
>>> Is there anyway to commit job with different user? My slurm cluster
>>> doesn’t have the same user config as my local  slurm-client machine. If I
>>> commit job on my local machine , it failed with the message “srun: error:
>>> Application launch failed: User not found on host”.
>>> Do I have to distribute my local /etc/passwd to cluster? I don’t want to
>>> do this way. Is there a better way to commit srun job with different user
>>> account?
>>>
>>> Thanks for your help,
>>> Sun
>>>
>>
>>
>

[slurm-dev] Re: How to get pids of a job

2017-05-11 Thread John Hearns

A good tool to us on the nodes when you have the list of nodes is 'pgrep'
https://linux.die.net/man/1/pgrep



On 11 May 2017 at 15:44, Jason Bacon  wrote:

>
>
> Parse the node names from squeue output (-o can help if you want to
> automate this) and then run ps or top on those nodes.
>
> Cheers,
>
> JB
>
> On 05/11/17 04:07, GHui wrote:
>
>> How to get pids of a job
>>
>> I want to get a job's pids on nodes. How could I do that?
>>
>> --GHui
>>
>
>
> --
> Earth is a beta site.
>

[slurm-dev] Re: Issue to startup slurm daemon on Compute nodes

2017-05-09 Thread John Hearns

Followig on from Maik's response,
it would be worth mentioning the compat-glibc package for CentOS

https://centos-packages.com/7/package/compat-glibc/
https://www.centos.org/forums/viewtopic.php?t=22250

Big get out of jail card - I have never built any version of Slurm on a
CentOS 7 system using the compat-glibc libraries!!!



On 9 May 2017 at 15:59, Maik Schmidt  wrote:

> It means you have to build SLURM on the node with the oldest glibc that
> you might still have in your cluster. It will then also run on the ones
> with newer glibc versions, just not the other way around.
>
> Best,
> Maik
>
>
> Am 09.05.2017 um 15:49 schrieb J. Smith:
>
>> Hi,
>>
>> I have compiled slurm v17.02.2 on Master Nodes running CentOS7.
>> I have no issue to startup slurm on the Master nodes but I am unable to
>> start the daemon on the Compute Nodes running on CentOS6. It is looking
>> for
>> GLIBC 2.14 which is not available on our compute Nodes(using glibc-2.12).
>>
>> Error:
>> service slurm status
>> /home/share/slurm/17.02.2/bin/scontrol: /lib64/libc.so.6: version
>> `GLIBC_2.14' not found (required by /home/share/slurm/17.02.2/bin/
>> scontrol)
>>
>> Does that mean that slurm will only work accross compute nodes running
>> CentOS7 and not CentOS6? Any suggestions?
>>
>> Thanks!
>>
>> --
> Maik Schmidt
> HPC Services
>
> Technische Universität Dresden
> Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
> Willers-Bau A116
> D-01062 Dresden
> Telefon: +49 351 463-32836
>
>
>

[slurm-dev] Re: Communication error

2017-05-08 Thread John Hearns

JAson, note that compute-2018 is in IDLE* status - which means that it is
not reachable.
As Felip suggests, log into that compute node and  tail -f
/var/log/slurmd.log

I would also suggest on your master node running an scontrol to set that
node as DRAIN then RESUME,
then log into the node and (re)start the slurmd daemon

I have had to do that when nodes have 'got their knickers in a  twist' when
I was playing around with the power up/down at idle settings in the past.







On 8 May 2017 at 16:02, Felip Moll  wrote:

> Do you have any kind of firewall in your network?
>
> I would suggest it is a problem with dates but since you tested munge -n
> we could discard that.
> Can you anyway do a pdsh -w compute-* date |dshbak -c ?
> Can you show nodes slurmd log output?
>
>
>
> *--Felip Moll Marquès*
> Computer Science Engineer
> E-Mail - lip...@gmail.com
> WebPage - http://lipix.ciutadella.es
>
> 2017-05-02 17:19 GMT+02:00 Jason Bacon :
>
>>
>>
>> I have a perplexing error since reimaging a compute node.  I've found a
>> few similar issues on Google, none of which were resolved.  Everything
>> seems to be fine until I try to run a job on it.
>>
>> Any suggestions about where to look for clues would be appreciated.
>>
>> Thanks,
>>
>> Jason
>>
>> [r...@login.avi bacon]# scontrol show nodes compute-2-17
>> NodeName=compute-2-17 Arch=x86_64 CoresPerSocket=4
>>CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
>>Gres=(null)
>>NodeAddr=compute-2-17 NodeHostName=compute-2-17 Version=15.08
>>OS=Linux RealMemory=24000 AllocMem=0 FreeMem=18357 Sockets=2 Boards=1
>>State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>>BootTime=2017-04-22T18:55:28 SlurmdStartTime=2017-04-28T11:19:33
>>CapWatts=n/a
>>CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> [r...@login.avi bacon]# scontrol show nodes compute-2-18
>> NodeName=compute-2-18 Arch=x86_64 CoresPerSocket=4
>>CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.05 Features=(null)
>>Gres=(null)
>>NodeAddr=compute-2-18 NodeHostName=compute-2-18 Version=15.08
>>OS=Linux RealMemory=24000 AllocMem=0 FreeMem=23736 Sockets=2 Boards=1
>>State=IDLE* ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>>BootTime=2017-04-28T11:14:23 SlurmdStartTime=2017-04-28T11:19:43
>>CapWatts=n/a
>>CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> [r...@login.avi bacon]# munge -n | ssh compute-2-17 unmunge
>> STATUS:   Success (0)
>> ENCODE_HOST:  login.avi.hpc.uwm.edu (129.89.58.18)
>> ENCODE_TIME:  2017-04-28 11:20:36 -0500 (1493396436)
>> DECODE_TIME:  2017-04-28 11:20:36 -0500 (1493396436)
>> TTL:  300
>> CIPHER:   aes128 (4)
>> MAC:  sha1 (3)
>> ZIP:  none (0)
>> UID:  root (0)
>> GID:  root (0)
>> LENGTH:   0
>>
>> [r...@login.avi bacon]# munge -n | ssh compute-2-18 unmunge
>> STATUS:   Success (0)
>> ENCODE_HOST:  login.avi.hpc.uwm.edu (129.89.58.18)
>> ENCODE_TIME:  2017-04-28 11:20:39 -0500 (1493396439)
>> DECODE_TIME:  2017-04-28 11:20:40 -0500 (1493396440)
>> TTL:  300
>> CIPHER:   aes128 (4)
>> MAC:  sha1 (3)
>> ZIP:  none (0)
>> UID:  root (0)
>> GID:  root (0)
>> LENGTH:   0
>>
>> [r...@login.avi bacon]# sinfo
>> PARTITIONAVAIL  TIMELIMIT  NODES  STATE NODELIST
>> batch*  up   infinite  8   drng compute-5-[07-14]
>> batch*  up   infinite 29mix compute-2-[03-14],compute-5-[1
>> 5-28,31,33-34]
>> batch*  up   infinite 66  alloc compute-1-[01-05],compute-2-[1
>> 9-36],compute-3-[01-20],compute-4-[01-14],compute-5-[01-06,29-30,32]
>> batch*  up   infinite 37   idle compute-1-[06-36],compute-2-[0
>> 1-02,15-18]
>> batch-nice  up   infinite  8   drng compute-5-[07-14]
>> batch-nice  up   infinite 29mix compute-2-[03-14],compute-5-[1
>> 5-28,31,33-34]
>> batch-nice  up   infinite 66  alloc compute-1-[01-05],compute-2-[1
>> 9-36],compute-3-[01-20],compute-4-[01-14],compute-5-[01-06,29-30,32]
>> batch-nice  up   infinite 37   idle compute-1-[06-36],compute-2-[0
>> 1-02,15-18]
>> highmem up   infinite  2   idle compute-5-[35-36]
>> highmem-niceup   infinite  2   idle compute-5-[35-36]
>>
>> [r...@login.avi bacon]# srun --nodelist=compute-2-17 hostname
>> compute-2-17.avi
>> [r...@login.avi bacon]# srun --nodelist=compute-2-18 hostname
>> srun: error: Task launch for 360081.0 failed on node compute-2-18:
>> Communication connection failure
>> srun: error: Application launch failed: Communication connection failure
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>
>> /var/log/slurm/slurmctld:
>>
>> [2017-04-28T11:25:09.002]

[slurm-dev] RE: Two different GPUs in a compute node (my own answer)

2017-05-04 Thread John Hearns

Daniel,
I think that you do not need the CPUs=  at all.

Also look at specifying the use of cgroups.  then when you run a job and
request one GPU, that GPU will be made available to you as
CUDA_VISIBLE_DEVICES
The other GPU will nto be available to you - but can be used by another
batch job.

On 4 May 2017 at 13:34, Daniel Ruiz Molina  wrote:

> Hello,
>
> I have reconfigured slurm:
>
>- slurmd.conf: NodeName=mynode CPUs=8 SocketsPerBoard=1
>CoresPerSocket=4 ThreadsPerCore=2 RealMemory=7812 TmpDisk=50268 Gres=gpu:2
>(without specify gpu model)
>
>
>- gres.conf: two separate lines:
>
> NodeName=mynode Name=gpu Count=1 Type=GeForceGTX680 File=/dev/nvidia0
> CPUs=0-3
> NodeName=mynode Name=gpu Count=1 Type=GeForceGTX1080 File=/dev/nvidia1
> CPUs=4-7
>
>
> With this configuration, slurm starts OK... but I think it would be also
> correct both lines with "CPUs=0-7", isn't it? Because if not, how could I
> use all CPUs with only one GPU?
>
> Thanks.
>
>

[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread John Hearns

Ole, thankyou.  That works for me!

And a plus one for python-hostlist  - so very useful.



On 3 May 2017 at 13:34, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote:

>
> Hi John,
>
> Thanks for your request for HTTP access.  I've configured our web-server
> for providing the FTP files via HTTP also, please see:
>
> http://ftp.fysik.dtu.dk/Slurm/
>
> Does that work for you?
>
> /Ole
>
> On 05/03/2017 12:02 PM, John Hearns wrote:
>
>> Ole,
>> a small ask. I si tpossible to put the 'pestat' utility for Slurm and
>> for PBS on a site which uses http?
>> The reason is many (most ?) corporate networks block ftp access.
>>
>> Thankyou
>>
>>
>> On 3 May 2017 at 09:06, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk
>> <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:
>>
>>
>> I'm announcing an updated version 0.30 of the node status tool
>> "pestat" for Slurm.
>>
>> Download the tool (a short bash script) from
>> ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat
>> <ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat>
>>
>> New options have been added as shown by the help information:
>>
>> # pestat -h
>> Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-f] [-V]
>> [-h]
>> where:
>> -p partition: Select only partion 
>> -u username: Print only user 
>> -q qoslist: Print only QOS in the qoslist 
>> -f: Print only nodes that are flagged by * (unexpected load
>> etc.)
>> -h: Print this help information
>> -V: Version information
>>
>> I use "pestat -f" all the time because it prints and flags (in
>> color) only the nodes which have an unexpected CPU load or node
>> status.
>>
>

[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread John Hearns

Ole,
a small ask. I si tpossible to put the 'pestat' utility for Slurm and for
PBS on a site which uses http?
The reason is many (most ?) corporate networks block ftp access.

Thankyou


On 3 May 2017 at 09:06, Ole Holm Nielsen  wrote:

>
> I'm announcing an updated version 0.30 of the node status tool "pestat"
> for Slurm.
>
> Download the tool (a short bash script) from
> ftp://ftp.fysik.dtu.dk/pub/Slurm/pestat
>
> New options have been added as shown by the help information:
>
> # pestat -h
> Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-f] [-V] [-h]
> where:
> -p partition: Select only partion 
> -u username: Print only user 
> -q qoslist: Print only QOS in the qoslist 
> -f: Print only nodes that are flagged by * (unexpected load etc.)
> -h: Print this help information
> -V: Version information
>
> I use "pestat -f" all the time because it prints and flags (in color) only
> the nodes which have an unexpected CPU load or node status.
>
> --
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark
>

[slurm-dev] RE: Fwd: job requeued in held state

2017-04-03 Thread John Hearns

Chris,
can the user start an 'srun' session?

From: Chris Woelkers - NOAA Affiliate [chris.woelk...@noaa.gov]
Sent: 03 April 2017 20:31
To: slurm-dev
Subject: [slurm-dev] Fwd: job requeued in held state

I am running a small HPC, only 24 nodes, via slurm and am having an
issue where one of the users is unable to submit any jobs.
The user is new and whenever a job is submitted it shows the "job
requeued in held state" state and is never actually ran. We have left
the job sitting for over three days and it does not start. We have
tried releasing the job and it does not start. Here are the log
entries after an attempted release:

[2017-04-03T19:16:24.173] sched: update_job: releasing hold for job_id
1938 uid 0
[2017-04-03T19:16:24.174] _slurm_rpc_update_job complete JobId=1938
uid=0 usec=375
[2017-04-03T19:16:24.919] sched: Allocate JobId=1938
NodeList=rhinonode[07-14] #CPUs=192
[2017-04-03T19:16:25.017] _slurm_rpc_requeue: Processing RPC:
REQUEST_JOB_REQUEUE from uid=0
[2017-04-03T19:16:25.035] Requeuing JobID=1938 State=0x0 NodeCnt=0

The user has the same permissions as the older users that can run jobs.
The script that is being run is a simple test script and no matter
where the output is redirected, an NFS mount(for our SAN), the local
home directory, or the tmp directory, the result is the same.

Any idea as to what might be happening?

Thanks,

Chris Woelkers
Caelum Research Corp.
Linux Server and Network Administrator
NOAA GLERL
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Does slurm work well with Supermicro KNL Phi boards?

2017-04-01 Thread John Hearns


Kenneth,
   I cant answer your question directly.
However I have quite a lot of experience recentyle in using syscfg on Intel 
brand servers and motherboards,
for an Omnipath cluster at a UK university and my own benchmarking cluster.
I find syscfg to be an excellent tool. No more BIOS Settings by crawling around 
with a keyboard and monitor!

One thing though - you need the latest version.
I found this the hard way - by using a slightly out of date version which 
refused to make BIOS changes.
Simple upgrade - then worked fine.

Checking the release notes for syscfg:
https://downloadmirror.intel.com/26365/eng/ReleaseNotes.txt

I found this very good guide from Prace on Xeon Phi best practices:
http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017/



From: Kenneth Chiu [kc...@binghamton.edu]
Sent: 01 April 2017 23:06
To: slurm-dev
Subject: [slurm-dev] Does slurm work well with Supermicro KNL Phi boards?

Does slurm also work well with Supermicro KNL boards, such as

https://www.supermicro.com/products/system/2U/5028/SYS-5028TK-HTR.cfm

My understanding is that slurm uses syscfg, provided by Intel, to configure a 
KNL node:

https://slurm.schedmd.com/intel_knl.html

I'm not sure if syscfg works on those boards, and whether or not slurm has a 
workaround if it doesn't.
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Suggestions on node memory cleaning

2017-03-30 Thread John Hearns

I think this thread has the answer 
http://askubuntu.com/questions/609226/freeing-page-cache-using-echo-3-proc-sys-vm-drop-caches-doesnt-work

echo 3 | sudo tee /proc/sys/vm/drop_caches

From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 30 March 2017 17:11
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] RE: Suggestions on node memory cleaning

Aha, follow this thread   
http://www.beowulf.org/pipermail/beowulf/2013-April/031407.html

From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 30 March 2017 17:07
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] RE: Suggestions on node memory cleaning

Chad,
I did rather a lot of work on that issue in my last job, with PBSPro actually.
We wanted a PBSPro prolog script which would flush the caches using a echo 3 > 
/proc/sys/vm/drop_caches before the job is run.
So far so good.  The wrinkle I found with PBSPro is that the prolog is run 
under the user id.
You need to be root to run that command – so I put it in /etc/sudoers and did :
sudo echo 3 > /proc/sys/vm/drop_caches

As I remember that doesn’t work either due to some weirdness of bash redirects, 
but you can get it to work.

From: Chad Cropper [mailto:chad.crop...@genusplc.com]
Sent: 30 March 2017 16:53
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Suggestions on node memory cleaning

We would like to clean our memory buffers/cache on a regular basis without 
rebooting nodes. We can easily do this manually with “sync; echo 3 > 
/proc/sys/vm/drop_caches“.  Is anyone else out there doing anything like this? 
Does SLURM offer anything builtin for running this when it sees an empty node? 
Outside of rotating nodes into a drain state for maintenance and then running 
this command, I have yet to see any other option. Any suggestions are greatly 
appreciated.

-Chad Cropper

*** The information contained in this communication may be confidential, is 
intended only for the use of the recipient(s) named above, and may be legally 
privileged. If the reader of this message is not the intended recipient, you 
are hereby notified that any dissemination, distribution, or copying of this 
communication, or any of its contents, is strictly prohibited. If you have 
received this communication in error, please return it to the sender 
immediately and delete the original message and any copies of it. If you have 
any questions concerning this message, please contact the sender. ***
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Suggestions on node memory cleaning

2017-03-30 Thread John Hearns

Aha, follow this thread   
http://www.beowulf.org/pipermail/beowulf/2013-April/031407.html

From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 30 March 2017 17:07
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] RE: Suggestions on node memory cleaning

Chad,
I did rather a lot of work on that issue in my last job, with PBSPro actually.
We wanted a PBSPro prolog script which would flush the caches using a echo 3 > 
/proc/sys/vm/drop_caches before the job is run.
So far so good.  The wrinkle I found with PBSPro is that the prolog is run 
under the user id.
You need to be root to run that command – so I put it in /etc/sudoers and did :
sudo echo 3 > /proc/sys/vm/drop_caches

As I remember that doesn’t work either due to some weirdness of bash redirects, 
but you can get it to work.

From: Chad Cropper [mailto:chad.crop...@genusplc.com]
Sent: 30 March 2017 16:53
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Suggestions on node memory cleaning

We would like to clean our memory buffers/cache on a regular basis without 
rebooting nodes. We can easily do this manually with “sync; echo 3 > 
/proc/sys/vm/drop_caches“.  Is anyone else out there doing anything like this? 
Does SLURM offer anything builtin for running this when it sees an empty node? 
Outside of rotating nodes into a drain state for maintenance and then running 
this command, I have yet to see any other option. Any suggestions are greatly 
appreciated.

-Chad Cropper

*** The information contained in this communication may be confidential, is 
intended only for the use of the recipient(s) named above, and may be legally 
privileged. If the reader of this message is not the intended recipient, you 
are hereby notified that any dissemination, distribution, or copying of this 
communication, or any of its contents, is strictly prohibited. If you have 
received this communication in error, please return it to the sender 
immediately and delete the original message and any copies of it. If you have 
any questions concerning this message, please contact the sender. ***
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Query about web front ends to slurm

2017-03-29 Thread John Hearns


Sean,
I cannot say if this satisfies your requirements, however in the past  have 
worked with Enginframe
A new version ws recently released.  It ceertainly does work with Slurm,

https://www.nice-software.com/products/enginframe

The LAP (Active Directory) integration works as a user mapping as I remember,
ie if a user can authenticate against AD then you can define which Linux user 
that corresponds to.




From: Sean Doyle [sdo...@gmail.com]
Sent: 29 March 2017 20:56
To: slurm-dev
Subject: [slurm-dev] Query about web front ends to slurm

Hi -

I see that there are a number of libraries that support web dashboards for 
slurm and there are also some that handle job submission (like MyCluster). 
However - I haven't seen a project that combines the two of these.  Does one 
exist?

Here's the problem that I'm trying to solve: I work in a medical environment 
where we want to keep strong privacy walls between different users running 
Docker containers on a cluster. Different research groups have access to 
distinct file shares that we can load dynamically. I'd like to be able to 
submit slurm jobs through a web portal which was hooked up to our LDAP server 
for authentication.

Thanks for any suggestions.

Sean
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-18 Thread John Hearns

Kesim,
   Touche Sir. I agree with you.

From: kesim [ketiw...@gmail.com]
Sent: 18 March 2017 18:06
To: slurm-dev
Subject: [slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

Dear John,

Thank you for your answer. Obviously you are right that I could slurm up 
everything and thus avoid the issue and your points are taken. However, I still 
insist that it is a serious bug not to take into account the actual CPU load 
when the scheduler submit a job regardless whose fault it is that a non-slurm 
job is running. I would not suspect that from even simplest scheduler and if I 
had such prior knowledge I would not invest so much time and effort  to setup 
slurm.
Best regards,

Ketiw

On Sat, Mar 18, 2017 at 5:42 PM, John Hearns 
<john.hea...@xma.co.uk<mailto:john.hea...@xma.co.uk>> wrote:

Kesim,

what you are saying is that Slurm schedukes tasks based on the number of 
allocated CPUs. Rather than the actual load factor on the server.
As I recall Gridengine actually used the load factor.

However you comment that "users run programs on the nodes" and "the slurm is 
aware about the load of non-slurm jobs"
IMHO, in any well-run HPC setup any user running jobs without using the 
scheduler would have their fingers broken. or at least bruised using the clue 
stick.

Seriously, three points:

a) tell users to use 'salloc' and 'srun'  to run interactive jobs. They can 
easily open a Bash session on a compute node and do what they like. Under the 
Slurm scheduler.

b) implement the pam-slurm PAMmodule. It is a few minutes work. This means your 
users cannot go behind the sluem scheduler and log into the nodes

c) on Bright clusters which I configure, you have a healtcheck running which 
wans you when a user is detected as logging in withotu using Slurm

Seriously again. You have implemented an HPC infrastructure, and have gone to 
the time and effort to implement a batch scheduling system.
A batch scheduler can be adapted to let your users do their jobs, including 
interactive shell sessions and remote visualization sessions.
Do not let the users ride roughshod over you.

From: kesim [ketiw...@gmail.com<mailto:ketiw...@gmail.com>]
Sent: 18 March 2017 16:16
To: slurm-dev
Subject: [slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

Unbelievable but it seems that nobody knows how to do that. It is astonishing 
that such sophisticated system fails with such simple problem. The slurm is 
aware about the cpu load of non-slurm jobs but it does not use the info. My 
original understanding of LLN was apparently correct. I can practically kill 
the CPUs on particular node with nonslurm tasks but slurm will diligently 
submit 7 jobs to this node leaving other idling.  I consider this as a serious 
bug of this program.

On Fri, Mar 17, 2017 at 10:32 AM, kesim 
<ketiw...@gmail.com<mailto:ketiw...@gmail.com><mailto:ketiw...@gmail.com<mailto:ketiw...@gmail.com>>>
 wrote:
Dear All,
Yesterday I did some tests and it seemed that the scheduling is following CPU 
load but I was wrong.
My configuration is at the moment:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU,CR_LLN

Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD info
node1 0.08  7/0/0/7
node20.01  7/0/0/7
node30.00  7/0/0/7
node42.97  7/0/0/7
node5   0.00  7/0/0/7
node6 0.01  7/0/0/7
node7  0.00  7/0/0/7
node8   0.05  7/0/0/7
node90.07  7/0/0/7
node100.38  7/0/0/7
node11 0.01  0/7/0/7
As you can see it allocated 7 CPUs on node 4 with CPU_LOAD 2.97 and 0 CPUs on 
idling node11. Why such simple thing is not a default? What am I missing???

On Thu, Mar 16, 2017 at 7:53 PM, kesim 
<ketiw...@gmail.com<mailto:ketiw...@gmail.com><mailto:ketiw...@gmail.com<mailto:ketiw...@gmail.com>>>
 wrote:
Than you for great suggestion. It is working! However the description of CR_LLN 
is misleading "Schedule resources to jobs on the least loaded nodes (based upon 
the number of idle CPUs)" Which I understood that if the two nodes has not 
fully allocated CPUs  the node with smaller number of allocated CPUs will take 
precedence. Therefore the bracketed comment should be removed from the 
description.

On Thu, Mar 16, 2017 at 6:24 PM, Paul Edmon 
<ped...@cfa.harvard.edu<mailto:ped...@cfa.harvard.edu><mailto:ped...@cfa.harvard.edu<mailto:ped...@cfa.harvard.edu>>>
 wrote:

You should look at LLN (least loaded nodes):

https://slurm.schedmd.com/slurm.conf.html

That should do what you want.

-Paul Edmon-

On 03/16/2017 12:54 PM, kesim wrote:

-- Forwarded message --
From: kesim 
<ketiw...@gmail.com<mailto:ketiw...@gmail.com><mailto:ketiw...@gmail.com<mailto

[slurm-dev] RE: Job-Specific Working Directory on Local Scratch

2017-03-13 Thread John Hearns

Stegfan,
regarding the Prolog/Task Prolog option, David Lee Braun sent me a 
comprehensive reply on that one back in January.
The answer is that you have to se the TMPDIR in a separate  
/etc/profile.d/slurm.sh
Is. The Prolog creates the directory OK, but the TMPDIR variable is only set if 
a profile.d script is used (I am sure there are other ways)

See this thread:  
https://groups.google.com/forum/#!topic/slurm-devel/kqzPbN8NpkQ

And thanks again to David



From: Stefan Seritan [mailto:sseri...@stanford.edu]
Sent: 13 March 2017 17:19
To: slurm-dev 
Subject: [slurm-dev] Job-Specific Working Directory on Local Scratch

Hey,

I am trying to set up SLURM to use job-specific working directories on local 
scratch for sbatch jobs, with the form /u/$USER/$SLURM_JOB_ID. Here's what I 
have tried:

- Manual: Users can put cd $SCRATCH, and copy the files they need. This works, 
but output files would not be copied back in case of things like jobs being 
killed due to exceeding resource limits.

- Prolog/TaskProlog: I can create the scratch directory and export a $SCRATCH 
environment variable, but I cannot cd into it or set any time of working 
directory environment variable.

- job_submit.lua: Have access to working directory variable and $USER, but 
don't have $SLURM_JOB_ID (job_desc.job_id and job_desc.job_id_str are not set)

- SPANK plugin: I can see the scratch directory that was created by the prolog, 
but again I cannot cd into it or change any kind of working directory 
environment. I've tried init, user_init, task_init_privileged, and task_init, 
and none of them worked for me.

Any suggestions on how to set this up would be great.

--
Stefan Seritan
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] gres/mic unable to set OFFLOAD_DEVICES

2017-02-28 Thread John Hearns

Some pointers appreciated please.
I suspect this is a common error message.Slurm version 16.05.8
In the slurmd logs on compute nodes I am seeing this:

[2017-02-27T20:25:04.886] error: gres/mic unable to set OFFLOAD_DEVICES, no 
device files configured
[2017-02-27T20:25:04.898] _run_prolog: run job script took usec=11995
[2017-02-27T20:25:04.898] _run_prolog: prolog with lock for job 714 ran for 0 
seconds
[2017-02-27T20:25:04.898] Launching batch job 714 for UID 1991667182
[2017-02-27T20:25:04.979] [714] error: gres/mic unable to set OFFLOAD_DEVICES, 
no device files configured
[2017-02-27T23:42:17.732] [714] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 
status 0
[2017-02-27T23:42:17.734] [714] done with job

These compute nodes have no GPUs or Xeon Phis, and there is no gres defined for 
them in the slurm.conf
There ARE some GPU equipped nodes on this cluster, with a gres defined

In slurm.conf, the comute nodes are gm-hpc-01 to 052

# Nodes
NodeName=gm-hpc-[01-52]  Procs=20
NodeName=gm-hpc-gpu-[01-05]  Procs=20 Gres=gpu:4

PartitionName=stdcomp Default=YES MinNodes=1 MaxNodes=52  MaxTime=1-12:00:00 
AllowGroups=ALL Priority=1 DisableRootJobs=YES RootOnly=NO Hidden=NO Shared=NO 
GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO 
ExclusiveUser=NO PriorityJ
obFactor=1 PriorityTier=1 OverSubscribe=NO State=UP MaxMemPerNode=102400 
Nodes=gm-hpc-[01-32],gm-hpc-[37-52]
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Slurm with sssd - limits help please

2017-02-16 Thread John Hearns

Thankyou!
In this case this option in slurm.conf did the trick:

PropagateResourceLimitsExcept=MEMLOCK

I

From: Guy Coates [mailto:guy.coa...@gmail.com]
Sent: 16 February 2017 17:05
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] RE: Slurm with sssd - limits help please

You should double check the pam config too:

https://slurm.schedmd.com/faq.html#pam

Thanks,

Guy

On 16 February 2017 at 15:17, John Hearns 
<john.hea...@xma.co.uk<mailto:john.hea...@xma.co.uk>> wrote:
My answer is here maybe?

https://slurm.schedmd.com/faq.html#memlock

RTFM ?

From: John Hearns [mailto:john.hea...@xma.co.uk<mailto:john.hea...@xma.co.uk>]
Sent: 16 February 2017 15:13
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Slurm with sssd - limits help please

Looks like there are others out there using slurm with sssd authentication, 
based on a quick mailing list search.

Forgive me if I have not understood something here.
On the cluster I am configuring at the moment, looking at the slurm daemon on 
the compute nodes
it has the max locked memory limit set to unlimited.
/etc/security/limits.conf has soft and hard max locked memory set to unlimited 
for all users.

I tried to run an MPI job and got the familiar warning that there was not 
enough locked memory available..
Looked at /etc/security/limits.conf as I have been bitten by that one before…

I log in using an srun and sure enough:

max locked memory   (kbytes, -l) 64

It looks to me like the limits are being inherited from the /usr/sbin/sssd 
daemon
In return this is started by systemd….
(cue dark clouds, the howling of a wolf and the sound of thunder off-stage)
At the moment I am looking at how to increase limits with systemd

Please, someone tell me I’m an idiot and there is an easy way to do this.
I weep though – is system really losing us the dependable and well known ways 
of doing things with limits.conf ??
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Slurm with sssd - limits help please

2017-02-16 Thread John Hearns

My answer is here maybe?

https://slurm.schedmd.com/faq.html#memlock

RTFM ?

From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 16 February 2017 15:13
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Slurm with sssd - limits help please

Looks like there are others out there using slurm with sssd authentication, 
based on a quick mailing list search.

Forgive me if I have not understood something here.
On the cluster I am configuring at the moment, looking at the slurm daemon on 
the compute nodes
it has the max locked memory limit set to unlimited.
/etc/security/limits.conf has soft and hard max locked memory set to unlimited 
for all users.

I tried to run an MPI job and got the familiar warning that there was not 
enough locked memory available..
Looked at /etc/security/limits.conf as I have been bitten by that one before…

I log in using an srun and sure enough:

max locked memory   (kbytes, -l) 64

It looks to me like the limits are being inherited from the /usr/sbin/sssd 
daemon
In return this is started by systemd….
(cue dark clouds, the howling of a wolf and the sound of thunder off-stage)
At the moment I am looking at how to increase limits with systemd

Please, someone tell me I’m an idiot and there is an easy way to do this.
I weep though – is system really losing us the dependable and well known ways 
of doing things with limits.conf ??
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Slurm with sssd - limits help please

2017-02-16 Thread John Hearns

Looks like there are others out there using slurm with sssd authentication, 
based on a quick mailing list search.

Forgive me if I have not understood something here.
On the cluster I am configuring at the moment, looking at the slurm daemon on 
the compute nodes
it has the max locked memory limit set to unlimited.
/etc/security/limits.conf has soft and hard max locked memory set to unlimited 
for all users.

I tried to run an MPI job and got the familiar warning that there was not 
enough locked memory available..
Looked at /etc/security/limits.conf as I have been bitten by that one before...

I log in using an srun and sure enough:

max locked memory   (kbytes, -l) 64

It looks to me like the limits are being inherited from the /usr/sbin/sssd 
daemon
In return this is started by systemd
(cue dark clouds, the howling of a wolf and the sound of thunder off-stage)
At the moment I am looking at how to increase limits with systemd

Please, someone tell me I'm an idiot and there is an easy way to do this.
I weep though - is system really losing us the dependable and well known ways 
of doing things with limits.conf ??
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Abaqus with Slurm

2017-02-13 Thread John Hearns

In case anyone is interested, here is my solution to creating the mp_host_list

First install the very useful python-hostlist   
https://www.nsc.liu.se/~kent/python-hostlist/
Then use the hostlist command to output the mp_host_list
The syntax is a bit convoluted but it works (under bach)

mp_host_list=$(hostlist  --append-slurm-tasks=$SLURM_TASKS_PER_NODE  -d -e -p 
[\' -a \',  -s ], $SLURM_NODELIST)
# put the keyword string on mp_host_list and add [ ] brackets . Need a second 
trailing bracket at the end
mp_host_list="mp_host_list=["${mp_host_list}"]]"
echo "${mp_host_list}">>abaqus_v6.env

Now to find the correct MPI incantation to get the darn thing running over 
Omnipath.
I forsee meetings at a crossroads at midnight and selling my soul to a shadowy 
figure




-----Original Message-
From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 09 February 2017 12:03
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Abaqus with Slurm

Sean, much thankyou.
Guinness owed if I am ever in Temple Bar soon.

-Original Message-
From: Sean McGrath [mailto:smcg...@tchpc.tcd.ie]
Sent: 09 February 2017 11:58
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Abaqus with Slurm


Hi,

We have slurm 16.05.4 and the latest version of Abaqus we use is 6.14. I 
remember running into a similar problem with Abaqus so I wrote some bad bash to 
populate the host list file; http://www.tchpc.tcd.ie/node/1261

The github script seems to be doing something similar but in a better way.

Yes, I too remember thinking it was poorly designed or implemented in Abaqus at 
that time too.

Regards

Sean

On Thu, Feb 09, 2017 at 03:16:09AM -0800, John Hearns wrote:

> I would guess quite a few sites are using Abaqus with Slurm. I would be 
> grateful for some pointers on the submission scripts for MPI parallel Abaqus 
> runs.
> I am setting up Abaqus version 6.14-1 on a system with Slurm 16.05 and an 
> Omnipath interconnect.
>
> Specifically I am using this script to create the mp_host_list and to
> add it to the abaqus_v6.env file
> https://github.com/nesi/hpc-workflows/blob/master/EnvironmentSetupScri
> pts/slurm_setup_abaqus-env.sh
>
> This just seems to go against the grain with me - Abaqus seems to be 'LSF 
> aware' and can create mp_host_list when run with LSF.
> Is Abaqus not more 'slurm aware' ??
> If the advice is 'use the 2016 version' then this advice will be hoisted 
> aboard.
>
> John H
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of the company.
> Employees of XMA Ltd are expressly required not to make defamatory
> statements and not to infringe or authorise any infringement of
> copyright or any other legal right by email communications. Any such
> communication is contrary to company policy and outside the scope of
> the employment of the individual concerned. The company will not
> accept any liability in respect of such communication, and the
> employee responsible will be personally liable for any damages or
> other liability arising. XMA Limited is registered in England and
> Wales (registered no. 2051703). Registered Office: Wilford Industrial
> Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP

--
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing Trinity College 
Dublin

sean.mcgr...@tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liabl

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread John Hearns

Thanks to Ryan,  Sarlo and Sean.

> "Killed" isn't usually a helpful error message that they understand.
Au contraire, I usually find that is a message they understand. Pour 
encourarger les autres you understand.





-Original Message-
From: Ryan Cox [mailto:ryan_...@byu.edu]
Sent: 09 February 2017 15:31
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Stopping compute usage on login nodes


John,

We use /etc/security/limits.conf to set cputime limits on processes:
* hard cpu 60
root hard cpu unlimited

It works pretty well but long running file transfers can get killed.  We have a 
script that looks for whitelisted programs to remove the limit from on a 
periodic basis.  We haven't experienced problems with this approach in users 
(that anyone has reported to us, at least).  Threaded programs get killed more 
quickly than multi-process programs since the limit is per process.

Additionally, we use cgroups for limits in a similar way to Sean but with an 
older approach than pam_cgroup.  We also use the cpu cgroup rather than cpuset 
because it doesn't limit them to particular CPUs and doesn't limit them when no 
one else is running (it's shares-based).  We also have an OOM notifier daemon 
that writes to a user's tty so they know if they ran out of memory.  "Killed" 
isn't usually a helpful error message that they understand.

We have this in a github repo: https://github.com/BYUHPC/uft.
Directories that may be useful include cputime_controls, oom_notifierd, 
loginlimits (lets users see their cgroup limits with some explanations).

Ryan

On 02/09/2017 07:18 AM, Sean McGrath wrote:
> Hi,
>
> We use cgroups to limit usage to 3 cores and 4G of memory on the head
> nodes. I didn't do it but will copy and paste in our documentation below.
>
> Those limits, 3 cores are 4G are global to all non root users I think
> as they apply to a group. We obviously don't do this on the nodes.
>
> We also monitor system utilisation with nagios and will intervene if needed.
> Before we had cgroups in place I very occasionally had to do a pkill
> -u baduser and lock them out temporarily until the situation was explained to 
> them.
>
> Any questions please let me know.
>
> Sean
>
>
>
> = How to configure Cgroups locally on a system =
>
> This is a step-to-step guide to configure Cgroups locally on a system.
>
>  1. Install the libraries to control Cgroups and to enforce it via
> PAM 
>
> $ yum install libcgroup libcgroup-pam
>
>  2. Load the Cgroups module on PAM 
>
> 
> $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/login
> $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/password-auth-ac
> $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/system-auth-ac
> 
>
>  3. Set the Cgroup limits and associate them to a user group 
>
> add to /etc/cgconfig.conf:
> 
> # cpuset.mems may be different in different architectures, e.g. in
> Parsons there # is only "0".
> group users {
>memory {
>  memory.limit_in_bytes="4G";
>  memory.memsw.limit_in_bytes="6G";
>}
>cpuset {
>  cpuset.mems="0-1";
>  cpuset.cpus="0-2";
>}
> }
> 
>
> Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive//
> of the ''memory.limit_in_bytes'' limit. So in the above example, the
> limit is 4GB of RAM following by a further 2 GB of swap. See:
>
> [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Lin
> ux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#p
> roc-cpu_and_mem
> ]]
>
> Set no limit for root and set limits for every other individual user:
>
> 
> $ echo "root*  /">>/etc/cgrules.conf
> $ echo "*   cpuset,memoryusers">>/etc/cgrules.conf
> 
>
> Note also that the ''users'' cgroup defined above is inclusive of
> **all** users (the * wildcard). So it is not a 4GB RAM limit for one
> user, it is a 4GB RAM limit in total for every non-root user.
>
>  4. Start the daemon that manages Cgroups configuration and set it
> to start on boot 
>
> 
> $ /etc/init.d/cgconfig start
> $ chkconfig cgconfig on
> 
>
>
>
>
>
> On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:
>
>> Does anyone have a good suggestion for this problem?
>>
>> On a cluster I am implementing I noticed a user is running a code on 16 
>> cores, on one of the login nodes, outside the batch system.
>> What are the accepted techniques to combat this? Other than applying a LART, 
>> if you all know what this means.
>>
>> On one system I set up a year or so ago I was asked to implement a

[slurm-dev] Stopping compute usage on login nodes

2017-02-09 Thread John Hearns

Does anyone have a good suggestion for this problem?

On a cluster I am implementing I noticed a user is running a code on 16 cores, 
on one of the login nodes, outside the batch system.
What are the accepted techniques to combat this? Other than applying a LART, if 
you all know what this means.

On one system I set up a year or so ago I was asked to implement a shell 
timeout, so if the user was idle for 30 minutes they would be logged out.
This actually is quite easy to set up as I recall.
I guess in this case as the user is connected to a running process then they 
are not 'idle'.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Abaqus with Slurm

2017-02-09 Thread John Hearns

Sean, much thankyou.
Guinness owed if I am ever in Temple Bar soon.

-Original Message-
From: Sean McGrath [mailto:smcg...@tchpc.tcd.ie]
Sent: 09 February 2017 11:58
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Abaqus with Slurm


Hi,

We have slurm 16.05.4 and the latest version of Abaqus we use is 6.14. I 
remember running into a similar problem with Abaqus so I wrote some bad bash to 
populate the host list file; http://www.tchpc.tcd.ie/node/1261

The github script seems to be doing something similar but in a better way.

Yes, I too remember thinking it was poorly designed or implemented in Abaqus at 
that time too.

Regards

Sean

On Thu, Feb 09, 2017 at 03:16:09AM -0800, John Hearns wrote:

> I would guess quite a few sites are using Abaqus with Slurm. I would be 
> grateful for some pointers on the submission scripts for MPI parallel Abaqus 
> runs.
> I am setting up Abaqus version 6.14-1 on a system with Slurm 16.05 and an 
> Omnipath interconnect.
>
> Specifically I am using this script to create the mp_host_list and to
> add it to the abaqus_v6.env file
> https://github.com/nesi/hpc-workflows/blob/master/EnvironmentSetupScri
> pts/slurm_setup_abaqus-env.sh
>
> This just seems to go against the grain with me - Abaqus seems to be 'LSF 
> aware' and can create mp_host_list when run with LSF.
> Is Abaqus not more 'slurm aware' ??
> If the advice is 'use the 2016 version' then this advice will be hoisted 
> aboard.
>
> John H
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of the company.
> Employees of XMA Ltd are expressly required not to make defamatory
> statements and not to infringe or authorise any infringement of
> copyright or any other legal right by email communications. Any such
> communication is contrary to company policy and outside the scope of
> the employment of the individual concerned. The company will not
> accept any liability in respect of such communication, and the
> employee responsible will be personally liable for any damages or
> other liability arising. XMA Limited is registered in England and
> Wales (registered no. 2051703). Registered Office: Wilford Industrial
> Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP

--
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing Trinity College 
Dublin

sean.mcgr...@tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Abaqus with Slurm

2017-02-09 Thread John Hearns

I would guess quite a few sites are using Abaqus with Slurm. I would be
grateful for some pointers on the submission scripts for MPI parallel Abaqus
runs.
I am setting up Abaqus version 6.14-1 on a system with Slurm 16.05 and an
Omnipath interconnect.

Specifically I am using this script to create the mp_host_list and to add it to
the abaqus_v6.env file
https://github.com/nesi/hpc-workflows/blob/master/EnvironmentSetupScripts/slurm_setup_abaqus-env.sh

This just seems to go against the grain with me - Abaqus seems to be 'LSF
aware' and can create mp_host_list when run with LSF.
Is Abaqus not more 'slurm aware' ??
If the advice is 'use the 2016 version' then this advice will be hoisted aboard.

John H
Any views or opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Employees of XMA Ltd are
expressly required not to make defamatory statements and not to infringe or
authorise any infringement of copyright or any other legal right by email
communications. Any such communication is contrary to company policy and
outside the scope of the employment of the individual concerned. The company
will not accept any liability in respect of such communication, and the
employee responsible will be personally liable for any damages or other
liability arising. XMA Limited is registered in England and Wales (registered
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane,
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: how to differentiate regular srun and srun with --pty

2017-02-06 Thread John Hearns

Bhanu

Regarding recording commands etc. then have a look at this project which was 
presented at FOSDEM last weekend:
https://fosdem.org/2017/schedule/event/ogrt/attachments/slides/1574/export/events/attachments/ogrt/slides/1574/005_ogrt.pdf
https://github.com/georg-rath/ogrt


This should do what you want.




From: Prasad, Bhanu [mailto:bhanu_pra...@hms.harvard.edu]
Sent: 06 February 2017 16:32
To: slurm-dev 
Subject: [slurm-dev] how to differentiate regular srun and srun with --pty

Hi,

we recently migrated to slurm-16.05.4

Is there a way to detect whether the user is executing srun with/without --pty 
in job_submit plugin or any other plugin
we need to limit a partition to only srun without --pty (i.e non-interactive 
only)
but can be able to run regular srun
Is there any variables that detect this feature or any one know a work around 
that does this kind of stuff

And

Is there a way to capture commands executed by user using srun / sbatch with 
--wrap
And how to redirect these commands to slurmdb
If changing default slurmdb is a bad idea
how to set up a new parallel database that store all job information including 
things like commands etc..

And

Are there any default ports defiend for the use of mpirun with srun ??


--
Thanks
Bhanu

Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Slurm for render farm

2017-01-28 Thread John Hearns

Bill
thanks for the advice.
I will send an update when I get this working.
The RenderPal software seems quite well behaved.  There is a configuration file 
which gives the name of the render executable.
it is easy to substitute this for a wrapper script.  The wrapper script says 
that it is called with some easily parseable arguments,
which are just the argument list for Blender (Maya, etc.)

From: Bill Broadley [b...@cse.ucdavis.edu]
Sent: 28 January 2017 02:37
To: slurm-dev
Subject: [slurm-dev] Re: Slurm for render farm

On 01/16/2017 10:02 AM, John Hearns wrote:
> The concept at the moment is to run the Renderpal server, which is a Windows
> application and it can detect the Linux render clients via  a 'heartbeat' 
> mechanism.
> I would spawn the Linux clients as needed via slurm.

Sounds reasonable.

> Thinking out loud, I could use slurm to run render clients on all compute 
> nodes
> in the cluster, then use job preemption to kill the jobs when other compute 
> jobs
> need the nodes.  I guess that very much risks 'live' Renderpal jobs being 
> killed
> off.

This works well, but slurm doesn't understand swap.  So it worked for me in
testing, until I used it in production and couldn't figure out why jobs wouldn't
suspend.  For that reason I had to use CR_CPU instead of CR_CPU_MEM to tell
slurm to only manage CPUs and not CPUs + ram.

Seems kinda weird to allow suspend, but not swap.  Ram is pretty expensive to
waste on suspended processes.  So I can't mix high memory/low cpu jobs and low
memory/high CPU jobs and efficiently utilize nodes.

 > The script will have to block until the job completes I think - else the
 > RenderPal server will report it has finished.
 > Is it possible to block and wait till an sbatch has finished?
 > Or shoudl I be thinking on using srun here?

Jobs could depend on other jobs, so that could work.  Your jobs could do
something like touch ~/job/finished/$SLURM_JOBID.  But I do think the easiest
would be srun.
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Job temporary directory

2017-01-20 Thread John Hearns

As I remember, in SGE and in PbsPro a job has a directory created for it on the 
execution host which is a temporary directory, named with he jobid.
you can define int he batch system configuration where the root of these 
directories is.

On running srun env, the only TMPDIR I see is /tmp
I know - RTFM.  I bet I haven't realised that this si easy to set up...

Specifically I would like a temprary job directory which is /local/$SLURM_JOBID

I guess I can create this in the job then delete it, but it would be cleaner if 
the batch system deleted it and didnt allow for failed jobs or bad scripts 
leaving it on disk.
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Slurm for render farm

2017-01-20 Thread John Hearns

Hello all.  I am making progress on using RenderPal with Slurm.
I have installed the RenderPal client on the job submission node on our cluster 
(which is the head node).
I am able to run a wrapper script which will get the parameters to run the 
render program.

A couple of questions:


I plan to have this wrapper script run the actual render through slurm.
The script will have to block until the job completes I think - else the 
RenderPal server will report it has finished.
Is it possible to block and wait till an sbatch has finished?
Or shoudl I be thinking on using srun here?

Second question - we are going to have to do some data staging from and to 
Samba mounted shares.
I have still got the mental scars from last time I did this sort of thing...
Anyway - for the Burst Buffer plugin, which machine actually does the data 
moving? the submission or the execution host?
I know - RTFM!





From: John Hearns
Sent: 16 January 2017 18:01
To: slurm-dev
Subject: Slurm for render farm

Is anyone out there using SLurm in conjunction with Renderpal   
http://www.renderpal.com/http://www.renderpal.com/

Forestalling the obvious replies... yes I know that a render farm manager and a 
scheduler do basically the same thing. In a rational universe I would be using 
one or 'tother. Perhaps in the next life...

The concept at the moment is to run the Renderpal server, which is a Windows 
application and it can detect the Linux render clients via  a 'heartbeat' 
mechanism.
I would spawn the Linux clients as needed via slurm.

Thinking out loud, I could use slurm to run render clients on all compute nodes 
in the cluster, then use job preemption to kill the jobs when other compute 
jobs need the nodes.  I guess that very much risks 'live' Renderpal jobs being 
killed off.

Any experiences in this area gratefully received.

John H


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Slurm for render farm

2017-01-16 Thread John Hearns

Is anyone out there using SLurm in conjunction with Renderpal   
http://www.renderpal.com/http://www.renderpal.com/

Forestalling the obvious replies... yes I know that a render farm manager and a 
scheduler do basically the same thing. In a rational universe I would be using 
one or 'tother. Perhaps in the next life...

The concept at the moment is to run the Renderpal server, which is a Windows 
application and it can detect the Linux render clients via  a 'heartbeat' 
mechanism.
I would spawn the Linux clients as needed via slurm.

Thinking out loud, I could use slurm to run render clients on all compute nodes 
in the cluster, then use job preemption to kill the jobs when other compute 
jobs need the nodes.  I guess that very much risks 'live' Renderpal jobs being 
killed off.

Any experiences in this area gratefully received.

John H


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Using slurm to control container images?

2016-11-16 Thread John Hearns

Lachlan,
I am sure it has been mentioned on this thread, but look at Singularity 
http://singularity.lbl.gov/

From: Lachlan Musicman [mailto:data...@gmail.com]
Sent: 16 November 2016 01:45
To: slurm-dev 
Subject: [slurm-dev] Re: Using slurm to control container images?

Yes, rkt was probably my preferred option. The researchers I work with aren't 
necessarily up to date with what's best practice wrt to this area, so docker is 
what they know best by virtue of branding/promotion. I don't mind which is used 
in a solution, if any. But yes, rkt would be my preference.
Cheers
L.

--
The most dangerous phrase in the language is, "We've always done it this way."

- Grace Hopper

On 16 November 2016 at 12:29, Jean Chassoul 
> wrote:
Hi,

Just wondering, have you consider rkt? I wonder if you run pip inside 
virtualenv's if that is the case the switch to a container with rkt seems 
"normal" instead of a more intrusive one all mighty process to rule everything 
that docker had the last time I check, its probably better now.

Saludos.
Jean

On Tue, Nov 15, 2016 at 8:21 PM, Lachlan Musicman 
> wrote:
Hola,
We were looking for the ability to make jobs perfectly reproducible - while the 
system is set up with environment modules with the increasing number of package 
management tools - pip/conda; npm; CRAN/Bioconductor - and people building 
increasingly more complex software stacks, our users have started asking about 
containerization and slurm.
I have found a discussion on this list from about a year ago

https://groups.google.com/d/msg/slurm-devel/oPmz5em5tAA/BYlDDfRDzTgJ
which mentioned a tool that's not been updated since and one called Shifter by 
NERSC, which is Cray specific?.
Has anyone tried Shifter out and has there been any movement on this? I presume 
the licensing issues remain.
Cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this way."

- Grace Hopper

Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Send notification email

2016-10-06 Thread John Hearns

Fany,
Is there a reason why you are choosing to use ssmtp rather than Postfix?
I ask because I know a little about Postfix,
But nothing about ssmtp!

Please look at this page on diagnostics for email:
https://www.port25.com/how-to-check-an-smtp-connection-with-a-manual-telnet-session-2/

You could try this:

telnet 10.8.52.254  25

then type
elho cluster.citi.cu

I think you will not get much response!

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 17:06
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

According to the previous mail explaining this, I'm trying to configure SLURM + 
SMTP (mail client) without using postfix.
I execute the jobs like this, but is not work:

salloc -n 2-N 2 --mail-user=fpa...@gmail.com --mail-type=END mpirun jobs1

/var/log/maillog
Oct  5 11:34:09 cluster sSMTP[2139]: Creating SSL connection to host Oct  5 
11:34:09 cluster sSMTP[2139]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384 
Oct  5 11:34:09 cluster sSMTP[2139]: Sent mail for root@fpa...@citi.cu (221 
2.0.0 Bye) uid=0 username=root outbytes=420 Oct  5 11:34:52 compute-0-3 
postfix/qmgr[1792]: 2AC6BC006B: from=<root@local>, size=4328, nrcpt=1 (queue 
active) Oct  5 11:34:52 compute-0-3 postfix/smtp[6469]: connect to 
10.8.52.254[10.8.52.254]:25: Connection refused Oct  5 11:34:52 compute-0-3 
postfix/smtp[6469]: 2AC6BC006B: to=<root@local>, orig_to=, relay=none, 
delay=8869, delays=8869/0.01/0/0, dsn=4.4.1, status=deferred $

-Mensaje original-----
De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de 
octubre de 2016 11:33
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

Fany,
Are you able to send us some of the lines from the /var/log/maillog file which 
indicateds why the email server is rejecting the email?
Thankyou

-Original Message-
From: Fany Pages [mailto:fpa...@udio.cujae.edu.cu]
Sent: 05 October 2016 16:30
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Thanks anyway.

All the best.
Fany

-Mensaje original-
De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de 
octubre de 2016 11:20
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

Fany,
You are correct.  I understand this a bit better now.
My answer I am afraid is that you will have to ask your corporate IT people to 
allow email from this address.

I recently dealt with a similar case at a university. The mail servers were 
refusing to accept mail from the cluster head node, as it did not have a 
reverse DNS access. I n the end we had to configure email to go via an 
Office365 server!

Other people on the list may be able to offer a better solution though.

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 15:45
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Yes, I refer to the external network IP cluster  is not valid out, my domain is 
not registered (@ cluster.citi.cu) therefore is not in the MX records, so when 
I relay in my postfix, my corporate mail server refuses mails go out of my 
internal network. I think that's what is happening. I'm wrong?

-Mensaje original-
De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de 
octubre de 2016 10:17
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

Fany,
Many clusters which have an internal network which is a private network.

However the other interface on the cluster head node, which is normally called 
the 'external' interface can have a real, proper IP address on your external 
network.
It will therefore be able to send email.
The cluster compute nodes can be configured to 'relay' email via the head node.

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 15:13
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Hi,
Thanks for your answer.
My HPC cluster does not have a real IP segment, it is a test cluster. Therefore 
it not recognized in the external network. So, I need try to another way.
All the best,
Fany

-Mensaje original-
De: Christopher Samuel [mailto:sam...@unimelb.edu.au] Enviado el: martes, 4 de 
octubre de 2016 18:43
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

On 03/10/16 23:39, Fanny Pagés Díaz wrote:

> I have a slurm running in the same HPC cluster server, but I need send
> all notification using my corporate mail server, which running in
> another server at my internal network. I not need use the local
> postfix installed at slurm server.

The most reliable solution will be to configure Postfix to send emails via the 
corporate server.

All our clusters send using our own mail server quite deliberately.

We set:

relayhost (to say where to relay email via) myor

[slurm-dev] Re: Send notification email

2016-10-05 Thread John Hearns

Fany,
Are you able to send us some of the lines from the /var/log/maillog file which 
indicateds why the email server is rejecting the email?
Thankyou

-Original Message-
From: Fany Pages [mailto:fpa...@udio.cujae.edu.cu]
Sent: 05 October 2016 16:30
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Thanks anyway.

All the best.
Fany

-Mensaje original-
De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de 
octubre de 2016 11:20
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

Fany,
You are correct.  I understand this a bit better now.
My answer I am afraid is that you will have to ask your corporate IT people to 
allow email from this address.

I recently dealt with a similar case at a university. The mail servers were 
refusing to accept mail from the cluster head node, as it did not have a 
reverse DNS access. I n the end we had to configure email to go via an 
Office365 server!

Other people on the list may be able to offer a better solution though.

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 15:45
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Yes, I refer to the external network IP cluster  is not valid out, my domain is 
not registered (@ cluster.citi.cu) therefore is not in the MX records, so when 
I relay in my postfix, my corporate mail server refuses mails go out of my 
internal network. I think that's what is happening. I'm wrong?

-Mensaje original-----
De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de 
octubre de 2016 10:17
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

Fany,
Many clusters which have an internal network which is a private network.

However the other interface on the cluster head node, which is normally called 
the 'external' interface can have a real, proper IP address on your external 
network.
It will therefore be able to send email.
The cluster compute nodes can be configured to 'relay' email via the head node.

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 15:13
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Hi,
Thanks for your answer.
My HPC cluster does not have a real IP segment, it is a test cluster. Therefore 
it not recognized in the external network. So, I need try to another way.
All the best,
Fany

-Mensaje original-
De: Christopher Samuel [mailto:sam...@unimelb.edu.au] Enviado el: martes, 4 de 
octubre de 2016 18:43
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

On 03/10/16 23:39, Fanny Pagés Díaz wrote:

> I have a slurm running in the same HPC cluster server, but I need send
> all notification using my corporate mail server, which running in
> another server at my internal network. I not need use the local
> postfix installed at slurm server.

The most reliable solution will be to configure Postfix to send emails via the 
corporate server.

All our clusters send using our own mail server quite deliberately.

We set:

relayhost (to say where to relay email via) myorigin (to set the system name to 
its proper FQDN) aliasmaps (to add an LDAP lookup to rewrite users email to the 
value in
LDAP)

But really this isn't a Slurm issue, it's a host config issue for Postfix.

All the best,
Chris
--
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP Any views or opinions presented in this email are 
solely those of the author and do not necessarily represent those of the 
company. Employees of XMA Ltd are expressly required not to make defamatory 
statements and not to infringe or authorise any infringement of copyright or 
any other legal right by email communications. Any such communication is 
contrary to company policy and outside the scope of the employment of the 
individual concerned. The company will not accept any liability in

[slurm-dev] Re: Send notification email

2016-10-05 Thread John Hearns

Fany,
You are correct.  I understand this a bit better now.
My answer I am afraid is that you will have to ask your corporate IT people to 
allow email from this address.

I recently dealt with a similar case at a university. The mail servers were 
refusing to accept mail from the cluster head node,
as it did not have a reverse DNS access. I n the end we had to configure email 
to go via an Office365 server!

Other people on the list may be able to offer a better solution though.

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 15:45
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Yes, I refer to the external network IP cluster  is not valid out, my domain is 
not registered (@ cluster.citi.cu) therefore is not in the MX records, so when 
I relay in my postfix, my corporate mail server refuses mails go out of my 
internal network. I think that's what is happening. I'm wrong?

-Mensaje original-
De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de 
octubre de 2016 10:17
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

Fany,
Many clusters which have an internal network which is a private network.

However the other interface on the cluster head node, which is normally called 
the 'external' interface can have a real, proper IP address on your external 
network.
It will therefore be able to send email.
The cluster compute nodes can be configured to 'relay' email via the head node.

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 15:13
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Send notification email

Hi,
Thanks for your answer.
My HPC cluster does not have a real IP segment, it is a test cluster. Therefore 
it not recognized in the external network. So, I need try to another way.
All the best,
Fany

-Mensaje original-
De: Christopher Samuel [mailto:sam...@unimelb.edu.au] Enviado el: martes, 4 de 
octubre de 2016 18:43
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

On 03/10/16 23:39, Fanny Pagés Díaz wrote:

> I have a slurm running in the same HPC cluster server, but I need send
> all notification using my corporate mail server, which running in
> another server at my internal network. I not need use the local
> postfix installed at slurm server.

The most reliable solution will be to configure Postfix to send emails via the 
corporate server.

All our clusters send using our own mail server quite deliberately.

We set:

relayhost (to say where to relay email via) myorigin (to set the system name to 
its proper FQDN) aliasmaps (to add an LDAP lookup to rewrite users email to the 
value in
LDAP)

But really this isn't a Slurm issue, it's a host config issue for Postfix.

All the best,
Chris
--
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Send notification email

2016-10-05 Thread John Hearns

Fany,
Many clusters which have an internal network which is a private network.

However the other interface on the cluster head node, which is normally called 
the 'external' interface can have a real, proper IP address on your external 
network.
It will therefore be able to send email.
The cluster compute nodes can be configured to 'relay' email via the head node.

-Original Message-
From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 05 October 2016 15:13
To: slurm-dev 
Subject: [slurm-dev] Re: Send notification email

Hi,
Thanks for your answer.
My HPC cluster does not have a real IP segment, it is a test cluster. Therefore 
it not recognized in the external network. So, I need try to another way.
All the best,
Fany

-Mensaje original-
De: Christopher Samuel [mailto:sam...@unimelb.edu.au] Enviado el: martes, 4 de 
octubre de 2016 18:43
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

On 03/10/16 23:39, Fanny Pagés Díaz wrote:

> I have a slurm running in the same HPC cluster server, but I need send
> all notification using my corporate mail server, which running in
> another server at my internal network. I not need use the local
> postfix installed at slurm server.

The most reliable solution will be to configure Postfix to send emails via the 
corporate server.

All our clusters send using our own mail server quite deliberately.

We set:

relayhost (to say where to relay email via) myorigin (to set the system name to 
its proper FQDN) aliasmaps (to add an LDAP lookup to rewrite users email to the 
value in
LDAP)

But really this isn't a Slurm issue, it's a host config issue for Postfix.

All the best,
Chris
--
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Send notification email

2016-09-30 Thread John Hearns

Fanny,
You are getting confused between the mail client – that is the program you use 
to send email as a user, and the mail server which the system uses to route the 
email to the recipient’s mail server.
These are called the MAU (Mail User Agent) and MTA (Mail Transfer Agent)  if I 
remember correctly.

Please send me an email off the list and I can help if I can. I have configured 
Postfix on many HPC clusters. (Postfix is a mail server.)


From: Fanny Pagés Díaz [mailto:fpa...@citi.cu]
Sent: 29 September 2016 19:09
To: slurm-dev 
Subject: [slurm-dev] Re: Send notification email

Yes, I can set the mail server in slurm?

De: Eckert, Phil [mailto:ecke...@llnl.gov]
Enviado el: miércoles, 28 de septiembre de 2016 15:00
Para: slurm-dev
Asunto: [slurm-dev] Re: Send notification email

If I understand your question, you can set it in the in slurm.conf file, the 
default is:

MailProg = /usr/bin/mail

From: Fanny Pagés Díaz >
Reply-To: slurm-dev >
Date: Wednesday, September 28, 2016 at 11:45 AM
To: slurm-dev >
Subject: [slurm-dev] Send notification email

I need send notification email from Slurm using other mail server which is not 
the standard one. Any can help me?
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Slurm web dashboards

2016-09-27 Thread John Hearns

Hello all.  What are the thoughts on a Slurm 'dashboard'. The purpose being to 
display cluster status on a large screen monitor.

I rather liked the look of this, based on dashing,io 
https://github.com/julcollas/dashing-slurm/blob/master/README.md
Sadly dashing.io is not being supported, and this looks two years old now.

There is the slurm-web from EDF.
Also a containerised dashboard by Christian Kniep (hello Christian).

Any other packages out there please?

Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: cpu identifier

2016-09-14 Thread John Hearns

Squee.(*)


Just a note - for Mellanox IB users, there is the tuning guide which advises 
using interrupts on the CPU nearest the HBA.
I guess it makes sense to eke out that last fraction of performance to make the 
reserved cores be local to the HBA.
hwloc is you friend here.


(*) and see my frequent references to 'donkey engines' way over there on the 
Beowulf list.
Donkey engine = small engine which is used to crank over a much larger diesel 
engine, and to provide auxiliary power

-Original Message-
From: Christopher Samuel [mailto:sam...@unimelb.edu.au]
Sent: 15 September 2016 01:31
To: slurm-dev 
Subject: [slurm-dev] Re: cpu identifier


On 15/09/16 05:20, andrealphus wrote:

> On a side note, any idea if there is a parameter to not have it use a
> particular cpu? This is a single node workstation, with
> 18 cores. The end goal is to have a default set up where it can say
> run 16 jobs, bound to 16 unique cores, excluding core 1 and 2, which
> are primarily used for system overhead.

Slurm has core specialisation, which is documented here:

http://slurm.schedmd.com/core_spec.html

I *think* it's meant to do what you want it to do, but I don't think I've had 
enough coffee yet to really grok what it's saying..

--
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: cpu identifier

2016-09-14 Thread John Hearns

Andrealphus,#
You should be using cpusets
You allocate cores 1 and 2 (actually I think they count from 0)  as the 'boot 
cpuset' and run the operating system processes in that.
You then create a cpuset for each job.
I have done this with PBSPro and it works very well.

http://slurm.schedmd.com/cgroups.html



-Original Message-
From: andrealphus [mailto:andrealp...@gmail.com]
Sent: 14 September 2016 21:19
To: slurm-dev 
Subject: [slurm-dev] Re: cpu identifier


Thanks Dani!  On a side note, any idea if there is a parameter to not have it 
use a particular cpu? This is a single node workstation, with
18 cores. The end goal is to have a default set up where it can say run 16 
jobs, bound to 16 unique cores, excluding core 1 and 2, which are primarily 
used for system overhead.

The catch is that hyperthreading is enabled, and I also dont want it to use 
virtual cores 19-36, but dont want to disable hyperthreading in the bios.

On Wed, Sep 14, 2016 at 6:34 AM, dani  wrote:
> Is there binding/affinity involved?
>
> If not, the process might execute each instruction on a different cpu,
> so slurm couldn't really provide something useful.
>
>
> If there is binding, take a look at
>
> SBATCH_CPU_BIND
> Set to value of the --cpu_bind option.
> SBATCH_CPU_BIND_VERBOSE
> Set to "verbose" if the --cpu_bind option includes the verbose option.
> Set to "quiet" otherwise.
> SBATCH_CPU_BIND_TYPE
> Set to the CPU binding type specified with the --cpu_bind option.
> Possible values two possible comma separated strings. The first
> possible string identifies the entity to be bound to: "threads",
> "cores", "sockets", "ldoms" and "boards". The second string identifies
> manner in which tasks are
> bound: "none", "rank", "map_cpu", "mask_cpu", "rank_ldom", "map_ldom"
> or "mask_ldom".
> SBATCH_CPU_BIND_LIST
> Set to bit mask used for CPU binding.
>
>
>
> On 14/09//2016 02:23, andrealphus wrote:
>>
>> is there an environmental variable available in sbatch that holds the
>> cpu/s the current job is being run on? (not the number of cpus, but a
>> cpu identifier).
>
>
>
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Remote Visualization and Slurm

2016-08-17 Thread John Hearns

Nicholas,
As you say there are several solutions out there.

The one I have has experience with is NICE Software, which I admit I integrated 
with PBS Pro.
When looking at the code though there are the options to use with SLurm.

Please send me an email off list and I can give more information.



-Original Message-
From: Nicholas McCollum [mailto:nmccol...@asc.edu]
Sent: 17 August 2016 15:33
To: slurm-dev 
Subject: [slurm-dev] Remote Visualization and Slurm

Hello All,

I've been looking for remote visualization solutions that integrate with
slurm; While I have found several companies that say they could work with
slurm, I have yet to find any that can actually show me a product.

Essentially I am wanting a solution where a class of 40 students could log
on to our supercomputers, be provided a remote CentOS VM desktop that is
attached to a GPU so that they can either interactively send jobs to the
GPU that they are assigned or run programs like Spartan, Maestro, Abacus,
ANSYS, etc.

If anyone has a working remote visualization cluster that integrates well
with slurm, I would love to hear from you.

Thanks!

---
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: defining jobs slots

2016-08-16 Thread John Hearns

Adrian, forgive my asking but are you running this on a laptop 'natively' or 
using a virtual machine, eg. On VirtualBox
It could be that if you have a VM it is set to have a different number of cores 
than your real laptop.
I could be very very wrong here  (I am working on a VirtualBox VM at the 
moment so that is why I am thinking that way)

'lscpu'  will tell you


-Original Message-
From: Adrian Sevcenco [mailto:adrian.sevce...@spacescience.ro]
Sent: 16 August 2016 16:11
To: slurm-dev 
Subject: [slurm-dev] defining jobs slots


Hi! I have trouble understanding the definition of job slots for each node ...
and i am running slurm only on my desktop to get used to it until i move it to 
the clusters .. it is not clear to me how one can define the job slots for each 
node and how it can be automatically done for a list of nodes (i hope that 
there is not a requirement of human input of hardware resources for each node)

I have defined
TaskPluginParam=Cores
SelectType=select/cons_res
SelectTypeParameters=CR_CPU

and
NodeName=localhost CPUs=8

PartitionName=local Nodes=localhost Default=YES MaxTime=172800 State=UP

but in the logs i have this:
Aug 16 17:45:39 sev.spacescience.ro slurmd[7371]: Node configuration differs 
from hardware: CPUs=8:8(hw) Boards=1:1(hw) SocketsPerBoard=8:1(hw) 
CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw) Aug 16 17:45:39 
sev.spacescience.ro slurmd[7373]: CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 
Memory=16006 TmpDisk=255791 Uptime=1313157 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)

how can i define for my machine to run 8 jobs?
also, for the next level, how can i set a job slot for each physical core (a 
job to run on the physical core + its ht partner)?

Thank you!
Adrian
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Storage accounting, with web presentation

2016-07-28 Thread John Hearns

Christian is quite correct to flag up Robinhood, which will be the correct tool 
for you.

However, if you want something you can implement today, and will give you a 
quick overview of storage use,
and offer a 'drill down' into each users area try agedu.
I have used it in the past:

http://www.chiark.greenend.org.uk/~sgtatham/agedu/

I stress please that this is not a substitute for Robinhood!

-Original Message-
From: Christian Goll [mailto:christian.g...@h-its.org]
Sent: 28 July 2016 08:54
To: slurm-dev 
Subject: [slurm-dev] Re: Storage accounting, with web presentation

Hello Marcin,
have a look at the robinhood policy engine 
https://github.com/cea-hpc/robinhood/wiki

kind regards,
Christian
On 28.07.2016 09:10, Marcin Stolarek wrote:
> Hi,
>
> This is not related to slurm, but I don't know better place to ask
> such a question. For sure on clusters you manage there is a need to
> present space used by projects, users on particular file systems.
> I'am no aware of any opensource solution providing sth like
> "accounting portal" for storage, does anyone know about such project?
>
> I'm thinking about solution where used space is periodically updated
> basing on script getting data from quota, or simple du, putting it
> into database, then displaying through web interface.
>
> How do you deal with this on yours clusters?
>
> cheers,
> Marcin

--
Dr. Christian Goll
HITS gGmbH
Schloss-Wolfsbrunnenweg 35
69118 Heidelberg
Germany

phone: +49 - 6221 - 533 230
email: christian.g...@h-its.org

Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: SGI UV2000 with SLURM

2016-07-20 Thread John Hearns

As Carlos says.

I don’t have direct experience on running slurm on a UV, but did run PbsPro 
with cpusets on a UV.
I might be remembering this wrong, but part of the init script was to move the 
pbs daemon out of the bootcpuset.
If you have root privileges you can move your own cpuset.  I might be wrong.

From: Carlos Fenoy [mailto:mini...@gmail.com]
Sent: 20 July 2016 09:23
To: slurm-dev 
Subject: [slurm-dev] RE: SGI UV2000 with SLURM

Is the slurmd process running in the bootcpuset?

On Wed, Jul 20, 2016 at 9:29 AM, Christopher Samuel 
> wrote:

On 20/07/16 17:13, A. Podstawka wrote:

> no direct error message, but the jobs get started in the bootcpuset

Do the processes show up in the tasks file for that cgroup?

Is it an exclusive cgroup? (cpuset.cpu_exclusive is 1)

> TaskAffinity=yes

Hmm, I suspect that could be related, try turning that off..

All the best!
Chris
--
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 
55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



--
--
Carles Fenoy
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: queue routing

2016-07-20 Thread John Hearns



So plenty of scope then for seeing (*)  Heisenbugs.

I shall get my coat

(*) Or not seeing, depending if the jobs are being run in a forest


From: Christopher Samuel [sam...@unimelb.edu.au]
Sent: 20 July 2016 01:23
To: slurm-dev
Subject: [slurm-dev] Re: queue routing


So whilst Torque is a classically mechanical system in the physics sense
Slurm is more quantum - your job exists in many states until it can
start running and then it collapses into its final state. :-)

Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Increase size of running job/correcting incorrect resource allocations?

2016-06-01 Thread John Hearns

Plus from me too!

I used the cpusets integration with PbsPro in my last job, and it was a godsend.
This was on a large SMP machine, but same lessons apply to clusters.
Applications get a defined set of CPUs - which they 'see' as being numbered 
from 0,
and they get a defined amount of memory.
If applications 'misbehave' then they get terminated when they run out of 
memory etc,
and they can't stomp all over the cores being used by another application.

Also job tear-down seemed to be a lot better - ie when you collapse the cpuset 
then allthe processes associated with it, so you don't have to worry about 
orphan processes.

From: Christopher Samuel [sam...@unimelb.edu.au]
Sent: 01 June 2016 00:53
To: slurm-dev
Subject: [slurm-dev] Re: Increase size of running job/correcting incorrect 
resource allocations?

On 31/05/16 18:42, Diego Zuccato wrote:

> What I did is the other way around: I've used cpuset to "pin" the job to
> the allocated CPUs.
[...]
> *Way* less need to watch closely the cluster :)

+lots!

We use Slurm's cgroup support to do just that, and Torque's cpuset
support (which I agitated for) before that.

Unless you're only ever allocating whole nodes to jobs then I'd strongly
suggest taking that path (and use cgroups to restrict memory to what the
job requested too).

All the best,
Chris
--
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: ReqNodeNotAvail - can't see all info

2016-05-26 Thread John Hearns

Just an update - looks like I have solved this.

I would still like to be able to see the full NODELIST(REASON) though!

Looking iun the slurmd logs, I saw that the nodes were reporting a different 
slurm.conf from the master
The solution was actuall to restart slurmd on the master in this case.


-Original Message-
From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 26 May 2016 09:21
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] ReqNodeNotAvail - can't see all info


I am scheduling an HPCC job on a certain set of nodes using -nodelist

I am getting informed that a node is not available - but for the life of me I 
cannot expand the NODELIST(REASON) fied to show it.


JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
   340  defq run-hpccjohnh PD   0:00  4 (Resources)
   344  defq run-hpccjohnh PD   0:00  4 
(ReqNodeNotAvail(Unavailable:co
   346  defq run-hpccjohnh PD   0:00  1 
(ReqNodeNotAvail(Unavailable:co
   350  defq run-hpccjohnh PD   0:00  1 
(ReqNodeNotAvail(Unavailable:co
   348  defq run-hpccjohnh  R   1:17  2 comp[15-16]


Also this may be relevant - I have the known problem of a job not teminating 
properly.
In slurmdbd

[2016-05-26T09:00:00.838] error: We have more allocated time than is possible 
(172800 > 126000) for cluster slurm_cluster(35) from 2016-05-26T08:00:00 - 
2016-05-26T09:00:00

I ruan the lost.pl  script from the bugzilla and it finds not still-running 
jobs.

slurm version 14.11.6
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] ReqNodeNotAvail - can't see all info

2016-05-26 Thread John Hearns


I am scheduling an HPCC job on a certain set of nodes using -nodelist

I am getting informed that a node is not available - but for the life of me I 
cannot expand the NODELIST(REASON)
fied to show it.


JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
   340  defq run-hpccjohnh PD   0:00  4 (Resources)
   344  defq run-hpccjohnh PD   0:00  4 
(ReqNodeNotAvail(Unavailable:co
   346  defq run-hpccjohnh PD   0:00  1 
(ReqNodeNotAvail(Unavailable:co
   350  defq run-hpccjohnh PD   0:00  1 
(ReqNodeNotAvail(Unavailable:co
   348  defq run-hpccjohnh  R   1:17  2 comp[15-16]


Also this may be relevant - I have the known problem of a job not teminating 
properly.
In slurmdbd

[2016-05-26T09:00:00.838] error: We have more allocated time than is possible 
(172800 > 126000) for cluster slurm_cluster(35) from 2016-05-26T08:00:00 - 
2016-05-26T09:00:00

I ruan the lost.pl  script from the bugzilla and it finds not still-running 
jobs.

slurm version 14.11.6
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: NFSv4

2016-05-25 Thread John Hearns

They've been doing things like this at CERN for donkeys years - with the Andrew 
File System in the past.
Look for Ticket Granting Tickets.  Sorry - my memory is getting hazy.

-Original Message-
From: Mike Johnson [mailto:m.d.john...@durhamonline.org]
Sent: 25 May 2016 12:22
To: slurm-dev 
Subject: [slurm-dev] NFSv4

Hi all,

I know this is a long-standing question, but thought it was worth asking.  I am 
in an environment that uses NFSv4, which obviously needs user credentials to 
grant access to filesystems.  Has anyone else tackled the issue of unattended 
batch jobs successfully?  I'm aware of AUKS.  Is there any other method anyone 
has used?

I'd be receptive to trying something like GlusterFS if it provided similar 
authentication and encryption measures.

Thanks
Mike
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Time limit on compute nodes?

2016-05-24 Thread John Hearns

I have found out the reason here...
I did not know that Slurm would read PBS directives!

I was uging a PBS script as an example, and had left the original PBS 
directives at the end of the script,
following an exit statement

#PBS -N OpenFoam
#PBS -l nodes=4:ppn=16
#PBS -l walltime=01:00:00
#PBS -A ds004

The job ran with account dc004 and used that walltime!

You learn something new every day


-Original Message-
From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 24 May 2016 07:26
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Time limit on compute nodes?


Last night I was running an Openfoam job which failed with a message about a 
time limit on comp13:

Time = 23.2

DILUPBiCG:  Solving for Ux, Initial residual = 0.00934715, Final residual = 
3.76516e-05, No Iterations 3
DILUPBiCG:  Solving for Uy, Initial residual = 0.00214164, Final residual = 
2.31061e-06, No Iterations 4
DILUPBiCG:  Solving for Uz, Initial residual = 0.00777681, Final residual = 
6.03314e-05, No Iterations 3
DILUPBiCG:  Solving for e, Initial residual = 0.00417167, Final residual = 
1.34667e-05, No Iterations 3
slurmstepd: *** JOB 313 CANCELLED AT 2016-05-23T19:08:44 DUE TO TIME LIMIT on 
comp13 ***

However as far as I can see there is no limit on the partition it ran in:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

defq*up   infinite  7   idle comp[05,07-08,13-16]

Have I managed to set a time limit on comp13 somehow?
The job submission script has not time limit that I see.
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Time limit on compute nodes?

2016-05-24 Thread John Hearns


Last night I was running an Openfoam job which failed with a message about a 
time limit on comp13:

Time = 23.2

DILUPBiCG:  Solving for Ux, Initial residual = 0.00934715, Final residual = 
3.76516e-05, No Iterations 3
DILUPBiCG:  Solving for Uy, Initial residual = 0.00214164, Final residual = 
2.31061e-06, No Iterations 4
DILUPBiCG:  Solving for Uz, Initial residual = 0.00777681, Final residual = 
6.03314e-05, No Iterations 3
DILUPBiCG:  Solving for e, Initial residual = 0.00417167, Final residual = 
1.34667e-05, No Iterations 3
slurmstepd: *** JOB 313 CANCELLED AT 2016-05-23T19:08:44 DUE TO TIME LIMIT on 
comp13 ***

However as far as I can see there is no limit on the partition it ran in:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

defq*up   infinite  7   idle comp[05,07-08,13-16]

Have I managed to set a time limit on comp13 somehow?
The job submission script has not time limit that I see.
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Guide for begginers Admin to make prioriries

2016-05-18 Thread John Hearns

Free IPA?
Damn. You mean identity management.
Not free beer.

Sent from my Windows Phone

From: Simpson Lachlan
Sent: ‎18/‎05/‎2016 01:41
To: slurm-dev
Subject: [slurm-dev] RE:   Guide for begginers Admin to make prioriries

The SLURM docs are the best there are, I’ve found.

For “lots of users” I would use something like FreeIPA to manage them.

Cheers
L.

From: David Ramírez [mailto:drami...@sie.es]
Sent: Tuesday, 17 May 2016 5:38 PM
To: slurm-dev
Subject: [SPAM][M] [slurm-dev] Guide for begginers Admin to make prioriries

Hi.

I need to deploy an HPC with some groups and define priorities in this.

I need créate a lor of users, delegations and groups. I’m search a manual, 
guide or video to do it correctly.

Some documenation is available for begginers users??

Thanks in advanced

David Ramírez   HPC Integrator & Systems Manager

Sistemas Informáticos Europeos S.L.  LadonOS Proyect
C/ Marqués de Mondejar 29-31 2ª Planta28028 Madrid-Spain

Phone:  (+34)913611002WWW.SIE.ES 
WWW.LADONOS.ORG
Mobile: (+34)661369483Email: 
drami...@sie.es
Skype: dramirezsie   Twitter: @dramirezhpc @ladon_os

Este correo y sus archivos asociados son privados y confidenciales y va 
dirigido exclusivamente a su destinatario. Si recibe este correo sin ser el 
destinatario del mismo, le rogamos proceda a su eliminación y lo ponga en 
conocimiento del emisor. La difusión por cualquier medio del contenido de este 
correo podría ser sancionada conforme a lo previsto en las leyes españolas. No 
se autoriza la utilización con fines comerciales o para su incorporación a 
ficheros automatizados de las direcciones del emisor o del destinatario .

This mail and its attached files are confidential and are exclusively intended 
to their addressee. In case you may receive this mail not being its addressee, 
we beg you to let us know the error by reply and to proceed to delete it. The 
circulation by any mean of this mail could be penalised in accordance with the 
Spanish legislation. The use of both the transmitter and the addressee’s 
address with a commercial aim, or in order to be incorporated to automated 
files, is not authorised.

This email (including any attachments or links) may contain confidential and/or 
legally privileged information and is intended only to be read or used by the 
addressee. If you are not the intended addressee, any use, distribution, 
disclosure or copying of this email is strictly prohibited. Confidentiality and 
legal privilege attached to this email (including any attachments) are not 
waived or lost by reason of its mistaken delivery to you. If you have received 
this email in error, please delete it and notify us immediately by telephone or 
email. Peter MacCallum Cancer Centre provides no guarantee that this 
transmission is free of virus or that it has not been intercepted or altered 
and will not be liable for any delay in its receipt.
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-12 Thread John Hearns

Thankyou for your help.

It turned out I had to run an   scontrol update state=RESUME  on all nodes also 
to wake them up.

I guess that is something I have to file away in my brain for the future!

Thanks once again.



From: John Hearns [john.hea...@xma.co.uk]
Sent: 12 April 2016 04:43
To: slurm-dev
Subject: [slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

Sure enough:

systemctl start slurmd returns immediately

systemctl start slurm   hangs


The node is still down and Not Responding - but I am further along with this! 
Thankyou.


ps. Are you related to Lachlan Mor of Angus Og fame ;-)


From: Lachlan Musicman [data...@gmail.com]
Sent: 12 April 2016 04:10
To: slurm-dev
Subject: [slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

I think I saw something like this just now - are you running:

systemctl start slurm

or

systemctl start slurmd ?

And slurmctld is running on the head?

Cheers
L.

--
The most dangerous phrase in the language is, "We've always done it this way."

- Grace Hopper

On 12 April 2016 at 13:04, John Hearns 
<john.hea...@xma.co.uk> wrote:
I am working on an OpenHPC/Warewulf cluster.

When I start the slurmd service on the compute nodes the systemctl sits there 
for a long time,
then reports:

Starting slurm (via systemctl):  Job for slurm.service failed because a timeout 
was exceeded. See "systemctl status slurm.service" and "journalctl -xe" for 
details.
   [FAILED]

However a slurmd process is started on the compute node.
On the head node sinfo says the compute node is down and Not Responding

I don;t expect my problem to be solved by the list, but would appreciate some 
diagnostics.

debug level set to 3 in slurm.conf


Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP



Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP


Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-11 Thread John Hearns

Sure enough:

systemctl start slurmd returns immediately

systemctl start slurm   hangs


The node is still down and Not Responding - but I am further along with this! 
Thankyou.


ps. Are you related to Lachlan Mor of Angus Og fame ;-)


From: Lachlan Musicman [data...@gmail.com]
Sent: 12 April 2016 04:10
To: slurm-dev
Subject: [slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

I think I saw something like this just now - are you running:

systemctl start slurm

or

systemctl start slurmd ?

And slurmctld is running on the head?

Cheers
L.

--
The most dangerous phrase in the language is, "We've always done it this way."

- Grace Hopper

On 12 April 2016 at 13:04, John Hearns 
<john.hea...@xma.co.uk<redir.aspx?REF=ZYdTMqASAtEKm0_l7QFgbf8hIrpFPmgtV4Xd9foduptdEgezgWLTCAFtYWlsdG86Sm9obi5IZWFybnNAeG1hLmNvLnVr>>
 wrote:
I am working on an OpenHPC/Warewulf cluster.

When I start the slurmd service on the compute nodes the systemctl sits there 
for a long time,
then reports:

Starting slurm (via systemctl):  Job for slurm.service failed because a timeout 
was exceeded. See "systemctl status slurm.service" and "journalctl -xe" for 
details.
   [FAILED]

However a slurmd process is started on the compute node.
On the head node sinfo says the compute node is down and Not Responding

I don;t expect my problem to be solved by the list, but would appreciate some 
diagnostics.

debug level set to 3 in slurm.conf


Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP



Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-11 Thread John Hearns

Thankyou Lachlan.

Actually I was being old school and running /etc/inir.d/slurm start
(yes - I know about systemd!)

I will try systemd


for what its worth  /var/log/slurm.log on the compute node reads:

[2016-04-12T03:52:18.813] Message aggregation disabled
[2016-04-12T03:52:18.814] error: _cpu_freq_cpu_avail: Could not open 
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
[2016-04-12T03:52:18.814] Resource spec: Reserved system memory limit not 
configured for this node
[2016-04-12T03:52:18.816] slurmd version 15.08.6 started
[2016-04-12T03:52:18.816] slurmd started on Tue, 12 Apr 2016 03:52:18 +0100
[2016-04-12T03:52:18.816] CPUs=16 Boards=1 Sockets=2 Cores=8 Threads=1 
Memory=64318 TmpDisk=32159 Uptime=494183 CPUSpecList=(null)




From: Lachlan Musicman [data...@gmail.com]
Sent: 12 April 2016 04:10
To: slurm-dev
Subject: [slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

I think I saw something like this just now - are you running:

systemctl start slurm

or

systemctl start slurmd ?

And slurmctld is running on the head?

Cheers
L.

--
The most dangerous phrase in the language is, "We've always done it this way."

- Grace Hopper

On 12 April 2016 at 13:04, John Hearns 
<john.hea...@xma.co.uk<redir.aspx?REF=QELRiDtWXSP4awCRwKUE3l3tdYRalEpuf40LpX5Wf4z1qIBjgGLTCAFtYWlsdG86Sm9obi5IZWFybnNAeG1hLmNvLnVr>>
 wrote:
I am working on an OpenHPC/Warewulf cluster.

When I start the slurmd service on the compute nodes the systemctl sits there 
for a long time,
then reports:

Starting slurm (via systemctl):  Job for slurm.service failed because a timeout 
was exceeded. See "systemctl status slurm.service" and "journalctl -xe" for 
details.
   [FAILED]

However a slurmd process is started on the compute node.
On the head node sinfo says the compute node is down and Not Responding

I don;t expect my problem to be solved by the list, but would appreciate some 
diagnostics.

debug level set to 3 in slurm.conf


Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP



Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Slurm service timeout - hints on diagnostics please?

2016-04-11 Thread John Hearns

I am working on an OpenHPC/Warewulf cluster.

When I start the slurmd service on the compute nodes the systemctl sits there 
for a long time,
then reports:

Starting slurm (via systemctl):  Job for slurm.service failed because a timeout 
was exceeded. See "systemctl status slurm.service" and "journalctl -xe" for 
details.
   [FAILED]

However a slurmd process is started on the compute node.
On the head node sinfo says the compute node is down and Not Responding

I don;t expect my problem to be solved by the list, but would appreciate some 
diagnostics.

debug level set to 3 in slurm.conf


Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: checkpoint/restart feature in SLURM

2016-03-19 Thread John Hearns

O I'll we k lo

Sent from my Windows Phone

From: Husen R
Sent: ‎17/‎03/‎2016 05:56
To: slurm-dev
Subject: [slurm-dev] checkpoint/restart feature in SLURM

Dear Slurm-dev,

Does checkpoint/restart feature available in SLURM able to relocate MPI 
application from one node to another node while it is running ?

For the example, I run MPI application in node A,B and C in a cluster and I 
want to migrate/relocate process running in node A to other node, let's say to 
node C while it is running.

is there a way to do this with SLURM ? Thank you.

Regards,

Husen

Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.

Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: What cluster provisioning system do you use?

2016-03-15 Thread John Hearns


I am currently setting up a test cluster and shall be looking at

- Warewulf

If you like Warewulf, you could look at OpenHPC, which uses Warewulf for the 
provisioning.
The slurm version on my OpenHPC server is 15.08.6, and this came from the 
OpenHPC repositories.





#
Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.
#
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: What cluster provisioning system do you use?

2016-03-15 Thread John Hearns

Bjorn

You should be definitely looking at Bright cluster Manager.

I set up a Bright cluster last week with CentOS 7.2 and slurm.
Bright works right our of the box with slurm, and it is set up automatically as 
you provision the nodes.
Also have the powersaving scripts etc all set up.

Please ping me an email off the list and I can discuss.

Also I am happy to let you log into our cluster remotely and 'test drive' it.
CentOS7, Slurm, Mellanox FDR Infiniband, and we have Xeon Phi too.





-Original Message-
From: Bjørn-Helge Mevik [mailto:b.h.me...@usit.uio.no]
Sent: 15 March 2016 12:40
To: slurm-dev 
Subject: [slurm-dev] What cluster provisioning system do you use?


I apologize for the slightly off-topic subject, but I could not think of a 
better forum to ask.  If you know of a more proper place to ask this, I'd be 
happy to know about it.

We are currently in the design fase for a new cluster that is going to be set 
up next year.  We have so far used Rocks (on top of CentOS) for cluster 
provisioning.  However, Rocks don't support CentOS >= 7, and it doesn't look 
like it will in the near future.  Also for other reasons, we are looking for 
alternatives to Rocks.

So, what are you using for cluster provisioning?

- Rocks?
- A different provisioning tool?
- A locally developed solution?

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

#
Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.
#
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Problem in loading modules in slurm Batch script.

2015-12-04 Thread John Hearns

From: John Hearns [mailto:john.hea...@xma.co.uk]
Sent: 04 December 2015 09:56
To: slurm-dev
Subject: [slurm-dev] Re: Problem in loading modules in slurm Batch script.

Hello Hezi,
Have you tried making the shell for the batch script a login shell?
#!/bin/bash -l
I have not come across this with Slurm, but have with other batch systems.
There are differences between interactive and non-interactive login sessions.

I found the thread I started – in 2007!  This was about Gridengine, but may be 
relevant here.

http://sourceforge.net/p/modules/mailman/message/5909573/

Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.

Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Jobs stuck in CF state

2015-11-29 Thread John Hearns

Thankyou Werner.

The compute nodes were all in  idle~   state - I now tknow this means power 
down,
but the nodes were up and running.
I restarted slurm completely, and thisngs are OK now.


Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Jobs stuck in CF state

2015-11-27 Thread John Hearns

Yesterday I thought to some investigations of the suspend and resume scripts on 
my in-house test cluster.

As my Mum would have said ' "See what thought done..."

I have backed out of the changes to slurm.conf   (or have I ... ??)
I have restarted slurm on the head node and all compute nodes.  Whacked up 
debug to 7  (not going to 11 just yet... )
When I start a job it just sits in the CF state, even a simple 'srun hostname'
The slurmctld log says the job is allocated to a node, then nothing more.

In the words of Penelope Pitstop,  "Haaayulp"



Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Distribute M jobs on N nodes without duplication

2015-10-02 Thread John Hearns



So far I tried my hands with SRUN, SBATCH and SALLOC, and thought SBATCH will 
do what I am looking for.  However, SBATCH starts with assigning the requested 
resource configuration but then runs every srun command on every node.  For 
instance, if my script looks like:



sbatch is the command to submit a batch job – which usually consists of a shell 
script with some #SBATCH style parameters at the start

srun is the command to start an interactive job session, or to run a command 
from the command line.

I wouldn’t do an srun in the middle of a batch job….  Why not just subnt for 
separate batch jobs?
Or you could use a small job array http://slurm.schedmd.com/job_array.html

With the single line in the script:

./mycode.x  < input.in > output${SLURM_ARRAY_TASK_ID}.out


Also as it is running as a batch job you would not need the &






Do you have 2 GPUs in each node? DO you want to run two jobs on each node, or 
just one?



Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] RE: Distribute M jobs on N nodes without duplication

2015-10-02 Thread John Hearns

I stand corrected.

I find myself in a maze of twisty little passages, all alike

All the examples for SBATCH (in the SLURM manual) uses 'SRUN' for execution of 
runs.  There are lot of other websites which gives SBATCH examples and all of 
them uses SRUN, unless using some version of MPI.




Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: srun to existing allocation, but just a specific node

2015-08-27 Thread John Hearns

 srun --output=srunps.%N-job%j-%t.out --jobid=5102027 'ps -ef'

So that it puts each node's ps in a separate file delineated by the hostname.

Oddly, I can't seem to figure out how to pipe inside that call: issuing 'ps 
-ef' does the same as 'ps -ef | grep GEOS'.


Matt, try using the   pgrep  command :  pgrep -l GEOS
Or something like that!



Also was there not discussion on this list of a slurm utility which ran top on 
all nodes recently?
I looked back on my emails and cannot find it though.


#
Scanned by MailMarshal - M86 Security's comprehensive email content security 
solution.
#
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

84 matches

Mail list logo