date:20150611

[slurm-dev] Re: How to distinguish slurmctld.pid of server and backup-server on a shared disk(slurm 14.11)

2015-06-11 Thread Qianqian Sha

Hi, Chrysovalantis.

Thanks a lot for the prompt reply.

Actually we have many nodes and put all the logs/pids/tmpfiles/ in one
folder  named slurm-xxx on a shared filesystem.
My boss think it's a good idea because we won't have to update every node
on every update attempt.  Well, I think it's not bad either. Hah.

Simbolic links is a feasible solution. But I will have to remove the local
simbolic link once some nodes is to be removed for any other use.

Now I put the pid files into /var/run and it will be deleted if the process
is killed. And the logs of slurmctlds of server and backup server will be
put in  one file.

Anyway, thanks a lot for your kind help.
May you have a good day.


2015-06-10 18:57 GMT+08:00 Chrysovalantis Paschoulas <
c.paschou...@fz-juelich.de>:

>  Hi!
>
> In our site we store the logs and the pid files of slurmctld and slurmdbbd
> daemons on local disk, it is better so to distinguish who is doing what and
> since if a node is down the daemons are not running then there is no need
> to put them on a shared filesystem. For slurmctld we have the state dir and
> jobcomp on a shared active/active filesystem. For slurmdbd we have an
> active/passive filesystem between the master nodes where we store the
> mysql/mariadb database.
>
> I understand that it is your policy to use shared filesystems for those
> files (and in general is not a bad idea) but in this case for slurm control
> daemons you should use local disks (personal opinion).
>
> Anyway, if you insist to use shared fs then I can only think of one
> solution: use symbolic links. E.g.:
> @master1: path-of-pidfile-in-slurm.conf -> /sharedfs/slurm/run/master1.pid
> @master2: path-of-pidfile-in-slurm.conf -> /sharedfs/slurm/run/master2.pid
>
> Cheers,
> Chrysovalantis Paschoulas
>
>
>
> On 06/10/2015 04:47 AM, Qianqian Sha wrote:
>
> Hi,
>
>   We store all slurm logs/pids/tmpfiles on a shared disk.
> logs/pids/tmpfiles of different slurmds can be distinguished by
> nodename(%n) or hostname(%h). But it seems that logs/pids path of slurmctld
> or backup-slurmctld does not support %n or %h.
>
> Any assistance or advice is appreciated.
>
>  Thanks.
>
>
>
>
>
> 
>
> 
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
>
> 
>
> 
>
>

[slurm-dev] What is the reload function in /etc/init.d/slurm supposed to do?

2015-06-11 Thread Sean Blanton

I had a bit of an adventure with that - I accidentally brought all the
slurm nodes to a "not responding" state (no job interruption). I'm mid
upgrade from 2.6.1 to 14.11.7 - so transient state, no call to fix
anything, just a question.

After my adventure, I realized I'm not sure what is really supposed to
happen on 'reload'. It looks like the controller asks all the nodes to
reload their configs?  Didn't find anything in the docs.

Here are the details...more as an fyi and entertainment:

I upgraded the dbd, controller, and one compute node - no problems - using
a pre-existing init.d/slurm script.  I'm improving the Debian package for
internal distribution and see that the init.d scripts in the distribution
do not support Debian. ( 1. location of init functions is
/lib/lsb/init-functions and 2. and mandatory support of force_reload() -
I'll be happy to add to the dev stream) I'm also newish to slurm, so I'm
trying to gain a deep understanding of the init.d script to learn about
slurm.

Investigating how to implement force_reload(), I see in /etc/init.d/slum
that 'reload' equates to:

 killproc $prog -HUP

I think, "fine, I can reload the configs instead of restarting the service
every time!"  So I do a:

 kill -1   #-- that's kill-dash-one

At first, the controller log shows:

[2015-06-11T07:44:12.224] Reconfigure signal (SIGHUP) received
[2015-06-11T07:44:12.235] restoring original state of nodes
[2015-06-11T07:44:12.235] restoring original partition state
[2015-06-11T07:44:12.246] cons_res: select_p_node_init
[2015-06-11T07:44:12.246] cons_res: preparing for 5 partitions
[2015-06-11T07:44:12.447] read_slurm_conf: backup_controller not specified.
[2015-06-11T07:44:12.447] cons_res: select_p_reconfigure
[2015-06-11T07:44:12.447] cons_res: select_p_node_init
[2015-06-11T07:44:12.447] cons_res: preparing for 5 partitions

Then the controller starts spitting out errors.

[2015-06-11T07:45:15.228] agent/is_node_resp: node: rpc:1001 :
Incompatible versions of client and server code

One per node. Then all the nodes stop responding.

 error: Nodes  not responding

I restarted every node's slurmd and everything was back to normal.

I guess I expected only the controller to reload its config file, but all
the better that it asks all the nodes to do the same.

I'll of course not do this again until the upgrade is complete.



Thanks
Sean

-- 
Sean Blanton, Ph.D.

[slurm-dev] concurrent job limit

2015-06-11 Thread Martin, Eric


Is there a way for users to self limit the number of jobs that they 
concurrently run?  
Eric Martin 
Center for Genome Sciences & Systems Biology
Washington University School of Medicine 
 Forest Park Avenue 
St. Louis, MO 63108
The materials in this message are private and may contain Protected Healthcare 
Information or other information of a sensitive nature. If you are not the 
intended
 recipient, be advised that any unauthorized use, disclosure, copying or the 
taking of any action in reliance on the contents of this information is 
strictly prohibited. If you have received this email in error, please 
immediately notify the sender via telephone
 or return mail.

[slurm-dev] Re: What is the reload function in /etc/init.d/slurm supposed to do?

2015-06-11 Thread Sean Blanton

Minor correction, that's support for Ubuntu, and the missing directive is
'force-reload',  with a dash, not an underscore.

On Thu, Jun 11, 2015 at 8:36 AM, Sean Blanton  wrote:

>  I had a bit of an adventure with that - I accidentally brought all the
> slurm nodes to a "not responding" state (no job interruption). I'm mid
> upgrade from 2.6.1 to 14.11.7 - so transient state, no call to fix
> anything, just a question.
>
> After my adventure, I realized I'm not sure what is really supposed to
> happen on 'reload'. It looks like the controller asks all the nodes to
> reload their configs?  Didn't find anything in the docs.
>
> Here are the details...more as an fyi and entertainment:
>
> I upgraded the dbd, controller, and one compute node - no problems - using
> a pre-existing init.d/slurm script.  I'm improving the Debian package for
> internal distribution and see that the init.d scripts in the distribution
> do not support Debian. ( 1. location of init functions is
> /lib/lsb/init-functions and 2. and mandatory support of force_reload() -
> I'll be happy to add to the dev stream) I'm also newish to slurm, so I'm
> trying to gain a deep understanding of the init.d script to learn about
> slurm.
>
> Investigating how to implement force_reload(), I see in /etc/init.d/slum
> that 'reload' equates to:
>
>  killproc $prog -HUP
>
> I think, "fine, I can reload the configs instead of restarting the service
> every time!"  So I do a:
>
>  kill -1   #-- that's kill-dash-one
>
> At first, the controller log shows:
>
> [2015-06-11T07:44:12.224] Reconfigure signal (SIGHUP) received
> [2015-06-11T07:44:12.235] restoring original state of nodes
> [2015-06-11T07:44:12.235] restoring original partition state
> [2015-06-11T07:44:12.246] cons_res: select_p_node_init
> [2015-06-11T07:44:12.246] cons_res: preparing for 5 partitions
> [2015-06-11T07:44:12.447] read_slurm_conf: backup_controller not specified.
> [2015-06-11T07:44:12.447] cons_res: select_p_reconfigure
> [2015-06-11T07:44:12.447] cons_res: select_p_node_init
> [2015-06-11T07:44:12.447] cons_res: preparing for 5 partitions
>
> Then the controller starts spitting out errors.
>
> [2015-06-11T07:45:15.228] agent/is_node_resp: node: rpc:1001 :
> Incompatible versions of client and server code
>
> One per node. Then all the nodes stop responding.
>
>  error: Nodes  not responding
>
> I restarted every node's slurmd and everything was back to normal.
>
> I guess I expected only the controller to reload its config file, but all
> the better that it asks all the nodes to do the same.
>
> I'll of course not do this again until the upgrade is complete.
>
>
>
> Thanks
> Sean
>
> --
> Sean Blanton, Ph.D.
>
>


-- 
Sean Blanton, Ph.D.
Quantitative Technologist
Radix Trading, LLC
Desk: 773.985.0456
Cell:   773.960.3495

[slurm-dev] Re: concurrent job limit

2015-06-11 Thread John Desantis


Eric,

We find that the "honor system" frequently needs to be policed.

We use SLURM's QOS and accounting systems to limit the number of
concurrent cores, maximum running jobs, and maximum pending jobs.  You
may want to investigate this.

With a QOS, you can apply limits per users (specifically cores), and
with accounting you can apply limits per user or as a whole for a
group (account).

John DeSantis


2015-06-11 10:12 GMT-04:00 Martin, Eric :
> Is there a way for users to self limit the number of jobs that they
> concurrently run?
>
> Eric Martin
> Center for Genome Sciences & Systems Biology
> Washington University School of Medicine
>  Forest Park Avenue
> St. Louis, MO 63108
>
>
>
> 
>
> The materials in this message are private and may contain Protected
> Healthcare Information or other information of a sensitive nature. If you
> are not the intended recipient, be advised that any unauthorized use,
> disclosure, copying or the taking of any action in reliance on the contents
> of this information is strictly prohibited. If you have received this email
> in error, please immediately notify the sender via telephone or return mail.

[slurm-dev] Re: Question about running interactive job on all cores on a list of heterogeneous nodes.

2015-06-11 Thread Christopher B Coffey

Hi Jim,

I think it may depend on how the "SelectType=“ is setup in your config.
If it’s set as cons_res, then you should be able to do something like
below:

Sum up the number of total cores from the list of nodes before hand, maybe
this isn’t possible though.  Then do:

srun -n -w node[1-50]

Or

srun -n -w /path/to/node_list_file

Hope that helps.

Best,
Chris

On 6/10/15, 2:26 PM, "Jim Robanske"  wrote:

>Hello,
>
>I've been trying to figure out how to run an interactive job on all cores
>on a list of heterogeneous nodes (meaning that the nodes in the list may
>have different numbers of cores).
>
>Anyone out there know how to accomplish this?
>
>Thanks in advance...
>
>-- 
>I may be inconsistent, but at least I'm consistently inconsistent.

[slurm-dev] Re: Question about running interactive job on all cores on a list of heterogeneous nodes.

2015-06-11 Thread Moe Jette



Also try the --exclusive option on the job submit command line.

Quoting Christopher B Coffey :

Hi Jim,

I think it may depend on how the "SelectType=“ is setup in your config.
If it’s set as cons_res, then you should be able to do something like
below:

Sum up the number of total cores from the list of nodes before hand, maybe
this isn’t possible though.  Then do:

srun -n -w node[1-50]

Or

srun -n -w /path/to/node_list_file

Hope that helps.

Best,
Chris




On 6/10/15, 2:26 PM, "Jim Robanske"  wrote:


Hello,

I've been trying to figure out how to run an interactive job on all cores
on a list of heterogeneous nodes (meaning that the nodes in the list may
have different numbers of cores).

Anyone out there know how to accomplish this?

Thanks in advance...

--
I may be inconsistent, but at least I'm consistently inconsistent.



--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

[slurm-dev] Re: concurrent job limit

2015-06-11 Thread Ryan Cox


Job arrays can kind of be used for that:

From http://slurm.schedmd.com/job_array.html:
A maximum number of simultaneously running tasks from the job array may 
be specified using a "%" separator. For example "--array=0-15%4" will 
limit the number of simultaneously running tasks from this job array to 4.


Ryan

On 06/11/2015 08:12 AM, Martin, Eric wrote:
Is there a way for users to self limit the number of jobs that they 
concurrently run?


Eric Martin
Center for Genome Sciences & Systems Biology
Washington University School of Medicine
 Forest Park Avenue
St. Louis, MO 63108



The materials in this message are private and may contain Protected 
Healthcare Information or other information of a sensitive nature. If 
you are not the intended recipient, be advised that any unauthorized 
use, disclosure, copying or the taking of any action in reliance on 
the contents of this information is strictly prohibited. If you have 
received this email in error, please immediately notify the sender via 
telephone or return mail.




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

[slurm-dev] slurm on systemd

2015-06-11 Thread Trevor Gale

Hi all,

I was wondering if slurm run on systemd operating systems like fedora 22. I
installed the tarball and right now I can't figure out how to start the
slurm daemons. Does anyone know how to start the daemons in Fedora/if slurm
is compatible with systemd?

Thanks,
Trevor

[slurm-dev] Re: slurm on systemd

2015-06-11 Thread Christopher Samuel

On 12/06/15 09:43, Trevor Gale wrote:

> I was wondering if slurm run on systemd operating systems like fedora
> 22. I installed the tarball and right now I can't figure out how to
> start the slurm daemons. Does anyone know how to start the daemons in
> Fedora/if slurm is compatible with systemd?

I did run it on CentOS 7 as a test a while ago, we never start slurm on
boot and just logging in as usual and running our RHEL6 init script to
start it worked fine.

Of course that won't give you proper systemd integration, but it did
seem to work in that way for simple testing at least.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: Slurm and docker/containers

2015-06-11 Thread Christopher Samuel

On 05/06/15 19:31, Nathan Harper wrote:

> Has anyone looked at using LXC rather than Docker specifically?  From
> what I understand, it's possible to run unprivileged LXC containers, so
> no need to be root.

Speaking to some of the uni's OpenStack folks at lunch today about
containerisation they recommended looking at LXD from Canonical which is
an evolution of LXC.

http://www.ubuntu.com/cloud/tools/lxd

Of course getting it to run on anything other than Ubuntu might be a
challenge which could limit its usefulness.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

[slurm-dev] Re: How to distinguish slurmctld.pid of server and backup-server on a shared disk(slurm 14.11)

[slurm-dev] What is the reload function in /etc/init.d/slurm supposed to do?

[slurm-dev] concurrent job limit

[slurm-dev] Re: What is the reload function in /etc/init.d/slurm supposed to do?

[slurm-dev] Re: concurrent job limit

[slurm-dev] Re: Question about running interactive job on all cores on a list of heterogeneous nodes.

[slurm-dev] Re: Question about running interactive job on all cores on a list of heterogeneous nodes.

[slurm-dev] Re: concurrent job limit

[slurm-dev] slurm on systemd

[slurm-dev] Re: slurm on systemd

[slurm-dev] Re: Slurm and docker/containers

11 matches

Site Navigation

Mail list logo

Footer information