date:20141027

[slurm-dev] Re: building slurm with rpmbuild and hwloc support

2014-10-27 Thread Chrysovalantis Paschoulas


Hi!

You need 2 packages installed on the system(in my case it is a RHEL based 
distro) where you build Slurm: hwloc and hwloc-devel. And also you don't need 
the .rpmmacros for hwloc if you have installed these packages. By default this 
option is enabled. ;) Btw, I have never used a custom hwloc installation and I 
cannot help you if you go that direction.

Sorry for the delayed answer, most probably you have already solved the 
problem...

Best Regards,
Valantis



On 10/17/2014 05:19 PM, Pancorbo, Juan wrote:
Hello,
we are running slurm2.6.9 and we are trying to use /task/cgroup plugin, but 
when we run a job we get the following error:
slurmd[lxa178]: task/cgroup: plugin not compiled with hwloc support, skipping 
affinity.

Slurm was built with rpmbuild to be installed as an rpm, so we try we to 
rebuild again slurm with hwlock support so that the plugin is recompiled again.
We tried first to use the .rpmmacros
cat .rpmmacros
%with_hwloc  --with-hwloc=/usr/hwloc/1.10

Without success:
checking for hwloc installation...
configure: WARNING: unable to locate hwloc installation

We run the configure script
./configure --enable-pam --enable-debug --enable-salloc-kill-cmd 
--with-pam_dir=/etc/pam.d --with-munge=/etc/munge --with-ssl=/etc/ssl 
--with-hwloc=/usr/hwloc/1.10  --prefix=/usr --sysconfdir=/etc/slurm

And from the output:
checking for hwloc installation... /usr/hwloc/1.10

then we got the configure results inside the tar and build it again without 
success.
We also including this line  in slurm.spec after
%configure \
   %{?with_hwloc:--with-hwloc=/usr/hwloc/1.10}

And also didn’t worked.

Anybody have an idea of what is needed to compile the plugin with hwloc support 
using rpmbuild?

Thanks in advance.


Juan Pancorbo Armada
juan.panco...@lrz.demailto:juan.panco...@lrz.de
http//www.lrz.de


Leibniz-Rechenzentrum
Abteilung: Hochleistungssysteme
Boltzmannstrasse 1, 85748 Garching
Telefon:  +49 (0) 89 35831-8735
Fax:  +49 (0) 89 35831-8535







Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt

[slurm-dev] Re: slurm cannot work with Infiniband after rebooting

2014-10-27 Thread Chrysovalantis Paschoulas


Hi!

For sure this is not connected to Slurm, but it is a problem with your 
Infiband+IMPI configuration. You should go to other forums or mailing lists and 
ask for help ;)

At first, I would suggest you to configure correctly the dat.conf file. In my case it is 
/etc/dat.conf. You have to comment out all the lines with the unsupported IB 
modes.

And then you should export some Intel MPI variables and set the correct 
environment.
Try to find the documentation about Intel MPI vars, like: I_MPI_DEVICE, 
I_MPI_FABRICS, I_MPI_FALLBACK, I_MPI_DAPL_PROVIDER_LIST and I_MPI_DEBUG. If you 
play enough I am sure you will get the desired result.

In our case we had set for example: I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1 
which solved similar problems if I remember correctly.

Best Regards,
Chrysovalantis Paschoulas


On 10/20/2014 06:46 PM, Tingyang Xu wrote:
To whom it may concern,
Hello. I am new in slurm. I am facing a problem of using slurm with Infiniband. 
When I ran the mpi jobs on a  rebooted node, I  would get fabric errors. For 
example, I tried a simple “hello world” via Intel mpi. I did like:
$ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
salloc: Granted job allocation 1201
$ module list
Currently Loaded Modulefiles:
 1) modules2) null   3) 
intelics/2013.1.039
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$srun ./hello
[3] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[4] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[5] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[6] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[7] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[8] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[10] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[11] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[0] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[9] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[2] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
srun: error: cn117: tasks 0-11: Exited with exit code 254
srun: Terminating job step 1201.0

However, as long as I manually restart the slurm on the cn117, the problem will 
be fixed. For example:
$ ssh root@cn117mailto:root@cn117
cn117#  service slurm restart
stopping slurmd:   [  OK  ]
slurmd is stopped
starting slurmd:   [  OK  ]
# exit
$ salloc -N1 -n12 -w cn117
salloc: Granted job allocation 1203
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ srun ./hello
This is Process  9 out of 12 running on host cn117
This is Process  3 out of 12 running on host cn117
This is Process  2 out of 12 running on host cn117
This is Process  7 out of 12 running on host cn117
This is Process  6 out of 12 running on host cn117
This is Process  0 out of 12 running on host cn117
This is Process  5 out of 12 running on host cn117
This is Process  1 out of 12 running on host cn117
This is Process  4 out of 12 running on host cn117
This is Process 10 out of 12 running on host cn117
This is Process  8 out of 12 running on host cn117
This is Process 11 out of 12 running on host cn117
=
Although I can manully do it, I still hope the system can be more automatic. I 
tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, 
rc.local but the issue is still there. Can anyone help me about that?

Sincerely,
Tingyang Xu
HPC Administrator
University of Connecticut


PS: some information of the infiniband:
$ slurmd -V
slurm 14.03.0

cn117$ ofed_info|head -n1
MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

cn117$ ibv_devinfo
hca_id: mlx4_0
transport:   InfiniBand (0)
fw_ver:2.11.550
node_guid:
sys_image_guid:   ##
vendor_id:   ##
vendor_part_id:   
hw_ver:0x0
board_id:   
phys_port_cnt:   2
 port: 1
  state:   PORT_ACTIVE (4)
  max_mtu:  4096 (5)
  active_mtu:  4096 (5)
  sm_lid:   1
  port_lid:  131
  port_lmc:  0x00
  link_layer:  InfiniBand

 port: 2
  state:   PORT_DOWN (1)
  max_mtu:  4096 (5)
  active_mtu:  4096 (5)
  sm_lid:   0
  port_lid:  0
  port_lmc:  0x00
  link_layer:  InfiniBand

cn117$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 6.5 (Santiago)







Forschungszentrum Juelich GmbH
52425

[slurm-dev] Re: slurm cannot work with Infiniband after rebooting

2014-10-27 Thread Tingyang Xu

Thank you very much, Chrysovalantis. I just created a topic in Intel forum 
though your suggestion did not fix our issue. I will also update this topic if 
I get the solution in case other slurm users may have the similar issue again.

Thanks,
Tingyang Xu

From: Chrysovalantis Paschoulas 
Sent: Monday, October 27, 2014 10:45 AM
To: slurm-dev 
Subject: [slurm-dev] Re: slurm cannot work with Infiniband after rebooting

Hi!

For sure this is not connected to Slurm, but it is a problem with your 
Infiband+IMPI configuration. You should go to other forums or mailing lists and 
ask for help ;)

At first, I would suggest you to configure correctly the dat.conf file. In my 
case it is /etc/dat.conf. You have to comment out all the lines with the 
unsupported IB modes.

And then you should export some Intel MPI variables and set the correct 
environment.
Try to find the documentation about Intel MPI vars, like: I_MPI_DEVICE, 
I_MPI_FABRICS, I_MPI_FALLBACK, I_MPI_DAPL_PROVIDER_LIST and I_MPI_DEBUG. If you 
play enough I am sure you will get the desired result. 

In our case we had set for example: I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1 
which solved similar problems if I remember correctly.

Best Regards,
Chrysovalantis Paschoulas

On 10/20/2014 06:46 PM, Tingyang Xu wrote:

  To whom it may concern,
  Hello. I am new in slurm. I am facing a problem of using slurm with 
Infiniband. When I ran the mpi jobs on a  rebooted node, I  would get fabric 
errors. For example, I tried a simple “hello world” via Intel mpi. I did like:
  $ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
  salloc: Granted job allocation 1201
  $ module list
  Currently Loaded Modulefiles:
1) modules2) null   3) 
intelics/2013.1.039
  $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
  $srun ./hello
  [3] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [4] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [5] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [6] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [7] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [8] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [10] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [11] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [0] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [9] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [1] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [2] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  srun: error: cn117: tasks 0-11: Exited with exit code 254
  srun: Terminating job step 1201.0

  However, as long as I manually restart the slurm on the cn117, the problem 
will be fixed. For example:
  $ ssh root@cn117
  cn117#  service slurm restart
  stopping slurmd:   [  OK  ]
  slurmd is stopped
  starting slurmd:   [  OK  ]
  # exit
  $ salloc -N1 -n12 -w cn117
  salloc: Granted job allocation 1203
  $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
  $ srun ./hello
  This is Process  9 out of 12 running on host cn117
  This is Process  3 out of 12 running on host cn117
  This is Process  2 out of 12 running on host cn117
  This is Process  7 out of 12 running on host cn117
  This is Process  6 out of 12 running on host cn117
  This is Process  0 out of 12 running on host cn117
  This is Process  5 out of 12 running on host cn117
  This is Process  1 out of 12 running on host cn117
  This is Process  4 out of 12 running on host cn117
  This is Process 10 out of 12 running on host cn117
  This is Process  8 out of 12 running on host cn117
  This is Process 11 out of 12 running on host cn117
  =
  Although I can manully do it, I still hope the system can be more automatic. 
I tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, 
rc.local but the issue is still there. Can anyone help me about that?

  Sincerely,
  Tingyang Xu
  HPC Administrator
  University of Connecticut

  PS: some information of the infiniband:
  $ slurmd -V
  slurm 14.03.0

  cn117$ ofed_info|head -n1
  MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

  cn117$ ibv_devinfo
  hca_id: mlx4_0
  transport:   InfiniBand (0)
  fw_ver:2.11.550
  node_guid:  
  sys_image_guid:   ##
  vendor_id:   ##
  vendor_part_id:   
  hw_ver:0x0
  board_id:   
  phys_port_cnt:   2
port: 1
 state:   PORT_ACTIVE (4)
 max_mtu:  4096 (5)
 active_mtu:  4096 (5)

[slurm-dev] logrotate causing job authentication failure

2014-10-27 Thread E V


Had 2 jobs die yesterday morning with a slurm_load_jobs error:
Protocol authentication error from inside DRMAA, and this interesting
message in the log:

If munged is up, restart with --num-threads=10
error: Munge encode failed: Unable to access
/var/run/munge/munge.socket.2: No such file or directory
error: authentication: Munged communication error

The slurmctl.log has this
error: slurm_receive_msg: Zero Bytes were transmitted or received

right about the same time.

Digging deeper it appears that the jobs state's were changing in the
slurmctl just as the munge daemon got restarted for a logrotate. I
changed logrotate to rotate munge.log based on size instead of daily,
which may fix the problem, but feels more like a work around. Any
other suggestions? I'd be nice to have some sort of retry in the code,
but not really sure if it'd be in the slurmctl or the DRMAA code.

[slurm-dev] Re: logrotate causing job authentication failure

2014-10-27 Thread jette



Slurm already has connect retry logic (10 times with 0.1 sec between  
retries). DRMAA should need no changes unless it directly accesses  
munge.


Has anyone else seen this problem?

Quoting E V eliven...@gmail.com:


Had 2 jobs die yesterday morning with a slurm_load_jobs error:
Protocol authentication error from inside DRMAA, and this interesting
message in the log:

If munged is up, restart with --num-threads=10
error: Munge encode failed: Unable to access
/var/run/munge/munge.socket.2: No such file or directory
error: authentication: Munged communication error

The slurmctl.log has this
error: slurm_receive_msg: Zero Bytes were transmitted or received

right about the same time.

Digging deeper it appears that the jobs state's were changing in the
slurmctl just as the munge daemon got restarted for a logrotate. I
changed logrotate to rotate munge.log based on size instead of daily,
which may fix the problem, but feels more like a work around. Any
other suggestions? I'd be nice to have some sort of retry in the code,
but not really sure if it'd be in the slurmctl or the DRMAA code.



--
Morris Moe Jette
CTO, SchedMD LLC

[slurm-dev] Re: reccomended software stack for development?

2014-10-27 Thread Andy Riebs

   Hi Manuel,
 
 The first rule is Keep it simple!
 
 I suggest that you start by viewing this as 2 problems:
 
 1. Learning how to work with Slurm
 2. Learning how to work with clusters
 
 For learning how to work with Slurm, cloning a copy of the repo is a
 good start.  In the Developers notes in the documentation, you'll
 find instructions for running Slurm on a single node, which makes it
 MUCH easier for testing and debugging than running on multiple
 nodes. Once you've got a simple test version running, then you can
 start thinking about writing new code.
 
 As for Puppet, Jenkins, et al., start with something easy -- perhaps
 just ensuring that you can set up 2 nodes and ssh into them. Once
 you're comfortable with Slurm, you can add it to your virtual
 environment.
 
 Hope this helps!
 
 Andy
 
 On 10/27/2014 12:59 PM, Manuel
   Rodríguez Pascual wrote:
   reccomended software stack for development?
   
   Hi all,
 I have the intention of working on Slurm, modifying it to
   satisfy my needs and (hopefully) include some new
   functionalities. I am however kind of newbie with this kind of
   software development, so I am writing looking for advise. My
   question is, can you recommend me any tools for the
   development of slurm?
 As a first layer, my idea is to use plain virtual machines
   and employ Puppet to configure them and then install MPICH and
   BLCR. Then, Jenkins would install and configure a Slurm-based
   cluster and run a set of tests. 
 I am however new in using both tools and in developing
   Slurm, so I am kind of lost right now. then, before starting
   to build and configure all this, I would really appreciate
   some suggestions from more experienced developers. 
 I have planned to clone Slurm github repo to work with my
   own github, and then employ Jenkins for Continuous
   Integration. I have some doubts on how to exactly do that, in
   particular regarding the contextualization of the compilation
   process, and the integration of the included regression tests
   with Jenkins. Have you got any suggestions on this? Again, any
   feedback on the best tools to work with Slurm would be
   welcome.
 Thanks for your help. Best regards,
 Manuel

[slurm-dev] Re: reccomended software stack for development?

2014-10-27 Thread Trey Dockendorf

I wouldn't count what I've done as production-ready but I have a Puppet
module for BLCR [1] and one for SLURM [2].  Also there's one for managing
SLURM QOS and clusters using native Puppet types [3].  They likely won't
aid in development as the two SLURM related modules both assume you have
build RPMs and placed them in some accessible repository for your hosts to
install from.  If those modules aren't exactly what your looking for, they
may offer ideas on how to get started with your own.  The SLURM module was
originally a fork from CERNOps but has since been completely rewritten.

The SLURM module [2] uses beaker to provision 4 VMs.  One is the
controller, two are compute nodes and one is a client (in my environment
that is the login nodes, web server, etc).  Those automated tests assume
you pass a URL to a yum repo containing RPMs.  The module relies on
exported resources so the provisioning of those 4 VMs is painfully long due
to having to also setup Postgresql and PuppetDB.

- Trey

[1]: https://forge.puppetlabs.com/treydock/blcr
[2]: https://github.com/treydock/puppet-slurm
[3]: https://github.com/treydock/puppet-slurm_providers

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu
Jabber: treyd...@tamu.edu

On Mon, Oct 27, 2014 at 12:00 PM, Manuel Rodríguez Pascual 
manuel.rodriguez.pasc...@gmail.com wrote:

  Hi all,

 I have the intention of working on Slurm, modifying it to satisfy my needs
 and (hopefully) include some new functionalities. I am however kind of
 newbie with this kind of software development, so I am writing looking for
 advise. My question is, can you recommend me any tools for the development
 of slurm?

 As a first layer, my idea is to use plain virtual machines and employ
 Puppet to configure them and then install MPICH and BLCR. Then, Jenkins
 would install and configure a Slurm-based cluster and run a set of tests.

 I am however new in using both tools and in developing Slurm, so I am kind
 of lost right now. then, before starting to build and configure all this, I
 would really appreciate some suggestions from more experienced developers.

 I have planned to clone Slurm github repo to work with my own github, and
 then employ Jenkins for Continuous Integration. I have some doubts on how
 to exactly do that, in particular regarding the contextualization of the
 compilation process, and the integration of the included regression tests
 with Jenkins. Have you got any suggestions on this? Again, any feedback on
 the best tools to work with Slurm would be welcome.

 Thanks for your help. Best regards,


 Manuel

[slurm-dev] Re: reccomended software stack for development?

2014-10-27 Thread rf


 Manuel == Manuel Rodríguez Pascual manuel.rodriguez.pasc...@gmail.com 
 writes:

Hi Manuel,

Manuel Hi all, I have the intention of working on Slurm, modifying
Manuel it to satisfy my needs and (hopefully) include some new
Manuel functionalities. I am however kind of newbie with this kind
Manuel of software development, so I am writing looking for
Manuel advise. My question is, can you recommend me any tools for
Manuel the development of slurm?

I agree with Andy, that it's best to view this as 2 separate tasks (cluster
setup/management + slurm development).

For your cluster setup, you could use Qlustar which will allow you to
easily setup a ready to run virtual demo cluster incl. a functioning
slurm and OpenMPI in about 30 min (no exaggeration, just follow
https://www.qlustar.com/book/docs/install-guide
and https://www.qlustar.com/book/docs/first-steps).
The Qlustar Basic Edition is free for academic usage and has everything
needed for your use case.

Once setup, you have all the tools of Ubuntu or Debian at your
finger-tips to jump into development.

Good luck,

Roland

---
http://www.q-leap.com / http://qlustar.com
  --- HPC / Storage / Cloud Linux Cluster OS ---

Manuel As a first layer, my idea is to use plain virtual machines
Manuel and employ Puppet to configure them and then install MPICH
Manuel and BLCR. Then, Jenkins would install and configure a
Manuel Slurm-based cluster and run a set of tests.

Manuel I am however new in using both tools and in developing
Manuel Slurm, so I am kind of lost right now. then, before starting
Manuel to build and configure all this, I would really appreciate
Manuel some suggestions from more experienced developers.

Manuel I have planned to clone Slurm github repo to work with my
Manuel own github, and then employ Jenkins for Continuous
Manuel Integration. I have some doubts on how to exactly do that,
Manuel in particular regarding the contextualization of the
Manuel compilation process, and the integration of the included
Manuel regression tests with Jenkins. Have you got any suggestions
Manuel on this? Again, any feedback on the best tools to work with
Manuel Slurm would be welcome.

Manuel Thanks for your help. Best regards,


Manuel Manuel

[slurm-dev] Re: Understanding Fairshare and effect on background/backfill type partitions

2014-10-27 Thread Ryan Cox


Trey,

I'm not sure why your jobs aren't starting.  Someone else will have to 
answer that question.


You can model an organizational hierarchy a lot better in 14.11 due to 
changes in Fairshare=parent for accounts.  If you only want fairshare to 
matter at the research group and user levels but want to maintain an 
account structure that reflects your organization, set everything above 
the research group to be Fairshare=parent.  It makes it so that those 
accounts disappear for fairshare calculation purposes (but not limits, 
accounting, etc).


As for fairshare, precision loss can be a real issue and I'm guessing 
that you're affected.  I won't rehash our Slurm UG presentation here, 
but we spent some time discussing precision loss issues.  What 
normalized shares values do you see?  Try plugging that into 
2^(-EffectvUsage / SharesNorm) to see how small the number is.  That 
number then has to be multiplied by PriorityWeightFairshare, which I see 
you sized properly.


I would suggest looking at the Fair Tree fairshare algorithm once 14.11 
is released.  In case you want more information: 
http://slurm.schedmd.com/SUG14/fair_tree.pdf and 
https://fsl.byu.edu/documentation/slurm/fair_tree.php.  The slides in 
the first link also discuss Fairshare=parent in slides 82-91.


Ryan

Disclaimer:  I have some personal interest in both of the suggestions 
since we developed them.


On 10/24/2014 10:49 AM, Trey Dockendorf wrote:

Understanding Fairshare and effect on background/backfill type partitions
In our setup we use a background partition that can be preempted but 
has access to the entire cluster.  The idea is that when stakeholder 
partitions are not fully utilized, users can be opportunistic in 
making use of the cluster when the system is not 100% utilized.


Recently I submitted a batch of jobs , ~60, to our background 
partition.  All nodes were idle but half my jobs ended up pending with 
reason of Priority.  I checked sshare and my FairShare value was 
at 0.00.  Would my Fairshare dropping to 0 cause my jobs to be 
queued when resources were IDLE and no other jobs were queued in that 
partition besides my own?


I'm also wondering what method is used to come up with sane Fairshare 
values.  We have a (likely unnecessarily) complex account structure in 
slurmdbd that mimics the organizational structure of the departments / 
colleges / research groups using the cluster.  Be interested how other 
groups have configured fairshare and the multifactor priority.


For completeness, here are relevant config items I'm working with:

AccountingStorageEnforce=limits,qos
PreemptMode=SUSPEND,GANG
PreemptType=preempt/partition_prio
PriorityCalcPeriod=5
PriorityDecayHalfLife=7-0
PriorityFavorSmall=YES
PriorityFlags=SMALL_RELATIVE_TO_TIME
PriorityMaxAge=7-0
PriorityType=priority/multifactor
PriorityUsageResetPeriod=NONE
PriorityWeightAge=2000# 20%
PriorityWeightFairshare=4000  # 40%
PriorityWeightJobSize=3000# 30%
PriorityWeightPartition=0 # 0%
PriorityWeightQOS=1000# 10%
SchedulerParameters=assume_swap # An option for in-house patch
SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK

Example of a stakeholder partition and background:

PartitionName=hepx Nodes=c0[101-116,120-132,227,416,530-532,933-936] 
Priority=100 AllowQOS=hepx  MaxNodes=1 MaxTime=120:00:00 State=UP
PartitionName=background Priority=10 AllowQOS=background MaxNodes=1 
MaxTime=96:00:00 State=UP


Thanks,
- Trey

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu mailto:treyd...@tamu.edu
Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu

[slurm-dev] Re: building slurm with rpmbuild and hwloc support

[slurm-dev] Re: slurm cannot work with Infiniband after rebooting

[slurm-dev] Re: slurm cannot work with Infiniband after rebooting

[slurm-dev] logrotate causing job authentication failure

[slurm-dev] Re: logrotate causing job authentication failure

[slurm-dev] Re: reccomended software stack for development?

[slurm-dev] Re: reccomended software stack for development?

[slurm-dev] Re: reccomended software stack for development?

[slurm-dev] Re: Understanding Fairshare and effect on background/backfill type partitions

9 matches

Site Navigation

Mail list logo

Footer information