Quoting "Jharrod W. LaFon" <[email protected]>:

> Hello,
>
> You may or may not be aware of a free utility called StarCluster (
> http://web.mit.edu/star/cluster/docs/latest/index.html), which completely
> allocates and configures clusters on Amazon's EC2.  It also provides the
> ability to grow or shrink the cluster using a load balancer.
>
> The default scheduler installed by StarCluster is SGE.  I have nothing
> against SGE, but I used SLURM when I worked at LANL and was very pleased
> with it, and decided to add it to StarCluster.
>
> The SLURM enabled fork of StarCluster is at
> https://github.com/jlafon/StarCluster, with a short set of instructions at
> https://github.com/jlafon/StarCluster/wiki/Getting-started-with-SLURM-on-Amazon's-EC2
> .
> This allows a fully configured SLURM cluster to be up and running in
> minutes.
>
> I do have some questions:
>
> Are there plans to add XML output to the SLURM utilities?  Right now I am
> parsing command output.  It would be much easier to implement a load
> balancer if this feature was available.

There are no immediate plans to do this, but SLURM is open source so  
if you care to pursue that, feel free. I would suggest just adding an  
--xml option to the scontrol command. SLURM does have C and PERL APIs  
available today.


> I have enabled slurmdbd.  Should I query the database directly for running
> and completed job information, or only use sreport, sacct, etc?

There are APIs available, which eliminate risks associated with  
changes in the DB schema and address security concerns (i.e.  
authenticated communications with the slurmdbd). Parsing command  
output is also an option.


> How can I queue jobs that I submit (rather than rejecting them) when the
> configured resources are enough to fulfill the job requirements?  When
> using a load balancer, it is desirable to only run the SLURM controller
> node (with no compute nodes) if there
> are no jobs running, and let the load balancer expand the cluster by adding
> compute nodes as jobs get submitted.  I have this feature working, but
> through a work around.  I configure a hidden partition of nodes with dummy
> entries in /etc/hosts, and update /etc/hosts with correct entries when
> compute nodes are added.  This allows a job to be queued rather than
> rejected, and allows slurmctld to start with fake nodes in a hidden
> partition (I noticed that slurmctld won't start at all if it can't resolve
> node hostnames to ip addresses).

Take a look at this document:
http://www.schedmd.com/slurmdocs/elastic_computing.html

Of particulate note is the node state of "CLOUD". It is not all  
fleshed out, but there is some infrastructure available in SLURM today  
to help


> Thanks!
>
> --
> --
> Jharrod LaFon
>

Reply via email to