Quoting "Jharrod W. LaFon" <[email protected]>: > Hello, > > You may or may not be aware of a free utility called StarCluster ( > http://web.mit.edu/star/cluster/docs/latest/index.html), which completely > allocates and configures clusters on Amazon's EC2. It also provides the > ability to grow or shrink the cluster using a load balancer. > > The default scheduler installed by StarCluster is SGE. I have nothing > against SGE, but I used SLURM when I worked at LANL and was very pleased > with it, and decided to add it to StarCluster. > > The SLURM enabled fork of StarCluster is at > https://github.com/jlafon/StarCluster, with a short set of instructions at > https://github.com/jlafon/StarCluster/wiki/Getting-started-with-SLURM-on-Amazon's-EC2 > . > This allows a fully configured SLURM cluster to be up and running in > minutes. > > I do have some questions: > > Are there plans to add XML output to the SLURM utilities? Right now I am > parsing command output. It would be much easier to implement a load > balancer if this feature was available.
There are no immediate plans to do this, but SLURM is open source so if you care to pursue that, feel free. I would suggest just adding an --xml option to the scontrol command. SLURM does have C and PERL APIs available today. > I have enabled slurmdbd. Should I query the database directly for running > and completed job information, or only use sreport, sacct, etc? There are APIs available, which eliminate risks associated with changes in the DB schema and address security concerns (i.e. authenticated communications with the slurmdbd). Parsing command output is also an option. > How can I queue jobs that I submit (rather than rejecting them) when the > configured resources are enough to fulfill the job requirements? When > using a load balancer, it is desirable to only run the SLURM controller > node (with no compute nodes) if there > are no jobs running, and let the load balancer expand the cluster by adding > compute nodes as jobs get submitted. I have this feature working, but > through a work around. I configure a hidden partition of nodes with dummy > entries in /etc/hosts, and update /etc/hosts with correct entries when > compute nodes are added. This allows a job to be queued rather than > rejected, and allows slurmctld to start with fake nodes in a hidden > partition (I noticed that slurmctld won't start at all if it can't resolve > node hostnames to ip addresses). Take a look at this document: http://www.schedmd.com/slurmdocs/elastic_computing.html Of particulate note is the node state of "CLOUD". It is not all fleshed out, but there is some infrastructure available in SLURM today to help > Thanks! > > -- > -- > Jharrod LaFon >
