There's a free & opensource application called StarCluster that can do most (if not all?) of the EC2 provisioning & cluster setup for a High Throughput Computing cluster:
http://web.mit.edu/stardev/cluster/ StarCluster sets up NFS, SGE, BLAS library, Open MPI, etc automatically for the user in around 10-15 mins. StarCluster is licensed under LGPL, written in Python+Boto, and supports a lot of the new EC2 features (Cluster Compute Instances, Spot Instances, Cluster GPU Instances, etc). Support for launching higher node count (100+ instances) clusters is even better with the new scalability enhancements in the latest version (0.92). And there are some tutorials on YouTube: - "StarCluster 0.91 Demo": http://www.youtube.com/watch?v=vC3lJcPq1FY - "Launching a Cluster on Amazon Ec2 Spot Instances Using StarCluster": http://www.youtube.com/watch?v=2Ym7epCYnSk Rayson ================================= Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net On Wed, Sep 21, 2011 at 7:02 AM, Eugen Leitl <[email protected]> wrote: > > http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars > > $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud > > By Jon Brodkin | Published September 20, 2011 10:49 AM > > Amazon EC2 and other cloud services are expanding the market for > high-performance computing. Without access to a national lab or a > supercomputer in your own data center, cloud computing lets businesses spin > up temporary clusters at will and stop paying for them as soon as the > computing needs are met. > > A vendor called Cycle Computing is on a mission to demonstrate the potential > of Amazon’s cloud by building increasingly large clusters on the Elastic > Compute Cloud. Even with Amazon, building a cluster takes some work, but > Cycle combines several technologies to ease the process and recently used > them to create a 30,000-core cluster running CentOS Linux. > > The cluster, announced publicly this week, was created for an unnamed “Top 5 > Pharma” customer, and ran for about seven hours at the end of July at a peak > cost of $1,279 per hour, including the fees to Amazon and Cycle Computing. > The details are impressive: 3,809 compute instances, each with eight cores > and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB > (petabytes) of disk space. Security was ensured with HTTPS, SSH and 256-bit > AES encryption, and the cluster ran across data centers in three Amazon > regions in the United States and Europe. The cluster was dubbed “Nekomata.” > > Spreading the cluster across multiple continents was done partly for disaster > recovery purposes, and also to guarantee that 30,000 cores could be > provisioned. “We thought it would improve our probability of success if we > spread it out,” Cycle Computing’s Dave Powers, manager of product > engineering, told Ars. “Nobody really knows how many instances you can get at > any one time from any one [Amazon] region.” > > Amazon offers its own special cluster compute instances, at a higher cost > than regular-sized virtual machines. These cluster instances provide 10 > Gigabit Ethernet networking along with greater CPU and memory, but they > weren’t necessary to build the Cycle Computing cluster. > > The pharmaceutical company’s job, related to molecular modeling, was > “embarrassingly parallel” so a fast interconnect wasn’t crucial. To further > reduce costs, Cycle took advantage of Amazon’s low-price “spot instances.” To > manage the cluster, Cycle Computing used its own management software as well > as the Condor High-Throughput Computing software and Chef, an open source > systems integration framework. > > Cycle demonstrated the power of the Amazon cloud earlier this year with a > 10,000-core cluster built for a smaller pharma firm called Genentech. Now, > 10,000 cores is a relatively easy task, says Powers. “We think we’ve mastered > the small-scale environments,” he said. 30,000 cores isn’t the end game, > either. Going forward, Cycle plans bigger, more complicated clusters, perhaps > ones that will require Amazon’s special cluster compute instances. > > The 30,000-core cluster may or may not be the biggest one run on EC2. Amazon > isn’t saying. > > “I can’t share specific customer details, but can tell you that we do have > businesses of all sizes running large-scale, high-performance computing > workloads on AWS [Amazon Web Services], including distributed clusters like > the Cycle Computing 30,000 core cluster to tightly-coupled clusters often > used for science and engineering applications such as computational fluid > dynamics and molecular dynamics simulation,” an Amazon spokesperson told Ars. > > Amazon itself actually built a supercomputer on its own cloud that made it > onto the list of the world’s Top 500 supercomputers. With 7,000 cores, the > Amazon cluster ranked number 232 in the world last November with speeds of > 41.82 teraflops, falling to number 451 in June of this year. So far, Cycle > Computing hasn’t run the Linpack benchmark to determine the speed of its > clusters relative to Top 500 sites. > > But Cycle’s work is impressive no matter how you measure it. The job > performed for the unnamed pharma company “would take well over a week for > them to run internally,” Powers says. In the end, the cluster performed the > equivalent of 10.9 “compute years of work.” > > The task of managing such large cloud-based clusters forced Cycle to step up > its own game, with a new plug-in for Chef the company calls Grill. > > “There is no way that any mere human could keep track of all of the moving > parts on a cluster of this scale,” Cycle wrote in a blog post. “At Cycle, > we’ve always been fans of extreme IT automation, but we needed to take this > to the next level in order to monitor and manage every instance, volume, > daemon, job, and so on in order for Nekomata to be an efficient 30,000 core > tool instead of a big shiny on-demand paperweight.” > > But problems did arise during the 30,000-core run. > > “You can be sure that when you run at massive scale, you are bound to run > into some unexpected gotchas,” Cycle notes. “In our case, one of the gotchas > included such things as running out of file descriptors on the license > server. In hindsight, we should have anticipated this would be an issue, but > we didn’t find that in our prelaunch testing, because we didn’t test at full > scale. We were able to quickly recover from this bump and keep moving along > with the workload with minimal impact. The license server was able to keep up > very nicely with this workload once we increased the number of file > descriptors.” > > Cycle also hit a speed bump related to volume and byte limits on Amazon’s > Elastic Block Store volumes. But the company is already planning bigger and > better things. > > “We already have our next use-case identified and will be turning up the > scale a bit more with the next run,” the company says. But ultimately, “it’s > not about core counts or terabytes of RAM or petabytes of data. Rather, it’s > about how we are helping to transform how science is done.” > > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ Wikimedia Commons http://commons.wikimedia.org/wiki/User:Raysonho _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
