A few weeks ago, we stress tested commlib (for those who don't know the code, commlib is the communication library in Grid Engine) to make sure that Grid Engine works in clusters larger than 10,000 nodes, and works efficiently. We blogged the experience and you can read it at:
http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html We used smaller nodes (called instances in AWS terminology), and in terms of core count per node, the largest ones we used only have 8 cores per node. We could have used larger nodes, like cc2.8xlarge (Cluster Compute Eight Extra Large Instance) that has 16 Intel Xeon E5-2670 cores per node, and use less number of nodes to achieve the same core count, but then it would put less stress on the commlib... There are some performance issues that we would like to fix before we run something even larger (like 20,000 nodes and beyond :-D ), and I think we are hitting the "C10K problem" that was encountered by web servers a few years ago! Rayson ================================================== Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/ On Thu, Nov 15, 2012 at 10:52 AM, Rayson Ho <ray...@scalablelogic.com> wrote: > > This year, we tested the scalability of Open Grid Scheduler / Grid > Engine on the cloud -- we ran a 10,000-node cluster on EC2 (we could > have used Gompute's hardware but obviously there are more important > workloads in the dedicated HPC Clouds at Gompute). _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users