I was in a discussion with someone who works for a cloud provider which offers Spark/Hadoop services. We got into a discussion of performance and the bewildering array of machine types and the problem of selecting a cluster with 20 "Large" instances VS 10 "Jumbo" instances or the trade offs between the cost of running a problem for longer on a small cluster vs shorter on a large cluster.
He offered to run some standard spark jobs on a number of clusters of different size and machine type and post the results. I thought if we could find a half dozen benchmarks (including data) which differ in CPU, IO and memory requirements, were open source and well known the post might help users since they could look at the posted data and select an optimal configuration for a the benchmark closer to their case. Probably problems sized to take 15 minutes on a medium 16 node cluster would be good because setup and deployment tend to obscure runtime issues. Terasort comes to mind as one problem - I suspect the ADAM group might have a biological problem like K-Mers but I am looking fora few others