I was in a discussion with someone who works for a cloud provider which
offers Spark/Hadoop services. We got into a discussion of performance and
the bewildering array of machine types and the problem of selecting a
cluster with 20 "Large" instances VS 10 "Jumbo" instances or the trade offs
between the cost of running a problem for longer on a small cluster vs
shorter on a large cluster.

He offered to run some standard spark jobs on a number of clusters of
different size and machine type and post the results.

I thought if we could find a half dozen benchmarks (including data) which
differ in CPU, IO and memory requirements, were open source and well known
the post might help users since they could look at the posted data and
select an optimal configuration for a the benchmark closer to their case.

Probably problems sized to take 15 minutes on a medium 16 node cluster
would be good because setup and deployment tend to obscure runtime issues.

Terasort comes to mind as one problem - I suspect the ADAM group might have
a biological problem like K-Mers but I am looking fora few others

Reply via email to