Hi everyone,

I’m interested in empirically measuring how faster spark works in comparison to 
Hadoop for certain problems and input corpus I currently work with (I’ve read 
Matei Zahari’s “Resilient Distributed Datasets: A Fault-Tolerant Abstraction 
for In-Memory Cluster Computing” paper and I want to perform a similar test). I 
personally think measuring the difference of speed in a single 1-node cluster 
isn’t enough, so I was wondering what would you recommend for this task, in 
regards of number of clusters/specs, etc.
I was thinking it could possible to launch a couple of CDH5 VMs across a few 
computers or do you think it would be easier to do it with Amazon EC2?

I’m particularly interested in knowing what is the overall experience in this 
case and what are your recommendations (what other common problems to test and 
what kind of benchmarks)

Have a great start of the week.
Cheers



Pablo Valdes Software Engineer | comScore, Inc. (NASDAQ:SCOR)

pval...@comscore.com<mailto:pval...@comscore.com>



Av. Del Cóndor N° 520, oficina 202, Ciudad Empresarial, Comuna de Huechuraba, | 
Santiago | CL

...........................................................................................................

comScore is a global leader in digital media analytics. We make audiences and 
advertising more valuable. To learn more, visit 
www.comscore.com<http://www.comscore.com>


Reply via email to