There’s loads of parameters that can be tweaked for performance optimisation. Stuff like executor cores, number of executors and the like.
How is your cluster setup? Are you using Spark on Yarn? Also, did you inspect the spark UI? Can you share the driver and executor memory its assigned to your application along with the number of executors. A great article for performance tuning - http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Thank you, Nikaash Puri > On 14-Feb-2016, at 12:44 PM, Andrew Musselman <[email protected]> > wrote: > > I'm not sure; can you try using smaller datasets as input and do some rough > benchmarking? > > On Saturday, February 13, 2016, Ram VISWANADHA < > [email protected]> wrote: > >> The set has 21,367,781 records. Would it take 17+ hours for 21M records? >> >> >> Best Regards, >> Ram >> -- >> >> >> >> >> >> >> On 2/13/16, 11:31 AM, "Ram VISWANADHA" <[email protected] >> <javascript:;>> wrote: >> >>> Hi, >>> I am calling SimilarityAnalysis.cooccurrencesIDS api from Java. Here is >> the code https://gist.github.com/ramv-dailymotion/38a32f379865e8ee5a58 >>> I am running this on a Spark cluster 3 worker nodes and 1 master node. >> Each machine has 108GB RAM and 32 CPUs. What am I doing wrong? Thanks in >> advance. >>> >>> Best Regards, >>> Ram >>> >>
