Hello guys, I am trying to run the following dummy example for Spark, on a dataset of 250MB, using 5 machines with >10GB RAM each, but the join seems to be taking too long (> 2hrs).
I am using Spark 0.8.0 but I have also tried the same example on more recent versions, with the same results. Do you have any idea why this is happening? Thanks a lot, Kostas *val *sc = *new *SparkContext( args(0), *"DummyJoin"*, System.*getenv*(*"SPARK_HOME"*), *Seq*(System.*getenv*(*"SPARK_EXAMPLES_JAR"*))) *val *file = sc.textFile(args(1)) *val *wordTuples = file .flatMap(line => line.split(args(2))) .map(word => (word, 1)) *val *big = wordTuples.filter { *case *((k, v)) => k != *"a" *}.cache() *val *small = wordTuples.filter { *case *((k, v)) => k != *"a" *&& k != *"to" *&& k != *"and" *}.cache() *val *res = big.leftOuterJoin(small) res.saveAsTextFile(args(3)) }