Hello guys,

I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with >10GB RAM
each, but the join seems to be taking too long (> 2hrs).

I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.

Do you have any idea why this is happening?

Thanks a lot,
Kostas

*val *sc = *new *SparkContext(
      args(0),
      *"DummyJoin"*,
      System.*getenv*(*"SPARK_HOME"*),
      *Seq*(System.*getenv*(*"SPARK_EXAMPLES_JAR"*)))

    *val *file = sc.textFile(args(1))

    *val *wordTuples = file
      .flatMap(line => line.split(args(2)))
      .map(word => (word, 1))

    *val *big = wordTuples.filter {
      *case *((k, v)) => k !=
*"a"    *}.cache()

    *val *small = wordTuples.filter {
      *case *((k, v)) => k != *"a" *&& k != *"to" *&& k !=
*"and"    *}.cache()

    *val *res = big.leftOuterJoin(small)
    res.saveAsTextFile(args(3))
  }

Reply via email to