Maybe I'm wrong, but what you are doing here is basically a bunch of cartesian product for each key. So if "hello" appear 100 times in your corpus, it will produce 100*100 elements in the join output.

I don't understand what you're doing here, but it's normal your join takes forever, it makes no sense as it, IMO.

Guillaume
Hello guys,

I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with >10GB RAM
each, but the join seems to be taking too long (> 2hrs).

I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.

Do you have any idea why this is happening?

Thanks a lot,
Kostas
**
*val *sc = *new *SparkContext(
args(0),
*"DummyJoin"*,
System./getenv/(*"SPARK_HOME"*),
/Seq/(System./getenv/(*"SPARK_EXAMPLES_JAR"*)))

*val *file = sc.textFile(args(1))

*val *wordTuples = file
.flatMap(line => line.split(args(2)))
.map(word => (word, 1))

*val *big = wordTuples.filter {
*case *((k, v)) => k != *"a"
*}.cache()

*val *small = wordTuples.filter {
*case *((k, v)) => k != *"a" *&& k != *"to" *&& k != *"and"
*}.cache()

*val *res = big.leftOuterJoin(small)
res.saveAsTextFile(args(3))
}


--
eXenSa

        
*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Reply via email to