Maybe I'm wrong, but what you are doing here is basically a bunch of
cartesian product for each key. So if "hello" appear 100 times in your
corpus, it will produce 100*100 elements in the join output.
I don't understand what you're doing here, but it's normal your join
takes forever, it makes no sense as it, IMO.
Guillaume
Hello guys,
I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with >10GB RAM
each, but the join seems to be taking too long (> 2hrs).
I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.
Do you have any idea why this is happening?
Thanks a lot,
Kostas
**
*val *sc = *new *SparkContext(
args(0),
*"DummyJoin"*,
System./getenv/(*"SPARK_HOME"*),
/Seq/(System./getenv/(*"SPARK_EXAMPLES_JAR"*)))
*val *file = sc.textFile(args(1))
*val *wordTuples = file
.flatMap(line => line.split(args(2)))
.map(word => (word, 1))
*val *big = wordTuples.filter {
*case *((k, v)) => k != *"a"
*}.cache()
*val *small = wordTuples.filter {
*case *((k, v)) => k != *"a" *&& k != *"to" *&& k != *"and"
*}.cache()
*val *res = big.leftOuterJoin(small)
res.saveAsTextFile(args(3))
}
--
eXenSa
*Guillaume PITEL, Président*
+33(0)626 222 431
eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705