If your data has special characteristics like one small other large then
you can think of doing map side join in Spark using (Broadcast Values),
this will speed up things.

Otherwise as Pitel mentioned if there is nothing special and its just
cartesian product it might take ever, or you might increase # of executors.

On Thu, Apr 9, 2015 at 8:37 PM, Guillaume Pitel <guillaume.pi...@exensa.com>
wrote:

>  Maybe I'm wrong, but what you are doing here is basically a bunch of
> cartesian product for each key. So if "hello" appear 100 times in your
> corpus, it will produce 100*100 elements in the join output.
>
> I don't understand what you're doing here, but it's normal your join takes
> forever, it makes no sense as it, IMO.
>
> Guillaume
>
> Hello guys,
>
>  I am trying to run the following dummy example for Spark,
> on a dataset of 250MB, using 5 machines with >10GB RAM
> each, but the join seems to be taking too long (> 2hrs).
>
>  I am using Spark 0.8.0 but I have also tried the same example
> on more recent versions, with the same results.
>
>  Do you have any idea why this is happening?
>
>  Thanks a lot,
> Kostas
>
>  *val *sc = *new *SparkContext(
>       args(0),
>       *"DummyJoin"*,
>       System.*getenv*(*"SPARK_HOME"*),
>       *Seq*(System.*getenv*(*"SPARK_EXAMPLES_JAR"*)))
>
>     *val *file = sc.textFile(args(1))
>
>     *val *wordTuples = file
>       .flatMap(line => line.split(args(2)))
>       .map(word => (word, 1))
>
>     *val *big = wordTuples.filter {
>       *case *((k, v)) => k !=
> *"a"     *}.cache()
>
>     *val *small = wordTuples.filter {
>       *case *((k, v)) => k != *"a" *&& k != *"to" *&& k !=
> *"and"     *}.cache()
>
>     *val *res = big.leftOuterJoin(small)
>     res.saveAsTextFile(args(3))
>   }
>
>
>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)626 222 431
>
> eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705
>



-- 
Deepak

Reply via email to