subject:"Join on Spark too slow."

Re: Join on Spark too slow.

2015-04-09 Thread Guillaume Pitel

Maybe I'm wrong, but what you are doing here is basically a bunch of cartesian product for each key. So if hello appear 100 times in your corpus, it will produce 100*100 elements in the join output. I don't understand what you're doing here, but it's normal your join takes forever, it makes

Join on Spark too slow.

2015-04-09 Thread Kostas Kloudas

Hello guys, I am trying to run the following dummy example for Spark, on a dataset of 250MB, using 5 machines with 10GB RAM each, but the join seems to be taking too long ( 2hrs). I am using Spark 0.8.0 but I have also tried the same example on more recent versions, with the same results. Do

Re: Join on Spark too slow.

2015-04-09 Thread ๏̯͡๏

If your data has special characteristics like one small other large then you can think of doing map side join in Spark using (Broadcast Values), this will speed up things. Otherwise as Pitel mentioned if there is nothing special and its just cartesian product it might take ever, or you might