Maybe I'm wrong, but what you are doing here is basically a bunch of
cartesian product for each key. So if hello appear 100 times in your
corpus, it will produce 100*100 elements in the join output.
I don't understand what you're doing here, but it's normal your join
takes forever, it makes
Hello guys,
I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with 10GB RAM
each, but the join seems to be taking too long ( 2hrs).
I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.
Do
If your data has special characteristics like one small other large then
you can think of doing map side join in Spark using (Broadcast Values),
this will speed up things.
Otherwise as Pitel mentioned if there is nothing special and its just
cartesian product it might take ever, or you might