Re: Union of many RDDs taking a long time

Tomasz Fruboes Mon, 29 Jun 2015 03:08:08 -0700

Hi Matt,

is there a reason you need to call coalesce every loop iteration? Mostlikely it forces spark to do lots of unnecessary shuffles. Also - forreally large number of inputs this approach can lead to due to to manynested RDD.union calls. A safer approach is to call union fromSparkContext once, as soon as you have all RDDs ready. For python itlooks this way:


    rdds = []
    for i in xrange(cnt):
        rdd = ...
        rdds.append(rdd)
    finalRDD = sparkContext.union(rdds)

HTH,
 Tomasz


W dniu 18.06.2015 o 02:53, Matt Forbes pisze:

I have multiple input paths which each contain data that need to be
mapped in a slightly different way into a common data structure. My
approach boils down to:

RDD<T> rdd = null;
for (Configuration conf : configurations) {
   RDD<T> nextRdd = loadFromConfiguration(conf);
   rdd = (rdd == null) ? nextRdd : rdd.union(nextRdd);
   rdd = rdd.coalesce(nextRdd.partitions().size());
}

Now, for a small number of inputs there doesn't seem to be a problem,
but for the full set which is about 60 sub-RDDs coming in at around
500MM total records takes a very long time to construct. Just for a
simple load-then-count example job, it takes 13 minutes total, where the
count() task only accounts for 2 minutes of that.

Is there something I should be doing differently here? If you can't
tell, this is in java so my RDD is probably some mess of nested wrapped
RDDs but I'm not sure if that would be the real issue.

Thanks,
Matt



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Union of many RDDs taking a long time

Reply via email to