Hi Matt,

is there a reason you need to call coalesce every loop iteration? Most likely it forces spark to do lots of unnecessary shuffles. Also - for really large number of inputs this approach can lead to due to to many nested RDD.union calls. A safer approach is to call union from SparkContext once, as soon as you have all RDDs ready. For python it looks this way:

    rdds = []
    for i in xrange(cnt):
        rdd = ...
        rdds.append(rdd)
    finalRDD = sparkContext.union(rdds)

HTH,
 Tomasz


W dniu 18.06.2015 o 02:53, Matt Forbes pisze:
I have multiple input paths which each contain data that need to be
mapped in a slightly different way into a common data structure. My
approach boils down to:

RDD<T> rdd = null;
for (Configuration conf : configurations) {
   RDD<T> nextRdd = loadFromConfiguration(conf);
   rdd = (rdd == null) ? nextRdd : rdd.union(nextRdd);
   rdd = rdd.coalesce(nextRdd.partitions().size());
}

Now, for a small number of inputs there doesn't seem to be a problem,
but for the full set which is about 60 sub-RDDs coming in at around
500MM total records takes a very long time to construct. Just for a
simple load-then-count example job, it takes 13 minutes total, where the
count() task only accounts for 2 minutes of that.

Is there something I should be doing differently here? If you can't
tell, this is in java so my RDD is probably some mess of nested wrapped
RDDs but I'm not sure if that would be the real issue.

Thanks,
Matt


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to