I have multiple input paths which each contain data that need to be mapped in a slightly different way into a common data structure. My approach boils down to:
RDD<T> rdd = null; for (Configuration conf : configurations) { RDD<T> nextRdd = loadFromConfiguration(conf); rdd = (rdd == null) ? nextRdd : rdd.union(nextRdd); rdd = rdd.coalesce(nextRdd.partitions().size()); } Now, for a small number of inputs there doesn't seem to be a problem, but for the full set which is about 60 sub-RDDs coming in at around 500MM total records takes a very long time to construct. Just for a simple load-then-count example job, it takes 13 minutes total, where the count() task only accounts for 2 minutes of that. Is there something I should be doing differently here? If you can't tell, this is in java so my RDD is probably some mess of nested wrapped RDDs but I'm not sure if that would be the real issue. Thanks, Matt