I have multiple input paths which each contain data that need to be mapped
in a slightly different way into a common data structure. My approach boils
down to:

RDD<T> rdd = null;
for (Configuration conf : configurations) {
  RDD<T> nextRdd = loadFromConfiguration(conf);
  rdd = (rdd == null) ? nextRdd : rdd.union(nextRdd);
  rdd = rdd.coalesce(nextRdd.partitions().size());
}

Now, for a small number of inputs there doesn't seem to be a problem, but
for the full set which is about 60 sub-RDDs coming in at around 500MM total
records takes a very long time to construct. Just for a simple
load-then-count example job, it takes 13 minutes total, where the count()
task only accounts for 2 minutes of that.

Is there something I should be doing differently here? If you can't tell,
this is in java so my RDD is probably some mess of nested wrapped RDDs but
I'm not sure if that would be the real issue.

Thanks,
Matt

Reply via email to