Hi Matt,
is there a reason you need to call coalesce every loop iteration? Most
likely it forces spark to do lots of unnecessary shuffles. Also - for
really large number of inputs this approach can lead to due to to many
nested RDD.union calls. A safer approach is to call union from
I have multiple input paths which each contain data that need to be mapped
in a slightly different way into a common data structure. My approach boils
down to:
RDDT rdd = null;
for (Configuration conf : configurations) {
RDDT nextRdd = loadFromConfiguration(conf);
rdd = (rdd == null) ?