Re: Union of many RDDs taking a long time

2015-06-29 Thread Tomasz Fruboes
Hi Matt, is there a reason you need to call coalesce every loop iteration? Most likely it forces spark to do lots of unnecessary shuffles. Also - for really large number of inputs this approach can lead to due to to many nested RDD.union calls. A safer approach is to call union from

Union of many RDDs taking a long time

2015-06-17 Thread Matt Forbes
I have multiple input paths which each contain data that need to be mapped in a slightly different way into a common data structure. My approach boils down to: RDDT rdd = null; for (Configuration conf : configurations) { RDDT nextRdd = loadFromConfiguration(conf); rdd = (rdd == null) ?