Can you simply apply the https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.util.StatCounter to this? You should be able to do something like this:
val stats = RDD.map(x => x._2).stats() -Todd On Tue, Apr 28, 2015 at 10:00 AM, subscripti...@prismalytics.io < subscripti...@prismalytics.io> wrote: > Hello Friends: > > I generated a Pair RDD with K/V pairs, like so: > > >>> > >>> rdd1.take(10) # Show a small sample. > [(u'2013-10-09', 7.60117302052786), > (u'2013-10-10', 9.322709163346612), > (u'2013-10-10', 28.264462809917358), > (u'2013-10-07', 9.664429530201343), > (u'2013-10-07', 12.461538461538463), > (u'2013-10-09', 20.76923076923077), > (u'2013-10-08', 11.842105263157894), > (u'2013-10-13', 32.32514177693762), > (u'2013-10-13', 26.249999999999996), > (u'2013-10-13', 10.693069306930692)] > > Now from the above RDD, I would like to calculate an average of the VALUES > for each KEY. > I can do so as shown here, which does work: > > >>> countsByKey = sc.broadcast(rdd1.countByKey()) # SAMPLE OUTPUT of > countsByKey.value: {u'2013-09-09': 215, u'2013-09-08': 69, ... snip ...} > >>> rdd1 = rdd1.reduceByKey(operator.add) # Calculate the numerator (i.e. > the SUM). > >>> rdd1 = rdd1.map(lambda x: (x[0], x[1]/countsByKey.value[x[0]])) # > Divide each SUM by it's denominator (i.e. COUNT) > >>> print(rdd1.collect()) > [(u'2013-10-09', 11.235365503035176), > (u'2013-10-07', 23.39500642456595), > ... snip ... > ] > > But I wonder if the above semantics/approach is the optimal one; or > whether perhaps there is a single API call > that handles common use case. > > Improvement thoughts welcome. =:) > > Thank you, > nmv >