Dear Spark Users, I need assistance in understanding which way I should do a sorted reduce in PySpark. Yes, I know all reduces are sorted because sorting is grouping, but what I mean is that I need to create a tuple where the first field is a key, and the second field is a sorted list of all items with that key. It should look like this:
(id, [(id,A),(id,B),(id,C),...]) And I'm wondering what the 'right' way to do this is. Because this is a slow operation. Option 1: sort in the reduce origin/dest flights_per_airplane = flights\ .map(lambda nameTuple: (nameTuple[5], [nameTuple]))\ .reduceByKey(lambda a, b: sorted(a + b, key=lambda x: (x[1],x[2],x[3],x[4]))) Option 2: sort in a subsequent map step flights_per_airplane = flights\ .map(lambda nameTuple: (nameTuple[5], [nameTuple]))\ .reduceByKey(lambda a, b: a + b)\ .map(lambda tuple: ( tuple[0], sorted(tuple[1], key=lambda x: (x[1],x[2],x[3],x[4]))) ) The second option would seem to be more efficient, unless PySpark is smart and uses the reduce sort for 'free.' Or maybe there is some other optimization PySpark does. In any case, both are 'slow', but which one is faster? Thanks! Stack Overflow: http://stackoverflow.com/questions/36376369/what-is-the-most-efficient-way-to-do-a-sorted-reduce-in-pyspark Gist: https://gist.github.com/rjurney/af27f70c76dc6c6ae05c465271331ade -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io