Dear Spark Users,

I need assistance in understanding which way I should do a sorted reduce in
PySpark. Yes, I know all reduces are sorted because sorting is grouping,
but what I mean is that I need to create a tuple where the first field is a
key, and the second field is a sorted list of all items with that key. It
should look like this:

(id, [(id,A),(id,B),(id,C),...])

And I'm wondering what the 'right' way to do this is. Because this is a
slow operation.

Option 1: sort in the reduce

origin/dest
flights_per_airplane = flights\
  .map(lambda nameTuple: (nameTuple[5], [nameTuple]))\
  .reduceByKey(lambda a, b: sorted(a + b, key=lambda x: (x[1],x[2],x[3],x[4])))

Option 2: sort in a subsequent map step

flights_per_airplane = flights\
  .map(lambda nameTuple: (nameTuple[5], [nameTuple]))\
  .reduceByKey(lambda a, b: a + b)\
  .map(lambda tuple:
    (
      tuple[0], sorted(tuple[1], key=lambda x: (x[1],x[2],x[3],x[4])))
    )

The second option would seem to be more efficient, unless PySpark is smart
and uses the reduce sort for 'free.' Or maybe there is some other
optimization PySpark does. In any case, both are 'slow', but which one is
faster?

Thanks!

Stack Overflow:
http://stackoverflow.com/questions/36376369/what-is-the-most-efficient-way-to-do-a-sorted-reduce-in-pyspark
Gist: https://gist.github.com/rjurney/af27f70c76dc6c6ae05c465271331ade
-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io

Reply via email to