Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

Bibudh Lahiri Tue, 03 May 2016 14:30:07 -0700

Hi,
  I have multiple procedure codes that a patient has undergone, in an RDD
with a different row for each combination of patient and procedure. I am
trying to covert this data to the LibSVM format, so that the resultant
looks as follows:


  "0 1:1 2:0 3:1 29:1 30:1 32:1 110:1"

  where 1, 2, 3, 29, 30, 32, 110 are numeric procedure codes a given
patient has undergone. Note that Spark needs these codes to be one-based
and in ascending order, so I am using a combination of groupByKey() and
mapValues() to do this conversion as follows:

procedures_rdd = procedures_rdd.groupByKey().mapValues(combine_procedures)

where combine_procedures() is defined as:

def combine_procedures(l_procs):
  ascii_procs = map(lambda x: int(custom_encode(x)), l_procs)
  return ' '.join([str(x) + ":1" for x in sorted(ascii_procs)])

  Note that this reduction is neither commutative nor associative, since
combining "29:1 30:1" and "32:1 110:1" to "32:1 110:1 29:1 30:1" is not
going to work.
  Can someone suggest some faster alternative to the combination
of groupByKey() and mapValues() for this?

Thanks
           Bibudh


-- 
Bibudh Lahiri
Senior Data Scientist, Impetus Technolgoies
720 University Avenue, Suite 130
Los Gatos, CA 95129
http://knowthynumbers.blogspot.com/

Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

Reply via email to