Hi, I have multiple procedure codes that a patient has undergone, in an RDD with a different row for each combination of patient and procedure. I am trying to covert this data to the LibSVM format, so that the resultant looks as follows:
"0 1:1 2:0 3:1 29:1 30:1 32:1 110:1" where 1, 2, 3, 29, 30, 32, 110 are numeric procedure codes a given patient has undergone. Note that Spark needs these codes to be one-based and in ascending order, so I am using a combination of groupByKey() and mapValues() to do this conversion as follows: procedures_rdd = procedures_rdd.groupByKey().mapValues(combine_procedures) where combine_procedures() is defined as: def combine_procedures(l_procs): ascii_procs = map(lambda x: int(custom_encode(x)), l_procs) return ' '.join([str(x) + ":1" for x in sorted(ascii_procs)]) Note that this reduction is neither commutative nor associative, since combining "29:1 30:1" and "32:1 110:1" to "32:1 110:1 29:1 30:1" is not going to work. Can someone suggest some faster alternative to the combination of groupByKey() and mapValues() for this? Thanks Bibudh -- Bibudh Lahiri Senior Data Scientist, Impetus Technolgoies 720 University Avenue, Suite 130 Los Gatos, CA 95129 http://knowthynumbers.blogspot.com/