If you put this into a dataframe then you may be able to use one hot
encoding and treat these as categorical features. I believe that the ml
pipeline components use project tungsten so the performance will be very
fast. After you process the result on the dataframe you would then need to
assemble your desired format.
On May 3, 2016 4:29 PM, "Bibudh Lahiri" <bibudhlah...@gmail.com> wrote:

> Hi,
>   I have multiple procedure codes that a patient has undergone, in an RDD
> with a different row for each combination of patient and procedure. I am
> trying to covert this data to the LibSVM format, so that the resultant
> looks as follows:
>
>   "0 1:1 2:0 3:1 29:1 30:1 32:1 110:1"
>
>   where 1, 2, 3, 29, 30, 32, 110 are numeric procedure codes a given
> patient has undergone. Note that Spark needs these codes to be one-based
> and in ascending order, so I am using a combination of groupByKey() and
> mapValues() to do this conversion as follows:
>
> procedures_rdd = procedures_rdd.groupByKey().mapValues(combine_procedures)
>
> where combine_procedures() is defined as:
>
> def combine_procedures(l_procs):
>   ascii_procs = map(lambda x: int(custom_encode(x)), l_procs)
>   return ' '.join([str(x) + ":1" for x in sorted(ascii_procs)])
>
>   Note that this reduction is neither commutative nor associative, since
> combining "29:1 30:1" and "32:1 110:1" to "32:1 110:1 29:1 30:1" is not
> going to work.
>   Can someone suggest some faster alternative to the combination
> of groupByKey() and mapValues() for this?
>
> Thanks
>            Bibudh
>
>
> --
> Bibudh Lahiri
> Senior Data Scientist, Impetus Technolgoies
> 720 University Avenue, Suite 130
> Los Gatos, CA 95129
> http://knowthynumbers.blogspot.com/
>
>

Reply via email to