Hi folks, currently have a DF that has a factor variable -- say gender. I am hoping to use the RandomForest algorithm on this data an it appears that this needs to be converted to RDD[LabeledPoint] first -- i.e. all features need to be double-encoded.
I see https://issues.apache.org/jira/browse/SPARK-5888 is still open but was wondering what is the recommended way to add a column? I can think of featuresDF.map { case Row(f1,f2,f3) =>(f1,f2,if (f3=='male') 0 else 1,if (f3=='female') 0 else 1) }.toDF("f1","f2","f3_dummy","f3_dummy2") but that isn't ideal as I already have 80+ features in that dataframe so the matching itself is a pain -- thinking there's got to be a better way to append |levels| number of columns and select all columns but "f3"? I see a withColumn method but no constructor to create a column...should I be creating the dummy features in a new dataframe and then select them out of there to get a Column? Any pointers are appreciated -- I'm sure I'm not the first person to attempt this, just unsure of the least painful way to achieve.