Re: DataFrame -- help with encoding factor variables

2015-04-06 Thread Xiangrui Meng
Before OneHotEncoder or LabelIndexer is merged, you can define an UDF
to do the mapping.

val labelToIndex = udf { ... }
featureDF.withColumn(f3_dummy, labelToIndex(col(f3)))

See instructions here
http://spark.apache.org/docs/latest/sql-programming-guide.html#udf-registration-moved-to-sqlcontextudf-java--scala

-Xiangrui

On Mon, Apr 6, 2015 at 7:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
 Hi folks, currently have a DF that has a factor variable -- say gender.

 I am hoping to use the RandomForest algorithm on this data an it appears
 that this needs to be converted to RDD[LabeledPoint] first -- i.e. all
 features need to be double-encoded.

 I see https://issues.apache.org/jira/browse/SPARK-5888 is still open but was
 wondering what is the recommended way to add a column? I can think of

 featuresDF.map { case Row(f1,f2,f3) =(f1,f2,if (f3=='male') 0 else 1,if
 (f3=='female') 0 else 1) }.toDF(f1,f2,f3_dummy,f3_dummy2)


 but that isn't ideal as I already have 80+ features in that dataframe so the
 matching itself is a pain -- thinking there's got to be a better way to
 append |levels| number of columns and select all columns but f3?

 I see a withColumn method but no constructor to create a column...should I
 be creating the dummy features in a new dataframe and then select them out
 of there to get a Column?

 Any pointers are appreciated -- I'm sure I'm not the first person to attempt
 this, just unsure of the least painful way to achieve.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



DataFrame -- help with encoding factor variables

2015-04-06 Thread Yana Kadiyska
Hi folks, currently have a DF that has a factor variable -- say gender.

I am hoping to use the RandomForest algorithm on this data an it appears
that this needs to be converted to RDD[LabeledPoint] first -- i.e. all
features need to be double-encoded.

I see https://issues.apache.org/jira/browse/SPARK-5888 is still open but
was wondering what is the recommended way to add a column? I can think of

featuresDF.map { case Row(f1,f2,f3) =(f1,f2,if (f3=='male') 0 else
1,if (f3=='female') 0 else 1) }.toDF(f1,f2,f3_dummy,f3_dummy2)

​

but that isn't ideal as I already have 80+ features in that dataframe so
the matching itself is a pain -- thinking there's got to be a better way to
append |levels| number of columns and select all columns but f3?

I see a withColumn method but no constructor to create a column...should I
be creating the dummy features in a new dataframe and then select them out
of there to get a Column?

Any pointers are appreciated -- I'm sure I'm not the first person to
attempt this, just unsure of the least painful way to achieve.