Hi, I'm trying to implement a custom one hot encoder, since I want the output to be a specific way, suitable to theano. Basically, it will give a new column for each distinct member of the original features and have it set to 1 if the observation contains the specific member of the distinct feature subset. Something like feature1.distinct1, feature1.distinct2...
Here is my attempt, which seems logically sound: for (column <- featuresThatNeedEncoding) { for (j <- df.select(column).distinct().collect().toSeq) { df = df.withColumn(column + "." + j.get(0).toString, expr("CASE WHEN " + column + " = '" + j.get(0).toString + "' THEN " + column + "." + j.get(0).toString + " = '1' ELSE " + column + "." + j.get(0).toString + " = '0' END")) } } And some of the stack trace: Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from someFeature#295; at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:72) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:267) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:266) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) Any ideas how to resolve this or why there is a #295 after my column name? Thanks, Ian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-string-added-to-column-name-withColumn-expr-tp27560.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org