Hi,

I'm trying to implement a custom one hot encoder, since I want the output to
be a specific way, suitable to theano. Basically, it will give a new column
for each distinct member of the original features and have it set to 1 if
the observation contains the specific member of the distinct feature subset.
Something like feature1.distinct1, feature1.distinct2...

Here is my attempt, which seems logically sound:

for (column <- featuresThatNeedEncoding) {

      for (j <- df.select(column).distinct().collect().toSeq) {
        
        df = df.withColumn(column + "." + j.get(0).toString,  expr("CASE
WHEN " + column + " = '" + j.get(0).toString + "' THEN " + column + "." +
j.get(0).toString + " = '1' ELSE " + column + "." + j.get(0).toString + " =
'0' END"))               
      }
}

And some of the stack trace:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't
extract value from someFeature#295;
        at
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:72)
        at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:267)
        at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:266)
        at
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)

Any ideas how to resolve this or why there is a #295 after my column name?

Thanks,

Ian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Extra-string-added-to-column-name-withColumn-expr-tp27560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to