Haha ok, its one of those days, Array isn't valid. RTFM and it says Catalyst array maps to a Scala Seq, that makes sense.
So it works! Two follow up questions; 1 - Is this the best approach? 2 - what if I want my expression to return multiple rows? - my binary classification model gives me a array double with 3 fields, the prediction, the class A probability and the class B probability. How could I make those like 3 columns from my expression? Clearly .withColumn only expects 1 column back. On Tue, Sep 8, 2015 at 6:21 PM, Night Wolf <nightwolf...@gmail.com> wrote: > Sorry for the spam - I had some success; > > case class ScoringDF(function: Row => Double) extends Expression { > val dataType = DataTypes.DoubleType > > override type EvaluatedType = Double > > override def eval(input: Row): EvaluatedType = { > function(input) > } > > override def nullable: Boolean = false > > override def children: Seq[Expression] = Nil > } > > But this falls over if I want to return an Array[Double]; > > case class ScoringDF(function: Row => Array[Double]) extends Expression { > val dataType = DataTypes.createArrayType(DataTypes.DoubleType) > > override type EvaluatedType = Array[Double] > > override def eval(input: Row): EvaluatedType = { > function(input) > } > > override def nullable: Boolean = false > > override def children: Seq[Expression] = Nil > } > > > get the following exception; > > scala> dfs.show > java.lang.ClassCastException: [D cannot be cast to scala.collection.Seq > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$2.apply(CatalystTypeConverters.scala:282) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertRowWithConverters(CatalystTypeConverters.scala:348) > at > org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$4.apply(CatalystTypeConverters.scala:301) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$2.apply(SparkPlan.scala:150) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$2.apply(SparkPlan.scala:150) > > Any ideas? > > On Tue, Sep 8, 2015 at 5:47 PM, Night Wolf <nightwolf...@gmail.com> wrote: > >> So basically I need something like >> >> df.withColumn("score", new Column(new Expression { >> ... >> >> def eval(input: Row = null): EvaluatedType = myModel.score(input) >> ... >> >> })) >> >> But I can't do this, so how can I make a UDF or something like it, that >> can take in a Row and pass back a double value or some struct... >> >> On Tue, Sep 8, 2015 at 5:33 PM, Night Wolf <nightwolf...@gmail.com> >> wrote: >> >>> Not sure how that would work. Really I want to tack on an extra column >>> onto the DF with a UDF that can take a Row object. >>> >>> On Tue, Sep 8, 2015 at 1:54 AM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>>> Can you use a map or list with different properties as one parameter? >>>> Alternatively a string where parameters are Comma-separated... >>>> >>>> Le lun. 7 sept. 2015 à 8:35, Night Wolf <nightwolf...@gmail.com> a >>>> écrit : >>>> >>>>> Is it possible to have a UDF which takes a variable number of >>>>> arguments? >>>>> >>>>> e.g. df.select(myUdf($"*")) fails with >>>>> >>>>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project >>>>> [scalaUDF(*) AS scalaUDF(*)#26]; >>>>> >>>>> What I would like to do is pass in a generic data frame which can be >>>>> then passed to a UDF which does scoring of a model. The UDF needs to know >>>>> the schema to map column names in the model to columns in the DataFrame. >>>>> >>>>> The model has 100s of factors (very wide), so I can't just have a >>>>> scoring UDF that has 500 parameters (for obvious reasons). >>>>> >>>>> Cheers, >>>>> ~N >>>>> >>>> >>> >> >