Haha ok, its one of those days, Array isn't valid. RTFM and it says
Catalyst array maps to a Scala Seq, that makes sense.

So it works! Two follow up questions;

1 - Is this the best approach?
2 - what if I want my expression to return multiple rows? - my binary
classification model gives me a array double with 3 fields, the prediction,
the class A probability and the class B probability. How could I make those
like 3 columns from my expression? Clearly .withColumn only expects 1
column back.

On Tue, Sep 8, 2015 at 6:21 PM, Night Wolf <nightwolf...@gmail.com> wrote:

> Sorry for the spam - I had some success;
>
> case class ScoringDF(function: Row => Double) extends Expression {
>   val dataType = DataTypes.DoubleType
>
>   override type EvaluatedType = Double
>
>   override def eval(input: Row): EvaluatedType = {
>     function(input)
>   }
>
>   override def nullable: Boolean = false
>
>   override def children: Seq[Expression] = Nil
> }
>
> But this falls over if I want to return an Array[Double];
>
> case class ScoringDF(function: Row => Array[Double]) extends Expression {
>   val dataType = DataTypes.createArrayType(DataTypes.DoubleType)
>
>   override type EvaluatedType = Array[Double]
>
>   override def eval(input: Row): EvaluatedType = {
>     function(input)
>   }
>
>   override def nullable: Boolean = false
>
>   override def children: Seq[Expression] = Nil
> }
>
>
>  get the following exception;
>
> scala> dfs.show
> java.lang.ClassCastException: [D cannot be cast to scala.collection.Seq
> at
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$2.apply(CatalystTypeConverters.scala:282)
> at
> org.apache.spark.sql.catalyst.CatalystTypeConverters$.convertRowWithConverters(CatalystTypeConverters.scala:348)
> at
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToScalaConverter$4.apply(CatalystTypeConverters.scala:301)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$2.apply(SparkPlan.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$2.apply(SparkPlan.scala:150)
>
> Any ideas?
>
> On Tue, Sep 8, 2015 at 5:47 PM, Night Wolf <nightwolf...@gmail.com> wrote:
>
>> So basically I need something like
>>
>> df.withColumn("score", new Column(new Expression {
>>  ...
>>
>> def eval(input: Row = null): EvaluatedType = myModel.score(input)
>> ...
>>
>> }))
>>
>> But I can't do this, so how can I make a UDF or something like it, that
>> can take in a Row and pass back a double value or some struct...
>>
>> On Tue, Sep 8, 2015 at 5:33 PM, Night Wolf <nightwolf...@gmail.com>
>> wrote:
>>
>>> Not sure how that would work. Really I want to tack on an extra column
>>> onto the DF with a UDF that can take a Row object.
>>>
>>> On Tue, Sep 8, 2015 at 1:54 AM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Can you use a map or list with different properties as one parameter?
>>>> Alternatively a string where parameters are Comma-separated...
>>>>
>>>> Le lun. 7 sept. 2015 à 8:35, Night Wolf <nightwolf...@gmail.com> a
>>>> écrit :
>>>>
>>>>> Is it possible to have a UDF which takes a variable number of
>>>>> arguments?
>>>>>
>>>>> e.g. df.select(myUdf($"*")) fails with
>>>>>
>>>>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project
>>>>> [scalaUDF(*) AS scalaUDF(*)#26];
>>>>>
>>>>> What I would like to do is pass in a generic data frame which can be
>>>>> then passed to a UDF which does scoring of a model. The UDF needs to know
>>>>> the schema to map column names in the model to columns in the DataFrame.
>>>>>
>>>>> The model has 100s of factors (very wide), so I can't just have a
>>>>> scoring UDF that has 500 parameters (for obvious reasons).
>>>>>
>>>>> Cheers,
>>>>> ~N
>>>>>
>>>>
>>>
>>
>

Reply via email to