>
> I am trying to implement  org.apache.spark.ml.Transformer interface in
> Java 8.
>
My understanding is the sudo code for transformers is something like
>
> @Override
>
>     public DataFrame transform(DataFrame df) {
>
> 1. Select the input column
>
> 2. Create a new column
>
> 3. Append the new column to the df argument and return
>
>    }
>

The following line can be used inside of the transform function to return a
Dataframe that has been augmented with a new column using the stem lambda
function (defined as a UDF below).

return df.withColumn("filteredInput", expr("stem(rawInput)"));

This is producing a new column called filterInput (that is appended to
whatever columns are already there) by passing the column rawInput to your
arbitrary lambda function.


> Based on my experience the current DataFrame api is very limited. You can
> not apply a complicated lambda function. As a work around I convert the
> data frame to a JavaRDD, apply my complicated lambda, and then convert the
> resulting RDD back to a Data Frame.
>

This is exactly what this code is doing.  You are defining an arbitrary
lambda function as a UDF.  The difference here, when compared to a JavaRDD
map, is that you can use this UDF to append columns without having to
manually append the new data to some existing object.

sqlContext.udf().register("stem", new UDF1<String, String>() {
  @Override
  public String call(String str) {
    return // TODO: stemming code here
  }
}, DataTypes.StringType);

Now I select the “new column” from the Data Frame and try to call
> df.withColumn().
>
>
> I can try an implement this as a UDF. How ever I need to use several 3rd
> party jars. Any idea how insure the workers will have the required jar
> files? If I was submitting a normal java app I would create an uber jar
> will this work with UDFs?
>

Yeah, UDFs are run the same way as your RDD lambda functions.

Reply via email to