> > I am trying to implement org.apache.spark.ml.Transformer interface in > Java 8. > My understanding is the sudo code for transformers is something like > > @Override > > public DataFrame transform(DataFrame df) { > > 1. Select the input column > > 2. Create a new column > > 3. Append the new column to the df argument and return > > } >
The following line can be used inside of the transform function to return a Dataframe that has been augmented with a new column using the stem lambda function (defined as a UDF below). return df.withColumn("filteredInput", expr("stem(rawInput)")); This is producing a new column called filterInput (that is appended to whatever columns are already there) by passing the column rawInput to your arbitrary lambda function. > Based on my experience the current DataFrame api is very limited. You can > not apply a complicated lambda function. As a work around I convert the > data frame to a JavaRDD, apply my complicated lambda, and then convert the > resulting RDD back to a Data Frame. > This is exactly what this code is doing. You are defining an arbitrary lambda function as a UDF. The difference here, when compared to a JavaRDD map, is that you can use this UDF to append columns without having to manually append the new data to some existing object. sqlContext.udf().register("stem", new UDF1<String, String>() { @Override public String call(String str) { return // TODO: stemming code here } }, DataTypes.StringType); Now I select the “new column” from the Data Frame and try to call > df.withColumn(). > > > I can try an implement this as a UDF. How ever I need to use several 3rd > party jars. Any idea how insure the workers will have the required jar > files? If I was submitting a normal java app I would create an uber jar > will this work with UDFs? > Yeah, UDFs are run the same way as your RDD lambda functions.