Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Shivaram Venkataraman
My guess is that the `createDataFrame` call is failing here. Can you check if the schema being passed to it includes the column name and type for the newly being zipped `features` ? Joseph probably knows this better, but AFAIK the DenseVector here will need to be marked as a VectorUDT while

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
Following your suggestion, I end up with the following implementation : *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = { val schema = transformSchema(dataSet.schema, paramMap, logging = true) val map = this.paramMap ++ paramMap* *val features =

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
In my transformSchema I do specify that the output column type is a VectorUDT : *override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { val map = this.paramMap ++ paramMap checkInputColumn(schema, map(inputCol), ArrayType(FloatType, false))

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Jaonary Rabarisoa
Dear all, I'm still struggling to make a pre-trained caffe model transformer for dataframe works. The main problem is that creating a caffe model inside the UDF is very slow and consumes memories. Some of you suggest to broadcast the model. The problem with broadcasting is that I use a JNI

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Shivaram Venkataraman
One workaround could be to convert a DataFrame into a RDD inside the transform function and then use mapPartitions/broadcast to work with the JNI calls and then convert back to RDD. Thanks Shivaram On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Jaonary Rabarisoa
Here is my current implementation with current master version of spark *class DeepCNNFeature extends Transformer with HasInputCol with HasOutputCol ... { override def transformSchema(...) { ... }* *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = {* *

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Joseph Bradley
I see. I think your best bet is to create the cnnModel on the master and then serialize it to send to the workers. If it's big (1M or so), then you can broadcast it and use the broadcast variable in the UDF. There is not a great way to do something equivalent to mapPartitions with UDFs right

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Jaonary
If think it will be interesting to have the equivalents of mappartitions with dataframe. There are many use cases where data are processed in batch. Another example is a simple linear classifier Ax=b where A is the matrix of feature vectors, x the model and b the output. Here again the product

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-02 Thread Joseph Bradley
I see, thanks for clarifying! I'd recommend following existing implementations in spark.ml transformers. You'll need to define a UDF which operates on a single Row to compute the value for the new column. You can then use the DataFrame DSL to create the new column; the DSL provides a nice syntax

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
class DeepCNNFeature extends Transformer ... { override def transform(data: DataFrame, paramMap: ParamMap): DataFrame = { // How can I do a map partition on the underlying RDD and then add the column ? } } On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-01 Thread Jaonary Rabarisoa
Hi Joseph, Thank your for the tips. I understand what should I do when my data are represented as a RDD. The thing that I can't figure out is how to do the same thing when the data is view as a DataFrame and I need to add the result of my pretrained model as a new column in the DataFrame.

Re: Some questions after playing a little with the new ml.Pipeline.

2015-02-28 Thread Joseph Bradley
Hi Jao, You can use external tools and libraries if they can be called from your Spark program or script (with appropriate conversion of data types, etc.). The best way to apply a pre-trained model to a dataset would be to call the model from within a closure, e.g.: myRDD.map { myDatum =

Some questions after playing a little with the new ml.Pipeline.

2015-02-27 Thread Jaonary Rabarisoa
Dear all, We mainly do large scale computer vision task (image classification, retrieval, ...). The pipeline is really great stuff for that. We're trying to reproduce the tutorial given on that topic during the latest spark summit (