My guess is that the `createDataFrame` call is failing here. Can you check
if the schema being passed to it includes the column name and type for the
newly being zipped `features` ?
Joseph probably knows this better, but AFAIK the DenseVector here will need
to be marked as a VectorUDT while
Following your suggestion, I end up with the following implementation :
*override def transform(dataSet: DataFrame, paramMap: ParamMap):
DataFrame = { val schema = transformSchema(dataSet.schema, paramMap,
logging = true) val map = this.paramMap ++ paramMap*
*val features =
In my transformSchema I do specify that the output column type is a VectorUDT :
*override def transformSchema(schema: StructType, paramMap: ParamMap):
StructType = { val map = this.paramMap ++ paramMap
checkInputColumn(schema, map(inputCol), ArrayType(FloatType, false))
Dear all,
I'm still struggling to make a pre-trained caffe model transformer for
dataframe works. The main problem is that creating a caffe model inside the
UDF is very slow and consumes memories.
Some of you suggest to broadcast the model. The problem with broadcasting
is that I use a JNI
One workaround could be to convert a DataFrame into a RDD inside the
transform function and then use mapPartitions/broadcast to work with the
JNI calls and then convert back to RDD.
Thanks
Shivaram
On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Dear all,
I'm
Here is my current implementation with current master version of spark
*class DeepCNNFeature extends Transformer with HasInputCol with
HasOutputCol ... { override def transformSchema(...) { ... }*
*override def transform(dataSet: DataFrame, paramMap: ParamMap):
DataFrame = {*
*
I see. I think your best bet is to create the cnnModel on the master and
then serialize it to send to the workers. If it's big (1M or so), then you
can broadcast it and use the broadcast variable in the UDF. There is not a
great way to do something equivalent to mapPartitions with UDFs right
If think it will be interesting to have the equivalents of mappartitions with
dataframe. There are many use cases where data are processed in batch. Another
example is a simple linear classifier Ax=b where A is the matrix of feature
vectors, x the model and b the output. Here again the product
I see, thanks for clarifying!
I'd recommend following existing implementations in spark.ml transformers.
You'll need to define a UDF which operates on a single Row to compute the
value for the new column. You can then use the DataFrame DSL to create the
new column; the DSL provides a nice syntax
class DeepCNNFeature extends Transformer ... {
override def transform(data: DataFrame, paramMap: ParamMap): DataFrame
= {
// How can I do a map partition on the underlying RDD and
then add the column ?
}
}
On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa
Hi Joseph,
Thank your for the tips. I understand what should I do when my data are
represented as a RDD. The thing that I can't figure out is how to do the
same thing when the data is view as a DataFrame and I need to add the
result of my pretrained model as a new column in the DataFrame.
Hi Jao,
You can use external tools and libraries if they can be called from your
Spark program or script (with appropriate conversion of data types, etc.).
The best way to apply a pre-trained model to a dataset would be to call the
model from within a closure, e.g.:
myRDD.map { myDatum =
Dear all,
We mainly do large scale computer vision task (image classification,
retrieval, ...). The pipeline is really great stuff for that. We're trying
to reproduce the tutorial given on that topic during the latest spark
summit (
13 matches
Mail list logo