class DeepCNNFeature extends Transformer ... { override def transform(data: DataFrame, paramMap: ParamMap): DataFrame = {
// How can I do a map partition on the underlying RDD and then add the column ? } } On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote: > Hi Joseph, > > Thank your for the tips. I understand what should I do when my data are > represented as a RDD. The thing that I can't figure out is how to do the > same thing when the data is view as a DataFrame and I need to add the > result of my pretrained model as a new column in the DataFrame. Preciselly, > I want to implement the following transformer : > > class DeepCNNFeature extends Transformer ... { > > } > > On Sun, Mar 1, 2015 at 1:32 AM, Joseph Bradley <jos...@databricks.com> > wrote: > >> Hi Jao, >> >> You can use external tools and libraries if they can be called from your >> Spark program or script (with appropriate conversion of data types, etc.). >> The best way to apply a pre-trained model to a dataset would be to call the >> model from within a closure, e.g.: >> >> myRDD.map { myDatum => preTrainedModel.predict(myDatum) } >> >> If your data is distributed in an RDD (myRDD), then the above call will >> distribute the computation of prediction using the pre-trained model. It >> will require that all of your Spark workers be able to run the >> preTrainedModel; that may mean installing Caffe and dependencies on all >> nodes in the compute cluster. >> >> For the second question, I would modify the above call as follows: >> >> myRDD.mapPartitions { myDataOnPartition => >> val myModel = // instantiate neural network on this partition >> myDataOnPartition.map { myDatum => myModel.predict(myDatum) } >> } >> >> I hope this helps! >> Joseph >> >> On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa <jaon...@gmail.com> >> wrote: >> >>> Dear all, >>> >>> >>> We mainly do large scale computer vision task (image classification, >>> retrieval, ...). The pipeline is really great stuff for that. We're trying >>> to reproduce the tutorial given on that topic during the latest spark >>> summit ( >>> http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html >>> ) >>> using the master version of spark pipeline and dataframe. The tutorial >>> shows different examples of feature extraction stages before running >>> machine learning algorithms. Even the tutorial is straightforward to >>> reproduce with this new API, we still have some questions : >>> >>> - Can one use external tools (e.g via pipe) as a pipeline stage ? An >>> example of use case is to extract feature learned with convolutional >>> neural >>> network. In our case, this corresponds to a pre-trained neural network >>> with >>> Caffe library (http://caffe.berkeleyvision.org/) . >>> >>> >>> - The second question is about the performance of the pipeline. >>> Library such as Caffe processes the data in batch and instancing one >>> Caffe >>> network can be time consuming when this network is very deep. So, we can >>> gain performance if we minimize the number of Caffe network creation and >>> give data in batch to the network. In the pipeline, this corresponds to >>> run >>> transformers that work on a partition basis and give the whole partition >>> to >>> a single caffe network. How can we create such a transformer ? >>> >>> >>> >>> Best, >>> >>> Jao >>> >> >> >