Hi Jao, You can use external tools and libraries if they can be called from your Spark program or script (with appropriate conversion of data types, etc.). The best way to apply a pre-trained model to a dataset would be to call the model from within a closure, e.g.:
myRDD.map { myDatum => preTrainedModel.predict(myDatum) } If your data is distributed in an RDD (myRDD), then the above call will distribute the computation of prediction using the pre-trained model. It will require that all of your Spark workers be able to run the preTrainedModel; that may mean installing Caffe and dependencies on all nodes in the compute cluster. For the second question, I would modify the above call as follows: myRDD.mapPartitions { myDataOnPartition => val myModel = // instantiate neural network on this partition myDataOnPartition.map { myDatum => myModel.predict(myDatum) } } I hope this helps! Joseph On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa <jaon...@gmail.com> wrote: > Dear all, > > > We mainly do large scale computer vision task (image classification, > retrieval, ...). The pipeline is really great stuff for that. We're trying > to reproduce the tutorial given on that topic during the latest spark > summit ( > http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html > ) > using the master version of spark pipeline and dataframe. The tutorial > shows different examples of feature extraction stages before running > machine learning algorithms. Even the tutorial is straightforward to > reproduce with this new API, we still have some questions : > > - Can one use external tools (e.g via pipe) as a pipeline stage ? An > example of use case is to extract feature learned with convolutional neural > network. In our case, this corresponds to a pre-trained neural network with > Caffe library (http://caffe.berkeleyvision.org/) . > > > - The second question is about the performance of the pipeline. > Library such as Caffe processes the data in batch and instancing one Caffe > network can be time consuming when this network is very deep. So, we can > gain performance if we minimize the number of Caffe network creation and > give data in batch to the network. In the pipeline, this corresponds to run > transformers that work on a partition basis and give the whole partition to > a single caffe network. How can we create such a transformer ? > > > > Best, > > Jao >