Hi Jao,

You can use external tools and libraries if they can be called from your
Spark program or script (with appropriate conversion of data types, etc.).
The best way to apply a pre-trained model to a dataset would be to call the
model from within a closure, e.g.:

myRDD.map { myDatum => preTrainedModel.predict(myDatum) }

If your data is distributed in an RDD (myRDD), then the above call will
distribute the computation of prediction using the pre-trained model.  It
will require that all of your Spark workers be able to run the
preTrainedModel; that may mean installing Caffe and dependencies on all
nodes in the compute cluster.

For the second question, I would modify the above call as follows:

myRDD.mapPartitions { myDataOnPartition =>
  val myModel = // instantiate neural network on this partition
  myDataOnPartition.map { myDatum => myModel.predict(myDatum) }
}

I hope this helps!
Joseph

On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa <jaon...@gmail.com>
wrote:

> Dear all,
>
>
> We mainly do large scale computer vision task (image classification,
> retrieval, ...). The pipeline is really great stuff for that. We're trying
> to reproduce the tutorial given on that topic during the latest spark
> summit (
> http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
>  )
> using the master version of spark pipeline and dataframe. The tutorial
> shows different examples of feature extraction stages before running
> machine learning algorithms. Even the tutorial is straightforward to
> reproduce with this new API, we still have some questions :
>
>    - Can one use external tools (e.g via pipe) as a pipeline stage ? An
>    example of use case is to extract feature learned with convolutional neural
>    network. In our case, this corresponds to a pre-trained neural network with
>    Caffe library (http://caffe.berkeleyvision.org/) .
>
>
>    - The second question is about the performance of the pipeline.
>    Library such as Caffe processes the data in batch and instancing one Caffe
>    network can be time consuming when this network is very deep. So, we can
>    gain performance if we minimize the number of Caffe network creation and
>    give data in batch to the network. In the pipeline, this corresponds to run
>    transformers that work on a partition basis and give the whole partition to
>    a single caffe network. How can we create such a transformer ?
>
>
>
> Best,
>
> Jao
>

Reply via email to