Re: word2vec: how to save an mllib model and reload it?

Carsten Schnober Thu, 05 Feb 2015 09:12:45 -0800

As a Spark newbie, I've come across this thread. I'm playing with Word2Vec in
our Hadoop cluster and here's my issue with classic Java serialization of
the model: I don't have SSH access to the cluster master node.  
Here's my code for computing the model:


    val input = sc.textFile("README.md").map(line => line.split(" ").toSeq)
    val word2vec = new Word2Vec();
    val model = word2vec.fit(input);
    val oos = new ObjectOutputStream(new FileOutputStream(modelFile));
    oos.writeObject(model);
    oos.close();

I can do that locally and get the file as desired. But that is of little use
for me if the file is stored on the master.

I've alternatively serialized the vectors to HDFS using this code:

    val vectors = model.getVectors;   
    val output = sc.parallelize(vectors.toSeq);
    output.saveAsObjectFile(modelFile);

Indeed, this results in a serialization on HDFS so I can access it as a
user. However, I have not figured out how to create a new Word2VecModel
object from those files.

Any clues?
Thanks!
Carsten



MLnick wrote
> Currently I see the word2vec model is collected onto the master, so the
> model itself is not distributed. 
> 
> 
> I guess the question is why do you need  a distributed model? Is the vocab
> size so large that it's necessary? For model serving in general, unless
> the model is truly massive (ie cannot fit into memory on a modern high end
> box with 64, or 128GB ram) then single instance is way faster and simpler
> (using a cluster of machines is more for load balancing / fault
> tolerance).
> 
> 
> 
> 
> What is your use case for model serving?
> 
> 
> —
> Sent from Mailbox
> 
> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh &lt;

> duy.huynh.uiv@

> &gt; wrote:
> 
>> you're right, serialization works.
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their
>> own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and
>> saves
>> at the master level.
>> much appreciated.
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks &lt;

> evan.sparks@

> &gt;
>> wrote:
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
>>> merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running
>>> save(), same with matlab. In python either pickling things or dumping to
>>> json seems pretty common. (even the scikit-learn docs recommend pickling
>>> -
>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>> all
>>> seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh &lt;

> duy.huynh.uiv@

> &gt;
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks &lt;

> evan.sparks@

> &gt;
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll &lt;

> duy.huynh.uiv@

> &gt; wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>> model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>>>>>> For additional commands, e-mail: 

> user-help@.apache

>>>>>>
>>>>>>
>>>>>
>>>>
>>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p21517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

Reply via email to