[jira] [Comment Edited] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

颜发才 Fri, 07 Jul 2017 23:29:39 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078987#comment-16078987
 ]


Yan Facai (颜发才) edited comment on SPARK-21341 at 7/8/17 6:28 AM:
-----------------------------------------------------------------

Hi, [~zsellami].
I guess that since the wordVectors is mllib model in fact, which might be 
removed in the future, so it is marked private and transient. More 
interestingly, wordVectors are saved in data folder as dataframe, see:

{code}
336     override protected def saveImpl(path: String): Unit = {
337       DefaultParamsWriter.saveMetadata(instance, path, sc)
338
339       val wordVectors = instance.wordVectors.getVectors
340       val dataSeq = wordVectors.toSeq.map { case (word, vector) => 
Data(word, vector) }
341       val dataPath = new Path(path, "data").toString
342       sparkSession.createDataFrame(dataSeq)
343         .repartition(calculateNumberOfPartitions)
344         .write
345         .parquet(dataPath)
346     }
{code}

In all, developers indeed take a try to save the wordVector, however it seems 
to break in pipeline as you said.

So, could you give an example code to reproduce the bug?
I'd like to dig deeper.


was (Author: facai):
Hi, [~zsellami].
I guess that since the wordVectors is mllib model in fact, which might be 
removed in the future, so it is marked private and transient. More 
interestingly, wordVectors are saved in data folder as dataframe, see:

{code}
336     override protected def saveImpl(path: String): Unit = {
337       DefaultParamsWriter.saveMetadata(instance, path, sc)
338
339       val wordVectors = instance.wordVectors.getVectors
340       val dataSeq = wordVectors.toSeq.map { case (word, vector) => 
Data(word, vector) }
341       val dataPath = new Path(path, "data").toString
342       sparkSession.createDataFrame(dataSeq)
343         .repartition(calculateNumberOfPartitions)
344         .write
345         .parquet(dataPath)
346     }
{code}

In all, developer indeed take a try to save the wordVector information, however 
it seems to break in pipeline as you said.

So, could you give a example code to reproduce the bug?
I'd like to dig deeper.

> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel 
> -------------------------------------------------------------------------
>
>                 Key: SPARK-21341
>                 URL: https://issues.apache.org/jira/browse/SPARK-21341
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a 
> pipelineModel with a Word2Vec ML Transformer. When I load the object and call 
> myPipelineModel.transform, Word2VecModel raise a null pointer error on line 
> 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by 
> removing@transient annotation on val wordVectors and @transient lazy val on 
> getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force 
> the serialization of wordVectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

Reply via email to