[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-02-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16607


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-02-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r99263532
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -302,16 +302,36 @@ class Word2VecModel private[ml] (
 @Since("1.6.0")
 object Word2VecModel extends MLReadable[Word2VecModel] {
 
+  private case class Data(word: String, vector: Array[Float])
+
   private[Word2VecModel]
   class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter {
 
-private case class Data(wordIndex: Map[String, Int], wordVectors: 
Seq[Float])
-
 override protected def saveImpl(path: String): Unit = {
   DefaultParamsWriter.saveMetadata(instance, path, sc)
-  val data = Data(instance.wordVectors.wordIndex, 
instance.wordVectors.wordVectors.toSeq)
+
+  val wordVectors = instance.wordVectors.getVectors
+  val dataArray = wordVectors.toSeq.map { case (word, vector) => 
Data(word, vector) }.toArray
--- End diff --

No need to convert back to an Array


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-02-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r99263525
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -320,14 +340,29 @@ object Word2VecModel extends 
MLReadable[Word2VecModel] {
 private val className = classOf[Word2VecModel].getName
 
 override def load(path: String): Word2VecModel = {
+  val spark = sparkSession
+  import spark.implicits._
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  val (major, minor) = 
VersionUtils.majorMinorVersion(metadata.sparkVersion)
+
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("wordIndex", "wordVectors")
-.head()
-  val wordIndex = data.getAs[Map[String, Int]](0)
-  val wordVectors = data.getAs[Seq[Float]](1).toArray
-  val oldModel = new feature.Word2VecModel(wordIndex, wordVectors)
+
+  val oldModel = if (major.toInt < 2 || (major.toInt == 2 && 
minor.toInt < 2)) {
--- End diff --

major, minor are already Ints


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-02-02 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r99259617
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -18,10 +18,9 @@
 package org.apache.spark.ml.feature
 
 import org.apache.hadoop.fs.Path
-
--- End diff --

Keep newline between non-spark and spark imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-18 Thread Krimit
Github user Krimit commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96672243
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -320,14 +341,29 @@ object Word2VecModel extends 
MLReadable[Word2VecModel] {
 private val className = classOf[Word2VecModel].getName
 
 override def load(path: String): Word2VecModel = {
+  val spark = sparkSession
+  import spark.implicits._
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("wordIndex", "wordVectors")
-.head()
-  val wordIndex = data.getAs[Map[String, Int]](0)
-  val wordVectors = data.getAs[Seq[Float]](1).toArray
-  val oldModel = new feature.Word2VecModel(wordIndex, wordVectors)
+  val rawData = spark.read.parquet(dataPath)
+
+  val oldModel = if (rawData.columns.contains("wordIndex")) {
--- End diff --

@jkbradley - Please see https://issues.apache.org/jira/browse/SPARK-15573, 
I left a comment on sniffing model versions, curious to hear your opinion. I'll 
follow the ✨ version pattern if you think it's best


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96524580
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -320,14 +341,29 @@ object Word2VecModel extends 
MLReadable[Word2VecModel] {
 private val className = classOf[Word2VecModel].getName
 
 override def load(path: String): Word2VecModel = {
+  val spark = sparkSession
+  import spark.implicits._
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("wordIndex", "wordVectors")
-.head()
-  val wordIndex = data.getAs[Map[String, Int]](0)
-  val wordVectors = data.getAs[Seq[Float]](1).toArray
-  val oldModel = new feature.Word2VecModel(wordIndex, wordVectors)
+  val rawData = spark.read.parquet(dataPath)
+
+  val oldModel = if (rawData.columns.contains("wordIndex")) {
--- End diff --

@Krimit You're right that the versioned SaveLoad code was in spark.mllib 
only.  There isn't a standard to follow yet for spark.ml.  I believe that 
relying on the Spark version is currently the best option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-16 Thread Krimit
Github user Krimit commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96328450
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -320,14 +341,29 @@ object Word2VecModel extends 
MLReadable[Word2VecModel] {
 private val className = classOf[Word2VecModel].getName
 
 override def load(path: String): Word2VecModel = {
+  val spark = sparkSession
+  import spark.implicits._
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("wordIndex", "wordVectors")
-.head()
-  val wordIndex = data.getAs[Map[String, Int]](0)
-  val wordVectors = data.getAs[Seq[Float]](1).toArray
-  val oldModel = new feature.Word2VecModel(wordIndex, wordVectors)
+  val rawData = spark.read.parquet(dataPath)
+
+  val oldModel = if (rawData.columns.contains("wordIndex")) {
--- End diff --

I'd only ever seen ``SaveLoadV1_0`` used in MLlib, is it still the 
preferred way to mark versions? In ml land I've seen things like relying on the 
spark version:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L981,
 
https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L210,
 
https://github.com/apache/spark/blob/7db09abb0168b77697064c69126ee82ca89609a0/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L234

which I don't really like in this case since it relies on something 
extraneous and makes it difficult to backport. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-16 Thread Krimit
Github user Krimit commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96327575
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -302,16 +303,36 @@ class Word2VecModel private[ml] (
 @Since("1.6.0")
 object Word2VecModel extends MLReadable[Word2VecModel] {
 
+  private case class Data(word: String, vector: Seq[Float])
+
   private[Word2VecModel]
   class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter {
 
-private case class Data(wordIndex: Map[String, Int], wordVectors: 
Seq[Float])
-
 override protected def saveImpl(path: String): Unit = {
   DefaultParamsWriter.saveMetadata(instance, path, sc)
-  val data = Data(instance.wordVectors.wordIndex, 
instance.wordVectors.wordVectors.toSeq)
+
+  val wordVectors = instance.wordVectors.getVectors
+  val dataArray = wordVectors.toSeq.map { case (word, vector) => 
Data(word, vector) }
   val dataPath = new Path(path, "data").toString
-  
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+  sparkSession.createDataFrame(dataArray)
+.repartition(calculateNumberOfPartitions)
+.write
+.parquet(dataPath)
+}
+
+val FloatSize = 4
+val AverageWordSize = 15
+def calculateNumberOfPartitions(): Int = {
+  // [SPARK-11994] - We want to partition the model in partitions 
smaller than
+  // spark.kryoserializer.buffer.max
+  val bufferSizeInBytes = Utils.byteStringAsBytes(
+sc.conf.get("spark.kryoserializer.buffer.max", "64m"))
+  // Calculate the approximate size of the model.
+  // Assuming an average word size of 15 bytes, the formula is:
+  // (floatSize * vectorSize + 15) * numWords
+  val numWords = instance.wordVectors.wordIndex.size
+  val approximateSizeInBytes = (FloatSize * instance.getVectorSize + 
AverageWordSize) * numWords
+  ((approximateSizeInBytes / bufferSizeInBytes) + 1).toInt
--- End diff --

This is basically copied from here: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L661-L671.
 Could you please clarify what you mean by rounding it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-16 Thread Krimit
Github user Krimit commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96327379
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -302,16 +303,36 @@ class Word2VecModel private[ml] (
 @Since("1.6.0")
 object Word2VecModel extends MLReadable[Word2VecModel] {
 
+  private case class Data(word: String, vector: Seq[Float])
+
   private[Word2VecModel]
   class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter {
 
-private case class Data(wordIndex: Map[String, Int], wordVectors: 
Seq[Float])
-
 override protected def saveImpl(path: String): Unit = {
   DefaultParamsWriter.saveMetadata(instance, path, sc)
-  val data = Data(instance.wordVectors.wordIndex, 
instance.wordVectors.wordVectors.toSeq)
+
+  val wordVectors = instance.wordVectors.getVectors
+  val dataArray = wordVectors.toSeq.map { case (word, vector) => 
Data(word, vector) }
   val dataPath = new Path(path, "data").toString
-  
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+  sparkSession.createDataFrame(dataArray)
+.repartition(calculateNumberOfPartitions)
+.write
+.parquet(dataPath)
+}
+
+val FloatSize = 4
--- End diff --

I was trying to follow the scala naming conventions for constants 
(http://docs.scala-lang.org/style/naming-conventions.html), which to my 
understanding state that constants should be UpperCamelCase. Coming from Java, 
I was looking for the equivalent of ``private static final float FLOAT_SIZE``. 
Happy to just use local vals if that's more idiomatic. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96307400
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -320,14 +341,29 @@ object Word2VecModel extends 
MLReadable[Word2VecModel] {
 private val className = classOf[Word2VecModel].getName
 
 override def load(path: String): Word2VecModel = {
+  val spark = sparkSession
+  import spark.implicits._
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("wordIndex", "wordVectors")
-.head()
-  val wordIndex = data.getAs[Map[String, Int]](0)
-  val wordVectors = data.getAs[Seq[Float]](1).toArray
-  val oldModel = new feature.Word2VecModel(wordIndex, wordVectors)
+  val rawData = spark.read.parquet(dataPath)
+
+  val oldModel = if (rawData.columns.contains("wordIndex")) {
--- End diff --

Have a look for `SaveLoadV1_0` elsewhere in the code. I think there's a 
different standard approach to versioning. I am not so familiar with it but you 
can see who's written it with git. Maybe @jkbradley ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96307257
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -302,16 +303,36 @@ class Word2VecModel private[ml] (
 @Since("1.6.0")
 object Word2VecModel extends MLReadable[Word2VecModel] {
 
+  private case class Data(word: String, vector: Seq[Float])
+
   private[Word2VecModel]
   class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter {
 
-private case class Data(wordIndex: Map[String, Int], wordVectors: 
Seq[Float])
-
 override protected def saveImpl(path: String): Unit = {
   DefaultParamsWriter.saveMetadata(instance, path, sc)
-  val data = Data(instance.wordVectors.wordIndex, 
instance.wordVectors.wordVectors.toSeq)
+
+  val wordVectors = instance.wordVectors.getVectors
+  val dataArray = wordVectors.toSeq.map { case (word, vector) => 
Data(word, vector) }
   val dataPath = new Path(path, "data").toString
-  
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+  sparkSession.createDataFrame(dataArray)
+.repartition(calculateNumberOfPartitions)
+.write
+.parquet(dataPath)
+}
+
+val FloatSize = 4
+val AverageWordSize = 15
+def calculateNumberOfPartitions(): Int = {
+  // [SPARK-11994] - We want to partition the model in partitions 
smaller than
+  // spark.kryoserializer.buffer.max
+  val bufferSizeInBytes = Utils.byteStringAsBytes(
+sc.conf.get("spark.kryoserializer.buffer.max", "64m"))
+  // Calculate the approximate size of the model.
+  // Assuming an average word size of 15 bytes, the formula is:
+  // (floatSize * vectorSize + 15) * numWords
+  val numWords = instance.wordVectors.wordIndex.size
+  val approximateSizeInBytes = (FloatSize * instance.getVectorSize + 
AverageWordSize) * numWords
+  ((approximateSizeInBytes / bufferSizeInBytes) + 1).toInt
--- End diff --

Just round it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96307241
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -302,16 +303,36 @@ class Word2VecModel private[ml] (
 @Since("1.6.0")
 object Word2VecModel extends MLReadable[Word2VecModel] {
 
+  private case class Data(word: String, vector: Seq[Float])
+
   private[Word2VecModel]
   class Word2VecModelWriter(instance: Word2VecModel) extends MLWriter {
 
-private case class Data(wordIndex: Map[String, Int], wordVectors: 
Seq[Float])
-
 override protected def saveImpl(path: String): Unit = {
   DefaultParamsWriter.saveMetadata(instance, path, sc)
-  val data = Data(instance.wordVectors.wordIndex, 
instance.wordVectors.wordVectors.toSeq)
+
+  val wordVectors = instance.wordVectors.getVectors
+  val dataArray = wordVectors.toSeq.map { case (word, vector) => 
Data(word, vector) }
   val dataPath = new Path(path, "data").toString
-  
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+  sparkSession.createDataFrame(dataArray)
+.repartition(calculateNumberOfPartitions)
+.write
+.parquet(dataPath)
+}
+
+val FloatSize = 4
--- End diff --

Nit: camelCase here, like floatSize.  These can be local variables?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16607: [SPARK-19247][ML] Save large word2vec models

2017-01-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16607#discussion_r96307444
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -320,14 +341,29 @@ object Word2VecModel extends 
MLReadable[Word2VecModel] {
 private val className = classOf[Word2VecModel].getName
 
 override def load(path: String): Word2VecModel = {
+  val spark = sparkSession
+  import spark.implicits._
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+
   val dataPath = new Path(path, "data").toString
-  val data = sparkSession.read.parquet(dataPath)
-.select("wordIndex", "wordVectors")
-.head()
-  val wordIndex = data.getAs[Map[String, Int]](0)
-  val wordVectors = data.getAs[Seq[Float]](1).toArray
-  val oldModel = new feature.Word2VecModel(wordIndex, wordVectors)
+  val rawData = spark.read.parquet(dataPath)
+
+  val oldModel = if (rawData.columns.contains("wordIndex")) {
+val data = rawData
+  .select("wordIndex", "wordVectors")
+  .head()
+val wordIndex = data.getAs[Map[String, Int]](0)
+val wordVectors = data.getAs[Seq[Float]](1).toArray
+new feature.Word2VecModel(wordIndex, wordVectors)
+  } else {
+val wordVectorsMap: Map[String, Array[Float]] = rawData.as[Data]
--- End diff --

Type isn't needed here, nor in general on locals


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org