[GitHub] spark pull request #14112: [SPARK-16240][ML] Model loading backward compatib...

2016-09-07 Thread GayathriMurali
Github user GayathriMurali closed the pull request at:

https://github.com/apache/spark/pull/14112


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-09-07 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@jkbradley Sure! Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-09-06 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@jkbradley  I am so sorry I couldn't respond to this on time! I am in a 
transition process and might not be able to drive this JIRA to completion at 
this point in time. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-08-10 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@jkbradley Can you please help review this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-26 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@jkbradley Please let me know if I can do anything to help get this merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-19 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@jkbradley Can you please help review this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14112: [SPARK-16240][ML] Model loading backward compatib...

2016-07-18 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/14112#discussion_r71172720
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -728,16 +755,40 @@ object DistributedLDAModel extends 
MLReadable[DistributedLDAModel] {
 private val className = classOf[DistributedLDAModel].getName
 
 override def load(path: String): DistributedLDAModel = {
+
+  // Pattern to determine sparkversion
+  val pattern =
+"""\\d+.\\d+(.\\d+)?(-SNAPSHOT)?""".r
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
   val modelPath = new Path(path, "oldModel").toString
   val oldModel = OldDistributedLDAModel.load(sc, modelPath)
-  val model = new DistributedLDAModel(
-metadata.uid, oldModel.vocabSize, oldModel, sparkSession, None)
-  DefaultParamsReader.getAndSetParams(model, metadata)
+  val model = new DistributedLDAModel(metadata.uid, oldModel.vocabSize,
+oldModel, sparkSession, None)
+  metadata.sparkVersion match {
+case "1.6" =>
+  implicit val format = DefaultFormats
+  metadata.params match {
+case JObject(pairs) =>
+  pairs.foreach { case (paramName, jsonValue) =>
+val origParam =
+  if (paramName == "topicDistribution") 
"topicDistributionCol" else paramName
+val param = model.getParam(origParam)
+val value = param.jsonDecode(compact(render(jsonValue)))
+model.set(param, value)
+  }
+case _ =>
+  throw new IllegalArgumentException(
+s"Cannot recognize JSON metadata: 
${metadata.metadataJson}.")
+  }
+case pattern =>
+  DefaultParamsReader.getAndSetParams(model, metadata)
+  }
--- End diff --

@hhbyyh Thank you! I can maybe write this as a separate function. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-14 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@jkbradley I implemented model loading logic for DistributedLDA as well. I 
am using a versionRegex  for robustness in version checking. Using 
`as[Data].head()` is producing a scala match error on Jenkins but not when I 
run locally on my machine . For now I removed that and I am using 
`.select().head()`  and all test cases are passing on Jenkins. I am 
investigating why it should fail. Meanwhile, it would be great if you can leave 
your comments on the recent commits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14112: [SPARK-16240][ML] Model loading backward compatib...

2016-07-13 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/14112#discussion_r70749752
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -566,26 +565,52 @@ object LocalLDAModel extends 
MLReadable[LocalLDAModel] {
 }
   }
 
+  private case class Data(
+   vocabSize: Int,
+   topicsMatrix: Matrix,
+   docConcentration: Vector,
+   topicConcentration: Double,
+   gammaShape: Double)
+
   private class LocalLDAModelReader extends MLReader[LocalLDAModel] {
 
 private val className = classOf[LocalLDAModel].getName
 
 override def load(path: String): LocalLDAModel = {
+  // Import implicits for Dataset Encoder
+  val sparkSession = super.sparkSession
+  import sparkSession.implicits._
+
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
   val dataPath = new Path(path, "data").toString
   val data = sparkSession.read.parquet(dataPath)
-.select("vocabSize", "topicsMatrix", "docConcentration", 
"topicConcentration",
-  "gammaShape")
-.head()
-  val vocabSize = data.getAs[Int](0)
-  val topicsMatrix = data.getAs[Matrix](1)
-  val docConcentration = data.getAs[Vector](2)
-  val topicConcentration = data.getAs[Double](3)
-  val gammaShape = data.getAs[Double](4)
+  val vectorConverted = MLUtils.convertVectorColumnsToML(data, 
"docConcentration")
+  val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: 
Vector,
--- End diff --

It worked when I locally ran the unit tests, but fails here on Jenkins. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-13 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@jkbradley I am sorry, I have been held up with something else. I am 
looking on ways to add this to DistribtedLDA model. I will have something by 
EOD today. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-11 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
+1 for separate loading logic. The recent commit includes separate code 
paths depending on sparkVersion


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-11 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@hhbyyh Thanks for helping out. Updated commit includes logic to include 
topicDistributionCol @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-09 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
retest this 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-07-08 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/14112
  
@hhbyyh Can you please help review? I am not sure if this is the right way 
to do it, as topicDistributionCol is not included in the MLWriter or load. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14112: [SPARK-16240][ML] Model loading backward compatib...

2016-07-08 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/14112

[SPARK-16240][ML] Model loading backward compatibility for LDA

## What changes were proposed in this pull request?
LDA model loading backward compatibility

## How was this patch tested?

Existing UT 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-16240

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14112


commit 880c3a1bfa67101835a5965c65f9f8942e95be35
Author: GayathriMurali 
Date:   2016-07-09T06:17:22Z

[SPARK-16240][ML] Model loading backward compatibility for LDA




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13745: [Spark-15997][DOC][ML] Update user guide for Hash...

2016-06-23 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13745#discussion_r68245367
  
--- Diff: examples/src/main/python/ml/quantile_discretizer_example.py ---
@@ -29,11 +29,12 @@
 # $example on$
 data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)]
 dataFrame = spark.createDataFrame(data, ["id", "hour"])
-
-# Note that we compute exact quantiles here by setting `relativeError` 
to 0 for
-# illustrative purposes, however in most cases the default parameter 
value should suffice
-discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", 
outputCol="result",
-  relativeError=0)
+# $example off$
+# Output of QuantileDiscretizer for such small datasets differ wrt 
underlying cores.
+# Allocating single partition for the dataframe helps with consistent 
results.
+.repartition(1)
--- End diff --

@MLnick I ran all unit tests and also tested them manually. It works fine. 
But I guess, writing it in the next line makes it look better. I will modify 
that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13745: [Spark-15997][DOC][ML] Update user guide for HashingTF, ...

2016-06-22 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13745
  
Oops! That works. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13745: [Spark-15997][DOC][ML] Update user guide for HashingTF, ...

2016-06-22 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13745
  
@jkbradley Yes, that works


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-22 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@jkbradley @MLnick  My bad. Sorry about that!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13745: [Spark-15997][DOC][ML] Update user guide for HashingTF, ...

2016-06-22 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13745
  
@jkbradley @MLnick `repartition` needs to be added along with the creation 
of the dataframe like this.
`val df = spark.createDataFrame(data).toDF("id","hour").repartition(1)` 
since df is of type val. we cannot hide this statement. I could convert df to 
mutable object, but that would seem inconsistent. Am i missing something here? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12675: [SPARK-14894][PySpark] Add result summary api to ...

2016-06-21 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12675#discussion_r67954415
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1070,6 +1070,21 @@ def test_logistic_regression_summary(self):
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC)
 
+def test_gaussian_mixture_summary(self):
+from pyspark.mllib.linalg import Vectors
+df = 
self.spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
--- End diff --

Please let me know if its ok to load data from a file when all other test 
cases uses hard coded data values. I tried with a sparse vector and fit gave me 
an error for the data format. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12675: [SPARK-14894][PySpark] Add result summary api to Gaussia...

2016-06-21 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/12675
  
@MLnick It would be great if you can help review this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13745: [Spark-15997][DOC][ML] Update user guide for HashingTF, ...

2016-06-21 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13745
  
@jkbradley @MLnick I agree with repartition idea. Although I think that it 
may not be a bad idea to call out that approxquantile calcultion for smaller 
datasets may be different on different machines depending on underlying cores 
available and leave the example and code as is. Please let me know whats best 
and I can change the documentation accordingly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-18 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@MLnick I opened PR #13745 to track this as @jkbradley suggested. This JIRA 
is only doing partial list of Audit ml.feature. Please help review SPARK-15597.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13745: [Spark 15997][DOC][ML] Update user guide for Hash...

2016-06-17 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/13745

[Spark 15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer 
and CountVectorizer

## What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-15997

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13745.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13745


commit c9ea8cb5e2c3d1b92ebe5e5e97733d43abeda6f9
Author: GayathriMurali 
Date:   2016-05-20T21:28:55Z

User guide changes to CountVectorizer, QuantileDiscretizer and HashingTF

commit 2d43f673bc2405f170082102104048d58a617a40
Author: GayathriMurali 
Date:   2016-05-20T21:43:11Z

Review comments

commit b12dea1e6c5928840a8e8b8404afa3761dc8e5cd
Author: GayathriMurali 
Date:   2016-05-24T01:37:30Z

Review comments

commit 14b804a9b34bb6c6eab43fbbd13d4d4f0a0b4b26
Author: GayathriMurali 
Date:   2016-05-24T21:26:53Z

Review comments

commit 3e44aa8310a9088eec50956b79d5ec7a27725d5f
Author: GayathriMurali 
Date:   2016-05-26T02:16:24Z

Review comments

commit 015f54ad9d464c8fcfa1681085f199416653cec5
Author: GayathriMurali 
Date:   2016-05-26T18:54:39Z

Review Comments

commit ef9dfa220d55ba598dcfec3747e445b17dd265e0
Author: GayathriMurali 
Date:   2016-05-31T23:11:23Z

Fixing QuantileDiscretizer doc and example

commit 65f9421a8bc281c10921613a57900984122bf1ae
Author: GayathriMurali 
Date:   2016-06-03T03:06:43Z

Including relativeError in all examples with a note

commit 563d65df1569762a9073c64b0f11d0a60b18508b
Author: GayathriMurali 
Date:   2016-06-03T04:33:29Z

Fixed python style issue

commit 695aebe960ba25547014eeab10a8ba34dd1249ff
Author: GayathriMurali 
Date:   2016-06-03T21:19:34Z

Review comments

commit 01e4a08b3a6154a0e04391fa8299821894109bb3
Author: GayathriMurali 
Date:   2016-06-10T22:56:37Z

Remove default value inclusion




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13176: [SPARK-15997][DOC] Modified user guide and exampl...

2016-06-17 Thread GayathriMurali
Github user GayathriMurali closed the pull request at:

https://github.com/apache/spark/pull/13176


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-16 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@jkbradley @MLnick I have created SPARK-15997 to track the changes 
addressed in this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-16 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@jkbradley I just tried this. 
https://cloud.githubusercontent.com/assets/7002441/16128207/94f835ea-33b4-11e6-9866-369672b7bdae.png";>
and getting this output which is the same as the one in the example
https://cloud.githubusercontent.com/assets/7002441/16128258/cf80114c-33b4-11e6-9c8e-34d553cf5c39.png";>

I will create a new JIRA and link this PR to that. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-16 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13285
  
@jkbradley I fixed for the review comment. Please let me know if there is 
anything else. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-15 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@jkbradley the different results was due to the difference in underlying 
core count(thread count). @MLnick  and I were able to get the same results for 
`local[4]`. We could explicitly specify this in the example and get rid of the 
error = 0. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12675: [SPARK-14894][PySpark] Add result summary api to Gaussia...

2016-06-14 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/12675
  
@jkbradley This PR has been open >30days. Can you please help review? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-14 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13285
  
@yanboliang  Please let me know if there is anything else I can do to help 
get this merged.Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-14 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@MLnick Please let me know if there is anything else I can do to help get 
this merged.Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12675: [SPARK-14894][PySpark] Add result summary api to Gaussia...

2016-06-07 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/12675
  
@jkbradley @holdenk Can you please help review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-06 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13285
  
@yanboliang Please let me know if there is anything else I can do to get 
this merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

2016-06-06 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r66011880
  
--- Diff: docs/ml-features.md ---
@@ -1092,14 +1095,11 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features.
-The bin ranges are chosen by taking a sample of the data and dividing it 
into roughly equal parts.
-The lower and upper bin bounds will be `-Infinity` and `+Infinity`, 
covering all real values.
-This attempts to find `numBuckets` partitions based on a sample of the 
given input data, but it may
-find fewer depending on the data sample values.
-
-Note that the result may be different every time you run it, since the 
sample strategy behind it is
-non-deterministic.
+categorical features. The number of bins is set by the `numBuckets` 
parameter.
+The bin ranges are chosen using an approximate algorithm (see the 
documentation for 
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
+for a detailed description). The precision of the approximation can be 
controlled with the
+`relativeError` parameter. When set to zero, exact quantiles are 
calculated (**Note:** Computing exact quantiles is an expensive operation). The 
default value of `relativeError` is 0.01.
--- End diff --

@MLnick What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13176: [SPARK-15100][DOC] Modified user guide and exampl...

2016-06-03 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r65790125
  
--- Diff: docs/ml-features.md ---
@@ -1092,14 +1095,11 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features.
-The bin ranges are chosen by taking a sample of the data and dividing it 
into roughly equal parts.
-The lower and upper bin bounds will be `-Infinity` and `+Infinity`, 
covering all real values.
-This attempts to find `numBuckets` partitions based on a sample of the 
given input data, but it may
-find fewer depending on the data sample values.
-
-Note that the result may be different every time you run it, since the 
sample strategy behind it is
-non-deterministic.
+categorical features. The number of bins is set by the `numBuckets` 
parameter.
+The bin ranges are chosen using an approximate algorithm (see the 
documentation for 
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
+for a detailed description). The precision of the approximation can be 
controlled with the
+`relativeError` parameter. When set to zero, exact quantiles are 
calculated (**Note:** Computing exact quantiles is an expensive operation). The 
default value of `relativeError` is 0.01.
--- End diff --

@MLnick I specified the default value coz in the example, we say "however 
in most cases the default parameter value should suffice " and not mentioning 
the default value wouldnt make much sense. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-02 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@MLnick I agree. Should I make those changes in this same PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13176: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-06-02 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13176
  
@MLnick Please let me know if there is anything else that I can help with 
this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-01 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13285
  
Also, #10219 uses include_example with different files , which is not the 
case here. @mengxr We need support for tags with include_example, or we need to 
reformat ml.R( or split every example into a different file) to be used here. I 
can create a JIRA and work on it, if this makes sense. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13285: [Spark-15129][R][DOC]R API changes in ML

2016-06-01 Thread GayathriMurali
Github user GayathriMurali commented on the issue:

https://github.com/apache/spark/pull/13285
  
@yanboliang  `$example on$` and `$example off$` needs to be included in 
ml.R. All the code encompassed within example on and off would be joined and a 
single code block will be produced in the html. It is used to remove comments 
and other cleanup in the code from appearing in examples. It is not possible to 
label them and select a certain label. I could add example on and off at the 
beginning and end of ml.R or we need to rewrite ml.R so that certain portions 
like creating a DF remains common for all models. What do you think? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
@MLnick  +1 for making the change in the example as well. Calling out 
difference in result due to parallelism might be little confusing in this 
document. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
I just tried with `--master local[8]` and I get the same results as you do. 
Should I call this out in the example? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
I just did. It is local[4]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
@MLnick I am using local. I havent explicitly setup thread count. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
On Mac. Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_73). I checked again and I consistently get the same output on master. 
@MLnick Please let me know how you would like to proceed. Should I go ahead and 
change the example in the doc and investigate further on my end?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
@BryanCutler @oliverpierson Looks like something is wrong on my side. I 
just checked again on a fresh build and got the same results. Will dig deeper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
I get this : Array[Double] = Array(5.0, 8.0)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176
  
@MLnick @oliverpierson I checked again with a clean build off master. Here 
is the hash : 2bfc4f15214a870b3e067f06f37eb506b0070a1f. Here is what I see

https://cloud.githubusercontent.com/assets/7002441/15684116/738724e4-271a-11e6-9e42-a80fdbc11bc1.png";>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and examples for ...

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r65223909
  
--- Diff: docs/ml-features.md ---
@@ -145,9 +148,11 @@ for more details on the API.
  passed to other algorithms like LDA.
 
  During the fitting process, `CountVectorizer` will select the top 
`vocabSize` words ordered by
- term frequency across the corpus. An optional parameter "minDF" also 
affects the fitting process
+ term frequency across the corpus. An optional parameter `minDF` also 
affects the fitting process
  by specifying the minimum number (or fraction if < 1.0) of documents a 
term must appear in to be
- included in the vocabulary.
+ included in the vocabulary. Another optional binary toggle parameter 
controls the output vector.
--- End diff --

2bfc4f15214a870b3e067f06f37eb506b0070a1f - Commit off master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-15129][R][DOC]R API changes in ML

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13285
  
@yanboliang I have included ml.r using include-example, wouldn't that cover 
all the examples? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-15129][R][DOC]R API changes in ML

2016-05-31 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13285#discussion_r65221434
  
--- Diff: docs/sparkr.md ---
@@ -285,71 +285,57 @@ head(teenagers)
 
 # Machine Learning
 
-SparkR allows the fitting of generalized linear models over DataFrames 
using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib 
to train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', ':', '+', and '-'.
+SparkR supports the following Machine Learning algorithms.
 
-The [summary()](api/R/summary.html) function gives the summary of a model 
produced by [glm()](api/R/glm.html).
+* Generalized Linear Regression Model [glm()](api/R/glm.html)
+* Naive Bayes [naiveBayes()](api/R/naiveBayes.html)
+* KMeans [kmeans()](api/R/kmeans.html)
+* AFT Survival Regression [survreg()](api/R/survreg.html)
 
-* For gaussian GLM model, it returns a list with 'devianceResiduals' and 
'coefficients' components. The 'devianceResiduals' gives the min/max deviance 
residuals of the estimation; the 'coefficients' gives the estimated 
coefficients and their estimated standard errors, t values and p-values. (It 
only available when model fitted by normal solver.)
-* For binomial GLM model, it returns a list with 'coefficients' component 
which gives the estimated coefficients.
+Under the hood, SparkR uses MLlib to train a model of the specified 
family. Currently the gaussian, binomial, Poisson and Gamma families are 
supported. We support a subset of the available R formula operators for model 
fitting, including '~', '.', ':', '+', and '-'.
 
-The examples below show the use of building gaussian GLM model and 
binomial GLM model using SparkR.
+The [summary()](api/R/summary.html) function gives the summary of a model 
produced by different algorithms listed above.
+This summary is same as the result of summary() function in R.
 
-## Gaussian GLM model
+## Model persistence
 
-
-{% highlight r %}
-# Create the DataFrame
-df <- createDataFrame(sqlContext, iris)
+* write.ml allows users to save a fitted model in a given input path
+* read.ml allows users to read/load the model which was saved using 
write.ml
+
+Model persistence is supported for all Machine Learning algorithms for all 
families.
 
-# Fit a gaussian GLM model over the dataset.
-model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
"gaussian")
+The examples below show the use of building Gaussian GLM, NaiveBayes, 
kMeans and AFTSurvivalReg using SparkR
 
+{% include_example r/ml.r %}
+
+# GLM Summary() Result
+
+Here is an example of the output from the summary() function for GLM
+
+{% highlight r %}
 # Model summary are returned in a similar format to R's native glm().
 summary(model)
-##$devianceResiduals
-## Min   Max 
-## -1.307112 1.412532
-##
-##$coefficients
-##   Estimate  Std. Error t value  Pr(>|t|)
-##(Intercept)2.251393  0.3697543  6.08889  9.568102e-09
-##Sepal_Width0.8035609 0.106339   7.556598 4.187317e-12
-##Species_versicolor 1.458743  0.1121079  13.01195 0   
-##Species_virginica  1.946817  0.100015   19.46525 0   
-
-# Make predictions based on the model.
-predictions <- predict(model, newData = df)
-head(select(predictions, "Sepal_Length", "prediction"))
-##  Sepal_Length prediction
-##1  5.1   5.063856
-##2  4.9   4.662076
-##3  4.7   4.822788
-##4  4.6   4.742432
-##5  5.0   5.144212
-##6  5.4   5.385281
-{% endhighlight %}
-
+##Deviance Residuals:
--- End diff --

+1 Since summary output is different for different models, it makes sense 
to remove it. I will go ahead and remove. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-29 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176#issuecomment-222409058
  
@MLnick Please let me know if there is anything else that I can help with 
this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-26 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64799207
  
--- Diff: docs/ml-features.md ---
@@ -145,9 +148,11 @@ for more details on the API.
  passed to other algorithms like LDA.
 
  During the fitting process, `CountVectorizer` will select the top 
`vocabSize` words ordered by
- term frequency across the corpus. An optional parameter "minDF" also 
affects the fitting process
+ term frequency across the corpus. An optional parameter `minDF` also 
affects the fitting process
  by specifying the minimum number (or fraction if < 1.0) of documents a 
term must appear in to be
- included in the vocabulary.
+ included in the vocabulary. Another optional binary toggle parameter 
controls the output vector.
--- End diff --

@MLnick I am sorry. I did see the email alert, but i was not able to find 
the comment here. I am addressing it now.

I am assuming you mean "This is especially useful for discrete 
probabilistic models that model binary, rather than integer, counts." to be 
consistent in both HashingTF and CountVectorizer. The other details like term 
frequencies is different for CountVectorizer(output vector).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-15129][R][DOC]R API changes in ML

2016-05-26 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13285#issuecomment-221954150
  
@yanboliang Can you please help review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-15129][R][DOC][WIP]R API changes in ML

2016-05-25 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13285#issuecomment-221764817
  
@yanboliang Thanks, thats a good idea. However, that would just include 
example code and not how the output of summary() looks like. It might be useful 
to include that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-25 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64683245
  
--- Diff: docs/ml-features.md ---
@@ -53,7 +53,10 @@ collisions, where different raw features may become the 
same term after hashing.
 chance of collision, we can increase the target feature dimension, i.e. 
the number of buckets 
 of the hash table. Since a simple modulo is used to transform the hash 
function to a column index, 
--- End diff --

@yanboliang I am neutral about adding this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 15129][R][DOC][WIP]R API changes in ML

2016-05-24 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13285#issuecomment-221428716
  
@jkbradley @MLnick I have marked this WIP, as I want to get your thoughts 
on if you think the format looks ok. I can add examples to KMeans and SurvReg 
is the overall format looks fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 15129][R][DOC][WIP]R API changes in ML

2016-05-24 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/13285

[Spark 15129][R][DOC][WIP]R  API changes in ML

## What changes were proposed in this pull request?

Make user guide changes to SparkR documentation for all changes that 
happened in 2.0 to Machine Learning APIs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-15129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13285.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13285


commit 490a8e8b038868c56f8393fee180255041f19b7f
Author: GayathriMurali 
Date:   2016-05-20T21:28:55Z

User guide changes to CountVectorizer, QuantileDiscretizer and HashingTF

commit 901fb6df17667440339120fde3e36ae6be1ae2df
Author: GayathriMurali 
Date:   2016-05-20T21:43:11Z

Review comments

commit c95ee2c20917ca2a546544b2e3168f8b67d52a2e
Author: GayathriMurali 
Date:   2016-05-24T22:55:21Z

[SPARK-15129][R][DOC][WIP] R API changes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-24 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64476690
  
--- Diff: docs/ml-features.md ---
@@ -1098,9 +1098,9 @@ for more details on the API.
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
 categorical features. The number of bins is set by the `numBuckets` 
parameter.
-The bin ranges are chosen using an approximate algorithm (see the 
documentation for 
[approxQuantile](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala)
+The bin ranges are chosen using an approximate algorithm (see the 
documentation for 
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.scala)
 for a detailed description). The precision of the approximation can be 
controlled with the
-`relativeError` parameter. When set to zero, exact quantiles are 
calculated.
+`relativeError` parameter. When set to zero, exact quantiles are 
calculated. Computing exact quantiles is an expensive operation.
 The lower and upper bin bounds will be `-Infinity` and `+Infinity` 
covering all real values.
 
 **Examples**
--- End diff --

@MLnick The example is still valid for the default value of relativeError 
param(0.001). I will it as is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-23 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176#issuecomment-221020100
  
@MLnick I fixed all review comments. Can you please let me know if there is 
anything else to be done to help get this merged? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176#issuecomment-220723548
  
@MLnick The latest commit includes just the ml-feature.md changes. I 
removed all the other example files and feature.py. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64101972
  
--- Diff: docs/ml-features.md ---
@@ -1093,13 +,10 @@ for more details on the API.
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
 categorical features.
-The bin ranges are chosen by taking a sample of the data and dividing it 
into roughly equal parts.
-The lower and upper bin bounds will be `-Infinity` and `+Infinity`, 
covering all real values.
-This attempts to find `numBuckets` partitions based on a sample of the 
given input data, but it may
-find fewer depending on the data sample values.
+The bin ranges are chosen using the `approxQuantile` method based on the 
Greenwald-Khanna algorithm.
+The number of bins found is equal to `numBuckets` parameter value. 
`relativeError` sets the target relative precision
--- End diff --

Sure. I was not able to find API doc for `approxQuantile`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176#issuecomment-220698824
  
Something messed up the `git push`. I will send another commit 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64087912
  
--- Diff: docs/ml-features.md ---
@@ -26,7 +26,9 @@ This section covers algorithms for working with features, 
roughly divided into t
 
 `HashingTF` is a `Transformer` which takes sets of terms and converts 
those sets into 
 fixed-length feature vectors.  In text processing, a "set of terms" might 
be a bag of words.
-The algorithm combines Term Frequency (TF) counts with the 
+A binary toggle parameter controls term frequency. When set to true all 
nonzero frequencies are
--- End diff --

Yup. That makes sense. Will change it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64079147
  
--- Diff: docs/ml-features.md ---
@@ -114,7 +116,10 @@ for more details on the API.
  During the fitting process, `CountVectorizer` will select the top 
`vocabSize` words ordered by
  term frequency across the corpus. An optional parameter "minDF" also 
affect the fitting process
  by specifying the minimum number (or fraction if < 1.0) of documents a 
term must appear in to be
- included in the vocabulary.
+ included in the vocabulary.Another optional binary toggle parameter 
controls the output vector.
--- End diff --

I said "Another", because the previous line starts with 'An optional 
parameter". It just sounded right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64078981
  
--- Diff: docs/ml-features.md ---
@@ -26,7 +26,9 @@ This section covers algorithms for working with features, 
roughly divided into t
 
 `HashingTF` is a `Transformer` which takes sets of terms and converts 
those sets into 
 fixed-length feature vectors.  In text processing, a "set of terms" might 
be a bag of words.
-The algorithm combines Term Frequency (TF) counts with the 
+A binary toggle parameter controls term frequency. When set to true all 
nonzero frequencies are
--- End diff --

It controls the output vector values in CountVectorizer and Term Frequency 
in HashingTF


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64075252
  
--- Diff: docs/ml-features.md ---
@@ -1064,7 +1069,8 @@ categorical features.
 The bin ranges are chosen by taking a sample of the data and dividing it 
into roughly equal parts.
 The lower and upper bin bounds will be `-Infinity` and `+Infinity`, 
covering all real values.
 This attempts to find `numBuckets` partitions based on a sample of the 
given input data, but it may
-find fewer depending on the data sample values.
+find fewer depending on the data sample values. Relative precision of the 
approxQuantile is set using
--- End diff --

@MLnick @oliverpierson I can fix the `approxQuantile` documentation on 
Scala side and python side to be more consistent with QuantileDiscretizer in 
DataFrameStat in this JIRA itself. Please let me know if that makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-20 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/13176#discussion_r64073253
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java
 ---
@@ -54,6 +54,7 @@ public static void main(String[] args) {
   .setOutputCol("feature")
   .setVocabSize(3)
   .setMinDF(2)
+  .setBinary(true)
--- End diff --

@MLnick Since we introduce Binary toggle in the doc, I thought it would 
make sense to show how to set it. Do you want me to remove it from all 
examples? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-19 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/13176#issuecomment-220513197
  
@hhbyyh Can you please help review this? I will resolve the branch conflict 
along with review comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...

2016-05-18 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/13176

[SPARK-15100][DOC] Modified user guide and examples for CountVectoriz…

## What changes were proposed in this pull request?

This is partial document changes to ml.feature. Made changes to 
CountVectorizer, HashingTF and QuantileDiscretizer


## How was this patch tested?

Unit test and manual testing

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-15100

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13176


commit 46408bbdb13da94ecd40ba380ee8fc219232d481
Author: GayathriMurali 
Date:   2016-05-18T18:58:27Z

[SPARK-15100][DOC] Modified user guide and examples for CountVectorizer, 
HashingTF and QuantileDiscretizer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][PySpark] Add result summary api ...

2016-05-11 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12675#issuecomment-218664815
  
@holdenk I checked the ScalaDoc and removed the evaluate method. Thanks for 
pointing it out. Can you please help review 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][PySpark] Add result summary api ...

2016-05-05 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12675#issuecomment-217318560
  
@holdenk I fixed the pydoc style issue. Can you please help review this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][PySpark] Add result summary api ...

2016-04-28 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12675#issuecomment-215529653
  
@jkbradley Can you please ok to test this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-28 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12683#discussion_r61481184
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -71,7 +71,25 @@ test_that("glm and predict", {
data = iris, family = poisson(link = identity)), iris))
   expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
 
-  # Test stats::predict is working
+  # Test model save/load
+  modelPath <- tempfile(pattern = "GLM", fileext = ".tmp")
+  ml.save(model, modelPath)
+  expect_error(ml.save(model, modelPath))
+  ml.save(model, modelPath, overwrite = TRUE)
+  m2 <- ml.load(modelPath)
+  s2 <- summary(m2)
+  expect_equal(s$rCoefficients, s2$rCoefficients)
--- End diff --

I agree. Should we do this test for all Gaussian, Poisson and Binomial 
family. I am assuming doing for either one of them should be sufficient? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12683#discussion_r61370053
  
--- Diff: R/pkg/R/mllib.R ---
@@ -406,6 +432,8 @@ ml.load <- function(path) {
   jobj <- callJStatic("org.apache.spark.ml.r.RWrappers", "load", path)
   if (isInstanceOf(jobj, "org.apache.spark.ml.r.NaiveBayesWrapper")) {
 return(new("NaiveBayesModel", jobj = jobj))
+  } else if (isInstanceOf(jobj, 
"org.apache.spark.ml.GeneralizedLinearRegressionWrapper")) {
--- End diff --

For some reason local tests(R/run-tests.sh) is not capturing these 
failures. Let me fix this and submit the code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-25 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/12683

[SPARK-14315][SparkR]Add model persistence to GLMs

## What changes were proposed in this pull request?

Add model persistence to GLMs in SparkR


Unit tests added



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-14315

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12683


commit 31b2b679c041068fb60db4d36ecc28d149b04c75
Author: GayathriMurali 
Date:   2016-04-26T04:06:08Z

[SPARK-14315][SparkR]Add model persistence to GLMs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-14314][SparkR] Add model persistence to...

2016-04-25 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/12680

[Spark-14314][SparkR] Add model persistence to KMeans

## What changes were proposed in this pull request?

Add model persistence to KMeans SparkR


## How was this patch tested?

Unit tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-14314

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12680


commit a38f18c6f2b28bd5615072858fd99984066a9f8e
Author: GayathriMurali 
Date:   2016-04-26T03:01:13Z

[Spark-14314][SparkR] Add model persistence to KMeans




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][PySpark] Add result summary api ...

2016-04-25 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12675#issuecomment-214569241
  
@wangmiao1981 @jkbradley  Please help review this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][PySpark] Add result summary api ...

2016-04-25 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/12675

[SPARK-14894][PySpark] Add result summary api to Gaussian Mixture

## What changes were proposed in this pull request?

Add summary API to Gaussian Mixture

## How was this patch tested?

Added unit test case to test summary information



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-14894

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12675.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12675


commit 727e85506c0dfa9beda3207c02d4ce1c1db22d81
Author: GayathriMurali 
Date:   2016-04-26T00:05:25Z

[SPARK-14894][PySpark] Add result summary api to Gaussian Mixture




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][Pyspark] Add result summary API ...

2016-04-25 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12670#issuecomment-214565229
  
I am closing this PR as a file got added by mistake. Will open a new one. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][Pyspark] Add result summary API ...

2016-04-25 Thread GayathriMurali
Github user GayathriMurali closed the pull request at:

https://github.com/apache/spark/pull/12670


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14894][Pyspark] Add result summary API ...

2016-04-25 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/12670

[SPARK-14894][Pyspark] Add result summary API to Gaussian Mixture

## What changes were proposed in this pull request?
Add summary API to Gaussian Mixture in Pyspark


## How was this patch tested?

Added unit testcases to verify summary information


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-14894

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12670.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12670


commit a934ac4cdd8133d2cd1f17bd31cf2a3d99728143
Author: GayathriMurali 
Date:   2016-04-25T22:16:12Z

[SPARK-14894][Pyspark] Add result summary API to Gaussian Mixture




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13783] [ML] Model export/import for spa...

2016-04-08 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12230#discussion_r59054623
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -257,12 +240,61 @@ final class GBTClassificationModel private[ml](
   private[ml] def toOld: OldGBTModel = {
 new OldGBTModel(OldAlgo.Classification, _trees.map(_.toOld), 
_treeWeights)
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter = new 
GBTClassificationModel.GBTClassificationModelWriter(this)
 }
 
-private[ml] object GBTClassificationModel {
+@Since("2.0.0")
+object GBTClassificationModel extends MLReadable[GBTClassificationModel] {
+
+  @Since("2.0.0")
+  override def read: MLReader[GBTClassificationModel] = new 
GBTClassificationModelReader
+
+  @Since("2.0.0")
+  override def load(path: String): GBTClassificationModel = 
super.load(path)
+
+  private[GBTClassificationModel]
+  class GBTClassificationModelWriter(instance: GBTClassificationModel) 
extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures,
+"numTrees" -> instance.getNumTrees)
+  EnsembleModelReadWrite.saveImpl(instance, path, sqlContext, 
extraMetadata)
+}
+  }
+
+  private class GBTClassificationModelReader extends 
MLReader[GBTClassificationModel] {
+
+/** Checked against metadata when loading model */
+private val className = classOf[GBTClassificationModel].getName
+private val treeClassName = 
classOf[DecisionTreeRegressionModel].getName
+
+override def load(path: String): GBTClassificationModel = {
+  implicit val format = DefaultFormats
+  val (metadata: Metadata, treesData: Array[(Metadata, Node)], 
treeWeights: Array[Double]) =
+EnsembleModelReadWrite.loadImpl(path, sqlContext, className, 
treeClassName)
+  val numFeatures = (metadata.metadata \ "numFeatures").extract[Int]
+  val numTrees = (metadata.metadata \ "numTrees").extract[Int]
+
+  val trees: Array[DecisionTreeRegressionModel] = treesData.map {
--- End diff --

@yanboliang My bad for not checking that properly. Sorry about that!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13783] [ML] Model export/import for spa...

2016-04-07 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12230#issuecomment-207051757
  
@yanboliang I did a quick first pass. I have some initial comments. Will 
stay tuned for updates. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13783] [ML] Model export/import for spa...

2016-04-07 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12230#discussion_r58926273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -257,12 +240,61 @@ final class GBTClassificationModel private[ml](
   private[ml] def toOld: OldGBTModel = {
 new OldGBTModel(OldAlgo.Classification, _trees.map(_.toOld), 
_treeWeights)
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter = new 
GBTClassificationModel.GBTClassificationModelWriter(this)
 }
 
-private[ml] object GBTClassificationModel {
+@Since("2.0.0")
+object GBTClassificationModel extends MLReadable[GBTClassificationModel] {
+
+  @Since("2.0.0")
+  override def read: MLReader[GBTClassificationModel] = new 
GBTClassificationModelReader
+
+  @Since("2.0.0")
+  override def load(path: String): GBTClassificationModel = 
super.load(path)
+
+  private[GBTClassificationModel]
+  class GBTClassificationModelWriter(instance: GBTClassificationModel) 
extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures,
+"numTrees" -> instance.getNumTrees)
+  EnsembleModelReadWrite.saveImpl(instance, path, sqlContext, 
extraMetadata)
+}
+  }
+
+  private class GBTClassificationModelReader extends 
MLReader[GBTClassificationModel] {
+
+/** Checked against metadata when loading model */
+private val className = classOf[GBTClassificationModel].getName
+private val treeClassName = 
classOf[DecisionTreeRegressionModel].getName
+
+override def load(path: String): GBTClassificationModel = {
+  implicit val format = DefaultFormats
+  val (metadata: Metadata, treesData: Array[(Metadata, Node)], 
treeWeights: Array[Double]) =
+EnsembleModelReadWrite.loadImpl(path, sqlContext, className, 
treeClassName)
+  val numFeatures = (metadata.metadata \ "numFeatures").extract[Int]
+  val numTrees = (metadata.metadata \ "numTrees").extract[Int]
+
+  val trees: Array[DecisionTreeRegressionModel] = treesData.map {
+case (treeMetadata, root) =>
+  val tree =
+new DecisionTreeRegressionModel(treeMetadata.uid, root, 
numFeatures)
--- End diff --

Same here. Shouldn't this be DecisionTreeClassificationModel? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13783] [ML] Model export/import for spa...

2016-04-07 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12230#discussion_r58926192
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -257,12 +240,61 @@ final class GBTClassificationModel private[ml](
   private[ml] def toOld: OldGBTModel = {
 new OldGBTModel(OldAlgo.Classification, _trees.map(_.toOld), 
_treeWeights)
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter = new 
GBTClassificationModel.GBTClassificationModelWriter(this)
 }
 
-private[ml] object GBTClassificationModel {
+@Since("2.0.0")
+object GBTClassificationModel extends MLReadable[GBTClassificationModel] {
+
+  @Since("2.0.0")
+  override def read: MLReader[GBTClassificationModel] = new 
GBTClassificationModelReader
+
+  @Since("2.0.0")
+  override def load(path: String): GBTClassificationModel = 
super.load(path)
+
+  private[GBTClassificationModel]
+  class GBTClassificationModelWriter(instance: GBTClassificationModel) 
extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures,
+"numTrees" -> instance.getNumTrees)
+  EnsembleModelReadWrite.saveImpl(instance, path, sqlContext, 
extraMetadata)
+}
+  }
+
+  private class GBTClassificationModelReader extends 
MLReader[GBTClassificationModel] {
+
+/** Checked against metadata when loading model */
+private val className = classOf[GBTClassificationModel].getName
+private val treeClassName = 
classOf[DecisionTreeRegressionModel].getName
+
+override def load(path: String): GBTClassificationModel = {
+  implicit val format = DefaultFormats
+  val (metadata: Metadata, treesData: Array[(Metadata, Node)], 
treeWeights: Array[Double]) =
+EnsembleModelReadWrite.loadImpl(path, sqlContext, className, 
treeClassName)
+  val numFeatures = (metadata.metadata \ "numFeatures").extract[Int]
+  val numTrees = (metadata.metadata \ "numTrees").extract[Int]
+
+  val trees: Array[DecisionTreeRegressionModel] = treesData.map {
--- End diff --

I guess there is a slight m ix up here. This is the GBTClassifier. I see 
you have used DecisionTreeRegressionModel. Am I missing something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13783] [ML] Model export/import for spa...

2016-04-07 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12230#discussion_r58925984
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -257,12 +240,61 @@ final class GBTClassificationModel private[ml](
   private[ml] def toOld: OldGBTModel = {
 new OldGBTModel(OldAlgo.Classification, _trees.map(_.toOld), 
_treeWeights)
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter = new 
GBTClassificationModel.GBTClassificationModelWriter(this)
 }
 
-private[ml] object GBTClassificationModel {
+@Since("2.0.0")
+object GBTClassificationModel extends MLReadable[GBTClassificationModel] {
+
+  @Since("2.0.0")
+  override def read: MLReader[GBTClassificationModel] = new 
GBTClassificationModelReader
+
+  @Since("2.0.0")
+  override def load(path: String): GBTClassificationModel = 
super.load(path)
+
+  private[GBTClassificationModel]
+  class GBTClassificationModelWriter(instance: GBTClassificationModel) 
extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures,
+"numTrees" -> instance.getNumTrees)
+  EnsembleModelReadWrite.saveImpl(instance, path, sqlContext, 
extraMetadata)
+}
+  }
+
+  private class GBTClassificationModelReader extends 
MLReader[GBTClassificationModel] {
+
+/** Checked against metadata when loading model */
+private val className = classOf[GBTClassificationModel].getName
+private val treeClassName = 
classOf[DecisionTreeRegressionModel].getName
+
+override def load(path: String): GBTClassificationModel = {
+  implicit val format = DefaultFormats
+  val (metadata: Metadata, treesData: Array[(Metadata, Node)], 
treeWeights: Array[Double]) =
+EnsembleModelReadWrite.loadImpl(path, sqlContext, className, 
treeClassName)
+  val numFeatures = (metadata.metadata \ "numFeatures").extract[Int]
+  val numTrees = (metadata.metadata \ "numTrees").extract[Int]
+
--- End diff --

Also here, please define numClasses


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13783] [ML] Model export/import for spa...

2016-04-07 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12230#discussion_r58925589
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -257,12 +240,61 @@ final class GBTClassificationModel private[ml](
   private[ml] def toOld: OldGBTModel = {
 new OldGBTModel(OldAlgo.Classification, _trees.map(_.toOld), 
_treeWeights)
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter = new 
GBTClassificationModel.GBTClassificationModelWriter(this)
 }
 
-private[ml] object GBTClassificationModel {
+@Since("2.0.0")
+object GBTClassificationModel extends MLReadable[GBTClassificationModel] {
+
+  @Since("2.0.0")
+  override def read: MLReader[GBTClassificationModel] = new 
GBTClassificationModelReader
+
+  @Since("2.0.0")
+  override def load(path: String): GBTClassificationModel = 
super.load(path)
+
+  private[GBTClassificationModel]
+  class GBTClassificationModelWriter(instance: GBTClassificationModel) 
extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures,
+"numTrees" -> instance.getNumTrees)
--- End diff --

Did you miss numClasses here? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

2016-04-01 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12118#issuecomment-204635416
  
@jkbradley Thanks for this. This looks great and clarifies a lot of things 
I was trying to do. I had one minor comment, except that it looks fine to me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

2016-04-01 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12118#discussion_r58287249
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
@@ -358,3 +376,100 @@ private[ml] object DecisionTreeModelReadWrite {
 finalNodes.head
   }
 }
+
+private[ml] object EnsembleModelReadWrite {
+
+  /**
+   * Helper method for saving a tree ensemble to disk.
+   *
+   * @param instance  Tree ensemble model
+   * @param path  Path to which to save the ensemble model.
+   * @param extraMetadata  Metadata such as numFeatures, numClasses, 
numTrees.
+   */
+  def saveImpl[M <: Params with TreeEnsembleModel](
+  instance: M,
+  path: String,
+  sql: SQLContext,
+  extraMetadata: JObject): Unit = {
+DefaultParamsWriter.saveMetadata(instance, path, sql.sparkContext, 
Some(extraMetadata))
+val treesMetadataJson: Array[(Int, String)] = 
instance.trees.zipWithIndex.map {
+  case (tree, treeID) =>
+treeID -> 
DefaultParamsWriter.getMetadataToSave(tree.asInstanceOf[Params], 
sql.sparkContext)
+}
+val treesMetadataPath = new Path(path, "treesMetadata").toString
+sql.createDataFrame(treesMetadataJson).toDF("treeID", "metadata")
+  .write.parquet(treesMetadataPath)
+val dataPath = new Path(path, "data").toString
+val nodeDataRDD = 
sql.sparkContext.parallelize(instance.trees.zipWithIndex).flatMap {
--- End diff --

Is it alright to use flatMap to combine RDDs? Can we use sparkContext.union 
instead? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

2016-04-01 Thread GayathriMurali
Github user GayathriMurali closed the pull request at:

https://github.com/apache/spark/pull/12023


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

2016-04-01 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12023#issuecomment-204576825
  
@jkbradley I was just about to ping you regarding this. I would definitely 
love to help out. I was out at Strata all week and couldn't get to this. Please 
let me know if you need anything else from me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

2016-03-30 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12023#issuecomment-203763228
  
@jkbradley I am sorry, I am afraid I will not be able to complete tonight. 
Can you please help me with reusing Splitdata/build code from DecisionTrees in 
RandomForests? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

2016-03-30 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12023#discussion_r57993953
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala 
---
@@ -199,21 +210,71 @@ final class RandomForestRegressionModel private[ml] (
   private[ml] def toOld: OldRandomForestModel = {
 new OldRandomForestModel(OldAlgo.Regression, _trees.map(_.toOld))
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter =
+new RandomForestRegressionModel.RandomForestRegressionModelWriter(this)
+
+  @Since("2.0.0")
+  override def read: MLReader[RandomForestRegressionModel] =
+new RandomForestRegressionModel.RandomForestRegressionModelReader(this)
 }
 
-private[ml] object RandomForestRegressionModel {
-
-  /** (private[ml]) Convert a model from the old API */
-  def fromOld(
-  oldModel: OldRandomForestModel,
-  parent: RandomForestRegressor,
-  categoricalFeatures: Map[Int, Int],
-  numFeatures: Int = -1): RandomForestRegressionModel = {
-require(oldModel.algo == OldAlgo.Regression, "Cannot convert 
RandomForestModel" +
-  s" with algo=${oldModel.algo} (old API) to 
RandomForestRegressionModel (new API).")
-val newTrees = oldModel.trees.map { tree =>
-  // parent for each tree is null since there is no good way to set 
this.
-  DecisionTreeRegressionModel.fromOld(tree, null, categoricalFeatures)
+@Since("2.0.0")
+object RandomForestRegressionModel extends 
MLReadable[RandomForestRegressionModel] {
+
+@Since("2.0.0")
+override def load(path: String): RandomForestRegressionModel = 
super.load(path)
+
+private[RandomForestRegressionModel]
+class RandomForestRegressionModelWriter(instance: 
RandomForestRegressionModel)
+  extends MLWriter {
+
+  override protected def saveImpl(path: String): Unit = {
+val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures)
+DefaultParamsWriter.saveMetadata(instance, path, sc, 
Some(extraMetadata))
+for ( treeIndex <- 1 to instance.getNumTrees) {
--- End diff --

@jkbradley Should saveImpl and load methods in RandomForestClassifier and 
Regressor over ride this method? I assume loadImpl will also have same 
signature. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

2016-03-30 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12023#discussion_r57968732
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
 ---
@@ -240,12 +250,66 @@ final class RandomForestClassificationModel 
private[ml] (
   private[ml] def toOld: OldRandomForestModel = {
 new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter =
+new 
RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
+
+  @Since("2.0.0")
+  override def read: MLReader =
+new 
RandomForestClassificationModel.RandomForestClassificationModelReader(this)
 }
 
-private[ml] object RandomForestClassificationModel {
+@Since("2.0.0")
+object RandomForestClassificationModel extends 
MLReadable[RandomForestClassificationModel] {
+
+
+  @Since("2.0.0")
+  override def load(path: String): RandomForestClassificationModel = 
super.load(path)
+
+  private[RandomForestClassificationModel]
+  class RandomForestClassificationModelWriter(instance: 
RandomForestClassificationModel)
+extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures,
+"numClasses" -> instance.numClasses)
+  DefaultParamsWriter.saveMetadata(instance, path, sc, 
Some(extraMetadata))
+  for(treeIndex <- 1 to instance.getNumTrees) {
--- End diff --

@jkbradley Are you thinking Array of RDDs and then flatten them into a 
single RDD?  Should I use SparkContext.union to combine multiple RDDs? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

2016-03-30 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12023#issuecomment-203645597
  
@jkbradley I should be able to update this by tonight. Would that work? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

2016-03-29 Thread GayathriMurali
Github user GayathriMurali commented on a diff in the pull request:

https://github.com/apache/spark/pull/12023#discussion_r57787829
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
 ---
@@ -240,12 +250,66 @@ final class RandomForestClassificationModel 
private[ml] (
   private[ml] def toOld: OldRandomForestModel = {
 new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
   }
+
+  @Since("2.0.0")
+  override def write: MLWriter =
+new 
RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
+
+  @Since("2.0.0")
+  override def read: MLReader =
+new 
RandomForestClassificationModel.RandomForestClassificationModelReader(this)
 }
 
-private[ml] object RandomForestClassificationModel {
+@Since("2.0.0")
+object RandomForestClassificationModel extends 
MLReadable[RandomForestClassificationModel] {
+
+
+  @Since("2.0.0")
+  override def load(path: String): RandomForestClassificationModel = 
super.load(path)
+
+  private[RandomForestClassificationModel]
+  class RandomForestClassificationModelWriter(instance: 
RandomForestClassificationModel)
+extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  val extraMetadata: JObject = Map(
+"numFeatures" -> instance.numFeatures,
+"numClasses" -> instance.numClasses)
+  DefaultParamsWriter.saveMetadata(instance, path, sc, 
Some(extraMetadata))
+  for(treeIndex <- 1 to instance.getNumTrees) {
--- End diff --

@jkbradley Sorry for the confusion. In the JIRA discussion, I meant every 
tree would be stored in a single dataframe. I guess I can work on storing all 
of them in a single dataframe. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 13784][ML][WIP] Model export/import for...

2016-03-28 Thread GayathriMurali
Github user GayathriMurali commented on the pull request:

https://github.com/apache/spark/pull/12023#issuecomment-202671126
  
@yanboliang @jkbradley Please help review the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 13784][ML][WIP] Model export/import for...

2016-03-28 Thread GayathriMurali
GitHub user GayathriMurali opened a pull request:

https://github.com/apache/spark/pull/12023

[Spark 13784][ML][WIP] Model export/import for spark.ml: RandomForests

Please help review the code. I have the WIP included to make sure the 
changes look correct. 

## What changes were proposed in this pull request?

Model export/import for spark.ml RandomForests





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/GayathriMurali/spark SPARK-13784

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12023.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12023


commit 03bb8880e73b6c107d9a13ab90ce7f61a8756c8f
Author: GayathriMurali 
Date:   2016-03-23T21:09:35Z

SPARK-13784 Model export/import for Spark ml RandomForests

commit 68b9358f128c365d573c5881b06f420276fd44ff
Author: GayathriMurali 
Date:   2016-03-29T02:17:41Z

SPARK-13783 Model export/import for spark.ml:RandomForests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >