[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14035


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-26 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r80449593
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala ---
@@ -29,8 +29,7 @@ class ChiSqSelectorSuite extends SparkFunSuite with 
MLlibTestSparkContext
   with DefaultReadWriteTest {
 
   test("Test Chi-Square selector") {
-val spark = this.spark
-import spark.implicits._
+import testImplicits._
--- End diff --

Nit: Actually it should be moved out of this test function and can be 
shared between all test cases if necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-25 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r80380667
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala ---
@@ -85,11 +87,13 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 checkPair(densePoints1Seq, sparsePoints1Seq)
 checkPair(densePoints2Seq, sparsePoints2Seq)
 
-densePoints1 = spark.createDataFrame(sc.parallelize(densePoints1Seq, 
2).map(FeatureData))
-sparsePoints1 = spark.createDataFrame(sc.parallelize(sparsePoints1Seq, 
2).map(FeatureData))
-densePoints2 = spark.createDataFrame(sc.parallelize(densePoints2Seq, 
2).map(FeatureData))
-sparsePoints2 = spark.createDataFrame(sc.parallelize(sparsePoints2Seq, 
2).map(FeatureData))
-badPoints = spark.createDataFrame(sc.parallelize(badPointsSeq, 
2).map(FeatureData))
+densePoints1 = densePoints1Seq.map(FeatureData).toDF()
+sparsePoints1 = sparsePoints1Seq.map(FeatureData).toDF()
+// TODO: If we directly use `toDF` without parallelize, the test in
+// "Throws error when given RDDs with different size vectors" is 
failed for an unknown reason.
+densePoints2 = sc.parallelize(densePoints2Seq, 
2).map(FeatureData).toDF()
--- End diff --

BTW, It seems a test is failed when I change this to 
`densePoints2Seq.map(FeatureData).toDF()` for an unknown reason.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-25 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r80379623
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -55,7 +56,7 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
 rdd = sc.parallelize(generateMultinomialLogisticInput(
   coefficients, xMean, xVariance, true, nPoints, 42), 2)
-dataset = spark.createDataFrame(rdd)
+dataset = rdd.toDF()
--- End diff --

It seems the `rdd` is being used in the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-25 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r80376849
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala ---
@@ -282,9 +281,7 @@ class MLUtilsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val z = Vectors.dense(4.0).asML
 val p = (5.0, z)
 val w = Vectors.dense(6.0)
-val df = spark.createDataFrame(Seq(
-  (0, x, y, p, w)
-)).toDF("id", "x", "y", "p", "w")
+val df = Seq((0, x, y, p, w)).toDF("id", "x", "y", "p", "w")
   .withColumn("x", col("x"), metadata)
--- End diff --

We are more prefer to use ```col("x")``` for DataFrame operation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-25 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r80376739
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala ---
@@ -57,8 +58,7 @@ class MinMaxScalerSuite extends SparkFunSuite with 
MLlibTestSparkContext with De
 
   test("MinMaxScaler arguments max must be larger than min") {
 withClue("arguments max must be larger than min") {
-  val dummyDF = spark.createDataFrame(Seq(
-(1, Vectors.dense(1.0, 2.0.toDF("id", "feature")
+  val dummyDF = Seq((1, Vectors.dense(1.0, 2.0))).toDF("id", "feature")
--- End diff --

+1 @jaceklaskowski 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-25 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r80376695
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala
 ---
@@ -42,9 +43,10 @@ class RegressionEvaluatorSuite
  * data.map(x=> x.label + ", " + x.features(0) + ", " + x.features(1))
  *   .saveAsTextFile("path")
  */
-val dataset = spark.createDataFrame(
-  sc.parallelize(LinearDataGenerator.generateLinearInput(
-6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 100, 42, 
0.1), 2).map(_.asML))
+val dataset = sc.parallelize(
--- End diff --

+1 @jaceklaskowski 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-09-25 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r80376680
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -116,7 +117,7 @@ class MultilayerPerceptronClassifierSuite
 // the input seed is somewhat magic, to make this test pass
 val rdd = sc.parallelize(generateMultinomialLogisticInput(
   coefficients, xMean, xVariance, true, nPoints, 1), 2)
-val dataFrame = spark.createDataFrame(rdd).toDF("label", "features")
+val dataFrame = rdd.toDF("label", "features")
--- End diff --

+1 @jaceklaskowski 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-04 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69432798
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -55,7 +56,7 @@ class LogisticRegressionSuite
 generateMultinomialLogisticInput(coefficients, xMean, xVariance,
   addIntercept = true, nPoints, 42)
 
-  spark.createDataFrame(sc.parallelize(testData, 4))
+  sc.parallelize(testData, 4).toDF()
--- End diff --

It'd be nice to know what was the purpose of the explicit partition setting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69400558
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala
 ---
@@ -158,7 +159,7 @@ class RandomForestClassifierSuite
   }
 
   test("Fitting without numClasses in metadata") {
-val df: DataFrame = 
spark.createDataFrame(TreeTests.featureImportanceData(sc))
+val df: DataFrame = TreeTests.featureImportanceData(sc).toDF()
--- End diff --

I also agree with this but actually it seems both are fine assuming from 
this discussion, https://github.com/apache/spark/pull/12452


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69400523
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -116,7 +117,7 @@ class MultilayerPerceptronClassifierSuite
 // the input seed is somewhat magic, to make this test pass
 val rdd = sc.parallelize(generateMultinomialLogisticInput(
   coefficients, xMean, xVariance, true, nPoints, 1), 2)
-val dataFrame = spark.createDataFrame(rdd).toDF("label", "features")
+val dataFrame = rdd.toDF("label", "features")
--- End diff --

Again, I also agree with this but I am hesitated to change this because it 
is explicitly set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69400465
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -55,7 +56,7 @@ class LogisticRegressionSuite
 generateMultinomialLogisticInput(coefficients, xMean, xVariance,
   addIntercept = true, nPoints, 42)
 
-  spark.createDataFrame(sc.parallelize(testData, 4))
+  sc.parallelize(testData, 4).toDF()
--- End diff --

I guess, to be strict, `sc.parallelize(testData, 4).toDF()` and 
`testData.toDF.repartition(4)` would not be exactly the same. It seems the 
author of this test code intended to explicitly set the initial number of 
partitions to `4` and I left as it is although I think as you said because I am 
not 100% sure and it is not the part of this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69391176
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala ---
@@ -282,9 +281,7 @@ class MLUtilsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 val z = Vectors.dense(4.0).asML
 val p = (5.0, z)
 val w = Vectors.dense(6.0)
-val df = spark.createDataFrame(Seq(
-  (0, x, y, p, w)
-)).toDF("id", "x", "y", "p", "w")
+val df = Seq((0, x, y, p, w)).toDF("id", "x", "y", "p", "w")
   .withColumn("x", col("x"), metadata)
--- End diff --

Replace `col("x")` with `$"x"` or (better) `'x`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69391151
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -52,23 +53,20 @@ class GeneralizedLinearRegressionSuite
 
 import GeneralizedLinearRegressionSuite._
 
-datasetGaussianIdentity = spark.createDataFrame(
-  sc.parallelize(generateGeneralizedLinearRegressionInput(
-intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
-xVariance = Array(0.7, 1.2), nPoints = 1, seed, noiseLevel = 
0.01,
-family = "gaussian", link = "identity"), 2))
+datasetGaussianIdentity = 
sc.parallelize(generateGeneralizedLinearRegressionInput(
--- End diff --

Why is this `sc.parallelize` needed here? Why are `2` partitions used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69391139
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala ---
@@ -102,7 +103,7 @@ class VectorIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext
   }
 
   test("Cannot fit an empty DataFrame") {
-val rdd = spark.createDataFrame(sc.parallelize(Array.empty[Vector], 
2).map(FeatureData))
+val rdd = sc.parallelize(Array.empty[Vector], 
2).map(FeatureData).toDF()
--- End diff --

Do you need `sc.parallelize`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69391120
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala ---
@@ -39,7 +40,7 @@ class StringIndexerSuite
 
   test("StringIndexer") {
 val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), 
(4, "a"), (5, "c")), 2)
--- End diff --

Could you remove `sc.parallelize`, too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390423
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala ---
@@ -29,10 +29,11 @@ import org.apache.spark.sql.types._
 
 class OneHotEncoderSuite
   extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+  import testImplicits._
 
   def stringIndexed(): DataFrame = {
 val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), 
(4, "a"), (5, "c")), 2)
--- End diff --

Remove `sc.parallelize`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390388
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala ---
@@ -61,7 +62,7 @@ class NormalizerSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   Vectors.sparse(3, Seq())
 )
 
-dataFrame = spark.createDataFrame(sc.parallelize(data, 
2).map(NormalizerSuite.FeatureData))
+dataFrame = sc.parallelize(data, 
2).map(NormalizerSuite.FeatureData).toDF()
--- End diff --

Remove `sc.parallelize`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390273
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala ---
@@ -57,8 +58,7 @@ class MinMaxScalerSuite extends SparkFunSuite with 
MLlibTestSparkContext with De
 
   test("MinMaxScaler arguments max must be larger than min") {
 withClue("arguments max must be larger than min") {
-  val dummyDF = spark.createDataFrame(Seq(
-(1, Vectors.dense(1.0, 2.0.toDF("id", "feature")
+  val dummyDF = Seq((1, Vectors.dense(1.0, 2.0))).toDF("id", "feature")
--- End diff --

It's just a column name, but for consistency...`features` (not `feature`) 
(unless there's a reason for this)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390216
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala ---
@@ -44,7 +45,7 @@ class CountVectorizerSuite extends SparkFunSuite with 
MLlibTestSparkContext
   (3, split(""), Vectors.sparse(4, Seq())), // empty string
--- End diff --

Replace the comment `// empty string` with `val EMPTY_STRING = ""`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390204
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala
 ---
@@ -42,9 +43,10 @@ class RegressionEvaluatorSuite
  * data.map(x=> x.label + ", " + x.features(0) + ", " + x.features(1))
  *   .saveAsTextFile("path")
  */
-val dataset = spark.createDataFrame(
-  sc.parallelize(LinearDataGenerator.generateLinearInput(
-6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 100, 42, 
0.1), 2).map(_.asML))
+val dataset = sc.parallelize(
--- End diff --

Remove `sc.parallelize`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390189
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala
 ---
@@ -158,7 +159,7 @@ class RandomForestClassifierSuite
   }
 
   test("Fitting without numClasses in metadata") {
-val df: DataFrame = 
spark.createDataFrame(TreeTests.featureImportanceData(sc))
+val df: DataFrame = TreeTests.featureImportanceData(sc).toDF()
--- End diff --

Why is the type annotation needed here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390185
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -55,7 +56,7 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
 rdd = sc.parallelize(generateMultinomialLogisticInput(
   coefficients, xMean, xVariance, true, nPoints, 42), 2)
-dataset = spark.createDataFrame(rdd)
+dataset = rdd.toDF()
--- End diff --

Merge it with line 57.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390178
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala 
---
@@ -47,7 +48,7 @@ class NaiveBayesSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defa
   Array(0.10, 0.10, 0.70, 0.10)  // label 2
 ).map(_.map(math.log))
 
-dataset = spark.createDataFrame(generateNaiveBayesInput(pi, theta, 
100, 42))
+dataset = generateNaiveBayesInput(pi, theta, 100, 42).toDF()
--- End diff --

Exactly my point above :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390173
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala
 ---
@@ -116,7 +117,7 @@ class MultilayerPerceptronClassifierSuite
 // the input seed is somewhat magic, to make this test pass
 val rdd = sc.parallelize(generateMultinomialLogisticInput(
   coefficients, xMean, xVariance, true, nPoints, 1), 2)
-val dataFrame = spark.createDataFrame(rdd).toDF("label", "features")
+val dataFrame = rdd.toDF("label", "features")
--- End diff --

Could we merge this line with 118? I don't think 118 needs `sc.parallelize`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390144
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -55,7 +56,7 @@ class LogisticRegressionSuite
 generateMultinomialLogisticInput(coefficients, xMean, xVariance,
   addIntercept = true, nPoints, 42)
 
-  spark.createDataFrame(sc.parallelize(testData, 4))
+  sc.parallelize(testData, 4).toDF()
--- End diff --

`testData.toDF.repartition(4)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390147
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -869,8 +870,7 @@ class LogisticRegressionSuite
 }
   }
 
-  (spark.createDataFrame(sc.parallelize(data1, 4)),
-spark.createDataFrame(sc.parallelize(data2, 4)))
+  (sc.parallelize(data1, 4).toDF(), sc.parallelize(data2, 4).toDF())
--- End diff --

Same as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390132
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -134,15 +135,14 @@ class GBTClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext
   */
 
   test("Fitting without numClasses in metadata") {
-val df: DataFrame = 
spark.createDataFrame(TreeTests.featureImportanceData(sc))
+val df: DataFrame = TreeTests.featureImportanceData(sc).toDF()
 val gbt = new GBTClassifier().setMaxDepth(1).setMaxIter(1)
 gbt.fit(df)
--- End diff --

Wonder why this line is separate not part of 139? Any reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14035#discussion_r69390117
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/ClassifierSuite.scala 
---
@@ -71,8 +71,7 @@ class ClassifierSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
   test("getNumClasses") {
 def getTestData(labels: Seq[Double]): DataFrame = {
--- End diff --

repeated. What about Moving it outside `test` methods?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...

2016-07-03 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/14035

[SPARK-16356][ML] Add testImplicits for ML unit tests and promote toDF()

## What changes were proposed in this pull request?

This was suggested in 
https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968.

This PR adds `testImplicits` to `MLlibTestSparkContext` so that some 
implicits such as `toDF()` can be sued across ml tests.

This PR also changes all the usages of `spark.createDataFrame( ... )` to 
`toDF()` where applicable in ml tests in Scala.

## How was this patch tested?

Existing tests should work.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-ml-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14035.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14035


commit 79453ac4806bc55dc5ac57da1fb4c706cdd0a762
Author: hyukjinkwon 
Date:   2016-06-28T04:25:35Z

Promote toDF() instead of createDataFrame from a Product-type RDD

commit 902b9132a029df3f879d70e0f01c04b640a97ebc
Author: hyukjinkwon 
Date:   2016-06-28T04:35:49Z

Fix indentation

commit 0df2e44c1871ce30a29878450b0d2024779a3e73
Author: hyukjinkwon 
Date:   2016-06-29T03:31:15Z

Add some more tests to use toDF API

commit 4f1fc1cfdd9d3cd55ce56b852d5a4a6d6b7ea958
Author: hyukjinkwon 
Date:   2016-07-03T04:46:03Z

Fetch upstream

commit 5f7f85b40709eee0eb261edd24eaaef9b7fc3783
Author: hyukjinkwon 
Date:   2016-07-03T05:45:43Z

Fix some more cases

commit 52e7f1601df73dc35aac7627a6e0466b19cd8248
Author: hyukjinkwon 
Date:   2016-07-03T05:56:24Z

Take out the change in SQL and consistent imports

commit 54c27d4d359a7e6ad445856e06f15e29132d582c
Author: hyukjinkwon 
Date:   2016-07-03T06:12:35Z

Remove unused imports and cleanup nits




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org