[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14035 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r80449593 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala --- @@ -29,8 +29,7 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { test("Test Chi-Square selector") { -val spark = this.spark -import spark.implicits._ +import testImplicits._ --- End diff -- Nit: Actually it should be moved out of this test function and can be shared between all test cases if necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r80380667 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala --- @@ -85,11 +87,13 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext checkPair(densePoints1Seq, sparsePoints1Seq) checkPair(densePoints2Seq, sparsePoints2Seq) -densePoints1 = spark.createDataFrame(sc.parallelize(densePoints1Seq, 2).map(FeatureData)) -sparsePoints1 = spark.createDataFrame(sc.parallelize(sparsePoints1Seq, 2).map(FeatureData)) -densePoints2 = spark.createDataFrame(sc.parallelize(densePoints2Seq, 2).map(FeatureData)) -sparsePoints2 = spark.createDataFrame(sc.parallelize(sparsePoints2Seq, 2).map(FeatureData)) -badPoints = spark.createDataFrame(sc.parallelize(badPointsSeq, 2).map(FeatureData)) +densePoints1 = densePoints1Seq.map(FeatureData).toDF() +sparsePoints1 = sparsePoints1Seq.map(FeatureData).toDF() +// TODO: If we directly use `toDF` without parallelize, the test in +// "Throws error when given RDDs with different size vectors" is failed for an unknown reason. +densePoints2 = sc.parallelize(densePoints2Seq, 2).map(FeatureData).toDF() --- End diff -- BTW, It seems a test is failed when I change this to `densePoints2Seq.map(FeatureData).toDF()` for an unknown reason. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r80379623 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -55,7 +56,7 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) rdd = sc.parallelize(generateMultinomialLogisticInput( coefficients, xMean, xVariance, true, nPoints, 42), 2) -dataset = spark.createDataFrame(rdd) +dataset = rdd.toDF() --- End diff -- It seems the `rdd` is being used in the tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r80376849 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala --- @@ -282,9 +281,7 @@ class MLUtilsSuite extends SparkFunSuite with MLlibTestSparkContext { val z = Vectors.dense(4.0).asML val p = (5.0, z) val w = Vectors.dense(6.0) -val df = spark.createDataFrame(Seq( - (0, x, y, p, w) -)).toDF("id", "x", "y", "p", "w") +val df = Seq((0, x, y, p, w)).toDF("id", "x", "y", "p", "w") .withColumn("x", col("x"), metadata) --- End diff -- We are more prefer to use ```col("x")``` for DataFrame operation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r80376739 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala --- @@ -57,8 +58,7 @@ class MinMaxScalerSuite extends SparkFunSuite with MLlibTestSparkContext with De test("MinMaxScaler arguments max must be larger than min") { withClue("arguments max must be larger than min") { - val dummyDF = spark.createDataFrame(Seq( -(1, Vectors.dense(1.0, 2.0.toDF("id", "feature") + val dummyDF = Seq((1, Vectors.dense(1.0, 2.0))).toDF("id", "feature") --- End diff -- +1 @jaceklaskowski --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r80376695 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala --- @@ -42,9 +43,10 @@ class RegressionEvaluatorSuite * data.map(x=> x.label + ", " + x.features(0) + ", " + x.features(1)) * .saveAsTextFile("path") */ -val dataset = spark.createDataFrame( - sc.parallelize(LinearDataGenerator.generateLinearInput( -6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 100, 42, 0.1), 2).map(_.asML)) +val dataset = sc.parallelize( --- End diff -- +1 @jaceklaskowski --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r80376680 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -116,7 +117,7 @@ class MultilayerPerceptronClassifierSuite // the input seed is somewhat magic, to make this test pass val rdd = sc.parallelize(generateMultinomialLogisticInput( coefficients, xMean, xVariance, true, nPoints, 1), 2) -val dataFrame = spark.createDataFrame(rdd).toDF("label", "features") +val dataFrame = rdd.toDF("label", "features") --- End diff -- +1 @jaceklaskowski --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69432798 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -55,7 +56,7 @@ class LogisticRegressionSuite generateMultinomialLogisticInput(coefficients, xMean, xVariance, addIntercept = true, nPoints, 42) - spark.createDataFrame(sc.parallelize(testData, 4)) + sc.parallelize(testData, 4).toDF() --- End diff -- It'd be nice to know what was the purpose of the explicit partition setting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69400558 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala --- @@ -158,7 +159,7 @@ class RandomForestClassifierSuite } test("Fitting without numClasses in metadata") { -val df: DataFrame = spark.createDataFrame(TreeTests.featureImportanceData(sc)) +val df: DataFrame = TreeTests.featureImportanceData(sc).toDF() --- End diff -- I also agree with this but actually it seems both are fine assuming from this discussion, https://github.com/apache/spark/pull/12452 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69400523 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -116,7 +117,7 @@ class MultilayerPerceptronClassifierSuite // the input seed is somewhat magic, to make this test pass val rdd = sc.parallelize(generateMultinomialLogisticInput( coefficients, xMean, xVariance, true, nPoints, 1), 2) -val dataFrame = spark.createDataFrame(rdd).toDF("label", "features") +val dataFrame = rdd.toDF("label", "features") --- End diff -- Again, I also agree with this but I am hesitated to change this because it is explicitly set. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69400465 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -55,7 +56,7 @@ class LogisticRegressionSuite generateMultinomialLogisticInput(coefficients, xMean, xVariance, addIntercept = true, nPoints, 42) - spark.createDataFrame(sc.parallelize(testData, 4)) + sc.parallelize(testData, 4).toDF() --- End diff -- I guess, to be strict, `sc.parallelize(testData, 4).toDF()` and `testData.toDF.repartition(4)` would not be exactly the same. It seems the author of this test code intended to explicitly set the initial number of partitions to `4` and I left as it is although I think as you said because I am not 100% sure and it is not the part of this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69391176 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala --- @@ -282,9 +281,7 @@ class MLUtilsSuite extends SparkFunSuite with MLlibTestSparkContext { val z = Vectors.dense(4.0).asML val p = (5.0, z) val w = Vectors.dense(6.0) -val df = spark.createDataFrame(Seq( - (0, x, y, p, w) -)).toDF("id", "x", "y", "p", "w") +val df = Seq((0, x, y, p, w)).toDF("id", "x", "y", "p", "w") .withColumn("x", col("x"), metadata) --- End diff -- Replace `col("x")` with `$"x"` or (better) `'x` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69391151 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -52,23 +53,20 @@ class GeneralizedLinearRegressionSuite import GeneralizedLinearRegressionSuite._ -datasetGaussianIdentity = spark.createDataFrame( - sc.parallelize(generateGeneralizedLinearRegressionInput( -intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), -xVariance = Array(0.7, 1.2), nPoints = 1, seed, noiseLevel = 0.01, -family = "gaussian", link = "identity"), 2)) +datasetGaussianIdentity = sc.parallelize(generateGeneralizedLinearRegressionInput( --- End diff -- Why is this `sc.parallelize` needed here? Why are `2` partitions used? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69391139 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala --- @@ -102,7 +103,7 @@ class VectorIndexerSuite extends SparkFunSuite with MLlibTestSparkContext } test("Cannot fit an empty DataFrame") { -val rdd = spark.createDataFrame(sc.parallelize(Array.empty[Vector], 2).map(FeatureData)) +val rdd = sc.parallelize(Array.empty[Vector], 2).map(FeatureData).toDF() --- End diff -- Do you need `sc.parallelize`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69391120 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -39,7 +40,7 @@ class StringIndexerSuite test("StringIndexer") { val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")), 2) --- End diff -- Could you remove `sc.parallelize`, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390423 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala --- @@ -29,10 +29,11 @@ import org.apache.spark.sql.types._ class OneHotEncoderSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + import testImplicits._ def stringIndexed(): DataFrame = { val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")), 2) --- End diff -- Remove `sc.parallelize`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390388 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala --- @@ -61,7 +62,7 @@ class NormalizerSuite extends SparkFunSuite with MLlibTestSparkContext with Defa Vectors.sparse(3, Seq()) ) -dataFrame = spark.createDataFrame(sc.parallelize(data, 2).map(NormalizerSuite.FeatureData)) +dataFrame = sc.parallelize(data, 2).map(NormalizerSuite.FeatureData).toDF() --- End diff -- Remove `sc.parallelize` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390273 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinMaxScalerSuite.scala --- @@ -57,8 +58,7 @@ class MinMaxScalerSuite extends SparkFunSuite with MLlibTestSparkContext with De test("MinMaxScaler arguments max must be larger than min") { withClue("arguments max must be larger than min") { - val dummyDF = spark.createDataFrame(Seq( -(1, Vectors.dense(1.0, 2.0.toDF("id", "feature") + val dummyDF = Seq((1, Vectors.dense(1.0, 2.0))).toDF("id", "feature") --- End diff -- It's just a column name, but for consistency...`features` (not `feature`) (unless there's a reason for this) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390216 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala --- @@ -44,7 +45,7 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext (3, split(""), Vectors.sparse(4, Seq())), // empty string --- End diff -- Replace the comment `// empty string` with `val EMPTY_STRING = ""` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390204 --- Diff: mllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala --- @@ -42,9 +43,10 @@ class RegressionEvaluatorSuite * data.map(x=> x.label + ", " + x.features(0) + ", " + x.features(1)) * .saveAsTextFile("path") */ -val dataset = spark.createDataFrame( - sc.parallelize(LinearDataGenerator.generateLinearInput( -6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 100, 42, 0.1), 2).map(_.asML)) +val dataset = sc.parallelize( --- End diff -- Remove `sc.parallelize` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390189 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala --- @@ -158,7 +159,7 @@ class RandomForestClassifierSuite } test("Fitting without numClasses in metadata") { -val df: DataFrame = spark.createDataFrame(TreeTests.featureImportanceData(sc)) +val df: DataFrame = TreeTests.featureImportanceData(sc).toDF() --- End diff -- Why is the type annotation needed here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390185 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala --- @@ -55,7 +56,7 @@ class OneVsRestSuite extends SparkFunSuite with MLlibTestSparkContext with Defau val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) rdd = sc.parallelize(generateMultinomialLogisticInput( coefficients, xMean, xVariance, true, nPoints, 42), 2) -dataset = spark.createDataFrame(rdd) +dataset = rdd.toDF() --- End diff -- Merge it with line 57. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390178 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala --- @@ -47,7 +48,7 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa Array(0.10, 0.10, 0.70, 0.10) // label 2 ).map(_.map(math.log)) -dataset = spark.createDataFrame(generateNaiveBayesInput(pi, theta, 100, 42)) +dataset = generateNaiveBayesInput(pi, theta, 100, 42).toDF() --- End diff -- Exactly my point above :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390173 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifierSuite.scala --- @@ -116,7 +117,7 @@ class MultilayerPerceptronClassifierSuite // the input seed is somewhat magic, to make this test pass val rdd = sc.parallelize(generateMultinomialLogisticInput( coefficients, xMean, xVariance, true, nPoints, 1), 2) -val dataFrame = spark.createDataFrame(rdd).toDF("label", "features") +val dataFrame = rdd.toDF("label", "features") --- End diff -- Could we merge this line with 118? I don't think 118 needs `sc.parallelize`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390144 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -55,7 +56,7 @@ class LogisticRegressionSuite generateMultinomialLogisticInput(coefficients, xMean, xVariance, addIntercept = true, nPoints, 42) - spark.createDataFrame(sc.parallelize(testData, 4)) + sc.parallelize(testData, 4).toDF() --- End diff -- `testData.toDF.repartition(4)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390147 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -869,8 +870,7 @@ class LogisticRegressionSuite } } - (spark.createDataFrame(sc.parallelize(data1, 4)), -spark.createDataFrame(sc.parallelize(data2, 4))) + (sc.parallelize(data1, 4).toDF(), sc.parallelize(data2, 4).toDF()) --- End diff -- Same as above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390132 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -134,15 +135,14 @@ class GBTClassifierSuite extends SparkFunSuite with MLlibTestSparkContext */ test("Fitting without numClasses in metadata") { -val df: DataFrame = spark.createDataFrame(TreeTests.featureImportanceData(sc)) +val df: DataFrame = TreeTests.featureImportanceData(sc).toDF() val gbt = new GBTClassifier().setMaxDepth(1).setMaxIter(1) gbt.fit(df) --- End diff -- Wonder why this line is separate not part of 139? Any reason? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14035#discussion_r69390117 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/ClassifierSuite.scala --- @@ -71,8 +71,7 @@ class ClassifierSuite extends SparkFunSuite with MLlibTestSparkContext { test("getNumClasses") { def getTestData(labels: Seq[Double]): DataFrame = { --- End diff -- repeated. What about Moving it outside `test` methods? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14035: [SPARK-16356][ML] Add testImplicits for ML unit t...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/14035 [SPARK-16356][ML] Add testImplicits for ML unit tests and promote toDF() ## What changes were proposed in this pull request? This was suggested in https://github.com/apache/spark/commit/101663f1ae222a919fc40510aa4f2bad22d1be6f#commitcomment-17114968. This PR adds `testImplicits` to `MLlibTestSparkContext` so that some implicits such as `toDF()` can be sued across ml tests. This PR also changes all the usages of `spark.createDataFrame( ... )` to `toDF()` where applicable in ml tests in Scala. ## How was this patch tested? Existing tests should work. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark minor-ml-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14035.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14035 commit 79453ac4806bc55dc5ac57da1fb4c706cdd0a762 Author: hyukjinkwon Date: 2016-06-28T04:25:35Z Promote toDF() instead of createDataFrame from a Product-type RDD commit 902b9132a029df3f879d70e0f01c04b640a97ebc Author: hyukjinkwon Date: 2016-06-28T04:35:49Z Fix indentation commit 0df2e44c1871ce30a29878450b0d2024779a3e73 Author: hyukjinkwon Date: 2016-06-29T03:31:15Z Add some more tests to use toDF API commit 4f1fc1cfdd9d3cd55ce56b852d5a4a6d6b7ea958 Author: hyukjinkwon Date: 2016-07-03T04:46:03Z Fetch upstream commit 5f7f85b40709eee0eb261edd24eaaef9b7fc3783 Author: hyukjinkwon Date: 2016-07-03T05:45:43Z Fix some more cases commit 52e7f1601df73dc35aac7627a6e0466b19cd8248 Author: hyukjinkwon Date: 2016-07-03T05:56:24Z Take out the change in SQL and consistent imports commit 54c27d4d359a7e6ad445856e06f15e29132d582c Author: hyukjinkwon Date: 2016-07-03T06:12:35Z Remove unused imports and cleanup nits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org