[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141851806 Can you merge the master to resolve the conflicts? Also, add warning in training summary that it ignores the training weights currently (except for the objective trace). Other than those small items, LGTM. You may remove WIP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937401 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -510,4 +513,90 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { .zip(testSummary.residuals.select("residuals").collect()) .forall { case (Row(r1: Double), Row(r2: Double)) => r1 ~== r2 relTol 1E-5 } } + + test("linear regression with weighted samples"){ +val (data, weightedData) = { + val activeData = LinearDataGenerator.generateLinearInput( +6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 500, 1, 0.1) + + val rnd = new Random(8392) + val signedData = activeData map { case p: LabeledPoint => +(rnd.nextGaussian() > 0.0, p) + } + + val data1 = signedData flatMap { +case (true, p) => Iterator(p, p) +case (false, p) => Iterator(p) + } + + val weightedSignedData = signedData flatMap { +case (true, LabeledPoint(label, features)) => + Iterator( +Instance(label, 1.2, features), +Instance(label, 0.8, features) + ) +case (false, LabeledPoint(label, features)) => + Iterator( +Instance(label, 0.3, features), +Instance(label, 0.1, features), +Instance(label, 0.6, features) + ) + } + + val noiseData = LinearDataGenerator.generateLinearInput( +2, Array(1, 3), Array(0.9, -1.3), Array(0.7, 1.2), 500, 1, 0.1) + val weightedNoiseData = noiseData map { +case LabeledPoint(label, features) => Instance(label, 0, features) + } + val data2 = weightedSignedData ++ weightedNoiseData + + (sqlContext.createDataFrame(sc.parallelize(data1, 4)), +sqlContext.createDataFrame(sc.parallelize(data2, 4))) +} + +val trainer1a = (new LinearRegression).setFitIntercept(true) + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(true) +val trainer1b = (new LinearRegression).setFitIntercept(true).setWeightCol("weight") + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(true) +val model1a0 = trainer1a.fit(data) +val model1a1 = trainer1a.fit(weightedData) +val model1b = trainer1b.fit(weightedData) +assert(model1a0.weights !~= model1a1.weights absTol 1E-3) +assert(model1a0.intercept !~= model1a1.intercept absTol 1E-3) +assert(model1a0.weights ~== model1b.weights absTol 1E-3) +assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) + +val trainer2a = (new LinearRegression).setFitIntercept(true) + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(false) +val trainer2b = (new LinearRegression).setFitIntercept(true).setWeightCol("weight") + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(false) +val model2a0 = trainer2a.fit(data) +val model2a1 = trainer2a.fit(weightedData) +val model2b = trainer2b.fit(weightedData) +assert(model2a0.weights !~= model2a1.weights absTol 1E-3) +assert(model2a0.intercept !~= model2a1.intercept absTol 1E-3) +assert(model2a0.weights ~== model2b.weights absTol 1E-3) +assert(model2a0.intercept ~== model2b.intercept absTol 1E-3) + +val trainer3a = (new LinearRegression).setFitIntercept(false) + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(true) +val trainer3b = (new LinearRegression).setFitIntercept(false).setWeightCol("weight") + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(true) +val model3a0 = trainer3a.fit(data) +val model3a1 = trainer3a.fit(weightedData) +val model3b = trainer3b.fit(weightedData) +assert(model3a0.weights !~= model3a1.weights absTol 1E-3) +assert(model3a0.weights ~== model3b.weights absTol 1E-3) + +val trainer4a = (new LinearRegression).setFitIntercept(false) + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(false) +val trainer4b = (new LinearRegression).setFitIntercept(false).setWeightCol("weight") + .setElasticNetParam(0.38).setRegParam(0.21).setStandardization(false) +val model4a0 = trainer4a.fit(data) +val model4a1 = trainer4a.fit(weightedData) +val model4b = trainer4b.fit(weightedData) +assert(model4a0.weights !~= model4a1.weights absTol 1E-3) +assert(model4a0.weights ~== model4b.weights absTol 1E-3) + --- End diff -- remove this extra line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitH
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937392 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -510,4 +513,90 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { .zip(testSummary.residuals.select("residuals").collect()) .forall { case (Row(r1: Double), Row(r2: Double)) => r1 ~== r2 relTol 1E-5 } } + + test("linear regression with weighted samples"){ +val (data, weightedData) = { + val activeData = LinearDataGenerator.generateLinearInput( +6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 500, 1, 0.1) + + val rnd = new Random(8392) + val signedData = activeData map { case p: LabeledPoint => +(rnd.nextGaussian() > 0.0, p) + } + + val data1 = signedData flatMap { +case (true, p) => Iterator(p, p) +case (false, p) => Iterator(p) + } + + val weightedSignedData = signedData flatMap { +case (true, LabeledPoint(label, features)) => + Iterator( +Instance(label, 1.2, features), +Instance(label, 0.8, features) + ) +case (false, LabeledPoint(label, features)) => + Iterator( +Instance(label, 0.3, features), +Instance(label, 0.1, features), +Instance(label, 0.6, features) + ) + } + + val noiseData = LinearDataGenerator.generateLinearInput( +2, Array(1, 3), Array(0.9, -1.3), Array(0.7, 1.2), 500, 1, 0.1) + val weightedNoiseData = noiseData map { +case LabeledPoint(label, features) => Instance(label, 0, features) --- End diff -- Make `case LabeledPoint(label, features) => Instance(label, weight = 0.0, features)` for easier readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937361 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -510,4 +513,90 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { .zip(testSummary.residuals.select("residuals").collect()) .forall { case (Row(r1: Double), Row(r2: Double)) => r1 ~== r2 relTol 1E-5 } } + + test("linear regression with weighted samples"){ +val (data, weightedData) = { + val activeData = LinearDataGenerator.generateLinearInput( +6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 500, 1, 0.1) + + val rnd = new Random(8392) + val signedData = activeData map { case p: LabeledPoint => +(rnd.nextGaussian() > 0.0, p) + } + + val data1 = signedData flatMap { --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937357 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -510,4 +513,90 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { .zip(testSummary.residuals.select("residuals").collect()) .forall { case (Row(r1: Double), Row(r2: Double)) => r1 ~== r2 relTol 1E-5 } } + + test("linear regression with weighted samples"){ +val (data, weightedData) = { + val activeData = LinearDataGenerator.generateLinearInput( +6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 500, 1, 0.1) + + val rnd = new Random(8392) + val signedData = activeData map { case p: LabeledPoint => --- End diff -- Please use `activeData.map` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937365 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -510,4 +513,90 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { .zip(testSummary.residuals.select("residuals").collect()) .forall { case (Row(r1: Double), Row(r2: Double)) => r1 ~== r2 relTol 1E-5 } } + + test("linear regression with weighted samples"){ +val (data, weightedData) = { + val activeData = LinearDataGenerator.generateLinearInput( +6.3, Array(4.7, 7.2), Array(0.9, -1.3), Array(0.7, 1.2), 500, 1, 0.1) + + val rnd = new Random(8392) + val signedData = activeData map { case p: LabeledPoint => +(rnd.nextGaussian() > 0.0, p) + } + + val data1 = signedData flatMap { +case (true, p) => Iterator(p, p) +case (false, p) => Iterator(p) + } + + val weightedSignedData = signedData flatMap { --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937291 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -598,17 +629,14 @@ private class LeastSquaresCostFun( featuresMean: Array[Double], effectiveL2regParam: Double) extends DiffFunction[BDV[Double]] { - override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = { -val w = Vectors.fromBreeze(weights) + override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = { +val coeff = Vectors.fromBreeze(coefficients) -val leastSquaresAggregator = data.treeAggregate(new LeastSquaresAggregator(w, labelStd, +val leastSquaresAggregator = data.treeAggregate(new LeastSquaresAggregator(coeff, labelStd, labelMean, fitIntercept, featuresStd, featuresMean))( -seqOp = (c, v) => (c, v) match { - case (aggregator, (label, features)) => aggregator.add(label, features) -}, -combOp = (c1, c2) => (c1, c2) match { - case (aggregator1, aggregator2) => aggregator1.merge(aggregator2) -}) +seqOp = (aggregator, instance) => aggregator.add(instance), +combOp = (aggregator1, aggregator2) => aggregator1.merge(aggregator2) +) --- End diff -- Move `)` to the end of line `combOp` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937180 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -31,21 +31,30 @@ import org.apache.spark.ml.util.Identifiable import org.apache.spark.mllib.evaluation.RegressionMetrics import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.linalg.BLAS._ -import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, Row} -import org.apache.spark.sql.functions.{col, udf} -import org.apache.spark.sql.types.StructField +import org.apache.spark.sql.functions.{col, udf, lit} import org.apache.spark.storage.StorageLevel -import org.apache.spark.util.StatCounter + --- End diff -- remove extra line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937145 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -520,28 +544,28 @@ private class LeastSquaresAggregator( * Add a new training data to this LeastSquaresAggregator, and update the loss and gradient * of the objective function. * - * @param label The label for this data point. - * @param data The features for one data point in dense/sparse vector format to be added - * into this aggregator. + * @param data The data point to be added. * @return This LeastSquaresAggregator object. --- End diff -- make `data` as `instance` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937155 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -493,26 +515,28 @@ private class LeastSquaresAggregator( featuresMean: Array[Double]) extends Serializable { private var totalCnt: Long = 0L + private var weightSum: Double = 0 --- End diff -- `private var weightSum: Double = 0.0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39937140 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -520,28 +544,28 @@ private class LeastSquaresAggregator( * Add a new training data to this LeastSquaresAggregator, and update the loss and gradient * of the objective function. * - * @param label The label for this data point. - * @param data The features for one data point in dense/sparse vector format to be added - * into this aggregator. + * @param data The data point to be added. * @return This LeastSquaresAggregator object. */ - def add(label: Double, data: Vector): this.type = { -require(dim == data.size, s"Dimensions mismatch when adding new sample." + - s" Expecting $dim but got ${data.size}.") + def add(data: Instance): this.type = data match { case Instance(label, weight, features) => +require(dim == features.size, s"Dimensions mismatch when adding new sample." + + s" Expecting $dim but got ${features.size}.") +require(weight >= 0.0, s"instance weight, ${weight} has to be >= 0.0") --- End diff -- Please add `if (weight == 0) return this`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141533658 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141533663 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42670/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141533529 [Test build #42670 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42670/console) for PR 8631 at commit [`854d0bb`](https://github.com/apache/spark/commit/854d0bb58d0a6b43135ce9e750e4f9df36a65003). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141492966 [Test build #42670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42670/consoleFull) for PR 8631 at commit [`854d0bb`](https://github.com/apache/spark/commit/854d0bb58d0a6b43135ce9e750e4f9df36a65003). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141489800 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141489727 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141374317 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42640/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141374237 [Test build #42640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42640/console) for PR 8631 at commit [`1f731c2`](https://github.com/apache/spark/commit/1f731c28ad8a59f3bf432435253dc7b0984f46b4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: String)` * ` require(censor == 1.0 || censor == 0.0, "censor of class AFTPoint must be 1.0 or 0.0")` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141374315 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141345895 [Test build #42640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42640/consoleFull) for PR 8631 at commit [`1f731c2`](https://github.com/apache/spark/commit/1f731c28ad8a59f3bf432435253dc7b0984f46b4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141344287 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141344297 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user rotationsymmetry commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141344261 @dbtsai Thanks for the comment on indentation. I have fixed it in the patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141301253 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42621/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141301251 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141301161 [Test build #42621 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42621/console) for PR 8631 at commit [`2afa2a1`](https://github.com/apache/spark/commit/2afa2a190368adb99ec398c64744fc7dafc98bed). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class Interaction(override val uid: String) extends Transformer` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39812153 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -123,30 +132,41 @@ class LinearRegression(override val uid: String) def setTol(value: Double): this.type = set(tol, value) setDefault(tol -> 1E-6) + /** + * Whether to over-/under-sample training instances according to the given weights in weightCol. + * If empty, all instances are treated equally (weight 1.0). + * Default is empty, so all instances have weight one. + * @group setParam + */ + def setWeightCol(value: String): this.type = set(weightCol, value) + setDefault(weightCol -> "") + override protected def train(dataset: DataFrame): LinearRegressionModel = { // Extract columns from data. If dataset is persisted, do not persist instances. -val instances = extractLabeledPoints(dataset).map { - case LabeledPoint(label: Double, features: Vector) => (label, features) +val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol)) +val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).map { + case Row(label: Double, weight: Double, features: Vector) => +Instance(label, weight, features) } + val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) -val (summarizer, statCounter) = instances.treeAggregate( - (new MultivariateOnlineSummarizer, new StatCounter))( -seqOp = (c, v) => (c, v) match { - case ((summarizer: MultivariateOnlineSummarizer, statCounter: StatCounter), - (label: Double, features: Vector)) => -(summarizer.add(features), statCounter.merge(label)) - }, -combOp = (c1, c2) => (c1, c2) match { - case ((summarizer1: MultivariateOnlineSummarizer, statCounter1: StatCounter), - (summarizer2: MultivariateOnlineSummarizer, statCounter2: StatCounter)) => -(summarizer1.merge(summarizer2), statCounter1.merge(statCounter2)) - }) - -val numFeatures = summarizer.mean.size -val yMean = statCounter.mean -val yStd = math.sqrt(statCounter.variance) +val (featuresSummarizer, ySummarizer) = { + val seqOp = (c: (MultivariateOnlineSummarizer, MultivariateOnlineSummarizer), + instance: Instance) => +(c._1.add(instance.features, instance.weight), + c._2.add(Vectors.dense(instance.label), instance.weight)) + val combOp = (c1: (MultivariateOnlineSummarizer, MultivariateOnlineSummarizer), +c2: (MultivariateOnlineSummarizer, MultivariateOnlineSummarizer)) => +(c1._1.merge(c2._1), c1._2.merge(c2._2)) --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39812144 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -123,30 +132,41 @@ class LinearRegression(override val uid: String) def setTol(value: Double): this.type = set(tol, value) setDefault(tol -> 1E-6) + /** + * Whether to over-/under-sample training instances according to the given weights in weightCol. + * If empty, all instances are treated equally (weight 1.0). + * Default is empty, so all instances have weight one. + * @group setParam + */ + def setWeightCol(value: String): this.type = set(weightCol, value) + setDefault(weightCol -> "") + override protected def train(dataset: DataFrame): LinearRegressionModel = { // Extract columns from data. If dataset is persisted, do not persist instances. -val instances = extractLabeledPoints(dataset).map { - case LabeledPoint(label: Double, features: Vector) => (label, features) +val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol)) +val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).map { + case Row(label: Double, weight: Double, features: Vector) => +Instance(label, weight, features) } + val handlePersistence = dataset.rdd.getStorageLevel == StorageLevel.NONE if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) -val (summarizer, statCounter) = instances.treeAggregate( - (new MultivariateOnlineSummarizer, new StatCounter))( -seqOp = (c, v) => (c, v) match { - case ((summarizer: MultivariateOnlineSummarizer, statCounter: StatCounter), - (label: Double, features: Vector)) => -(summarizer.add(features), statCounter.merge(label)) - }, -combOp = (c1, c2) => (c1, c2) match { - case ((summarizer1: MultivariateOnlineSummarizer, statCounter1: StatCounter), - (summarizer2: MultivariateOnlineSummarizer, statCounter2: StatCounter)) => -(summarizer1.merge(summarizer2), statCounter1.merge(statCounter2)) - }) - -val numFeatures = summarizer.mean.size -val yMean = statCounter.mean -val yStd = math.sqrt(statCounter.variance) +val (featuresSummarizer, ySummarizer) = { + val seqOp = (c: (MultivariateOnlineSummarizer, MultivariateOnlineSummarizer), + instance: Instance) => --- End diff -- indentation. see LoR for example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141258078 [Test build #42621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42621/consoleFull) for PR 8631 at commit [`2afa2a1`](https://github.com/apache/spark/commit/2afa2a190368adb99ec398c64744fc7dafc98bed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141257039 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141257022 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141217999 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141218002 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42611/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141217897 [Test build #42611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42611/console) for PR 8631 at commit [`3f98247`](https://github.com/apache/spark/commit/3f98247801368a86aaffabd78b3755bf36fab330). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141163296 [Test build #42611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42611/consoleFull) for PR 8631 at commit [`3f98247`](https://github.com/apache/spark/commit/3f98247801368a86aaffabd78b3755bf36fab330). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141160787 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141160754 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141160556 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user rotationsymmetry commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-141136871 retest this please. "org.apache.spark.HeartbeatReceiverSuite.reregister if heartbeat from removed executor" failed, which should be unrelated to this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140996440 [Test build #42579 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42579/console) for PR 8631 at commit [`3f98247`](https://github.com/apache/spark/commit/3f98247801368a86aaffabd78b3755bf36fab330). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140996513 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42579/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140996512 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140973032 [Test build #42579 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42579/consoleFull) for PR 8631 at commit [`3f98247`](https://github.com/apache/spark/commit/3f98247801368a86aaffabd78b3755bf36fab330). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140972488 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140972473 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user rotationsymmetry commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140972365 @dbtsai Thank you for your comments. I have revised the patch. Please test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39580975 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -123,30 +123,48 @@ class LinearRegression(override val uid: String) def setTol(value: Double): this.type = set(tol, value) setDefault(tol -> 1E-6) + /** + * Whether to over-/undersamples each of training instance according to the given --- End diff -- The doc is changed in LoR. Please sync with that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39580918 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -123,30 +123,48 @@ class LinearRegression(override val uid: String) def setTol(value: Double): this.type = set(tol, value) setDefault(tol -> 1E-6) + /** + * Whether to over-/undersamples each of training instance according to the given + * weight in `weightCol`. If empty, all samples are supposed to have weights as 1.0. + * Default is empty, so all samples have weight one. + * @group setParam + */ + def setWeightCol(value: String): this.type = set(weightCol, value) + setDefault(weightCol -> "") + override protected def train(dataset: DataFrame): LinearRegressionModel = { // Extract columns from data. If dataset is persisted, do not persist instances. -val instances = extractLabeledPoints(dataset).map { --- End diff -- use `lit` and `col` for simplifying the code. See example in LoR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39580848 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -572,7 +591,7 @@ private class LeastSquaresAggregator( this } - def count: Long = totalCnt + def count: Double = totalCnt --- End diff -- We decided to keep `count` as it, and add `weightSum`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8631#discussion_r39580880 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -589,7 +608,7 @@ private class LeastSquaresAggregator( * It's used in Breeze's convex optimization routines. */ private class LeastSquaresCostFun( -data: RDD[(Double, Vector)], +data: RDD[(Double, Vector, Double)], --- End diff -- Refactor the `Instance` case class out from LoR, and use it for code readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-140583259 Hello, weighted `MultivariateOnlineSummarizer` is merged which unblocks you. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user rotationsymmetry commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139587297 @dbtsai Thank you for OKing the test. My patch depends on the `MultivariateOnlineSummarizer` in your PR for applying weights to logistics regressions ([link](https://github.com/apache/spark/pull/7884)). My patch should be OK to test after your PR is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139481651 [Test build #42318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42318/console) for PR 8631 at commit [`e9093cb`](https://github.com/apache/spark/commit/e9093cbea2554fbc124899a58e3cbfdade5ea795). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class WeightedLabeledPoint(label: Double, features: Vector, weight: Double)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139481653 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139481656 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42318/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139481021 [Test build #42318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42318/consoleFull) for PR 8631 at commit [`e9093cb`](https://github.com/apache/spark/commit/e9093cbea2554fbc124899a58e3cbfdade5ea795). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139480026 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139479994 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139479960 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-139479926 Jenkins, add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8631#issuecomment-138123811 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9642] [ML] [WIP] LinearRegression shoul...
GitHub user rotationsymmetry opened a pull request: https://github.com/apache/spark/pull/8631 [SPARK-9642] [ML] [WIP] LinearRegression should supported weighted data In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling. work in progress. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rotationsymmetry/spark SPARK-9642 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8631.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8631 commit e9093cbea2554fbc124899a58e3cbfdade5ea795 Author: Meihua Wu Date: 2015-09-06T15:15:55Z [WIP] Add support for weighted sample and associated test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org