[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78252660 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -311,8 +348,25 @@ class LogisticRegression @Since("1.2.0") ( val histogram = labelSummarizer.histogram val numInvalid = labelSummarizer.countInvalid -val numClasses = histogram.length val numFeatures = summarizer.mean.size +val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 else numFeatures + +val numClasses = MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match { + case Some(n: Int) => +require(n >= histogram.length, s"Specified number of classes $n was " + + s"less than the number of unique labels ${histogram.length}.") +n + case None => histogram.length +} +val isBinaryClassification = numClasses == 1 || numClasses == 2 +val isMultinomial = ($(family) == "auto" && !isBinaryClassification) || + ($(family) == "multinomial") +val numCoefficientSets = if (isMultinomial) numClasses else 1 + --- End diff -- I switched it up. Let me know if that looks better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78243482 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") ( val isConstantLabel = histogram.count(_ != 0) == 1 - if (numClasses > 2) { -val msg = s"LogisticRegression with ElasticNet in ML package only supports " + - s"binary classification. Found $numClasses in the input dataset. Consider using " + - s"MultinomialLogisticRegression instead." -logError(msg) -throw new SparkException(msg) - } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) { -logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be positive infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) - } else if ($(fitIntercept) && numClasses == 1) { -logWarning(s"All labels are zero and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be negative infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + if ($(fitIntercept) && isConstantLabel) { +logWarning(s"All labels are the same value and fitIntercept=true, so the coefficients " + + s"will be zeros. Training is not needed.") +val constantLabelIndex = Vectors.dense(histogram).argmax +val coefficientMatrix = Matrices.sparse(numCoefficientSets, numFeatures, + Array.fill(numFeatures + 1)(0), Array.empty[Int], Array.empty[Double]) --- End diff -- I added this logic and also added a test to check it. We can add to that test when we complete the compressed logic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78243270 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => + val providedCoefs = model.coefficientMatrix + val modelValid = (providedCoefs.numRows == numCoefficientSets) && --- End diff -- see above comment and let me know if you think it needs to be changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78243195 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => + val providedCoefs = model.coefficientMatrix + val modelValid = (providedCoefs.numRows == numCoefficientSets) && +(providedCoefs.numCols == numFeatures) && +(model.interceptVector.size == numCoefficientSets) + if (!modelValid) { +logWarning(s"Initial coefficients will be ignored! Its dimensions " + + s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did not match the expected " + + s"size ($numCoefficientSets, $numFeatures)") + } + modelValid } --- End diff -- I changed this up to use pattern matching as you sugges and to check the `fitIntercept` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78243127 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -452,6 +555,7 @@ class LogisticRegression @Since("1.2.0") ( logError(msg) throw new SparkException(msg) } +bcFeaturesStd.destroy(blocking = false) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78243150 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => + val providedCoefs = model.coefficientMatrix + val modelValid = (providedCoefs.numRows == numCoefficientSets) && +(providedCoefs.numCols == numFeatures) && +(model.interceptVector.size == numCoefficientSets) + if (!modelValid) { +logWarning(s"Initial coefficients will be ignored! Its dimensions " + + s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did not match the expected " + + s"size ($numCoefficientSets, $numFeatures)") + } + modelValid } -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { - val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray - optInitialModel.get.coefficients.foreachActive { case (index, value) => -initialCoefficientsWithInterceptArray(index) = value +if (initialModelIsValid) { + val initialCoefArray = initialCoefficientsWithIntercept.toArray + val providedCoef = optInitialModel.get.coefficientMatrix + providedCoef.foreachActive { (row, col, value) => +val flatIndex = row * numFeaturesPlusIntercept + col +// We need to scale the coefficients since they will be trained in the scaled space +initialCoefArray(flatIndex) = value * featuresStd(col) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- -
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78243092 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") ( as a result, no scaling is needed. */ val rawCoefficients = state.x.toArray.clone() -var i = 0 -while (i < numFeatures) { - rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 0.0 } - i += 1 +val coefficientArray = Array.tabulate(numCoefficientSets * numFeatures) { i => + // flatIndex will loop though rawCoefficients, and skip the intercept terms. + val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i + val featureIndex = i % numFeatures + if (featuresStd(featureIndex) != 0.0) { +rawCoefficients(flatIndex) / featuresStd(featureIndex) + } else { +0.0 + } +} +val coefficientMatrix = + new DenseMatrix(numCoefficientSets, numFeatures, coefficientArray, isTransposed = true) + +if ($(regParam) == 0.0 && isMultinomial) { + /* +When no regularization is applied, the coefficients lack identifiability because +we do not use a pivot class. We can add any constant value to the coefficients and +get the same likelihood. So here, we choose the mean centered coefficients for +reproducibility. This method follows the approach in glmnet, described here: + +Friedman, et al. "Regularization Paths for Generalized Linear Models via + Coordinate Descent," https://core.ac.uk/download/files/153/6287975.pdf + */ + val coefficientMean = coefficientMatrix.values.sum / coefficientMatrix.values.length + coefficientMatrix.update(_ - coefficientMean) } -bcFeaturesStd.destroy(blocking = false) -if ($(fitIntercept)) { - (Vectors.dense(rawCoefficients.dropRight(1)).compressed, rawCoefficients.last, -arrayBuilder.result()) +val interceptsArray: Array[Double] = if ($(fitIntercept)) { + Array.tabulate(numCoefficientSets) { i => +val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1 +rawCoefficients(coefIndex) + } +} else { + Array[Double]() +} +/* + The intercepts are never regularized, so we always center the mean. + */ +val interceptVector = if (interceptsArray.nonEmpty && isMultinomial) { + val interceptMean = interceptsArray.sum / numClasses + interceptsArray.indices.foreach { i => interceptsArray(i) -= interceptMean } + Vectors.dense(interceptsArray) +} else if (interceptsArray.length == 1) { + Vectors.dense(interceptsArray) } else { - (Vectors.dense(rawCoefficients).compressed, 0.0, arrayBuilder.result()) + Vectors.sparse(numCoefficientSets, Seq()) } +(coefficientMatrix, interceptVector, arrayBuilder.result()) } } if (handlePersistence) instances.unpersist() -val model = copyValues(new LogisticRegressionModel(uid, coefficients, intercept)) -val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() -val logRegSummary = new BinaryLogisticRegressionTrainingSummary( - summaryModel.transform(dataset), - probabilityColName, - $(labelCol), - $(featuresCol), - objectiveHistory) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78243118 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") ( as a result, no scaling is needed. */ val rawCoefficients = state.x.toArray.clone() -var i = 0 -while (i < numFeatures) { - rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 0.0 } - i += 1 +val coefficientArray = Array.tabulate(numCoefficientSets * numFeatures) { i => + // flatIndex will loop though rawCoefficients, and skip the intercept terms. + val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i + val featureIndex = i % numFeatures + if (featuresStd(featureIndex) != 0.0) { +rawCoefficients(flatIndex) / featuresStd(featureIndex) + } else { +0.0 + } +} +val coefficientMatrix = + new DenseMatrix(numCoefficientSets, numFeatures, coefficientArray, isTransposed = true) + +if ($(regParam) == 0.0 && isMultinomial) { + /* +When no regularization is applied, the coefficients lack identifiability because +we do not use a pivot class. We can add any constant value to the coefficients and +get the same likelihood. So here, we choose the mean centered coefficients for +reproducibility. This method follows the approach in glmnet, described here: + +Friedman, et al. "Regularization Paths for Generalized Linear Models via + Coordinate Descent," https://core.ac.uk/download/files/153/6287975.pdf + */ + val coefficientMean = coefficientMatrix.values.sum / coefficientMatrix.values.length + coefficientMatrix.update(_ - coefficientMean) } -bcFeaturesStd.destroy(blocking = false) -if ($(fitIntercept)) { - (Vectors.dense(rawCoefficients.dropRight(1)).compressed, rawCoefficients.last, -arrayBuilder.result()) +val interceptsArray: Array[Double] = if ($(fitIntercept)) { + Array.tabulate(numCoefficientSets) { i => +val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1 +rawCoefficients(coefIndex) + } +} else { + Array[Double]() +} +/* + The intercepts are never regularized, so we always center the mean. + */ +val interceptVector = if (interceptsArray.nonEmpty && isMultinomial) { + val interceptMean = interceptsArray.sum / numClasses + interceptsArray.indices.foreach { i => interceptsArray(i) -= interceptMean } + Vectors.dense(interceptsArray) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78233010 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => --- End diff -- TBH, I think `initialModelIsValid` is better. `isValidInitialModel` or `isInitialModelValid` sound like questions (more useful if they were methods that returned an answer), when this is actually a val that contains the answer. `if (initialModelIsValid)` reads more naturally than `if (isValidInitialModel)` also. That said, it's a private variable and so it's not a big issue. If others feel strongly then I can change it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78224587 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) --- End diff -- It doesn't fit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78128581 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") ( as a result, no scaling is needed. */ val rawCoefficients = state.x.toArray.clone() -var i = 0 -while (i < numFeatures) { - rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 0.0 } - i += 1 +val coefficientArray = Array.tabulate(numCoefficientSets * numFeatures) { i => + // flatIndex will loop though rawCoefficients, and skip the intercept terms. + val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i + val featureIndex = i % numFeatures + if (featuresStd(featureIndex) != 0.0) { +rawCoefficients(flatIndex) / featuresStd(featureIndex) + } else { +0.0 + } +} +val coefficientMatrix = + new DenseMatrix(numCoefficientSets, numFeatures, coefficientArray, isTransposed = true) + +if ($(regParam) == 0.0 && isMultinomial) { + /* +When no regularization is applied, the coefficients lack identifiability because +we do not use a pivot class. We can add any constant value to the coefficients and +get the same likelihood. So here, we choose the mean centered coefficients for +reproducibility. This method follows the approach in glmnet, described here: + +Friedman, et al. "Regularization Paths for Generalized Linear Models via + Coordinate Descent," https://core.ac.uk/download/files/153/6287975.pdf + */ + val coefficientMean = coefficientMatrix.values.sum / coefficientMatrix.values.length + coefficientMatrix.update(_ - coefficientMean) } -bcFeaturesStd.destroy(blocking = false) -if ($(fitIntercept)) { - (Vectors.dense(rawCoefficients.dropRight(1)).compressed, rawCoefficients.last, -arrayBuilder.result()) +val interceptsArray: Array[Double] = if ($(fitIntercept)) { + Array.tabulate(numCoefficientSets) { i => +val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1 +rawCoefficients(coefIndex) + } +} else { + Array[Double]() +} +/* + The intercepts are never regularized, so we always center the mean. + */ +val interceptVector = if (interceptsArray.nonEmpty && isMultinomial) { + val interceptMean = interceptsArray.sum / numClasses + interceptsArray.indices.foreach { i => interceptsArray(i) -= interceptMean } + Vectors.dense(interceptsArray) +} else if (interceptsArray.length == 1) { + Vectors.dense(interceptsArray) } else { - (Vectors.dense(rawCoefficients).compressed, 0.0, arrayBuilder.result()) + Vectors.sparse(numCoefficientSets, Seq()) } +(coefficientMatrix, interceptVector, arrayBuilder.result()) } } if (handlePersistence) instances.unpersist() -val model = copyValues(new LogisticRegressionModel(uid, coefficients, intercept)) -val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol() -val logRegSummary = new BinaryLogisticRegressionTrainingSummary( - summaryModel.transform(dataset), - probabilityColName, - $(labelCol), - $(featuresCol), - objectiveHistory) --- End diff -- Change the outer ```scala val (coefficients, intercept, objectiveHistory) = { . } ``` to ```scala val (coefficientMatrix, interceptVector, objectiveHistory) = { . } ``` for clarity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78128374 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") ( as a result, no scaling is needed. */ val rawCoefficients = state.x.toArray.clone() -var i = 0 -while (i < numFeatures) { - rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 0.0 } - i += 1 +val coefficientArray = Array.tabulate(numCoefficientSets * numFeatures) { i => + // flatIndex will loop though rawCoefficients, and skip the intercept terms. + val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i + val featureIndex = i % numFeatures + if (featuresStd(featureIndex) != 0.0) { +rawCoefficients(flatIndex) / featuresStd(featureIndex) + } else { +0.0 + } +} +val coefficientMatrix = + new DenseMatrix(numCoefficientSets, numFeatures, coefficientArray, isTransposed = true) + +if ($(regParam) == 0.0 && isMultinomial) { + /* +When no regularization is applied, the coefficients lack identifiability because +we do not use a pivot class. We can add any constant value to the coefficients and +get the same likelihood. So here, we choose the mean centered coefficients for +reproducibility. This method follows the approach in glmnet, described here: + +Friedman, et al. "Regularization Paths for Generalized Linear Models via + Coordinate Descent," https://core.ac.uk/download/files/153/6287975.pdf + */ + val coefficientMean = coefficientMatrix.values.sum / coefficientMatrix.values.length + coefficientMatrix.update(_ - coefficientMean) } -bcFeaturesStd.destroy(blocking = false) -if ($(fitIntercept)) { - (Vectors.dense(rawCoefficients.dropRight(1)).compressed, rawCoefficients.last, -arrayBuilder.result()) +val interceptsArray: Array[Double] = if ($(fitIntercept)) { + Array.tabulate(numCoefficientSets) { i => +val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1 +rawCoefficients(coefIndex) + } +} else { + Array[Double]() +} +/* + The intercepts are never regularized, so we always center the mean. + */ +val interceptVector = if (interceptsArray.nonEmpty && isMultinomial) { + val interceptMean = interceptsArray.sum / numClasses + interceptsArray.indices.foreach { i => interceptsArray(i) -= interceptMean } + Vectors.dense(interceptsArray) --- End diff -- Let's have it in TODO. This will not work if one of the class is not in training. Since the intercept corresponding to that class will be negative infinity, and there is no well defined interceptMean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78128102 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -452,6 +555,7 @@ class LogisticRegression @Since("1.2.0") ( logError(msg) throw new SparkException(msg) } +bcFeaturesStd.destroy(blocking = false) --- End diff -- You could move it up right after while. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78127943 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => + val providedCoefs = model.coefficientMatrix + val modelValid = (providedCoefs.numRows == numCoefficientSets) && +(providedCoefs.numCols == numFeatures) && +(model.interceptVector.size == numCoefficientSets) + if (!modelValid) { +logWarning(s"Initial coefficients will be ignored! Its dimensions " + + s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did not match the expected " + + s"size ($numCoefficientSets, $numFeatures)") + } + modelValid } -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { - val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray - optInitialModel.get.coefficients.foreachActive { case (index, value) => -initialCoefficientsWithInterceptArray(index) = value +if (initialModelIsValid) { + val initialCoefArray = initialCoefficientsWithIntercept.toArray + val providedCoef = optInitialModel.get.coefficientMatrix + providedCoef.foreachActive { (row, col, value) => +val flatIndex = row * numFeaturesPlusIntercept + col +// We need to scale the coefficients since they will be trained in the scaled space +initialCoefArray(flatIndex) = value * featuresStd(col) --- End diff -- Cool. I think the original code doesn't do `value * featuresStd(col)` which is a bug. Thank you for finding it. Can you change `initialCoefArray` into something withIntercept? Also, please check if the parent initalModel has fitIntercept or not. --- If your project is set up for it, you can reply to this email and have your reply appear
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78127321 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => + val providedCoefs = model.coefficientMatrix + val modelValid = (providedCoefs.numRows == numCoefficientSets) && +(providedCoefs.numCols == numFeatures) && +(model.interceptVector.size == numCoefficientSets) + if (!modelValid) { +logWarning(s"Initial coefficients will be ignored! Its dimensions " + + s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did not match the expected " + + s"size ($numCoefficientSets, $numFeatures)") + } + modelValid } --- End diff -- ```scala val isValidInitialModel = optInitialModel match { case Some(model) => val providedCoefs = model.coefficientMatrix if ((providedCoefs.numRows == numCoefficientSets) && (providedCoefs.numCols == numFeatures) && (model.interceptVector.size == numCoefficientSets) ) true else { logWarning(.) false } case None => false } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78126776 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => + val providedCoefs = model.coefficientMatrix + val modelValid = (providedCoefs.numRows == numCoefficientSets) && --- End diff -- isValidModel? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78111319 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) + +val initialModelIsValid = optInitialModel.exists { model => --- End diff -- `isInitialModelValid`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78111049 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") ( val bcFeaturesStd = instances.context.broadcast(featuresStd) val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), - $(standardization), bcFeaturesStd, regParamL2, multinomial = false, $(aggregationDepth)) + $(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial, + $(aggregationDepth)) val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { val standardizationParam = $(standardization) def regParamL1Fun = (index: Int) => { // Remove the L1 penalization on the intercept -if (index == numFeatures) { +val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) +if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { +val featureIndex = if ($(fitIntercept)) { + index % numFeaturesPlusIntercept +} else { + index % numFeatures +} // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(featureIndex) != 0.0) { + regParamL1 / featuresStd(featureIndex) +} else { + 0.0 +} } } } new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) - -if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { - val vecSize = optInitialModel.get.coefficients.size - logWarning( -s"Initial coefficients will be ignored!! As its size $vecSize did not match the " + -s"expected size $numFeatures") + Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept) --- End diff -- move it up the previous line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78110797 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") ( val isConstantLabel = histogram.count(_ != 0) == 1 - if (numClasses > 2) { -val msg = s"LogisticRegression with ElasticNet in ML package only supports " + - s"binary classification. Found $numClasses in the input dataset. Consider using " + - s"MultinomialLogisticRegression instead." -logError(msg) -throw new SparkException(msg) - } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) { -logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be positive infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) - } else if ($(fitIntercept) && numClasses == 1) { -logWarning(s"All labels are zero and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be negative infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + if ($(fitIntercept) && isConstantLabel) { +logWarning(s"All labels are the same value and fitIntercept=true, so the coefficients " + + s"will be zeros. Training is not needed.") +val constantLabelIndex = Vectors.dense(histogram).argmax +val coefficientMatrix = Matrices.sparse(numCoefficientSets, numFeatures, + Array.fill(numFeatures + 1)(0), Array.empty[Int], Array.empty[Double]) --- End diff -- Good point. I'll update it soon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78109655 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") ( val isConstantLabel = histogram.count(_ != 0) == 1 - if (numClasses > 2) { -val msg = s"LogisticRegression with ElasticNet in ML package only supports " + - s"binary classification. Found $numClasses in the input dataset. Consider using " + - s"MultinomialLogisticRegression instead." -logError(msg) -throw new SparkException(msg) - } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) { -logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be positive infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) - } else if ($(fitIntercept) && numClasses == 1) { -logWarning(s"All labels are zero and fitIntercept=true, so the coefficients will be " + - s"zeros and the intercept will be negative infinity; as a result, " + - s"training is not needed.") -(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + if ($(fitIntercept) && isConstantLabel) { +logWarning(s"All labels are the same value and fitIntercept=true, so the coefficients " + + s"will be zeros. Training is not needed.") +val constantLabelIndex = Vectors.dense(histogram).argmax +val coefficientMatrix = Matrices.sparse(numCoefficientSets, numFeatures, + Array.fill(numFeatures + 1)(0), Array.empty[Int], Array.empty[Double]) --- End diff -- I think we should be able to tune the sparse matrix implementation later since it's not efficiency to store `Array.fill(numFeatures + 1)(0)` for high dimensional problems. For now, you check `if (numClasses > numFeatures+1)`, you store it as CSC, otherwise, CSR will be prefer and this is most majority of use case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78105821 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -827,67 +1743,117 @@ class LogisticRegressionSuite } - test("binary logistic regression with weighted samples") { -val (dataset, weightedDataset) = { - val nPoints = 1000 - val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) - val xMean = Array(5.843, 3.057, 3.758, 1.199) - val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) - val testData = -generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, 42) - - // Let's over-sample the positive samples twice. - val data1 = testData.flatMap { case labeledPoint: LabeledPoint => -if (labeledPoint.label == 1.0) { - Iterator(labeledPoint, labeledPoint) -} else { - Iterator(labeledPoint) -} - } + test("binary logistic regression with weighted data") { +val numClasses = 2 +val numPoints = 40 +val outlierData = MLTestingUtils.genClassificationInstancesWithWeightedOutliers(spark, + numClasses, numPoints) +val testData = spark.createDataFrame(Array.tabulate[LabeledPoint](numClasses) { i => + LabeledPoint(i.toDouble, Vectors.dense(i.toDouble)) +}) +val lr = new LogisticRegression().setWeightCol("weight") +val model = lr.fit(outlierData) +val results = model.transform(testData).select("label", "prediction").collect() + +// check that the predictions are the one to one mapping +results.foreach { case Row(label: Double, pred: Double) => + assert(label === pred) +} +val (overSampledData, weightedData) = + MLTestingUtils.genEquivalentOversampledAndWeightedInstances(outlierData, "label", "features", +42L) +val weightedModel = lr.fit(weightedData) +val overSampledModel = lr.setWeightCol("").fit(overSampledData) +assert(weightedModel.coefficientMatrix ~== overSampledModel.coefficientMatrix relTol 0.01) + } - val rnd = new Random(8392) - val data2 = testData.flatMap { case LabeledPoint(label: Double, features: Vector) => -if (rnd.nextGaussian() > 0.0) { - if (label == 1.0) { -Iterator( - Instance(label, 1.2, features), - Instance(label, 0.8, features), - Instance(0.0, 0.0, features)) - } else { -Iterator( - Instance(label, 0.3, features), - Instance(1.0, 0.0, features), - Instance(label, 0.1, features), - Instance(label, 0.6, features)) - } -} else { - if (label == 1.0) { -Iterator(Instance(label, 2.0, features)) - } else { -Iterator(Instance(label, 1.0, features)) - } -} - } + test("multinomial logistic regression with weighted data") { +val numClasses = 5 +val numPoints = 40 +val outlierData = MLTestingUtils.genClassificationInstancesWithWeightedOutliers(spark, + numClasses, numPoints) +val testData = spark.createDataFrame(Array.tabulate[LabeledPoint](numClasses) { i => + LabeledPoint(i.toDouble, Vectors.dense(i.toDouble)) +}) +val mlr = new LogisticRegression().setWeightCol("weight") +val model = mlr.fit(outlierData) +val results = model.transform(testData).select("label", "prediction").collect() + +// check that the predictions are the one to one mapping +results.foreach { case Row(label: Double, pred: Double) => + assert(label === pred) +} +val (overSampledData, weightedData) = + MLTestingUtils.genEquivalentOversampledAndWeightedInstances(outlierData, "label", "features", +42L) +val weightedModel = mlr.fit(weightedData) +val overSampledModel = mlr.setWeightCol("").fit(overSampledData) +assert(weightedModel.coefficientMatrix ~== overSampledModel.coefficientMatrix relTol 0.01) + } - (spark.createDataFrame(sc.parallelize(data1, 4)), -spark.createDataFrame(sc.parallelize(data2, 4))) + test("set family") { +val lr = new LogisticRegression().setMaxIter(1) +// don't set anything for binary classification +val model1 = lr.fit(binaryDataset) +assert(model1.coefficientMatrix.numRows === 1 && model1.coefficientMatrix.numCols === 4) +assert(model1.interceptVector.size === 1) + +// set to multinomial for binary classification +val model
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78104785 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -232,6 +363,47 @@ class LogisticRegressionSuite } } + test("coefficients and intercept methods") { +val mlr = new LogisticRegression().setMaxIter(1) +val mlrModel = mlr.fit(smallMultinomialDataset) +val thrownCoef = intercept[SparkException] { + mlrModel.coefficients +} +val thrownIntercept = intercept[SparkException] { + mlrModel.intercept +} +assert(thrownCoef.getMessage().contains("use coefficientMatrix instead")) +assert(thrownIntercept.getMessage().contains("use interceptVector instead")) + +val blr = new LogisticRegression().setMaxIter(1) --- End diff -- In tests like these, perhaps we should explicitly set the family to be certain of the code path. We already have a test to check the family detection code. I'll add this in my next commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78101965 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") ( val isConstantLabel = histogram.count(_ != 0) == 1 --- End diff -- minor, `histogram.count(_ != 0.0) == 1` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78101327 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -311,8 +348,25 @@ class LogisticRegression @Since("1.2.0") ( val histogram = labelSummarizer.histogram val numInvalid = labelSummarizer.countInvalid -val numClasses = histogram.length val numFeatures = summarizer.mean.size +val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 else numFeatures + +val numClasses = MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match { + case Some(n: Int) => +require(n >= histogram.length, s"Specified number of classes $n was " + + s"less than the number of unique labels ${histogram.length}.") +n + case None => histogram.length +} +val isBinaryClassification = numClasses == 1 || numClasses == 2 +val isMultinomial = ($(family) == "auto" && !isBinaryClassification) || + ($(family) == "multinomial") +val numCoefficientSets = if (isMultinomial) numClasses else 1 + --- End diff -- maybe it's easier to read ```scala val (isBinomial, isMultinomial) = { $(family) match { case "binomial" => require(numClasses == 1 || numClasses == 2, s"Binomial family only supports 1 or 2 " + + s"outcome classes but found $numClasses.") (true, false) case "multinomial" => (false, true) case "auto": => case _: throw new IllegalArgExpcetion("Unsupported family") } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r78093424 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -50,6 +50,8 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas with HasRegParam with HasElasticNetParam with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol with HasThreshold with HasAggregationDepth { + import LogisticRegression._ --- End diff -- could you import with full classpath without using wildcard? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r76463590 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -201,11 +201,24 @@ abstract class ProbabilisticClassificationModel[ probability.argmax } else { val thresholds: Array[Double] = getThresholds - val scaledProbability: Array[Double] = -probability.toArray.zip(thresholds).map { case (p, t) => - if (t == 0.0) Double.PositiveInfinity else p / t + val probabilities = probability.toArray --- End diff -- This code was implemented in the PR for MLOR. I included it here since it is more generic. It requires only a single pass through the probabilities array using a while loop, so it should be slightly faster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14834#discussion_r76463186 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -676,39 +936,54 @@ object LogisticRegressionModel extends MLReadable[LogisticRegressionModel] { private case class Data( numClasses: Int, numFeatures: Int, -intercept: Double, -coefficients: Vector) +interceptVector: Vector, +coefficientMatrix: Matrix, +isMultinomial: Boolean) override protected def saveImpl(path: String): Unit = { // Save metadata and Params DefaultParamsWriter.saveMetadata(instance, path, sc) // Save model data: numClasses, numFeatures, intercept, coefficients - val data = Data(instance.numClasses, instance.numFeatures, instance.intercept, -instance.coefficients) + val data = Data(instance.numClasses, instance.numFeatures, instance.interceptVector, +instance.coefficientMatrix, instance.isMultinomial) val dataPath = new Path(path, "data").toString sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) } } - private class LogisticRegressionModelReader -extends MLReader[LogisticRegressionModel] { + private class LogisticRegressionModelReader extends MLReader[LogisticRegressionModel] { /** Checked against metadata when loading model */ private val className = classOf[LogisticRegressionModel].getName override def load(path: String): LogisticRegressionModel = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val versionRegex = "([0-9]+)\\.([0-9]+)\\.(.+)".r + val versionRegex(major, minor, _) = metadata.sparkVersion val dataPath = new Path(path, "data").toString val data = sparkSession.read.format("parquet").load(dataPath) - // We will need numClasses, numFeatures in the future for multinomial logreg support. - // TODO: remove numClasses and numFeatures fields? - val Row(numClasses: Int, numFeatures: Int, intercept: Double, coefficients: Vector) = -MLUtils.convertVectorColumnsToML(data, "coefficients") - .select("numClasses", "numFeatures", "intercept", "coefficients") - .head() - val model = new LogisticRegressionModel(metadata.uid, coefficients, intercept) + val model = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt == 0)) { --- End diff -- I did an offline test to make sure that we can successfully load old models into the new API --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/14834 [SPARK-17163][ML][WIP] Unified LogisticRegression interface ## What changes were proposed in this pull request? Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`. Marked as WIP because we should discuss the coefficients API in the model. See discussion below. JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163) ## How was this patch tested? Merged test suites and added some new unit tests. ## Design ### Switching between binomial and multinomial We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error. ### coefficients/intercept model API (TODO) This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points: * We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double` * There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts. Some options: 1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time. 2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy. If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions. ### Threshold/thresholds (TODO) Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA. ## Follow up * Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) * Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) You can merge this pull request into a Git repository by running: $ git pull https://github.com/sethah/spark SPARK-17163 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14834.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14834 commit b20d2e769b837d3566ca5f1b1a75d17d755d3b9a Author: sethah Date: 2016-08-25T00:13:33Z first pass at merging MLOR with LOR commit d85381abb44ec0c17117aafdea8c5f017f2f6851 Author: sethah Date: 2016-08-25T05:05:46Z add initial model commit 6b1b984f5fb892bfedfc58c39ef9a7ea9863ed29 Author: sethah Date: 2016-08-25T16:16:33Z fixing some todos, added