[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78252660
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +348,25 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = ($(family) == "auto" && !isBinaryClassification) ||
+  ($(family) == "multinomial")
+val numCoefficientSets = if (isMultinomial) numClasses else 1
+
--- End diff --

I switched it up. Let me know if that looks better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78243482
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") (
 
   val isConstantLabel = histogram.count(_ != 0) == 1
 
-  if (numClasses > 2) {
-val msg = s"LogisticRegression with ElasticNet in ML package only 
supports " +
-  s"binary classification. Found $numClasses in the input dataset. 
Consider using " +
-  s"MultinomialLogisticRegression instead."
-logError(msg)
-throw new SparkException(msg)
-  } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) {
-logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be positive infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
-  } else if ($(fitIntercept) && numClasses == 1) {
-logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be negative infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  if ($(fitIntercept) && isConstantLabel) {
+logWarning(s"All labels are the same value and fitIntercept=true, 
so the coefficients " +
+  s"will be zeros. Training is not needed.")
+val constantLabelIndex = Vectors.dense(histogram).argmax
+val coefficientMatrix = Matrices.sparse(numCoefficientSets, 
numFeatures,
+  Array.fill(numFeatures + 1)(0), Array.empty[Int], 
Array.empty[Double])
--- End diff --

I added this logic and also added a test to check it. We can add to that 
test when we complete the compressed logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78243270
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
+  val providedCoefs = model.coefficientMatrix
+  val modelValid = (providedCoefs.numRows == numCoefficientSets) &&
--- End diff --

see above comment and let me know if you think it needs to be changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78243195
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
+  val providedCoefs = model.coefficientMatrix
+  val modelValid = (providedCoefs.numRows == numCoefficientSets) &&
+(providedCoefs.numCols == numFeatures) &&
+(model.interceptVector.size == numCoefficientSets)
+  if (!modelValid) {
+logWarning(s"Initial coefficients will be ignored! Its 
dimensions " +
+  s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did 
not match the expected " +
+  s"size ($numCoefficientSets, $numFeatures)")
+  }
+  modelValid
 }
--- End diff --

I changed this up to use pattern matching as you sugges and to check the 
`fitIntercept`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78243127
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -452,6 +555,7 @@ class LogisticRegression @Since("1.2.0") (
   logError(msg)
   throw new SparkException(msg)
 }
+bcFeaturesStd.destroy(blocking = false)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78243150
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
+  val providedCoefs = model.coefficientMatrix
+  val modelValid = (providedCoefs.numRows == numCoefficientSets) &&
+(providedCoefs.numCols == numFeatures) &&
+(model.interceptVector.size == numCoefficientSets)
+  if (!modelValid) {
+logWarning(s"Initial coefficients will be ignored! Its 
dimensions " +
+  s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did 
not match the expected " +
+  s"size ($numCoefficientSets, $numFeatures)")
+  }
+  modelValid
 }
 
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size == numFeatures) {
-  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
-  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
-initialCoefficientsWithInterceptArray(index) = value
+if (initialModelIsValid) {
+  val initialCoefArray = initialCoefficientsWithIntercept.toArray
+  val providedCoef = optInitialModel.get.coefficientMatrix
+  providedCoef.foreachActive { (row, col, value) =>
+val flatIndex = row * numFeaturesPlusIntercept + col
+// We need to scale the coefficients since they will be 
trained in the scaled space
+initialCoefArray(flatIndex) = value * featuresStd(col)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-

[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78243092
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
   }
 }
 
 if (handlePersistence) instances.unpersist()
 
-val model = copyValues(new LogisticRegressionModel(uid, coefficients, 
intercept))
-val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
-  summaryModel.transform(dataset),
-  probabilityColName,
-  $(labelCol),
-  $(featuresCol),
-  objectiveHistory)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78243118
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78233010
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
--- End diff --

TBH, I think `initialModelIsValid` is better. `isValidInitialModel` or 
`isInitialModelValid` sound like questions (more useful if they were methods 
that returned an answer), when this is actually a val that contains the answer. 
`if (initialModelIsValid)` reads more naturally than `if (isValidInitialModel)` 
also. That said, it's a private variable and so it's not a big issue. If others 
feel strongly then I can change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78224587
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
--- End diff --

It doesn't fit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78128581
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
   }
 }
 
 if (handlePersistence) instances.unpersist()
 
-val model = copyValues(new LogisticRegressionModel(uid, coefficients, 
intercept))
-val (summaryModel, probabilityColName) = 
model.findSummaryModelAndProbabilityCol()
-val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
-  summaryModel.transform(dataset),
-  probabilityColName,
-  $(labelCol),
-  $(featuresCol),
-  objectiveHistory)
--- End diff --

Change the outer

```scala
val (coefficients, intercept, objectiveHistory) = {
  .
 }
```
to
```scala
val (coefficientMatrix, interceptVector, objectiveHistory) = {
  .
 }
```
for clarity. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78128374
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +564,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
--- End diff --

Let's have it in TODO. This will not work if one of the class is not in 
training. Since the intercept corresponding to that class will be negative 
infinity, and there is no well defined interceptMean.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78128102
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -452,6 +555,7 @@ class LogisticRegression @Since("1.2.0") (
   logError(msg)
   throw new SparkException(msg)
 }
+bcFeaturesStd.destroy(blocking = false)
--- End diff --

You could move it up right after while. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78127943
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
+  val providedCoefs = model.coefficientMatrix
+  val modelValid = (providedCoefs.numRows == numCoefficientSets) &&
+(providedCoefs.numCols == numFeatures) &&
+(model.interceptVector.size == numCoefficientSets)
+  if (!modelValid) {
+logWarning(s"Initial coefficients will be ignored! Its 
dimensions " +
+  s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did 
not match the expected " +
+  s"size ($numCoefficientSets, $numFeatures)")
+  }
+  modelValid
 }
 
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size == numFeatures) {
-  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
-  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
-initialCoefficientsWithInterceptArray(index) = value
+if (initialModelIsValid) {
+  val initialCoefArray = initialCoefficientsWithIntercept.toArray
+  val providedCoef = optInitialModel.get.coefficientMatrix
+  providedCoef.foreachActive { (row, col, value) =>
+val flatIndex = row * numFeaturesPlusIntercept + col
+// We need to scale the coefficients since they will be 
trained in the scaled space
+initialCoefArray(flatIndex) = value * featuresStd(col)
--- End diff --

Cool. I think the original code doesn't do `value * featuresStd(col)` which 
is a bug. Thank you for finding it. Can you change `initialCoefArray` into 
something withIntercept? Also, please check if the parent initalModel has 
fitIntercept or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear

[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78127321
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
+  val providedCoefs = model.coefficientMatrix
+  val modelValid = (providedCoefs.numRows == numCoefficientSets) &&
+(providedCoefs.numCols == numFeatures) &&
+(model.interceptVector.size == numCoefficientSets)
+  if (!modelValid) {
+logWarning(s"Initial coefficients will be ignored! Its 
dimensions " +
+  s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did 
not match the expected " +
+  s"size ($numCoefficientSets, $numFeatures)")
+  }
+  modelValid
 }
--- End diff --

```scala
val isValidInitialModel = optInitialModel match {
  case Some(model) => 
 val providedCoefs = model.coefficientMatrix
 if ((providedCoefs.numRows == numCoefficientSets) && 
(providedCoefs.numCols == numFeatures) &&  (model.interceptVector.size == 
numCoefficientSets) ) true
else {
   logWarning(.)
   false
}
  case None => false
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78126776
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
+  val providedCoefs = model.coefficientMatrix
+  val modelValid = (providedCoefs.numRows == numCoefficientSets) &&
--- End diff --

isValidModel?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78111319
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
+
+val initialModelIsValid = optInitialModel.exists { model =>
--- End diff --

`isInitialModelValid`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78111049
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -370,49 +420,102 @@ class LogisticRegression @Since("1.2.0") (
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
 val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
false, $(aggregationDepth))
+  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+  $(aggregationDepth))
 
 val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
   new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
   val standardizationParam = $(standardization)
   def regParamL1Fun = (index: Int) => {
 // Remove the L1 penalization on the intercept
-if (index == numFeatures) {
+val isIntercept = $(fitIntercept) && ((index + 1) % 
numFeaturesPlusIntercept == 0)
+if (isIntercept) {
   0.0
 } else {
   if (standardizationParam) {
 regParamL1
   } else {
+val featureIndex = if ($(fitIntercept)) {
+  index % numFeaturesPlusIntercept
+} else {
+  index % numFeatures
+}
 // If `standardization` is false, we still standardize the 
data
 // to improve the rate of convergence; as a result, we 
have to
 // perform this reverse standardization by penalizing each 
component
 // differently to get effectively the same objective 
function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(featureIndex) != 0.0) {
+  regParamL1 / featuresStd(featureIndex)
+} else {
+  0.0
+}
   }
 }
   }
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
 val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
-
-if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
-  val vecSize = optInitialModel.get.coefficients.size
-  logWarning(
-s"Initial coefficients will be ignored!! As its size $vecSize 
did not match the " +
-s"expected size $numFeatures")
+  Vectors.zeros(numCoefficientSets * numFeaturesPlusIntercept)
--- End diff --

move it up the previous line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78110797
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") (
 
   val isConstantLabel = histogram.count(_ != 0) == 1
 
-  if (numClasses > 2) {
-val msg = s"LogisticRegression with ElasticNet in ML package only 
supports " +
-  s"binary classification. Found $numClasses in the input dataset. 
Consider using " +
-  s"MultinomialLogisticRegression instead."
-logError(msg)
-throw new SparkException(msg)
-  } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) {
-logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be positive infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
-  } else if ($(fitIntercept) && numClasses == 1) {
-logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be negative infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  if ($(fitIntercept) && isConstantLabel) {
+logWarning(s"All labels are the same value and fitIntercept=true, 
so the coefficients " +
+  s"will be zeros. Training is not needed.")
+val constantLabelIndex = Vectors.dense(histogram).argmax
+val coefficientMatrix = Matrices.sparse(numCoefficientSets, 
numFeatures,
+  Array.fill(numFeatures + 1)(0), Array.empty[Int], 
Array.empty[Double])
--- End diff --

Good point. I'll update it soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78109655
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") (
 
   val isConstantLabel = histogram.count(_ != 0) == 1
 
-  if (numClasses > 2) {
-val msg = s"LogisticRegression with ElasticNet in ML package only 
supports " +
-  s"binary classification. Found $numClasses in the input dataset. 
Consider using " +
-  s"MultinomialLogisticRegression instead."
-logError(msg)
-throw new SparkException(msg)
-  } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) {
-logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be positive infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
-  } else if ($(fitIntercept) && numClasses == 1) {
-logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be negative infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  if ($(fitIntercept) && isConstantLabel) {
+logWarning(s"All labels are the same value and fitIntercept=true, 
so the coefficients " +
+  s"will be zeros. Training is not needed.")
+val constantLabelIndex = Vectors.dense(histogram).argmax
+val coefficientMatrix = Matrices.sparse(numCoefficientSets, 
numFeatures,
+  Array.fill(numFeatures + 1)(0), Array.empty[Int], 
Array.empty[Double])
--- End diff --

I think we should be able to tune the sparse matrix implementation later 
since it's not efficiency to store `Array.fill(numFeatures + 1)(0)` for high 
dimensional problems. For now, you check `if (numClasses > numFeatures+1)`, you 
store it as CSC, otherwise, CSR will be prefer and this is most majority of use 
case.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78105821
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -827,67 +1743,117 @@ class LogisticRegressionSuite
 
   }
 
-  test("binary logistic regression with weighted samples") {
-val (dataset, weightedDataset) = {
-  val nPoints = 1000
-  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
-  val xMean = Array(5.843, 3.057, 3.758, 1.199)
-  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
-  val testData =
-generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, 42)
-
-  // Let's over-sample the positive samples twice.
-  val data1 = testData.flatMap { case labeledPoint: LabeledPoint =>
-if (labeledPoint.label == 1.0) {
-  Iterator(labeledPoint, labeledPoint)
-} else {
-  Iterator(labeledPoint)
-}
-  }
+  test("binary logistic regression with weighted data") {
+val numClasses = 2
+val numPoints = 40
+val outlierData = 
MLTestingUtils.genClassificationInstancesWithWeightedOutliers(spark,
+  numClasses, numPoints)
+val testData = 
spark.createDataFrame(Array.tabulate[LabeledPoint](numClasses) { i =>
+  LabeledPoint(i.toDouble, Vectors.dense(i.toDouble))
+})
+val lr = new LogisticRegression().setWeightCol("weight")
+val model = lr.fit(outlierData)
+val results = model.transform(testData).select("label", 
"prediction").collect()
+
+// check that the predictions are the one to one mapping
+results.foreach { case Row(label: Double, pred: Double) =>
+  assert(label === pred)
+}
+val (overSampledData, weightedData) =
+  
MLTestingUtils.genEquivalentOversampledAndWeightedInstances(outlierData, 
"label", "features",
+42L)
+val weightedModel = lr.fit(weightedData)
+val overSampledModel = lr.setWeightCol("").fit(overSampledData)
+assert(weightedModel.coefficientMatrix ~== 
overSampledModel.coefficientMatrix relTol 0.01)
+  }
 
-  val rnd = new Random(8392)
-  val data2 = testData.flatMap { case LabeledPoint(label: Double, 
features: Vector) =>
-if (rnd.nextGaussian() > 0.0) {
-  if (label == 1.0) {
-Iterator(
-  Instance(label, 1.2, features),
-  Instance(label, 0.8, features),
-  Instance(0.0, 0.0, features))
-  } else {
-Iterator(
-  Instance(label, 0.3, features),
-  Instance(1.0, 0.0, features),
-  Instance(label, 0.1, features),
-  Instance(label, 0.6, features))
-  }
-} else {
-  if (label == 1.0) {
-Iterator(Instance(label, 2.0, features))
-  } else {
-Iterator(Instance(label, 1.0, features))
-  }
-}
-  }
+  test("multinomial logistic regression with weighted data") {
+val numClasses = 5
+val numPoints = 40
+val outlierData = 
MLTestingUtils.genClassificationInstancesWithWeightedOutliers(spark,
+  numClasses, numPoints)
+val testData = 
spark.createDataFrame(Array.tabulate[LabeledPoint](numClasses) { i =>
+  LabeledPoint(i.toDouble, Vectors.dense(i.toDouble))
+})
+val mlr = new LogisticRegression().setWeightCol("weight")
+val model = mlr.fit(outlierData)
+val results = model.transform(testData).select("label", 
"prediction").collect()
+
+// check that the predictions are the one to one mapping
+results.foreach { case Row(label: Double, pred: Double) =>
+  assert(label === pred)
+}
+val (overSampledData, weightedData) =
+  
MLTestingUtils.genEquivalentOversampledAndWeightedInstances(outlierData, 
"label", "features",
+42L)
+val weightedModel = mlr.fit(weightedData)
+val overSampledModel = mlr.setWeightCol("").fit(overSampledData)
+assert(weightedModel.coefficientMatrix ~== 
overSampledModel.coefficientMatrix relTol 0.01)
+  }
 
-  (spark.createDataFrame(sc.parallelize(data1, 4)),
-spark.createDataFrame(sc.parallelize(data2, 4)))
+  test("set family") {
+val lr = new LogisticRegression().setMaxIter(1)
+// don't set anything for binary classification
+val model1 = lr.fit(binaryDataset)
+assert(model1.coefficientMatrix.numRows === 1 && 
model1.coefficientMatrix.numCols === 4)
+assert(model1.interceptVector.size === 1)
+
+// set to multinomial for binary classification
+val model

[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78104785
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -232,6 +363,47 @@ class LogisticRegressionSuite
 }
   }
 
+  test("coefficients and intercept methods") {
+val mlr = new LogisticRegression().setMaxIter(1)
+val mlrModel = mlr.fit(smallMultinomialDataset)
+val thrownCoef = intercept[SparkException] {
+  mlrModel.coefficients
+}
+val thrownIntercept = intercept[SparkException] {
+  mlrModel.intercept
+}
+assert(thrownCoef.getMessage().contains("use coefficientMatrix 
instead"))
+assert(thrownIntercept.getMessage().contains("use interceptVector 
instead"))
+
+val blr = new LogisticRegression().setMaxIter(1)
--- End diff --

In tests like these, perhaps we should explicitly set the family to be 
certain of the code path. We already have a test to check the family detection 
code. I'll add this in my next commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78101965
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -333,22 +387,18 @@ class LogisticRegression @Since("1.2.0") (
 
   val isConstantLabel = histogram.count(_ != 0) == 1
--- End diff --

minor, `histogram.count(_ != 0.0) == 1`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78101327
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +348,25 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = ($(family) == "auto" && !isBinaryClassification) ||
+  ($(family) == "multinomial")
+val numCoefficientSets = if (isMultinomial) numClasses else 1
+
--- End diff --

maybe it's easier to read

```scala
val (isBinomial, isMultinomial) = {
$(family) match {
  case "binomial" =>
require(numClasses == 1 || numClasses == 2,  s"Binomial family only 
supports 1 or 2 " +
  +  s"outcome classes but found $numClasses.") 
(true, false)
  case "multinomial" => 
(false, true)
  case "auto": => 
  case _: throw new IllegalArgExpcetion("Unsupported family")
}
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-09-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78093424
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -50,6 +50,8 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   with HasRegParam with HasElasticNetParam with HasMaxIter with 
HasFitIntercept with HasTol
   with HasStandardization with HasWeightCol with HasThreshold with 
HasAggregationDepth {
 
+  import LogisticRegression._
--- End diff --

could you import with full classpath without using wildcard? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-08-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r76463590
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 ---
@@ -201,11 +201,24 @@ abstract class ProbabilisticClassificationModel[
   probability.argmax
 } else {
   val thresholds: Array[Double] = getThresholds
-  val scaledProbability: Array[Double] =
-probability.toArray.zip(thresholds).map { case (p, t) =>
-  if (t == 0.0) Double.PositiveInfinity else p / t
+  val probabilities = probability.toArray
--- End diff --

This code was implemented in the PR for MLOR. I included it here since it 
is more generic. It requires only a single pass through the probabilities array 
using a while loop, so it should be slightly faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-08-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r76463186
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -676,39 +936,54 @@ object LogisticRegressionModel extends 
MLReadable[LogisticRegressionModel] {
 private case class Data(
 numClasses: Int,
 numFeatures: Int,
-intercept: Double,
-coefficients: Vector)
+interceptVector: Vector,
+coefficientMatrix: Matrix,
+isMultinomial: Boolean)
 
 override protected def saveImpl(path: String): Unit = {
   // Save metadata and Params
   DefaultParamsWriter.saveMetadata(instance, path, sc)
   // Save model data: numClasses, numFeatures, intercept, coefficients
-  val data = Data(instance.numClasses, instance.numFeatures, 
instance.intercept,
-instance.coefficients)
+  val data = Data(instance.numClasses, instance.numFeatures, 
instance.interceptVector,
+instance.coefficientMatrix, instance.isMultinomial)
   val dataPath = new Path(path, "data").toString
   
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
 }
   }
 
-  private class LogisticRegressionModelReader
-extends MLReader[LogisticRegressionModel] {
+  private class LogisticRegressionModelReader extends 
MLReader[LogisticRegressionModel] {
 
 /** Checked against metadata when loading model */
 private val className = classOf[LogisticRegressionModel].getName
 
 override def load(path: String): LogisticRegressionModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  val versionRegex = "([0-9]+)\\.([0-9]+)\\.(.+)".r
+  val versionRegex(major, minor, _) = metadata.sparkVersion
 
   val dataPath = new Path(path, "data").toString
   val data = sparkSession.read.format("parquet").load(dataPath)
 
-  // We will need numClasses, numFeatures in the future for 
multinomial logreg support.
-  // TODO: remove numClasses and numFeatures fields?
-  val Row(numClasses: Int, numFeatures: Int, intercept: Double, 
coefficients: Vector) =
-MLUtils.convertVectorColumnsToML(data, "coefficients")
-  .select("numClasses", "numFeatures", "intercept", "coefficients")
-  .head()
-  val model = new LogisticRegressionModel(metadata.uid, coefficients, 
intercept)
+  val model = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt 
== 0)) {
--- End diff --

I did an offline test to make sure that we can successfully load old models 
into the new API


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML][WIP] Unified LogisticRegression...

2016-08-26 Thread sethah
GitHub user sethah opened a pull request:

https://github.com/apache/spark/pull/14834

[SPARK-17163][ML][WIP] Unified LogisticRegression interface

## What changes were proposed in this pull request?

Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove 
`MultinomialLogisticRegression`.

Marked as WIP because we should discuss the coefficients API in the model. 
See discussion below.

JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)


## How was this patch tested?

Merged test suites and added some new unit tests.

## Design

### Switching between binomial and multinomial

We default to automatically detecting whether we should run binomial or 
multinomial lor. We expose a new parameter called `family` which defaults to 
auto. When "auto" is used, we run normal binomial lor with pivoting if there 
are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly 
sets the family, then we abide by that setting. In the case where "binomial" is 
set but multiclass lor is detected, we throw an error.

### coefficients/intercept model API (TODO)

This is the biggest design point remaining, IMO. We need to decide how to 
store the coefficients and intercepts in the model, and in turn how to expose 
them via the API. Two important points:

* We must maintain compatibility with the old API, i.e. we must expose `def 
coefficients: Vector` and `def intercept: Double`
* There are two separate cases: binomial lr where we have a single set of 
coefficients and a single intercept and multinomial lr where we have 
`numClasses` sets of coefficients and `numClasses` intercepts.

Some options:

1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This 
means that we would center the model coefficients before storing them in the 
model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would 
convert them to `2 x numFeatures` coefficients before storing them, effectively 
doubling the storage in the model. This has the advantage that we can make the 
code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to 
reason about the different cases as much. It has the disadvantage that we 
double the storage space and we could see small regressions at prediction time 
since there are 2x the number of operations in the prediction algorithms. 
Additionally, we still have to produce the uncentered coefficients/intercept 
via the API, so we will have to either ALSO store the uncentered version, or 
compute it in `def coefficients: Vector` every time.

2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We 
still store the coefficients as a matrix and the intercepts as a vector. When 
users call `coefficients` we return them a `Vector` that is backed by the same 
underlying array as the `coefficientMatrix`, so we don't duplicate any data. At 
prediction time, we use the old prediction methods that are specialized for 
binary LOR. The benefits here are that we don't store extra data, and we won't 
see any regressions in performance. The cost of this is that we have separate 
implementations for predict methods in the binary vs multiclass case. The 
duplicated code is really not very high, but it's still a bit messy. 

If we do decide to store the 2x coefficients, we would likely want to see 
some performance tests to understand the potential regressions.

### Threshold/thresholds (TODO)

Currently, when `threshold` is set we clear whatever value is in 
`thresholds` and when `thresholds` is set we clear whatever value is in 
`threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) 
was created to prefer thresholds over threshold. We should decide if we should 
implement this behavior now or if we want to do it in a separate JIRA.

## Follow up

* Summary model for multiclass logistic regression 
[SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
* Thresholds vs threshold 
[SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sethah/spark SPARK-17163

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14834.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14834


commit b20d2e769b837d3566ca5f1b1a75d17d755d3b9a
Author: sethah 
Date:   2016-08-25T00:13:33Z

first pass at merging MLOR with LOR

commit d85381abb44ec0c17117aafdea8c5f017f2f6851
Author: sethah 
Date:   2016-08-25T05:05:46Z

add initial model

commit 6b1b984f5fb892bfedfc58c39ef9a7ea9863ed29
Author: sethah 
Date:   2016-08-25T16:16:33Z

fixing some todos, added