Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17094#discussion_r118475804
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala
 ---
    @@ -0,0 +1,224 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.ml.optim.aggregator
    +
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.ml.feature.Instance
    +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors}
    +
    +/**
    + * LeastSquaresAggregator computes the gradient and loss for a 
Least-squared loss function,
    + * as used in linear regression for samples in sparse or dense vector in 
an online fashion.
    + *
    + * Two LeastSquaresAggregator can be merged together to have a summary of 
loss and gradient of
    + * the corresponding joint dataset.
    + *
    + * For improving the convergence rate during the optimization process, and 
also preventing against
    + * features with very large variances exerting an overly large influence 
during model training,
    + * package like R's GLMNET performs the scaling to unit variance and 
removing the mean to reduce
    + * the condition number, and then trains the model in scaled space but 
returns the coefficients in
    + * the original scale. See page 9 in 
http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    + *
    + * However, we don't want to apply the `StandardScaler` on the training 
dataset, and then cache
    + * the standardized dataset since it will create a lot of overhead. As a 
result, we perform the
    + * scaling implicitly when we compute the objective function. The 
following is the mathematical
    + * derivation.
    + *
    + * Note that we don't deal with intercept by adding bias here, because the 
intercept
    + * can be computed using closed form after the coefficients are converged.
    + * See this discussion for detail.
    + * 
http://stats.stackexchange.com/questions/13617/how-is-the-intercept-computed-in-glmnet
    + *
    + * When training with intercept enabled,
    + * The objective function in the scaled space is given by
    + *
    + * <blockquote>
    + *    $$
    + *    L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / 
\hat{y}||^2,
    + *    $$
    + * </blockquote>
    + *
    + * where $\bar{x_i}$ is the mean of $x_i$, $\hat{x_i}$ is the standard 
deviation of $x_i$,
    + * $\bar{y}$ is the mean of label, and $\hat{y}$ is the standard deviation 
of label.
    + *
    + * If we fitting the intercept disabled (that is forced through 0.0),
    + * we can use the same equation except we set $\bar{y}$ and $\bar{x_i}$ to 
0 instead
    + * of the respective means.
    + *
    + * This can be rewritten as
    + *
    + * <blockquote>
    + *    $$
    + *    \begin{align}
    + *     L &= 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i 
(w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
    + *          + \bar{y} / \hat{y}||^2 \\
    + *       &= 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n 
diff^2
    + *    \end{align}
    + *    $$
    + * </blockquote>
    + *
    + * where $w_i^\prime$ is the effective coefficients defined by 
$w_i/\hat{x_i}$, offset is
    + *
    + * <blockquote>
    + *    $$
    + *    - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
    + *    $$
    + * </blockquote>
    + *
    + * and diff is
    + *
    + * <blockquote>
    + *    $$
    + *    \sum_i w_i^\prime x_i - y / \hat{y} + offset
    + *    $$
    + * </blockquote>
    + *
    + * Note that the effective coefficients and offset don't depend on 
training dataset,
    + * so they can be precomputed.
    + *
    + * Now, the first derivative of the objective function in scaled space is
    + *
    + * <blockquote>
    + *    $$
    + *    \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / 
\hat{x_i}
    + *    $$
    + * </blockquote>
    + *
    + * However, $(x_i - \bar{x_i})$ will densify the computation, so it's not
    + * an ideal formula when the training dataset is sparse format.
    + *
    + * This can be addressed by adding the dense $\bar{x_i} / \hat{x_i}$ terms
    + * in the end by keeping the sum of diff. The first derivative of total
    + * objective function from all the samples is
    + *
    + *
    + * <blockquote>
    + *    $$
    + *    \begin{align}
    + *       \frac{\partial L}{\partial w_i} &=
    + *           1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i} \\
    + *         &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} 
/ \hat{x_i}) \\
    + *         &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
    + *    \end{align}
    + *    $$
    + * </blockquote>
    + *
    + * where $correction_i = - diffSum \bar{x_i} / \hat{x_i}$
    + *
    + * A simple math can show that diffSum is actually zero, so we don't even
    + * need to add the correction terms in the end. From the definition of 
diff,
    + *
    + * <blockquote>
    + *    $$
    + *    \begin{align}
    + *       diffSum &= \sum_j (\sum_i w_i(x_{ij} - \bar{x_i})
    + *                    / \hat{x_i} - (y_j - \bar{y}) / \hat{y}) \\
    + *         &= N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - 
(\bar{y} - \bar{y}) / \hat{y}) \\
    + *         &= 0
    + *    \end{align}
    + *    $$
    + * </blockquote>
    + *
    + * As a result, the first derivative of the total objective function only 
depends on
    + * the training dataset, which can be easily computed in distributed 
fashion, and is
    + * sparse format friendly.
    + *
    + * <blockquote>
    + *    $$
    + *    \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / 
\hat{x_i})
    + *    $$
    + * </blockquote>
    + *
    + * @note The constructor is curried, since the cost function will 
repeatedly create new versions
    + *       of this class for different coefficient vectors.
    + *
    + * @param labelStd The standard deviation value of the label.
    + * @param labelMean The mean value of the label.
    + * @param fitIntercept Whether to fit an intercept term.
    + * @param bcFeaturesStd The broadcast standard deviation values of the 
features.
    + * @param bcFeaturesMean The broadcast mean values of the features.
    + * @param bcCoefficients The broadcast coefficients corresponding to the 
features.
    + */
    +private[ml] class LeastSquaresAggregator(
    +    labelStd: Double,
    +    labelMean: Double,
    +    fitIntercept: Boolean,
    +    bcFeaturesStd: Broadcast[Array[Double]],
    +    bcFeaturesMean: Broadcast[Array[Double]])(bcCoefficients: 
Broadcast[Vector])
    +  extends DifferentiableLossAggregator[Instance, LeastSquaresAggregator] {
    +  require(labelStd > 0.0, s"${this.getClass.getName} requires the label 
standard" +
    +    s"deviation to be positive.")
    --- End diff --
    
    Add a space before 'deviation' or at the end of the previous line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to