[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

MLnick Tue, 17 May 2016 06:30:41 -0700

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r63521258
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,197 @@ regression model and extracting model summary 
statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +When working with data that has a relatively small number of features (< 
4096), Spark's GeneralizedLinearRegression interface
    +allows for flexible specification of [generalized linear 
models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which 
can be used for various types of
    +prediction problems including linear regression, Poisson regression, 
logistic regression, and others.
    +
    +Contrasted with linear regression where the output is assumed to follow a 
Gaussian
    +distribution, GLMs are specifications of linear models where the response 
variable $Y_i$ may take on _any_
    +distribution from the [exponential family of 
distributions](https://en.wikipedia.org/wiki/Exponential_family). 
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
    +$$
    +
    +An exponential family distribution is any probability distribution of the 
form
    +
    +$$
    +f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta - 
b(\theta)}{\phi/w} - c(y, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected 
value of the response variable
    +$\mu_i$ by
    +
    +$$
    +\theta_i = h(\mu_i)
    +$$
    +
    +Here, $h(\mu_i)$ is defined by the form of the exponential family 
distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected 
value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $h(\mu) = g(\mu)$, which 
yields a simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor 
$\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = h(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the 
likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - 
c(y_i, \phi)\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression 
coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    +$$
    +
    +Spark's generalized linear regression interface also provides summary 
statistics for diagnosing the
    +fit of GLM models, including residuals, p-values, deviances, the Akaike 
information criterion, and
    +others.
    +
    +###  Available families
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th></th>
    +      <th>PDF</th>
    +      <th>Response Type</th>
    +      <th>Supported Links</th></tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>Gaussian</td>
    +      <td>$\frac{1}{\sigma \sqrt{2\pi}} \exp \left( -\frac{(x - 
\mu)^2}{2\sigma^2}\right)$</td>
    +      <td>Continuous</td>
    +      <td>Identity*, Log, Inverse</td>
    +    </tr>
    +    <tr>
    +      <td>Binomial</td>
    +      <td>$\binom{n}{k}p^k (1-p)^{n-k}$</td>
    +      <td>Binary</td>
    +      <td>Logit*, Probit, CLogLog</td>
    +    </tr>
    +    <tr>
    +      <td>Poisson</td>
    +      <td>$\frac{\lambda^k e^{-\lambda}}{k!}$</td>
    +      <td>Count</td>
    +      <td>Log*, Identity, Sqrt</td>
    +    </tr>
    +    <tr>
    +      <td>Gamma</td>
    +      <td>$\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta 
x}$</td>
    +      <td>Continuous</td>
    +      <td>Inverse*, Idenity, Log</td>
    +    </tr>
    +    <tfoot><tr><td colspan="4">* Canonical Link</td></tr></tfoot>
    +  </tbody>
    +</table>
    +
    +### Optimization
    +
    +The `spark.ml` GLM implements the method of 
    +[iteratively reweighted least 
squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) 
(IRLS) for finding
    +the optimal regression coefficients. GLMs seek to find a maximum 
likelihood estimate of the
    +regression coefficients by finding zeros of the [score 
equation](https://en.wikipedia.org/wiki/Score_(statistics)). 
    +The IRLS solver casts a first-order Taylor approximation of the score 
equation to a weighted least squares regression and solves it
    +iteratively until convergence.
    +
    +### Input Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>labelCol</td>
    +      <td>Double</td>
    +      <td>"label"</td>
    +      <td>Label to predict</td>
    +    </tr>
    +    <tr>
    +      <td>featuresCol</td>
    +      <td>Vector</td>
    +      <td>"features"</td>
    +      <td>Feature vector</td>
    +    </tr>
    +    <tr>
    +      <td>weightCol</td>
    +      <td>Double</td>
    +      <td>""</td>
    +      <td>Sample weights</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +### Output Columns
    +
    +<table class="table">
    +  <thead>
    +    <tr>
    +      <th align="left">Param name</th>
    +      <th align="left">Type(s)</th>
    +      <th align="left">Default</th>
    +      <th align="left">Description</th>
    +    </tr>
    +  </thead>
    +  <tbody>
    +    <tr>
    +      <td>predictionCol</td>
    +      <td>Double</td>
    +      <td>"prediction"</td>
    +      <td>Predicted label</td>
    +    </tr>
    +    <tr>
    +      <td>linkPredictionCol</td>
    +      <td>Double</td>
    +      <td>""</td>
    +      <td>Linear predicted response</td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +
    +**Example**
    +
    +The following example demonstrates training a GLM with a Gaussian response 
and identity link
    +function and extracting model summary statistics.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +{% include_example 
scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
    --- End diff --
    
    Other examples usually include a link to the API docs also in this section



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15186][ML][DOCS] Add user guide for gen...

Reply via email to