[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module

mktal Wed, 08 Jun 2016 15:22:09 -0700

Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/42#discussion_r66351302
  
    --- Diff: src/ports/postgres/modules/stats/pred_metrics.sql_in ---
    @@ -0,0 +1,831 @@
    +/* ----------------------------------------------------------------------- 
*//**
    + *
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + *
    + * @file pred_metrics.sql_in
    + *
    + * @brief A collection of summary statistics to gauge model
    + * accuracy based on predicted values vs. ground-truth values.
    + * @date April 2016
    + *
    + *
    + *//* 
----------------------------------------------------------------------- */
    +
    +m4_include(`SQLCommon.m4')
    +
    +/* ----------------------------------------------------------------------- 
*/
    +
    +/**
    +@addtogroup grp_pred
    +
    +<div class="toc"><b>Contents</b>
    +<ul>
    +<li><a href="#list">List of Prediction Metric Functions</a></li>
    +<li><a href="#specs">Function Specific Details</a></li>
    +<li><a href="#examples">Examples</a></li>
    +<li><a href="#related">Related Topics</a></li>
    +</ul>
    +</div>
    +
    +@brief Provides various prediction accuracy metrics.
    +
    +This module provides a set of prediction accuracy metrics. It is a support
    +module for several machine learning algorithms that require metrics to 
validate
    +their models. A typical function will take a set of "prediction" and
    +"observation" values to calculate the desired metric, unless noted 
otherwise.
    +Grouping is supported by all of these functions (except confusion matrix).
    +
    +@anchor list
    +@par Prediction Metrics Functions
    +<table class="output">
    +<tr><th>mean_abs_error(table_in, table_out, prediction_col, observed_col, 
grouping_cols)</th><td> Mean Absolute Error. </td></tr>
    +<tr><th>mean_abs_perc_error(table_in, table_out, prediction_col, 
observed_col, grouping_cols)</th><td> Mean Absolute Percentage Error. </td></tr>
    +<tr><th>mean_perc_error(table_in, table_out, prediction_col, observed_col, 
grouping_cols)</th><td>  Mean Percentage Error. </td></tr>
    +<tr><th>mean_squared_error(table_in, table_out, prediction_col, 
observed_col, grouping_cols)</th><td> Mean Squared Error.</td></tr>
    +<tr><th>r2_score(table_in, table_out, prediction_col, observed_col, 
grouping_cols)</th><td> R-squared. </td></tr>
    +<tr><th>adjusted_r2_score(table_in, table_out, prediction_col, 
observed_col, num_predictors, training_size, grouping_cols)</th><td> Adjusted 
R-squared. </td></tr>
    +<tr><th>binary_classifier(table_in, table_out, prediction_col, 
observed_col, grouping_cols)</th><td> Collection of prediction metrics related 
to binary classification.</td></tr>
    +<tr><th>area_under_roc(table_in, table_out, prediction_col, observed_col, 
grouping_cols)</th><td> Area under the ROC curve (in binary 
classification).</td></tr>
    +<tr><th>confusion_matrix(table_in, table_out, prediction_col, 
observed_col, grouping_cols)</th><td> Confusion matrix for a multi-class 
classifier. </td></tr>
    +</table>
    +
    +\b Arguments
    +<DL class="arglist">
    +<DT>table_in</DT>
    +<DD>TEXT. Name of the input table.</DD>
    +<DT>table_out</DT>
    +<DD>TEXT. Name of the output table.</DD>
    +<DT>prediction_col</DT>
    +<DD>TEXT. Name of the column of predicted values from input table.</DD>
    +<DT>observed_col</DT>
    +<DD>TEXT. Name of the column of observed values from input table.</DD>
    +<DT>num_predictors</DT>
    +<DD>INTEGER. The number of parameters in the predicting model, not 
counting the constant term.</DD>
    +<DT>training_size</DT>
    +<DD>INTEGER. The number of rows used for training, excluding any NULL 
rows.</DD>
    +<DT>grouping_cols (optional)</DT>
    +<DD>TEXT, default: NULL. Name of the column of grouping values from input
    +table.</DD>
    +</DL>
    +
    +@anchor specs
    +@par Function Specific Details
    +
    +<b>R-squared Score</b>
    +
    +This function returns the coefficient of determination (R2) between the
    +predicted and observed values. An R2 of 1 indicates that the regression 
line
    +perfectly fits the data, while an R2 of 0 indicates that the line does not 
fit
    +the data at all. Negative values of R2 may occur when fitting non-linear
    +functions to data. Please refer to the reference <a href="#r2">[1]</a> for
    +further details.
    +
    +<b>Adjusted R-squared Score</b>
    +
    +This function returns the adjusted R2 score. Adjusted R2 score is used to
    +counter the problem of the R2 automatically increasing when extra 
explanatory
    +variables are added to the model. It takes additional two integers 
describing
    +the degrees of freedom of the model and the size of the training set over 
which
    +it was developed, and returning the adjusted R-squared prediction accuracy
    +metric. Please refer to the reference <a href="#r2">[1]</a> for further 
details.
    +
    +Arguments:
    +
    +- num_predictors: Indicates the number of parameters the model has other 
than
    +the constant term. For example, if it is set to '3' the model may take the
    +following form, 7 + 5x + 39y + 0.91z.
    +- training_size: Indicates the number of rows in the training set 
(excluding
    +any NULL rows).
    +
    +Neither of these arguments can be deduced from the predicted values and 
the test data alone.
    +
    +@anchor bc
    +<b>Binary Classification</b>
    +
    +This function returns an output table with a number of metrics commonly 
used in binary classification.
    +
    +The definitions of the various metrics are as follows.
    +
    +- \f$\textit{tp}\f$ is the count of correctly-classified positives.
    +- \f$\textit{tn}\f$ is the count of correctly-classified negatives.
    +- \f$\textit{fp}\f$ is the count of misclassified negatives.
    +- \f$\textit{fn}\f$ is the count of misclassified positives.
    +- \f$\textit{tpr}=\textit{tp}/(\textit{tp}+\textit{fn})\f$.
    +- \f$\textit{tnr}=\textit{tn}/(\textit{fp}+\textit{tn})\f$.
    +- \f$\textit{ppv}=\textit{tp}/(\textit{tp}+\textit{fp})\f$.
    +- \f$\textit{npv}=\textit{tn}/(\textit{tn}+\textit{fn})\f$.
    +- \f$\textit{fpr}=\textit{fp}/(\textit{fp}+\textit{tn})\f$.
    +- \f$\textit{fdr}=1-\textit{ppv}\f$.
    +- \f$\textit{fnr}=\textit{fn}/(\textit{fn}+\textit{tp})\f$.
    +- 
\f$\textit{acc}=(\textit{tp}+\textit{tn})/(\textit{tp}+\textit{tn}+\textit{fp}+\textit{fn})\f$.
    +- \f$\textit{f1}=2*\textit{tp}/(2*\textit{tp}+\textit{fp}+\textit{fn})\f$.
    +
    +
    +<b>Area under ROC Curve</b>
    +
    +This function returns the area under the Receiver Operating Characteristic 
curve
    +for binary classification (the AUC). The ROC curve is the curve relating 
the
    +classifier's TPR and FPR metrics. (See <a href="#bc">Binary 
Classification</a>
    +for a definition of these metrics). Please refer to the reference <a
    +href="#aoc">[2]</a> for further details. Note that the binary 
classification
    +function can be used to obtain the data (tpr and fpr values) required for
    +drawing the ROC curve.
    +
    +@note For 'binary_classifier' and 'area_under_roc' functions,
    +
    +The 'observed_col' column is assumed to be a numeric column with
    +two levels: 0 and 1. For the purposes of the metric calculation 0 is 
considered
    +a negative and 1 is a positive.
    +
    +The 'pred_col' column is expected to contain numeric values corresponding 
to
    +likelihood/probability: a larger value corresponds to greater certainty 
that the
    +observed value will be '1', lower value corresponds to a greater certainty 
that
    +it will be '0'
    +
    +
    +<b>Confusion Matrix</b>
    +
    +This function returns the confusion matrix of a multi-class 
classification. Each
    +column of the matrix represents the instances in a predicted class while 
each
    +row represents the instances in an actual class. This allows more detailed
    +analysis than mere proportion of correct guesses (accuracy). Please refer 
to the
    +reference <a href="#cm">[3]</a> for further details. As noted earlier, 
grouping
    --- End diff --
    
    [3] is aoc



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module

Reply via email to