[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module
Github user mktal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/42#discussion_r66351302 --- Diff: src/ports/postgres/modules/stats/pred_metrics.sql_in --- @@ -0,0 +1,831 @@ +/* --- *//** + * + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + * + * @file pred_metrics.sql_in + * + * @brief A collection of summary statistics to gauge model + * accuracy based on predicted values vs. ground-truth values. + * @date April 2016 + * + * + *//* --- */ + +m4_include(`SQLCommon.m4') + +/* --- */ + +/** +@addtogroup grp_pred + +Contents + +List of Prediction Metric Functions +Function Specific Details +Examples +Related Topics + + + +@brief Provides various prediction accuracy metrics. + +This module provides a set of prediction accuracy metrics. It is a support +module for several machine learning algorithms that require metrics to validate +their models. A typical function will take a set of "prediction" and +"observation" values to calculate the desired metric, unless noted otherwise. +Grouping is supported by all of these functions (except confusion matrix). + +@anchor list +@par Prediction Metrics Functions + +mean_abs_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Absolute Error. +mean_abs_perc_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Absolute Percentage Error. +mean_perc_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Percentage Error. +mean_squared_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Squared Error. +r2_score(table_in, table_out, prediction_col, observed_col, grouping_cols) R-squared. +adjusted_r2_score(table_in, table_out, prediction_col, observed_col, num_predictors, training_size, grouping_cols) Adjusted R-squared. +binary_classifier(table_in, table_out, prediction_col, observed_col, grouping_cols) Collection of prediction metrics related to binary classification. +area_under_roc(table_in, table_out, prediction_col, observed_col, grouping_cols) Area under the ROC curve (in binary classification). +confusion_matrix(table_in, table_out, prediction_col, observed_col, grouping_cols) Confusion matrix for a multi-class classifier. + + +\b Arguments + +table_in +TEXT. Name of the input table. +table_out +TEXT. Name of the output table. +prediction_col +TEXT. Name of the column of predicted values from input table. +observed_col +TEXT. Name of the column of observed values from input table. +num_predictors +INTEGER. The number of parameters in the predicting model, not counting the constant term. +training_size +INTEGER. The number of rows used for training, excluding any NULL rows. +grouping_cols (optional) +TEXT, default: NULL. Name of the column of grouping values from input +table. + + +@anchor specs +@par Function Specific Details + +R-squared Score + +This function returns the coefficient of determination (R2) between the +predicted and observed values. An R2 of 1 indicates that the regression line +perfectly fits the data, while an R2 of 0 indicates that the line does not fit +the data at all. Negative values of R2 may occur when fitting non-linear +functions to data. Please refer to the reference [1] for +further details. + +Adjusted R-squared Score + +This function returns the adjusted R2 score. Adjusted R2 score is used to +counter the problem of the R2 automatically increasing when extra explanatory +variables are added to the model. It takes additional two integers describing +the degr
[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module
Github user mktal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/42#discussion_r66351192 --- Diff: src/ports/postgres/modules/stats/pred_metrics.sql_in --- @@ -0,0 +1,831 @@ +/* --- *//** + * + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + * + * @file pred_metrics.sql_in + * + * @brief A collection of summary statistics to gauge model + * accuracy based on predicted values vs. ground-truth values. + * @date April 2016 + * + * + *//* --- */ + +m4_include(`SQLCommon.m4') + +/* --- */ + +/** +@addtogroup grp_pred + +Contents + +List of Prediction Metric Functions +Function Specific Details +Examples +Related Topics + + + +@brief Provides various prediction accuracy metrics. + +This module provides a set of prediction accuracy metrics. It is a support +module for several machine learning algorithms that require metrics to validate +their models. A typical function will take a set of "prediction" and +"observation" values to calculate the desired metric, unless noted otherwise. +Grouping is supported by all of these functions (except confusion matrix). + +@anchor list +@par Prediction Metrics Functions + +mean_abs_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Absolute Error. +mean_abs_perc_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Absolute Percentage Error. +mean_perc_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Percentage Error. +mean_squared_error(table_in, table_out, prediction_col, observed_col, grouping_cols) Mean Squared Error. +r2_score(table_in, table_out, prediction_col, observed_col, grouping_cols) R-squared. +adjusted_r2_score(table_in, table_out, prediction_col, observed_col, num_predictors, training_size, grouping_cols) Adjusted R-squared. +binary_classifier(table_in, table_out, prediction_col, observed_col, grouping_cols) Collection of prediction metrics related to binary classification. +area_under_roc(table_in, table_out, prediction_col, observed_col, grouping_cols) Area under the ROC curve (in binary classification). +confusion_matrix(table_in, table_out, prediction_col, observed_col, grouping_cols) Confusion matrix for a multi-class classifier. + + +\b Arguments + +table_in +TEXT. Name of the input table. +table_out +TEXT. Name of the output table. +prediction_col +TEXT. Name of the column of predicted values from input table. +observed_col +TEXT. Name of the column of observed values from input table. +num_predictors +INTEGER. The number of parameters in the predicting model, not counting the constant term. +training_size +INTEGER. The number of rows used for training, excluding any NULL rows. +grouping_cols (optional) +TEXT, default: NULL. Name of the column of grouping values from input +table. + + +@anchor specs +@par Function Specific Details + +R-squared Score + +This function returns the coefficient of determination (R2) between the +predicted and observed values. An R2 of 1 indicates that the regression line +perfectly fits the data, while an R2 of 0 indicates that the line does not fit +the data at all. Negative values of R2 may occur when fitting non-linear +functions to data. Please refer to the reference [1] for +further details. + +Adjusted R-squared Score + +This function returns the adjusted R2 score. Adjusted R2 score is used to +counter the problem of the R2 automatically increasing when extra explanatory +variables are added to the model. It takes additional two integers describing +the degr
[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module
Github user mktal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/42#discussion_r66282751 --- Diff: src/ports/postgres/modules/stats/pred_metrics.py_in --- @@ -0,0 +1,562 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Prediction Metrics +# This module provides a set of prediction accuracy metrics. It is a support +# module for several machine learning algorithms that require metrics to +# validate their models. A typical function will take a set of "prediction" and +# "observation" values to calculate the desired metric, unless noted otherwise. +# Grouping is supported by all of these functions (except confusion matrix). + +# Please refer to the pred_metrics.sql_in file for the documentation + +import plpy +from utilities.validate_args import input_tbl_valid, output_tbl_valid, is_var_valid +from utilities.utilities import _assert +from utilities.utilities import split_quoted_delimited_str + + +def _validate_args(table_in, table_out, validate_cols): + input_tbl_valid(table_in, "Prediction Metrics") + output_tbl_valid(table_out, "Prediction Metrics") +is_var_valid(table_in, ', '.join(validate_cols)) + + +def _parse_grp_col_str(grp_col_str): +group_set = set(split_quoted_delimited_str(grp_col_str)) +return list(group_set) +# -- + + +def _create_output_table(table_in, table_out, agg_fun, agg_name, grp_col_str=None): +""" Create an output table with optional groups + +General template function that builds an output table with grouping while +applying an aggregate function. + +Args: +@param agg_fun: str, SQL aggregate to be executed +@param grp_cols: str, Comma-separated list of column names +""" +grp_cols = _parse_grp_col_str(grp_col_str) +_validate_args(table_in, table_out, grp_cols) +if not grp_cols: +grp_by_str = grp_out_str = "" +else: +grp_by_str = "GROUP BY " + grp_col_str +grp_out_str = grp_col_str + ", " +plpy.execute(""" + CREATE TABLE {table_out} AS + SELECT +{grp_out_str} +{agg_fun} AS {agg_name} + FROM {table_in} + {grp_by_str} + """.format(**locals())) + + +# Mean Absolute Error. +def mean_abs_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_abs_agg = "AVG(ABS({0} - {1}))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_abs_agg, "mean_abs_error", grp_cols) + + +# Mean Absolute Percentage Error. +def mean_abs_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_abs_perc_agg = "AVG(ABS({0} - {1})/NULLIF({1}, 0))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_abs_perc_agg, "mean_abs_perc_error", grp_cols) + + +# Mean Percentage Error. +def mean_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_perc_agg = "AVG(({0} - {1})/NULLIF({1}, 0))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_perc_agg, "mean_perc_error", grp_cols) + + +# Mean Squared Error. +def mean_squared_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_sq_agg = "AVG(({0} - {1})^2)".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_sq_agg, "mean_squared_error", grp_cols) + + +def metric_agg_help_msg(schema_madlib, message, agg_name, **kwargs): + +if not message: +help_strin
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user iyerr3 commented on the pull request: https://github.com/apache/incubator-madlib/pull/42 Along with casting the columns to int in binary classification, we also need to change docs/online-help/tests to reflect that boolean columns allowed for observation columns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/42#discussion_r65259392 --- Diff: src/ports/postgres/modules/stats/pred_metrics.py_in --- @@ -0,0 +1,562 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Prediction Metrics +# This module provides a set of prediction accuracy metrics. It is a support +# module for several machine learning algorithms that require metrics to +# validate their models. A typical function will take a set of "prediction" and +# "observation" values to calculate the desired metric, unless noted otherwise. +# Grouping is supported by all of these functions (except confusion matrix). + +# Please refer to the pred_metrics.sql_in file for the documentation + +import plpy +from utilities.validate_args import input_tbl_valid, output_tbl_valid, is_var_valid +from utilities.utilities import _assert +from utilities.utilities import split_quoted_delimited_str + + +def _validate_args(table_in, table_out, validate_cols): + input_tbl_valid(table_in, "Prediction Metrics") + output_tbl_valid(table_out, "Prediction Metrics") +is_var_valid(table_in, ', '.join(validate_cols)) + + +def _parse_grp_col_str(grp_col_str): +group_set = set(split_quoted_delimited_str(grp_col_str)) +return list(group_set) +# -- + + +def _create_output_table(table_in, table_out, agg_fun, agg_name, grp_col_str=None): +""" Create an output table with optional groups + +General template function that builds an output table with grouping while +applying an aggregate function. + +Args: +@param agg_fun: str, SQL aggregate to be executed +@param grp_cols: str, Comma-separated list of column names +""" +grp_cols = _parse_grp_col_str(grp_col_str) +_validate_args(table_in, table_out, grp_cols) +if not grp_cols: +grp_by_str = grp_out_str = "" +else: +grp_by_str = "GROUP BY " + grp_col_str +grp_out_str = grp_col_str + ", " +plpy.execute(""" + CREATE TABLE {table_out} AS + SELECT +{grp_out_str} +{agg_fun} AS {agg_name} + FROM {table_in} + {grp_by_str} + """.format(**locals())) + + +# Mean Absolute Error. +def mean_abs_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_abs_agg = "AVG(ABS({0} - {1}))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_abs_agg, "mean_abs_error", grp_cols) + + +# Mean Absolute Percentage Error. +def mean_abs_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_abs_perc_agg = "AVG(ABS({0} - {1})/NULLIF({1}, 0))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_abs_perc_agg, "mean_abs_perc_error", grp_cols) + + +# Mean Percentage Error. +def mean_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_perc_agg = "AVG(({0} - {1})/NULLIF({1}, 0))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_perc_agg, "mean_perc_error", grp_cols) + + +# Mean Squared Error. +def mean_squared_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_sq_agg = "AVG(({0} - {1})^2)".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_sq_agg, "mean_squared_error", grp_cols) + + +def metric_agg_help_msg(schema_madlib, message, agg_name, **kwargs): + +if not message: +help_strin
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/42#discussion_r65259215 --- Diff: src/ports/postgres/modules/stats/pred_metrics.py_in --- @@ -0,0 +1,562 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Prediction Metrics +# This module provides a set of prediction accuracy metrics. It is a support +# module for several machine learning algorithms that require metrics to +# validate their models. A typical function will take a set of "prediction" and +# "observation" values to calculate the desired metric, unless noted otherwise. +# Grouping is supported by all of these functions (except confusion matrix). + +# Please refer to the pred_metrics.sql_in file for the documentation + +import plpy +from utilities.validate_args import input_tbl_valid, output_tbl_valid, is_var_valid +from utilities.utilities import _assert +from utilities.utilities import split_quoted_delimited_str + + +def _validate_args(table_in, table_out, validate_cols): + input_tbl_valid(table_in, "Prediction Metrics") + output_tbl_valid(table_out, "Prediction Metrics") +is_var_valid(table_in, ', '.join(validate_cols)) + + +def _parse_grp_col_str(grp_col_str): +group_set = set(split_quoted_delimited_str(grp_col_str)) +return list(group_set) +# -- + + +def _create_output_table(table_in, table_out, agg_fun, agg_name, grp_col_str=None): +""" Create an output table with optional groups + +General template function that builds an output table with grouping while +applying an aggregate function. + +Args: +@param agg_fun: str, SQL aggregate to be executed +@param grp_cols: str, Comma-separated list of column names +""" +grp_cols = _parse_grp_col_str(grp_col_str) +_validate_args(table_in, table_out, grp_cols) +if not grp_cols: +grp_by_str = grp_out_str = "" +else: +grp_by_str = "GROUP BY " + grp_col_str +grp_out_str = grp_col_str + ", " +plpy.execute(""" + CREATE TABLE {table_out} AS + SELECT +{grp_out_str} +{agg_fun} AS {agg_name} + FROM {table_in} + {grp_by_str} + """.format(**locals())) + + +# Mean Absolute Error. +def mean_abs_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_abs_agg = "AVG(ABS({0} - {1}))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_abs_agg, "mean_abs_error", grp_cols) + + +# Mean Absolute Percentage Error. +def mean_abs_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_abs_perc_agg = "AVG(ABS({0} - {1})/NULLIF({1}, 0))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_abs_perc_agg, "mean_abs_perc_error", grp_cols) + + +# Mean Percentage Error. +def mean_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_perc_agg = "AVG(({0} - {1})/NULLIF({1}, 0))".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_perc_agg, "mean_perc_error", grp_cols) + + +# Mean Squared Error. +def mean_squared_error(table_in, table_out, pred_col, obs_col, grp_cols=None): +mean_sq_agg = "AVG(({0} - {1})^2)".format(pred_col, obs_col) +_create_output_table(table_in, table_out, mean_sq_agg, "mean_squared_error", grp_cols) + + +def metric_agg_help_msg(schema_madlib, message, agg_name, **kwargs): + +if not message: +help_strin
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user iyerr3 commented on the pull request: https://github.com/apache/incubator-madlib/pull/42#issuecomment-23823 I have made some changes and added validation and online help functions (in my [private fork branch](https://github.com/iyerr3/incubator-madlib/commits/feature/pred_metrics_take2)). I have not implemented the SRF versions as discussed earlier. We can make those functions as a separate work. @orhankislal could you please pull in the 3 commits on that branch and update your branch to update this PR. Once we've discussed the changes here, I can merge the commits to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user iyerr3 commented on the pull request: https://github.com/apache/incubator-madlib/pull/42#issuecomment-218301178 General comments: - The distance functions (mean_*_error) all have the same structure except the distance metric. I suggest refactoring the table creation and the grouping statements into a single function that takes a parameter for the distance. - Avoid using `__` within python. It leads to [name mangling](https://docs.python.org/2/tutorial/classes.html#private-variables-and-class-local-references), that doesn't really help in this situation. - There seems to be considerable overlap between the grouping and non-grouping SQL in `r2`, `adjusted_r2` and `auc` functions. Maybe we can combine them? - I'll bump the [suggestion](https://github.com/apache/incubator-madlib/pull/41#issuecomment-216661096) by @decibel in the previous PR. It might be helpful to have 2 versions: 1 that creates the output table and the other returns the rows directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user decibel commented on the pull request: https://github.com/apache/incubator-madlib/pull/41#issuecomment-216655743 I suggest starting with documentation and hold off a bit on the code. There might be some even better ways to do things that what I initially thought of; I'm hoping that docs will help me figure that out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user orhankislal closed the pull request at: https://github.com/apache/incubator-madlib/pull/41 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user orhankislal commented on the pull request: https://github.com/apache/incubator-madlib/pull/41#issuecomment-216651893 Thanks for your comments decibel. I'll close this pull request, review your comments, update the code and documentation accordingly and create a new pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user decibel commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/41#discussion_r61940732 --- Diff: src/ports/postgres/modules/pred_metrics/pred_metrics.py_in --- @@ -0,0 +1,391 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from utilities.utilities import unique_string +import plpy + +def mean_abs_error( +schema_madlib, table_in, table_out, prediction_col, observed_col, +grouping_cols=None): +sql_st1 = """ +CREATE TABLE {table_out} AS +SELECT +SUM(ABS({prediction_col}- {observed_col})) +/COUNT(*) AS mean_abs_error """.format(**locals()) +sql_st2= "" +sql_st3= """ FROM {table_in} """.format(**locals()) +sql_st4= "" +if grouping_cols: +sql_st2= """ , {grouping_cols} """.format(**locals()) +sql_st4= """ GROUP BY {grouping_cols}""".format(**locals()) +sql_st = sql_st1+sql_st2+sql_st3+sql_st4 +plpy.execute(sql_st) + +def mean_abs_perc_error( +schema_madlib, table_in, table_out, prediction_col, observed_col, +grouping_cols=None): +sql_st1 = """ +CREATE TABLE {table_out} AS +SELECT +SUM(ABS({prediction_col}- {observed_col})/{observed_col}) +/COUNT(*) AS mean_abs_perc_error """.format(**locals()) +sql_st2= "" +sql_st3= """ FROM {table_in} """.format(**locals()) +sql_st4= "" +if grouping_cols: +sql_st2= """ , {grouping_cols} """.format(**locals()) +sql_st4= """ GROUP BY {grouping_cols}""".format(**locals()) +sql_st = sql_st1+sql_st2+sql_st3+sql_st4 +plpy.execute(sql_st) + +def mean_perc_error( + schema_madlib, table_in, table_out, prediction_col, observed_col, +grouping_cols=None): +sql_st1 = """ +CREATE TABLE {table_out} AS +SELECT +SUM(({prediction_col}- {observed_col})/{observed_col}) +/COUNT(*) AS mean_perc_error """.format(**locals()) +sql_st2= "" +sql_st3= """ FROM {table_in} """.format(**locals()) +sql_st4= "" +if grouping_cols: +sql_st2= """ , {grouping_cols} """.format(**locals()) +sql_st4= """ GROUP BY {grouping_cols}""".format(**locals()) +sql_st = sql_st1+sql_st2+sql_st3+sql_st4 +plpy.execute(sql_st) + +def mean_squared_error( + schema_madlib, table_in, table_out, prediction_col, observed_col, +grouping_cols=None): +sql_st1 = """ +CREATE TABLE {table_out} AS +SELECT +SUM(({prediction_col}- {observed_col})^2) +/COUNT(*) AS mean_squared_error """.format(**locals()) +sql_st2= "" +sql_st3= """ FROM {table_in} """.format(**locals()) +sql_st4= "" +if grouping_cols: +sql_st2= """ , {grouping_cols} """.format(**locals()) +sql_st4= """ GROUP BY {grouping_cols}""".format(**locals()) +sql_st = sql_st1+sql_st2+sql_st3+sql_st4 +plpy.execute(sql_st) + + +def __r2_score( --- End diff -- This scans {table_in} 3 times for no good reason. It would be better to use the query below, but I'm not sure that's even necessary. Do any of the functions at http://www.postgresql.org/docs/9.5/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE do what's necessary here? CREATE TABLE {table_out} AS SELECT 1 - ssres/sstot AS r2_score FROM ( SELECT sum(({prediction_col} - {observed_col})^2) AS ssres, sum(( {observed_col} - (SELECT SUM({observed_col})/count(*) AS mean FROM {table_in}) )^2) AS sstot FROM {table_in} ) intermediate --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
Github user decibel commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/41#discussion_r61935728 --- Diff: src/ports/postgres/modules/pred_metrics/pred_metrics.py_in --- @@ -0,0 +1,391 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from utilities.utilities import unique_string +import plpy + +def mean_abs_error( +schema_madlib, table_in, table_out, prediction_col, observed_col, +grouping_cols=None): +sql_st1 = """ +CREATE TABLE {table_out} AS +SELECT +SUM(ABS({prediction_col}- {observed_col})) +/COUNT(*) AS mean_abs_error """.format(**locals()) +sql_st2= "" +sql_st3= """ FROM {table_in} """.format(**locals()) +sql_st4= "" +if grouping_cols: +sql_st2= """ , {grouping_cols} """.format(**locals()) --- End diff -- FWIW, normal convention is to put grouping columns first in the select clause. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-madlib pull request: Prediction Metrics: New module
GitHub user orhankislal opened a pull request: https://github.com/apache/incubator-madlib/pull/41 Prediction Metrics: New module JIRA: MADLIB-907 A collection of summary statistics to gauge model accuracy based on predicted values vs. ground-truth values. You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/incubator-madlib feature/pred_metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/41.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #41 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Prediction Metrics
Orhan, I think this is a good addition to MADlib. Regarding your questions: 1) Seems like a good set of prediction metrics to start with. If other members of the community would like to add more, they are welcome to create a JIRA for those and work on them. 2) Suggest we do include grouping as an optional param, since it could be very useful. It means an output table is the way to go. Without grouping, an output table with a single value is not ideal but OK, since consistency of output format is useful. Frank On Fri, Apr 8, 2016 at 3:54 PM, Orhan Kislal <okis...@pivotal.io> wrote: > Hello MADlib community, > > I think it might make sense to add a module to MADlib for prediction > metrics. > Since there are quite a bit of options, I decided to start with the list of > metrics from PDLTools [1]. You can see my proposed interface at attachment > of > the associated JIRA [2,3]. I'll paste a snippet just as an example. I would > like > the feedback of the community on a number of questions that came up. > > 1) Are there any other metrics that should take precedence over these ones? > Please note that binary_classifier reports multiple metrics (tpr, fpr, acc, > f1 > etc.) > > 2) How should we handle grouping? As you can see in the example, the > function > returns a double value for regular execution but an output table is used if > grouping parameter is passed. This dual interface doesn't seem clean and > returning a table with a single value for the regular execution feels > wrong. > > Thanks > > Orhan Kislal > > > [1] > > http://pivotalsoftware.github.io/PDLTools/group__grp__prediction__metrics.html > > [2] https://issues.apache.org/jira/browse/MADLIB-907 > > [3] > https://issues.apache.org/jira/secure/attachment/12797816/interface_v1.sql > > --- > > CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.area_under_roc( > table_inTEXT, > prediction_col TEXT, > observed_col TEXT, > table_out TEXT, > grouping_col TEXT > ) RETURNS VOID > AS $$ > PythonFunctionBodyOnly(`pred_metrics', `pred_metrics') > return pred_metrics.area_under_roc(schema_madlib, > table_in, prediction_col, observed_col, table_out, grouping_col) > $$ LANGUAGE plpythonu > m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `'); > > CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.area_under_roc( > table_inTEXT, > prediction_col TEXT, > observed_col TEXT > ) RETURNS DOUBLE PRECISION > AS $$ > PythonFunctionBodyOnly(`pred_metrics', `pred_metrics') > return pred_metrics.area_under_roc(schema_madlib, > table_in, prediction_col, observed_col) > $$ LANGUAGE plpythonu > m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `'); > > --- >