[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module

2016-06-08 Thread mktal
Github user mktal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/42#discussion_r66351302
  
--- Diff: src/ports/postgres/modules/stats/pred_metrics.sql_in ---
@@ -0,0 +1,831 @@
+/* --- 
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ * @file pred_metrics.sql_in
+ *
+ * @brief A collection of summary statistics to gauge model
+ * accuracy based on predicted values vs. ground-truth values.
+ * @date April 2016
+ *
+ *
+ *//* 
--- */
+
+m4_include(`SQLCommon.m4')
+
+/* --- 
*/
+
+/**
+@addtogroup grp_pred
+
+Contents
+
+List of Prediction Metric Functions
+Function Specific Details
+Examples
+Related Topics
+
+
+
+@brief Provides various prediction accuracy metrics.
+
+This module provides a set of prediction accuracy metrics. It is a support
+module for several machine learning algorithms that require metrics to 
validate
+their models. A typical function will take a set of "prediction" and
+"observation" values to calculate the desired metric, unless noted 
otherwise.
+Grouping is supported by all of these functions (except confusion matrix).
+
+@anchor list
+@par Prediction Metrics Functions
+
+mean_abs_error(table_in, table_out, prediction_col, observed_col, 
grouping_cols) Mean Absolute Error. 
+mean_abs_perc_error(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Mean Absolute Percentage Error. 
+mean_perc_error(table_in, table_out, prediction_col, observed_col, 
grouping_cols)  Mean Percentage Error. 
+mean_squared_error(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Mean Squared Error.
+r2_score(table_in, table_out, prediction_col, observed_col, 
grouping_cols) R-squared. 
+adjusted_r2_score(table_in, table_out, prediction_col, 
observed_col, num_predictors, training_size, grouping_cols) Adjusted 
R-squared. 
+binary_classifier(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Collection of prediction metrics related 
to binary classification.
+area_under_roc(table_in, table_out, prediction_col, observed_col, 
grouping_cols) Area under the ROC curve (in binary 
classification).
+confusion_matrix(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Confusion matrix for a multi-class 
classifier. 
+
+
+\b Arguments
+
+table_in
+TEXT. Name of the input table.
+table_out
+TEXT. Name of the output table.
+prediction_col
+TEXT. Name of the column of predicted values from input table.
+observed_col
+TEXT. Name of the column of observed values from input table.
+num_predictors
+INTEGER. The number of parameters in the predicting model, not 
counting the constant term.
+training_size
+INTEGER. The number of rows used for training, excluding any NULL 
rows.
+grouping_cols (optional)
+TEXT, default: NULL. Name of the column of grouping values from input
+table.
+
+
+@anchor specs
+@par Function Specific Details
+
+R-squared Score
+
+This function returns the coefficient of determination (R2) between the
+predicted and observed values. An R2 of 1 indicates that the regression 
line
+perfectly fits the data, while an R2 of 0 indicates that the line does not 
fit
+the data at all. Negative values of R2 may occur when fitting non-linear
+functions to data. Please refer to the reference [1] for
+further details.
+
+Adjusted R-squared Score
+
+This function returns the adjusted R2 score. Adjusted R2 score is used to
+counter the problem of the R2 automatically increasing when extra 
explanatory
+variables are added to the model. It takes additional two integers 
describing
+the degr

[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module

2016-06-08 Thread mktal
Github user mktal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/42#discussion_r66351192
  
--- Diff: src/ports/postgres/modules/stats/pred_metrics.sql_in ---
@@ -0,0 +1,831 @@
+/* --- 
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ * @file pred_metrics.sql_in
+ *
+ * @brief A collection of summary statistics to gauge model
+ * accuracy based on predicted values vs. ground-truth values.
+ * @date April 2016
+ *
+ *
+ *//* 
--- */
+
+m4_include(`SQLCommon.m4')
+
+/* --- 
*/
+
+/**
+@addtogroup grp_pred
+
+Contents
+
+List of Prediction Metric Functions
+Function Specific Details
+Examples
+Related Topics
+
+
+
+@brief Provides various prediction accuracy metrics.
+
+This module provides a set of prediction accuracy metrics. It is a support
+module for several machine learning algorithms that require metrics to 
validate
+their models. A typical function will take a set of "prediction" and
+"observation" values to calculate the desired metric, unless noted 
otherwise.
+Grouping is supported by all of these functions (except confusion matrix).
+
+@anchor list
+@par Prediction Metrics Functions
+
+mean_abs_error(table_in, table_out, prediction_col, observed_col, 
grouping_cols) Mean Absolute Error. 
+mean_abs_perc_error(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Mean Absolute Percentage Error. 
+mean_perc_error(table_in, table_out, prediction_col, observed_col, 
grouping_cols)  Mean Percentage Error. 
+mean_squared_error(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Mean Squared Error.
+r2_score(table_in, table_out, prediction_col, observed_col, 
grouping_cols) R-squared. 
+adjusted_r2_score(table_in, table_out, prediction_col, 
observed_col, num_predictors, training_size, grouping_cols) Adjusted 
R-squared. 
+binary_classifier(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Collection of prediction metrics related 
to binary classification.
+area_under_roc(table_in, table_out, prediction_col, observed_col, 
grouping_cols) Area under the ROC curve (in binary 
classification).
+confusion_matrix(table_in, table_out, prediction_col, 
observed_col, grouping_cols) Confusion matrix for a multi-class 
classifier. 
+
+
+\b Arguments
+
+table_in
+TEXT. Name of the input table.
+table_out
+TEXT. Name of the output table.
+prediction_col
+TEXT. Name of the column of predicted values from input table.
+observed_col
+TEXT. Name of the column of observed values from input table.
+num_predictors
+INTEGER. The number of parameters in the predicting model, not 
counting the constant term.
+training_size
+INTEGER. The number of rows used for training, excluding any NULL 
rows.
+grouping_cols (optional)
+TEXT, default: NULL. Name of the column of grouping values from input
+table.
+
+
+@anchor specs
+@par Function Specific Details
+
+R-squared Score
+
+This function returns the coefficient of determination (R2) between the
+predicted and observed values. An R2 of 1 indicates that the regression 
line
+perfectly fits the data, while an R2 of 0 indicates that the line does not 
fit
+the data at all. Negative values of R2 may occur when fitting non-linear
+functions to data. Please refer to the reference [1] for
+further details.
+
+Adjusted R-squared Score
+
+This function returns the adjusted R2 score. Adjusted R2 score is used to
+counter the problem of the R2 automatically increasing when extra 
explanatory
+variables are added to the model. It takes additional two integers 
describing
+the degr

[GitHub] incubator-madlib pull request #42: Prediction Metrics: New module

2016-06-08 Thread mktal
Github user mktal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/42#discussion_r66282751
  
--- Diff: src/ports/postgres/modules/stats/pred_metrics.py_in ---
@@ -0,0 +1,562 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
    +
+# Prediction Metrics
+# This module provides a set of prediction accuracy metrics. It is a 
support
+# module for several machine learning algorithms that require metrics to
+# validate their models. A typical function will take a set of 
"prediction" and
+# "observation" values to calculate the desired metric, unless noted 
otherwise.
+# Grouping is supported by all of these functions (except confusion 
matrix).
+
+# Please refer to the pred_metrics.sql_in file for the documentation
+
+import plpy
+from utilities.validate_args import input_tbl_valid, output_tbl_valid, 
is_var_valid
+from utilities.utilities import _assert
+from utilities.utilities import split_quoted_delimited_str
+
+
+def _validate_args(table_in, table_out, validate_cols):
+    input_tbl_valid(table_in, "Prediction Metrics")
+    output_tbl_valid(table_out, "Prediction Metrics")
+is_var_valid(table_in, ', '.join(validate_cols))
+
+
+def _parse_grp_col_str(grp_col_str):
+group_set = set(split_quoted_delimited_str(grp_col_str))
+return list(group_set)
+# --
+
+
+def _create_output_table(table_in, table_out, agg_fun, agg_name, 
grp_col_str=None):
+""" Create an output table with optional groups
+
+General template function that builds an output table with grouping 
while
+applying an aggregate function.
+
+Args:
+@param agg_fun: str, SQL aggregate to be executed
+@param grp_cols: str, Comma-separated list of column names
+"""
+grp_cols = _parse_grp_col_str(grp_col_str)
+_validate_args(table_in, table_out, grp_cols)
+if not grp_cols:
+grp_by_str = grp_out_str = ""
+else:
+grp_by_str = "GROUP BY " + grp_col_str
+grp_out_str = grp_col_str + ", "
+plpy.execute("""
+ CREATE TABLE {table_out} AS
+ SELECT
+{grp_out_str}
+{agg_fun} AS {agg_name}
+ FROM {table_in}
+ {grp_by_str}
+ """.format(**locals()))
+
+
+# Mean Absolute Error.
+def mean_abs_error(table_in, table_out, pred_col, obs_col, grp_cols=None):
+mean_abs_agg = "AVG(ABS({0} - {1}))".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_abs_agg, 
"mean_abs_error", grp_cols)
+
+
+# Mean Absolute Percentage Error.
+def mean_abs_perc_error(table_in, table_out, pred_col, obs_col, 
grp_cols=None):
+mean_abs_perc_agg = "AVG(ABS({0} - {1})/NULLIF({1}, 
0))".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_abs_perc_agg, 
"mean_abs_perc_error", grp_cols)
+
+
+# Mean Percentage Error.
+def mean_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None):
+mean_perc_agg = "AVG(({0} - {1})/NULLIF({1}, 0))".format(pred_col, 
obs_col)
+_create_output_table(table_in, table_out, mean_perc_agg, 
"mean_perc_error", grp_cols)
+
+
+# Mean Squared Error.
+def mean_squared_error(table_in, table_out, pred_col, obs_col, 
grp_cols=None):
+mean_sq_agg = "AVG(({0} - {1})^2)".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_sq_agg, 
"mean_squared_error", grp_cols)
+
+
+def metric_agg_help_msg(schema_madlib, message, agg_name, **kwargs):
+
+if not message:
+help_strin

[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-31 Thread iyerr3
Github user iyerr3 commented on the pull request:

https://github.com/apache/incubator-madlib/pull/42
  
Along with casting the columns to int in binary classification, we also 
need to change docs/online-help/tests to reflect that boolean columns allowed 
for observation columns. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-31 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/42#discussion_r65259392
  
--- Diff: src/ports/postgres/modules/stats/pred_metrics.py_in ---
@@ -0,0 +1,562 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
    +
+# Prediction Metrics
+# This module provides a set of prediction accuracy metrics. It is a 
support
+# module for several machine learning algorithms that require metrics to
+# validate their models. A typical function will take a set of 
"prediction" and
+# "observation" values to calculate the desired metric, unless noted 
otherwise.
+# Grouping is supported by all of these functions (except confusion 
matrix).
+
+# Please refer to the pred_metrics.sql_in file for the documentation
+
+import plpy
+from utilities.validate_args import input_tbl_valid, output_tbl_valid, 
is_var_valid
+from utilities.utilities import _assert
+from utilities.utilities import split_quoted_delimited_str
+
+
+def _validate_args(table_in, table_out, validate_cols):
+    input_tbl_valid(table_in, "Prediction Metrics")
+    output_tbl_valid(table_out, "Prediction Metrics")
+is_var_valid(table_in, ', '.join(validate_cols))
+
+
+def _parse_grp_col_str(grp_col_str):
+group_set = set(split_quoted_delimited_str(grp_col_str))
+return list(group_set)
+# --
+
+
+def _create_output_table(table_in, table_out, agg_fun, agg_name, 
grp_col_str=None):
+""" Create an output table with optional groups
+
+General template function that builds an output table with grouping 
while
+applying an aggregate function.
+
+Args:
+@param agg_fun: str, SQL aggregate to be executed
+@param grp_cols: str, Comma-separated list of column names
+"""
+grp_cols = _parse_grp_col_str(grp_col_str)
+_validate_args(table_in, table_out, grp_cols)
+if not grp_cols:
+grp_by_str = grp_out_str = ""
+else:
+grp_by_str = "GROUP BY " + grp_col_str
+grp_out_str = grp_col_str + ", "
+plpy.execute("""
+ CREATE TABLE {table_out} AS
+ SELECT
+{grp_out_str}
+{agg_fun} AS {agg_name}
+ FROM {table_in}
+ {grp_by_str}
+ """.format(**locals()))
+
+
+# Mean Absolute Error.
+def mean_abs_error(table_in, table_out, pred_col, obs_col, grp_cols=None):
+mean_abs_agg = "AVG(ABS({0} - {1}))".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_abs_agg, 
"mean_abs_error", grp_cols)
+
+
+# Mean Absolute Percentage Error.
+def mean_abs_perc_error(table_in, table_out, pred_col, obs_col, 
grp_cols=None):
+mean_abs_perc_agg = "AVG(ABS({0} - {1})/NULLIF({1}, 
0))".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_abs_perc_agg, 
"mean_abs_perc_error", grp_cols)
+
+
+# Mean Percentage Error.
+def mean_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None):
+mean_perc_agg = "AVG(({0} - {1})/NULLIF({1}, 0))".format(pred_col, 
obs_col)
+_create_output_table(table_in, table_out, mean_perc_agg, 
"mean_perc_error", grp_cols)
+
+
+# Mean Squared Error.
+def mean_squared_error(table_in, table_out, pred_col, obs_col, 
grp_cols=None):
+mean_sq_agg = "AVG(({0} - {1})^2)".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_sq_agg, 
"mean_squared_error", grp_cols)
+
+
+def metric_agg_help_msg(schema_madlib, message, agg_name, **kwargs):
+
+if not message:
+help_strin

[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-31 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/42#discussion_r65259215
  
--- Diff: src/ports/postgres/modules/stats/pred_metrics.py_in ---
@@ -0,0 +1,562 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
    +
+# Prediction Metrics
+# This module provides a set of prediction accuracy metrics. It is a 
support
+# module for several machine learning algorithms that require metrics to
+# validate their models. A typical function will take a set of 
"prediction" and
+# "observation" values to calculate the desired metric, unless noted 
otherwise.
+# Grouping is supported by all of these functions (except confusion 
matrix).
+
+# Please refer to the pred_metrics.sql_in file for the documentation
+
+import plpy
+from utilities.validate_args import input_tbl_valid, output_tbl_valid, 
is_var_valid
+from utilities.utilities import _assert
+from utilities.utilities import split_quoted_delimited_str
+
+
+def _validate_args(table_in, table_out, validate_cols):
+    input_tbl_valid(table_in, "Prediction Metrics")
+    output_tbl_valid(table_out, "Prediction Metrics")
+is_var_valid(table_in, ', '.join(validate_cols))
+
+
+def _parse_grp_col_str(grp_col_str):
+group_set = set(split_quoted_delimited_str(grp_col_str))
+return list(group_set)
+# --
+
+
+def _create_output_table(table_in, table_out, agg_fun, agg_name, 
grp_col_str=None):
+""" Create an output table with optional groups
+
+General template function that builds an output table with grouping 
while
+applying an aggregate function.
+
+Args:
+@param agg_fun: str, SQL aggregate to be executed
+@param grp_cols: str, Comma-separated list of column names
+"""
+grp_cols = _parse_grp_col_str(grp_col_str)
+_validate_args(table_in, table_out, grp_cols)
+if not grp_cols:
+grp_by_str = grp_out_str = ""
+else:
+grp_by_str = "GROUP BY " + grp_col_str
+grp_out_str = grp_col_str + ", "
+plpy.execute("""
+ CREATE TABLE {table_out} AS
+ SELECT
+{grp_out_str}
+{agg_fun} AS {agg_name}
+ FROM {table_in}
+ {grp_by_str}
+ """.format(**locals()))
+
+
+# Mean Absolute Error.
+def mean_abs_error(table_in, table_out, pred_col, obs_col, grp_cols=None):
+mean_abs_agg = "AVG(ABS({0} - {1}))".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_abs_agg, 
"mean_abs_error", grp_cols)
+
+
+# Mean Absolute Percentage Error.
+def mean_abs_perc_error(table_in, table_out, pred_col, obs_col, 
grp_cols=None):
+mean_abs_perc_agg = "AVG(ABS({0} - {1})/NULLIF({1}, 
0))".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_abs_perc_agg, 
"mean_abs_perc_error", grp_cols)
+
+
+# Mean Percentage Error.
+def mean_perc_error(table_in, table_out, pred_col, obs_col, grp_cols=None):
+mean_perc_agg = "AVG(({0} - {1})/NULLIF({1}, 0))".format(pred_col, 
obs_col)
+_create_output_table(table_in, table_out, mean_perc_agg, 
"mean_perc_error", grp_cols)
+
+
+# Mean Squared Error.
+def mean_squared_error(table_in, table_out, pred_col, obs_col, 
grp_cols=None):
+mean_sq_agg = "AVG(({0} - {1})^2)".format(pred_col, obs_col)
+_create_output_table(table_in, table_out, mean_sq_agg, 
"mean_squared_error", grp_cols)
+
+
+def metric_agg_help_msg(schema_madlib, message, agg_name, **kwargs):
+
+if not message:
+help_strin

[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-27 Thread iyerr3
Github user iyerr3 commented on the pull request:

https://github.com/apache/incubator-madlib/pull/42#issuecomment-23823
  
I have made some changes and added validation and online help functions (in 
my [private fork 
branch](https://github.com/iyerr3/incubator-madlib/commits/feature/pred_metrics_take2)).
 

I have not implemented the SRF versions as discussed earlier. We can make 
those functions as a separate work. 

@orhankislal could you please pull in the 3 commits on that branch and 
update your branch to update this PR. Once we've discussed the changes here, I 
can merge the commits to master. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-10 Thread iyerr3
Github user iyerr3 commented on the pull request:

https://github.com/apache/incubator-madlib/pull/42#issuecomment-218301178
  
General comments: 
- The distance functions (mean_*_error) all have the same structure except 
the distance metric. I suggest refactoring the table creation and the grouping 
statements into a single function that takes a parameter for the distance. 
- Avoid using `__` within python. It leads to [name 
mangling](https://docs.python.org/2/tutorial/classes.html#private-variables-and-class-local-references),
 that doesn't really help in this situation. 
- There seems to be considerable overlap between the grouping and 
non-grouping SQL in `r2`, `adjusted_r2` and `auc` functions. Maybe we can 
combine them? 
- I'll bump the 
[suggestion](https://github.com/apache/incubator-madlib/pull/41#issuecomment-216661096)
 by @decibel in the previous PR. It might be helpful to have 2 versions: 1 that 
creates the output table and the other returns the rows directly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread decibel
Github user decibel commented on the pull request:

https://github.com/apache/incubator-madlib/pull/41#issuecomment-216655743
  
I suggest starting with documentation and hold off a bit on the code. 
There might be some even better ways to do things that what I initially 
thought of; I'm hoping that docs will help me figure that out.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread orhankislal
Github user orhankislal closed the pull request at:

https://github.com/apache/incubator-madlib/pull/41


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread orhankislal
Github user orhankislal commented on the pull request:

https://github.com/apache/incubator-madlib/pull/41#issuecomment-216651893
  
Thanks for your comments decibel. I'll close this pull request, review your 
comments, update the code and documentation accordingly and create a new pull 
request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/41#discussion_r61940732
  
--- Diff: src/ports/postgres/modules/pred_metrics/pred_metrics.py_in ---
@@ -0,0 +1,391 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from utilities.utilities import unique_string
+import plpy
+
+def mean_abs_error(
+schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(ABS({prediction_col}- {observed_col}))
+/COUNT(*) AS mean_abs_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+def mean_abs_perc_error(
+schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(ABS({prediction_col}- {observed_col})/{observed_col})
+/COUNT(*) AS mean_abs_perc_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+def mean_perc_error(
+   schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(({prediction_col}- {observed_col})/{observed_col})
+/COUNT(*) AS mean_perc_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+def mean_squared_error(
+   schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(({prediction_col}- {observed_col})^2)
+/COUNT(*) AS mean_squared_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
+sql_st4= """ GROUP BY {grouping_cols}""".format(**locals())
+sql_st = sql_st1+sql_st2+sql_st3+sql_st4
+plpy.execute(sql_st)
+
+
+def __r2_score(
--- End diff --

This scans {table_in} 3 times for no good reason. It would be better to use 
the query below, but I'm not sure that's even necessary. Do any of the 
functions at 
http://www.postgresql.org/docs/9.5/static/functions-aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE
 do what's necessary here?

CREATE TABLE {table_out} AS
SELECT 1 - ssres/sstot AS r2_score FROM (
SELECT sum(({prediction_col} - {observed_col})^2) AS ssres, sum(( 
{observed_col} - (SELECT SUM({observed_col})/count(*) AS mean FROM {table_in}) 
)^2) AS sstot FROM {table_in}
) intermediate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with 

[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread decibel
Github user decibel commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/41#discussion_r61935728
  
--- Diff: src/ports/postgres/modules/pred_metrics/pred_metrics.py_in ---
@@ -0,0 +1,391 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from utilities.utilities import unique_string
+import plpy
+
+def mean_abs_error(
+schema_madlib, table_in, table_out, prediction_col, observed_col,
+grouping_cols=None):
+sql_st1 = """
+CREATE TABLE {table_out} AS
+SELECT
+SUM(ABS({prediction_col}- {observed_col}))
+/COUNT(*) AS mean_abs_error """.format(**locals())
+sql_st2= ""
+sql_st3= """ FROM {table_in} """.format(**locals())
+sql_st4= ""
+if grouping_cols:
+sql_st2= """ , {grouping_cols} """.format(**locals())
--- End diff --

FWIW, normal convention is to put grouping columns first in the select 
clause.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request: Prediction Metrics: New module

2016-05-03 Thread orhankislal
GitHub user orhankislal opened a pull request:

https://github.com/apache/incubator-madlib/pull/41

Prediction Metrics: New module

JIRA: MADLIB-907
A collection of summary statistics to gauge model accuracy
based on predicted values vs. ground-truth values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/orhankislal/incubator-madlib 
feature/pred_metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/41.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #41






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Prediction Metrics

2016-04-11 Thread Frank McQuillan
Orhan,

I think this is a good addition to MADlib.  Regarding your questions:

1) Seems like a good set of prediction metrics to start with.  If other
members of the community would like to add more, they are welcome to create
a JIRA for those and work on them.

2) Suggest we do include grouping as an optional param, since it could be
very useful.  It means an output table is the way to go.  Without grouping,
an output table with a single value is not ideal but OK, since consistency
of output format is useful.

Frank



On Fri, Apr 8, 2016 at 3:54 PM, Orhan Kislal <okis...@pivotal.io> wrote:

> Hello MADlib community,
>
> I think it might make sense to add a module to MADlib for prediction
> metrics.
> Since there are quite a bit of options, I decided to start with the list of
> metrics from PDLTools [1]. You can see my proposed interface at attachment
> of
> the associated JIRA [2,3]. I'll paste a snippet just as an example. I would
> like
> the feedback of the community on a number of questions that came up.
>
> 1) Are there any other metrics that should take precedence over these ones?
> Please note that binary_classifier reports multiple metrics (tpr, fpr, acc,
> f1
> etc.)
>
> 2) How should we handle grouping? As you can see in the example, the
> function
> returns a double value for regular execution but an output table is used if
> grouping parameter is passed. This dual interface doesn't seem clean and
> returning a table with a single value for the regular execution feels
> wrong.
>
> Thanks
>
> Orhan Kislal
>
>
> [1]
>
> http://pivotalsoftware.github.io/PDLTools/group__grp__prediction__metrics.html
>
> [2] https://issues.apache.org/jira/browse/MADLIB-907
>
> [3]
> https://issues.apache.org/jira/secure/attachment/12797816/interface_v1.sql
>
> ---
>
> CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.area_under_roc(
> table_inTEXT,
> prediction_col TEXT,
>   observed_col TEXT,
> table_out TEXT,
> grouping_col TEXT
> ) RETURNS VOID
> AS $$
> PythonFunctionBodyOnly(`pred_metrics', `pred_metrics')
> return pred_metrics.area_under_roc(schema_madlib,
> table_in, prediction_col, observed_col, table_out, grouping_col)
> $$ LANGUAGE plpythonu
> m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
>
> CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.area_under_roc(
> table_inTEXT,
> prediction_col TEXT,
>   observed_col TEXT
> ) RETURNS DOUBLE PRECISION
> AS $$
> PythonFunctionBodyOnly(`pred_metrics', `pred_metrics')
> return pred_metrics.area_under_roc(schema_madlib,
> table_in, prediction_col, observed_col)
> $$ LANGUAGE plpythonu
> m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `MODIFIES SQL DATA', `');
>
> ---
>