This is an automated email from the ASF dual-hosted git repository. nkak pushed a commit to branch madlib2-master in repository https://gitbox.apache.org/repos/asf/madlib.git
commit 0cd28f9733927d63beaefc9488db7f8bfdb3bd80 Author: Nikhil Kak <n...@vmware.com> AuthorDate: Thu Feb 15 17:53:56 2024 -0800 PMML: Do not include intercept as a predictor JIRA: MADLIB-1517 Note that this commit only fixes GLM, logisitic and linear. A future commit will fix other pmml modules. Context : -------------------------------------------------------- MADlib's way of passing intercept to regression models is a bit unusual. Usually intercept is a boolean which indicates whether the model needs to be fit with intercept or not. MADlib makes the user pass an integer (1 means use an intercept and no value means don't fit with intercept) along with the other independent variables and uses that directly for computation. For e.g. ARRAY[1,x1,x2,...] indicates use an intercept whereas ARRAY[x1,x2,...] means don't use an intercept Problem: -------------------------------------------------------- * So essentially all the regression algorithms treat the intercept value "1" as just another independent variable(it's always the first one though). * Because of this implicit assumption, users need to specifically inject a predictor named "1" with a value of 1 for all the input rows. This can be very inconvenient specially when using pmml to predict a stream of data or some other preprocessed form of data. Fix: -------------------------------------------------------- * Once the model is trained, the only way to know if the model was fit with intercept is to look at the `independent_varname` field in the summary table. * If the value contains ARRAY[1, x1, x2..], then an intercept was used. * Since this intercept is just another independent variable, there aren't any explicit references or logic to handle intercept in our python or c++ code for training or predict. * Because of this assumption, the pmml code also considers all of "ARRAY[1,x1,x2,...]" as independent variables and hence the output pmml contains "1" as an input predictor. * We can just remove all references to the column "1" in the pmml file. We will still keep the "p0" variable which is explicitly marked as an intercept and will store the intercept's coefficient * The pmml module gets the 'X' and 'Y' values from the summary table and then parses it to create a list of all the independent predictors so that it can be written to the pmml file * It uses regex to match the expression ARRAY[1,x1,x2]/ARRAY[x1,x2] and then returns either ['1','x1','x2'] or ['x1',x2'] * Our goal with the pmml code is to not treat the intercept "1" as an independent predictor but just as an intercept * The commit fixes this by changing the regex and using the output to determine if an intercept was passed so that both expressions ARRAY[1,x1,x2]/ARRAY[x1,x2] return ['x1', 'x2'] * Also had to make changes to the various pmml builder classes to treat intercept's coefficient differently than the feature coefficients * Note that this commit only fixes GLM, logisitic and linear. A future commit will fix other pmml modules. Before the fix: ``` <?xml version="1.0" standalone="yes"?> <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1"> <Header copyright="Copyright (c) 2024 nkak"> <Extension extender="MADlib" name="user" value="nkak"/> <Application name="MADlib" version="2.1.0"/> <Timestamp>2024-02-16 12:19:58.798139 PDT</Timestamp> </Header> <DataDictionary numberOfFields="4"> <DataField name="second_attack_pmml_prediction" optype="categorical" dataType="boolean"> <Value value="True"/> <Value value="False"/> </DataField> <DataField name="1" optype="continuous" dataType="double"/> <DataField name="treatment" optype="continuous" dataType="double"/> <DataField name="trait_anxiety" optype="continuous" dataType="double"/> </DataDictionary> <RegressionModel functionName="classification" normalizationMethod="softmax"> <MiningSchema> <MiningField name="second_attack_pmml_prediction" usageType="predicted"/> <MiningField name="1"/> <MiningField name="treatment"/> <MiningField name="trait_anxiety"/> </MiningSchema> <RegressionTable intercept="0.0" targetCategory="True"> <NumericPredictor name="1" coefficient="-6.363469941781864"/> <NumericPredictor name="treatment" coefficient="-1.0241060523932668"/> <NumericPredictor name="trait_anxiety" coefficient="0.11904491666860616"/> </RegressionTable> <RegressionTable intercept="0.0" targetCategory="False"/> </RegressionModel> </PMML> ``` After the fix: ``` <?xml version="1.0" standalone="yes"?> <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1"> <Header copyright="Copyright (c) 2024 nkak"> <Extension extender="MADlib" name="user" value="nkak"/> <Application name="MADlib" version="2.1.0"/> <Timestamp>2024-02-16 13:37:15.367609 PDT</Timestamp> </Header> <DataDictionary numberOfFields="3"> <DataField name="second_attack_pmml_prediction" optype="categorical" dataType="boolean"> <Value value="True"/> <Value value="False"/> </DataField> <DataField name="treatment" optype="continuous" dataType="double"/> <DataField name="trait_anxiety" optype="continuous" dataType="double"/> </DataDictionary> <RegressionModel functionName="classification" normalizationMethod="softmax"> <MiningSchema> <MiningField name="second_attack_pmml_prediction" usageType="predicted"/> <MiningField name="treatment"/> <MiningField name="trait_anxiety"/> </MiningSchema> <RegressionTable intercept="-6.36346994178186" targetCategory="True"> <NumericPredictor name="treatment" coefficient="-1.0241060523932697"/> <NumericPredictor name="trait_anxiety" coefficient="0.11904491666860609"/> </RegressionTable> <RegressionTable intercept="0.0" targetCategory="False"/> </RegressionModel> </PMML> ``` Risks and limitations: -------------------------------------------------------- 1. Straying away from the intrinsic intercept assumption only for the pmml code: * As we have established already, the intercept is not treated any different from an independent variable. * To fix the pmml file to not include the intercept as an independent variable, we will need to break this intrinsic assumption. * If the pmml code breaks this assumption, it's possible that there might be some unexpected side effects or errors that even with exhaustive testing may not be uncovered. For e.g. the pmml code relies on the len of the coefficient to make some decisions about naming and such. (See formula.py for an example) which might have some weird edge cases We might be fine with this risk for now and if something breaks in the future, we can deal with it later. Biggest risk is that something fundamental breaks in the future that might make us revert this new logic. But the odds of that are pretty low 2. If the user passed a non array expression for the independent variable Consider the following example ``` -- Create a table where the x variable is an array of the independent variables to train on CREATE TABLE warpbreaks_dummy_simple_xcol AS SELECT breaks AS y, ARRAY[1,"wool_B","tension_M", "tension_H"] AS x_a from warpbreaks_dummy; -- Now use the column 'x_a' created in the previous step. SELECT madlib.glm('warpbreaks_dummy_simple_xcol', 'glm_warpbreaks_intercept_1_simple_xcol', 'y' , 'x_a' , 'family=poisson, link=log'); ``` Now there's no way for us know if this model was fit with an intercept or not. The only way to know is to check the value of "independent_varname" in the summary table which would be "x_A" in this case which won't tell us anything about the intercept. Ideally, we would like to change the fit functions to take a boolen for the intercept arg but that will too big of a change and hence is out of scope of this commit. The easiest fix for this problem for now is that we are going to assume that all non array expressions always include the intercept. Note that this assumption only applies to the pmml module 3. Using the name_spec arg of the pmml function * The pmml function accepts an optional arg named "name_spec" which is used to explicitly name the input and output variables in the pmml file. * The user will now need to remove the "1" from this expression For e.g. `SELECT madlib.pmml('patients_logregr', 'attack~1+anxiety+treatment');` will have to be rewritten as `SELECT madlib.pmml('patients_logregr', 'attack~anxiety+treatment');` * We will need to remove this from the pmml user docs which will be done in a separate PR. 4. If the intercept is not the first one in the independent_varname array expression Consider the following examples ``` SELECT madlib.linregr_train('houses', 'linregr_model', 'price', 'array[bedroom, 1, bath, size]'); SELECT madlib.linregr_predict(coef, ARRAY[bedroom, 1, bath, size]) FROM linregr_model, houses; ``` or ``` SELECT madlib.linregr_train('houses', 'linregr_model', 'price', 'array[bedroom, bath, size, 1]'); SELECT madlib.linregr_predict(coef, ARRAY[bedroom, bath, size, 1]) FROM linregr_model, houses; ``` Both of these are allowed which makes it really hard for the pmml code to figure out if the intercept was used or not. Solution 1: * Always assume that the intercept arg "1" will be at the start of the expression. * All our regression user docs usually specify the intercept in the beginning so most of our users will be used to that format. * There is a small risk that when the intercept is not in the beginning of the expression, the exported pmml will assume that "1" is a normal predictor and not an intercept. This is no different than how it's treated right now before we decided to fix it. Users will just need to provide a column named "1" when predicting using that pmml Solution 2: * pmml code will need to get smarter and parse the array expression to figure out the position of the intercept and then accordingly get the intercept coefficient from the coef array * This will require a lot of work and might still not be foolproof since we also allow passing random integers in the independent variable expression.(see previous issue) * Even if we ignore the integer issue, we will need to make quite a few changes to the pmml code which can be error prone and hard to maintain. Decided to go with Solution 1 for ease of use and maintainability --- src/ports/postgres/modules/pmml/formula.py_in | 98 +++++- src/ports/postgres/modules/pmml/pmml_builder.py_in | 61 ++-- .../pmml/test/unit_tests/test_formula.py_in | 349 +++++++++++++++++++++ 3 files changed, 469 insertions(+), 39 deletions(-) diff --git a/src/ports/postgres/modules/pmml/formula.py_in b/src/ports/postgres/modules/pmml/formula.py_in index 4a14e0df..0d575315 100644 --- a/src/ports/postgres/modules/pmml/formula.py_in +++ b/src/ports/postgres/modules/pmml/formula.py_in @@ -2,25 +2,61 @@ import plpy import re class Formula(object): + def __init__(self, y_str, x_str, coef_len): - self.n_coef = coef_len + """ + :param y_str: Dependent variable used during training + :param x_str: Independent variable used during training. Can take + multiple formats like + 'ARRAY[1,x1,x2]', 'ARRAY[x1,x2]' or just 'x' + :param coef_len: Length of all the coefficients including the + intercept's coefficient(if any) + """ + # TODO: Fix the nested warning and add explanation for the regex + self.array_expr = re.compile(r'array[[]([0-1],|[0-1].0,)?(["a-z0-9_, .]+)[]]', flags=re.I) + self.non_array_expr = re.compile(r'["a-z0-9_]+', flags=re.I) + + self.intercept = self.has_intercept(x_str) + self.all_coef_len = coef_len + if self.intercept: + self.feature_coef_len = coef_len - 1 + else: + self.feature_coef_len = coef_len self.y = y_str.replace('"','') self.x = self.parse(x_str) def parse(self, x_str): - array_expr = re.compile(r'array[[](["a-z0-9_, .]+)[]]', flags=re.I) - simple_col = re.compile(r'["a-z0-9_]+', flags=re.I) + """ + The parse function parses the x_str (that is obtained by querying the model summary table) + The goal of this function is to extract the features from this string and + ignore the intercept (if present) + If a non array expression like 'x' is used for the independent + variable, this function will assume that the intercept was used + during training + :param x_str: Independent variable used during training. Can take + multiple formats like + 'ARRAY[1,x1,x2]', 'ARRAY[x1,x2]' or just 'x' + :return: array of all the independent features + """ prefix = 'x' - if array_expr.match(x_str) is not None: - x_csv = array_expr.sub(r'\1', x_str) + if self.array_expr.match(x_str) is not None: + x_csv = self.array_expr.sub(r'\2', x_str) ret = [s.strip().replace('"','') for s in x_csv.split(',')] - if len(ret) == self.n_coef: + if len(ret) == self.feature_coef_len: return ret - else: - pass # fall back to using 'x' - elif simple_col.match(x_str) is not None: - prefix = x_str.replace('"','') - return ["{0}[{1}]".format(prefix, str(i+1)) for i in range(self.n_coef)] + pass + elif self.non_array_expr.match(x_str) is not None: + # We assume that if a non array expression was used for training, + # it includes the intercept + prefix = x_str.replace('"', '') + return ["{0}[{1}]".format(prefix, str(i+1)) for i in range(self.feature_coef_len)] + # We will only get here if we matched the array format but the + # coefficient length didn't match the x_str len. This would be a very + # rare and unexpected scenario and there isn't a good solution here. + # So we just loop through all the coefficients (including the intercept) + # so that all of them are considered as predictors + return ["{0}[{1}]".format(prefix, str(i+1)) for i in range( + self.all_coef_len)] def rename(self, spec): if isinstance(spec, str): @@ -37,20 +73,50 @@ class Formula(object): x = [s.strip() for s in spec.split('+')] else: x = [s.strip() for s in spec.split(',')] - if self.n_coef != len(x): + + if self.feature_coef_len != len(x): plpy.warning("PMML warning: unexpected namespec '" + \ spec + "', using default names") else: self.y = y self.x = x else: - if len(spec) == self.n_coef + 1: + if len(spec) == self.feature_coef_len + 1: self.y = spec[0] self.x = spec[1:] - elif len(spec) == self.n_coef: + elif len(spec) == self.feature_coef_len: self.x = spec else: plpy.warning("PMML warning: unexpected namespec '" + \ - str(spec) + "', using default names") - + str(spec) + "', using default names") + def has_intercept(self, x_str): + """ + Parses the independent var string and determines if intercept was + used during fit. This is important for pmml building because of the + following reasons: + 1. The coef vector includes the coefficient of the intercept as well + 2. If we don't handle this coefficient separately, the intercept will be + treated an independent variable in the pmml output. This isn't + ideal since the user will need to pass this intercept "1" as an input + for each row while using the pmml for prediction + 2. Since we don't want to treat intercept as an independent variable, + it's important to know if an intercept was used or not and treat + it accordingly. + :param x_str: + :return: + """ + array_expr_match = self.array_expr.match(x_str) + if array_expr_match is not None: + if array_expr_match.groups()[0] is None: + return False + else: + return True + # If the independent var used during training does not match the + # "ARRAY[1, x1, x2]" or "ARRAY[x1, x2]" format (for e.g. a simple col + # expression like x), we default to intercept being true. This is + # because without this format, we have no way to knowing whether the + # input table was fit with an intercept or not. So assuming intercept to + # be True is a safer assumption since in most cases, the model would + # have been trained with an intercept + return True diff --git a/src/ports/postgres/modules/pmml/pmml_builder.py_in b/src/ports/postgres/modules/pmml/pmml_builder.py_in index 125c616a..b6d56131 100644 --- a/src/ports/postgres/modules/pmml/pmml_builder.py_in +++ b/src/ports/postgres/modules/pmml/pmml_builder.py_in @@ -33,6 +33,15 @@ class PMMLBuilder(object): self.name_spec = name_spec self.pmml_str = None + def _get_intercept_and_x_coefs(self, coefs): + if self.formula.intercept: + intercept_coef = coefs[0] + x_coefs = coefs[1:] + else: + intercept_coef = 0 + x_coefs = coefs + return intercept_coef, x_coefs + def _validate_output_table(self): cols_in_tbl_valid(self.model_table, self.__class__.OUTPUT_COLS, @@ -65,7 +74,7 @@ class PMMLBuilder(object): raise NotImplementedError def _construct_formula(self): - self.formula = Formula(self.y_str, self.x_str, self.n_coef) + self.formula = Formula(self.y_str, self.x_str, self.all_coef_len) if self.name_spec is not None: self.formula.rename(self.name_spec) else: @@ -209,7 +218,7 @@ class RegressionPMMLBuilder(PMMLBuilder): def _parse_output(self): self.grouped_coefs = self.output self.coef0 = self.output[0]['coef'] - self.n_coef = len(self.coef0) + self.all_coef_len = len(self.coef0) self.grouping_keys = [k for k in self.output[0] if k != 'coef'] def _construct_predict_spec(self): @@ -230,23 +239,26 @@ class RegressionPMMLBuilder(PMMLBuilder): self.mining_schema = MiningSchema(*mining_field_forest) def _create_numeric_predictors(self, coef): - numeric_predictor_forest = [] + numeric_predictors = [] for i, e in enumerate(coef): - numeric_predictor_forest.append( + numeric_predictors.append( NumericPredictor(name=self.formula.x[i], coefficient=e)) - return numeric_predictor_forest + return numeric_predictors - def _create_model_regression(self, numeric_predictor_forest): + def _create_model_regression(self, numeric_predictor_forest, intercept_coef): + # TODO: fix this intercept value properly + # when is this code called regression_table_forest = [RegressionTable(*numeric_predictor_forest, - intercept='0')] + intercept=intercept_coef)] return RegressionModel(self.mining_schema, *regression_table_forest, functionName=self.function) - def _create_model_classification(self, numeric_predictor_forest): + def _create_model_classification(self, numeric_predictor_forest, intercept_coef): + #TODO: Will False category always have intercept 0 ? regression_table_forest = [ RegressionTable(*numeric_predictor_forest, - targetCategory=True, intercept='0'), + targetCategory=True, intercept=intercept_coef), RegressionTable(targetCategory=False, intercept='0')] return RegressionModel(self.mining_schema, *regression_table_forest, @@ -255,12 +267,13 @@ class RegressionPMMLBuilder(PMMLBuilder): def _create_single_model(self, coef): self._build_mining_schema() - numeric_predictor_forest = self._create_numeric_predictors(coef) + intercept_coef, x_coef = self._get_intercept_and_x_coefs(coef) + numeric_predictors_regression = self._create_numeric_predictors(x_coef) if self.function == 'regression': - return self._create_model_regression(numeric_predictor_forest) + return self._create_model_regression(numeric_predictors_regression, intercept_coef) elif self.function == 'classification': - return self._create_model_classification(numeric_predictor_forest) + return self._create_model_classification(numeric_predictors_regression, intercept_coef) class GeneralRegressionPMMLBuilder(RegressionPMMLBuilder): @@ -375,17 +388,19 @@ class GLMPMMLBuilder(GeneralRegressionPMMLBuilder): self._build_covariate_list() self._build_ppmatrix() + intercept_coef, x_coef = self._get_intercept_and_x_coefs(coef) # pcells - pcell_attrib0 = dict(parameterName='p0', beta='0', df='1') + pcell_attrib0 = dict(parameterName='p0', beta=intercept_coef, df='1') if self.function == 'classification': pcell_attrib0['targetCategory'] = True pcell_forest = [PCell(**pcell_attrib0)] - for i, e in enumerate(coef): + for i, e in enumerate(x_coef): pcell_attrib = dict(parameterName="p"+str(i+1), beta=e, df='1') if self.function == 'classification': pcell_attrib['targetCategory'] = True pcell_forest.append(PCell(**pcell_attrib)) + return GeneralRegressionModel(self.mining_schema, self.parameter_list, FactorList(), @@ -398,7 +413,6 @@ class GLMPMMLBuilder(GeneralRegressionPMMLBuilder): functionName=self.function, **self.link_spec) - class MultiClassRegressionPMMLBuilder(GeneralRegressionPMMLBuilder): """Base builder class for Multinomial logistic and Ordinal. """ @@ -452,13 +466,13 @@ class OrdinalRegressionPMMLBuilder(MultiClassRegressionPMMLBuilder): def _parse_output(self): self.grouped_coefs = self.output self.coef0 = self.output[0]['coef'] - self.n_coef = len(self.output[0]['coef_feature']) + self.all_coef_len = len(self.output[0]['coef_feature']) self.grouping_keys = [k for k in self.output[0] if k not in ('coef', 'coef_feature')] def _create_single_model(self, coef): - coef_threshold = coef[self.formula.n_coef:] - coef_feature = coef[:self.formula.n_coef] + coef_threshold = coef[self.formula.all_coef_len:] + coef_feature = coef[:self.formula.all_coef_len] self._build_mining_schema() self._build_parameter_list() @@ -533,7 +547,7 @@ class MultinomRegressionPMMLBuilder(MultiClassRegressionPMMLBuilder): self.grouped_coefs = [dict(list(zip(self.grouping_keys, grp_val))+[('coef', coef)]) for grp_val, coef in coef_dict.items()] self.coef0 = list(coef_dict.values())[0] - self.n_coef = len(list(self.coef0.values())[0]) + self.all_coef_len = len(list(self.coef0.values())[0]) def _create_single_model(self, coef): self._build_mining_schema() @@ -544,9 +558,10 @@ class MultinomRegressionPMMLBuilder(MultiClassRegressionPMMLBuilder): # pcells pcell_forest = [] for cate, coef_per_cate in coef.items(): + intercept_coef, x_coef_per_cate = self._get_intercept_and_x_coefs(coef_per_cate) pcell_forest.append(PCell(parameterName="p0", - beta=0, df='1', targetCategory=cate)) - for i, c in enumerate(coef_per_cate): + beta=intercept_coef, df='1', targetCategory=cate)) + for i, c in enumerate(x_coef_per_cate): pcell_forest.append(PCell(parameterName="p"+str(i+1), beta=c, df='1', targetCategory=cate)) @@ -595,7 +610,7 @@ class DecisionTreePMMLBuilder(PMMLBuilder): # assume that summary table sort the independent varnames (cat, con) self.x_str = self.summary['independent_varnames'] self.x = [s.strip() for s in self.x_str.split(',')] - self.n_coef = len(self.x) + self.all_coef_len = len(self.x) self.grouping_col = self.summary['grouping_cols'] self.grouping_str = ('' if self.grouping_col is None @@ -982,7 +997,7 @@ class RandomForestPMMLBuilder(DecisionTreePMMLBuilder): # assume that summary table sort the independent varnames (cat, con) self.x_str = self.summary['independent_varnames'] self.x = [s.strip() for s in self.x_str.split(',')] - self.n_coef = len(self.x) + self.all_coef_len = len(self.x) self.grouping_col = self.summary['grouping_cols'] self.grouping_str = ('' if self.grouping_col is None diff --git a/src/ports/postgres/modules/pmml/test/unit_tests/test_formula.py_in b/src/ports/postgres/modules/pmml/test/unit_tests/test_formula.py_in new file mode 100644 index 00000000..6075edc4 --- /dev/null +++ b/src/ports/postgres/modules/pmml/test/unit_tests/test_formula.py_in @@ -0,0 +1,349 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import sys +from os import path +import unittest +from mock import * + +m4_changequote(`<!', `!>') + +# Add modules to the pythonpath. +sys.path.append(path.join(path.dirname(path.abspath(__file__)), '../../..')) +sys.path.append(path.join(path.dirname(path.abspath(__file__)), '../..')) + +class FormulaTestCase(unittest.TestCase): + def setUp(self): + self.plpy_mock = Mock() + patches = { + 'plpy': self.plpy_mock + } + self.module_patcher = patch.dict('sys.modules', patches) + self.module_patcher.start() + from pmml import formula + self.subject = formula + def tearDown(self): + self.module_patcher.stop() + + def test_formula_array_with_intercept(self): + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1.0,foo,bar]', 3) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0,foo,bar]', 3) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0.0,foo,bar]', 3) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1,"1","bar"]', 3) + self.assertEqual(f.x, ['1', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1,"1.0","bar"]', 3) + self.assertEqual(f.x, ['1.0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1.0,"1","bar"]', 3) + self.assertEqual(f.x, ['1', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1.0,"1.0","bar"]', 3) + self.assertEqual(f.x, ['1.0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1,"0","bar"]', 3) + self.assertEqual(f.x, ['0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1.0,"0","bar"]', 3) + self.assertEqual(f.x, ['0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1,"0.0","bar"]', 3) + self.assertEqual(f.x, ['0.0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1.0,"0.0","bar"]', 3) + self.assertEqual(f.x, ['0.0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0,"1","bar"]', 3) + self.assertEqual(f.x, ['1', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0.0,"1","bar"]', 3) + self.assertEqual(f.x, ['1', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0.0,"1.0","bar"]', 3) + self.assertEqual(f.x, ['1.0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0,"0","bar"]', 3) + self.assertEqual(f.x, ['0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0.0,"0","bar"]', 3) + self.assertEqual(f.x, ['0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0,"0.0","bar"]', 3) + self.assertEqual(f.x, ['0.0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[0.0,"0.0","bar"]', 3) + self.assertEqual(f.x, ['0.0', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + def test_formula_array_with_invalid_intercept(self): + f = self.subject.Formula('baaz', 'ARRAY[10,foo,bar]', 3) + self.assertEqual(f.x, ['10', 'foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + # A negative number shouldn't be allowed technically the train functions + # don't error out, so adding this test for the sake of completeness + f = self.subject.Formula('baaz', 'ARRAY[-2,foo,bar]', 3) + self.assertEqual(f.intercept, True) + self.assertEqual(f.x, ['ARRAY[-2,foo,bar][1]', 'ARRAY[-2,foo,bar][2]']) + self.assertEqual(f.y, "baaz") + + f = self.subject.Formula('baaz', 'ARRAY[2,foo,bar]', 3) + self.assertEqual(f.x, ['2', 'foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + f = self.subject.Formula('baaz', 'ARRAY[23,foo,bar]', 3) + self.assertEqual(f.x, ['23', 'foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + def test_formula_array_with_intercept_unequal_coef_len(self): + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 2) + self.assertEqual(f.x, ['x[1]', 'x[2]']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 4) + self.assertEqual(f.x, ['x[1]', 'x[2]', 'x[3]', 'x[4]']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + def test_formula_array_without_intercept(self): + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + f = self.subject.Formula('baaz', 'ARRAY["1",foo,bar]', 3) + self.assertEqual(f.x, ['1', 'foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + f = self.subject.Formula('baaz', 'ARRAY["1", "foo","bar"]', 3) + self.assertEqual(f.x, ['1', 'foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + f = self.subject.Formula('baaz', 'ARRAY["0", "foo","bar"]', 3) + self.assertEqual(f.x, ['0', 'foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + def test_formula_array_without_intercept_unequal_coef_len(self): + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 1) + self.assertEqual(f.x, ['x[1]']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 3) + self.assertEqual(f.x, ['x[1]', 'x[2]', 'x[3]']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, False) + + def test_formula_nonarray(self): + f = self.subject.Formula('baaz', 'foo', 3) + self.assertEqual(f.x, ['foo[1]', 'foo[2]']) + self.assertEqual(f.y, "baaz") + self.assertEqual(f.intercept, True) + + f = self.subject.Formula('baaz', '{1,foo,bar}', 3) + self.assertEqual(f.intercept, True) + self.assertEqual(f.x, ['x[1]', 'x[2]', 'x[3]']) + self.assertEqual(f.y, "baaz") + + def test_rename_string_expression_with_intercept(self): + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename('y ~ foo+bar') + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "y") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename('y ~ foo,bar') + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "y") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename('foo+bar') + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename('foo,bar') + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename('{foo,bar}') + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + def test_rename_string_expression_without_intercept(self): + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename('y ~ x1+x2') + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "y") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename('y ~ x1,x2') + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "y") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename('x1+x2') + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename('x1,x2') + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename('{x1,x2}') + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + def test_rename_string_expression_with_intercept_throws_warning(self): + f = self.subject.Formula('baaz', 'ARRAY[1, foo,bar]', 3) + f.rename('y ~ x1') + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(1, self.plpy_mock.warning.call_count) + + def test_rename_string_expression_without_intercept_throws_warning(self): + f = self.subject.Formula('baaz', 'ARRAY[foo,bar,foobar]', 3) + f.rename('y ~ x1+x2') + self.assertEqual(f.x, ['foo', 'bar', 'foobar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(1, self.plpy_mock.warning.call_count) + + def test_rename_array_expression_with_intercept(self): + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename(['x1','x2']) + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename(['y', 'x1','x2']) + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "y") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + def test_rename_array_expression_without_intercept(self): + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename(['x1','x2']) + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "baaz") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename(['y', 'x1','x2']) + self.assertEqual(f.x, ['x1', 'x2']) + self.assertEqual(f.y, "y") + self.assertEqual(0, self.plpy_mock.warning.call_count) + + def test_rename_array_expression_with_intercept_throws_warning(self): + f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3) + f.rename(['x1']) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(1, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[1, foo,bar]', 3) + f.rename(['x1', 'x2', 'x3','x4']) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(2, self.plpy_mock.warning.call_count) + + def test_rename_array_expression_without_intercept_throws_warning(self): + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename(['x1']) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(1, self.plpy_mock.warning.call_count) + + f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2) + f.rename(['x1', 'x2', 'x3','x4']) + self.assertEqual(f.x, ['foo', 'bar']) + self.assertEqual(f.y, "baaz") + self.assertEqual(2, self.plpy_mock.warning.call_count) + + +if __name__ == '__main__': + unittest.main() + +# ---------------------------------------------------------------------