(madlib) 02/05: PMML: Do not include intercept as a predictor

nkak Mon, 26 Feb 2024 11:28:54 -0800

This is an automated email from the ASF dual-hosted git repository.

nkak pushed a commit to branch madlib2-master
in repository https://gitbox.apache.org/repos/asf/madlib.git


commit 0cd28f9733927d63beaefc9488db7f8bfdb3bd80
Author: Nikhil Kak <n...@vmware.com>
AuthorDate: Thu Feb 15 17:53:56 2024 -0800

    PMML: Do not include intercept as a predictor
    
    JIRA: MADLIB-1517
    
    Note that this commit only fixes GLM, logisitic and linear. A future commit
    will fix other pmml modules.
    
    Context :
    --------------------------------------------------------
    MADlib's way of passing intercept to regression models is a bit unusual.
    Usually intercept is a boolean which indicates whether the model needs to be
    fit with intercept or not. MADlib makes the user pass an integer (1 means 
use
    an intercept and no value means don't fit with intercept) along with the 
other
    independent variables and uses that directly for computation. For e.g.
    ARRAY[1,x1,x2,...] indicates use an intercept whereas ARRAY[x1,x2,...] means
    don't use an intercept
    
    Problem:
    --------------------------------------------------------
    * So essentially all the regression algorithms treat the intercept value 
"1" as
      just another independent variable(it's always the first one though).
    * Because of this implicit assumption, users need to specifically inject a
      predictor named "1" with a value of 1 for all the input rows. This can be 
very
      inconvenient specially when using pmml to predict a stream of data or some
      other preprocessed form of data.
    
    Fix:
    --------------------------------------------------------
    * Once the model is trained, the only way to know if the model was fit with
      intercept is to look at the `independent_varname` field in the summary 
table.
    * If the value contains ARRAY[1, x1, x2..], then an intercept was used.
    * Since this intercept is just another independent variable, there aren't 
any
      explicit references or logic to handle intercept in our python or c++ 
code for
      training or predict.
    * Because of this assumption, the pmml code also considers all of
      "ARRAY[1,x1,x2,...]" as independent variables and hence the output pmml
      contains "1" as an input predictor.
    * We can just remove all references to the column "1" in the pmml file. We 
will
      still keep the "p0" variable which is explicitly marked as an intercept 
and
      will store the intercept's coefficient
    * The pmml module gets the 'X' and 'Y' values from the summary table and 
then
      parses it to create a list of all the independent predictors so that it 
can
      be written to the pmml file
    * It uses regex to match the expression ARRAY[1,x1,x2]/ARRAY[x1,x2] and then
      returns either ['1','x1','x2'] or ['x1',x2']
    * Our goal with the pmml code is to not treat the intercept "1" as an
      independent predictor but just as an intercept
    * The commit fixes this by changing the regex and using the output to 
determine
      if an intercept was passed so that both expressions
      ARRAY[1,x1,x2]/ARRAY[x1,x2] return ['x1', 'x2']
    * Also had to make changes to the various pmml builder classes to treat
      intercept's coefficient differently than the feature coefficients
    * Note that this commit only fixes GLM, logisitic and linear. A future 
commit
      will fix other pmml modules.
    
    Before the fix:
    ```
    <?xml version="1.0" standalone="yes"?>
    <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1";>
     <Header copyright="Copyright (c) 2024 nkak">
       <Extension extender="MADlib" name="user" value="nkak"/>
       <Application name="MADlib" version="2.1.0"/>
       <Timestamp>2024-02-16 12:19:58.798139 PDT</Timestamp>
     </Header>
     <DataDictionary numberOfFields="4">
       <DataField name="second_attack_pmml_prediction" optype="categorical" 
dataType="boolean">
         <Value value="True"/>
         <Value value="False"/>
       </DataField>
       <DataField name="1" optype="continuous" dataType="double"/>
       <DataField name="treatment" optype="continuous" dataType="double"/>
       <DataField name="trait_anxiety" optype="continuous" dataType="double"/>
     </DataDictionary>
     <RegressionModel functionName="classification" 
normalizationMethod="softmax">
       <MiningSchema>
         <MiningField name="second_attack_pmml_prediction" 
usageType="predicted"/>
         <MiningField name="1"/>
         <MiningField name="treatment"/>
         <MiningField name="trait_anxiety"/>
       </MiningSchema>
       <RegressionTable intercept="0.0" targetCategory="True">
         <NumericPredictor name="1" coefficient="-6.363469941781864"/>
         <NumericPredictor name="treatment" coefficient="-1.0241060523932668"/>
         <NumericPredictor name="trait_anxiety" 
coefficient="0.11904491666860616"/>
       </RegressionTable>
       <RegressionTable intercept="0.0" targetCategory="False"/>
     </RegressionModel>
    </PMML>
    ```
    
    After the fix:
    ```
    <?xml version="1.0" standalone="yes"?>
    <PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1";>
      <Header copyright="Copyright (c) 2024 nkak">
        <Extension extender="MADlib" name="user" value="nkak"/>
        <Application name="MADlib" version="2.1.0"/>
        <Timestamp>2024-02-16 13:37:15.367609 PDT</Timestamp>
      </Header>
      <DataDictionary numberOfFields="3">
        <DataField name="second_attack_pmml_prediction" optype="categorical" 
dataType="boolean">
          <Value value="True"/>
          <Value value="False"/>
        </DataField>
        <DataField name="treatment" optype="continuous" dataType="double"/>
        <DataField name="trait_anxiety" optype="continuous" dataType="double"/>
      </DataDictionary>
      <RegressionModel functionName="classification" 
normalizationMethod="softmax">
        <MiningSchema>
          <MiningField name="second_attack_pmml_prediction" 
usageType="predicted"/>
          <MiningField name="treatment"/>
          <MiningField name="trait_anxiety"/>
        </MiningSchema>
        <RegressionTable intercept="-6.36346994178186" targetCategory="True">
          <NumericPredictor name="treatment" coefficient="-1.0241060523932697"/>
          <NumericPredictor name="trait_anxiety" 
coefficient="0.11904491666860609"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="False"/>
      </RegressionModel>
    </PMML>
    
    ```
    
    Risks and limitations:
    --------------------------------------------------------
    1. Straying away from the intrinsic intercept assumption only for the pmml 
code:
      * As we have established already, the intercept is not treated any 
different
        from an independent variable.
      * To fix the pmml file to not include the intercept as an independent 
variable,
        we will need to break this intrinsic assumption.
      * If the pmml code breaks this assumption, it's possible that there might 
be
        some unexpected side effects or errors that even with exhaustive 
testing may
        not be uncovered. For e.g. the pmml code relies on the len of the 
coefficient
        to make some decisions about naming and such. (See formula.py for an 
example)
        which might have some weird edge cases
    
      We might be fine with this risk for now and if something breaks in the 
future,
      we can deal with it later. Biggest risk is that something fundamental 
breaks in
      the future that might make us revert this new logic. But the odds of that 
are
      pretty low
    
    2. If the user passed a non array expression for the independent variable
      Consider the following example
    
      ```
      -- Create a table where the x variable is an array of the independent 
variables to train on
      CREATE TABLE warpbreaks_dummy_simple_xcol AS SELECT breaks AS y, 
ARRAY[1,"wool_B","tension_M", "tension_H"] AS x_a from warpbreaks_dummy;
    
      -- Now use the column 'x_a' created in the previous step.
      SELECT madlib.glm('warpbreaks_dummy_simple_xcol', 
'glm_warpbreaks_intercept_1_simple_xcol', 'y' , 'x_a' , 'family=poisson, 
link=log');
      ```
      Now there's no way for us know if this model was fit with an intercept or 
not.
      The only way to know is to check the value of "independent_varname" in the
      summary table which would be "x_A" in this case which won't tell us 
anything
      about the intercept.
    
      Ideally, we would like to change the fit functions to take a boolen for 
the
      intercept arg but that will too big of a change and hence is out of scope 
of this commit.
    
      The easiest fix for this problem for now is that we are going to assume 
that
      all non array expressions always include the intercept. Note that this
      assumption only applies to the pmml module
    
    3. Using the name_spec arg of the pmml function
      * The pmml function accepts an optional arg named "name_spec" which is 
used to explicitly name the input and output variables in the pmml file.
      * The user will now need to remove the "1" from this expression
         For e.g. `SELECT madlib.pmml('patients_logregr', 
'attack~1+anxiety+treatment');` will have to be rewritten as
          `SELECT madlib.pmml('patients_logregr', 'attack~anxiety+treatment');`
      * We will need to remove this from the pmml user docs which will be done 
in a separate PR.
    
    4. If the intercept is not the first one in the independent_varname array 
expression
    
       Consider the following examples
       ```
       SELECT madlib.linregr_train('houses', 'linregr_model', 'price', 
'array[bedroom, 1, bath, size]');
       SELECT madlib.linregr_predict(coef, ARRAY[bedroom, 1, bath, size]) FROM 
linregr_model, houses;
       ```
       or
       ```
       SELECT madlib.linregr_train('houses', 'linregr_model', 'price', 
'array[bedroom, bath, size, 1]');
       SELECT madlib.linregr_predict(coef, ARRAY[bedroom, bath, size, 1]) FROM 
linregr_model, houses;
       ```
       Both of these are allowed which makes it really hard for the pmml code 
to figure out if the intercept was used or not.
    
       Solution 1:
       * Always assume that the intercept arg "1" will be at the start of the 
expression.
       * All our regression user docs usually specify the intercept in the 
beginning so most of our users will be used to that format.
       * There is a small risk that when the intercept is not in the beginning 
of the expression, the exported pmml will assume that "1" is a normal predictor 
and not an intercept. This is no different than how
         it's treated right now before we decided to fix it. Users will just 
need to provide a column named "1" when predicting using that pmml
    
       Solution 2:
       * pmml code will need to get smarter and parse the array expression to 
figure out the position of the intercept and then accordingly get the intercept 
coefficient from the coef array
       * This will require a lot of work and might still not be foolproof since 
we also allow passing random integers in the independent variable 
expression.(see previous issue)
       * Even if we ignore the integer issue, we will need to make quite a few 
changes to the pmml code which can be error prone and hard to maintain.
    
       Decided to go with Solution 1 for ease of use and maintainability
---
 src/ports/postgres/modules/pmml/formula.py_in      |  98 +++++-
 src/ports/postgres/modules/pmml/pmml_builder.py_in |  61 ++--
 .../pmml/test/unit_tests/test_formula.py_in        | 349 +++++++++++++++++++++
 3 files changed, 469 insertions(+), 39 deletions(-)

diff --git a/src/ports/postgres/modules/pmml/formula.py_in 
b/src/ports/postgres/modules/pmml/formula.py_in
index 4a14e0df..0d575315 100644
--- a/src/ports/postgres/modules/pmml/formula.py_in
+++ b/src/ports/postgres/modules/pmml/formula.py_in
@@ -2,25 +2,61 @@ import plpy
 import re
 
 class Formula(object):
+
     def __init__(self, y_str, x_str, coef_len):
-        self.n_coef = coef_len
+        """
+        :param y_str:    Dependent variable used during training
+        :param x_str:    Independent variable used during training. Can take
+                         multiple formats like
+                         'ARRAY[1,x1,x2]', 'ARRAY[x1,x2]' or just 'x'
+        :param coef_len: Length of all the coefficients including the
+                         intercept's coefficient(if any)
+        """
+        # TODO: Fix the nested warning and add explanation for the regex
+        self.array_expr = re.compile(r'array[[]([0-1],|[0-1].0,)?(["a-z0-9_, 
.]+)[]]', flags=re.I)
+        self.non_array_expr = re.compile(r'["a-z0-9_]+', flags=re.I)
+
+        self.intercept = self.has_intercept(x_str)
+        self.all_coef_len = coef_len
+        if self.intercept:
+            self.feature_coef_len = coef_len - 1
+        else:
+            self.feature_coef_len = coef_len
         self.y = y_str.replace('"','')
         self.x = self.parse(x_str)
 
     def parse(self, x_str):
-        array_expr = re.compile(r'array[[](["a-z0-9_, .]+)[]]', flags=re.I)
-        simple_col = re.compile(r'["a-z0-9_]+', flags=re.I)
+        """
+        The parse function parses the x_str (that is obtained by querying the 
model summary table)
+        The goal of this function is to extract the features from this string 
and
+        ignore the intercept (if present)
+        If a non array expression like 'x' is used for the independent
+        variable, this function will assume that the intercept was used
+        during training
+        :param x_str: Independent variable used during training. Can take
+                      multiple formats like
+                     'ARRAY[1,x1,x2]', 'ARRAY[x1,x2]' or just 'x'
+        :return: array of all the independent features
+        """
         prefix = 'x'
-        if array_expr.match(x_str) is not None:
-            x_csv = array_expr.sub(r'\1', x_str)
+        if self.array_expr.match(x_str) is not None:
+            x_csv = self.array_expr.sub(r'\2',  x_str)
             ret = [s.strip().replace('"','') for s in x_csv.split(',')]
-            if len(ret) == self.n_coef:
+            if len(ret) == self.feature_coef_len:
                 return ret
-            else:
-                pass # fall back to using 'x'
-        elif simple_col.match(x_str) is not None:
-            prefix = x_str.replace('"','')
-        return ["{0}[{1}]".format(prefix, str(i+1)) for i in 
range(self.n_coef)]
+            pass
+        elif self.non_array_expr.match(x_str) is not None:
+            # We assume that if a non array expression was used for training,
+            # it includes the intercept
+            prefix = x_str.replace('"', '')
+            return ["{0}[{1}]".format(prefix, str(i+1)) for i in 
range(self.feature_coef_len)]
+        # We will only get here if we matched the array format but the
+        # coefficient length didn't match the x_str len. This would be a very
+        # rare and unexpected scenario and there isn't a good solution here.
+        # So we just loop through all the coefficients (including the 
intercept)
+        # so that all of them are considered as predictors
+        return ["{0}[{1}]".format(prefix, str(i+1)) for i in range(
+            self.all_coef_len)]
 
     def rename(self, spec):
         if isinstance(spec, str):
@@ -37,20 +73,50 @@ class Formula(object):
                 x = [s.strip() for s in spec.split('+')]
             else:
                 x = [s.strip() for s in spec.split(',')]
-            if self.n_coef != len(x):
+
+            if self.feature_coef_len != len(x):
                 plpy.warning("PMML warning: unexpected namespec '" + \
                         spec + "', using default names")
             else:
                 self.y = y
                 self.x = x
         else:
-            if len(spec) == self.n_coef + 1:
+            if len(spec) == self.feature_coef_len + 1:
                 self.y = spec[0]
                 self.x = spec[1:]
-            elif len(spec) == self.n_coef:
+            elif len(spec) == self.feature_coef_len:
                 self.x = spec
             else:
                 plpy.warning("PMML warning: unexpected namespec '" + \
-                        str(spec) + "', using default names")
-
+                             str(spec) + "', using default names")
 
+    def has_intercept(self, x_str):
+        """
+        Parses the independent var string and determines if intercept was
+        used during fit. This is important for pmml building because of the
+        following reasons:
+        1. The coef vector includes the coefficient of the intercept as well
+        2. If we don't handle this coefficient separately, the intercept will 
be
+           treated an independent variable in the pmml output. This isn't
+           ideal since the user will need to pass this intercept "1" as an 
input
+           for each row while using the pmml for prediction
+        2. Since we don't want to treat intercept as an independent variable,
+           it's important to know if an intercept was used or not and treat
+           it accordingly.
+        :param x_str:
+        :return:
+        """
+        array_expr_match = self.array_expr.match(x_str)
+        if array_expr_match is not None:
+            if array_expr_match.groups()[0] is None:
+                return False
+            else:
+                return True
+        # If the independent var used during training does not match the
+        # "ARRAY[1, x1, x2]" or "ARRAY[x1, x2]" format (for e.g. a simple col
+        # expression like x), we default to intercept being true. This is
+        # because without this format, we have no way to knowing whether the
+        # input table was fit with an intercept or not. So assuming intercept 
to
+        # be True is a safer assumption since in most cases, the model would
+        # have been trained with an intercept
+        return True
diff --git a/src/ports/postgres/modules/pmml/pmml_builder.py_in 
b/src/ports/postgres/modules/pmml/pmml_builder.py_in
index 125c616a..b6d56131 100644
--- a/src/ports/postgres/modules/pmml/pmml_builder.py_in
+++ b/src/ports/postgres/modules/pmml/pmml_builder.py_in
@@ -33,6 +33,15 @@ class PMMLBuilder(object):
         self.name_spec = name_spec
         self.pmml_str = None
 
+    def _get_intercept_and_x_coefs(self, coefs):
+        if self.formula.intercept:
+            intercept_coef = coefs[0]
+            x_coefs = coefs[1:]
+        else:
+            intercept_coef = 0
+            x_coefs = coefs
+        return intercept_coef, x_coefs
+
     def _validate_output_table(self):
         cols_in_tbl_valid(self.model_table,
                           self.__class__.OUTPUT_COLS,
@@ -65,7 +74,7 @@ class PMMLBuilder(object):
         raise NotImplementedError
 
     def _construct_formula(self):
-        self.formula = Formula(self.y_str, self.x_str, self.n_coef)
+        self.formula = Formula(self.y_str, self.x_str, self.all_coef_len)
         if self.name_spec is not None:
             self.formula.rename(self.name_spec)
         else:
@@ -209,7 +218,7 @@ class RegressionPMMLBuilder(PMMLBuilder):
     def _parse_output(self):
         self.grouped_coefs = self.output
         self.coef0 = self.output[0]['coef']
-        self.n_coef = len(self.coef0)
+        self.all_coef_len = len(self.coef0)
         self.grouping_keys = [k for k in self.output[0] if k != 'coef']
 
     def _construct_predict_spec(self):
@@ -230,23 +239,26 @@ class RegressionPMMLBuilder(PMMLBuilder):
             self.mining_schema = MiningSchema(*mining_field_forest)
 
     def _create_numeric_predictors(self, coef):
-        numeric_predictor_forest = []
+        numeric_predictors = []
         for i, e in enumerate(coef):
-            numeric_predictor_forest.append(
+            numeric_predictors.append(
                 NumericPredictor(name=self.formula.x[i], coefficient=e))
-        return numeric_predictor_forest
+        return numeric_predictors
 
-    def _create_model_regression(self, numeric_predictor_forest):
+    def _create_model_regression(self, numeric_predictor_forest, 
intercept_coef):
+        # TODO: fix this intercept value properly
+        # when is this code called
         regression_table_forest = [RegressionTable(*numeric_predictor_forest,
-                                                   intercept='0')]
+                                                   intercept=intercept_coef)]
         return RegressionModel(self.mining_schema,
                                *regression_table_forest,
                                functionName=self.function)
 
-    def _create_model_classification(self, numeric_predictor_forest):
+    def _create_model_classification(self, numeric_predictor_forest, 
intercept_coef):
+        #TODO: Will False category always have intercept 0 ?
         regression_table_forest = [
             RegressionTable(*numeric_predictor_forest,
-                            targetCategory=True, intercept='0'),
+                            targetCategory=True, intercept=intercept_coef),
             RegressionTable(targetCategory=False, intercept='0')]
         return RegressionModel(self.mining_schema,
                                *regression_table_forest,
@@ -255,12 +267,13 @@ class RegressionPMMLBuilder(PMMLBuilder):
 
     def _create_single_model(self, coef):
         self._build_mining_schema()
-        numeric_predictor_forest = self._create_numeric_predictors(coef)
+        intercept_coef, x_coef = self._get_intercept_and_x_coefs(coef)
+        numeric_predictors_regression = self._create_numeric_predictors(x_coef)
 
         if self.function == 'regression':
-            return self._create_model_regression(numeric_predictor_forest)
+            return 
self._create_model_regression(numeric_predictors_regression, intercept_coef)
         elif self.function == 'classification':
-            return self._create_model_classification(numeric_predictor_forest)
+            return 
self._create_model_classification(numeric_predictors_regression, intercept_coef)
 
 
 class GeneralRegressionPMMLBuilder(RegressionPMMLBuilder):
@@ -375,17 +388,19 @@ class GLMPMMLBuilder(GeneralRegressionPMMLBuilder):
         self._build_covariate_list()
         self._build_ppmatrix()
 
+        intercept_coef, x_coef = self._get_intercept_and_x_coefs(coef)
         # pcells
-        pcell_attrib0 = dict(parameterName='p0', beta='0', df='1')
+        pcell_attrib0 = dict(parameterName='p0', beta=intercept_coef, df='1')
         if self.function == 'classification':
             pcell_attrib0['targetCategory'] = True
         pcell_forest = [PCell(**pcell_attrib0)]
-        for i, e in enumerate(coef):
+        for i, e in enumerate(x_coef):
             pcell_attrib = dict(parameterName="p"+str(i+1), beta=e, df='1')
             if self.function == 'classification':
                 pcell_attrib['targetCategory'] = True
             pcell_forest.append(PCell(**pcell_attrib))
 
+
         return GeneralRegressionModel(self.mining_schema,
                                       self.parameter_list,
                                       FactorList(),
@@ -398,7 +413,6 @@ class GLMPMMLBuilder(GeneralRegressionPMMLBuilder):
                                       functionName=self.function,
                                       **self.link_spec)
 
-
 class MultiClassRegressionPMMLBuilder(GeneralRegressionPMMLBuilder):
     """Base builder class for Multinomial logistic and Ordinal.
     """
@@ -452,13 +466,13 @@ class 
OrdinalRegressionPMMLBuilder(MultiClassRegressionPMMLBuilder):
     def _parse_output(self):
         self.grouped_coefs = self.output
         self.coef0 = self.output[0]['coef']
-        self.n_coef = len(self.output[0]['coef_feature'])
+        self.all_coef_len = len(self.output[0]['coef_feature'])
         self.grouping_keys = [k for k in self.output[0]
                               if k not in ('coef', 'coef_feature')]
 
     def _create_single_model(self, coef):
-        coef_threshold = coef[self.formula.n_coef:]
-        coef_feature = coef[:self.formula.n_coef]
+        coef_threshold = coef[self.formula.all_coef_len:]
+        coef_feature = coef[:self.formula.all_coef_len]
 
         self._build_mining_schema()
         self._build_parameter_list()
@@ -533,7 +547,7 @@ class 
MultinomRegressionPMMLBuilder(MultiClassRegressionPMMLBuilder):
         self.grouped_coefs = [dict(list(zip(self.grouping_keys, 
grp_val))+[('coef', coef)])
                               for grp_val, coef in coef_dict.items()]
         self.coef0 = list(coef_dict.values())[0]
-        self.n_coef = len(list(self.coef0.values())[0])
+        self.all_coef_len = len(list(self.coef0.values())[0])
 
     def _create_single_model(self, coef):
         self._build_mining_schema()
@@ -544,9 +558,10 @@ class 
MultinomRegressionPMMLBuilder(MultiClassRegressionPMMLBuilder):
         # pcells
         pcell_forest = []
         for cate, coef_per_cate in coef.items():
+            intercept_coef, x_coef_per_cate = 
self._get_intercept_and_x_coefs(coef_per_cate)
             pcell_forest.append(PCell(parameterName="p0",
-                                      beta=0, df='1', targetCategory=cate))
-            for i, c in enumerate(coef_per_cate):
+                                      beta=intercept_coef, df='1', 
targetCategory=cate))
+            for i, c in enumerate(x_coef_per_cate):
                 pcell_forest.append(PCell(parameterName="p"+str(i+1),
                                           beta=c, df='1', targetCategory=cate))
 
@@ -595,7 +610,7 @@ class DecisionTreePMMLBuilder(PMMLBuilder):
         # assume that summary table sort the independent varnames (cat, con)
         self.x_str = self.summary['independent_varnames']
         self.x = [s.strip() for s in self.x_str.split(',')]
-        self.n_coef = len(self.x)
+        self.all_coef_len = len(self.x)
 
         self.grouping_col = self.summary['grouping_cols']
         self.grouping_str = ('' if self.grouping_col is None
@@ -982,7 +997,7 @@ class RandomForestPMMLBuilder(DecisionTreePMMLBuilder):
         # assume that summary table sort the independent varnames (cat, con)
         self.x_str = self.summary['independent_varnames']
         self.x = [s.strip() for s in self.x_str.split(',')]
-        self.n_coef = len(self.x)
+        self.all_coef_len = len(self.x)
 
         self.grouping_col = self.summary['grouping_cols']
         self.grouping_str = ('' if self.grouping_col is None
diff --git a/src/ports/postgres/modules/pmml/test/unit_tests/test_formula.py_in 
b/src/ports/postgres/modules/pmml/test/unit_tests/test_formula.py_in
new file mode 100644
index 00000000..6075edc4
--- /dev/null
+++ b/src/ports/postgres/modules/pmml/test/unit_tests/test_formula.py_in
@@ -0,0 +1,349 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import sys
+from os import path
+import unittest
+from mock import *
+
+m4_changequote(`<!', `!>')
+
+# Add modules to the pythonpath.
+sys.path.append(path.join(path.dirname(path.abspath(__file__)), '../../..'))
+sys.path.append(path.join(path.dirname(path.abspath(__file__)), '../..'))
+
+class FormulaTestCase(unittest.TestCase):
+    def setUp(self):
+        self.plpy_mock = Mock()
+        patches = {
+            'plpy': self.plpy_mock
+        }
+        self.module_patcher = patch.dict('sys.modules', patches)
+        self.module_patcher.start()
+        from pmml import formula
+        self.subject = formula
+    def tearDown(self):
+        self.module_patcher.stop()
+
+    def test_formula_array_with_intercept(self):
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1.0,foo,bar]', 3)
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0,foo,bar]', 3)
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0.0,foo,bar]', 3)
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,"1","bar"]', 3)
+        self.assertEqual(f.x, ['1', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,"1.0","bar"]', 3)
+        self.assertEqual(f.x, ['1.0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1.0,"1","bar"]', 3)
+        self.assertEqual(f.x, ['1', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1.0,"1.0","bar"]', 3)
+        self.assertEqual(f.x, ['1.0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,"0","bar"]', 3)
+        self.assertEqual(f.x, ['0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1.0,"0","bar"]', 3)
+        self.assertEqual(f.x, ['0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,"0.0","bar"]', 3)
+        self.assertEqual(f.x, ['0.0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1.0,"0.0","bar"]', 3)
+        self.assertEqual(f.x, ['0.0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0,"1","bar"]', 3)
+        self.assertEqual(f.x, ['1', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0.0,"1","bar"]', 3)
+        self.assertEqual(f.x, ['1', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0.0,"1.0","bar"]', 3)
+        self.assertEqual(f.x, ['1.0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0,"0","bar"]', 3)
+        self.assertEqual(f.x, ['0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0.0,"0","bar"]', 3)
+        self.assertEqual(f.x, ['0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0,"0.0","bar"]', 3)
+        self.assertEqual(f.x, ['0.0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[0.0,"0.0","bar"]', 3)
+        self.assertEqual(f.x, ['0.0', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+    def test_formula_array_with_invalid_intercept(self):
+        f = self.subject.Formula('baaz', 'ARRAY[10,foo,bar]', 3)
+        self.assertEqual(f.x, ['10', 'foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+        # A negative number shouldn't be allowed technically the train 
functions
+        # don't error out, so adding this test for the sake of completeness
+        f = self.subject.Formula('baaz', 'ARRAY[-2,foo,bar]', 3)
+        self.assertEqual(f.intercept, True)
+        self.assertEqual(f.x, ['ARRAY[-2,foo,bar][1]', 'ARRAY[-2,foo,bar][2]'])
+        self.assertEqual(f.y, "baaz")
+
+        f = self.subject.Formula('baaz', 'ARRAY[2,foo,bar]', 3)
+        self.assertEqual(f.x, ['2', 'foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+        f = self.subject.Formula('baaz', 'ARRAY[23,foo,bar]', 3)
+        self.assertEqual(f.x, ['23', 'foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+    def test_formula_array_with_intercept_unequal_coef_len(self):
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 2)
+        self.assertEqual(f.x, ['x[1]', 'x[2]'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 4)
+        self.assertEqual(f.x, ['x[1]', 'x[2]', 'x[3]', 'x[4]'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+    def test_formula_array_without_intercept(self):
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+        f = self.subject.Formula('baaz', 'ARRAY["1",foo,bar]', 3)
+        self.assertEqual(f.x, ['1', 'foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+        f = self.subject.Formula('baaz', 'ARRAY["1", "foo","bar"]', 3)
+        self.assertEqual(f.x, ['1', 'foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+        f = self.subject.Formula('baaz', 'ARRAY["0", "foo","bar"]', 3)
+        self.assertEqual(f.x, ['0', 'foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+    def test_formula_array_without_intercept_unequal_coef_len(self):
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 1)
+        self.assertEqual(f.x, ['x[1]'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 3)
+        self.assertEqual(f.x, ['x[1]', 'x[2]', 'x[3]'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, False)
+
+    def test_formula_nonarray(self):
+        f = self.subject.Formula('baaz', 'foo', 3)
+        self.assertEqual(f.x, ['foo[1]', 'foo[2]'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(f.intercept, True)
+
+        f = self.subject.Formula('baaz', '{1,foo,bar}', 3)
+        self.assertEqual(f.intercept, True)
+        self.assertEqual(f.x, ['x[1]', 'x[2]', 'x[3]'])
+        self.assertEqual(f.y, "baaz")
+
+    def test_rename_string_expression_with_intercept(self):
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename('y ~ foo+bar')
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "y")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename('y ~ foo,bar')
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "y")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename('foo+bar')
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename('foo,bar')
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename('{foo,bar}')
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+    def test_rename_string_expression_without_intercept(self):
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename('y ~ x1+x2')
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "y")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename('y ~ x1,x2')
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "y")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename('x1+x2')
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename('x1,x2')
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename('{x1,x2}')
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+    def test_rename_string_expression_with_intercept_throws_warning(self):
+        f = self.subject.Formula('baaz', 'ARRAY[1, foo,bar]', 3)
+        f.rename('y ~ x1')
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(1, self.plpy_mock.warning.call_count)
+
+    def test_rename_string_expression_without_intercept_throws_warning(self):
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar,foobar]', 3)
+        f.rename('y ~ x1+x2')
+        self.assertEqual(f.x, ['foo', 'bar', 'foobar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(1, self.plpy_mock.warning.call_count)
+
+    def test_rename_array_expression_with_intercept(self):
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename(['x1','x2'])
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename(['y', 'x1','x2'])
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "y")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+    def test_rename_array_expression_without_intercept(self):
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename(['x1','x2'])
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename(['y', 'x1','x2'])
+        self.assertEqual(f.x, ['x1', 'x2'])
+        self.assertEqual(f.y, "y")
+        self.assertEqual(0, self.plpy_mock.warning.call_count)
+
+    def test_rename_array_expression_with_intercept_throws_warning(self):
+        f = self.subject.Formula('baaz', 'ARRAY[1,foo,bar]', 3)
+        f.rename(['x1'])
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(1, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[1, foo,bar]', 3)
+        f.rename(['x1', 'x2', 'x3','x4'])
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(2, self.plpy_mock.warning.call_count)
+
+    def test_rename_array_expression_without_intercept_throws_warning(self):
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename(['x1'])
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(1, self.plpy_mock.warning.call_count)
+
+        f = self.subject.Formula('baaz', 'ARRAY[foo,bar]', 2)
+        f.rename(['x1', 'x2', 'x3','x4'])
+        self.assertEqual(f.x, ['foo', 'bar'])
+        self.assertEqual(f.y, "baaz")
+        self.assertEqual(2, self.plpy_mock.warning.call_count)
+
+
+if __name__ == '__main__':
+    unittest.main()
+
+# ---------------------------------------------------------------------

(madlib) 02/05: PMML: Do not include intercept as a predictor

Reply via email to