kaknikhil opened a new pull request, #613:
URL: https://github.com/apache/madlib/pull/613
JIRA: MADLIB-1517
Note that this PR only fixes GLM, logisitic and linear. A future PR
will fix other pmml modules.
# Context :
MADlib's way of passing intercept to regression models is a bit unusual.
Usually intercept is a boolean which indicates whether the model needs to be
fit with intercept or not. MADlib makes the user pass an integer (1 means use
an intercept and no value means don't fit with intercept) along with the
other
independent variables and uses that directly for computation. For e.g.
ARRAY[1,x1,x2,...] indicates use an intercept whereas ARRAY[x1,x2,...] means
don't use an intercept
# Problem:
* So essentially all the regression algorithms treat the intercept value "1"
as
just another independent variable(it's always the first one though).
* Because of this implicit assumption, users need to specifically inject a
predictor named "1" with a value of 1 for all the input rows. This can be
very
inconvenient specially when using pmml to predict a stream of data or some
other preprocessed form of data.
# Fix:
* Once the model is trained, the only way to know if the model was fit with
intercept is to look at the `independent_varname` field in the summary
table.
* If the value contains ARRAY[1, x1, x2..], then an intercept was used.
* Since this intercept is just another independent variable, there aren't any
explicit references or logic to handle intercept in our python or c++ code
for
training or predict.
* Because of this assumption, the pmml code also considers all of
"ARRAY[1,x1,x2,...]" as independent variables and hence the output pmml
contains "1" as an input predictor.
* We can just remove all references to the column "1" in the pmml file. We
will
still keep the "p0" variable which is explicitly marked as an intercept and
will store the intercept's coefficient
* The pmml module gets the 'X' and 'Y' values from the summary table and then
parses it to create a list of all the independent predictors so that it can
be written to the pmml file
* It uses regex to match the expression ARRAY[1,x1,x2]/ARRAY[x1,x2] and then
returns either ['1','x1','x2'] or ['x1',x2']
* Our goal with the pmml code is to not treat the intercept "1" as an
independent predictor but just as an intercept
* The commit fixes this by changing the regex and using the output to
determine
if an intercept was passed so that both expressions
ARRAY[1,x1,x2]/ARRAY[x1,x2] return ['x1', 'x2']
* Also had to make changes to the various pmml builder classes to treat
intercept's coefficient differently than the feature coefficients
* Note that this PR only fixes GLM, logisitic and linear. A future PR
will fix other pmml modules.
Before the fix:
```
<?xml version="1.0" standalone="yes"?>
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1">
<Header copyright="Copyright (c) 2024 nkak">
<Extension extender="MADlib" name="user" value="nkak"/>
<Application name="MADlib" version="2.1.0"/>
<Timestamp>2024-02-16 12:19:58.798139 PDT</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
<DataField name="second_attack_pmml_prediction" optype="categorical"
dataType="boolean">
<Value value="True"/>
<Value value="False"/>
</DataField>
<DataField name="1" optype="continuous" dataType="double"/>
<DataField name="treatment" optype="continuous" dataType="double"/>
<DataField name="trait_anxiety" optype="continuous" dataType="double"/>
</DataDictionary>
<RegressionModel functionName="classification"
normalizationMethod="softmax">
<MiningSchema>
<MiningField name="second_attack_pmml_prediction"
usageType="predicted"/>
<MiningField name="1"/>
<MiningField name="treatment"/>
<MiningField name="trait_anxiety"/>
</MiningSchema>
<RegressionTable intercept="0.0" targetCategory="True">
<NumericPredictor name="1" coefficient="-6.363469941781864"/>
<NumericPredictor name="treatment" coefficient="-1.0241060523932668"/>
<NumericPredictor name="trait_anxiety"
coefficient="0.11904491666860616"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="False"/>
</RegressionModel>
</PMML>
```
After the fix:
```
<?xml version="1.0" standalone="yes"?>
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1">
<Header copyright="Copyright (c) 2024 nkak">
<Extension extender="MADlib" name="user" value="nkak"/>
<Application name="MADlib" version="2.1.0"/>
<Timestamp>2024-02-16 13:37:15.367609 PDT</Timestamp>
</Header>
<DataDictionary numberOfFields="3">
<DataField name="second_attack_pmml_prediction" optype="categorical"
dataType="boolean">
<Value value="True"/>
<Value value="False"/>
</DataField>
<DataField name="treatment" optype="continuous" dataType="double"/>
<DataField name="trait_anxiety" optype="continuous" dataType="double"/>
</DataDictionary>
<RegressionModel functionName="classification"
normalizationMethod="softmax">
<MiningSchema>
<MiningField name="second_attack_pmml_prediction"
usageType="predicted"/>
<MiningField name="treatment"/>
<MiningField name="trait_anxiety"/>
</MiningSchema>
<RegressionTable intercept="-6.36346994178186" targetCategory="True">
<NumericPredictor name="treatment" coefficient="-1.0241060523932697"/>
<NumericPredictor name="trait_anxiety"
coefficient="0.11904491666860609"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="False"/>
</RegressionModel>
</PMML>
```
# Risks and limitations:
1. Straying away from the intrinsic intercept assumption only for the pmml
code:
* As we have established already, the intercept is not treated any
different
from an independent variable.
* To fix the pmml file to not include the intercept as an independent
variable,
we will need to break this intrinsic assumption.
* If the pmml code breaks this assumption, it's possible that there might
be
some unexpected side effects or errors that even with exhaustive testing
may
not be uncovered. For e.g. the pmml code relies on the len of the
coefficient
to make some decisions about naming and such. (See formula.py for an
example)
which might have some weird edge cases
We might be fine with this risk for now and if something breaks in the
future,
we can deal with it later. Biggest risk is that something fundamental
breaks in
the future that might make us revert this new logic. But the odds of
that are
pretty low
2. If the user passed a non array expression for the independent variable
Consider the following example
```
-- Create a table where the x variable is an array of the independent
variables to train on
CREATE TABLE warpbreaks_dummy_simple_xcol AS SELECT breaks AS y,
ARRAY[1,"wool_B","tension_M", "tension_H"]
AS x_a from warpbreaks_dummy;
-- Now use the column 'x_a' created in the previous step.
SELECT madlib.glm('warpbreaks_dummy_simple_xcol',
'glm_warpbreaks_intercept_1_simple_xcol', 'y' , 'x_a' ,
'family=poisson, link=log');
```
Now there's no way for us know if this model was fit with an intercept
or not.
The only way to know is to check the value of "independent_varname" in
the
summary table which would be "x_A" in this case which won't tell us
anything
about the intercept.
Ideally, we would like to change the fit functions to take a boolen for
the
intercept arg but that will too big of a change and hence is out of
scope of this commit.
The easiest fix for this problem for now is that we are going to assume
that
all non array expressions always include the intercept. Note that this
assumption only applies to the pmml module
3. Using the name_spec arg of the pmml function
* The pmml function accepts an optional arg named "name_spec" which is
used to explicitly name the input and output variables in the pmml file.
* The user will now need to remove the "1" from this expression
For e.g. `SELECT madlib.pmml('patients_logregr',
'attack~1+anxiety+treatment');` will have to be rewritten as
`SELECT madlib.pmml('patients_logregr', 'attack~anxiety+treatment');`
* We will need to remove this from the pmml user docs which will be done
in a separate PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]