[ 
https://issues.apache.org/jira/browse/MADLIB-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Kak updated MADLIB-1517:
-------------------------------
    Description: 
Our regression models include the intercept as a predictor in the exported pmml 
file.

Reproduction Steps

Create glm model and generate pmml

CREATE TABLE warpbreaks(
    id      serial,
    breaks  integer,
    wool    char(1),
    tension char(1)
);
INSERT INTO warpbreaks(breaks, wool, tension) VALUES
(26, 'A', 'L'),
(30, 'A', 'L'),
(54, 'A', 'L'),
(25, 'A', 'L'),
(70, 'A', 'L'),
(52, 'A', 'L'),
(51, 'A', 'L'),
(26, 'A', 'L'),
(67, 'A', 'L'),
(18, 'A', 'M'),
(21, 'A', 'M'),
(29, 'A', 'M'),
(17, 'A', 'M'),
(12, 'A', 'M'),
(18, 'A', 'M'),
(35, 'A', 'M'),
(30, 'A', 'M'),
(36, 'A', 'M'),
(36, 'A', 'H'),
(21, 'A', 'H'),
(24, 'A', 'H'),
(18, 'A', 'H'),
(10, 'A', 'H'),
(43, 'A', 'H'),
(28, 'A', 'H'),
(15, 'A', 'H'),
(26, 'A', 'H'),
(27, 'B', 'L'),
(14, 'B', 'L'),
(29, 'B', 'L'),
(19, 'B', 'L'),
(29, 'B', 'L'),
(31, 'B', 'L'),
(41, 'B', 'L'),
(20, 'B', 'L'),
(44, 'B', 'L'),
(42, 'B', 'M'),
(26, 'B', 'M'),
(19, 'B', 'M'),
(16, 'B', 'M'),
(39, 'B', 'M'),
(28, 'B', 'M'),
(21, 'B', 'M'),
(39, 'B', 'M'),
(29, 'B', 'M'),
(20, 'B', 'H'),
(21, 'B', 'H'),
(24, 'B', 'H'),
(17, 'B', 'H'),
(13, 'B', 'H'),
(15, 'B', 'H'),
(15, 'B', 'H'),
(16, 'B', 'H'),
(28, 'B', 'H');
SELECT create_indicator_variables('warpbreaks', 'warpbreaks_dummy', 
'wool,tension');
DROP TABLE IF EXISTS glm_model, glm_model_summary;
SELECT glm('warpbreaks_dummy',
           'glm_model',
           'breaks',
           'ARRAY[1.0,"wool_B","tension_M", "tension_H"]',
           'family=poisson, link=log');
COPY (SELECT madlib.pmml('glm_model')) TO '/tmp/glm.pmml';

SELECT madlib.glm_predict(coef,  ARRAY[1, 0, 1, 0]::float8[], 'log') FROM 
glm_model;
    glm_predict
--------------------
 29.097222222222218

SELECT madlib.glm_predict(coef, ARRAY[1, 0, 0, 0]::float8[], 'log') FROM 
glm_model;
    glm_predict
--------------------
 40.123538011695906

Use pypmml to predict the data using the generated pmml:

from pypmml import Model
model = Model.fromFile('/tmp/glm.pmml')
data = [{ "wool_B": 0, "tension_M": 0, "tension_H": 0},
        { "wool_B": 0, "tension_M": 1, "tension_H": 0},
        ]
for d in data:
    result = model.predict(d)
    print(d)
    print(result)

{'wool_B': 0, 'tension_M': 0, 'tension_H': 0}
{'predicted_breaks_pmml_prediction': nan}
{'wool_B': 0, 'tension_M': 1, 'tension_H': 0}
{'predicted_breaks_pmml_prediction': nan}

Obviously the nan results are wrong. To make it work with the existing pmml 
file, we need to modify the input a bit.

from pypmml import Model
model = Model.fromFile('/tmp/glm.pmml')
data = [{ "1.0": 1, "wool_B": 0, "tension_M": 0, "tension_H": 0},
        { "1.0": 1, "wool_B": 0, "tension_M": 1, "tension_H": 0}
        ]


for d in data:
    result = model.predict(d)
    print(d)
    print(result)

{'1.0': 1, 'wool_B': 0, 'tension_M': 0, 'tension_H': 0}
{'predicted_breaks_pmml_prediction': 40.123538011695906}
{'1.0': 1, 'wool_B': 0, 'tension_M': 1, 'tension_H': 0}
{'predicted_breaks_pmml_prediction': 29.097222222222218}

Now these values match the madlib glm_predict output
Goal

The goal of this story is to fix the glm pmml code so that we don't need to 
awkwardly pass the intercept as "'1.0':1" before the beginning of each data 
row. Note the the name "1.0" is used because that's how it's stored in the pmml 
file (see predictorName property)

  was:
Currently PMML export utility ({{madlib.pmml()}}) in madlib supports exporting 
decision tree, logistic, linear regression model as a pmml file. When trying to 
read this file by a python code, we see the following:
1. For decision tree model : Fails for invalid XML
2. Logistic/Linear regression: The model isn't exported correctly as the 
predictions are off.
Belos is the code in python to import the logreg pmml generated as shown in 
[logreg example|https://madlib.apache.org/docs/latest/group__grp__pmml.html]: 
{code}
from pypmml import Model
model = Model.fromFile('logreg.pmml')
{code}
As part of this JIRA, need to fix these PMML exports that can be imported by 
other apps.


> Do not include intercept as predictor for regression models
> -----------------------------------------------------------
>
>                 Key: MADLIB-1517
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1517
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Ekta Khanna
>            Priority: Major
>             Fix For: v2.2.0
>
>
> Our regression models include the intercept as a predictor in the exported 
> pmml file.
> Reproduction Steps
> Create glm model and generate pmml
> CREATE TABLE warpbreaks(
>     id      serial,
>     breaks  integer,
>     wool    char(1),
>     tension char(1)
> );
> INSERT INTO warpbreaks(breaks, wool, tension) VALUES
> (26, 'A', 'L'),
> (30, 'A', 'L'),
> (54, 'A', 'L'),
> (25, 'A', 'L'),
> (70, 'A', 'L'),
> (52, 'A', 'L'),
> (51, 'A', 'L'),
> (26, 'A', 'L'),
> (67, 'A', 'L'),
> (18, 'A', 'M'),
> (21, 'A', 'M'),
> (29, 'A', 'M'),
> (17, 'A', 'M'),
> (12, 'A', 'M'),
> (18, 'A', 'M'),
> (35, 'A', 'M'),
> (30, 'A', 'M'),
> (36, 'A', 'M'),
> (36, 'A', 'H'),
> (21, 'A', 'H'),
> (24, 'A', 'H'),
> (18, 'A', 'H'),
> (10, 'A', 'H'),
> (43, 'A', 'H'),
> (28, 'A', 'H'),
> (15, 'A', 'H'),
> (26, 'A', 'H'),
> (27, 'B', 'L'),
> (14, 'B', 'L'),
> (29, 'B', 'L'),
> (19, 'B', 'L'),
> (29, 'B', 'L'),
> (31, 'B', 'L'),
> (41, 'B', 'L'),
> (20, 'B', 'L'),
> (44, 'B', 'L'),
> (42, 'B', 'M'),
> (26, 'B', 'M'),
> (19, 'B', 'M'),
> (16, 'B', 'M'),
> (39, 'B', 'M'),
> (28, 'B', 'M'),
> (21, 'B', 'M'),
> (39, 'B', 'M'),
> (29, 'B', 'M'),
> (20, 'B', 'H'),
> (21, 'B', 'H'),
> (24, 'B', 'H'),
> (17, 'B', 'H'),
> (13, 'B', 'H'),
> (15, 'B', 'H'),
> (15, 'B', 'H'),
> (16, 'B', 'H'),
> (28, 'B', 'H');
> SELECT create_indicator_variables('warpbreaks', 'warpbreaks_dummy', 
> 'wool,tension');
> DROP TABLE IF EXISTS glm_model, glm_model_summary;
> SELECT glm('warpbreaks_dummy',
>            'glm_model',
>            'breaks',
>            'ARRAY[1.0,"wool_B","tension_M", "tension_H"]',
>            'family=poisson, link=log');
> COPY (SELECT madlib.pmml('glm_model')) TO '/tmp/glm.pmml';
> SELECT madlib.glm_predict(coef,  ARRAY[1, 0, 1, 0]::float8[], 'log') FROM 
> glm_model;
>     glm_predict
> --------------------
>  29.097222222222218
> SELECT madlib.glm_predict(coef, ARRAY[1, 0, 0, 0]::float8[], 'log') FROM 
> glm_model;
>     glm_predict
> --------------------
>  40.123538011695906
> Use pypmml to predict the data using the generated pmml:
> from pypmml import Model
> model = Model.fromFile('/tmp/glm.pmml')
> data = [{ "wool_B": 0, "tension_M": 0, "tension_H": 0},
>         { "wool_B": 0, "tension_M": 1, "tension_H": 0},
>         ]
> for d in data:
>     result = model.predict(d)
>     print(d)
>     print(result)
> {'wool_B': 0, 'tension_M': 0, 'tension_H': 0}
> {'predicted_breaks_pmml_prediction': nan}
> {'wool_B': 0, 'tension_M': 1, 'tension_H': 0}
> {'predicted_breaks_pmml_prediction': nan}
> Obviously the nan results are wrong. To make it work with the existing pmml 
> file, we need to modify the input a bit.
> from pypmml import Model
> model = Model.fromFile('/tmp/glm.pmml')
> data = [{ "1.0": 1, "wool_B": 0, "tension_M": 0, "tension_H": 0},
>         { "1.0": 1, "wool_B": 0, "tension_M": 1, "tension_H": 0}
>         ]
> for d in data:
>     result = model.predict(d)
>     print(d)
>     print(result)
> {'1.0': 1, 'wool_B': 0, 'tension_M': 0, 'tension_H': 0}
> {'predicted_breaks_pmml_prediction': 40.123538011695906}
> {'1.0': 1, 'wool_B': 0, 'tension_M': 1, 'tension_H': 0}
> {'predicted_breaks_pmml_prediction': 29.097222222222218}
> Now these values match the madlib glm_predict output
> Goal
> The goal of this story is to fix the glm pmml code so that we don't need to 
> awkwardly pass the intercept as "'1.0':1" before the beginning of each data 
> row. Note the the name "1.0" is used because that's how it's stored in the 
> pmml file (see predictorName property)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to