Hi,
I am currently utilizing pysparks `GeneralizedLinearRegression`, and have a
question when I try to include a categorical:categorical interaction in my
model.
I have tried searching stackoverflow and exisitngs mail, but with no luck. Let
me know if I have missed anything 😊.
My question boils down to, if the `Interaction` functionality between
categorical variables is usable in combination with GeneralizedLinearRegression
( and generally functions using IWLS)? Since I am unsure of whether the design
matrix generated from the feature column considers collinearity.
Please see code and further comments below:
The following code utilizes the RFormula transformation to visualize the model
matrix:
```
from pyspark.ml.feature import RFormula
from pyspark.sql.functions import lit
# Create sample data with all combinations
data = [("N", "N"), ("Y", "N"), ("N", "Y"), ("Y", "Y")]
df = spark.createDataFrame(data, ["x1", "x2"])
df = df.withColumn("y", lit(1.0))
formula = RFormula(formula="y ~ x1+x2+x1:x2", featuresCol="features")
transformed = formula.fit(df).transform(df)
# Display
transformed.select("x1", "x2", "features").show(truncate=False)
``
Returning
+---+---+-------------------------+
|x1 |x2 |features |
+---+---+-------------------------+
|N |N |[1.0,1.0,1.0,0.0,0.0,0.0]|
|Y |N |(6,[1,4],[1.0,1.0]) |
|N |Y |(6,[0,3],[1.0,1.0]) |
|Y |Y |(6,[5],[1.0]) |
+---+---+-------------------------+
This will introduce a 4x6 (7 with intercept) dimensioned matrix:
Perfect Multicollinearity - The interaction columns are perfectly determined by
the main effects. For example, the (N,N) interaction is exactly where x1="N"
AND x2="N".
Rank Deficiency - The design matrix is rank deficient. For a two-factor
interaction model, we should need only 4 parameters (intercept + 3
coefficients), but PySpark creates 6.
Causing converging problems when using the IWLS algorithm.
--------------------
If we look a the same example using patsy’s dmatrix implementation:
```
import pandas as pd
from patsy import dmatrix
data = pd.DataFrame({
'x1': ['N', 'Y', 'N', 'Y'],
'x2': ['N', 'N', 'Y', 'Y'],
'y': [1.0, 1.0, 1.0, 1.0]
})
formula_matrix = dmatrix("x1 + x2 + x1:x2", data, return_type="dataframe")
print(formula_matrix)
```
Patsy formula matrix:
Intercept x1[T.Y] x2[T.Y] x1[T.Y]:x2[T.Y]
0 1.0 0.0 0.0 0.0
1 1.0 1.0 0.0 0.0
2 1.0 0.0 1.0 0.0
3 1.0 1.0 1.0 1.0
We get the correct model matrix setup.
Venlig hilsen | Best Regards
Emil Hofman | Aktuar
T +45 51 55 86 80 | M +45 51 55 86 80
[email protected]<mailto:[email protected]>
[cid:[email protected]]
[cid:[email protected]]<https://www.linkedin.com/company/alm-brand-group/>
almbrandgroup.com<https://almbrandgroup.com>
Alm. Brand Group | Hovedkontor: Midtermolen 7 | DK-2100 København Ø | T +45 35
47 47 47
Er du ikke den tiltænkte modtager af denne mail, beder vi dig venligst
informere os, slette mailen og ikke videredistribuere indholdet og evt.
vedhæftede filer.
Du kan læse mere om, hvordan vi behandler dine personoplysninger, og hvilke
rettigheder du har i vores privatlivspolitik ved at klikke
her<https://www.almbrandgroup.com/om-os/privatlivspolitik/>. Denne mail er
scannet for virus.
For information about how we process your personal data, please click
here<https://www.almbrandgroup.com/en/about-us/privacy-policy/>.