Shuo Xiang created SPARK-13029:
----------------------------------

             Summary: Logistic regression returns inaccurate results when there 
is a column with identical value, and fit_intercept=false
                 Key: SPARK-13029
                 URL: https://issues.apache.org/jira/browse/SPARK-13029
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 1.6.0, 1.5.2
            Reporter: Shuo Xiang


This is a bug that appears while fitting a Logistic Regression model with 
`.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix 
has one column with identical value, the resulting model is not correct. 
Specifically, the special column will always get a weight of 0, due to the 
special check inside the code. However, the correct solution, which is unique 
for L2 logistic regression, usually has non-zero weight.

I use the [heart_scale 
data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and 
manually augmented the data matrix with a column of one. The resulting data is 
run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools:
 - libsvm
 - scikit-learn
 - sparkml

(Notice libsvm and scikit-learn use a slightly different formulation, so their 
regularizer is equivalently set to 1/270).

The first two will have an objective value 0.7275 and give a solution vector:
[0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 
0.02436266296315414, 0.01739437315700921, -0.0006404006623321454
0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 
0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 
0.1801661775839843, -0.01248615347419409].

Spark will produce an objective value 0.7278 and give a solution vector:
[0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0]

I have a fix for it and passed the above test.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to