[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13029: --- Target Version/s: 1.5.3, 2.0.0, 1.6.2 (was: 1.5.3, 1.6.1, 2.0.0) > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the heart_scale data > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and > manually augmented the data matrix with a column of one (available in the > PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the > following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and give a solution vector: > [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, > 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 > 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, > 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, > 0.1801661775839843, -0.01248615347419409]. > Spark will produce an objective value 0.7278 and give a solution vector: > [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] > Notice the last element of the weight vector is 0. > A even simpler example is: > {code:title=benchmark.py|borderStyle=solid} > import numpy as np > from sklearn.datasets import load_svmlight_file > from sklearn.linear_model import LogisticRegression > x_train = np.array([[1, 1], [0, 1]]) > y_train = np.array([1, 0]) > model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, > fit_intercept=False).fit(x_train, y_train) > print model.coef_ > [[ 0.22478867 -0.02241016]] > {code} > The same data trained by the current solver also gives a different result, > see the unit test in the PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. A even simpler example is: import numpy as np from sklearn.datasets import load_svmlight_file from sklearn.linear_model import LogisticRegression x_train = np.array([[1, 1], [0, 1]]) y_train = np.array([1, 0]) model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, fit_intercept=False).fit(x_train, y_train) print model.coef_ [[ 0.22478867 -0.02241016]] The same data trained by the current solver also gives a different result, see the unit test in the PR. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. A even simpler example is: {code:title=benchmark.py|borderStyle=solid} import numpy as np from sklearn.datasets import load_svmlight_file from sklearn.linear_model import LogisticRegression x_train = np.array([[1, 1], [0, 1]]) y_train = np.array([1, 0]) model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, fit_intercept=False).fit(x_train, y_train) print model.coef_ [[ 0.22478867 -0.02241016]] {code} The same data trained by the current solver also gives a different result, see the unit test in the PR. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. A even simpler example is: {code:title=benchmark.py|borderStyle=solid} import numpy as np from sklearn.datasets import load_svmlight_file from sklearn.linear_model import LogisticRegression x_train = np.array([[1, 1], [0, 1]]) y_train = np.array([1, 0]) model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, fit_intercept=False).fit(x_train, y_train) print model.coef_ [[ 0.22478867 -0.02241016]] {code} The same data trained by the current solver also gives a different result, see the unit test in the PR. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. A even simpler example is: {code:title=benchmark.py|borderStyle=solid} import numpy as np from sklearn.datasets import load_svmlight_file from sklearn.linear_model import LogisticRegression x_train = np.array([[1, 1], [0, 1]]) y_train = np.array([1, 0]) model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, fit_intercept=False).fit(x_train, y_train) print model.coef_ [[ 0.22478867 -0.02241016]] {code} The same data trained by the current solver also gives a different result, see the unit test in the PR. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. A even simpler example is: import numpy as np from sklearn.datasets import load_svmlight_file from sklearn.linear_model import LogisticRegression x_train = np.array([[1, 1], [0, 1]]) y_train = np.array([1, 0]) model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, fit_intercept=False).fit(x_train, y_train) print model.coef_ [[ 0.22478867 -0.02241016]] The same data trained by the current solver also gives a different result, see the unit test in the PR. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 >
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the [heart_scale data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the heart_scale data > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and > manually augmented the data matrix with a column of one. The resulting data > is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and give a solution vector: > [0.03007516959304916,
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the [heart_scale data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the [heart_scale data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. I have a fix for it and passed the above test. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the [heart_scale > data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) > and manually augmented the data matrix with a column of one. The resulting > data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and give
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the [heart_scale data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. I have a fix for it and passed the above test. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the [heart_scale data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] I have a fix for it and passed the above test. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the [heart_scale > data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) > and manually augmented the data matrix with a column of one. The resulting > data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and give a solution vector: >
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13029: -- Shepherd: Xiangrui Meng Target Version/s: 1.5.3, 1.6.1, 2.0.0 > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the [heart_scale > data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) > and manually augmented the data matrix with a column of one. The resulting > data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and give a solution vector: > [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, > 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 > 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, > 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, > 0.1801661775839843, -0.01248615347419409]. > Spark will produce an objective value 0.7278 and give a solution vector: > [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] > Notice the last element of the weight vector is 0. > I have a fix for it and passed the above test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13029: -- Assignee: Shuo Xiang > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the [heart_scale > data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) > and manually augmented the data matrix with a column of one. The resulting > data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and give a solution vector: > [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, > 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 > 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, > 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, > 0.1801661775839843, -0.01248615347419409]. > Spark will produce an objective value 0.7278 and give a solution vector: > [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] > Notice the last element of the weight vector is 0. > I have a fix for it and passed the above test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one (available in the PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, also available in the PR) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the heart_scale data > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and > manually augmented the data matrix with a column of one (available in the > PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the > following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an
[jira] [Updated] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false
[ https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-13029: --- Description: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, also available in the PR) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. was: This is a bug that appears while fitting a Logistic Regression model with `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix has one column with identical value, the resulting model is not correct. Specifically, the special column will always get a weight of 0, due to the special check inside the code. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight. I use the heart_scale data (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and manually augmented the data matrix with a column of one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the following tools: - libsvm - scikit-learn - sparkml (Notice libsvm and scikit-learn use a slightly different formulation, so their regularizer is equivalently set to 1/270). The first two will have an objective value 0.7275 and give a solution vector: [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 0.1801661775839843, -0.01248615347419409]. Spark will produce an objective value 0.7278 and give a solution vector: [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0] Notice the last element of the weight vector is 0. > Logistic regression returns inaccurate results when there is a column with > identical value, and fit_intercept=false > --- > > Key: SPARK-13029 > URL: https://issues.apache.org/jira/browse/SPARK-13029 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2, 1.6.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > > This is a bug that appears while fitting a Logistic Regression model with > `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix > has one column with identical value, the resulting model is not correct. > Specifically, the special column will always get a weight of 0, due to the > special check inside the code. However, the correct solution, which is unique > for L2 logistic regression, usually has non-zero weight. > I use the heart_scale data > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, also > available in the PR) and manually augmented the data matrix with a column of > one. The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the > following tools: > - libsvm > - scikit-learn > - sparkml > (Notice libsvm and scikit-learn use a slightly different formulation, so > their regularizer is equivalently set to 1/270). > The first two will have an objective value 0.7275 and