GitHub user holdenk opened a pull request: https://github.com/apache/spark/pull/10788
[SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for @dbtsai to review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/holdenk/spark SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10788.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10788 ---- commit a529c013fa722748cbd1d3878e4ea3bed5b15181 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-22T20:54:59Z document plans commit f9e26350d15d7d36b75ece4f4718797dbe2a0830 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-22T22:53:29Z Some progress. commit 7ebbd566e20923efc32dee1cfcf12ea315259e30 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-22T23:16:18Z Keep track of the number of requested classes so that if its more than 2 we use the legacy implementation. Also allow pass through of initialWeights commit ef2a9b0f5b6cb2e971c2e5371f3394b4dec64574 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-22T23:48:06Z Expose a train on instances method within Spark, use numOfLinearPredictors instead of keeping track of class variable, pass through persistence information commit 407491e38b1a5834d26a137ab20829a3d96f5142 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-24T01:14:04Z tests are fun commit e02bf3a9688d1efa2f3da60b3d9f27911b04955d Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-24T07:42:13Z Start updating the tests to run with different updaters. commit 8517539d0e8829833968dcb7e47ad8ba20849cb1 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-24T08:00:36Z get the tests compiling commit a619d42b821575afd8efa90f2a38edf9690eb0df Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-24T08:04:53Z style fixed commit 4febcc32f524edadeb68dc674e2681a087ffaa38 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-24T08:13:23Z make the test method private commit e8e03a13ba04c6b3100e290a5c435959c2f01912 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-24T20:16:13Z CR feedback, pass RDD of Labeled points to ml implemetnation. Also from tests require that feature scaling is turned on to use ml implementation. commit 38a024bd9a36e83ef8005a5f2af8a4dd44f6760e Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-25T07:24:21Z Convert it to a df and use set for the inital params commit 478b8c5d5ff20478dc4ba913b0c77172e0abdfff Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-25T20:06:57Z Handle non-dense weights commit 08589f58b81bc1e6099b425f86226053c5b6a360 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-26T03:39:54Z CR feedback: make the setInitialWeights function private, don't mess with the weights when they are user supploed, validate that the user supplied weights are reasonable. commit f40c401496ae1e6cc7b39db820fea194d42c25c5 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-05-26T04:19:46Z style fix up commit f35a16aa8110a33c32959db674908d145be6e97f Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-06-02T23:29:11Z Copy the number of iterations, convergence tolerance, and if we are fitting an intercept from mllib to ml when training lbfgs model using ml code commit 4d431a358074f5245abcbc95af3e2bdf75b4f21d Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-06-03T00:39:48Z scala style check issue commit 7e4192849efc6d282633159a15c7dd41376aa1a3 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-06-03T07:30:48Z Only the weights if we need to. commit ed351ffdf862994389b41284f95aa148c6550f41 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-06-03T19:39:56Z Use appendBias for adding intercept to initial weights , fix generateInitialWeights commit 3ac02d72cab72b35b7cc76c50d7088d4b98bfd9d Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-06-08T20:20:19Z Merge in master commit d1ce12ba45f12d93b962ffd560242757eda739c2 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-07-09T20:13:21Z Merge in master commit 8ca0fa927bd2773ceb4ccf740445058ead706f7a Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-08-28T21:57:51Z attempt to merge in master commit 6f66f2cbc7d80335bfb0e2e5b8b430930206d06f Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-10-01T23:05:01Z Merge in master (again) commit 0cedd50368eeda594eafdb9500ed162ff33f2e25 Author: Holden Karau <hol...@pigscanfly.ca> Date: 2015-10-02T01:44:08Z Fix compile error after simple merge commit 2bf289b2ab92ff9da742d22e1feda0b57f8a796c Author: Holden Karau <hol...@us.ibm.com> Date: 2015-12-30T18:41:30Z Merge branch 'master' into SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized commit d7a26318be962eede7d6fa0792f1f4d72178dc8d Author: Holden Karau <hol...@us.ibm.com> Date: 2016-01-16T03:21:04Z Merge in master commit b0fe1e68bf8e7fc13cc845db90e7eb27729545d9 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-01-16T03:24:08Z scala style import order fix commit 827dcdec09414c5b25a66be359c4d651a9e18ee6 Author: Holden Karau <hol...@us.ibm.com> Date: 2016-01-16T06:24:33Z Import ordering ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org