[ https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
DB Tsai updated SPARK-7685: --------------------------- Summary: Handle high imbalanced data and apply weights to different samples in Logistic Regression (was: Handle high imbalanced data or apply weights to different samples in Logistic Regression) > Handle high imbalanced data and apply weights to different samples in > Logistic Regression > ----------------------------------------------------------------------------------------- > > Key: SPARK-7685 > URL: https://issues.apache.org/jira/browse/SPARK-7685 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: DB Tsai > > In fraud detection dataset, almost all the samples are negative while only > couple of them are positive. This type of high imbalanced data will bias the > models toward negative resulting poor performance. In python-scikit, they > provide a correction allowing users to Over-/undersample the samples of each > class according to the given weights. In auto mode, selects weights inversely > proportional to class frequencies in the training set. This can be done in a > more efficient way by multiplying the weights into loss and gradient instead > of doing actual over/undersampling in the training dataset which is very > expensive. > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html > On the other hand, some of the training data maybe more important like the > training samples from tenure users while the training samples from new users > maybe less important. We should be able to provide another "weight: Double" > information in the LabeledPoint to weight them differently in the learning > algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org