[jira] [Resolved] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression

Xiangrui Meng (JIRA) Tue, 15 Sep 2015 15:48:06 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiangrui Meng resolved SPARK-7685.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.0

Issue resolved by pull request 7884
[https://github.com/apache/spark/pull/7884]

> Handle high imbalanced data and apply weights to different samples in 
> Logistic Regression
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-7685
>                 URL: https://issues.apache.org/jira/browse/SPARK-7685
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: DB Tsai
>            Assignee: DB Tsai
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> In fraud detection dataset, almost all the samples are negative while only 
> couple of them are positive. This type of high imbalanced data will bias the 
> models toward negative resulting poor performance. In python-scikit, they 
> provide a correction allowing users to Over-/undersample the samples of each 
> class according to the given weights. In auto mode, selects weights inversely 
> proportional to class frequencies in the training set. This can be done in a 
> more efficient way by multiplying the weights into loss and gradient instead 
> of doing actual over/undersampling in the training dataset which is very 
> expensive.
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> On the other hand, some of the training data maybe more important like the 
> training samples from tenure users while the training samples from new users 
> maybe less important. We should be able to provide another "weight: Double" 
> information in the LabeledPoint to weight them differently in the learning 
> algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression

Reply via email to