[jira] [Updated] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression

DB Tsai (JIRA) Sat, 16 May 2015 21:27:13 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


DB Tsai updated SPARK-7685:
---------------------------
    Summary: Handle high imbalanced data and apply weights to different samples 
in Logistic Regression  (was: Handle high imbalanced data or apply weights to 
different samples in Logistic Regression)

> Handle high imbalanced data and apply weights to different samples in 
> Logistic Regression
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-7685
>                 URL: https://issues.apache.org/jira/browse/SPARK-7685
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: DB Tsai
>
> In fraud detection dataset, almost all the samples are negative while only 
> couple of them are positive. This type of high imbalanced data will bias the 
> models toward negative resulting poor performance. In python-scikit, they 
> provide a correction allowing users to Over-/undersample the samples of each 
> class according to the given weights. In auto mode, selects weights inversely 
> proportional to class frequencies in the training set. This can be done in a 
> more efficient way by multiplying the weights into loss and gradient instead 
> of doing actual over/undersampling in the training dataset which is very 
> expensive.
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> On the other hand, some of the training data maybe more important like the 
> training samples from tenure users while the training samples from new users 
> maybe less important. We should be able to provide another "weight: Double" 
> information in the LabeledPoint to weight them differently in the learning 
> algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression

Reply via email to