spark ml : auc on extreme distributed data

Zhiliang Zhu Sun, 14 Aug 2016 21:12:07 -0700

Hi All, 
Here I have lot of data with around 1,000,000 rows, 97% of them are negative 
class and 3% of them are positive class .  I applied Random Forest algorithm to 
build the model and predict the testing data.
For the data preparation,i. firstly randomly split all the data as training 
data and testing data by 0.7 : 0.3ii. let the testing data unchanged, its 
negative and positive class ratio would still be 97:3iii. try to make the 
training data negative and positive class ratio as 50:50, by way of sample 
algorithm in the different classesiv. get RF model by training data and predict 
testing data
by modifying algorithm parameters and feature work (PCA etc ), it seems that 
the auc on the testing data is always above 0.8, or much more higher ...
Then I lose into some confusion... It seems that the model or auc depends a lot 
on the original data distribution...In effect, I would like to know, for this 
data distribution, how its auc would be for random guess?What the auc would be 
for any kind of data distribution?
Thanks in advance~~

spark ml : auc on extreme distributed data

Reply via email to