Re: spark ml : auc on extreme distributed data

2016-08-15 Thread Sean Owen
Class imbalance can be an issue for algorithms, but decision forests should in general cope reasonably well with imbalanced classes. By default, positive and negative classes are treated 'equally' however, and that may not reflect reality in some cases. Upsampling the under-represented case is a

spark ml : auc on extreme distributed data

2016-08-14 Thread Zhiliang Zhu
Hi All,  Here I have lot of data with around 1,000,000 rows, 97% of them are negative class and 3% of them are positive class .  I applied Random Forest algorithm to build the model and predict the testing data. For the data preparation,i. firstly randomly split all the data as training data