Any help on the below? On 19-Jan-2018 7:12 PM, "Aakash Basu" <aakash.spark....@gmail.com> wrote:
> Hi all, > > I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model > Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any > API for ROC_Curve calculation for BinaryClassification in SparkMLLib. > > The codes below have a wrapper function which is creating the respective > dataframe from the source data with two columns which is as attached. > > I want to achieve the same result as Python code in the Spark to get the > roc_curve. Is there any API from MLLib side to achieve the same? > > Python sklearn Code - > > def roc(self, y_true, y_pred): > df_a = self._df.copy() > values_1_tmp = df_a[y_true].values > values_1_tmp2 = values_1_tmp[~np.isnan(values_1_tmp)] > values_1 = values_1_tmp2.astype(int) > values_2_tmp = df_a[y_pred].values > values_2_tmp2 = values_2_tmp[~np.isnan(values_2_tmp)] > values_2 = values_2_tmp2.astype(int) > specificity, sensitivity, thresholds = metrics.roc_curve(values_1, > values_2, pos_label=2) > # area_under_roc = metrics.roc_auc_score(values_1, values_2) > print(sensitivity, specificity) > return sensitivity, specificity > > Result: > > [ 0. 0.34138342 0.67412045 1. ] [ 0. > 0.33373458 0.67378875 1. ] > > > PySpark Code - > > def roc(self, y_true, y_pred): > print('using pyspark df') > df_a = self._df > values_1 = list(df_a[y_true, y_pred].toPandas().values) > new_list = [l.tolist() for l in values_1] > > double_list = [] > for myList in new_list: > temp = [] > for item in myList: > temp.append(float(item)) > double_list.append(temp) > > new_rdd = self._sc.parallelize(double_list) > metrics = BinaryClassificationMetrics(new_rdd) > roc_calc = metrics.areaUnderROC > print(roc_calc) > print(type(roc_calc)) > return 1 > > > Please help. > > Thanks, > Aakash. >