Hi, If I understand the RandomForest model in the ML Pipeline implementation in the ml package correctly, I have to first run my outcome label variable through the StringIndexer, even if my labels are numeric. The StringIndexer then converts the labels into numeric indices based on frequency of the label.
This could create situations where if I'm classifying binary outcomes where my original labels are simply 0 and 1, the StringIndexer may actually flip my labels such that 0s become 1s and 1s become 0s if my original 1s were more frequent. This transformation would then extend itself to the predictions. In the old mllib implementation, the RF does not require the labels to be changed and I could use 0/1 labels without worrying about them being transformed. I was wondering: 1. Why is this the default implementation for the Pipeline RF? This seems like it could cause a lot of confusion in cases like the one I outlined above. 2. Is there a way to avoid this by either controlling how the indices are created in StringIndexer or bypassing StringIndexer altogether? 3. If 2 is not possible, is there an easy way to see how my original labels mapped onto the indices so that I can revert the predictions back to the original labels rather than the transformed labels? I suppose I could do this by counting the original labels and mapping by frequency, but it seems like there should be a more straightforward way to get it out of the StringIndexer. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org