ML plans to make Machine Learning pipeline that users can make machine learning more efficient. It's more general to make StringIndexer chain with any kinds of Estimators. I think we can make StringIndexer and reverse process automatic in the future. If you want to know your original labels, you can use IndexToString.
2015-08-11 6:56 GMT+08:00 pkphlam <pkph...@gmail.com>: > Hi, > > If I understand the RandomForest model in the ML Pipeline implementation in > the ml package correctly, I have to first run my outcome label variable > through the StringIndexer, even if my labels are numeric. The StringIndexer > then converts the labels into numeric indices based on frequency of the > label. > > This could create situations where if I'm classifying binary outcomes where > my original labels are simply 0 and 1, the StringIndexer may actually flip > my labels such that 0s become 1s and 1s become 0s if my original 1s were > more frequent. This transformation would then extend itself to the > predictions. In the old mllib implementation, the RF does not require the > labels to be changed and I could use 0/1 labels without worrying about them > being transformed. > > I was wondering: > 1. Why is this the default implementation for the Pipeline RF? This seems > like it could cause a lot of confusion in cases like the one I outlined > above. > 2. Is there a way to avoid this by either controlling how the indices are > created in StringIndexer or bypassing StringIndexer altogether? > 3. If 2 is not possible, is there an easy way to see how my original labels > mapped onto the indices so that I can revert the predictions back to the > original labels rather than the transformed labels? I suppose I could do > this by counting the original labels and mapping by frequency, but it seems > like there should be a more straightforward way to get it out of the > StringIndexer. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >