Hi,

If I understand the RandomForest model in the ML Pipeline implementation in
the ml package correctly, I have to first run my outcome label variable
through the StringIndexer, even if my labels are numeric. The StringIndexer
then converts the labels into numeric indices based on frequency of the
label. 

This could create situations where if I'm classifying binary outcomes where
my original labels are simply 0 and 1, the StringIndexer may actually flip
my labels such that 0s become 1s and 1s become 0s if my original 1s were
more frequent. This transformation would then extend itself to the
predictions. In the old mllib implementation, the RF does not require the
labels to be changed and I could use 0/1 labels without worrying about them
being transformed.

I was wondering:
1. Why is this the default implementation for the Pipeline RF? This seems
like it could cause a lot of confusion in cases like the one I outlined
above.
2. Is there a way to avoid this by either controlling how the indices are
created in StringIndexer or bypassing StringIndexer altogether?
3. If 2 is not possible, is there an easy way to see how my original labels
mapped onto the indices so that I can revert the predictions back to the
original labels rather than the transformed labels? I suppose I could do
this by counting the original labels and mapping by frequency, but it seems
like there should be a more straightforward way to get it out of the
StringIndexer.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to