ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels, you can use IndexToString.

2015-08-11 6:56 GMT+08:00 pkphlam <pkph...@gmail.com>:

> Hi,
>
> If I understand the RandomForest model in the ML Pipeline implementation in
> the ml package correctly, I have to first run my outcome label variable
> through the StringIndexer, even if my labels are numeric. The StringIndexer
> then converts the labels into numeric indices based on frequency of the
> label.
>
> This could create situations where if I'm classifying binary outcomes where
> my original labels are simply 0 and 1, the StringIndexer may actually flip
> my labels such that 0s become 1s and 1s become 0s if my original 1s were
> more frequent. This transformation would then extend itself to the
> predictions. In the old mllib implementation, the RF does not require the
> labels to be changed and I could use 0/1 labels without worrying about them
> being transformed.
>
> I was wondering:
> 1. Why is this the default implementation for the Pipeline RF? This seems
> like it could cause a lot of confusion in cases like the one I outlined
> above.
> 2. Is there a way to avoid this by either controlling how the indices are
> created in StringIndexer or bypassing StringIndexer altogether?
> 3. If 2 is not possible, is there an easy way to see how my original labels
> mapped onto the indices so that I can revert the predictions back to the
> original labels rather than the transformed labels? I suppose I could do
> this by counting the original labels and mapping by frequency, but it seems
> like there should be a more straightforward way to get it out of the
> StringIndexer.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Random-Forest-and-StringIndexer-in-pyspark-ML-Pipeline-tp24200.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to