Since you use two steps (StringIndexer and OneHotEncoder) to encode
categories to Vector, I guess you want to decode the eventual vector into
their original categories.
Suppose you have a DataFrame with only one column named "name", there are
three categories: "b", "a", "c" (ranked by frequency). You can refer the
following code snippets to do encode and decode:

val df = spark.createDataFrame(Seq("a", "b", "c", "b", "a",
"b").map(Tuple1.apply)).toDF("name")

val si = new StringIndexer().setInputCol("name").setOutputCol("indexedName")

val siModel = si.fit(df)

val df2 = siModel.transform(df)

val encoder = new OneHotEncoder()

  .setDropLast(false)

  .setInputCol("indexedName")

  .setOutputCol("encodedName")

val df3 = encoder.transform(df2)

df3.show()

// Decode to get the original categories.

val group = AttributeGroup.fromStructField(df3.schema("encodedName"))

val categories = group.attributes.get.map(_.name.get)

println(categories.mkString(","))

// Output: b,a,c


Thanks
Yanbo

2016-07-14 6:46 GMT-07:00 rachmaninovquartet <rachmaninovquar...@gmail.com>:

> or would it be common practice to just retain the original categories in
> another df?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to