Since you use two steps (StringIndexer and OneHotEncoder) to encode categories to Vector, I guess you want to decode the eventual vector into their original categories. Suppose you have a DataFrame with only one column named "name", there are three categories: "b", "a", "c" (ranked by frequency). You can refer the following code snippets to do encode and decode:
val df = spark.createDataFrame(Seq("a", "b", "c", "b", "a", "b").map(Tuple1.apply)).toDF("name") val si = new StringIndexer().setInputCol("name").setOutputCol("indexedName") val siModel = si.fit(df) val df2 = siModel.transform(df) val encoder = new OneHotEncoder() .setDropLast(false) .setInputCol("indexedName") .setOutputCol("encodedName") val df3 = encoder.transform(df2) df3.show() // Decode to get the original categories. val group = AttributeGroup.fromStructField(df3.schema("encodedName")) val categories = group.attributes.get.map(_.name.get) println(categories.mkString(",")) // Output: b,a,c Thanks Yanbo 2016-07-14 6:46 GMT-07:00 rachmaninovquartet <rachmaninovquar...@gmail.com>: > or would it be common practice to just retain the original categories in > another df? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >