Re: Dense Vectors outputs in feature engineering
Since you use two steps (StringIndexer and OneHotEncoder) to encode categories to Vector, I guess you want to decode the eventual vector into their original categories. Suppose you have a DataFrame with only one column named "name", there are three categories: "b", "a", "c" (ranked by frequency). You can refer the following code snippets to do encode and decode: val df = spark.createDataFrame(Seq("a", "b", "c", "b", "a", "b").map(Tuple1.apply)).toDF("name") val si = new StringIndexer().setInputCol("name").setOutputCol("indexedName") val siModel = si.fit(df) val df2 = siModel.transform(df) val encoder = new OneHotEncoder() .setDropLast(false) .setInputCol("indexedName") .setOutputCol("encodedName") val df3 = encoder.transform(df2) df3.show() // Decode to get the original categories. val group = AttributeGroup.fromStructField(df3.schema("encodedName")) val categories = group.attributes.get.map(_.name.get) println(categories.mkString(",")) // Output: b,a,c Thanks Yanbo 2016-07-14 6:46 GMT-07:00 rachmaninovquartet: > or would it be common practice to just retain the original categories in > another df? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Dense Vectors outputs in feature engineering
or would it be common practice to just retain the original categories in another df? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Dense Vectors outputs in feature engineering
Thanks Disha, that worked out well. Can you point me to an example of how to decode my feature vectors in the dataframe, back into their categories? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27336.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Dense Vectors outputs in feature engineering
Hi Ian, You can create a dense vector of you features as follows: - String Index your features - Invoke One Hot Encoding on them, which generates a sparse vector - Now, in case you wish to merge these features, then use VectorAssembler (optional) - After transforming the dataframe to return sparse vector/s (which you may or may not assemble), you can use Vectos.dense(vector.toArray()) on either the individual One Hot features or the assembled sparse vector. Hope this helps. Cheers, Disha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27332.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org