Re: Dense Vectors outputs in feature engineering
Since you use two steps (StringIndexer and OneHotEncoder) to encode categories to Vector, I guess you want to decode the eventual vector into their original categories. Suppose you have a DataFrame with only one column named "name", there are three categories: "b", "a", "c" (ranked by frequency). You can refer the following code snippets to do encode and decode: val df = spark.createDataFrame(Seq("a", "b", "c", "b", "a", "b").map(Tuple1.apply)).toDF("name") val si = new StringIndexer().setInputCol("name").setOutputCol("indexedName") val siModel = si.fit(df) val df2 = siModel.transform(df) val encoder = new OneHotEncoder() .setDropLast(false) .setInputCol("indexedName") .setOutputCol("encodedName") val df3 = encoder.transform(df2) df3.show() // Decode to get the original categories. val group = AttributeGroup.fromStructField(df3.schema("encodedName")) val categories = group.attributes.get.map(_.name.get) println(categories.mkString(",")) // Output: b,a,c Thanks Yanbo 2016-07-14 6:46 GMT-07:00 rachmaninovquartet <rachmaninovquar...@gmail.com>: > or would it be common practice to just retain the original categories in > another df? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Dense Vectors outputs in feature engineering
or would it be common practice to just retain the original categories in another df? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Dense Vectors outputs in feature engineering
Thanks Disha, that worked out well. Can you point me to an example of how to decode my feature vectors in the dataframe, back into their categories? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27336.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Dense Vectors outputs in feature engineering
Hi Ian, You can create a dense vector of you features as follows: - String Index your features - Invoke One Hot Encoding on them, which generates a sparse vector - Now, in case you wish to merge these features, then use VectorAssembler (optional) - After transforming the dataframe to return sparse vector/s (which you may or may not assemble), you can use Vectos.dense(vector.toArray()) on either the individual One Hot features or the assembled sparse vector. Hope this helps. Cheers, Disha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27332.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Dense Vectors outputs in feature engineering
Hi, I'm trying to use the StringIndexer and OneHotEncoder, in order to vectorize some of my features. Unfortunately, OneHotEncoder only returns sparse vectors. I can't find a way, much less an efficient one, to convert the columns generated by OneHotEncoder into dense vectors. I need this as I will eventually be doing some deep learning on my data, not something internal to spark. If I were to update OneHotEncoder to have a setDense option, is there much of a chance it would be accepted as a PR? Since the first question seems unlikely, is there a way to change a dataframe, with a sparse vector and index columns into columns, like the pandas get_dummies method: http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example or is there a better way to replicate the get_dummies functionality? Thanks, Ian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org