Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
Since you use two steps (StringIndexer and OneHotEncoder) to encode
categories to Vector, I guess you want to decode the eventual vector into
their original categories.
Suppose you have a DataFrame with only one column named "name", there are
three categories: "b", "a", "c" (ranked by frequency). You can refer the
following code snippets to do encode and decode:

val df = spark.createDataFrame(Seq("a", "b", "c", "b", "a",
"b").map(Tuple1.apply)).toDF("name")

val si = new StringIndexer().setInputCol("name").setOutputCol("indexedName")

val siModel = si.fit(df)

val df2 = siModel.transform(df)

val encoder = new OneHotEncoder()

  .setDropLast(false)

  .setInputCol("indexedName")

  .setOutputCol("encodedName")

val df3 = encoder.transform(df2)

df3.show()

// Decode to get the original categories.

val group = AttributeGroup.fromStructField(df3.schema("encodedName"))

val categories = group.attributes.get.map(_.name.get)

println(categories.mkString(","))

// Output: b,a,c


Thanks
Yanbo

2016-07-14 6:46 GMT-07:00 rachmaninovquartet <rachmaninovquar...@gmail.com>:

> or would it be common practice to just retain the original categories in
> another df?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Dense Vectors outputs in feature engineering

2016-07-14 Thread rachmaninovquartet
or would it be common practice to just retain the original categories in
another df?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Dense Vectors outputs in feature engineering

2016-07-14 Thread rachmaninovquartet
Thanks Disha, that worked out well. Can you point me to an example of how to
decode my feature vectors in the dataframe, back into their categories?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Dense Vectors outputs in feature engineering

2016-07-13 Thread disha_dp
Hi Ian,
You can create a dense vector of you features as follows:

- String Index your features
- Invoke One Hot Encoding on them, which generates a sparse vector
   - Now, in case you wish to merge these features, then use VectorAssembler
(optional)
- After transforming the dataframe to return sparse vector/s (which you may
or may not assemble), you can  use Vectos.dense(vector.toArray()) on either
the individual One Hot features or the assembled sparse vector.

Hope this helps.

Cheers,
Disha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27332.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Dense Vectors outputs in feature engineering

2016-07-13 Thread rachmaninovquartet
Hi,

I'm trying to use the StringIndexer and OneHotEncoder, in order to vectorize
some of my features. Unfortunately, OneHotEncoder only returns sparse
vectors. I can't find a way, much less an efficient one, to convert the
columns generated by OneHotEncoder into dense vectors. I need this as I will
eventually be doing some deep learning on my data, not something internal to
spark.

If I were to update OneHotEncoder to have a setDense option, is there much
of a chance it would be accepted as a PR?

Since the first question seems unlikely, is there a way to change a
dataframe, with a sparse vector and index columns into columns, like the
pandas get_dummies method:
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example

or is there a better way to replicate the get_dummies functionality?

Thanks,

Ian





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org