It has 4 categories a = 1 0 0 b = 0 0 0 c = 0 1 0 d = 0 0 1 -- Oleksiy Dyagilev
On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > Any help on above mail use case ? > > Regards, > Rajesh > > On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar < > mrajaf...@gmail.com> wrote: > >> Hi, >> >> I am new to Spark ML, trying to create a LabeledPoint from categorical >> dataset(example code from spark). For this, I am using One-hot encoding >> <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code >> >> val df = sparkSession.createDataFrame(Seq( >> (0, "a"), >> (1, "b"), >> (2, "c"), >> (3, "a"), >> (4, "a"), >> (5, "c"), >> (6, "d"))).toDF("id", "category") >> >> val indexer = new StringIndexer() >> .setInputCol("category") >> .setOutputCol("categoryIndex") >> .fit(df) >> >> val indexed = indexer.transform(df) >> >> indexed.select("category", "categoryIndex").show() >> >> val encoder = new OneHotEncoder() >> .setInputCol("categoryIndex") >> .setOutputCol("categoryVec") >> val encoded = encoder.transform(indexed) >> >> encoded.select("id", "category", "categoryVec").show() >> >> *Output :- * >> +---+--------+-------------+ >> | id|category| categoryVec| >> +---+--------+-------------+ >> | 0| a|(3,[0],[1.0])| >> | 1| b| (3,[],[])| >> | 2| c|(3,[1],[1.0])| >> | 3| a|(3,[0],[1.0])| >> | 4| a|(3,[0],[1.0])| >> | 5| c|(3,[1],[1.0])| >> | 6| d|(3,[2],[1.0])| >> +---+--------+-------------+ >> >> *Creating LablePoint from encoded dataframe:-* >> >> val data = encoded.rdd.map { x => >> { >> val featureVector = Vectors.dense(x.getAs[org.apac >> he.spark.ml.linalg.SparseVector]("categoryVec").toArray) >> val label = x.getAs[java.lang.Integer]("id").toDouble >> LabeledPoint(label, featureVector) >> } >> } >> >> data.foreach { x => println(x) } >> >> *Output :-* >> >> (0.0,[1.0,0.0,0.0]) >> (1.0,[0.0,0.0,0.0]) >> (2.0,[0.0,1.0,0.0]) >> (3.0,[1.0,0.0,0.0]) >> (4.0,[1.0,0.0,0.0]) >> (5.0,[0.0,1.0,0.0]) >> (6.0,[0.0,0.0,1.0]) >> >> I have a four categorical values like a, b, c, d. I am expecting 4 >> features in the above LablePoint but it has only 3 features. >> >> Please help me to creation of LablePoint from categorical features. >> >> Regards, >> Rajesh >> >> >> >