Hi, Any help on above mail use case ?
Regards, Rajesh On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I am new to Spark ML, trying to create a LabeledPoint from categorical > dataset(example code from spark). For this, I am using One-hot encoding > <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code > > val df = sparkSession.createDataFrame(Seq( > (0, "a"), > (1, "b"), > (2, "c"), > (3, "a"), > (4, "a"), > (5, "c"), > (6, "d"))).toDF("id", "category") > > val indexer = new StringIndexer() > .setInputCol("category") > .setOutputCol("categoryIndex") > .fit(df) > > val indexed = indexer.transform(df) > > indexed.select("category", "categoryIndex").show() > > val encoder = new OneHotEncoder() > .setInputCol("categoryIndex") > .setOutputCol("categoryVec") > val encoded = encoder.transform(indexed) > > encoded.select("id", "category", "categoryVec").show() > > *Output :- * > +---+--------+-------------+ > | id|category| categoryVec| > +---+--------+-------------+ > | 0| a|(3,[0],[1.0])| > | 1| b| (3,[],[])| > | 2| c|(3,[1],[1.0])| > | 3| a|(3,[0],[1.0])| > | 4| a|(3,[0],[1.0])| > | 5| c|(3,[1],[1.0])| > | 6| d|(3,[2],[1.0])| > +---+--------+-------------+ > > *Creating LablePoint from encoded dataframe:-* > > val data = encoded.rdd.map { x => > { > val featureVector = Vectors.dense(x.getAs[org. > apache.spark.ml.linalg.SparseVector]("categoryVec").toArray) > val label = x.getAs[java.lang.Integer]("id").toDouble > LabeledPoint(label, featureVector) > } > } > > data.foreach { x => println(x) } > > *Output :-* > > (0.0,[1.0,0.0,0.0]) > (1.0,[0.0,0.0,0.0]) > (2.0,[0.0,1.0,0.0]) > (3.0,[1.0,0.0,0.0]) > (4.0,[1.0,0.0,0.0]) > (5.0,[0.0,1.0,0.0]) > (6.0,[0.0,0.0,1.0]) > > I have a four categorical values like a, b, c, d. I am expecting 4 > features in the above LablePoint but it has only 3 features. > > Please help me to creation of LablePoint from categorical features. > > Regards, > Rajesh > > >