Hi, I have done this in different way. Please correct me, is this approach right ?
val df = spark.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"), (6, "d"))).toDF("id", "category") val categories: List[String] = List("a", "b", "c", "d") val categoriesList: Array[Double] = new Array[Double](categories.size) val labelPoint = df.rdd.map { line => val values = line.getAs("category").toString() val id = line.getAs[java.lang.Integer]("id").toDouble var i = -1 categories.foreach { x => i += 1; categoriesList(i) = if (x == values) 1.0 else 0.0 } val denseVector = Vectors.dense(categoriesList) LabeledPoint(id, denseVector) } labelPoint.foreach { x => println(x) } *Output :- (0.0,[1.0,0.0,0.0,0.0])(1.0,[0.0,1.0,0.0,0.0])(2.0,[0.0,0.0,1.0,0.0])(3.0,[1.0,0.0,0.0,0.0])(4.0,[1.0,0.0,0.0,0.0])(5.0,[0.0,0.0,1.0,0.0])(6.0,[0.0,0.0,0.0,1.0])* Regards, Rajesh On Thu, Sep 8, 2016 at 12:35 AM, aka.fe2s <aka.f...@gmail.com> wrote: > It has 4 categories > a = 1 0 0 > b = 0 0 0 > c = 0 1 0 > d = 0 0 1 > > -- > Oleksiy Dyagilev > > On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar < > mrajaf...@gmail.com> wrote: > >> Hi, >> >> Any help on above mail use case ? >> >> Regards, >> Rajesh >> >> On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar < >> mrajaf...@gmail.com> wrote: >> >>> Hi, >>> >>> I am new to Spark ML, trying to create a LabeledPoint from categorical >>> dataset(example code from spark). For this, I am using One-hot encoding >>> <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code >>> >>> val df = sparkSession.createDataFrame(Seq( >>> (0, "a"), >>> (1, "b"), >>> (2, "c"), >>> (3, "a"), >>> (4, "a"), >>> (5, "c"), >>> (6, "d"))).toDF("id", "category") >>> >>> val indexer = new StringIndexer() >>> .setInputCol("category") >>> .setOutputCol("categoryIndex") >>> .fit(df) >>> >>> val indexed = indexer.transform(df) >>> >>> indexed.select("category", "categoryIndex").show() >>> >>> val encoder = new OneHotEncoder() >>> .setInputCol("categoryIndex") >>> .setOutputCol("categoryVec") >>> val encoded = encoder.transform(indexed) >>> >>> encoded.select("id", "category", "categoryVec").show() >>> >>> *Output :- * >>> +---+--------+-------------+ >>> | id|category| categoryVec| >>> +---+--------+-------------+ >>> | 0| a|(3,[0],[1.0])| >>> | 1| b| (3,[],[])| >>> | 2| c|(3,[1],[1.0])| >>> | 3| a|(3,[0],[1.0])| >>> | 4| a|(3,[0],[1.0])| >>> | 5| c|(3,[1],[1.0])| >>> | 6| d|(3,[2],[1.0])| >>> +---+--------+-------------+ >>> >>> *Creating LablePoint from encoded dataframe:-* >>> >>> val data = encoded.rdd.map { x => >>> { >>> val featureVector = Vectors.dense(x.getAs[org.apac >>> he.spark.ml.linalg.SparseVector]("categoryVec").toArray) >>> val label = x.getAs[java.lang.Integer]("id").toDouble >>> LabeledPoint(label, featureVector) >>> } >>> } >>> >>> data.foreach { x => println(x) } >>> >>> *Output :-* >>> >>> (0.0,[1.0,0.0,0.0]) >>> (1.0,[0.0,0.0,0.0]) >>> (2.0,[0.0,1.0,0.0]) >>> (3.0,[1.0,0.0,0.0]) >>> (4.0,[1.0,0.0,0.0]) >>> (5.0,[0.0,1.0,0.0]) >>> (6.0,[0.0,0.0,1.0]) >>> >>> I have a four categorical values like a, b, c, d. I am expecting 4 >>> features in the above LablePoint but it has only 3 features. >>> >>> Please help me to creation of LablePoint from categorical features. >>> >>> Regards, >>> Rajesh >>> >>> >>> >> >