Hi,

I have done this in different way. Please correct me, is this approach
right ?

val df = spark.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"),
      (6, "d"))).toDF("id", "category")

   val categories: List[String] = List("a", "b", "c", "d")
    val categoriesList: Array[Double] = new Array[Double](categories.size)
    val labelPoint = df.rdd.map { line =>
      val values = line.getAs("category").toString()
      val id = line.getAs[java.lang.Integer]("id").toDouble
      var i = -1
      categories.foreach { x => i += 1; categoriesList(i) = if (x ==
values) 1.0 else 0.0 }
      val denseVector = Vectors.dense(categoriesList)
      LabeledPoint(id, denseVector)
    }
    labelPoint.foreach { x => println(x) }











*Output :-
(0.0,[1.0,0.0,0.0,0.0])(1.0,[0.0,1.0,0.0,0.0])(2.0,[0.0,0.0,1.0,0.0])(3.0,[1.0,0.0,0.0,0.0])(4.0,[1.0,0.0,0.0,0.0])(5.0,[0.0,0.0,1.0,0.0])(6.0,[0.0,0.0,0.0,1.0])*
Regards,
Rajesh


On Thu, Sep 8, 2016 at 12:35 AM, aka.fe2s <aka.f...@gmail.com> wrote:

> It has 4 categories
> a = 1 0 0
> b = 0 0 0
> c = 0 1 0
> d = 0 0 1
>
> --
> Oleksiy Dyagilev
>
> On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> Any help on above mail use case ?
>>
>> Regards,
>> Rajesh
>>
>> On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am new to Spark ML, trying to create a LabeledPoint from categorical
>>> dataset(example code from spark). For this, I am using One-hot encoding
>>> <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code
>>>
>>> val df = sparkSession.createDataFrame(Seq(
>>>       (0, "a"),
>>>       (1, "b"),
>>>       (2, "c"),
>>>       (3, "a"),
>>>       (4, "a"),
>>>       (5, "c"),
>>>       (6, "d"))).toDF("id", "category")
>>>
>>>     val indexer = new StringIndexer()
>>>       .setInputCol("category")
>>>       .setOutputCol("categoryIndex")
>>>       .fit(df)
>>>
>>>     val indexed = indexer.transform(df)
>>>
>>>     indexed.select("category", "categoryIndex").show()
>>>
>>>     val encoder = new OneHotEncoder()
>>>       .setInputCol("categoryIndex")
>>>       .setOutputCol("categoryVec")
>>>     val encoded = encoder.transform(indexed)
>>>
>>>      encoded.select("id", "category", "categoryVec").show()
>>>
>>> *Output :- *
>>> +---+--------+-------------+
>>> | id|category|  categoryVec|
>>> +---+--------+-------------+
>>> |  0|       a|(3,[0],[1.0])|
>>> |  1|       b|    (3,[],[])|
>>> |  2|       c|(3,[1],[1.0])|
>>> |  3|       a|(3,[0],[1.0])|
>>> |  4|       a|(3,[0],[1.0])|
>>> |  5|       c|(3,[1],[1.0])|
>>> |  6|       d|(3,[2],[1.0])|
>>> +---+--------+-------------+
>>>
>>> *Creating LablePoint from encoded dataframe:-*
>>>
>>> val data = encoded.rdd.map { x =>
>>>       {
>>>         val featureVector = Vectors.dense(x.getAs[org.apac
>>> he.spark.ml.linalg.SparseVector]("categoryVec").toArray)
>>>         val label = x.getAs[java.lang.Integer]("id").toDouble
>>>         LabeledPoint(label, featureVector)
>>>       }
>>>     }
>>>
>>>     data.foreach { x => println(x) }
>>>
>>> *Output :-*
>>>
>>> (0.0,[1.0,0.0,0.0])
>>> (1.0,[0.0,0.0,0.0])
>>> (2.0,[0.0,1.0,0.0])
>>> (3.0,[1.0,0.0,0.0])
>>> (4.0,[1.0,0.0,0.0])
>>> (5.0,[0.0,1.0,0.0])
>>> (6.0,[0.0,0.0,1.0])
>>>
>>> I have a four categorical values like a, b, c, d. I am expecting 4
>>> features in the above LablePoint but it has only 3 features.
>>>
>>> Please help me to creation of LablePoint from categorical features.
>>>
>>> Regards,
>>> Rajesh
>>>
>>>
>>>
>>
>

Reply via email to