It has 4 categories
a = 1 0 0
b = 0 0 0
c = 0 1 0
d = 0 0 1

--
Oleksiy Dyagilev

On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> Any help on above mail use case ?
>
> Regards,
> Rajesh
>
> On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I am new to Spark ML, trying to create a LabeledPoint from categorical
>> dataset(example code from spark). For this, I am using One-hot encoding
>> <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code
>>
>> val df = sparkSession.createDataFrame(Seq(
>>       (0, "a"),
>>       (1, "b"),
>>       (2, "c"),
>>       (3, "a"),
>>       (4, "a"),
>>       (5, "c"),
>>       (6, "d"))).toDF("id", "category")
>>
>>     val indexer = new StringIndexer()
>>       .setInputCol("category")
>>       .setOutputCol("categoryIndex")
>>       .fit(df)
>>
>>     val indexed = indexer.transform(df)
>>
>>     indexed.select("category", "categoryIndex").show()
>>
>>     val encoder = new OneHotEncoder()
>>       .setInputCol("categoryIndex")
>>       .setOutputCol("categoryVec")
>>     val encoded = encoder.transform(indexed)
>>
>>      encoded.select("id", "category", "categoryVec").show()
>>
>> *Output :- *
>> +---+--------+-------------+
>> | id|category|  categoryVec|
>> +---+--------+-------------+
>> |  0|       a|(3,[0],[1.0])|
>> |  1|       b|    (3,[],[])|
>> |  2|       c|(3,[1],[1.0])|
>> |  3|       a|(3,[0],[1.0])|
>> |  4|       a|(3,[0],[1.0])|
>> |  5|       c|(3,[1],[1.0])|
>> |  6|       d|(3,[2],[1.0])|
>> +---+--------+-------------+
>>
>> *Creating LablePoint from encoded dataframe:-*
>>
>> val data = encoded.rdd.map { x =>
>>       {
>>         val featureVector = Vectors.dense(x.getAs[org.apac
>> he.spark.ml.linalg.SparseVector]("categoryVec").toArray)
>>         val label = x.getAs[java.lang.Integer]("id").toDouble
>>         LabeledPoint(label, featureVector)
>>       }
>>     }
>>
>>     data.foreach { x => println(x) }
>>
>> *Output :-*
>>
>> (0.0,[1.0,0.0,0.0])
>> (1.0,[0.0,0.0,0.0])
>> (2.0,[0.0,1.0,0.0])
>> (3.0,[1.0,0.0,0.0])
>> (4.0,[1.0,0.0,0.0])
>> (5.0,[0.0,1.0,0.0])
>> (6.0,[0.0,0.0,1.0])
>>
>> I have a four categorical values like a, b, c, d. I am expecting 4
>> features in the above LablePoint but it has only 3 features.
>>
>> Please help me to creation of LablePoint from categorical features.
>>
>> Regards,
>> Rajesh
>>
>>
>>
>

Reply via email to