[ 
https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244061#comment-15244061
 ] 

hujiayin commented on SPARK-14623:
----------------------------------

Hi Joseph, I think it is similar as the combination of StringIndexer + 
OneHotEncoder into one class but the difference is the LabelBinarizer will 
collect the same element into one vector and will remember the position of the 
element in the input. 

For example, 
Input is "yellow,green,red,green,0"
Label Binarizer retrieves the labels from input and the labels are "0, green, 
red, yellow"
Output is
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
The second column reflects element "green" appears at positions 1 and 3 in the 
input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and 
column 1 is label "green", so on. If I understand correctly, StringIndexer 
returns the category number of a label and OneHotEncoder returns the binary 
representation of the category number.

> add label binarizer 
> --------------------
>
>                 Key: SPARK-14623
>                 URL: https://issues.apache.org/jira/browse/SPARK-14623
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: hujiayin
>            Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 1
> 0, 1, 0, 0
> 0, 0, 1, 0
> 0, 1, 0, 0
> 1, 0 ,0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to