[ 
https://issues.apache.org/jira/browse/SPARK-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356548#comment-15356548
 ] 

mahendra singh commented on SPARK-16290:
----------------------------------------

[~srowen] Hi srowen , 
 have one issue with spark regarding with text type features for naive bayes . 
I have following data 

Male , Suspicion of Alcohol , Weekday , 12 ,75 , 30-39 
Male , Moving Traffic Violation , Weekday , 12 , 20 ,20-24 
Male , Suspicion of Alcohol , Weekend , 4 , 1 2, 40-49 
Male , Suspicion of Alcohol , Weekday , 12 , 0 , 50-59 
Female , Road Traffic Collision , Weekend , 12 , 0 , 20-24 
Male , Road Traffic Collision  , Weekday , 12 , 0 , 25-29 
Male , Road Traffic Collision , Weekday , 8 , 0 , Other 
Male , Road Traffic Collision , Weekday , 8 , 23 , 60-69
Male , Moving Traffic Violation  , Weekend , 4, 26, 30-39
Female , Road Traffic Collision , Weekend, 8 , 61, 16-19  
Male , Moving Traffic Violation , Weekend , 4 , 74 , 25-29 
Male , Road Traffic Collision , Weekday , 12, 0 , Other 
Male  , Moving Traffic Violation , Weekday , 8 , 0 , 16-19 
Male , Road Traffic Collision , Weekday , 8 , 0 , Other
Male , Moving Traffic Violation , Weekend , 4 , 0 ,30-39

In this data you can see some column (comma separated ) are numeric and some 
are text data . Now spark naive bayes only support numeric type data . So how 
can transform text type to numeric  type . Every time ( training and testing ) 
numeric value for text type should be same other wise it will create problem . 
Is it possible through spark now , i am asking because i did not find solution 
for this . If it is possible then how and if not then can solve this issue ?

> text type features column for classification
> --------------------------------------------
>
>                 Key: SPARK-16290
>                 URL: https://issues.apache.org/jira/browse/SPARK-16290
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLilb
>    Affects Versions: 1.6.2
>            Reporter: mahendra singh
>              Labels: features
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> we have to improve spark ml and mllib in case of features columns . Mean we 
> can give text type of value also in features . 
> Suppose we have 4 features value 
> id. dept_name. score. result. 
> We can see dept_name will be text type so we have to handle it internally in 
> spark mean we have to change text to numerical column . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to