[ https://issues.apache.org/jira/browse/SPARK-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356548#comment-15356548 ]
mahendra singh commented on SPARK-16290: ---------------------------------------- [~srowen] Hi srowen , have one issue with spark regarding with text type features for naive bayes . I have following data Male , Suspicion of Alcohol , Weekday , 12 ,75 , 30-39 Male , Moving Traffic Violation , Weekday , 12 , 20 ,20-24 Male , Suspicion of Alcohol , Weekend , 4 , 1 2, 40-49 Male , Suspicion of Alcohol , Weekday , 12 , 0 , 50-59 Female , Road Traffic Collision , Weekend , 12 , 0 , 20-24 Male , Road Traffic Collision , Weekday , 12 , 0 , 25-29 Male , Road Traffic Collision , Weekday , 8 , 0 , Other Male , Road Traffic Collision , Weekday , 8 , 23 , 60-69 Male , Moving Traffic Violation , Weekend , 4, 26, 30-39 Female , Road Traffic Collision , Weekend, 8 , 61, 16-19 Male , Moving Traffic Violation , Weekend , 4 , 74 , 25-29 Male , Road Traffic Collision , Weekday , 12, 0 , Other Male , Moving Traffic Violation , Weekday , 8 , 0 , 16-19 Male , Road Traffic Collision , Weekday , 8 , 0 , Other Male , Moving Traffic Violation , Weekend , 4 , 0 ,30-39 In this data you can see some column (comma separated ) are numeric and some are text data . Now spark naive bayes only support numeric type data . So how can transform text type to numeric type . Every time ( training and testing ) numeric value for text type should be same other wise it will create problem . Is it possible through spark now , i am asking because i did not find solution for this . If it is possible then how and if not then can solve this issue ? > text type features column for classification > -------------------------------------------- > > Key: SPARK-16290 > URL: https://issues.apache.org/jira/browse/SPARK-16290 > Project: Spark > Issue Type: New Feature > Components: ML, MLilb > Affects Versions: 1.6.2 > Reporter: mahendra singh > Labels: features > Original Estimate: 504h > Remaining Estimate: 504h > > we have to improve spark ml and mllib in case of features columns . Mean we > can give text type of value also in features . > Suppose we have 4 features value > id. dept_name. score. result. > We can see dept_name will be text type so we have to handle it internally in > spark mean we have to change text to numerical column . -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org