[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926543#comment-15926543 ]
Sean Owen commented on SPARK-19962: ----------------------------------- Well, it's maintained by the StringIndexerModel for you. What's your non-integer, non-string use case that can't be converted to a string but is categorical? > add DictVectorizor for DataFrame > -------------------------------- > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML > Affects Versions: 2.1.0 > Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org