[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
yu peng updated SPARK-19962: ---------------------------- Issue Type: New Feature (was: Wish) > add DictVectorizor for DataFrame > -------------------------------- > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.1.0 > Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > df.show() > |age|gender|country|hobbies| > |1|male|cn|[sing, dance]| > |3|female|us|[sing]| > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org