[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
yu peng updated SPARK-19962: ---------------------------- Description: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. something like ``` df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', hobbies=['sing']), ]) import DictVectorizor vec = DictVectorizor() matrix = vec.fit_transform(df) matrix.show() |features| |[1, 0, 1, 0, 1, 1, 1]| |[3, 1, 0, 1, 0, 1, 1]| vec.show() |feature_name| feature_dimension| |age|0| |gender=female|1| |gender=male|2| |country=us|3| |country=cn|4| |hobbies=sing|5| |hobbies=dance|6| ``` was: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. something like ``` df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn'),Row(age=3, gender='female', country='us'), ]) import DictVectorizor vec = DictVectorizor() matrix = vec.fit_transform(df) matrix.show() |features| |[1, 0, 1, 0, 1]| |[3, 1, 0, 1, 0]| vec.show() |feature_name| feature_dimension| |age|0| |gender=female|1| |gender=male|2| |country=us|3| |country=cn|4| ``` > add DictVectorizor for DataFrame > -------------------------------- > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML > Affects Versions: 2.1.0 > Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org