Hi folks,

I have a set of categorical columns (strings), that I'm parsing and
converting into Vectors of features to pass to a mllib classifier (random
forest). 

In my input data, some columns have null values. Say, in one of those
columns, I have p values + a null value :
How should I build my feature Vectors, and the categoricalFeaturesInfo map
of the classifier ?
* option 1 : I tell p values in categoricalFeaturesInfo, and I use
Double.NaN in my input Vectors ?  [ How NaNs are handled by classifiers ? ]
* option 2 : I consider nulls as a value, so I tell (p+1) values in
categoricalFeaturesInfo, and I map nulls to some int ?


Thanks for your help.

Mathieu

(PS : I know the the new dataframe + pipeline + vectorindexer API, but for
reasons it doesn't fit well my need, so I need to do that by myself)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Handle-null-NaN-values-in-mllib-classifier-tp24822.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to