This operation requires two transformers: 1) Indexer, which maps string features into categorical features 2) OneHotEncoder, which flatten categorical features into binary features
We are working on the new dataset implementation, so we can easily express those transformations. Sorry for late! If you want a quick and dirty solution, you can try hashing: val rdd: RDD[(Double, Array[String])] = ... val training = rdd.mapValues { factors => val indices = mutable.Set.empty[Int] factors.view.zipWithIndex.foreach { (f, idx) => indices += math.abs(f.## ^ idx) % 100000 } Vectors.sparse(100000, indices.toSeq.map(x => (x, 1.0))) } It creates a training dataset with all binary features, with a chance of collision. You can use it in SVM, LR, or DecisionTree. Best, Xiangrui On Sun, Nov 2, 2014 at 9:20 AM, ashu <ashutosh.triv...@iiitb.org> wrote: > Hi, > Sorry to bounce back the old thread. > What is the state now? Is this problem solved. How spark handle categorical > data now? > > Regards, > Ashutosh > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org