Re: Prediction using Classification with text attributes in Apache Spark MLLib

Xiangrui Meng Sun, 02 Nov 2014 10:37:19 -0800

This operation requires two transformers:

1) Indexer, which maps string features into categorical features
2) OneHotEncoder, which flatten categorical features into binary features


We are working on the new dataset implementation, so we can easily
express those transformations. Sorry for late! If you want a quick and
dirty solution, you can try hashing:

val rdd: RDD[(Double, Array[String])] = ...
val training = rdd.mapValues { factors =>
    val indices = mutable.Set.empty[Int]
    factors.view.zipWithIndex.foreach { (f, idx) =>
      indices += math.abs(f.## ^ idx) % 100000
    }
    Vectors.sparse(100000, indices.toSeq.map(x => (x, 1.0)))
}

It creates a training dataset with all binary features, with a chance
of collision. You can use it in SVM, LR, or DecisionTree.

Best,
Xiangrui

On Sun, Nov 2, 2014 at 9:20 AM, ashu <ashutosh.triv...@iiitb.org> wrote:
> Hi,
> Sorry to bounce back the old thread.
> What is the state now? Is this problem solved. How spark handle categorical
> data now?
>
> Regards,
> Ashutosh
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Prediction-using-Classification-with-text-attributes-in-Apache-Spark-MLLib-tp8166p17919.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Prediction using Classification with text attributes in Apache Spark MLLib

Reply via email to