MLLib sparse vector
Hi All,I have transformed the data into following format: First column is user id, and then all the other columns are class ids. For a user only class ids that appear in this row have value 1 and others are 0. I need to crease a sparse vector from this. Does the API for creating a sparse vector that can directly support this format? User idProduct class ids 2622572 145447 162013421 28565 285556 293 455367261 130 3646167118806 183576 328651715 57671 57476
Re: MLLib sparse vector
Hi Sameer, MLLib uses Breeze’s vector format under the hood. You can use that. http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector For example: import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV} val numClasses = classes.distinct.count.toInt val userWithClassesAsSparseVector = rows.map(x = (x.userID, new BSV[Double](x.classIDs.sortWith(_ _), Seq.fill(x.classIDs.length)(1.0).toArray, numClasses).asInstanceOf[BV[Double]])) Chris On Sep 15, 2014, at 11:28 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I have transformed the data into following format: First column is user id, and then all the other columns are class ids. For a user only class ids that appear in this row have value 1 and others are 0. I need to crease a sparse vector from this. Does the API for creating a sparse vector that can directly support this format? User idProduct class ids 2622572 145447 162013421 28565 285556 293 455367261 130 3646167118806 183576 328651715 57671 57476
Re: MLLib sparse vector
Or you can use the factory method `Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote: Hi Sameer, MLLib uses Breeze’s vector format under the hood. You can use that. http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector For example: import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV} val numClasses = classes.distinct.count.toInt val userWithClassesAsSparseVector = rows.map(x = (x.userID, new BSV[Double](x.classIDs.sortWith(_ _), Seq.fill(x.classIDs.length)(1.0).toArray, numClasses).asInstanceOf[BV[Double]])) Chris On Sep 15, 2014, at 11:28 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I have transformed the data into following format: First column is user id, and then all the other columns are class ids. For a user only class ids that appear in this row have value 1 and others are 0. I need to crease a sparse vector from this. Does the API for creating a sparse vector that can directly support this format? User idProduct class ids 2622572 145447 1620 13421 28565 285556 293 4553 67261 130 3646 1671 18806 183576 3286 51715 57671 57476 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib sparse vector
Probably worth noting that the factory methods in mllib create an object of type org.apache.spark.mllib.linalg.Vector which stores data in a similar format as Breeze vectors Chris On Sep 15, 2014, at 3:24 PM, Xiangrui Meng men...@gmail.com wrote: Or you can use the factory method `Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote: Hi Sameer, MLLib uses Breeze’s vector format under the hood. You can use that. http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector For example: import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV} val numClasses = classes.distinct.count.toInt val userWithClassesAsSparseVector = rows.map(x = (x.userID, new BSV[Double](x.classIDs.sortWith(_ _), Seq.fill(x.classIDs.length)(1.0).toArray, numClasses).asInstanceOf[BV[Double]])) Chris On Sep 15, 2014, at 11:28 AM, Sameer Tilak ssti...@live.com wrote: Hi All, I have transformed the data into following format: First column is user id, and then all the other columns are class ids. For a user only class ids that appear in this row have value 1 and others are 0. I need to crease a sparse vector from this. Does the API for creating a sparse vector that can directly support this format? User idProduct class ids 2622572 145447 1620 13421 28565 285556 293 4553 67261 130 3646 1671 18806 183576 3286 51715 57671 57476 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org