Re: How to binarize data in spark

2015-08-07 Thread Adamantios Corais
I have ended up with the following piece of code but is turns out to be
really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x= ((x(0) , x(1)) , x(2))).groupByKey().map(x=
(x._1 , x._2.toArray)).map{x=
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y = i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}



*// Adamantios*



On Fri, Aug 7, 2015 at 8:36 AM, Yanbo Liang yblia...@gmail.com wrote:

 I think you want to flatten the 1M products to a vector of 1M elements, of
 course mostly are zero.
 It looks like HashingTF
 https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf
 can help you.

 2015-08-07 11:02 GMT+08:00 praveen S mylogi...@gmail.com:

 Use StringIndexer in MLib1.4 :

 https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html

 On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 I have a set of data based on which I want to create a classification
 model. Each row has the following form:

 user1,class1,product1
 user1,class1,product2
 user1,class1,product5
 user2,class1,product2
 user2,class1,product5
 user3,class2,product1
 etc


 There are about 1M users, 2 classes, and 1M products. What I would like
 to do next is create the sparse vectors (something already supported by
 MLlib) BUT in order to apply that function I have to create the dense 
 vectors
 (with the 0s), first. In other words, I have to binarize my data. What's
 the easiest (or most elegant) way of doing that?


 *// Adamantios*







How to binarize data in spark

2015-08-06 Thread Adamantios Corais
I have a set of data based on which I want to create a classification
model. Each row has the following form:

user1,class1,product1
 user1,class1,product2
 user1,class1,product5
 user2,class1,product2
 user2,class1,product5
 user3,class2,product1
 etc


There are about 1M users, 2 classes, and 1M products. What I would like to
do next is create the sparse vectors (something already supported by MLlib)
BUT in order to apply that function I have to create the dense vectors
(with the 0s), first. In other words, I have to binarize my data. What's
the easiest (or most elegant) way of doing that?


*// Adamantios*