I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?
val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=> var lt : Array[Double] = new Array[Double](test12.size) val id = x._1._1 val cl = x._1._2 val dt = x._2 var i = -1 test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0} val vs = Vectors.dense(lt) (id , cl , vs) } *// Adamantios* On Fri, Aug 7, 2015 at 8:36 AM, Yanbo Liang <yblia...@gmail.com> wrote: > I think you want to flatten the 1M products to a vector of 1M elements, of > course mostly are zero. > It looks like HashingTF > <https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf> > can help you. > > 2015-08-07 11:02 GMT+08:00 praveen S <mylogi...@gmail.com>: > >> Use StringIndexer in MLib1.4 : >> >> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html >> >> On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais < >> adamantios.cor...@gmail.com> wrote: >> >>> I have a set of data based on which I want to create a classification >>> model. Each row has the following form: >>> >>> user1,class1,product1 >>>> user1,class1,product2 >>>> user1,class1,product5 >>>> user2,class1,product2 >>>> user2,class1,product5 >>>> user3,class2,product1 >>>> etc >>> >>> >>> There are about 1M users, 2 classes, and 1M products. What I would like >>> to do next is create the sparse vectors (something already supported by >>> MLlib) BUT in order to apply that function I have to create the dense >>> vectors >>> (with the 0s), first. In other words, I have to binarize my data. What's >>> the easiest (or most elegant) way of doing that? >>> >>> >>> *// Adamantios* >>> >>> >>> >> >