Sean, I thought sampleByKey (stratified sampling) in 1.1 was designed to solve the problem that randomSplit can't sample by key...
Xiangrui, What's the expected behavior of sampleByKey ? In the dataset sampled using sampleByKey the keys should match the input dataset keys right ? If it is a bug, I can open up a JIRA and look into it... Thanks. Deb On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote: > I use randomSplit to make a train/CV/test set in one go. It definitely > produces disjoint data sets and is efficient. The problem is you can't > do it by key. > > I am not sure why your subtract does not work. I suspect it is because > the values do not partition the same way, or they don't evaluate > equality in the expected way, but I don't see any reason why. Tuples > work as expected here. > > On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > > Hi, > > > > I have a rdd whose key is a userId and value is (movieId, rating)... > > > > I want to sample 80% of the (movieId,rating) that each userId has seen > for > > train, rest is for test... > > > > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2)) > > > > val keyedRatings = indexedRating.map{x => (x.product, (x.user, > x.rating))} > > > > val keyedTraining = keyedRatings.sample(true, 0.8, 1L) > > > > val keyedTest = keyedRatings.subtract(keyedTraining) > > > > blocks = sc.maxParallelism > > > > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}") > > > > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}") > > > > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}") > > > > My expectation was that the println will produce exact number of keys for > > keyedRatings, keyedTraining and keyedTest but this is not the case... > > > > On MovieLens for example I am noticing the following: > > > > Rating keys 3706 > > > > Training keys 3676 > > > > Test keys 3470 > > > > I also tried sampleByKey as follows: > > > > val keyedRatings = indexedRating.map{x => (x.product, (x.user, > x.rating))} > > > > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap > > > > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L) > > > > val keyedTest = keyedRatings.subtract(keyedTraining) > > > > Still I get the results as: > > > > Rating keys 3706 > > > > Training keys 3682 > > > > Test keys 3459 > > > > Any idea what's is wrong here... > > > > Are my assumptions about behavior of sample/sampleByKey on a key-value > RDD > > correct ? If this is a bug I can dig deeper... > > > > Thanks. > > > > Deb >