Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
e, Nov 18, 2014 at 6:59 AM, Debasish Das >> wrote: >> > Sean, >> > >> > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve >> > the problem that randomSplit can't sample by key... >> > >> > Xiangrui, >> >

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
it can't sample by key... > > > > Xiangrui, > > > > What's the expected behavior of sampleByKey ? In the dataset sampled > using > > sampleByKey the keys should match the input dataset keys right ? If it > is a > > bug, I can open up a JIRA and look into

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
rui, > > What's the expected behavior of sampleByKey ? In the dataset sampled using > sampleByKey the keys should match the input dataset keys right ? If it is a > bug, I can open up a JIRA and look into it... > > Thanks. > Deb > > On Tue, Nov 18, 2014 at 1:34 AM, Sean Owe

Re: Using sampleByKey

2014-11-18 Thread Debasish Das
Sean, I thought sampleByKey (stratified sampling) in 1.1 was designed to solve the problem that randomSplit can't sample by key... Xiangrui, What's the expected behavior of sampleByKey ? In the dataset sampled using sampleByKey the keys should match the input dataset keys right ? If i

Re: Using sampleByKey

2014-11-18 Thread Sean Owen
I use randomSplit to make a train/CV/test set in one go. It definitely produces disjoint data sets and is efficient. The problem is you can't do it by key. I am not sure why your subtract does not work. I suspect it is because the values do not partition the same way, or they don't evaluate equali

Using sampleByKey

2014-11-17 Thread Debasish Das
Hi, I have a rdd whose key is a userId and value is (movieId, rating)... I want to sample 80% of the (movieId,rating) that each userId has seen for train, rest is for test... val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2)) val keyedRatings = indexedRating.map{x => (x.produ