Re: Using sampleByKey

Debasish Das Tue, 18 Nov 2014 07:03:48 -0800

Sean,

I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
the problem that randomSplit can't sample by key...


Xiangrui,

What's the expected behavior of sampleByKey ? In the dataset sampled using
sampleByKey the keys should match the input dataset keys right ? If it is a
bug, I can open up a JIRA and look into it...

Thanks.
Deb

On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:

> I use randomSplit to make a train/CV/test set in one go. It definitely
> produces disjoint data sets and is efficient. The problem is you can't
> do it by key.
>
> I am not sure why your subtract does not work. I suspect it is because
> the values do not partition the same way, or they don't evaluate
> equality in the expected way, but I don't see any reason why. Tuples
> work as expected here.
>
> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > Hi,
> >
> > I have a rdd whose key is a userId and value is (movieId, rating)...
> >
> > I want to sample 80% of the (movieId,rating) that each userId has seen
> for
> > train, rest is for test...
> >
> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
> >
> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> x.rating))}
> >
> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
> >
> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >
> > blocks = sc.maxParallelism
> >
> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
> >
> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
> >
> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
> >
> > My expectation was that the println will produce exact number of keys for
> > keyedRatings, keyedTraining and keyedTest but this is not the case...
> >
> > On MovieLens for example I am noticing the following:
> >
> > Rating keys 3706
> >
> > Training keys 3676
> >
> > Test keys 3470
> >
> > I also tried sampleByKey as follows:
> >
> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
> x.rating))}
> >
> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
> >
> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
> >
> > val keyedTest = keyedRatings.subtract(keyedTraining)
> >
> > Still I get the results as:
> >
> > Rating keys 3706
> >
> > Training keys 3682
> >
> > Test keys 3459
> >
> > Any idea what's is wrong here...
> >
> > Are my assumptions about behavior of sample/sampleByKey on a key-value
> RDD
> > correct ? If this is a bug I can dig deeper...
> >
> > Thanks.
> >
> > Deb
>

Re: Using sampleByKey

Reply via email to