e, Nov 18, 2014 at 6:59 AM, Debasish Das
>> wrote:
>> > Sean,
>> >
>> > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
>> > the problem that randomSplit can't sample by key...
>> >
>> > Xiangrui,
>> >
it can't sample by key...
> >
> > Xiangrui,
> >
> > What's the expected behavior of sampleByKey ? In the dataset sampled
> using
> > sampleByKey the keys should match the input dataset keys right ? If it
> is a
> > bug, I can open up a JIRA and look into
rui,
>
> What's the expected behavior of sampleByKey ? In the dataset sampled using
> sampleByKey the keys should match the input dataset keys right ? If it is a
> bug, I can open up a JIRA and look into it...
>
> Thanks.
> Deb
>
> On Tue, Nov 18, 2014 at 1:34 AM, Sean Owe
Sean,
I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
the problem that randomSplit can't sample by key...
Xiangrui,
What's the expected behavior of sampleByKey ? In the dataset sampled using
sampleByKey the keys should match the input dataset keys right ? If i
I use randomSplit to make a train/CV/test set in one go. It definitely
produces disjoint data sets and is efficient. The problem is you can't
do it by key.
I am not sure why your subtract does not work. I suspect it is because
the values do not partition the same way, or they don't evaluate
equali
Hi,
I have a rdd whose key is a userId and value is (movieId, rating)...
I want to sample 80% of the (movieId,rating) that each userId has seen for
train, rest is for test...
val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
val keyedRatings = indexedRating.map{x => (x.produ