1. No. 2. The seed per partition is fixed. So it should generate non-overlapping subsets.
3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. Best, Xiangrui On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote: > Hi, all > > When we use MLUtils.kfold to generate training and validation set for cross > validation > > we found that there is overlapped part in two sets…. > > from the code, it does sampling for twice for the same dataset > > @Experimental > def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): > Array[(RDD[T], RDD[T])] = { > val numFoldsF = numFolds.toFloat > (1 to numFolds).map { fold => > val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / > numFoldsF, > complement = false) > val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed) > val training = new PartitionwiseSampledRDD(rdd, > sampler.cloneComplement(), true, seed) > (training, validation) > }.toArray > } > > the sampler is complement, there is still possibility to generate overlapped > training and validation set > > because the sampling method looks like : > > override def sample(items: Iterator[T]): Iterator[T] = { > items.filter { item => > val x = rng.nextDouble() > (x >= lb && x < ub) ^ complement > } > } > > I’m not a machine learning guy, so I guess I must fall into one of the > following three situations > > 1. does it mean actually we allow overlapped training and validation set ? > (counter intuitive to me) > > 2. I had some misunderstanding on the code? > > 3. it’s a bug? > > Anyone can explain it to me? > > Best, > > -- > Nan Zhu > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org