Thanks, Xiangrui, I found the reason of overlapped training set and test set
…. Another counter-intuitive issue related to https://github.com/apache/spark/pull/2508 Best, -- Nan Zhu On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote: > 1. No. > > 2. The seed per partition is fixed. So it should generate > non-overlapping subsets. > > 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. > > Best, > Xiangrui > > On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu <zhunanmcg...@gmail.com > (mailto:zhunanmcg...@gmail.com)> wrote: > > Hi, all > > > > When we use MLUtils.kfold to generate training and validation set for cross > > validation > > > > we found that there is overlapped part in two sets…. > > > > from the code, it does sampling for twice for the same dataset > > > > @Experimental > > def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): > > Array[(RDD[T], RDD[T])] = { > > val numFoldsF = numFolds.toFloat > > (1 to numFolds).map { fold => > > val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / > > numFoldsF, > > complement = false) > > val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed) > > val training = new PartitionwiseSampledRDD(rdd, > > sampler.cloneComplement(), true, seed) > > (training, validation) > > }.toArray > > } > > > > the sampler is complement, there is still possibility to generate overlapped > > training and validation set > > > > because the sampling method looks like : > > > > override def sample(items: Iterator[T]): Iterator[T] = { > > items.filter { item => > > val x = rng.nextDouble() > > (x >= lb && x < ub) ^ complement > > } > > } > > > > I’m not a machine learning guy, so I guess I must fall into one of the > > following three situations > > > > 1. does it mean actually we allow overlapped training and validation set ? > > (counter intuitive to me) > > > > 2. I had some misunderstanding on the code? > > > > 3. it’s a bug? > > > > Anyone can explain it to me? > > > > Best, > > > > -- > > Nan Zhu > > > > >