Re: MLUtil.kfold generates overlapped training and validation set?
Thanks, Xiangrui, I found the reason of overlapped training set and test set …. Another counter-intuitive issue related to https://github.com/apache/spark/pull/2508 Best, -- Nan Zhu On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote: > 1. No. > > 2. The seed per partition is fixed. So it should generate > non-overlapping subsets. > > 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. > > Best, > Xiangrui > > On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu (mailto:zhunanmcg...@gmail.com)> wrote: > > Hi, all > > > > When we use MLUtils.kfold to generate training and validation set for cross > > validation > > > > we found that there is overlapped part in two sets…. > > > > from the code, it does sampling for twice for the same dataset > > > > @Experimental > > def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): > > Array[(RDD[T], RDD[T])] = { > > val numFoldsF = numFolds.toFloat > > (1 to numFolds).map { fold => > > val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / > > numFoldsF, > > complement = false) > > val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed) > > val training = new PartitionwiseSampledRDD(rdd, > > sampler.cloneComplement(), true, seed) > > (training, validation) > > }.toArray > > } > > > > the sampler is complement, there is still possibility to generate overlapped > > training and validation set > > > > because the sampling method looks like : > > > > override def sample(items: Iterator[T]): Iterator[T] = { > > items.filter { item => > > val x = rng.nextDouble() > > (x >= lb && x < ub) ^ complement > > } > > } > > > > I’m not a machine learning guy, so I guess I must fall into one of the > > following three situations > > > > 1. does it mean actually we allow overlapped training and validation set ? > > (counter intuitive to me) > > > > 2. I had some misunderstanding on the code? > > > > 3. it’s a bug? > > > > Anyone can explain it to me? > > > > Best, > > > > -- > > Nan Zhu > > > > >
Re: MLUtil.kfold generates overlapped training and validation set?
1. No. 2. The seed per partition is fixed. So it should generate non-overlapping subsets. 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. Best, Xiangrui On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu wrote: > Hi, all > > When we use MLUtils.kfold to generate training and validation set for cross > validation > > we found that there is overlapped part in two sets…. > > from the code, it does sampling for twice for the same dataset > > @Experimental > def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): > Array[(RDD[T], RDD[T])] = { > val numFoldsF = numFolds.toFloat > (1 to numFolds).map { fold => > val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / > numFoldsF, > complement = false) > val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed) > val training = new PartitionwiseSampledRDD(rdd, > sampler.cloneComplement(), true, seed) > (training, validation) > }.toArray > } > > the sampler is complement, there is still possibility to generate overlapped > training and validation set > > because the sampling method looks like : > > override def sample(items: Iterator[T]): Iterator[T] = { > items.filter { item => > val x = rng.nextDouble() > (x >= lb && x < ub) ^ complement > } > } > > I’m not a machine learning guy, so I guess I must fall into one of the > following three situations > > 1. does it mean actually we allow overlapped training and validation set ? > (counter intuitive to me) > > 2. I had some misunderstanding on the code? > > 3. it’s a bug? > > Anyone can explain it to me? > > Best, > > -- > Nan Zhu > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLUtil.kfold generates overlapped training and validation set?
Hi, all When we use MLUtils.kfold to generate training and validation set for cross validation we found that there is overlapped part in two sets…. from the code, it does sampling for twice for the same dataset @Experimental def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int): Array[(RDD[T], RDD[T])] = { val numFoldsF = numFolds.toFloat (1 to numFolds).map { fold => val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold / numFoldsF, complement = false) val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed) val training = new PartitionwiseSampledRDD(rdd, sampler.cloneComplement(), true, seed) (training, validation) }.toArray } the sampler is complement, there is still possibility to generate overlapped training and validation set because the sampling method looks like : override def sample(items: Iterator[T]): Iterator[T] = { items.filter { item => val x = rng.nextDouble() (x >= lb && x < ub) ^ complement } } I’m not a machine learning guy, so I guess I must fall into one of the following three situations 1. does it mean actually we allow overlapped training and validation set ? (counter intuitive to me) 2. I had some misunderstanding on the code? 3. it’s a bug? Anyone can explain it to me? Best, -- Nan Zhu