1. No.

2. The seed per partition is fixed. So it should generate
non-overlapping subsets.

3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.

Best,
Xiangrui

On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:
> Hi, all
>
> When we use MLUtils.kfold to generate training and validation set for cross
> validation
>
> we found that there is overlapped part in two sets….
>
> from the code, it does sampling for twice for the same dataset
>
>  @Experimental
>   def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int):
> Array[(RDD[T], RDD[T])] = {
>     val numFoldsF = numFolds.toFloat
>     (1 to numFolds).map { fold =>
>       val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold /
> numFoldsF,
>         complement = false)
>       val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
>       val training = new PartitionwiseSampledRDD(rdd,
> sampler.cloneComplement(), true, seed)
>       (training, validation)
>     }.toArray
>   }
>
> the sampler is complement, there is still possibility to generate overlapped
> training and validation set
>
> because the sampling method looks like :
>
> override def sample(items: Iterator[T]): Iterator[T] = {
>     items.filter { item =>
>       val x = rng.nextDouble()
>       (x >= lb && x < ub) ^ complement
>     }
>   }
>
> I’m not a machine learning guy, so I guess I must fall into one of the
> following three situations
>
> 1. does it mean actually we allow overlapped training and validation set ?
> (counter intuitive to me)
>
> 2. I had some misunderstanding on the code?
>
> 3. it’s a bug?
>
> Anyone can explain it to me?
>
> Best,
>
> --
> Nan Zhu
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to