Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Xiangrui Meng
1. No. 2. The seed per partition is fixed. So it should generate non-overlapping subsets. 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. Best, Xiangrui On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all When we use MLUtils.kfold to generate training

Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Nan Zhu
Thanks, Xiangrui, I found the reason of overlapped training set and test set …. Another counter-intuitive issue related to https://github.com/apache/spark/pull/2508 Best, -- Nan Zhu On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote: 1. No. 2. The seed per partition

MLUtil.kfold generates overlapped training and validation set?

2014-10-09 Thread Nan Zhu
Hi, all When we use MLUtils.kfold to generate training and validation set for cross validation we found that there is overlapped part in two sets…. from the code, it does sampling for twice for the same dataset @Experimental def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: