1. No.
2. The seed per partition is fixed. So it should generate
non-overlapping subsets.
3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.
Best,
Xiangrui
On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, all
When we use MLUtils.kfold to generate training
Thanks, Xiangrui,
I found the reason of overlapped training set and test set
….
Another counter-intuitive issue related to
https://github.com/apache/spark/pull/2508
Best,
--
Nan Zhu
On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote:
1. No.
2. The seed per partition
Hi, all
When we use MLUtils.kfold to generate training and validation set for cross
validation
we found that there is overlapped part in two sets….
from the code, it does sampling for twice for the same dataset
@Experimental
def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: