Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Xiangrui Meng
1. No.

2. The seed per partition is fixed. So it should generate
non-overlapping subsets.

3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.

Best,
Xiangrui

On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote:
 Hi, all

 When we use MLUtils.kfold to generate training and validation set for cross
 validation

 we found that there is overlapped part in two sets….

 from the code, it does sampling for twice for the same dataset

  @Experimental
   def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int):
 Array[(RDD[T], RDD[T])] = {
 val numFoldsF = numFolds.toFloat
 (1 to numFolds).map { fold =
   val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold /
 numFoldsF,
 complement = false)
   val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
   val training = new PartitionwiseSampledRDD(rdd,
 sampler.cloneComplement(), true, seed)
   (training, validation)
 }.toArray
   }

 the sampler is complement, there is still possibility to generate overlapped
 training and validation set

 because the sampling method looks like :

 override def sample(items: Iterator[T]): Iterator[T] = {
 items.filter { item =
   val x = rng.nextDouble()
   (x = lb  x  ub) ^ complement
 }
   }

 I’m not a machine learning guy, so I guess I must fall into one of the
 following three situations

 1. does it mean actually we allow overlapped training and validation set ?
 (counter intuitive to me)

 2. I had some misunderstanding on the code?

 3. it’s a bug?

 Anyone can explain it to me?

 Best,

 --
 Nan Zhu


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Nan Zhu
Thanks, Xiangrui,   

I found the reason of overlapped training set and test set

….

Another counter-intuitive issue related to 
https://github.com/apache/spark/pull/2508

Best,  

--  
Nan Zhu


On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote:

 1. No.
  
 2. The seed per partition is fixed. So it should generate
 non-overlapping subsets.
  
 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.
  
 Best,
 Xiangrui
  
 On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all
   
  When we use MLUtils.kfold to generate training and validation set for cross
  validation
   
  we found that there is overlapped part in two sets….
   
  from the code, it does sampling for twice for the same dataset
   
  @Experimental
  def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int):
  Array[(RDD[T], RDD[T])] = {
  val numFoldsF = numFolds.toFloat
  (1 to numFolds).map { fold =
  val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold /
  numFoldsF,
  complement = false)
  val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
  val training = new PartitionwiseSampledRDD(rdd,
  sampler.cloneComplement(), true, seed)
  (training, validation)
  }.toArray
  }
   
  the sampler is complement, there is still possibility to generate overlapped
  training and validation set
   
  because the sampling method looks like :
   
  override def sample(items: Iterator[T]): Iterator[T] = {
  items.filter { item =
  val x = rng.nextDouble()
  (x = lb  x  ub) ^ complement
  }
  }
   
  I’m not a machine learning guy, so I guess I must fall into one of the
  following three situations
   
  1. does it mean actually we allow overlapped training and validation set ?
  (counter intuitive to me)
   
  2. I had some misunderstanding on the code?
   
  3. it’s a bug?
   
  Anyone can explain it to me?
   
  Best,
   
  --
  Nan Zhu