That's not really the best way to handle random number generation. There have been multiple discussions on https://groups.google.com/forum/?fromgroups=#!forum/spark-users and elsewhere about how to use mapPartitions or mapWith to create higher-performance Spark code that uses PRNGs.
On Thu, Sep 12, 2013 at 9:55 AM, Jason Lenderman <[email protected]>wrote: > Even if it worked, using subtract doesn't seem like a good way to achieve > this. You could try something like: > > def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long = > System.currentTimeMillis): (RDD[T], RDD[T]) = { > val rand = new java.util.Random(seed) > val temp = data.map(x => (x, rand.nextDouble)) > (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1)) > } > > Note: this code compiles, but I haven't tested it yet... > > > On Thu, Sep 12, 2013 at 1:18 AM, Hao REN <[email protected]> wrote: > >> Hi, >> >> I am writing a logistic regression prog with Spark based on SparkLR >> example. >> >> Say, a data set containing 10000 DataPoints, where DataPoint is a case >> class like: case class DataPoint(x: Vector, y: Double) as defined in >> the SparkLR example. >> >> In order to divide the data set into 2 parts: training set and test set, >> I tried some code below: >> >> val trainingSet = points.sample(false, 0.6, 7) >> val testSet = points.subtract(trainingSet) >> >> ,where points is a RDD[DataPoint] contains 10000 points >> >> sample works well, trainingSet.count gives a number around 6000, but >> testSet.count gives 10000 which is not the expected 4000. >> >> It seems that subtract cant work with some custom class, as DataPoint >> here. >> >> 2 questions: >> >> 1) Which is the best way to divide data with a ratio, say 6/4, especially >> when Data is not a primitive type, like some custom classes ? >> >> 2) Why subtract doesn't work ? Maybe ordering and compare should be >> implemented for DataPoint class ? >> >> >> I have also checked the SubtractedRDD class. Without background about the >> Spark source code, I can not understand what the problem is. >> >> https://github.com/mesos/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SubtractedRDD.scala >> >> >> Any help is highly appreciated ! >> >> Thank you in advance. =) >> >> Hao >> >> >> -- >> REN Hao >> >> Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL) >> >> Computer Science >> >> Etudiant à l'Université de Technologie de Compiègne (UTC) >> >> Génie Informatique - Fouille de Données >> >> Tel: +33 06 14 54 57 24 / +41 07 86 47 52 69 >> > >
