Spark will load the whole dataset. The sampling action can be viewed as an filter. The real implementation can be more complicate, but I give you the idea by simple implementation. val rand = new Random(); val subRdd = rdd.filter(x => rand.nextDouble() <= 0.3)
To prevent recomputing data, you can cache data or make a checkpoint ( http://alvincjin.blogspot.com/2014/12/cache-vs-checkpoint-in-spark.html) 2016-05-31 15:11 GMT+07:00 Patrick Baier <patrick.ba...@zalando.de>: > I would assume that the driver has to count the number of lines in the > json file anyway. > Otherwise, how could it tell the workers which lines they should work on? > > > > 2016-05-31 10:03 GMT+02:00 Gavin Yue <yue.yuany...@gmail.com>: > >> If not reading the whole dataset, how do you know the total number of >> records? If not knowing total number, how do you choose 30%? >> >> >> >> > On May 31, 2016, at 00:45, pbaier <patrick.ba...@zalando.de> wrote: >> > >> > Hi all, >> > >> > I have to following use case: >> > I have around 10k of jsons that I want to use for learning. >> > The jsons are all stored in one file. >> > >> > For learning a ML model, however, I only need around 30% of the jsons >> (the >> > rest is not needed at all). >> > So, my idea was to load all data into a RDD and then use the rdd.sample >> > method to get my fraction of the data. >> > I implemented this, and in the end it took as long as loading the whole >> data >> > set. >> > So I was wondering if Spark is still loading the whole dataset from >> disk and >> > does the filtering afterwards? >> > If this is the case, why does Spark not push down the filtering and load >> > only a fraction of data from the disk? >> > >> > Cheers, >> > >> > Patrick >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Behaviour-of-RDD-sampling-tp27052.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> > > > > -- > > Patrick Baier > > Payment Analytics > > > > *E-Mail: patrick.ba...@zalando.de <patrick.ba...@zalando.de>* > > >