​​Spark will  load the whole dataset.
The sampling action can be viewed as an filter. The real implementation can
be more complicate, but I give you the idea by simple implementation.
val rand = new Random();
val subRdd = rdd.filter(x => rand.nextDouble() <= 0.3)

To prevent recomputing data, you can cache data or make a checkpoint (
http://alvincjin.blogspot.com/2014/12/cache-vs-checkpoint-in-spark.html)

2016-05-31 15:11 GMT+07:00 Patrick Baier <patrick.ba...@zalando.de>:

> I would assume that the driver has to count the number of lines in the
> json file anyway.
> Otherwise, how could it tell the workers which lines they should work on?
>
>
>
> 2016-05-31 10:03 GMT+02:00 Gavin Yue <yue.yuany...@gmail.com>:
>
>> If not reading the whole dataset, how do you know the total number of
>> records? If not knowing total number, how do you choose 30%?
>>
>>
>>
>> > On May 31, 2016, at 00:45, pbaier <patrick.ba...@zalando.de> wrote:
>> >
>> > Hi all,
>> >
>> > I have to following use case:
>> > I have around 10k of jsons that I want to use for learning.
>> > The jsons are all stored in one file.
>> >
>> > For learning a ML model, however, I only need around 30% of the jsons
>> (the
>> > rest is not needed at all).
>> > So, my idea was to load all data into a RDD and then use the rdd.sample
>> > method to get my fraction of the data.
>> > I implemented this, and in the end it took as long as loading the whole
>> data
>> > set.
>> > So I was wondering if Spark is still loading the whole dataset from
>> disk and
>> > does the filtering afterwards?
>> > If this is the case, why does Spark not push down the filtering and load
>> > only a fraction of data from the disk?
>> >
>> > Cheers,
>> >
>> > Patrick
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Behaviour-of-RDD-sampling-tp27052.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>
>
> --
>
> Patrick Baier
>
> Payment Analytics
>
>
>
> *E-Mail: patrick.ba...@zalando.de <patrick.ba...@zalando.de>*
>
>
>

Reply via email to