If not reading the whole dataset, how do you know the total number of records? 
If not knowing total number, how do you choose 30%?



> On May 31, 2016, at 00:45, pbaier <patrick.ba...@zalando.de> wrote:
> 
> Hi all,
> 
> I have to following use case:
> I have around 10k of jsons that I want to use for learning.
> The jsons are all stored in one file.
> 
> For learning a ML model, however, I only need around 30% of the jsons (the
> rest is not needed at all).
> So, my idea was to load all data into a RDD and then use the rdd.sample
> method to get my fraction of the data.
> I implemented this, and in the end it took as long as loading the whole data
> set.
> So I was wondering if Spark is still loading the whole dataset from disk and
> does the filtering afterwards?
> If this is the case, why does Spark not push down the filtering and load
> only a fraction of data from the disk?
> 
> Cheers,
> 
> Patrick
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Behaviour-of-RDD-sampling-tp27052.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to