Hi Ram,

Regarding your caching question:
The data frame is evaluated lazy. That means it isn’t cached directly on 
invoking of .cache(), but on calling the first action on it (in your case 
count).
Then it is loaded into memory and the rows are counted, not on the call of 
.cache(). 
On the second call to count it is already in memory and cached and that’s why 
it’s faster.

I do not know if it’s allowed to recommend resources here, but I really liked 
the Big Data Analysis with Spark Course by Heather Miller on Coursera.
And the Spark documentation is also a good place to start.

Regards,
Steffen

> On 25. May 2017, at 07:28, ramnavan <hirampra...@gmail.com> wrote:
> 
> Hi,
> 
> I’m new to Spark and trying to understand the inner workings of Spark in the
> below mentioned scenarios. I’m using PySpark and Spark 2.1.1
> 
> Spark.read.json():
> 
> I am running executing this line
> “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three
> worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
> files and the total size is about 4.4 GB. The line created three spark jobs
> first job with 10000 tasks, second job with 19949 tasks and third job with
> 10000 tasks. Each of the jobs have one stage in it. Please refer to the
> attached images job0, job1 and job2.jpg.   job0.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job0.jpg>    
> job1.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job1.jpg>    
> job2.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/job2.jpg>   
> I was expecting it to create 1 job with 19949 tasks.  I’d like to understand
> why there are three jobs instead of just one and why reading json files
> calls for map operation.
> 
> Caching and Count():
> 
> Once spark reads 19949 json files into a dataframe (let’s call it files_df),
> I am calling these two operations files_df.createOrReplaceTempView(“files)
> and files_df.cache(). I am expecting files_df.cache() will cache the entire
> dataframe in memory so any subsequent operation will be faster. My next
> statement is files_df.count(). This operation took an entire 8.8 minutes and
> it looks like it read the files again from s3 and calculated the count. 
> Please refer to attached count.jpg file for reference.   count.jpg
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n28708/count.jpg>  
> Why is this happening? If I call files_df.count() for the second time, it
> comes back fast within few seconds. Can someone explain this?
> 
> In general, I am looking for a good source to learn about Spark Internals
> and try to understand what’s happening beneath the hood.
> 
> Thanks in advance!
> 
> Ram
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

Reply via email to