Hi Steffen, Thanks for your response.
Isn't spark.read.json() an action function? It reads the files from the source directory, infers the schema and creates a dataframe right? dataframe.cache() prints out this schema as well. I am not sure why dataframe.count() will try to do the same thing again (reading files from source). spark.read.json() and count() - both actions took 8 minutes each in my scenario. I'd expect only one of the action should incur the expenses of reading 19949 files from s3. Am I missing anything? Thank you! Ram On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz <steffenschm...@hotmail.de> wrote: > Hi Ram, > > Regarding your caching question: > The data frame is evaluated lazy. That means it isn’t cached directly on > invoking of .cache(), but on calling the first action on it (in your case > count). > Then it is loaded into memory and the rows are counted, not on the call of > .cache(). > On the second call to count it is already in memory and cached and that’s > why it’s faster. > > I do not know if it’s allowed to recommend resources here, but I really > liked the Big Data Analysis with Spark Course by Heather Miller on Coursera. > And the Spark documentation is also a good place to start. > > Regards, > Steffen > > > On 25. May 2017, at 07:28, ramnavan <hirampra...@gmail.com> wrote: > > > > Hi, > > > > I’m new to Spark and trying to understand the inner workings of Spark in > the > > below mentioned scenarios. I’m using PySpark and Spark 2.1.1 > > > > Spark.read.json(): > > > > I am running executing this line > > “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three > > worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json > > files and the total size is about 4.4 GB. The line created three spark > jobs > > first job with 10000 tasks, second job with 19949 tasks and third job > with > > 10000 tasks. Each of the jobs have one stage in it. Please refer to the > > attached images job0, job1 and job2.jpg. job0.jpg > > <http://apache-spark-user-list.1001560.n3.nabble.com/ > file/n28708/job0.jpg> > > job1.jpg > > <http://apache-spark-user-list.1001560.n3.nabble.com/ > file/n28708/job1.jpg> > > job2.jpg > > <http://apache-spark-user-list.1001560.n3.nabble.com/ > file/n28708/job2.jpg> > > I was expecting it to create 1 job with 19949 tasks. I’d like to > understand > > why there are three jobs instead of just one and why reading json files > > calls for map operation. > > > > Caching and Count(): > > > > Once spark reads 19949 json files into a dataframe (let’s call it > files_df), > > I am calling these two operations files_df.createOrReplaceTempView(“ > files) > > and files_df.cache(). I am expecting files_df.cache() will cache the > entire > > dataframe in memory so any subsequent operation will be faster. My next > > statement is files_df.count(). This operation took an entire 8.8 minutes > and > > it looks like it read the files again from s3 and calculated the count. > > Please refer to attached count.jpg file for reference. count.jpg > > <http://apache-spark-user-list.1001560.n3.nabble.com/ > file/n28708/count.jpg> > > Why is this happening? If I call files_df.count() for the second time, it > > comes back fast within few seconds. Can someone explain this? > > > > In general, I am looking for a good source to learn about Spark Internals > > and try to understand what’s happening beneath the hood. > > > > Thanks in advance! > > > > Ram > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708. > html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > -- Ram