Hi Steffen,

Thanks for your response.

Isn't spark.read.json() an action function? It reads the files from the
source directory, infers the schema and creates a dataframe right?
dataframe.cache() prints out this schema as well. I am not sure why
dataframe.count() will try to do the same thing again (reading files from
source). spark.read.json() and count() - both actions took 8 minutes each
in my scenario. I'd expect only one of the action should incur the expenses
of reading 19949 files from s3. Am I missing anything?

Thank you!

Ram


On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz <steffenschm...@hotmail.de>
wrote:

> Hi Ram,
>
> Regarding your caching question:
> The data frame is evaluated lazy. That means it isn’t cached directly on
> invoking of .cache(), but on calling the first action on it (in your case
> count).
> Then it is loaded into memory and the rows are counted, not on the call of
> .cache().
> On the second call to count it is already in memory and cached and that’s
> why it’s faster.
>
> I do not know if it’s allowed to recommend resources here, but I really
> liked the Big Data Analysis with Spark Course by Heather Miller on Coursera.
> And the Spark documentation is also a good place to start.
>
> Regards,
> Steffen
>
> > On 25. May 2017, at 07:28, ramnavan <hirampra...@gmail.com> wrote:
> >
> > Hi,
> >
> > I’m new to Spark and trying to understand the inner workings of Spark in
> the
> > below mentioned scenarios. I’m using PySpark and Spark 2.1.1
> >
> > Spark.read.json():
> >
> > I am running executing this line
> > “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with three
> > worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
> > files and the total size is about 4.4 GB. The line created three spark
> jobs
> > first job with 10000 tasks, second job with 19949 tasks and third job
> with
> > 10000 tasks. Each of the jobs have one stage in it. Please refer to the
> > attached images job0, job1 and job2.jpg.   job0.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/job0.jpg>
> > job1.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/job1.jpg>
> > job2.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/job2.jpg>
> > I was expecting it to create 1 job with 19949 tasks.  I’d like to
> understand
> > why there are three jobs instead of just one and why reading json files
> > calls for map operation.
> >
> > Caching and Count():
> >
> > Once spark reads 19949 json files into a dataframe (let’s call it
> files_df),
> > I am calling these two operations files_df.createOrReplaceTempView(“
> files)
> > and files_df.cache(). I am expecting files_df.cache() will cache the
> entire
> > dataframe in memory so any subsequent operation will be faster. My next
> > statement is files_df.count(). This operation took an entire 8.8 minutes
> and
> > it looks like it read the files again from s3 and calculated the count.
> > Please refer to attached count.jpg file for reference.   count.jpg
> > <http://apache-spark-user-list.1001560.n3.nabble.com/
> file/n28708/count.jpg>
> > Why is this happening? If I call files_df.count() for the second time, it
> > comes back fast within few seconds. Can someone explain this?
> >
> > In general, I am looking for a good source to learn about Spark Internals
> > and try to understand what’s happening beneath the hood.
> >
> > Thanks in advance!
> >
> > Ram
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-Caching-tp28708.
> html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>


-- 
Ram

Reply via email to