Re: Questions regarding Jobs, Stages and Caching

Nicholas Hakobian Thu, 25 May 2017 09:47:20 -0700

If you do not specify a schema, then the json() function will attempt to
determine the schema, which requires a full scan of the file. Any
subsequent actions will again have to read in the data. See the
documentation at:


http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json

"If the schema parameter is not specified, this function goes through the
input once to determine the input schema."


Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Thu, May 25, 2017 at 9:24 AM, Steffen Schmitz <steffenschm...@hotmail.de>
wrote:

> Hi Ram,
>
> spark.read.json() should be evaluated on the first the call of .count().
> It should then be read into memory once and the rows are counted. After
> this operation it will be in memory and access will be faster.
> If you add println statements in between of your function calls you should
> see start Spark starts to work only after the call of count.
>
> Regards,
> Steffen
>
> On 25. May 2017, at 17:02, Ram Navan <hirampra...@gmail.com> wrote:
>
> Hi Steffen,
>
> Thanks for your response.
>
> Isn't spark.read.json() an action function? It reads the files from the
> source directory, infers the schema and creates a dataframe right?
> dataframe.cache() prints out this schema as well. I am not sure why
> dataframe.count() will try to do the same thing again (reading files from
> source). spark.read.json() and count() - both actions took 8 minutes each
> in my scenario. I'd expect only one of the action should incur the expenses
> of reading 19949 files from s3. Am I missing anything?
>
> Thank you!
>
> Ram
>
>
> On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz <
> steffenschm...@hotmail.de> wrote:
>
>> Hi Ram,
>>
>> Regarding your caching question:
>> The data frame is evaluated lazy. That means it isn’t cached directly on
>> invoking of .cache(), but on calling the first action on it (in your case
>> count).
>> Then it is loaded into memory and the rows are counted, not on the call
>> of .cache().
>> On the second call to count it is already in memory and cached and that’s
>> why it’s faster.
>>
>> I do not know if it’s allowed to recommend resources here, but I really
>> liked the Big Data Analysis with Spark Course by Heather Miller on Coursera.
>> And the Spark documentation is also a good place to start.
>>
>> Regards,
>> Steffen
>>
>> > On 25. May 2017, at 07:28, ramnavan <hirampra...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I’m new to Spark and trying to understand the inner workings of Spark
>> in the
>> > below mentioned scenarios. I’m using PySpark and Spark 2.1.1
>> >
>> > Spark.read.json():
>> >
>> > I am running executing this line
>> > “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with
>> three
>> > worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json
>> > files and the total size is about 4.4 GB. The line created three spark
>> jobs
>> > first job with 10000 tasks, second job with 19949 tasks and third job
>> with
>> > 10000 tasks. Each of the jobs have one stage in it. Please refer to the
>> > attached images job0, job1 and job2.jpg.   job0.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/job0.jpg>
>> > job1.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/job1.jpg>
>> > job2.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/job2.jpg>
>> > I was expecting it to create 1 job with 19949 tasks.  I’d like to
>> understand
>> > why there are three jobs instead of just one and why reading json files
>> > calls for map operation.
>> >
>> > Caching and Count():
>> >
>> > Once spark reads 19949 json files into a dataframe (let’s call it
>> files_df),
>> > I am calling these two operations files_df.createOrReplaceTempVi
>> ew(“files)
>> > and files_df.cache(). I am expecting files_df.cache() will cache the
>> entire
>> > dataframe in memory so any subsequent operation will be faster. My next
>> > statement is files_df.count(). This operation took an entire 8.8
>> minutes and
>> > it looks like it read the files again from s3 and calculated the count.
>> > Please refer to attached count.jpg file for reference.   count.jpg
>> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/
>> n28708/count.jpg>
>> > Why is this happening? If I call files_df.count() for the second time,
>> it
>> > comes back fast within few seconds. Can someone explain this?
>> >
>> > In general, I am looking for a good source to learn about Spark
>> Internals
>> > and try to understand what’s happening beneath the hood.
>> >
>> > Thanks in advance!
>> >
>> > Ram
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and-
>> Caching-tp28708.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>
>
>
> --
> Ram
>
>
>

Re: Questions regarding Jobs, Stages and Caching

Reply via email to