If you do not specify a schema, then the json() function will attempt to determine the schema, which requires a full scan of the file. Any subsequent actions will again have to read in the data. See the documentation at:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json "If the schema parameter is not specified, this function goes through the input once to determine the input schema." Nicholas Szandor Hakobian, Ph.D. Senior Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Thu, May 25, 2017 at 9:24 AM, Steffen Schmitz <steffenschm...@hotmail.de> wrote: > Hi Ram, > > spark.read.json() should be evaluated on the first the call of .count(). > It should then be read into memory once and the rows are counted. After > this operation it will be in memory and access will be faster. > If you add println statements in between of your function calls you should > see start Spark starts to work only after the call of count. > > Regards, > Steffen > > On 25. May 2017, at 17:02, Ram Navan <hirampra...@gmail.com> wrote: > > Hi Steffen, > > Thanks for your response. > > Isn't spark.read.json() an action function? It reads the files from the > source directory, infers the schema and creates a dataframe right? > dataframe.cache() prints out this schema as well. I am not sure why > dataframe.count() will try to do the same thing again (reading files from > source). spark.read.json() and count() - both actions took 8 minutes each > in my scenario. I'd expect only one of the action should incur the expenses > of reading 19949 files from s3. Am I missing anything? > > Thank you! > > Ram > > > On Thu, May 25, 2017 at 1:34 AM, Steffen Schmitz < > steffenschm...@hotmail.de> wrote: > >> Hi Ram, >> >> Regarding your caching question: >> The data frame is evaluated lazy. That means it isn’t cached directly on >> invoking of .cache(), but on calling the first action on it (in your case >> count). >> Then it is loaded into memory and the rows are counted, not on the call >> of .cache(). >> On the second call to count it is already in memory and cached and that’s >> why it’s faster. >> >> I do not know if it’s allowed to recommend resources here, but I really >> liked the Big Data Analysis with Spark Course by Heather Miller on Coursera. >> And the Spark documentation is also a good place to start. >> >> Regards, >> Steffen >> >> > On 25. May 2017, at 07:28, ramnavan <hirampra...@gmail.com> wrote: >> > >> > Hi, >> > >> > I’m new to Spark and trying to understand the inner workings of Spark >> in the >> > below mentioned scenarios. I’m using PySpark and Spark 2.1.1 >> > >> > Spark.read.json(): >> > >> > I am running executing this line >> > “spark.read.json(‘s3a://<bucket-name>/*.json’)” and a cluster with >> three >> > worker nodes (AWS M4.xlarge instances). The bucket has about 19949 json >> > files and the total size is about 4.4 GB. The line created three spark >> jobs >> > first job with 10000 tasks, second job with 19949 tasks and third job >> with >> > 10000 tasks. Each of the jobs have one stage in it. Please refer to the >> > attached images job0, job1 and job2.jpg. job0.jpg >> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/ >> n28708/job0.jpg> >> > job1.jpg >> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/ >> n28708/job1.jpg> >> > job2.jpg >> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/ >> n28708/job2.jpg> >> > I was expecting it to create 1 job with 19949 tasks. I’d like to >> understand >> > why there are three jobs instead of just one and why reading json files >> > calls for map operation. >> > >> > Caching and Count(): >> > >> > Once spark reads 19949 json files into a dataframe (let’s call it >> files_df), >> > I am calling these two operations files_df.createOrReplaceTempVi >> ew(“files) >> > and files_df.cache(). I am expecting files_df.cache() will cache the >> entire >> > dataframe in memory so any subsequent operation will be faster. My next >> > statement is files_df.count(). This operation took an entire 8.8 >> minutes and >> > it looks like it read the files again from s3 and calculated the count. >> > Please refer to attached count.jpg file for reference. count.jpg >> > <http://apache-spark-user-list.1001560.n3.nabble.com/file/ >> n28708/count.jpg> >> > Why is this happening? If I call files_df.count() for the second time, >> it >> > comes back fast within few seconds. Can someone explain this? >> > >> > In general, I am looking for a good source to learn about Spark >> Internals >> > and try to understand what’s happening beneath the hood. >> > >> > Thanks in advance! >> > >> > Ram >> > >> > >> > >> > -- >> > View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Questions-regarding-Jobs-Stages-and- >> Caching-tp28708.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com >> . >> > >> > --------------------------------------------------------------------- >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >> > > > -- > Ram > > >