That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this case.
Best Regards, Shixiong Zhu 2015-08-25 14:01 GMT+08:00 Cheng, Hao <hao.ch...@intel.com>: > O, Sorry, I miss reading your reply! > > > > I know the minimum tasks will be 2 for scanning, but Jeff is talking about > 2 jobs, not 2 tasks. > > > > *From:* Shixiong Zhu [mailto:zsxw...@gmail.com] > *Sent:* Tuesday, August 25, 2015 1:29 PM > *To:* Cheng, Hao > *Cc:* Jeff Zhang; user@spark.apache.org > > *Subject:* Re: DataFrame#show cost 2 Spark Jobs ? > > > > Hao, > > > > I can reproduce it using the master branch. I'm curious why you cannot > reproduce it. Did you check if the input HadoopRDD did have two partitions? > My test code is > > > > val df = sqlContext.read.json("examples/src/main/resources/people.json") > > df.show() > > > > > Best Regards, > > Shixiong Zhu > > > > 2015-08-25 13:01 GMT+08:00 Cheng, Hao <hao.ch...@intel.com>: > > Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark > jobs in the `df.show()` with latest code, we did refactor the code for json > data source recently, not sure you’re running an earlier version of it. > > > > And a known issue is Spark SQL will try to re-list the files every time > when loading the data for JSON, it’s probably causes longer time for ramp > up with large number of files/partitions. > > > > *From:* Jeff Zhang [mailto:zjf...@gmail.com] > *Sent:* Tuesday, August 25, 2015 8:11 AM > *To:* Cheng, Hao > *Cc:* user@spark.apache.org > *Subject:* Re: DataFrame#show cost 2 Spark Jobs ? > > > > Hi Cheng, > > > > I know that sqlContext.read will trigger one spark job to infer the > schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it > would cost 3 jobs. > > > > Here's the command I use: > > > > >> val df = sqlContext.read.json(" > file:///Users/hadoop/github/spark/examples/src/main/resources/people.json") > // trigger one spark job to infer schema > > >> df.show() // trigger 2 spark jobs which is weird > > > > > > > > > > On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > > The first job is to infer the json schema, and the second one is what you > mean of the query. > > You can provide the schema while loading the json file, like below: > > > > sqlContext.read.schema(xxx).json(“…”)? > > > > Hao > > *From:* Jeff Zhang [mailto:zjf...@gmail.com] > *Sent:* Monday, August 24, 2015 6:20 PM > *To:* user@spark.apache.org > *Subject:* DataFrame#show cost 2 Spark Jobs ? > > > > It's weird to me that the simple show function will cost 2 spark jobs. > DataFrame#explain shows it is a very simple operation, not sure why need 2 > jobs. > > > > == Parsed Logical Plan == > > Relation[age#0L,name#1] > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] > > > > == Analyzed Logical Plan == > > age: bigint, name: string > > Relation[age#0L,name#1] > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] > > > > == Optimized Logical Plan == > > Relation[age#0L,name#1] > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] > > > > == Physical Plan == > > Scan > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1] > > > > > > > > -- > > Best Regards > > Jeff Zhang > > > > > > -- > > Best Regards > > Jeff Zhang > > >