Hi Cheng, I know that sqlContext.read will trigger one spark job to infer the schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3 jobs.
Here's the command I use: >> val df = sqlContext.read.json("file:///Users/hadoop/github/spark/examples/src/main/resources/people.json") // trigger one spark job to infer schema >> df.show() // trigger 2 spark jobs which is weird On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > The first job is to infer the json schema, and the second one is what you > mean of the query. > > You can provide the schema while loading the json file, like below: > > > > sqlContext.read.schema(xxx).json(“…”)? > > > > Hao > > *From:* Jeff Zhang [mailto:zjf...@gmail.com] > *Sent:* Monday, August 24, 2015 6:20 PM > *To:* user@spark.apache.org > *Subject:* DataFrame#show cost 2 Spark Jobs ? > > > > It's weird to me that the simple show function will cost 2 spark jobs. > DataFrame#explain shows it is a very simple operation, not sure why need 2 > jobs. > > > > == Parsed Logical Plan == > > Relation[age#0L,name#1] > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] > > > > == Analyzed Logical Plan == > > age: bigint, name: string > > Relation[age#0L,name#1] > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] > > > > == Optimized Logical Plan == > > Relation[age#0L,name#1] > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] > > > > == Physical Plan == > > Scan > JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1] > > > > > > > > -- > > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang