Because defaultMinPartitions is 2 (See https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/core/src/main/scala/org/apache/spark/SparkContext.scala#L2057 ), your input "people.json" will be split to 2 partitions.
At first, `take` will start a job for the first partition. However, the limit is 21, but the first partition only has 2 records. So it will continue to start a new job for the second partition. You can check implementation details in SparkPlan.executeTake: https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L185 Best Regards, Shixiong Zhu 2015-08-25 8:11 GMT+08:00 Jeff Zhang <zjf...@gmail.com>: > Hi Cheng, > > I know that sqlContext.read will trigger one spark job to infer the > schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it > would cost 3 jobs. > > Here's the command I use: > > >> val df = > sqlContext.read.json("file:///Users/hadoop/github/spark/examples/src/main/resources/people.json") > // trigger one spark job to infer schema > >> df.show() // trigger 2 spark jobs which is weird > > > > > On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > >> The first job is to infer the json schema, and the second one is what you >> mean of the query. >> >> You can provide the schema while loading the json file, like below: >> >> >> >> sqlContext.read.schema(xxx).json(“…”)? >> >> >> >> Hao >> >> *From:* Jeff Zhang [mailto:zjf...@gmail.com] >> *Sent:* Monday, August 24, 2015 6:20 PM >> *To:* user@spark.apache.org >> *Subject:* DataFrame#show cost 2 Spark Jobs ? >> >> >> >> It's weird to me that the simple show function will cost 2 spark jobs. >> DataFrame#explain shows it is a very simple operation, not sure why need 2 >> jobs. >> >> >> >> == Parsed Logical Plan == >> >> Relation[age#0L,name#1] >> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] >> >> >> >> == Analyzed Logical Plan == >> >> age: bigint, name: string >> >> Relation[age#0L,name#1] >> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] >> >> >> >> == Optimized Logical Plan == >> >> Relation[age#0L,name#1] >> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json] >> >> >> >> == Physical Plan == >> >> Scan >> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1] >> >> >> >> >> >> >> >> -- >> >> Best Regards >> >> Jeff Zhang >> > > > > -- > Best Regards > > Jeff Zhang >