Re: DataFrame#show cost 2 Spark Jobs ?

Shixiong Zhu Mon, 24 Aug 2015 21:50:29 -0700

Because defaultMinPartitions is 2 (See
https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/core/src/main/scala/org/apache/spark/SparkContext.scala#L2057
), your input "people.json" will be split to 2 partitions.


At first, `take` will start a job for the first partition. However, the
limit is 21, but the first partition only has 2 records. So it will
continue to start a new job for the second partition.

You can check implementation details in SparkPlan.executeTake:
https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L185

Best Regards,
Shixiong Zhu

2015-08-25 8:11 GMT+08:00 Jeff Zhang <zjf...@gmail.com>:

> Hi Cheng,
>
> I know that sqlContext.read will trigger one spark job to infer the
> schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it
> would cost 3 jobs.
>
> Here's the command I use:
>
> >> val df =
> sqlContext.read.json("file:///Users/hadoop/github/spark/examples/src/main/resources/people.json")
>        // trigger one spark job to infer schema
> >> df.show()            // trigger 2 spark jobs which is weird
>
>
>
>
> On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
>> The first job is to infer the json schema, and the second one is what you
>> mean of the query.
>>
>> You can provide the schema while loading the json file, like below:
>>
>>
>>
>> sqlContext.read.schema(xxx).json(“…”)?
>>
>>
>>
>> Hao
>>
>> *From:* Jeff Zhang [mailto:zjf...@gmail.com]
>> *Sent:* Monday, August 24, 2015 6:20 PM
>> *To:* user@spark.apache.org
>> *Subject:* DataFrame#show cost 2 Spark Jobs ?
>>
>>
>>
>> It's weird to me that the simple show function will cost 2 spark jobs.
>> DataFrame#explain shows it is a very simple operation, not sure why need 2
>> jobs.
>>
>>
>>
>> == Parsed Logical Plan ==
>>
>> Relation[age#0L,name#1]
>> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]
>>
>>
>>
>> == Analyzed Logical Plan ==
>>
>> age: bigint, name: string
>>
>> Relation[age#0L,name#1]
>> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]
>>
>>
>>
>> == Optimized Logical Plan ==
>>
>> Relation[age#0L,name#1]
>> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]
>>
>>
>>
>> == Physical Plan ==
>>
>> Scan
>> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: DataFrame#show cost 2 Spark Jobs ?

Reply via email to