RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
Ok, I see, thanks for the correction, but this should be optimized.

From: Shixiong Zhu [mailto:zsxw...@gmail.com]
Sent: Tuesday, August 25, 2015 2:08 PM
To: Cheng, Hao
Cc: Jeff Zhang; user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?

That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this case.


Best Regards,
Shixiong Zhu

2015-08-25 14:01 GMT+08:00 Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com:
O, Sorry, I miss reading your reply!

I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 
jobs, not 2 tasks.

From: Shixiong Zhu [mailto:zsxw...@gmail.commailto:zsxw...@gmail.com]
Sent: Tuesday, August 25, 2015 1:29 PM
To: Cheng, Hao
Cc: Jeff Zhang; user@spark.apache.orgmailto:user@spark.apache.org

Subject: Re: DataFrame#show cost 2 Spark Jobs ?

Hao,

I can reproduce it using the master branch. I'm curious why you cannot 
reproduce it. Did you check if the input HadoopRDD did have two partitions? My 
test code is

val df = sqlContext.read.json(examples/src/main/resources/people.json)
df.show()



Best Regards,
Shixiong Zhu

2015-08-25 13:01 GMT+08:00 Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com:
Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark jobs in 
the `df.show()` with latest code, we did refactor the code for json data source 
recently, not sure you’re running an earlier version of it.

And a known issue is Spark SQL will try to re-list the files every time when 
loading the data for JSON, it’s probably causes longer time for ramp up with 
large number of files/partitions.

From: Jeff Zhang [mailto:zjf...@gmail.commailto:zjf...@gmail.com]
Sent: Tuesday, August 25, 2015 8:11 AM
To: Cheng, Hao
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?

Hi Cheng,

I know that sqlContext.read will trigger one spark job to infer the schema. 
What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3 
jobs.

Here's the command I use:

 val df = 
 sqlContext.read.json(file:///Users/hadoop/github/spark/examples/src/main/resources/people.jsonfile:///\\Users\hadoop\github\spark\examples\src\main\resources\people.json)
 // trigger one spark job to infer schema
 df.show()// trigger 2 spark jobs which is weird




On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
The first job is to infer the json schema, and the second one is what you mean 
of the query.
You can provide the schema while loading the json file, like below:

sqlContext.read.schema(xxx).json(“…”)?

Hao
From: Jeff Zhang [mailto:zjf...@gmail.commailto:zjf...@gmail.com]
Sent: Monday, August 24, 2015 6:20 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: DataFrame#show cost 2 Spark Jobs ?

It's weird to me that the simple show function will cost 2 spark jobs. 
DataFrame#explain shows it is a very simple operation, not sure why need 2 jobs.

== Parsed Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Analyzed Logical Plan ==
age: bigint, name: string
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Optimized Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Physical Plan ==
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]



--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang




RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
O, Sorry, I miss reading your reply!

I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 
jobs, not 2 tasks.

From: Shixiong Zhu [mailto:zsxw...@gmail.com]
Sent: Tuesday, August 25, 2015 1:29 PM
To: Cheng, Hao
Cc: Jeff Zhang; user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?

Hao,

I can reproduce it using the master branch. I'm curious why you cannot 
reproduce it. Did you check if the input HadoopRDD did have two partitions? My 
test code is

val df = sqlContext.read.json(examples/src/main/resources/people.json)
df.show()



Best Regards,
Shixiong Zhu

2015-08-25 13:01 GMT+08:00 Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com:
Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark jobs in 
the `df.show()` with latest code, we did refactor the code for json data source 
recently, not sure you’re running an earlier version of it.

And a known issue is Spark SQL will try to re-list the files every time when 
loading the data for JSON, it’s probably causes longer time for ramp up with 
large number of files/partitions.

From: Jeff Zhang [mailto:zjf...@gmail.commailto:zjf...@gmail.com]
Sent: Tuesday, August 25, 2015 8:11 AM
To: Cheng, Hao
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?

Hi Cheng,

I know that sqlContext.read will trigger one spark job to infer the schema. 
What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3 
jobs.

Here's the command I use:

 val df = 
 sqlContext.read.json(file:///Users/hadoop/github/spark/examples/src/main/resources/people.jsonfile:///\\Users\hadoop\github\spark\examples\src\main\resources\people.json)
 // trigger one spark job to infer schema
 df.show()// trigger 2 spark jobs which is weird




On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
The first job is to infer the json schema, and the second one is what you mean 
of the query.
You can provide the schema while loading the json file, like below:

sqlContext.read.schema(xxx).json(“…”)?

Hao
From: Jeff Zhang [mailto:zjf...@gmail.commailto:zjf...@gmail.com]
Sent: Monday, August 24, 2015 6:20 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: DataFrame#show cost 2 Spark Jobs ?

It's weird to me that the simple show function will cost 2 spark jobs. 
DataFrame#explain shows it is a very simple operation, not sure why need 2 jobs.

== Parsed Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Analyzed Logical Plan ==
age: bigint, name: string
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Optimized Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Physical Plan ==
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]



--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang



Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Shixiong Zhu
That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this
case.

Best Regards,
Shixiong Zhu

2015-08-25 14:01 GMT+08:00 Cheng, Hao hao.ch...@intel.com:

 O, Sorry, I miss reading your reply!



 I know the minimum tasks will be 2 for scanning, but Jeff is talking about
 2 jobs, not 2 tasks.



 *From:* Shixiong Zhu [mailto:zsxw...@gmail.com]
 *Sent:* Tuesday, August 25, 2015 1:29 PM
 *To:* Cheng, Hao
 *Cc:* Jeff Zhang; user@spark.apache.org

 *Subject:* Re: DataFrame#show cost 2 Spark Jobs ?



 Hao,



 I can reproduce it using the master branch. I'm curious why you cannot
 reproduce it. Did you check if the input HadoopRDD did have two partitions?
 My test code is



 val df = sqlContext.read.json(examples/src/main/resources/people.json)

 df.show()




 Best Regards,

 Shixiong Zhu



 2015-08-25 13:01 GMT+08:00 Cheng, Hao hao.ch...@intel.com:

 Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark
 jobs in the `df.show()` with latest code, we did refactor the code for json
 data source recently, not sure you’re running an earlier version of it.



 And a known issue is Spark SQL will try to re-list the files every time
 when loading the data for JSON, it’s probably causes longer time for ramp
 up with large number of files/partitions.



 *From:* Jeff Zhang [mailto:zjf...@gmail.com]
 *Sent:* Tuesday, August 25, 2015 8:11 AM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: DataFrame#show cost 2 Spark Jobs ?



 Hi Cheng,



 I know that sqlContext.read will trigger one spark job to infer the
 schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it
 would cost 3 jobs.



 Here's the command I use:



  val df = sqlContext.read.json(
 file:///Users/hadoop/github/spark/examples/src/main/resources/people.json)
// trigger one spark job to infer schema

  df.show()// trigger 2 spark jobs which is weird









 On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao hao.ch...@intel.com wrote:

 The first job is to infer the json schema, and the second one is what you
 mean of the query.

 You can provide the schema while loading the json file, like below:



 sqlContext.read.schema(xxx).json(“…”)?



 Hao

 *From:* Jeff Zhang [mailto:zjf...@gmail.com]
 *Sent:* Monday, August 24, 2015 6:20 PM
 *To:* user@spark.apache.org
 *Subject:* DataFrame#show cost 2 Spark Jobs ?



 It's weird to me that the simple show function will cost 2 spark jobs.
 DataFrame#explain shows it is a very simple operation, not sure why need 2
 jobs.



 == Parsed Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Analyzed Logical Plan ==

 age: bigint, name: string

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Optimized Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Physical Plan ==

 Scan
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]







 --

 Best Regards

 Jeff Zhang





 --

 Best Regards

 Jeff Zhang





RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
The first job is to infer the json schema, and the second one is what you mean 
of the query.
You can provide the schema while loading the json file, like below:

sqlContext.read.schema(xxx).json(“…”)?

Hao
From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Monday, August 24, 2015 6:20 PM
To: user@spark.apache.org
Subject: DataFrame#show cost 2 Spark Jobs ?

It's weird to me that the simple show function will cost 2 spark jobs. 
DataFrame#explain shows it is a very simple operation, not sure why need 2 jobs.

== Parsed Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Analyzed Logical Plan ==
age: bigint, name: string
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Optimized Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Physical Plan ==
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]



--
Best Regards

Jeff Zhang


Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Jeff Zhang
Hi Cheng,

I know that sqlContext.read will trigger one spark job to infer the schema.
What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3
jobs.

Here's the command I use:

 val df =
sqlContext.read.json(file:///Users/hadoop/github/spark/examples/src/main/resources/people.json)
   // trigger one spark job to infer schema
 df.show()// trigger 2 spark jobs which is weird




On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao hao.ch...@intel.com wrote:

 The first job is to infer the json schema, and the second one is what you
 mean of the query.

 You can provide the schema while loading the json file, like below:



 sqlContext.read.schema(xxx).json(“…”)?



 Hao

 *From:* Jeff Zhang [mailto:zjf...@gmail.com]
 *Sent:* Monday, August 24, 2015 6:20 PM
 *To:* user@spark.apache.org
 *Subject:* DataFrame#show cost 2 Spark Jobs ?



 It's weird to me that the simple show function will cost 2 spark jobs.
 DataFrame#explain shows it is a very simple operation, not sure why need 2
 jobs.



 == Parsed Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Analyzed Logical Plan ==

 age: bigint, name: string

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Optimized Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Physical Plan ==

 Scan
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]







 --

 Best Regards

 Jeff Zhang




-- 
Best Regards

Jeff Zhang


Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
Hao,

I can reproduce it using the master branch. I'm curious why you cannot
reproduce it. Did you check if the input HadoopRDD did have two partitions?
My test code is

val df = sqlContext.read.json(examples/src/main/resources/people.json)
df.show()


Best Regards,
Shixiong Zhu

2015-08-25 13:01 GMT+08:00 Cheng, Hao hao.ch...@intel.com:

 Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark
 jobs in the `df.show()` with latest code, we did refactor the code for json
 data source recently, not sure you’re running an earlier version of it.



 And a known issue is Spark SQL will try to re-list the files every time
 when loading the data for JSON, it’s probably causes longer time for ramp
 up with large number of files/partitions.



 *From:* Jeff Zhang [mailto:zjf...@gmail.com]
 *Sent:* Tuesday, August 25, 2015 8:11 AM
 *To:* Cheng, Hao
 *Cc:* user@spark.apache.org
 *Subject:* Re: DataFrame#show cost 2 Spark Jobs ?



 Hi Cheng,



 I know that sqlContext.read will trigger one spark job to infer the
 schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it
 would cost 3 jobs.



 Here's the command I use:



  val df = sqlContext.read.json(
 file:///Users/hadoop/github/spark/examples/src/main/resources/people.json)
// trigger one spark job to infer schema

  df.show()// trigger 2 spark jobs which is weird









 On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao hao.ch...@intel.com wrote:

 The first job is to infer the json schema, and the second one is what you
 mean of the query.

 You can provide the schema while loading the json file, like below:



 sqlContext.read.schema(xxx).json(“…”)?



 Hao

 *From:* Jeff Zhang [mailto:zjf...@gmail.com]
 *Sent:* Monday, August 24, 2015 6:20 PM
 *To:* user@spark.apache.org
 *Subject:* DataFrame#show cost 2 Spark Jobs ?



 It's weird to me that the simple show function will cost 2 spark jobs.
 DataFrame#explain shows it is a very simple operation, not sure why need 2
 jobs.



 == Parsed Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Analyzed Logical Plan ==

 age: bigint, name: string

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Optimized Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Physical Plan ==

 Scan
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]







 --

 Best Regards

 Jeff Zhang





 --

 Best Regards

 Jeff Zhang



RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark jobs in 
the `df.show()` with latest code, we did refactor the code for json data source 
recently, not sure you’re running an earlier version of it.

And a known issue is Spark SQL will try to re-list the files every time when 
loading the data for JSON, it’s probably causes longer time for ramp up with 
large number of files/partitions.

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Tuesday, August 25, 2015 8:11 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: DataFrame#show cost 2 Spark Jobs ?

Hi Cheng,

I know that sqlContext.read will trigger one spark job to infer the schema. 
What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3 
jobs.

Here's the command I use:

 val df = 
 sqlContext.read.json(file:///Users/hadoop/github/spark/examples/src/main/resources/people.jsonfile:///\\Users\hadoop\github\spark\examples\src\main\resources\people.json)
 // trigger one spark job to infer schema
 df.show()// trigger 2 spark jobs which is weird




On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
The first job is to infer the json schema, and the second one is what you mean 
of the query.
You can provide the schema while loading the json file, like below:

sqlContext.read.schema(xxx).json(“…”)?

Hao
From: Jeff Zhang [mailto:zjf...@gmail.commailto:zjf...@gmail.com]
Sent: Monday, August 24, 2015 6:20 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: DataFrame#show cost 2 Spark Jobs ?

It's weird to me that the simple show function will cost 2 spark jobs. 
DataFrame#explain shows it is a very simple operation, not sure why need 2 jobs.

== Parsed Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Analyzed Logical Plan ==
age: bigint, name: string
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Optimized Logical Plan ==
Relation[age#0L,name#1] 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]

== Physical Plan ==
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]



--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang


Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Shixiong Zhu
Because defaultMinPartitions is 2 (See
https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/core/src/main/scala/org/apache/spark/SparkContext.scala#L2057
), your input people.json will be split to 2 partitions.

At first, `take` will start a job for the first partition. However, the
limit is 21, but the first partition only has 2 records. So it will
continue to start a new job for the second partition.

You can check implementation details in SparkPlan.executeTake:
https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L185

Best Regards,
Shixiong Zhu

2015-08-25 8:11 GMT+08:00 Jeff Zhang zjf...@gmail.com:

 Hi Cheng,

 I know that sqlContext.read will trigger one spark job to infer the
 schema. What I mean is DataFrame#show cost 2 spark jobs. So overall it
 would cost 3 jobs.

 Here's the command I use:

  val df =
 sqlContext.read.json(file:///Users/hadoop/github/spark/examples/src/main/resources/people.json)
// trigger one spark job to infer schema
  df.show()// trigger 2 spark jobs which is weird




 On Mon, Aug 24, 2015 at 10:56 PM, Cheng, Hao hao.ch...@intel.com wrote:

 The first job is to infer the json schema, and the second one is what you
 mean of the query.

 You can provide the schema while loading the json file, like below:



 sqlContext.read.schema(xxx).json(“…”)?



 Hao

 *From:* Jeff Zhang [mailto:zjf...@gmail.com]
 *Sent:* Monday, August 24, 2015 6:20 PM
 *To:* user@spark.apache.org
 *Subject:* DataFrame#show cost 2 Spark Jobs ?



 It's weird to me that the simple show function will cost 2 spark jobs.
 DataFrame#explain shows it is a very simple operation, not sure why need 2
 jobs.



 == Parsed Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Analyzed Logical Plan ==

 age: bigint, name: string

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Optimized Logical Plan ==

 Relation[age#0L,name#1]
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]



 == Physical Plan ==

 Scan
 JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]







 --

 Best Regards

 Jeff Zhang




 --
 Best Regards

 Jeff Zhang