date:20180125

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163770850
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -325,6 +326,32 @@ object WholeStageCodegenExec {
   }
 }
 
+object WholeStageCodegenId {
+  // codegenStageId: ID for codegen stages within a query plan.
+  // It does not affect equality, nor does it participate in destructuring 
pattern matching.
+  //
+  // Within a query, a codegen stage in a plan starts counting from 1, in 
"insertion order".
+  // WholeStageCodegenExec operators are inserted into a plan in 
depth-first post-order.
+  // See CollapseCodegenStages.insertWholeStageCodegen for the definition 
of insertion order.
+  //
+  // 0 is reserved as a special ID value to indicate a temporary 
WholeStageCodegenExec object
+  // is created, e.g. for special fallback handling when an existing 
WholeStageCodegenExec
+  // failed to generate/compile code.
+
--- End diff --

I think we should describe about the usage of such codegen stage id, e.g., 
the codegen stage id would show up in explain string and generated class name.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163771262
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,35 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("codegen stage IDs should be preserved in transformations after 
CollapseCodegenStages") {
+// test case adapted from DataFrameSuite to trigger ReuseExchange
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2") {
+  val df = spark.range(100)
+  val join = df.join(df, "id")
+  val plan = join.queryExecution.executedPlan
+  assert(!plan.find(p =>
+p.isInstanceOf[WholeStageCodegenExec] &&
+  p.asInstanceOf[WholeStageCodegenExec].codegenStageId == 
0).isDefined,
+"codegen stage IDs should be preserved through ReuseExchange")
+  checkAnswer(join, df.toDF)
+}
+  }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "the same query run twice should hit the 
codegen cache")
+
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "a different query can result in codegen 
cache miss, that's okay")
--- End diff --

That assert is actually "useless" in the sense that the CodegenMetrics are 
always monotonically increasing, so `after3 >= after2` will always be true 
(note the `=` in there). I only put it in to show the intent that it's by 
design that a different query can cause codegen cache miss.

Would you have any concrete suggestions in wording that assertion message? 
I'm totally open to suggestions here. I can move the current message into a 
comment and make the assert message look more like an assert message


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19285
  
**[Test build #86630 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86630/testReport)**
 for PR 19285 at commit 
[`40bdcac`](https://github.com/apache/spark/commit/40bdcacfc14b24c913c5979e0b2cf8b90154c543).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20375: [SPARK-23199][SQL]improved Removes repetition from group...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20375
  
**[Test build #86625 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86625/testReport)**
 for PR 20375 at commit 
[`5050e50`](https://github.com/apache/spark/commit/5050e50aedb25463ff690b54356211dd859af690).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18931
  
**[Test build #86624 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86624/testReport)**
 for PR 18931 at commit 
[`11946e7`](https://github.com/apache/spark/commit/11946e7a62928304560c0602d71b3064789086d6).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86627/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86630/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163771343
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -629,6 +629,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME =
+buildConf("spark.sql.codegen.useIdInClassName")
+.internal()
+.doc("When true, embed the (whole-stage) codegen stage ID into " +
+  "the class name of the generated class as a suffix")
+.booleanConf
+.createWithDefault(true)
--- End diff --

Shall we disable codegen stage id in both explain result and generated 
class name at the same time? It seems not be useful if we disable it in class 
name but keep it in explain result.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20224
  
**[Test build #86627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86627/testReport)**
 for PR 20224 at commit 
[`a11232e`](https://github.com/apache/spark/commit/a11232e162c50a1b9312410debb9fb7c4766f9a2).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: 
Int)`
  * `  final class $className extends $`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20375: [SPARK-23199][SQL]improved Removes repetition from group...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20375
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86625/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86629/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20375: [SPARK-23199][SQL]improved Removes repetition from group...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20375
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19285
  
**[Test build #86629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86629/testReport)**
 for PR 19285 at commit 
[`9e0759f`](https://github.com/apache/spark/commit/9e0759fb49eb4994099c10c8f8ec3a05637c915b).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86624/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163771658
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,35 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("codegen stage IDs should be preserved in transformations after 
CollapseCodegenStages") {
+// test case adapted from DataFrameSuite to trigger ReuseExchange
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2") {
+  val df = spark.range(100)
+  val join = df.join(df, "id")
+  val plan = join.queryExecution.executedPlan
+  assert(!plan.find(p =>
+p.isInstanceOf[WholeStageCodegenExec] &&
+  p.asInstanceOf[WholeStageCodegenExec].codegenStageId == 
0).isDefined,
+"codegen stage IDs should be preserved through ReuseExchange")
+  checkAnswer(join, df.toDF)
+}
+  }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "the same query run twice should hit the 
codegen cache")
+
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "a different query can result in codegen 
cache miss, that's okay")
--- End diff --

It is an error message so it's weird to use it to explain this assert. I 
think you can make it as a comment to explain the assert and remove the error 
message.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20224
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20224
  
**[Test build #86631 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86631/testReport)**
 for PR 20224 at commit 
[`a11232e`](https://github.com/apache/spark/commit/a11232e162c50a1b9312410debb9fb7c4766f9a2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20224
  
LGTM with minor comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20292: [SPARK-23129][CORE] Make deserializeStream of DiskMapIte...

2018-01-25 Thread caneGuy

Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/20292
  
Thanks for your time! @cloud-fan @jerryshao @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18931
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/224/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/225/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20224
  
**[Test build #86632 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86632/testReport)**
 for PR 20224 at commit 
[`a11232e`](https://github.com/apache/spark/commit/a11232e162c50a1b9312410debb9fb7c4766f9a2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18931
  
**[Test build #86633 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86633/testReport)**
 for PR 18931 at commit 
[`11946e7`](https://github.com/apache/spark/commit/11946e7a62928304560c0602d71b3064789086d6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163773021
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -629,6 +629,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME =
+buildConf("spark.sql.codegen.useIdInClassName")
+.internal()
+.doc("When true, embed the (whole-stage) codegen stage ID into " +
+  "the class name of the generated class as a suffix")
+.booleanConf
+.createWithDefault(true)
--- End diff --

I think it's always good to have id in explain and generated classes. The 
only concern is we may have codegen cache issues if putting id in the class 
name, so we need a config to turn it off.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163773516
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,35 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("codegen stage IDs should be preserved in transformations after 
CollapseCodegenStages") {
+// test case adapted from DataFrameSuite to trigger ReuseExchange
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2") {
+  val df = spark.range(100)
+  val join = df.join(df, "id")
+  val plan = join.queryExecution.executedPlan
+  assert(!plan.find(p =>
+p.isInstanceOf[WholeStageCodegenExec] &&
+  p.asInstanceOf[WholeStageCodegenExec].codegenStageId == 
0).isDefined,
+"codegen stage IDs should be preserved through ReuseExchange")
+  checkAnswer(join, df.toDF)
+}
+  }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "the same query run twice should hit the 
codegen cache")
+
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "a different query can result in codegen 
cache miss, that's okay")
--- End diff --

ok seems this assert is not very useful, how about we just remove it? We 
can just add a comment before `assert(after1 == after2, "the same query run 
twice should hit the codegen cache")` and say that different queries can cause 
codegen cache miss, so this assert proves the query is same?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163773528
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -629,6 +629,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME =
+buildConf("spark.sql.codegen.useIdInClassName")
+.internal()
+.doc("When true, embed the (whole-stage) codegen stage ID into " +
+  "the class name of the generated class as a suffix")
+.booleanConf
+.createWithDefault(true)
--- End diff --

Make sense to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20377#discussion_r163774842
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
 ---
@@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader(
 
   /** The isolated client interface to Hive. */
   private[hive] def createClient(): HiveClient = synchronized {
-val warehouseDir = 
Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname))
 if (!isolationOn) {
-  return new HiveClientImpl(version, warehouseDir, sparkConf, 
hadoopConf, config,
-baseClassLoader, this)
+  return new HiveClientImpl(version, sparkConf, hadoopConf, config, 
baseClassLoader, this)
--- End diff --

so the major concern is to hide the `Configuration` class through the code 
path. How about we create a wrapper?
```
trait HadoopConfWrapper {
  def get(key: String): String
  def toIterator: Iterator[(String, String)]
}
```

and here
```
val wrapper = new HadoopConfWrapper {
  def get(key: String) = hadoopConf.get(key)

  def toIterator = hadoopConf.iterator().asScala
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163774923
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,35 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("codegen stage IDs should be preserved in transformations after 
CollapseCodegenStages") {
+// test case adapted from DataFrameSuite to trigger ReuseExchange
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2") {
+  val df = spark.range(100)
+  val join = df.join(df, "id")
+  val plan = join.queryExecution.executedPlan
+  assert(!plan.find(p =>
+p.isInstanceOf[WholeStageCodegenExec] &&
+  p.asInstanceOf[WholeStageCodegenExec].codegenStageId == 
0).isDefined,
+"codegen stage IDs should be preserved through ReuseExchange")
+  checkAnswer(join, df.toDF)
+}
+  }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "the same query run twice should hit the 
codegen cache")
+
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "a different query can result in codegen 
cache miss, that's okay")
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20377#discussion_r163774967
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
 ---
@@ -251,10 +251,8 @@ private[hive] class IsolatedClientLoader(
 
   /** The isolated client interface to Hive. */
   private[hive] def createClient(): HiveClient = synchronized {
-val warehouseDir = 
Option(hadoopConf.get(ConfVars.METASTOREWAREHOUSE.varname))
 if (!isolationOn) {
-  return new HiveClientImpl(version, warehouseDir, sparkConf, 
hadoopConf, config,
-baseClassLoader, this)
+  return new HiveClientImpl(version, sparkConf, hadoopConf, config, 
baseClassLoader, this)
--- End diff --

cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19285
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19285
  
**[Test build #86634 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86634/testReport)**
 for PR 19285 at commit 
[`40bdcac`](https://github.com/apache/spark/commit/40bdcacfc14b24c913c5979e0b2cf8b90154c543).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20391: [SPARK-23208][SQL] Fix code generation for complex creat...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20391
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20391: [SPARK-23208][SQL] Fix code generation for comple...

2018-01-25 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20391


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/...

2018-01-25 Thread jiangxb1987

GitHub user jiangxb1987 opened a pull request:

https://github.com/apache/spark/pull/20393

[SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFrame could lead to 
Data Loss

## What changes were proposed in this pull request?

Currently shuffle repartition uses RoundRobinPartitioning, the generated 
result is nondeterministic since the sequence of input rows are not determined.

The bug can be triggered when there is a repartition call following a 
shuffle (which would lead to non-deterministic row ordering), as the pattern 
shows below:
upstream stage -> repartition stage -> result stage
(-> indicate a shuffle)
When one of the executors process goes down, some tasks on the repartition 
stage will be retried and generate inconsistent ordering, and some tasks of the 
result stage will be retried generating different data.

The following code returns 931532, instead of 100:
```
import scala.sys.process._

import org.apache.spark.TaskContext
val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 
2) {
throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()
```

In this PR, we propose a most straight-forward way to fix this problem by 
performing a local sort before partitioning, after we make the input row 
ordering deterministic, the function from rows to partitions is fully 
deterministic too.

The downside of the approach is that with extra local sort inserted, the 
performance of repartition() will go down, so we add a new config named 
`spark.sql.execution.sortBeforeRepartition` to control whether this patch is 
applied. The patch is default enabled to be safe-by-default, but user may 
choose to manually turn it off to avoid performance regression.

This patch also changes the output rows ordering of repartition(), that 
leads to a bunch of test cases failure because they are comparing the results 
directly.

## How was this patch tested?

Add unit test in ExchangeSuite.

With this patch(and `spark.sql.execution.sortBeforeRepartition` set to 
true), the following query returns 100:
```
import scala.sys.process._

import org.apache.spark.TaskContext

spark.conf.set("spark.sql.execution.sortBeforeRepartition", "true")

val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 
2) {
throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()

res7: Long = 100
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiangxb1987/spark shuffle-repartition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20393


commit 7fd964e964d6385f263b45fc264871d16163772d
Author: Xingbo Jiang 
Date:   2018-01-24T22:38:35Z

make RoundRobinPartitioning output deterministic.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20393
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20393
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/226/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20393
  
**[Test build #86635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86635/testReport)**
 for PR 20393 at commit 
[`7fd964e`](https://github.com/apache/spark/commit/7fd964e964d6385f263b45fc264871d16163772d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163784286
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -325,6 +326,32 @@ object WholeStageCodegenExec {
   }
 }
 
+object WholeStageCodegenId {
+  // codegenStageId: ID for codegen stages within a query plan.
+  // It does not affect equality, nor does it participate in destructuring 
pattern matching.
+  //
+  // Within a query, a codegen stage in a plan starts counting from 1, in 
"insertion order".
+  // WholeStageCodegenExec operators are inserted into a plan in 
depth-first post-order.
+  // See CollapseCodegenStages.insertWholeStageCodegen for the definition 
of insertion order.
+  //
+  // 0 is reserved as a special ID value to indicate a temporary 
WholeStageCodegenExec object
+  // is created, e.g. for special fallback handling when an existing 
WholeStageCodegenExec
+  // failed to generate/compile code.
+
--- End diff --

Sure thing. Will address it in the next update. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163784607
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,35 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("codegen stage IDs should be preserved in transformations after 
CollapseCodegenStages") {
+// test case adapted from DataFrameSuite to trigger ReuseExchange
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2") {
+  val df = spark.range(100)
+  val join = df.join(df, "id")
+  val plan = join.queryExecution.executedPlan
+  assert(!plan.find(p =>
+p.isInstanceOf[WholeStageCodegenExec] &&
+  p.asInstanceOf[WholeStageCodegenExec].codegenStageId == 
0).isDefined,
+"codegen stage IDs should be preserved through ReuseExchange")
+  checkAnswer(join, df.toDF)
+}
+  }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "the same query run twice should hit the 
codegen cache")
+
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "a different query can result in codegen 
cache miss, that's okay")
--- End diff --

Thanks for the suggestions, @viirya and @cloud-fan ! I'll move the assert 
message to a comment in the next update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [WIP][SPARK-22221][DOCS] Adding User Documentatio...

2018-01-25 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r163784133
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,154 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
 You may run `./bin/spark-sql --help` for a complete list of all available
 options.
 
+# Usage Guide for Pandas with Arrow
+
+## Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial 
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
+changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight 
any differences when
+working with Arrow-enabled data.
+
+## Ensure pyarrow Installed
+
+If you install pyspark using pip, then pyarrow can be brought in as an 
extra dependency of the sql
+module with the command "pip install pyspark[sql]". Otherwise, you must 
ensure that pyarrow is
+installed and available on all cluster node Python environments. The 
current supported version is
+0.8.0. You can install using pip or conda from the conda-forge channel. 
See pyarrow
+[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
+
+## How to Enable for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call 
+`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, it first must be enabled by 
setting the Spark conf
+'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default.
+
+
+
+{% highlight python %}
+
+import numpy as np
+import pandas as pd
+
+# Enable Arrow, 'spark' is an existing SparkSession
+spark.conf.set("spark.sql.execution.arrow.enabled", "true")
+
+# Generate sample data
+pdf = pd.DataFrame(np.random.rand(100, 3))
+
+# Create a Spark DataFrame from Pandas data using Arrow
+df = spark.createDataFrame(pdf)
+
+# Convert the Spark DataFrame to a local Pandas DataFrame
+selpdf = df.select(" * ").toPandas()
+
+{% endhighlight %}
+
+
+
+Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
+enabled. Not all Spark data types are currently supported and an error 
will be raised if a column
+has an unsupported type, see [Supported Types](#supported-types).
+
+## How to Write Vectorized UDFs
+
+A vectorized UDF is similar to a standard UDF in Spark except the inputs 
and output will be
+Pandas Series, which allow the function to be composed with vectorized 
operations. This function
+can then be run very efficiently in Spark where data is sent in batches to 
Python and then
+is executed using Pandas Series as the inputs. The exected output of the 
function is also a Pandas
+Series of the same length as the inputs. A vectorized UDF is declared 
using the `pandas_udf`
+keyword, no additional configuration is required.
+
+The following example shows how to create a vectorized UDF that computes 
the product of 2 columns.
+
+
+
+{% highlight python %}
+
+import pandas as pd
+from pyspark.sql.functions import col, pandas_udf
+from pyspark.sql.types import LongType
+
+# Declare the function and create the UDF
+def multiply_func(a, b):
+return a * b
+
+multiply = pandas_udf(multiply_func, returnType=LongType())
+
+# The function for a pandas_udf should be able to execute with local 
Pandas data
+x = pd.Series([1, 2, 3])
+print(multiply_func(x, x))
+# 01
+# 14
+# 29
+# dtype: int64
+
+# Create a Spark DataFrame, 'spark' is an existing SparkSession
+df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
+
+# Execute function as a Spark vectorized UDF
+df.select(multiply(col("x"), col("x"))).show()
+# +---+
+# |multiply_func(x, x)|
+# +---+
+# |  1|
+# |  4|
+# |  9|
+# +---+
+
+{% endhighlight %}
+
+
+
+## GroupBy-Apply UDFs
+
+## Usage Notes
+
+### Supported types
+
+Currently, all Spark SQL data types are supported except `MapType`, 
`ArrayType` of `TimestampType`, and
+nested `StructType`.
+
+### Setting Arrow Batch Size
+
+Data partitions in Spark are converted into Arrow record batches, which 
can temporarily lead to
+high memo

[GitHub] spark issue #17422: [SPARK-20087] Attach accumulators / metrics to 'TaskKill...

2018-01-25 Thread advancedxy

Github user advancedxy commented on the issue:

https://github.com/apache/spark/pull/17422
  
@noodle-fb are you still working on this? If not, I may work on it based on 
your current impl.

I am facing same issue here. The accumulator updates are lost for killed 
tasks.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20224
  
Updated again. Addressed @viirya 's comments:
1. added comments to explain where this codegen stage ID is used
2. moved an assertion message to a comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/227/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20224
  
**[Test build #86636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86636/testReport)**
 for PR 20224 at commit 
[`5c99777`](https://github.com/apache/spark/commit/5c99777a6a9aa21905158d53f9e393c0ad7acc9f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163790082
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,38 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("codegen stage IDs should be preserved in transformations after 
CollapseCodegenStages") {
+// test case adapted from DataFrameSuite to trigger ReuseExchange
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2") {
+  val df = spark.range(100)
+  val join = df.join(df, "id")
+  val plan = join.queryExecution.executedPlan
+  assert(!plan.find(p =>
+p.isInstanceOf[WholeStageCodegenExec] &&
+  p.asInstanceOf[WholeStageCodegenExec].codegenStageId == 
0).isDefined,
+"codegen stage IDs should be preserved through ReuseExchange")
+  checkAnswer(join, df.toDF)
+}
+  }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+
+  // the same query run twice should hit the codegen cache
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "Should hit codegen cache. No new 
compilation to bytecode expected")
+
+  // a different query can result in codegen cache miss, that's by 
design
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "always")
--- End diff --

hmm, I think we can just remove the above 3 lines and add a comment before 
L258, to say that the CodegenMetrics are always monotonically increasing, so 
`after1 == after2` proves we hit the codegen cache.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20390: [SPARK-23081][PYTHON]Add colRegex API to PySpark

2018-01-25 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20390#discussion_r163791518
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -819,6 +819,29 @@ def columns(self):
 """
 return [f.name for f in self.schema.fields]
 
+@since(2.4)
+def colRegex(self, colName):
+"""
+Selects column based on the column name specified as a regex and 
return it
+as :class:`Column`.
+
+:param colName: string, column name specified as a regex.
+
+>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)])
+>>> df.select(df.colRegex("`(_1)?+.+`")).show()
--- End diff --

nit: perhaps a bit obscure to pick the default column name of `_1`?
how about we name the columns in the line above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19575: [WIP][SPARK-22221][DOCS] Adding User Documentation for A...

2018-01-25 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19575
  
it looks like maybe we have a blocker for RC2?
let's try to get this in soon so it could get into 2.3.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20354: [SPARK-23163][DOC][PYTHON] Sync ML Python API with Scala

2018-01-25 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20354
  
merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20354: [SPARK-23163][DOC][PYTHON] Sync ML Python API wit...

2018-01-25 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20354


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18931
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

2018-01-25 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20394

[SPARK-23214][SQL] cached data should not carry extra hint info

## What changes were proposed in this pull request?

This is a regression introduced by 
https://github.com/apache/spark/pull/19864

When we lookup cache, we should not carry the hint info, as this cache 
entry might be added by a plan having hint info, while the input plan for this 
lookup may not have hint info, or have different hint info.

## How was this patch tested?

a new test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark cache

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20394.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20394


commit 87ac064db0fcb1d9ffe5c7f77069c09ba47a9f01
Author: Wenchen Fan 
Date:   2018-01-25T10:07:26Z

cached data should not carry extra hint info




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20394#discussion_r163798968
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -169,14 +169,17 @@ class CacheManager extends Logging {
   /** Replaces segments of the given logical plan with cached versions 
where possible. */
   def useCachedData(plan: LogicalPlan): LogicalPlan = {
 val newPlan = plan transformDown {
+  // Do not lookup the cache by hint node. Hint node is special, we 
should ignore it when
+  // canonicalizing plans, so that plans which are same except hint 
can hit the same cache.
+  // However, we also want to keep the hint info after cache lookup. 
Here we skip the hint
+  // node, so that the returned caching plan won't replace the hint 
node and drop the hint info
+  // from the original plan.
+  case hint: ResolvedHint => hint
--- End diff --

A small clean up for https://github.com/apache/spark/pull/20365


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20394: [SPARK-23214][SQL] cached data should not carry extra hi...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20394
  
**[Test build #86637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86637/testReport)**
 for PR 20394 at commit 
[`87ac064`](https://github.com/apache/spark/commit/87ac064db0fcb1d9ffe5c7f77069c09ba47a9f01).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20394#discussion_r163799084
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -62,7 +62,7 @@ case class InMemoryRelation(
 @transient child: SparkPlan,
 tableName: Option[String])(
 @transient var _cachedColumnBuffers: RDD[CachedBatch] = null,
-val batchStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator,
+val sizeInBytesStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator,
--- End diff --

unrelated but this name is more accurate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20394: [SPARK-23214][SQL] cached data should not carry e...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20394#discussion_r163799202
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -70,8 +71,8 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
   private def testBroadcastJoin[T: ClassTag](
   joinType: String,
   forceBroadcast: Boolean = false): SparkPlan = {
-val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
-val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
+val df1 = Seq((1, "4"), (2, "2")).toDF("key", "value")
+val df2 = Seq((1, "1"), (2, "2")).toDF("key", "value")
--- End diff --

some code style fixing


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20394: [SPARK-23214][SQL] cached data should not carry extra hi...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20394
  
cc @CodingCat @gatorsmile @hvanhovell @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-25 Thread rednaxelafx

Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163799937
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,38 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("codegen stage IDs should be preserved in transformations after 
CollapseCodegenStages") {
+// test case adapted from DataFrameSuite to trigger ReuseExchange
+withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "2") {
+  val df = spark.range(100)
+  val join = df.join(df, "id")
+  val plan = join.queryExecution.executedPlan
+  assert(!plan.find(p =>
+p.isInstanceOf[WholeStageCodegenExec] &&
+  p.asInstanceOf[WholeStageCodegenExec].codegenStageId == 
0).isDefined,
+"codegen stage IDs should be preserved through ReuseExchange")
+  checkAnswer(join, df.toDF)
+}
+  }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+
+  // the same query run twice should hit the codegen cache
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "Should hit codegen cache. No new 
compilation to bytecode expected")
+
+  // a different query can result in codegen cache miss, that's by 
design
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "always")
--- End diff --

I like that. Updating now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20394: [SPARK-23214][SQL] cached data should not carry extra hi...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20394
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20394: [SPARK-23214][SQL] cached data should not carry extra hi...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20394
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/228/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/229/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20224
  
**[Test build #86638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86638/testReport)**
 for PR 20224 at commit 
[`ce8171a`](https://github.com/apache/spark/commit/ce8171aa115d05d0a198964a64e7ef60d9637502).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread rednaxelafx

Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20224
  
Updated again to address @cloud-fan 's comments: removed unneeded test case 
and added a bit more comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19285: [SPARK-22068][CORE]Reduce the duplicate code betw...

2018-01-25 Thread Ngone51

Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19285#discussion_r163807859
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -702,6 +634,93 @@ private[spark] class MemoryStore(
   }
 }
 
+private trait ValuesBuilder[T] {
+  def preciseSize: Long
--- End diff --

Hey guys, why not name the trait as `MemoryEntryBuilder`?  As I see from 
the code, it is used to build the `MemoryEntry`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18931
  
**[Test build #86633 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86633/testReport)**
 for PR 18931 at commit 
[`11946e7`](https://github.com/apache/spark/commit/11946e7a62928304560c0602d71b3064789086d6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86633/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19285: [SPARK-22068][CORE]Reduce the duplicate code betw...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19285#discussion_r163819767
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -702,6 +634,93 @@ private[spark] class MemoryStore(
   }
 }
 
+private trait ValuesBuilder[T] {
+  def preciseSize: Long
--- End diff --

good idea


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-25 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18931
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20224
  
**[Test build #86631 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86631/testReport)**
 for PR 20224 at commit 
[`a11232e`](https://github.com/apache/spark/commit/a11232e162c50a1b9312410debb9fb7c4766f9a2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: 
Int)`
  * `  final class $className extends $`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20224
  
**[Test build #86632 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86632/testReport)**
 for PR 20224 at commit 
[`a11232e`](https://github.com/apache/spark/commit/a11232e162c50a1b9312410debb9fb7c4766f9a2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class WholeStageCodegenExec(child: SparkPlan)(val codegenStageId: 
Int)`
  * `  final class $className extends $`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2018-01-25 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18931


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86631/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20224
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86632/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20393
  
**[Test build #86635 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86635/testReport)**
 for PR 20393 at commit 
[`7fd964e`](https://github.com/apache/spark/commit/7fd964e964d6385f263b45fc264871d16163772d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public final class RecordBinaryComparator extends RecordComparator `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20393
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86635/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20393
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20393
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20393
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20393
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/230/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20393
  
**[Test build #86639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86639/testReport)**
 for PR 20393 at commit 
[`7fd964e`](https://github.com/apache/spark/commit/7fd964e964d6385f263b45fc264871d16163772d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18906: [SPARK-21692][PYSPARK][SQL] Add nullability suppo...

2018-01-25 Thread ptkool

Github user ptkool commented on a diff in the pull request:

https://github.com/apache/spark/pull/18906#discussion_r163828796
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2105,6 +2105,14 @@ def udf(f=None, returnType=StringType()):
 >>> import random
 >>> random_udf = udf(lambda: int(random.random() * 100), 
IntegerType()).asNondeterministic()
 
+.. note:: The user-defined functions are considered to be able to 
return null values by default.
+If your function is not nullable, call `asNonNullable` on the user 
defined function.
+E.g.:
+
+>>> from pyspark.sql.types import StringType
+>>> import getpass
+>>> getuser_udf = udf(lambda: getpass.getuser(), 
StringType()).asNonNullable()
--- End diff --

Why? Use of StringType() is consistent with other tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19575: [WIP][SPARK-22221][DOCS] Adding User Documentation for A...

2018-01-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19575
  
How about a critical @felixcheung? I will focus on this one anyway to get 
this into 2.3.0. Seems RC3 will be going on if I didn't misunderstand.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18906: [SPARK-21692][PYSPARK][SQL] Add nullability suppo...

2018-01-25 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18906#discussion_r163831513
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2105,6 +2105,14 @@ def udf(f=None, returnType=StringType()):
 >>> import random
 >>> random_udf = udf(lambda: int(random.random() * 100), 
IntegerType()).asNondeterministic()
 
+.. note:: The user-defined functions are considered to be able to 
return null values by default.
+If your function is not nullable, call `asNonNullable` on the user 
defined function.
+E.g.:
+
+>>> from pyspark.sql.types import StringType
+>>> import getpass
+>>> getuser_udf = udf(lambda: getpass.getuser(), 
StringType()).asNonNullable()
--- End diff --

Oh, simply because it's shorter :-). That's fine if you prefer this way. I 
don't feel strongly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20382
  
**[Test build #86640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86640/testReport)**
 for PR 20382 at commit 
[`56c60f3`](https://github.com/apache/spark/commit/56c60f3d9d920cea095e78695544b371435ca6f5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19285
  
**[Test build #86634 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86634/testReport)**
 for PR 19285 at commit 
[`40bdcac`](https://github.com/apache/spark/commit/40bdcacfc14b24c913c5979e0b2cf8b90154c543).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86634/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20382
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/231/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2

2018-01-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20382
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 630 matches

Mail list logo