[GitHub] [spark] SparkQA commented on pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


SparkQA commented on pull request #29643:
URL: https://github.com/apache/spark/pull/29643#issuecomment-686954059


   **[Test build #128282 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128282/testReport)**
 for PR 29643 at commit 
[`30e6c4a`](https://github.com/apache/spark/commit/30e6c4af86f890b2ee19343bbbe2affb2157e46b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29643:
URL: https://github.com/apache/spark/pull/29643#issuecomment-686951102







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29643:
URL: https://github.com/apache/spark/pull/29643#issuecomment-686951102







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


HeartSaVioR commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r483420929



##
File path: docs/structured-streaming-programming-guide.md
##
@@ -861,6 +861,10 @@ isStreaming(df)
 
 
 
+You may want to check the logical plan of the query, as Spark converts the 
operation into another operation, which includes adding streaming aggregation. 
(e.g. count, distinct, union, etc.)

Review comment:
   The thing is whether Spark injects streaming aggregation which end users 
have to maintain or not, and that can be checked by looking into logical plan, 
right? I didn't mean they need to find the distinct in logical plan and how 
Spark changes the operation. They just need to check for stateful operations.
   
   SQL distinct and Dataset dropDuplicate aren't the only difference. SQL union 
and Dataset union are also different. The cases can increase and decrease 
according to the Spark catalyst rules, which is not the thing we can ensure the 
doc be in sync.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


SparkQA commented on pull request #29643:
URL: https://github.com/apache/spark/pull/29643#issuecomment-686950450


   **[Test build #128281 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128281/testReport)**
 for PR 29643 at commit 
[`e796ea7`](https://github.com/apache/spark/commit/e796ea71b0cc87f85ff65fe1dad236bd109c1955).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


HeartSaVioR commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r483420929



##
File path: docs/structured-streaming-programming-guide.md
##
@@ -861,6 +861,10 @@ isStreaming(df)
 
 
 
+You may want to check the logical plan of the query, as Spark converts the 
operation into another operation, which includes adding streaming aggregation. 
(e.g. count, distinct, union, etc.)

Review comment:
   The thing is whether Spark injects streaming aggregation which end users 
have to maintain or not, and that can be checked by looking into logical plan, 
right? I didn't mean they need to find the distinct in logical plan and how 
Spark changes the operation. They just need to check for stateful operations.
   
   SQL distinct and Dataset dropDuplicate aren't the only difference. SQL union 
and Dataset union are also different. The cases can increase and decrease 
according to the Spark catalyst rules, which is not we can ensure the doc be in 
sync.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

2020-09-03 Thread GitBox


viirya commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r483419615



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2030,7 +2030,47 @@ class Dataset[T] private[sql](
* @group typedrel
* @since 2.3.0
*/
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and 
another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows 
different set
+   * of column names between two Datasets. Missing columns at each side, will 
be filled with
+   * null values. The missing columns at left Dataset will be added at the end 
in the schema
+   * of the union result:
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
+   *   df1.unionByName(df2, true).show
+   *
+   *   // output: "col3" is missing at left df1 and added at the end of schema.
+   *   // +++++
+   *   // |col0|col1|col2|col3|
+   *   // +++++
+   *   // |   1|   2|   3|null|
+   *   // |   5|   4|null|   6|
+   *   // +++++
+   *
+   *   df2.unionByName(df1, true).show
+   *
+   *   // output: "col2" is missing at left df2 and added at the end of schema.
+   *   // +++++
+   *   // |col1|col0|col3|col2|
+   *   // +++++
+   *   // |   4|   5|   6|null|
+   *   // |   2|   1|null|   3|
+   *   // +++++
+   * }}}
+   *
+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] 
= withSetOperator {

Review comment:
   I should create a followup PR for Python and R. But it is okay for a 
beginner task too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

2020-09-03 Thread GitBox


gatorsmile commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r483418810



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2030,7 +2030,47 @@ class Dataset[T] private[sql](
* @group typedrel
* @since 2.3.0
*/
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and 
another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows 
different set
+   * of column names between two Datasets. Missing columns at each side, will 
be filled with
+   * null values. The missing columns at left Dataset will be added at the end 
in the schema
+   * of the union result:
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
+   *   df1.unionByName(df2, true).show
+   *
+   *   // output: "col3" is missing at left df1 and added at the end of schema.
+   *   // +++++
+   *   // |col0|col1|col2|col3|
+   *   // +++++
+   *   // |   1|   2|   3|null|
+   *   // |   5|   4|null|   6|
+   *   // +++++
+   *
+   *   df2.unionByName(df1, true).show
+   *
+   *   // output: "col2" is missing at left df2 and added at the end of schema.
+   *   // +++++
+   *   // |col1|col0|col3|col2|
+   *   // +++++
+   *   // |   4|   5|   6|null|
+   *   // |   2|   1|null|   3|
+   *   // +++++
+   * }}}
+   *
+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] 
= withSetOperator {

Review comment:
   This is a good beginner task for new contributors.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29646: [SPARK-XXXXX][SQL][TEST] Add tests to check if since fields are set correctly in ExpressionInfo

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29646:
URL: https://github.com/apache/spark/pull/29646#issuecomment-686947407







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29646: [SPARK-XXXXX][SQL][TEST] Add tests to check if since fields are set correctly in ExpressionInfo

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29646:
URL: https://github.com/apache/spark/pull/29646#issuecomment-686947407







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28996: [SPARK-29358][SQL] Make unionByName optionally fill missing columns with nulls

2020-09-03 Thread GitBox


gatorsmile commented on a change in pull request #28996:
URL: https://github.com/apache/spark/pull/28996#discussion_r483418373



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2030,7 +2030,47 @@ class Dataset[T] private[sql](
* @group typedrel
* @since 2.3.0
*/
-  def unionByName(other: Dataset[T]): Dataset[T] = withSetOperator {
+  def unionByName(other: Dataset[T]): Dataset[T] = unionByName(other, false)
+
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and 
another Dataset.
+   *
+   * The difference between this function and [[union]] is that this function
+   * resolves columns by name (not by position).
+   *
+   * When the parameter `allowMissingColumns` is true, this function allows 
different set
+   * of column names between two Datasets. Missing columns at each side, will 
be filled with
+   * null values. The missing columns at left Dataset will be added at the end 
in the schema
+   * of the union result:
+   *
+   * {{{
+   *   val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
+   *   val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3")
+   *   df1.unionByName(df2, true).show
+   *
+   *   // output: "col3" is missing at left df1 and added at the end of schema.
+   *   // +++++
+   *   // |col0|col1|col2|col3|
+   *   // +++++
+   *   // |   1|   2|   3|null|
+   *   // |   5|   4|null|   6|
+   *   // +++++
+   *
+   *   df2.unionByName(df1, true).show
+   *
+   *   // output: "col2" is missing at left df2 and added at the end of schema.
+   *   // +++++
+   *   // |col1|col0|col3|col2|
+   *   // +++++
+   *   // |   4|   5|   6|null|
+   *   // |   2|   1|null|   3|
+   *   // +++++
+   * }}}
+   *
+   * @group typedrel
+   * @since 3.1.0
+   */
+  def unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] 
= withSetOperator {

Review comment:
   Do we have a JIRA to add the corresponding API for Python? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29646: [SPARK-XXXXX][SQL][TEST] Add tests to check if since fields are set correctly in ExpressionInfo

2020-09-03 Thread GitBox


SparkQA commented on pull request #29646:
URL: https://github.com/apache/spark/pull/29646#issuecomment-686946728


   **[Test build #128280 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128280/testReport)**
 for PR 29646 at commit 
[`21496cc`](https://github.com/apache/spark/commit/21496cc98118d5377d776fa5058fc4a02f137c58).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29646: [SPARK-XXXXX][SQL][TEST] Add tests to check if since fields are set correctly in ExpressionInfo

2020-09-03 Thread GitBox


maropu commented on pull request #29646:
URL: https://github.com/apache/spark/pull/29646#issuecomment-686945795


   NOTE: I will file jira later if necessary.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29646: [SPARK-XXXXX][SQL][TEST] Add tests to check if since fields are set correctly in ExpressionInfo

2020-09-03 Thread GitBox


maropu commented on pull request #29646:
URL: https://github.com/apache/spark/pull/29646#issuecomment-686945464


   I don't have much time to check the versions (SPARK-32780) expr-by-expr now, 
but I think its worth adding the test to prevent one from forgetting setting a 
since field when adding a new expression.  WDYT? @HyukjinKwon 
   (NOTE: I made this PR instead because @tanelk looks absent)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu opened a new pull request #29646: [SPARK-XXXXX][SQL][TEST] Add tests to check if since fields are set correctly in ExpressionInfo

2020-09-03 Thread GitBox


maropu opened a new pull request #29646:
URL: https://github.com/apache/spark/pull/29646


   
   
   ### What changes were proposed in this pull request?
   
   This PR intends to add a test to check if `since` fields are set correctly 
in `ExpressionInfo`.
   This comes from the discussion in 
https://github.com/apache/spark/pull/29577#discussion_r479794502.
   
   The credit should be @tanelk.
   
   ### Why are the changes needed?
   
   To prevent one from forgetting setting a `since` field when adding a new 
expression.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


xuanyuanking commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r483416180



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2525,14 +2525,19 @@ class Dataset[T] private[sql](
 
   /**
* Returns a new Dataset that contains only the unique rows from this 
Dataset.
-   * This is an alias for `distinct`.
+   * This is an alias for `distinct` on batch [[Dataset]]. For streaming 
[[Dataset]], it would show
+   * slightly different behavior. (see below)
*
* For a static batch [[Dataset]], it just drops duplicate rows. For a 
streaming [[Dataset]], it
* will keep all data across triggers as intermediate state to drop 
duplicates rows. You can use
* [[withWatermark]] to limit how late the duplicate data can be and system 
will accordingly limit
* the state. In addition, too late data older than watermark will be 
dropped to avoid any
* possibility of duplicates.
*
+   * Note that for a streaming [[Dataset]], this method only returns distinct 
rows only once,
+   * regardless of the output mode. Spark may convert the `distinct` operation 
to aggregation`,

Review comment:
   +1 for the reword version, actually the original comments also 
confusing, we need to emphasize the distinct clause in SQL.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


xuanyuanking commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r483414767



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2525,14 +2525,19 @@ class Dataset[T] private[sql](
 
   /**
* Returns a new Dataset that contains only the unique rows from this 
Dataset.
-   * This is an alias for `distinct`.
+   * This is an alias for `distinct` on batch [[Dataset]]. For streaming 
[[Dataset]], it would show
+   * slightly different behavior. (see below)
*
* For a static batch [[Dataset]], it just drops duplicate rows. For a 
streaming [[Dataset]], it
* will keep all data across triggers as intermediate state to drop 
duplicates rows. You can use
* [[withWatermark]] to limit how late the duplicate data can be and system 
will accordingly limit
* the state. In addition, too late data older than watermark will be 
dropped to avoid any
* possibility of duplicates.
*
+   * Note that for a streaming [[Dataset]], this method only returns distinct 
rows only once
+   * regardless of the output mode, which the behavior may not be same with 
using distinct in

Review comment:
   +1 for the second version, actually the original comments also 
confusing, we need to emphasize the distinct clause in SQL.

##
File path: docs/structured-streaming-programming-guide.md
##
@@ -861,6 +861,10 @@ isStreaming(df)
 
 
 
+You may want to check the logical plan of the query, as Spark converts the 
operation into another operation, which includes adding streaming aggregation. 
(e.g. count, distinct, union, etc.)

Review comment:
   I think the operation converting is internal behavior, maybe it's not 
clear enough for asking the end-user to check it. How about we just comment on 
the behavior difference between SQL distinct and dataset dropDuplicate, WDYT?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29461:
URL: https://github.com/apache/spark/pull/29461#issuecomment-686943825







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29461:
URL: https://github.com/apache/spark/pull/29461#issuecomment-686943825







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


SparkQA commented on pull request #29461:
URL: https://github.com/apache/spark/pull/29461#issuecomment-686943331


   **[Test build #128279 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128279/testReport)**
 for PR 29461 at commit 
[`41817cb`](https://github.com/apache/spark/commit/41817cb302dd3b1fe1c6ac3eb647e4fd521e616a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29645: [SPARK-32796][SQL] Make withField API support nested struct in array

2020-09-03 Thread GitBox


viirya commented on a change in pull request #29645:
URL: https://github.com/apache/spark/pull/29645#discussion_r483408113



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithFields.scala
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.WithFields
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Resolves `UnresolvedWithFields`.
+ */
+object ResolveWithFields extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp 
{
+case e if !e.childrenResolved => e
+
+case q: LogicalPlan =>
+  q.transformExpressions {
+case expr if !expr.childrenResolved => expr
+case e: UnresolvedWithFields => WithFields(e.col, e.fieldName, e.expr)

Review comment:
   This can be moved to other proper rule. Just not sure which one is good, 
so put as an individual rule first.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686937104







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

2020-09-03 Thread GitBox


HeartSaVioR commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r483410078



##
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##
@@ -2525,14 +2525,19 @@ class Dataset[T] private[sql](
 
   /**
* Returns a new Dataset that contains only the unique rows from this 
Dataset.
-   * This is an alias for `distinct`.
+   * This is an alias for `distinct` on batch [[Dataset]]. For streaming 
[[Dataset]], it would show
+   * slightly different behavior. (see below)
*
* For a static batch [[Dataset]], it just drops duplicate rows. For a 
streaming [[Dataset]], it
* will keep all data across triggers as intermediate state to drop 
duplicates rows. You can use
* [[withWatermark]] to limit how late the duplicate data can be and system 
will accordingly limit
* the state. In addition, too late data older than watermark will be 
dropped to avoid any
* possibility of duplicates.
*
+   * Note that for a streaming [[Dataset]], this method only returns distinct 
rows only once,
+   * regardless of the output mode. Spark may convert the `distinct` operation 
to aggregation`,

Review comment:
   Probably I need to reword, as the behavior is tied to the catalyst 
rules, which can be changed in anytime. It would be just enough that `, which 
the behavior may not be same with using `distinct` in SQL statement`. I'll fix 
it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


SparkQA removed a comment on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686925219


   **[Test build #128277 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128277/testReport)**
 for PR 29634 at commit 
[`9dac672`](https://github.com/apache/spark/commit/9dac672d0d0545fd7060e7ee9f14ac50428bbbce).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686937104







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29645: [SPARK-32796][SQL] Make withField API support nested struct in array

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29645:
URL: https://github.com/apache/spark/pull/29645#issuecomment-686936496







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


SparkQA commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686936689


   **[Test build #128277 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128277/testReport)**
 for PR 29634 at commit 
[`9dac672`](https://github.com/apache/spark/commit/9dac672d0d0545fd7060e7ee9f14ac50428bbbce).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29645: [SPARK-32796][SQL] Make withField API support nested struct in array

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29645:
URL: https://github.com/apache/spark/pull/29645#issuecomment-686936496







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29645: [SPARK-32796][SQL] Make withField API support nested struct in array

2020-09-03 Thread GitBox


SparkQA commented on pull request #29645:
URL: https://github.com/apache/spark/pull/29645#issuecomment-686935891


   **[Test build #128278 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128278/testReport)**
 for PR 29645 at commit 
[`643728d`](https://github.com/apache/spark/commit/643728d9cb6bdaac1ccc135b71cd24835c57c47f).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29645: [SPARK-32796][SQL] Make withField API support nested struct in array

2020-09-03 Thread GitBox


viirya commented on a change in pull request #29645:
URL: https://github.com/apache/spark/pull/29645#discussion_r483408113



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveWithFields.scala
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions.WithFields
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Resolves `UnresolvedWithFields`.
+ */
+object ResolveWithFields extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp 
{
+case e if !e.childrenResolved => e
+
+case q: LogicalPlan =>
+  q.transformExpressions {
+case expr if !expr.childrenResolved => expr
+case e: UnresolvedWithFields => WithFields(e.col, e.fieldName, e.expr)

Review comment:
   This can be moved to other proper rule.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya opened a new pull request #29645: [SPARK-32796][SQL] Make withField API support nested struct in array

2020-09-03 Thread GitBox


viirya opened a new pull request #29645:
URL: https://github.com/apache/spark/pull/29645


   
   
   ### What changes were proposed in this pull request?
   
   
   This patch adds nested struct support to `Column.withField` API.
   
   ### Why are the changes needed?
   
   
   Currently `Column.withField` only supports `StructType`. For nested struct 
in `ArrayType`, it doesn't support. We can support nested struct in array to 
make the API more general and useful.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes. Adding nested struct support to `Column.withField` API.
   
   ### How was this patch tested?
   
   
   Unit tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wzhfy commented on a change in pull request #29589: [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec

2020-09-03 Thread GitBox


wzhfy commented on a change in pull request #29589:
URL: https://github.com/apache/spark/pull/29589#discussion_r483406090



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
##
@@ -1342,6 +1345,52 @@ abstract class DynamicPartitionPruningSuiteBase
   }
 }
   }
+
+  test("SPARK-32748: propagate local properties to dynamic pruning thread") {
+def checkPropertyValueByUdfResult(propKey: String, propValue: String): 
Unit = {
+  spark.sparkContext.setLocalProperty(propKey, propValue)
+  val df = sql(
+s"""
+   |SELECT compare_property_value(f.date_id, '$propKey', '$propValue') 
as col

Review comment:
   Yes, IIUC udf created by `spark.udf.register` is not foldable, so it 
will be evaluated in tasks. Besides, otherwise TaskContext.get() will get null 
if it's not run in task I think.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wzhfy commented on a change in pull request #29589: [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec

2020-09-03 Thread GitBox


wzhfy commented on a change in pull request #29589:
URL: https://github.com/apache/spark/pull/29589#discussion_r483404918



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
##
@@ -1342,6 +1345,52 @@ abstract class DynamicPartitionPruningSuiteBase
   }
 }
   }
+
+  test("SPARK-32748: propagate local properties to dynamic pruning thread") {
+def checkPropertyValueByUdfResult(propKey: String, propValue: String): 
Unit = {
+  spark.sparkContext.setLocalProperty(propKey, propValue)
+  val df = sql(
+s"""
+   |SELECT compare_property_value(f.date_id, '$propKey', '$propValue') 
as col
+   |FROM fact_sk f
+   |INNER JOIN dim_store s
+   |ON f.store_id = s.store_id AND s.country = 'NL'
+  """.stripMargin)
+
+  checkPartitionPruningPredicate(df, false, true)
+  assert(df.collect().forall(_.toSeq == Seq(true)))
+}
+
+try {
+  
SQLConf.get.setConf(StaticSQLConf.BROADCAST_EXCHANGE_MAX_THREAD_THRESHOLD, 1)

Review comment:
   Yes you are right.. but it's not because it's a static conf, it's 
because `executionContext` is in the `SubqueryBroadcastExec` object.
   This makes it hard to write unit test. Do you have any suggestion?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a change in pull request #29639: [SPARK-32186][DOCS][PYTHON] Development - Debugging

2020-09-03 Thread GitBox


zero323 commented on a change in pull request #29639:
URL: https://github.com/apache/spark/pull/29639#discussion_r483402746



##
File path: python/docs/source/development/debugging.rst
##
@@ -0,0 +1,187 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+=
+Debugging PySpark
+=
+
+PySpark uses Spark as an engine. If a PySpark application does not require 
interaction
+between Python workers and JVMs, Python workers are not launched. They are 
lazily launched only when
+Python native functions or data have to be handled, for example, when you 
execute pandas UDFs or
+PySpark RDD APIs.
+
+This page describes how to debug such Python applications and workers instead 
of focusing on debugging with JVM.
+Profiling and debugging JVM is described at `Useful Developer Tools 
`_.
+
+
+Remote Debugging (PyCharm)

Review comment:
   Can we explicitly state that the remote debugger is available only 
PyCharm Professional (I know, it is clearly explained in PyCharm docs, but not 
everyone will get there)? And maybe link to [pydev remote debugger 
docs](https://www.pydev.org/manual_adv_remote_debugger.html) as an alternative.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686925870







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686925870







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


SparkQA commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686925219


   **[Test build #128277 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128277/testReport)**
 for PR 29634 at commit 
[`9dac672`](https://github.com/apache/spark/commit/9dac672d0d0545fd7060e7ee9f14ac50428bbbce).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29640: [SPARK-32180][PYTHON][DOCS] Installation page of Getting Started in PySpark documentation

2020-09-03 Thread GitBox


HyukjinKwon commented on a change in pull request #29640:
URL: https://github.com/apache/spark/pull/29640#discussion_r483400388



##
File path: python/docs/source/getting_started/installation.rst
##
@@ -0,0 +1,119 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+Installation
+
+
+The official release channel is to download it from `the Apache Spark website 
`_.
+Alternatively, you can also install it via pip from PyPI.  PyPI installation 
is usually to use
+standalone locally or as a client to connect to a cluster instead of setting a 
cluster up.  
+ 
+This page includes the instructions for installing PySpark by using pip, 
Conda, downloading manually, and building it from the source.
+
+Python Version Supported
+
+
+Python 3.6 and above.
+
+Using PyPI
+--
+
+PySpark installation using `PyPI `_
+
+.. code-block:: bash
+
+pip install pyspark
+   
+Using Conda  
+---
+
+Conda is an open-source package management and environment management system 
which is a part of `Anaconda `_ 
distribution. It is both cross-platform and language agnostic.
+  
+Conda can be used to create a virtual environment from terminal as shown below:
+
+.. code-block:: bash
+
+   conda create -n pyspark_env 
+
+After the virtual environment is created, it should be visible under the list 
of conda environments which can be seen using the following command:
+
+.. code-block:: bash
+
+   conda env list
+
+The newly created environment can be accessed using the following command:
+
+.. code-block:: bash
+
+   conda activate pyspark_env
+
+In lower Conda version, the following command might be used:
+
+.. code-block:: bash
+
+   source activate pyspark_env
+
+PySpark installation using ``pip`` under Conda environment is official. 
+
+PySpark can be installed in this newly created environment using PyPI as shown 
before:
+
+.. code-block:: bash
+
+   pip install pyspark
+
+`PySpark at Conda `_ is not the 
official release.
+
+Official Release Channel
+
+
+Different flavor of PySpark is available in `the official release channel 
`_.
+Any suitable version can be downloaded and extracted as below:
+
+.. code-block:: bash
+
+tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
+
+An important step is to ensure ``SPARK_HOME`` environment variable points to 
the directory where the code has been extracted. 
+The next step is to properly define ``PYTHONPATH`` such that it can find the 
PySpark and 
+Py4J under ``$SPARK_HOME/python/lib``, one example of doing this is shown 
below:
+
+.. code-block:: bash
+
+cd spark-3.0.0-bin-hadoop2.7
+export SPARK_HOME=`pwd`
+export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo 
"${ZIPS[*]}"):$PYTHONPATH
+
+Installing from Source
+--
+
+To install PySpark from source, refer `Building Spark 
`_.
+
+Steps for defining ``PYTHONPATH`` is same as described in `Official Release 
Channel <#official-release-channel>`_. 
+
+Dependencies
+
+= = 
+Package   Minimum supported version Note
+= = 
+`pandas`  0.23.2Optional
+`NumPy`   1.7   Optional
+`pyarrow` 0.15.1Optional
+`Py4J`0.10.9Required
+= = 
+
+**Note**: A prerequisite for PySpark installation is the availability of 
``JAVA 8 or 11`` and ``JAVA_HOME`` properly set.

Review comment:
   I would also add this note as well:
   
   ```
   If you are using JDK 11, you should set 
``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related
   features. See also `Downloading 
`_.
   ```




-

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


HyukjinKwon commented on a change in pull request #29634:
URL: https://github.com/apache/spark/pull/29634#discussion_r483399207



##
File path: python/docs/source/development/testing.rst
##
@@ -0,0 +1,61 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Testing PySpark
+===
+
+In order to run PySpark tests, you should build Spark itself first via Maven
+or SBT. For example,
+
+.. code-block:: bash
+
+build/mvn -DskipTests clean package
+
+After that, the PySpark test cases can be run via using ``python/run-tests``. 
For example,
+
+.. code-block:: bash
+
+python/run-tests --python-executable=python3
+
+Note that:
+
+* If you are running tests on Mac OS, you may set 
``OBJC_DISABLE_INITIALIZE_FORK_SAFETY`` environment variable to ``YES``.
+* If you are using JDK 11, you should set 
``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related features. See 
also `Downloading `_.

Review comment:
   Oh! but we set it in testing scripts by default so it is more required 
for debugging when we don't use our own testing script. Yes, I got the point 
now. Let me just remove this here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29593: [SPARK-32753][SQL] Only copy tags to node with no tags

2020-09-03 Thread GitBox


cloud-fan commented on a change in pull request #29593:
URL: https://github.com/apache/spark/pull/29593#discussion_r483396414



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
##
@@ -91,7 +91,9 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] 
extends Product {
   private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty
 
   protected def copyTagsFrom(other: BaseType): Unit = {
-tags ++= other.tags
+if (tags.isEmpty) {

Review comment:
   can we add some comments?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


maropu commented on a change in pull request #29643:
URL: https://github.com/apache/spark/pull/29643#discussion_r483396030



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##
@@ -168,6 +170,85 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
 }.toSeq
   }
 
+
+  /**
+   * Rewrites this plan tree based on the given plan mappings from old plan 
nodes to new nodes.
+   * This method also updates all the related references in this plan tree 
accordingly, in case
+   * the replaced node has different output expr ID than the old node.
+   */
+  def rewriteWithPlanMapping(
+  planMapping: Map[PlanType, PlanType],
+  canGetOutput: PlanType => Boolean = _ => true): PlanType = {
+def internalRewrite(plan: PlanType): (PlanType, Seq[(Attribute, 
Attribute)]) = {
+  if (planMapping.contains(plan)) {

Review comment:
   hm, yea, this is  complicated though, I remember the existing tests fail 
because of this reason. That might be the case below;
   ```
   SQLQueryTestSuite.sql
   org.scalatest.exceptions.TestFailedException: union.sql
   Expected "struct<[c1:decimal(11,1),c2:string]>", but got "struct<[]>" Schema 
did not match for query #3
   SELECT *
   FROM   (SELECT * FROM t1
   UNION ALL
   SELECT * FROM t2
   UNION ALL
   SELECT * FROM t2): -- !query
   SELECT *
   FROM   (SELECT * FROM t1
   UNION ALL
   SELECT * FROM t2
   UNION ALL
   SELECT * FROM t2)
   -- !query schema
   struct<>
   -- !query output
   org.apache.spark.sql.catalyst.errors.package$TreeNodeException
   After applying rule 
org.apache.spark.sql.catalyst.optimizer.RemoveNoopOperators in batch Operator 
Optimization before Inferring Filters, the structural integrity of the plan is 
broken., tree:
   'Union false, false
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


maropu commented on a change in pull request #29643:
URL: https://github.com/apache/spark/pull/29643#discussion_r483396030



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##
@@ -168,6 +170,85 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
 }.toSeq
   }
 
+
+  /**
+   * Rewrites this plan tree based on the given plan mappings from old plan 
nodes to new nodes.
+   * This method also updates all the related references in this plan tree 
accordingly, in case
+   * the replaced node has different output expr ID than the old node.
+   */
+  def rewriteWithPlanMapping(
+  planMapping: Map[PlanType, PlanType],
+  canGetOutput: PlanType => Boolean = _ => true): PlanType = {
+def internalRewrite(plan: PlanType): (PlanType, Seq[(Attribute, 
Attribute)]) = {
+  if (planMapping.contains(plan)) {

Review comment:
   hm, yea, this is a little complicated though, I remember the existing 
tests fail because of this reason. That might be the case below;
   ```
   SQLQueryTestSuite.sql
   org.scalatest.exceptions.TestFailedException: union.sql
   Expected "struct<[c1:decimal(11,1),c2:string]>", but got "struct<[]>" Schema 
did not match for query #3
   SELECT *
   FROM   (SELECT * FROM t1
   UNION ALL
   SELECT * FROM t2
   UNION ALL
   SELECT * FROM t2): -- !query
   SELECT *
   FROM   (SELECT * FROM t1
   UNION ALL
   SELECT * FROM t2
   UNION ALL
   SELECT * FROM t2)
   -- !query schema
   struct<>
   -- !query output
   org.apache.spark.sql.catalyst.errors.package$TreeNodeException
   After applying rule 
org.apache.spark.sql.catalyst.optimizer.RemoveNoopOperators in batch Operator 
Optimization before Inferring Filters, the structural integrity of the plan is 
broken., tree:
   'Union false, false
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jainshashank24 removed a comment on pull request #28828: [SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numRowsDroppedByWatermark"

2020-09-03 Thread GitBox


jainshashank24 removed a comment on pull request #28828:
URL: https://github.com/apache/spark/pull/28828#issuecomment-686435584


   Hi i have used this MR and the other one 
https://github.com/apache/spark/pull/28607/files
   even after that i cant see counter value increasing
   Though i can see the counter under "stateOperators"
   
   "stateOperators" : [ {
   "numRowsTotal" : 0,
   "numRowsUpdated" : 0,
   "memoryUsedBytes" : 1344,
   "numRowsDroppedByWatermark" : 0,
   "customMetrics" : {
 "loadedMapCacheHitCount" : 0,
 "loadedMapCacheMissCount" : 0,
 "stateOnCurrentVersionSizeBytes" : 480
   }
 } ],
   
   @HeartSaVioR if you can help out



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


cloud-fan commented on a change in pull request #29643:
URL: https://github.com/apache/spark/pull/29643#discussion_r483394790



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##
@@ -168,6 +170,85 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
 }.toSeq
   }
 
+
+  /**
+   * Rewrites this plan tree based on the given plan mappings from old plan 
nodes to new nodes.
+   * This method also updates all the related references in this plan tree 
accordingly, in case
+   * the replaced node has different output expr ID than the old node.
+   */
+  def rewriteWithPlanMapping(
+  planMapping: Map[PlanType, PlanType],
+  canGetOutput: PlanType => Boolean = _ => true): PlanType = {
+def internalRewrite(plan: PlanType): (PlanType, Seq[(Attribute, 
Attribute)]) = {
+  if (planMapping.contains(plan)) {

Review comment:
   hmm, is this a real-world case? I think this is too complicated if we 
need to replace nodes in the values of `planMapping`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29593: [SPARK-32753][SQL] Only copy tags to node with no tags

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29593:
URL: https://github.com/apache/spark/pull/29593#issuecomment-686909589







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29593: [SPARK-32753][SQL] Only copy tags to node with no tags

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29593:
URL: https://github.com/apache/spark/pull/29593#issuecomment-686909589







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29593: [SPARK-32753][SQL] Only copy tags to node with no tags

2020-09-03 Thread GitBox


SparkQA removed a comment on pull request #29593:
URL: https://github.com/apache/spark/pull/29593#issuecomment-686823149


   **[Test build #128270 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128270/testReport)**
 for PR 29593 at commit 
[`b990a06`](https://github.com/apache/spark/commit/b990a06239c37b7475b86665d8e7751bb7ca7c9a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29593: [SPARK-32753][SQL] Only copy tags to node with no tags

2020-09-03 Thread GitBox


SparkQA commented on pull request #29593:
URL: https://github.com/apache/spark/pull/29593#issuecomment-686908644


   **[Test build #128270 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128270/testReport)**
 for PR 29593 at commit 
[`b990a06`](https://github.com/apache/spark/commit/b990a06239c37b7475b86665d8e7751bb7ca7c9a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


maropu commented on a change in pull request #29643:
URL: https://github.com/apache/spark/pull/29643#discussion_r483387646



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
##
@@ -168,6 +170,85 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
 }.toSeq
   }
 
+
+  /**
+   * Rewrites this plan tree based on the given plan mappings from old plan 
nodes to new nodes.
+   * This method also updates all the related references in this plan tree 
accordingly, in case
+   * the replaced node has different output expr ID than the old node.
+   */
+  def rewriteWithPlanMapping(
+  planMapping: Map[PlanType, PlanType],
+  canGetOutput: PlanType => Boolean = _ => true): PlanType = {
+def internalRewrite(plan: PlanType): (PlanType, Seq[(Attribute, 
Attribute)]) = {
+  if (planMapping.contains(plan)) {

Review comment:
   IIUC this check cannot correctly handle nested cases in `planMapping`; 
for example,
   ```
   Project
+- Union
   :+- (1) Project 
   :   +- Union
   :   :  :+- (2) Project
   :   :
   :   +- Project
   :
   +- Project
  +- ...
   ```
   If the two nested `Project`s above, `(1)` and `(2)`,  are stored in 
`planMapping`, I think only the case `(1)` is matched in this condition then 
the case `(2)` is just ignored. So, I rewrote the logic a bit so that plans are 
replaced in a bottom-up way in the previous PR:
   
https://github.com/apache/spark/blob/1de272f98d0ff22d0dd151797f22b8faf310963a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L140-L144





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686906378







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686906378







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


SparkQA commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686906035


   **[Test build #128275 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128275/testReport)**
 for PR 29639 at commit 
[`70adf81`](https://github.com/apache/spark/commit/70adf815715405060796b7ff8d34f58149b2d0bb).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


SparkQA removed a comment on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686895732


   **[Test build #128275 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128275/testReport)**
 for PR 29639 at commit 
[`70adf81`](https://github.com/apache/spark/commit/70adf815715405060796b7ff8d34f58149b2d0bb).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation

2020-09-03 Thread GitBox


AngersZh commented on pull request #29087:
URL: https://github.com/apache/spark/pull/29087#issuecomment-686902515


   @HyukjinKwon  as metioned in 
https://github.com/apache/spark/pull/29087#discussion_r454101882. can you also 
help review that pr



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang edited a comment on pull request #29638: [SPARK-32687][SQL] Let CostBasedJoinReorder produce relatively deterministic optimization result

2020-09-03 Thread GitBox


LuciferYang edited a comment on pull request #29638:
URL: https://github.com/apache/spark/pull/29638#issuecomment-686862866


   > @LuciferYang are you sure? tests passed for that PR. Build jobs seem fine. 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
   What are you seeing?
   
   @srowen Sorry, it my mistake,  What I meant was  SPARK-32755 blocking 
compilation of sql/ catalyst module in Scala 2.13, not Scala 2.12
   
   It's not related to this pr, just going to bring some fix work in Scala 2.13 
:(



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29643: [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan

2020-09-03 Thread GitBox


maropu commented on a change in pull request #29643:
URL: https://github.com/apache/spark/pull/29643#discussion_r483382255



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##
@@ -123,127 +123,6 @@ object AnalysisContext {
   }
 }
 
-object Analyzer {
-
-  /**
-   * Rewrites a given `plan` recursively based on rewrite mappings from old 
plans to new ones.
-   * This method also updates all the related references in the `plan` 
accordingly.
-   *
-   * @param plan to rewrite
-   * @param rewritePlanMap has mappings from old plans to new ones for the 
given `plan`.
-   * @return a rewritten plan and updated references related to a root node of
-   * the given `plan` for rewriting it.
-   */
-  def rewritePlan(plan: LogicalPlan, rewritePlanMap: Map[LogicalPlan, 
LogicalPlan])
-: (LogicalPlan, Seq[(Attribute, Attribute)]) = {
-if (plan.resolved) {
-  val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]()
-  val newChildren = plan.children.map { child =>
-// If not, we'd rewrite child plan recursively until we find the
-// conflict node or reach the leaf node.
-val (newChild, childAttrMapping) = rewritePlan(child, rewritePlanMap)
-attrMapping ++= childAttrMapping.filter { case (oldAttr, _) =>
-  // `attrMapping` is not only used to replace the attributes of the 
current `plan`,
-  // but also to be propagated to the parent plans of the current 
`plan`. Therefore,
-  // the `oldAttr` must be part of either `plan.references` (so that 
it can be used to
-  // replace attributes of the current `plan`) or `plan.outputSet` (so 
that it can be
-  // used by those parent plans).
-  (plan.outputSet ++ plan.references).contains(oldAttr)
-}
-newChild
-  }
-
-  val newPlan = if (rewritePlanMap.contains(plan)) {
-rewritePlanMap(plan).withNewChildren(newChildren)
-  } else {
-plan.withNewChildren(newChildren)
-  }
-
-  assert(!attrMapping.groupBy(_._1.exprId)
-.exists(_._2.map(_._2.exprId).distinct.length > 1),
-"Found duplicate rewrite attributes")
-
-  val attributeRewrites = AttributeMap(attrMapping)
-  // Using attrMapping from the children plans to rewrite their parent 
node.
-  // Note that we shouldn't rewrite a node using attrMapping from its 
sibling nodes.
-  val p = newPlan.transformExpressions {
-case a: Attribute =>
-  updateAttr(a, attributeRewrites)
-case s: SubqueryExpression =>
-  s.withNewPlan(updateOuterReferencesInSubquery(s.plan, 
attributeRewrites))
-  }
-  attrMapping ++= plan.output.zip(p.output)
-.filter { case (a1, a2) => a1.exprId != a2.exprId }
-  p -> attrMapping
-} else {
-  // Just passes through unresolved nodes
-  plan.mapChildren {

Review comment:
   Ah, I see. Nice catch.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang edited a comment on pull request #29638: [SPARK-32687][SQL] Let CostBasedJoinReorder produce relatively deterministic optimization result

2020-09-03 Thread GitBox


LuciferYang edited a comment on pull request #29638:
URL: https://github.com/apache/spark/pull/29638#issuecomment-686890566


   > Hmm, why this is needed? Firstly I thought CostBasedJoinReorder will 
produce non-deterministic for same query. But I looked at the JIRA description, 
seems for different input, the rule will produce different output. Doesn't it 
sound reasonable? Different input causes different output.
   
   @viirya viirya Sorry, I didn't describe it clearly. Actually, there are 2 
problems we found in  SPARK-32526:
   
   1. For same Scala version,  different input causes different output as I 
describe in SPARK-32687, for example:
   
   ```
   d1.join(t3).join(t4).join(f1).join(d3).join(d2)
 .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   
   and 
   
   ```
   d1.join(t3).join(f1).join(d2).join(t4).join(d3)
.where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   have different optimization result, I think this is acceptable if the 
candidates have same cost, but @cloud-fan maybe has some different view in 
https://github.com/apache/spark/pull/29434, I'm not sure I understand it 
correctly.
   
   
   2.  For different Scala version (2.12 vs 2.13), same input maybe causes 
different output,  for example
   
   ```
   
d1.join(t3).join(t4).join(f1).join(d2).join(t5).join(t6).join(d3).join(t1).join(t2)
   .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("d2_c2") === nameToAttr("t5_c1")) &&
 (nameToAttr("t5_c2") === nameToAttr("t6_c2")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")) &&
 (nameToAttr("d3_c2") === nameToAttr("t1_c1")) &&
 (nameToAttr("t1_c2") === nameToAttr("t2_c2")))
   ```
   in Scala 2.12 and Scala 2.13 have different optimization result. This pr 
also can fix this problem. If everyone thinks that `different input causes 
different output` is reasonable,  I will close this first. But maybe we also 
need resolve problem 2, I will describe the problem in another jira based on 
problem 2 and try to fix it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #29638: [SPARK-32687][SQL] Let CostBasedJoinReorder produce relatively deterministic optimization result

2020-09-03 Thread GitBox


LuciferYang commented on pull request #29638:
URL: https://github.com/apache/spark/pull/29638#issuecomment-686899631


   @viirya I'm also entangled in this issue :(



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang edited a comment on pull request #29638: [SPARK-32687][SQL] Let CostBasedJoinReorder produce relatively deterministic optimization result

2020-09-03 Thread GitBox


LuciferYang edited a comment on pull request #29638:
URL: https://github.com/apache/spark/pull/29638#issuecomment-686890566


   > Hmm, why this is needed? Firstly I thought CostBasedJoinReorder will 
produce non-deterministic for same query. But I looked at the JIRA description, 
seems for different input, the rule will produce different output. Doesn't it 
sound reasonable? Different input causes different output.
   
   @viirya viirya Sorry, I didn't describe it clearly. Actually, there are 2 
problems we found in  SPARK-32526:
   
   1. For same Scala version,  different input causes different output as I 
describe in SPARK-32687, for example:
   
   ```
   d1.join(t3).join(t4).join(f1).join(d3).join(d2)
 .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   
   and 
   
   ```
   d1.join(t3).join(f1).join(d2).join(t4).join(d3)
.where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   have different optimization result, I think this is acceptable if the 
candidates have same cost, but @cloud-fan has some different view in 
https://github.com/apache/spark/pull/29434, I'm not sure I understand it 
correctly.
   
   
   2.  For different Scala version (2.12 vs 2.13), same input maybe causes 
different output,  for example
   
   ```
   
d1.join(t3).join(t4).join(f1).join(d2).join(t5).join(t6).join(d3).join(t1).join(t2)
   .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("d2_c2") === nameToAttr("t5_c1")) &&
 (nameToAttr("t5_c2") === nameToAttr("t6_c2")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")) &&
 (nameToAttr("d3_c2") === nameToAttr("t1_c1")) &&
 (nameToAttr("t1_c2") === nameToAttr("t2_c2")))
   ```
   in Scala 2.12 and Scala 2.13 have different optimization result. This pr 
also can reslove this problem. If everyone thinks that `different input causes 
different output` is reasonable,  I will close this first. But maybe we also 
need resolve problem 2, I will describe the problem in another jira based on 
problem 2 and try to fix it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29087:
URL: https://github.com/apache/spark/pull/29087#issuecomment-686898464







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang edited a comment on pull request #29638: [SPARK-32687][SQL] Let CostBasedJoinReorder produce relatively deterministic optimization result

2020-09-03 Thread GitBox


LuciferYang edited a comment on pull request #29638:
URL: https://github.com/apache/spark/pull/29638#issuecomment-686890566


   > Hmm, why this is needed? Firstly I thought CostBasedJoinReorder will 
produce non-deterministic for same query. But I looked at the JIRA description, 
seems for different input, the rule will produce different output. Doesn't it 
sound reasonable? Different input causes different output.
   
   @viirya viirya Sorry, I didn't describe it clearly. Actually, there are 2 
problems we found in  SPARK-32526:
   
   1. For same Scala version,  different input causes different output as I 
describe in SPARK-32687, for example:
   
   ```
   d1.join(t3).join(t4).join(f1).join(d3).join(d2)
 .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   
   and 
   
   ```
   d1.join(t3).join(f1).join(d2).join(t4).join(d3)
.where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   have different optimization result, I think this is acceptable if the 
candidates have same cost, but @cloud-fan has some different view in 
https://github.com/apache/spark/pull/29434, I'm not sure I understand it 
correctly.
   
   
   2.  For different Scala version (2.12 vs 2.13), same input maybe causes 
different output,  for example
   
   ```
   
d1.join(t3).join(t4).join(f1).join(d2).join(t5).join(t6).join(d3).join(t1).join(t2)
   .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("d2_c2") === nameToAttr("t5_c1")) &&
 (nameToAttr("t5_c2") === nameToAttr("t6_c2")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")) &&
 (nameToAttr("d3_c2") === nameToAttr("t1_c1")) &&
 (nameToAttr("t1_c2") === nameToAttr("t2_c2")))
   ```
   in Scala 2.12 and Scala 2.13 have different optimization result. This pr 
also can fix this problem. If everyone thinks that `different input causes 
different output` is reasonable,  I will close this first. But maybe we also 
need resolve problem 2, I will describe the problem in another jira based on 
problem 2 and try to fix it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29087:
URL: https://github.com/apache/spark/pull/29087#issuecomment-686898464







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation

2020-09-03 Thread GitBox


SparkQA commented on pull request #29087:
URL: https://github.com/apache/spark/pull/29087#issuecomment-686897991


   **[Test build #128276 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128276/testReport)**
 for PR 29087 at commit 
[`5d85160`](https://github.com/apache/spark/commit/5d85160abca388a53054551ad7ce9e48e363dcd5).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation

2020-09-03 Thread GitBox


HyukjinKwon commented on pull request #29087:
URL: https://github.com/apache/spark/pull/29087#issuecomment-686896927


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686896591







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686896591







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


SparkQA removed a comment on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686888751


   **[Test build #128274 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128274/testReport)**
 for PR 29634 at commit 
[`46934b3`](https://github.com/apache/spark/commit/46934b3470480d7c1cda711289546ae9ba419a6b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


SparkQA commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686896263


   **[Test build #128274 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128274/testReport)**
 for PR 29634 at commit 
[`46934b3`](https://github.com/apache/spark/commit/46934b3470480d7c1cda711289546ae9ba419a6b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


SparkQA commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686895732


   **[Test build #128275 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128275/testReport)**
 for PR 29639 at commit 
[`70adf81`](https://github.com/apache/spark/commit/70adf815715405060796b7ff8d34f58149b2d0bb).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686893915







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686893915







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29364: [SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29364:
URL: https://github.com/apache/spark/pull/29364#issuecomment-686886945







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28269: [SPARK-31493][SQL] Optimize InSet to In according partition size at InSubqueryExec

2020-09-03 Thread GitBox


SparkQA commented on pull request #28269:
URL: https://github.com/apache/spark/pull/28269#issuecomment-686887154


   **[Test build #128272 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128272/testReport)**
 for PR 28269 at commit 
[`54d75f5`](https://github.com/apache/spark/commit/54d75f5fbffe97eae0429b9fa727995eeab0b4c7).
* This patch **fails Spark unit tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29364: [SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29364:
URL: https://github.com/apache/spark/pull/29364#issuecomment-686886945







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28269: [SPARK-31493][SQL] Optimize InSet to In according partition size at InSubqueryExec

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #28269:
URL: https://github.com/apache/spark/pull/28269#issuecomment-686887480







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


HyukjinKwon commented on a change in pull request #29634:
URL: https://github.com/apache/spark/pull/29634#discussion_r483369994



##
File path: python/docs/source/development/testing.rst
##
@@ -0,0 +1,61 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Testing PySpark
+===
+
+In order to run PySpark tests, you should build Spark itself first via Maven
+or SBT. For example,
+
+.. code-block:: bash
+
+build/mvn -DskipTests clean package
+
+After that, the PySpark test cases can be run via using ``python/run-tests``. 
For example,
+
+.. code-block:: bash
+
+python/run-tests --python-executable=python3
+
+Note that:
+
+* If you are running tests on Mac OS, you may set 
``OBJC_DISABLE_INITIALIZE_FORK_SAFETY`` environment variable to ``YES``.
+* If you are using JDK 11, you should set 
``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related features. See 
also `Downloading `_.

Review comment:
   Yeah, ideally we should let it work out of the box but actually it is 
required to set `tryReflectionSetAccessible` property to run with JDK 11 (see 
https://github.com/apache/spark/pull/26552). Otherwise, the Arrow related code 
paths fail.  See also ARROW-7223.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28269: [SPARK-31493][SQL] Optimize InSet to In according partition size at InSubqueryExec

2020-09-03 Thread GitBox


SparkQA removed a comment on pull request #28269:
URL: https://github.com/apache/spark/pull/28269#issuecomment-686835937


   **[Test build #128272 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128272/testReport)**
 for PR 28269 at commit 
[`54d75f5`](https://github.com/apache/spark/commit/54d75f5fbffe97eae0429b9fa727995eeab0b4c7).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


HyukjinKwon commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686890672


   Thanks @srowen and @viirya.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #29638: [SPARK-32687][SQL] Let CostBasedJoinReorder produce relatively deterministic optimization result

2020-09-03 Thread GitBox


LuciferYang commented on pull request #29638:
URL: https://github.com/apache/spark/pull/29638#issuecomment-686890566


   > Hmm, why this is needed? Firstly I thought CostBasedJoinReorder will 
produce non-deterministic for same query. But I looked at the JIRA description, 
seems for different input, the rule will produce different output. Doesn't it 
sound reasonable? Different input causes different output.
   
   @viirya viirya Sorry, I didn't describe it clearly. Actually, there are 2 
problems we found in  SPARK-32526:
   
   1. For same Scala version,  different input causes different output as I 
describe in SPARK-32687, for example:
   
   ```
   d1.join(t3).join(t4).join(f1).join(d3).join(d2)
 .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   
   and 
   
   ```
   d1.join(t3).join(f1).join(d2).join(t4).join(d3)
.where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")))
   ```
   have different optimization result, I think this is acceptable if the 
candidates have same cost, but @cloud-fan has some different view in 
https://github.com/apache/spark/pull/29434, I'm not sure I understand it 
correctly.
   
   
   2.  For different Scala version (2.12 vs 2.13), same input maybe causes 
different output,  for example
   
   ```
   
d1.join(t3).join(t4).join(f1).join(d2).join(t5).join(t6).join(d3).join(t1).join(t2)
   .where((nameToAttr("d1_c2") === nameToAttr("t3_c1")) &&
 (nameToAttr("t3_c2") === nameToAttr("t4_c2")) &&
 (nameToAttr("d1_pk") === nameToAttr("f1_fk1")) &&
 (nameToAttr("f1_fk2") === nameToAttr("d2_pk")) &&
 (nameToAttr("d2_c2") === nameToAttr("t5_c1")) &&
 (nameToAttr("t5_c2") === nameToAttr("t6_c2")) &&
 (nameToAttr("f1_fk3") === nameToAttr("d3_pk")) &&
 (nameToAttr("d3_c2") === nameToAttr("t1_c1")) &&
 (nameToAttr("t1_c2") === nameToAttr("t2_c2")))
   ```
   in Scala 2.12 and Scala 2.13 have different optimization result. If everyone 
thinks that `different input causes different output` is reasonable,  I will 
close this first. But maybe we also need resolve problem 2, I will describe the 
problem in another jira based on problem 2 and try to fix it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28269: [SPARK-31493][SQL] Optimize InSet to In according partition size at InSubqueryExec

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #28269:
URL: https://github.com/apache/spark/pull/28269#issuecomment-686887489


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128272/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


SparkQA commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686888751


   **[Test build #128274 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128274/testReport)**
 for PR 29634 at commit 
[`46934b3`](https://github.com/apache/spark/commit/46934b3470480d7c1cda711289546ae9ba419a6b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28269: [SPARK-31493][SQL] Optimize InSet to In according partition size at InSubqueryExec

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #28269:
URL: https://github.com/apache/spark/pull/28269#issuecomment-686887480


   Build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686889107







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29634: [SPARK-32783][DOCS][PYTHON] Development - Testing PySpark

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29634:
URL: https://github.com/apache/spark/pull/29634#issuecomment-686889107







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29364: [SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API

2020-09-03 Thread GitBox


SparkQA removed a comment on pull request #29364:
URL: https://github.com/apache/spark/pull/29364#issuecomment-686812047


   **[Test build #128269 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128269/testReport)**
 for PR 29364 at commit 
[`27455d0`](https://github.com/apache/spark/commit/27455d007c49ceec991b8813196a8430e3edee1e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29364: [SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API

2020-09-03 Thread GitBox


SparkQA commented on pull request #29364:
URL: https://github.com/apache/spark/pull/29364#issuecomment-686886390


   **[Test build #128269 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128269/testReport)**
 for PR 29364 at commit 
[`27455d0`](https://github.com/apache/spark/commit/27455d007c49ceec991b8813196a8430e3edee1e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `case class Person(id: Int, name: String, age: Int)`
 * `case class Salary(personId: Int, salary: Double)`
 * `class SqlResourceWithActualMetricsSuite extends SharedSparkSession with 
SQLMetricsTestUtils `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686884086


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/128273/
   Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686884077







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


viirya commented on a change in pull request #29639:
URL: https://github.com/apache/spark/pull/29639#discussion_r483366839



##
File path: python/docs/source/development/debugging.rst
##
@@ -0,0 +1,187 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+=
+Debugging PySpark
+=
+
+PySpark uses Spark as an engine. If a PySpark application does not require 
interaction
+between Python workers and JVMs, Python workers are not launched. They are 
lazily launched only when
+Python native functions or data have to be handled, for example, when you 
execute pandas UDFs or
+PySpark RDD APIs.
+
+This page describes how to debug such Python applications and workers instead 
of focusing on debugging with JVM.
+Profiling and debugging JVM is described at `Useful Developer Tools 
`_.
+
+
+Remote Debugging (PyCharm)
+--
+
+In order to debug the Python workers remotely, you should connect from the 
Python worker to the debug server in PyCharm.
+This section describes remote debugging within a single machine to demonstrate 
easily.
+In order to debug PySpark applications on other machines, please refer to the 
full instructions that are specific
+to PyCharm, documented `here 
`_.
 
+
+Firstly, choose **Edit Configuration...** from the **Run** menu. It opens the 
Run/debug Configurations dialog.
+You have to click ``+`` configuration on the toolbar, and from the list of 
available configurations, select **Python Debug Server**.
+Enter the name of this run/debug configuration, for example, 
``MyRemoteDebugger`` and also specify the port number, for example ``12345``.
+
+.. image:: ../../../../docs/img/pyspark-remote-debug1.png
+:alt: PyCharm remote debugger setting
+
+| After that, you should install the corresponding version of the 
``pydevd-pycahrm`` package. In the previous dialog, it shows the command to 
install.
+
+.. code-block:: text
+
+pip install pydevd-pycharm~=
+
+In your current working directory, prepare a Python file as below:
+
+.. code-block:: bash
+
+echo "from pyspark import daemon, worker
+def remote_debug_wrapped(*args, **kwargs):
+#==Copy and paste from the previous 
dialog===
+import pydevd_pycharm
+pydevd_pycharm.settrace('localhost', port=12345, stdoutToServer=True, 
stderrToServer=True)
+
#
+worker.main(*args, **kwargs)
+daemon.worker_main = remote_debug_wrapped
+if __name__ == '__main__':
+daemon.manager()" > remote_debug.py
+
+You will use this file as the Python worker in your PySpark applications by 
using the ``spark.python.daemon.module`` configuration.
+Run the ``pyspark`` shell with the configuration below:
+
+.. code-block:: bash
+
+pyspark --conf spark.python.daemon.module=remote_debug
+
+Now you're ready to remote debug. Start debugging with your 
``MyRemoteDebugger``.

Review comment:
   remote debug -> remotely debug

##
File path: python/docs/source/development/debugging.rst
##
@@ -0,0 +1,187 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+=
+Debugging PySpark
+=
+
+PySpark uses Spa

[GitHub] [spark] AmplabJenkins removed a comment on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686884077


   Merged build finished. Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


SparkQA removed a comment on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686874465


   **[Test build #128273 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128273/testReport)**
 for PR 29639 at commit 
[`cea4be4`](https://github.com/apache/spark/commit/cea4be49ad0dfce4622959cc7e8782afd4c90e90).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


SparkQA commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686883777


   **[Test build #128273 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128273/testReport)**
 for PR 29639 at commit 
[`cea4be4`](https://github.com/apache/spark/commit/cea4be49ad0dfce4622959cc7e8782afd4c90e90).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] KevinSmile removed a comment on pull request #29644: [SPARK-32598][Scheduler] fix missing driver logs in UI Executors tab in standalone mode

2020-09-03 Thread GitBox


KevinSmile removed a comment on pull request #29644:
URL: https://github.com/apache/spark/pull/29644#issuecomment-686875689


   Direct bug reason: 
   the original author forgot to implement `getDriverLogUrls` in 
`StandaloneSchedulerBackend`
   
   
https://github.com/apache/spark/blob/1de272f98d0ff22d0dd151797f22b8faf310963a/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L70-L75



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] KevinSmile edited a comment on pull request #29644: [SPARK-32598][Scheduler] fix missing driver logs in UI Executors tab in standalone mode

2020-09-03 Thread GitBox


KevinSmile edited a comment on pull request #29644:
URL: https://github.com/apache/spark/pull/29644#issuecomment-686875689


   Direct bug reason: 
   the original author forgot to implement `getDriverLogUrls` in 
`StandaloneSchedulerBackend`
   
   
https://github.com/apache/spark/blob/1de272f98d0ff22d0dd151797f22b8faf310963a/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L70-L75



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] KevinSmile commented on pull request #29644: [SPARK-32598][Scheduler] fix missing driver logs in UI Executors tab in standalone mode

2020-09-03 Thread GitBox


KevinSmile commented on pull request #29644:
URL: https://github.com/apache/spark/pull/29644#issuecomment-686875689


   Direct bug reason: 
   the original author forgot to implement `getDriverLogUrls` in 
`StandaloneSchedulerBackend`
   
   
https://github.com/apache/spark/blob/1de272f98d0ff22d0dd151797f22b8faf310963a/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L71-L75



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686874895







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29644: [SPARK-32598][Scheduler] fix missing driver logs in UI Executors tab in standalone mode

2020-09-03 Thread GitBox


AmplabJenkins removed a comment on pull request #29644:
URL: https://github.com/apache/spark/pull/29644#issuecomment-686874349


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686874895







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29644: [SPARK-32598][Scheduler] fix missing driver logs in UI Executors tab in standalone mode

2020-09-03 Thread GitBox


AmplabJenkins commented on pull request #29644:
URL: https://github.com/apache/spark/pull/29644#issuecomment-686874710


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29639: [SPARK-32186][DOCS][PYTHON] User Guide - Debugging

2020-09-03 Thread GitBox


SparkQA commented on pull request #29639:
URL: https://github.com/apache/spark/pull/29639#issuecomment-686874465


   **[Test build #128273 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128273/testReport)**
 for PR 29639 at commit 
[`cea4be4`](https://github.com/apache/spark/commit/cea4be49ad0dfce4622959cc7e8782afd4c90e90).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >