[GitHub] [spark] AmplabJenkins removed a comment on pull request #29587: [SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29587:
URL: https://github.com/apache/spark/pull/29587#issuecomment-695171420







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29587: [SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29587:
URL: https://github.com/apache/spark/pull/29587#issuecomment-695171420







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29587: [SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns

2020-09-18 Thread GitBox


SparkQA commented on pull request #29587:
URL: https://github.com/apache/spark/pull/29587#issuecomment-695171322


   **[Test build #128886 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128886/testReport)**
 for PR 29587 at commit 
[`9040c56`](https://github.com/apache/spark/commit/9040c56b86a932d7740186637f1e4c9accbee216).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] manuzhang commented on a change in pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-18 Thread GitBox


manuzhang commented on a change in pull request #29797:
URL: https://github.com/apache/spark/pull/29797#discussion_r491281977



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##
@@ -705,7 +705,8 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 exchange.ShuffleExchangeExec(
   r.partitioning,
   planLater(r.child),
-  noUserSpecifiedNumPartition = r.optNumPartitions.isEmpty) :: Nil
+  noUserSpecifiedNumPartition = conf.coalesceShufflePartitionsEnabled 
&&
+r.optNumPartitions.isEmpty) :: Nil

Review comment:
   @viirya Thanks for the good questions.
   
   This solution is not ideal till we find a way not to apply local shuffle 
reader if the partitioning doesn't match that of dynamic partition. It's not 
ideal either that any AQE config is global. 
   
   With `coalesceShufflePartitionsEnabled` and `localShuffleReaderEnabled`, the 
output partitioning of shuffle is uncertain which arguably contradicts the 
purpose of `RepartitionByExpression`.
   
   >  If the user needs to coalesce shuffle partition in the query, but also 
needs dynamic partition overwrite?
   
   Another option is to introduce a new config specific for repartition, e.g. 
`spark.sql.adaptive.repartition.canChangeNumPartitions` 
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] manuzhang commented on a change in pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-18 Thread GitBox


manuzhang commented on a change in pull request #29797:
URL: https://github.com/apache/spark/pull/29797#discussion_r491281977



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##
@@ -705,7 +705,8 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 exchange.ShuffleExchangeExec(
   r.partitioning,
   planLater(r.child),
-  noUserSpecifiedNumPartition = r.optNumPartitions.isEmpty) :: Nil
+  noUserSpecifiedNumPartition = conf.coalesceShufflePartitionsEnabled 
&&
+r.optNumPartitions.isEmpty) :: Nil

Review comment:
   @viirya Thanks for the good questions.
   
   This solution is not ideal till we find a way not to apply local shuffle 
reader if the partitioning doesn't match that of dynamic partition. It's not 
ideal either that any AQE config is global. 
   
   With `coalesceShufflePartitionsEnabled` and `localShuffleReaderEnabled`, the 
output partitioning of shuffle is uncertain which arguably contradicts the 
purpose of `RepartitionByExpression`.
   
   >  If the user needs to coalesce shuffle partition in the query, but also 
needs dynamic partition overwrite?
   Another option is to introduce a new config specific for repartition, e.g. 
`spark.sql.adaptive.repartition.canChangeNumPartitions` 
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] manuzhang commented on a change in pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-18 Thread GitBox


manuzhang commented on a change in pull request #29797:
URL: https://github.com/apache/spark/pull/29797#discussion_r491281977



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##
@@ -705,7 +705,8 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 exchange.ShuffleExchangeExec(
   r.partitioning,
   planLater(r.child),
-  noUserSpecifiedNumPartition = r.optNumPartitions.isEmpty) :: Nil
+  noUserSpecifiedNumPartition = conf.coalesceShufflePartitionsEnabled 
&&
+r.optNumPartitions.isEmpty) :: Nil

Review comment:
   @viirya 
   This solution is not ideal till we find a way not to apply local shuffle 
reader if the partitioning doesn't match that of dynamic partition. It's not 
ideal either that any AQE config is global. 
   
   With `coalesceShufflePartitionsEnabled` and `localShuffleReaderEnabled`, the 
output partitioning of shuffle is uncertain which arguably contradicts the 
purpose of `RepartitionByExpression`.
   
   >  If the user needs to coalesce shuffle partition in the query, but also 
needs dynamic partition overwrite?
   Another option is to introduce a new config specific for repartition, e.g. 
`spark.sql.adaptive.repartition.canChangeNumPartitions` 
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29807: [SPARK-32939][SQL]Avoid re-compute expensive expression

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29807:
URL: https://github.com/apache/spark/pull/29807#issuecomment-695169067







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29807: [SPARK-32939][SQL]Avoid re-compute expensive expression

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29807:
URL: https://github.com/apache/spark/pull/29807#issuecomment-695169067







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29807: [SPARK-32939][SQL]Avoid re-compute expensive expression

2020-09-18 Thread GitBox


SparkQA commented on pull request #29807:
URL: https://github.com/apache/spark/pull/29807#issuecomment-695168995


   **[Test build #128885 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128885/testReport)**
 for PR 29807 at commit 
[`73e94c3`](https://github.com/apache/spark/commit/73e94c39d2871d76e9072556ab02666f31085ab4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu opened a new pull request #29807: [SPARK-32939][SQL]Avoid re-compute expensive expression

2020-09-18 Thread GitBox


AngersZh opened a new pull request #29807:
URL: https://github.com/apache/spark/pull/29807


   ### What changes were proposed in this pull request?
   ```
test("SPARK-32939: Expensive expr re-compute demo") {
   withTable("t") {
 withTempDir { loc =>
   sql(
 s"""CREATE TABLE t(c1 INT, s STRING) PARTITIONED BY(P1 STRING)
| LOCATION '${loc.getAbsolutePath}'
|""".stripMargin)
   sql(
 """
   |SELECT c1,
   |case
   |  when get_json_object(s,'$.a')=1 then "a"
   |  when get_json_object(s,'$.a')=2 then "b"
   |end as s_type
   |FROM t
   |WHERE get_json_object(s,'$.a') in (1, 2)
 """.stripMargin).explain(true)
}
   }
   }
   ```
   will got plan as 
   ```
   == Physical Plan ==
   *(1) Project [c1#1, CASE WHEN (cast(get_json_object(s#2, $.a) as int) = 1) 
THEN a WHEN (cast(get_json_object(s#2, $.a) as int) = 2) THEN b END AS s_type#0]
   +- *(1) Filter get_json_object(s#2, $.a) IN (1,2)
  +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], 
Statistics(sizeInBytes=8.0 EiB)
   ```
   we can see that  get_json_object(s#2, $.a) will be computed tree times
   Always there are expensive expressions are re-computed many times in such 
grammar。
   This case is frequent in SQL in production environments and expr is complex 
and same.
   So after judgement, we can compute these duplicated complex expression first 
with a projection.
   The result plan like 
   ```
   == Physical Plan ==
   *(1) Project [c1#1, CASE WHEN (cast(expensive_col_6#6 as int) = 1) THEN a 
WHEN (cast(expensive_col_6#6 as int) = 2) THEN b END AS s_type#0]
   +- *(1) Filter expensive_col_6#6 IN (1,2)
  +- Project [c1#1, s#2, get_json_object(s#2, $.a) AS expensive_col_6#6]
 +- Scan hive default.t [c1#1, s#2], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#1, s#2], [P1#3], 
Statistics(sizeInBytes=8.0 EiB)
   ```
   The ProjectExec near Scan not contained by WholeStageCodegen since it not 
match this case now(won't occur in current code)
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Need add UT
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695165544


   > BTW, welcome to the Apache Spark community and thank you for your first 
contribution, @cchighman .
   
   @Dooyoung-Hwang 
   Thank you very much!  Hopefully after compiling key feedback that was 
previously spread across this massive PR, reviewers find the context they need 
stamp the PR with their seal of approval.  
   
   Looking back over the last 3 months, there's something poetic about this 
PR's journey and how its transformed into the shiny, polished version now!!!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-695164896







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-695164896







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


SparkQA removed a comment on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-695162982


   **[Test build #128884 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128884/testReport)**
 for PR 29781 at commit 
[`ee355b0`](https://github.com/apache/spark/commit/ee355b0e0c6bc9f414886da6486e746090f86ec0).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


SparkQA commented on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-695164840


   **[Test build #128884 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128884/testReport)**
 for PR 29781 at commit 
[`ee355b0`](https://github.com/apache/spark/commit/ee355b0e0c6bc9f414886da6486e746090f86ec0).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695164039


   > I roughly checked the code and I think you need to brush up more by 
referring the other code and the style doc. Could you? The proposed idea itself 
looks good and I think its useful.
   
   After much discussion and evolution of this PR from valuable reviewer 
feedback, it was @maropu's turn.  @maropu's superpower involves relentless 
dedication to consistency throughout files and overall consideration to how 
they're structured.  While resolving provided feedback, file names would 
change,, unit tests would be consolidated, separated, and then consolidated 
again, public facing documentation would be reconsidered and simplified for 
best readability, and not a single out-of-band whitespace would survive.
   
   @maropu's contributions towards this code review resulted in simplification, 
clarification, robustness, and a shiny polish of quality from start to finish.  
At one point he rewrote part of my code according to how he felt it should look 
and then pushed into my fork, asking me to approve those changes!  It was quite 
humorous for me from the standpoint of thinking about engineers I've worked 
with in the past would very possibly destroy their keyboard or monitor , hahaha.
   
   The bigger picture here, of course, is understanding and appreciating the 
diversity in thought and perspective @maropu offers during the review 
experience.  Thank you, @maropu 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-695163079







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-695163079







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


SparkQA commented on pull request #29781:
URL: https://github.com/apache/spark/pull/29781#issuecomment-695162982


   **[Test build #128884 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128884/testReport)**
 for PR 29781 at commit 
[`ee355b0`](https://github.com/apache/spark/commit/ee355b0e0c6bc9f414886da6486e746090f86ec0).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a change in pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


itholic commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r491267799



##
File path: python/docs/source/development/setting_ide.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Setting up IDEs
+===
+
+
+PyCharm
+---
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.

Review comment:
   Thanks, @dongjoon-hyun :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695161492


   > Shall we use `modificationTimeFilter`? the word `date` is not accurate 
since it can be referred to date without any hour/minute/seconds
   
   Semantics for the new options were thoughtfully considered.  @gengliangwang 
comment here led to realizing that descriptions of this feature in terms of 
date did not accurately convey it's purpose.  As a result, everything was 
adjusted to "time".



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695161242


   > @cchighman
   > Thanks for understanding. I still think even in SS case we want to provide 
timestamp to effectively filter out files - the point is whether the timestamp 
is static or dynamic. Let's focus on the batch case here.
   
   This PR initially allowed support for both the batch and streaming use 
cases.  Semantics for the option were to be shared and filtering implemented in 
the same place..  A rather long, in-depth, and thoughtful discussion ensued 
where @HeartSaVioR  patiently outlined subtle differences between approaches. 
   
   The result of this conversation was to restrict this PR to only focus on the 
batch use case. while creating a JIRA item  (SPARK-32155) to track file-based 
semantics for the streaming scenario.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695160233


   > These timestamps should be inclusive or exclusive? e.g., `(modifiedAfter, 
modifiedBefore)`, `[modifiedAfter, modifiedBefore]`... Looks like the current 
code follows `(modifiedAfter, modifiedBefore]`? Anyway, I think we need to test 
all the boundary cases.
   
   @maropu's request to explore these type of conditions as test cases led to 
finding a corner case that was missed.  Four tests were added to provide 
protection around the bounded contexts of our timeZone parameter.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695159898


   > Should we specify the timezone here? What should the default timezone be?
   > @cloud-fan @MaxGekk
   
   @gengliangwang's comment here drew quick attention to the fact timezone were 
not being respected as would be expected for consistency.  After a small period 
of research and comments provided  in code, timezone support was added by 
closely following existing idioms for identifying the zone id.  This typically 
meant first checking the timeZone option, checking for a default-configured 
option within the SQLConf, or resorting to UTC as a last resort.  
   
   Nearly 20 unit tests were added in order to fully test the validity of this 
feature.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695158666


   > > I haven't looked into the detail yet, but Hadoop FS and a PathFilter 
API, does Spark file source support the path filter option?
   > 
   > Hadoop's PathFilter only gives the user Path without metadata. But this is 
an interesting point. The user needs a file metadata filter essentially. For 
example, some user may ask to filter files using the file size. But supporting 
a filter in multiple languages is hard.
   
   @zsxwing Circling back around, you introduced the _latestFirst_option for 
determining how a file-based data source would process files.  Any thoughts 
surrounding this feature from your end ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695158320


   > > The need arises since we only have one filtering strategy where 
pathGlobFilter is used in PartitioningAwareFileIndex and handling cleanly if we 
now have more than one approach to path filtering so the implementation I'm 
suggesting would be nearly identical to above except moving it to where 
pathGlobFilter is implemented.
   > 
   > @cchighman the option should be effective for both `MetadataLogFileIndex` 
and `InMemoryFileIndex`. That's why I suggest implementing it in 
`PartitioningAwareFileIndex`. I think we can create a method to determine which 
files are matched according to the parameters.
   
   Further feedback from @gengliangwang provided clear direction for 
fine-tuning this implementation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695158061


   > `f` here is `FileStatus`. It looks inefficient to get the `FileStatus` 
again in 
https://github.com/apache/spark/pull/28841/files#diff-3c816e65d0bbdf16d46c7ea4e0b8cbaeR55
   
   Feedback from @cloud-fan prompted a closer consideration surrounding 
optimization.  The code in question was optimized and resolved.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695157152


   > @cchighman I like the idea of such filtering.
   > How about we follow the approach of the option `pathGlobFilter`: 
https://github.com/apache/spark/pull/24518/files#diff-e458973c0425211675054a6e6d742957R59
   > 
   > The modification time is also available in `FileStatus`
   
   Because there is a lot of info in this PR, I wanted to quote key reviewer 
feedback to help illustrate this change better.
   
   With @gengliangwang's feedback, _modifiedBefore__ and _modifiedAfter_ 
options were introduced with solid alignment to existing practices.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695156624







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695156624







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


SparkQA removed a comment on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695115619


   **[Test build #128883 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128883/testReport)**
 for PR 29804 at commit 
[`7ffcdbc`](https://github.com/apache/spark/commit/7ffcdbc288679238a4c8f16446f8259396a34a8c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


SparkQA commented on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695156470


   **[Test build #128883 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128883/testReport)**
 for PR 29804 at commit 
[`7ffcdbc`](https://github.com/apache/spark/commit/7ffcdbc288679238a4c8f16446f8259396a34a8c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

2020-09-18 Thread GitBox


cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-695155611


   @gengliangwang @cloud-fan @maropu @Dooyoung-Hwang @HeartSaVioR @HyukjinKwon 
   Following up on this PR for next steps.  We've seen a strong amount of 
constructive feedback during the review process but haven't convinced anyone 
yet to approve.
   
   I went to revisit the commit log in efforts to identify other potential 
reviewers with expertise in our file..  I found our group represented quite 
well here with @gengliangwang early on suggesting to follow an existing method 
for consistency.
   
   Comments?   
   
   Thank you!
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wzhfy commented on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-18 Thread GitBox


wzhfy commented on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-695145607


   @mridulm Closed. Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wzhfy closed pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-18 Thread GitBox


wzhfy closed pull request #29764:
URL: https://github.com/apache/spark/pull/29764


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] balajij81 edited a comment on pull request #23576: [SPARK-26655] [SS] Support multiple aggregates in append mode

2020-09-18 Thread GitBox


balajij81 edited a comment on pull request #23576:
URL: https://github.com/apache/spark/pull/23576#issuecomment-695144270


   I too agree that this PR is very valuable there are lot of use cases spark 
could be efficiently used for real-time streaming use cases.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] balajij81 commented on pull request #23576: [SPARK-26655] [SS] Support multiple aggregates in append mode

2020-09-18 Thread GitBox


balajij81 commented on pull request #23576:
URL: https://github.com/apache/spark/pull/23576#issuecomment-695144270


   I too agree that this PR is very valuable there are lot of use cases spark 
could be efficiently for real-time streaming use cases.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #28497: [SPARK-31677][SS] Use kvstore to cache stream query progress

2020-09-18 Thread GitBox


github-actions[bot] commented on pull request #28497:
URL: https://github.com/apache/spark/pull/28497#issuecomment-695139730


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #28755: [SPARK-29394][SQL]Support ISO 8601 format for intervals

2020-09-18 Thread GitBox


github-actions[bot] commented on pull request #28755:
URL: https://github.com/apache/spark/pull/28755#issuecomment-695139727


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #28147: [SPARK-31357][SQL][WIP] Catalog API for view metadata

2020-09-18 Thread GitBox


github-actions[bot] commented on pull request #28147:
URL: https://github.com/apache/spark/pull/28147#issuecomment-695139736


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #28036: [SPARK-26341][CORE]Expose executor memory metrics at the stage level, in the Stages tab

2020-09-18 Thread GitBox


github-actions[bot] commented on pull request #28036:
URL: https://github.com/apache/spark/pull/28036#issuecomment-695139741


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #26935: [SPARK-30294][SS] Explicitly defines read-only StateStore and optimize for HDFSBackedStateStore

2020-09-18 Thread GitBox


viirya commented on a change in pull request #26935:
URL: https://github.com/apache/spark/pull/26935#discussion_r491243565



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
##
@@ -253,7 +253,8 @@ case class StateStoreRestoreExec(
   stateManager.getStateValueSchema,
   indexOrdinal = None,
   sqlContext.sessionState,
-  Some(sqlContext.streams.stateStoreCoordinator)) { case (store, iter) =>
+  Some(sqlContext.streams.stateStoreCoordinator),
+  readOnly = true) { case (store, iter) =>

Review comment:
   Is this only place we use read-only state store?

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
##
@@ -253,7 +253,8 @@ case class StateStoreRestoreExec(
   stateManager.getStateValueSchema,
   indexOrdinal = None,
   sqlContext.sessionState,
-  Some(sqlContext.streams.stateStoreCoordinator)) { case (store, iter) =>
+  Some(sqlContext.streams.stateStoreCoordinator),
+  readOnly = true) { case (store, iter) =>

Review comment:
   Is this the only place we use read-only state store?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


maropu commented on a change in pull request #29804:
URL: https://github.com/apache/spark/pull/29804#discussion_r491243552



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -2764,6 +2764,14 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val DYNAMIC_DECIDE_BUCKETING_ENABLED =
+buildConf("spark.sql.sources.dynamic.decide.bucketing.enabled")

Review comment:
   How about adding this param under `spark.sql.sources.bucketing.xxx`? 
`spark.sql.sources.bucketing.enableAutoBucketScan` or something?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #26935: [SPARK-30294][SS] Explicitly defines read-only StateStore and optimize for HDFSBackedStateStore

2020-09-18 Thread GitBox


viirya commented on a change in pull request #26935:
URL: https://github.com/apache/spark/pull/26935#discussion_r491241918



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
##
@@ -109,6 +109,41 @@ trait StateStore {
   def hasCommitted: Boolean
 }
 
+/** A versioned key-value store which is same as [[StateStore]], but 
write-protected. */

Review comment:
   For write-protected, we have `WrappedReadOnlyStateStore`. 
"write-protected" sounds like it can be written, but just being protected. 
However, `ReadOnlyStateStore` is actually read-only state store. Can we update 
this comment?

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
##
@@ -79,6 +78,23 @@ private[state] class HDFSBackedStateStoreProvider extends 
StateStoreProvider wit
   //   java.util.ConcurrentModificationException
   type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, UnsafeRow]
 
+  class HDFSBackedReadOnlyStateStore(val version: Long, map: MapType)
+extends ReadOnlyStateStore {
+
+override def id: StateStoreId = 
HDFSBackedStateStoreProvider.this.stateStoreId
+
+override def get(key: UnsafeRow): UnsafeRow = map.get(key)
+
+override def abort(): Unit = {}

Review comment:
   Isn't this a `ReadOnlyStateStore`? Why need to override `abort`? Can't 
we have `throwNotAllowed` too for `abort` in `ReadOnlyStateStore`?

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala
##
@@ -76,7 +76,7 @@ class StateStoreRDD[T: ClassTag, U: ClassTag](
 
 store = StateStore.get(
   storeProviderId, keySchema, valueSchema, indexOrdinal, storeVersion,
-  storeConf, hadoopConfBroadcast.value.value)
+  storeConf, hadoopConfBroadcast.value.value, readOnly)
 val inputIter = dataRDD.iterator(partition, ctxt)
 storeUpdateFunction(store, inputIter)

Review comment:
   For read-only, do we still need to call `storeUpdateFunction`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu edited a comment on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


maropu edited a comment on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695134989


   Looks an interesting idea, @c21 ! I'll check this later.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


maropu commented on a change in pull request #29804:
URL: https://github.com/apache/spark/pull/29804#discussion_r491242019



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -165,6 +166,7 @@ case class FileSourceScanExec(
 partitionFilters: Seq[Expression],
 optionalBucketSet: Option[BitSet],
 optionalNumCoalescedBuckets: Option[Int],
+optionalDynamicDecideBucketing: Option[Boolean],

Review comment:
   nit: could you add this param in `metadata` for better explain output?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


maropu commented on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695134989


   Looks an interesting idea, @c21 !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #29764: [SPARK-32738][CORE][2.4] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-18 Thread GitBox


mridulm commented on pull request #29764:
URL: https://github.com/apache/spark/pull/29764#issuecomment-695131542


   @wzhfy Can you close the PR please ? It has been merged to branch-2.4



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-18 Thread GitBox


viirya commented on a change in pull request #29779:
URL: https://github.com/apache/spark/pull/29779#discussion_r491235435



##
File path: python/docs/source/getting_started/install.rst
##
@@ -0,0 +1,138 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+Installation
+
+
+PySpark is included in the official releases of Spark available in the `Apache 
Spark website `_.
+For Python users, PySpark also provides ``pip`` installation from PyPI. This 
is usually for local usage or as
+a client to connect to a cluster instead of setting up a cluster itself.
+ 
+This page includes instructions for installing PySpark by using pip, Conda, 
downloading manually,
+and building from the source.
+
+
+Python Version Supported
+
+
+Python 3.6 and above.
+
+
+Using PyPI
+--
+
+PySpark installation using `PyPI `_ is as 
follows:
+
+.. code-block:: bash
+
+pip install pyspark
+
+If you want to install extra dependencies for a specific componenet, you can 
install it as below:
+
+.. code-block:: bash
+
+pip install pyspark[sql]
+
+
+Using Conda
+---
+
+Conda is an open-source package management and environment management system 
which is a part of
+the `Anaconda `_ distribution. It is both 
cross-platform and
+language agnostic. In practice, Conda can replace both `pip 
`_ and
+`virtualenv `_.
+
+Create new virtual environment from your terminal as shown below:
+
+.. code-block:: bash
+
+conda create -n pyspark_env
+
+After the virtual environment is created, it should be visible under the list 
of Conda environments
+which can be seen using the following command:
+
+.. code-block:: bash
+
+conda env list
+
+Now activate the newly created environment with the following command:
+
+.. code-block:: bash
+
+conda activate pyspark_env
+
+You can install pyspark by `Using PyPI <#using-pypi>`_ to install PySpark in 
the newly created
+environment, for example as below. It will install PySpark under the new 
virtual environemnt
+``pyspark_env`` created above.
+
+.. code-block:: bash
+
+pip install pyspark
+
+Alternatively, you can install PySpark from Conda itself as below:
+
+.. code-block:: bash
+
+conda install pyspark
+
+However, note that `PySpark at Conda 
`_ is not necessarily
+synced with PySpark release cycle because it is maintained by the community 
separately.
+
+
+Manually Downloading
+
+
+PySpark is included in the distributions available at the `Apache Spark 
website `_.
+You can download a distribution you want from the site. After that, uncompress 
the tar file into the directoy where you want
+to install Spark as below:
+
+.. code-block:: bash
+
+tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
+
+Ensure the ``SPARK_HOME`` environment variable points to the directory where 
the code has been extracted. 

Review comment:
   `where the tar file has been extracted.`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29803: [SPARK-32874][SQL][FOLLOWUP][test-hive1.2][test-hadoop2.7] Fix spark-master-test-sbt-hadoop-2.7-hive-1.2

2020-09-18 Thread GitBox


maropu commented on pull request #29803:
URL: https://github.com/apache/spark/pull/29803#issuecomment-695124152


   late LGTM, thanks for the quick fix, @yaooqinn !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29805: [SPARK-32635][SQL][2.4] Fix foldable propagation

2020-09-18 Thread GitBox


maropu commented on pull request #29805:
URL: https://github.com/apache/spark/pull/29805#issuecomment-695122960


   late LGTM, thanks, @peter-toth !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29802: [SPARK-32635][SQL][FOLLOW-UP] Add a new test case in catalyst module

2020-09-18 Thread GitBox


maropu commented on pull request #29802:
URL: https://github.com/apache/spark/pull/29802#issuecomment-695122624


   late LGTM, thanks, all!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695115867







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695115867







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29804: [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically

2020-09-18 Thread GitBox


SparkQA commented on pull request #29804:
URL: https://github.com/apache/spark/pull/29804#issuecomment-695115619


   **[Test build #128883 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128883/testReport)**
 for PR 29804 at commit 
[`7ffcdbc`](https://github.com/apache/spark/commit/7ffcdbc288679238a4c8f16446f8259396a34a8c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29788: [SPARK-32913][CORE][K8S] Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases

2020-09-18 Thread GitBox


dongjoon-hyun commented on a change in pull request #29788:
URL: https://github.com/apache/spark/pull/29788#discussion_r491214032



##
File path: core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala
##
@@ -17,7 +17,7 @@
 
 package org.apache.spark
 
-import org.apache.spark.scheduler.ExecutorDecommissionInfo
+import org.apache.spark.scheduler.ExecutorDecommissionReason

Review comment:
   I guess @Ngone51 is trying to follow `TaskEndReason`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29788: [SPARK-32913][CORE][K8S] Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases

2020-09-18 Thread GitBox


dongjoon-hyun commented on a change in pull request #29788:
URL: https://github.com/apache/spark/pull/29788#discussion_r491214032



##
File path: core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala
##
@@ -17,7 +17,7 @@
 
 package org.apache.spark
 
-import org.apache.spark.scheduler.ExecutorDecommissionInfo
+import org.apache.spark.scheduler.ExecutorDecommissionReason

Review comment:
   I guess @Ngone51 is trying to follow the style of `TaskEndReason`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


dongjoon-hyun commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r491193767



##
File path: python/docs/source/development/setting_ide.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Setting up IDEs
+===
+
+
+PyCharm
+---
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.

Review comment:
   `others` ->` Maven` because there is only one alternative.

##
File path: python/docs/source/development/setting_ide.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Setting up IDEs
+===
+
+
+PyCharm
+---
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.
+
+.. code-block:: bash
+
+git clone https://github.com/apache/spark.git
+
+When the download is completed, go to the ``spark`` directory and build the 
package.
+SBT build is generally much faster than others. More details about the build 
are documented `here 
`_.

Review comment:
   `others` ->` Maven` because there is only one alternative?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29781: [SPARK-32189][DOCS][PYTHON] Development - Setting PySpark with PyCharm

2020-09-18 Thread GitBox


dongjoon-hyun commented on a change in pull request #29781:
URL: https://github.com/apache/spark/pull/29781#discussion_r491193274



##
File path: python/docs/source/development/setting_ide.rst
##
@@ -0,0 +1,62 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Setting up IDEs
+===
+
+
+PyCharm
+---
+
+This section describes how to setup PySpark on PyCharm.
+It guides step by step to the process of downloading the source code from 
GitHub and running the test code successfully.
+
+Firstly, download the Spark source code from GitHub using git url. You can 
download the source code by simply using ``git clone`` command as shown below.
+If you want to download the code from any forked repository rather than spark 
original repository, please change the url properly.

Review comment:
   `spark` -> `Spark` because it's a name.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29779: [SPARK-32180][PYTHON][DOCS][FOLLOW-UP] Rephrase and add some more information in installation guide

2020-09-18 Thread GitBox


dongjoon-hyun commented on a change in pull request #29779:
URL: https://github.com/apache/spark/pull/29779#discussion_r491192048



##
File path: python/docs/source/getting_started/install.rst
##
@@ -0,0 +1,138 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+Installation
+
+
+PySpark is included in the official releases of Spark available in the `Apache 
Spark website `_.
+For Python users, PySpark also provides ``pip`` installation from PyPI. This 
is usually for local usage or as
+a client to connect to a cluster instead of setting up a cluster itself.
+ 
+This page includes instructions for installing PySpark by using pip, Conda, 
downloading manually,
+and building from the source.
+
+
+Python Version Supported
+
+
+Python 3.6 and above.
+
+
+Using PyPI
+--
+
+PySpark installation using `PyPI `_ is as 
follows:
+
+.. code-block:: bash
+
+pip install pyspark
+
+If you want to install extra dependencies for a specific componenet, you can 
install it as below:
+
+.. code-block:: bash
+
+pip install pyspark[sql]
+
+
+Using Conda
+---
+
+Conda is an open-source package management and environment management system 
which is a part of
+the `Anaconda `_ distribution. It is both 
cross-platform and
+language agnostic. In practice, Conda can replace both `pip 
`_ and
+`virtualenv `_.
+
+Create new virtual environment from your terminal as shown below:
+
+.. code-block:: bash
+
+conda create -n pyspark_env
+
+After the virtual environment is created, it should be visible under the list 
of Conda environments
+which can be seen using the following command:
+
+.. code-block:: bash
+
+conda env list
+
+Now activate the newly created environment with the following command:
+
+.. code-block:: bash
+
+conda activate pyspark_env
+
+You can install pyspark by `Using PyPI <#using-pypi>`_ to install PySpark in 
the newly created
+environment, for example as below. It will install PySpark under the new 
virtual environemnt
+``pyspark_env`` created above.
+
+.. code-block:: bash
+
+pip install pyspark
+
+Alternatively, you can install PySpark from Conda itself as below:
+
+.. code-block:: bash
+
+conda install pyspark
+
+However, note that `PySpark at Conda 
`_ is not necessarily
+synced with PySpark release cycle because it is maintained by the community 
separately.
+
+
+Manually Downloading
+
+
+PySpark is included in the distributions available at the `Apache Spark 
website `_.
+You can download a distribution you want from the site. After that, uncompress 
the tar file into the directoy where you want
+to install Spark as below:
+
+.. code-block:: bash
+
+tar xzvf spark-3.0.0-bin-hadoop2.7.tgz

Review comment:
   Is `3.0.0` used because we need to land this to `branch-3.0`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29789: [SPARK-32898][CORE] Fix wrong executorRunTime when task killed before real start

2020-09-18 Thread GitBox


dongjoon-hyun commented on pull request #29789:
URL: https://github.com/apache/spark/pull/29789#issuecomment-695086068


   Merged to master/3.0. 
   
   @Ngone51 . Could you make a backport to `branch-2.4`, please?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #29789: [SPARK-32898][CORE] Fix wrong executorRunTime when task killed before real start

2020-09-18 Thread GitBox


dongjoon-hyun closed pull request #29789:
URL: https://github.com/apache/spark/pull/29789


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #29798: [SPARK-32931][SQL] Unevaluable Expressions are not Foldable

2020-09-18 Thread GitBox


dongjoon-hyun commented on pull request #29798:
URL: https://github.com/apache/spark/pull/29798#issuecomment-695084171


   cc @sunchao and @viirya 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #29802: [SPARK-32635][SQL][FOLLOW-UP] Add a new test case in catalyst module

2020-09-18 Thread GitBox


dongjoon-hyun closed pull request #29802:
URL: https://github.com/apache/spark/pull/29802


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-695029859







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


SparkQA removed a comment on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-695020379


   **[Test build #128882 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128882/testReport)**
 for PR 29799 at commit 
[`6a50bfe`](https://github.com/apache/spark/commit/6a50bfe77eae3d6311c6cd727ed896a0eaf0d407).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-695029859







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


SparkQA commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-695029502


   **[Test build #128882 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128882/testReport)**
 for PR 29799 at commit 
[`6a50bfe`](https://github.com/apache/spark/commit/6a50bfe77eae3d6311c6cd727ed896a0eaf0d407).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #29763: [SPARK-32738][CORE][3.0] Should reduce the number of active threads if fatal error happens in `Inbox.process`

2020-09-18 Thread GitBox


mridulm commented on pull request #29763:
URL: https://github.com/apache/spark/pull/29763#issuecomment-695027046


   Thanks for clarifying @HyukjinKwon , I was not aware of that !
   I learn something new every day :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


SparkQA commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-695020379


   **[Test build #128882 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128882/testReport)**
 for PR 29799 at commit 
[`6a50bfe`](https://github.com/apache/spark/commit/6a50bfe77eae3d6311c6cd727ed896a0eaf0d407).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-695017669







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-695017669







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29788: [SPARK-32913][CORE][K8S] Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases

2020-09-18 Thread GitBox


agrawaldevesh commented on a change in pull request #29788:
URL: https://github.com/apache/spark/pull/29788#discussion_r491107142



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionReason.scala
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+private[spark] sealed trait ExecutorDecommissionReason {

Review comment:
   Duh. Sorry for my n00bness. I can totally see why this shouldn't be a 
sealed trait: For example it is forcing the TestExecutorDecommissionInfo to be 
in this file. 
   
   @Ngone51 is there a strong reason for making this be a sealed trait ? Is 
that required by the RPC framework for example ? If not, I don't think its 
worth it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29788: [SPARK-32913][CORE][K8S] Improve ExecutorDecommissionInfo and ExecutorDecommissionState for different use cases

2020-09-18 Thread GitBox


agrawaldevesh commented on a change in pull request #29788:
URL: https://github.com/apache/spark/pull/29788#discussion_r491083572



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2368,7 +2368,7 @@ private[scheduler] class 
DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler
 case ExecutorLost(execId, reason) =>
   val workerHost = reason match {
 case ExecutorProcessLost(_, workerHost, _) => workerHost
-case ExecutorDecommission(workerHost) => workerHost
+case ExecutorDecommission(_, host) => host

Review comment:
   See comment below for `ExecutorDecommission` ... Should this be changed 
to a:
   
   ```
   case decom @ ExecutorDecommission => decom.workerHost // or decom.host
   ```
   
   You don't need to add an extra '_' then.

##
File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala
##
@@ -71,7 +71,8 @@ case class ExecutorProcessLost(
  * This is used by the task scheduler to remove state associated with the 
executor, but
  * not yet fail any tasks that were running in the executor before the 
executor is "fully" lost.
  *
- * @param workerHost it is defined when the worker is decommissioned too
+ * @param reason the reason why the executor is decommissioned
+ * @param host it is defined when the host where the executor located is 
decommissioned too
  */
-private [spark] case class ExecutorDecommission(workerHost: Option[String] = 
None)
- extends ExecutorLossReason("Executor decommission.")
+private [spark] case class ExecutorDecommission(reason: String, host: 
Option[String] = None)

Review comment:
   My scala knowledge is really really poor, but I would rather we make 
this be a non case class if you are planning to do this. Currently, I think the 
field "reason" is going to be duplicated in the base class ExecutorLossReason  
and the ExecutorDecommission. 
   
   That's also the reason why you are pattern matching it above with an 
additional _ (for the `reason`) argument, when you really don't care about the 
`reason`. 

##
File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionReason.scala
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+private[spark] sealed trait ExecutorDecommissionReason {
+  val reason: String = "decommissioned"

Review comment:
   I don't think the `reason` field is really needed anywhere, besides it 
being used for `toString` ? Should we just require overriding `toString` by 
marking `toString` abstract ? I don't think that child classes need to override 
both `toString` and `reason` : I would prefer we just override methods instead 
of fields.

##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -970,6 +970,9 @@ private[spark] class TaskSchedulerImpl(
   logDebug(s"Executor $executorId on $hostPort lost, but reason not yet 
known.")
 case ExecutorKilled =>
   logInfo(s"Executor $executorId on $hostPort killed by driver.")
+case ExecutorDecommission(reason, _) =>
+  // use logInfo instead of logError as the loss of decommissioned 
executor is what we expect
+  logInfo(s"Decommissioned executor $executorId on $hostPort shutdown: 
$reason")

Review comment:
   instead of 'shutdown', should we say 'is finally lost' ? To be more 
accurate in this setting.
   
   +1 on this change to avoid log spam.

##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
##
@@ -991,7 +991,7 @@ private[spark] class TaskSetManager(
 for ((tid, info) <- taskInfos if info.running && info.executorId == 
execId) {
   val exitCausedByApp: Boolean = reason match {
 case exited: ExecutorExited => exited.exitCausedByApp
-case ExecutorKilled | ExecutorDecommission(_) => false
+case ExecutorKilled | ExecutorDecommission(_, _) => false

Review comment:
   I am wondering if we should instead pattern match in a separate arm 
like: 
   
   ```
   _ @ ExecutorDecommission => false
   ```
   
   To avoid having to change the case arms when we ma

[GitHub] [spark] SparkQA removed a comment on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


SparkQA removed a comment on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-694982896


   **[Test build #128881 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128881/testReport)**
 for PR 29799 at commit 
[`814d933`](https://github.com/apache/spark/commit/814d933e51ecfbd08726caf7762d6dde284d50f4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-694993304







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-694993304







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


SparkQA commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-694992925


   **[Test build #128881 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128881/testReport)**
 for PR 29799 at commit 
[`814d933`](https://github.com/apache/spark/commit/814d933e51ecfbd08726caf7762d6dde284d50f4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rednaxelafx commented on a change in pull request #29798: [SPARK-32931][SQL] Unevaluable Expressions are not Foldable

2020-09-18 Thread GitBox


rednaxelafx commented on a change in pull request #29798:
URL: https://github.com/apache/spark/pull/29798#discussion_r491088433



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
##
@@ -318,7 +321,6 @@ trait Unevaluable extends Expression {
  */
 trait RuntimeReplaceable extends UnaryExpression with Unevaluable {
   override def nullable: Boolean = child.nullable
-  override def foldable: Boolean = child.foldable

Review comment:
   This change is not what you'd actually want.
   I'm actually really curious why `RuntimeReplaceable` has to implement 
`Unevaluable` in the first place. It doesn't really make sense. We could just 
turn `RuntimeReplaceable` into a proper wrapper that delegates everything, 
right?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##
@@ -327,7 +327,6 @@ case class InSubquery(values: Seq[Expression], query: 
ListQuery)
 
   override def children: Seq[Expression] = values :+ query
   override def nullable: Boolean = children.exists(_.nullable)
-  override def foldable: Boolean = children.forall(_.foldable)

Review comment:
   Let's have some more confirmation of why it was written this way 
originally, and whether or not we're losing anything by just making `folable` 
always `false` here.

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
##
@@ -365,14 +360,6 @@ abstract class OffsetWindowFunction
 
   override def children: Seq[Expression] = Seq(input, offset, default)
 
-  /*

Review comment:
   Can we keep the original comment and just add a new line of comment that 
says `foldable` is properly set to `false` from `Unevaluable`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #29805: [SPARK-32635][SQL][2.4] Fix foldable propagation

2020-09-18 Thread GitBox


dongjoon-hyun closed pull request #29805:
URL: https://github.com/apache/spark/pull/29805


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-694983574







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-694983574







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29799: [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods

2020-09-18 Thread GitBox


SparkQA commented on pull request #29799:
URL: https://github.com/apache/spark/pull/29799#issuecomment-694982896


   **[Test build #128881 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128881/testReport)**
 for PR 29799 at commit 
[`814d933`](https://github.com/apache/spark/commit/814d933e51ecfbd08726caf7762d6dde284d50f4).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

2020-09-18 Thread GitBox


viirya commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r491080641



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala
##
@@ -316,4 +317,33 @@ class HiveOrcQuerySuite extends OrcQueryTest with 
TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-32864: Support ORC forced positional evolution") {
+Seq("native", "hive").foreach { orcImpl =>
+  Seq(true, false).foreach { forcePositionalEvolution =>
+withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl,
+  OrcConf.FORCE_POSITIONAL_EVOLUTION.getAttribute -> 
forcePositionalEvolution.toString) {

Review comment:
   We want to use this config as it is? Or append usual prefix 
`spark.hadoop`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-694979963







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-694979963







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-18 Thread GitBox


SparkQA removed a comment on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-694966780


   **[Test build #128880 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128880/testReport)**
 for PR 29806 at commit 
[`d58eab4`](https://github.com/apache/spark/commit/d58eab4e9b864a0aff17f66997a60b405879b140).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-18 Thread GitBox


SparkQA commented on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-694979561


   **[Test build #128880 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128880/testReport)**
 for PR 29806 at commit 
[`d58eab4`](https://github.com/apache/spark/commit/d58eab4e9b864a0aff17f66997a60b405879b140).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on a change in pull request #29737: [SPARK-32864][SQL] Support ORC forced positional evolution

2020-09-18 Thread GitBox


peter-toth commented on a change in pull request #29737:
URL: https://github.com/apache/spark/pull/29737#discussion_r491077606



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##
@@ -142,13 +142,17 @@ object OrcUtils extends Logging {
   reader: Reader,
   conf: Configuration): Option[(Array[Int], Boolean)] = {
 val orcFieldNames = reader.getSchema.getFieldNames.asScala
+val forcePositionalEvolution = 
OrcConf.FORCE_POSITIONAL_EVOLUTION.getBoolean(conf)

Review comment:
   I see your point now, but this is an ORC specific setting in Hive. With 
parquet tables you get `NULL` on a renamed column:
   ```
   > set orc.force.positional.evolution;
   +--+
   | set  |
   +--+
   | orc.force.positional.evolution=true  |
   +--+
   > create external table t2 (c1 string, c2 string) stored as parquet;
   > insert into t2 values ('foo', 'bar');
   > alter table t2 change c1 c3 string;
   > select * from t2;
   +++
   | t2.c3  | t2.c2  |
   +++
   | NULL   | bar|
   +++
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29749: [SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29749:
URL: https://github.com/apache/spark/pull/29749#issuecomment-694969733







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29749: [SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29749:
URL: https://github.com/apache/spark/pull/29749#issuecomment-694969733







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29749: [SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type

2020-09-18 Thread GitBox


SparkQA removed a comment on pull request #29749:
URL: https://github.com/apache/spark/pull/29749#issuecomment-694908702


   **[Test build #128879 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128879/testReport)**
 for PR 29749 at commit 
[`e13f2b9`](https://github.com/apache/spark/commit/e13f2b94ff0c2897663dd81e596c775b728db862).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29749: [SPARK-32877][SQL] Fix Hive UDF not support decimal type in complex type

2020-09-18 Thread GitBox


SparkQA commented on pull request #29749:
URL: https://github.com/apache/spark/pull/29749#issuecomment-694969073


   **[Test build #128879 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128879/testReport)**
 for PR 29749 at commit 
[`e13f2b9`](https://github.com/apache/spark/commit/e13f2b94ff0c2897663dd81e596c775b728db862).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-18 Thread GitBox


AmplabJenkins removed a comment on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-694967424







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-694967424







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29806: [SPARK-32187][PYTHON][DOCS] Doc on Python packaging

2020-09-18 Thread GitBox


SparkQA commented on pull request #29806:
URL: https://github.com/apache/spark/pull/29806#issuecomment-694966780


   **[Test build #128880 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128880/testReport)**
 for PR 29806 at commit 
[`d58eab4`](https://github.com/apache/spark/commit/d58eab4e9b864a0aff17f66997a60b405879b140).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-18 Thread GitBox


viirya commented on a change in pull request #29797:
URL: https://github.com/apache/spark/pull/29797#discussion_r490732936



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##
@@ -705,7 +705,8 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 exchange.ShuffleExchangeExec(
   r.partitioning,
   planLater(r.child),
-  noUserSpecifiedNumPartition = r.optNumPartitions.isEmpty) :: Nil
+  noUserSpecifiedNumPartition = conf.coalesceShufflePartitionsEnabled 
&&
+r.optNumPartitions.isEmpty) :: Nil

Review comment:
   Two questions so far.
   
   1. Seems to me `RepartitionByExpression` here is not always for dynamic 
partition overwrite case. By this change, all `RepartitionByExpression` are 
affected by `coalesceShufflePartitionsEnabled` config.
   
   2. So the users of dynamic partition overwrite need to set 
`conf.coalesceShufflePartitionsEnabled` false to avoid that? If the user needs 
to coalesce shuffle partition in the query, but also needs dynamic partition 
overwrite?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29797: [SPARK-32932][SQL] Do not use local shuffle reader on RepartitionByExpression when coalescing disabled

2020-09-18 Thread GitBox


viirya commented on a change in pull request #29797:
URL: https://github.com/apache/spark/pull/29797#discussion_r490732936



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##
@@ -705,7 +705,8 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 exchange.ShuffleExchangeExec(
   r.partitioning,
   planLater(r.child),
-  noUserSpecifiedNumPartition = r.optNumPartitions.isEmpty) :: Nil
+  noUserSpecifiedNumPartition = conf.coalesceShufflePartitionsEnabled 
&&
+r.optNumPartitions.isEmpty) :: Nil

Review comment:
   Two questions so far.
   
   1. Seems to me `RepartitionByExpression` here is not alwasys for dynamic 
partition overwrite case. By this change, all `RepartitionByExpression` are 
affected by `coalesceShufflePartitionsEnabled` config.
   
   2. So the users of dynamic partition overwrite need to set 
`conf.coalesceShufflePartitionsEnabled` false to avoid that? If the user needs 
to coalesce shuffle partition in the query, but also needs dynamic partition 
overwrite?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28921: [SPARK-32086][YARN]Bug fix for RemoveBroadcast RPC failed after executor is shutdown

2020-09-18 Thread GitBox


AmplabJenkins commented on pull request #28921:
URL: https://github.com/apache/spark/pull/28921#issuecomment-694943473


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #29711: [SPARK-32808][SQL] Pass all test of sql/core module in Scala 2.13

2020-09-18 Thread GitBox


srowen closed pull request #29711:
URL: https://github.com/apache/spark/pull/29711


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >