[GitHub] [spark] sarutak opened a new pull request #32691: Docker integration test ga take2

2021-05-27 Thread GitBox


sarutak opened a new pull request #32691:
URL: https://github.com/apache/spark/pull/32691


   ### What changes were proposed in this pull request?
   
   This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
   Once #32631 was merged but there was a lack of consideration.
   
   Diff between this change and 
https://github.com/apache/spark/pull/32631/commits/692d95d1458993cbb9cbd47014202e84cd6aa328
 merged in #32631 is as follows.
   
   ```
  if: github.repository != 'apache/spark'
  id: sync-branch
  run: |
   +apache_spark_ref=`git rev-parse HEAD`
git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit"
   +echo "::set-output name=APACHE_SPARK_REF::$apache_spark_ref"
- name: Cache Scala, SBT and Maven
  uses: actions/cache@v2
  with:
   ```
   
   ### Why are the changes needed?
   
   CI for `docker-integration-tests` is absent for now.
   
   ### Does this PR introduce _any_ user-facing change?
   
   GA.
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850157887


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43564/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32653: [SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger

2021-05-27 Thread GitBox


SparkQA commented on pull request #32653:
URL: https://github.com/apache/spark/pull/32653#issuecomment-850152542


   **[Test build #139048 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139048/testReport)**
 for PR 32653 at commit 
[`b59e02e`](https://github.com/apache/spark/commit/b59e02e131f82f80489c527f202064d3de5f4fb9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


SparkQA commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850150535


   **[Test build #139047 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139047/testReport)**
 for PR 32688 at commit 
[`f5bafee`](https://github.com/apache/spark/commit/f5bafeeb22f677a2dc823b7a9e95590020a02f8e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


LuciferYang commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850149331


   thx @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32690: [SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true

2021-05-27 Thread GitBox


SparkQA commented on pull request #32690:
URL: https://github.com/apache/spark/pull/32690#issuecomment-850148551


   **[Test build #139046 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139046/testReport)**
 for PR 32690 at commit 
[`22780dd`](https://github.com/apache/spark/commit/22780ddf9fe367693b0ba30260a5455fbb364807).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-850146664


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43565/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850146667


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139039/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on a change in pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-27 Thread GitBox


sarutak commented on a change in pull request #32631:
URL: https://github.com/apache/spark/pull/32631#discussion_r641279731



##
File path: .github/workflows/build_and_test.yml
##
@@ -625,3 +625,83 @@ jobs:
   with:
 name: unit-tests-log-tpcds--8-hadoop3.2-hive2.3
 path: "**/target/unit-tests.log"
+
+  docker-integration-tests:
+name: Run docker integration tests
+runs-on: ubuntu-20.04
+env:
+  HADOOP_PROFILE: hadoop3.2
+  HIVE_PROFILE: hive2.3
+  GITHUB_PREV_SHA: ${{ github.event.before }}
+  SPARK_LOCAL_IP: localhost
+  ORACLE_DOCKER_IMAGE_NAME: oracle/database:18.4.0-xe
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: master
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  id: sync-branch
+  run: |
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit"

Review comment:
   Ah, O.K. I'll do it. Thanks for letting me know.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850092153






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850146665






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-850146664


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43565/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850146665






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850146669


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139040/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850146667


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139039/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #32689: [SPARK-35552][SQL] Make query stage materialized more readable

2021-05-27 Thread GitBox


HyukjinKwon edited a comment on pull request #32689:
URL: https://github.com/apache/spark/pull/32689#issuecomment-850145032


   @LuciferYang, the docker test failure should be now fixed in the latest 
master branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32689: [SPARK-35552][SQL] Make query stage materialized more readable

2021-05-27 Thread GitBox


HyukjinKwon commented on pull request #32689:
URL: https://github.com/apache/spark/pull/32689#issuecomment-850145032


   @LuciferYang, the docker test failure should be fixed in the latest master 
branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


HyukjinKwon edited a comment on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850144781


   @LuciferYang, the test failure should be now fixed in the latest master 
branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


HyukjinKwon commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850144781


   @LuciferYang, the test failure should be fixed in the latest master branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-27 Thread GitBox


HyukjinKwon edited a comment on pull request #32631:
URL: https://github.com/apache/spark/pull/32631#issuecomment-850143899


   sorry for reverting quickly - I reverted first as the issue is sort of minor 
but it takes a while to test related to this 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-27 Thread GitBox


HyukjinKwon commented on pull request #32631:
URL: https://github.com/apache/spark/pull/32631#issuecomment-850143899


   sorry for a revert quickly - I reverted first as the issue is sort of minor 
but it takes a while to test related to this 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-27 Thread GitBox


HyukjinKwon commented on a change in pull request #32631:
URL: https://github.com/apache/spark/pull/32631#discussion_r641277034



##
File path: .github/workflows/build_and_test.yml
##
@@ -625,3 +625,83 @@ jobs:
   with:
 name: unit-tests-log-tpcds--8-hadoop3.2-hive2.3
 path: "**/target/unit-tests.log"
+
+  docker-integration-tests:
+name: Run docker integration tests
+runs-on: ubuntu-20.04
+env:
+  HADOOP_PROFILE: hadoop3.2
+  HIVE_PROFILE: hive2.3
+  GITHUB_PREV_SHA: ${{ github.event.before }}
+  SPARK_LOCAL_IP: localhost
+  ORACLE_DOCKER_IMAGE_NAME: oracle/database:18.4.0-xe
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: master
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  id: sync-branch
+  run: |
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit"

Review comment:
   @sarutak would you mind opening a Pr again for this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32631: [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA.

2021-05-27 Thread GitBox


HyukjinKwon commented on a change in pull request #32631:
URL: https://github.com/apache/spark/pull/32631#discussion_r641276796



##
File path: .github/workflows/build_and_test.yml
##
@@ -625,3 +625,83 @@ jobs:
   with:
 name: unit-tests-log-tpcds--8-hadoop3.2-hive2.3
 path: "**/target/unit-tests.log"
+
+  docker-integration-tests:
+name: Run docker integration tests
+runs-on: ubuntu-20.04
+env:
+  HADOOP_PROFILE: hadoop3.2
+  HIVE_PROFILE: hive2.3
+  GITHUB_PREV_SHA: ${{ github.event.before }}
+  SPARK_LOCAL_IP: localhost
+  ORACLE_DOCKER_IMAGE_NAME: oracle/database:18.4.0-xe
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: master
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  id: sync-branch
+  run: |
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit"

Review comment:
   Oh, we should add `echo "::set-output 
name=APACHE_SPARK_REF::$apache_spark_ref"` after this line because we're 
running tests with `run-tests.py`.

##
File path: .github/workflows/build_and_test.yml
##
@@ -625,3 +625,83 @@ jobs:
   with:
 name: unit-tests-log-tpcds--8-hadoop3.2-hive2.3
 path: "**/target/unit-tests.log"
+
+  docker-integration-tests:
+name: Run docker integration tests
+runs-on: ubuntu-20.04
+env:
+  HADOOP_PROFILE: hadoop3.2
+  HIVE_PROFILE: hive2.3
+  GITHUB_PREV_SHA: ${{ github.event.before }}
+  SPARK_LOCAL_IP: localhost
+  ORACLE_DOCKER_IMAGE_NAME: oracle/database:18.4.0-xe
+steps:
+- name: Checkout Spark repository
+  uses: actions/checkout@v2
+  with:
+fetch-depth: 0
+repository: apache/spark
+ref: master
+- name: Sync the current branch with the latest in Apache Spark
+  if: github.repository != 'apache/spark'
+  id: sync-branch
+  run: |
+git fetch https://github.com/$GITHUB_REPOSITORY.git 
${GITHUB_REF#refs/heads/}
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
+git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit"

Review comment:
   I will revert this for now .. seems like it breaks other tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32689: [SPARK-35552][SQL] Make query stage materialized more readable

2021-05-27 Thread GitBox


SparkQA commented on pull request #32689:
URL: https://github.com/apache/spark/pull/32689#issuecomment-850142972


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43563/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-27 Thread GitBox


SparkQA commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-850142364


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43565/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850140897


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43564/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


SparkQA removed a comment on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850068550


   **[Test build #139040 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139040/testReport)**
 for PR 32686 at commit 
[`408851d`](https://github.com/apache/spark/commit/408851d641be4aa13146c640a48ebfb9bc158be8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


SparkQA commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850137924


   **[Test build #139040 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139040/testReport)**
 for PR 32686 at commit 
[`408851d`](https://github.com/apache/spark/commit/408851d641be4aa13146c640a48ebfb9bc158be8).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850137712


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43562/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850135302


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43561/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


SparkQA removed a comment on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850068523


   **[Test build #139039 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139039/testReport)**
 for PR 32688 at commit 
[`72c425c`](https://github.com/apache/spark/commit/72c425c2022eb76436af499f27f514556da18444).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


SparkQA commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850134234


   **[Test build #139039 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139039/testReport)**
 for PR 32688 at commit 
[`72c425c`](https://github.com/apache/spark/commit/72c425c2022eb76436af499f27f514556da18444).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32690: [SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true

2021-05-27 Thread GitBox


HyukjinKwon commented on pull request #32690:
URL: https://github.com/apache/spark/pull/32690#issuecomment-850130270


   cc @xinrong-databricks and @itholic too fyi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request #32690: [SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true

2021-05-27 Thread GitBox


HyukjinKwon opened a new pull request #32690:
URL: https://github.com/apache/spark/pull/32690


   ### What changes were proposed in this pull request?
   
   This PR proposes to fix and reenable 
`test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` 
that was disabled when we upgrade Python 3.9 in CI at 
https://github.com/apache/spark/pull/32657.
   
   Seems like this is because of the latest NumPy's behaviour change, see also 
`https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.
   
   pandas inherits this behaviour but it doesn't make sense when `numeric_only` 
is set to `True` in pandas. I will track and follow the status of the issue 
between pandas and NumPy.
   
   For the time being, I propose to exclude boolean case alone in 
percentile/quartile test case
   
   ### Why are the changes needed?
   
   To keep the test coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   I roughly locally tested. But it should pass in CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850126315


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43562/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] lidiyag commented on pull request #32664: [SPARK-35516][WEBUI] Storage UI tab Storage Level tool tip correction

2021-05-27 Thread GitBox


lidiyag commented on pull request #32664:
URL: https://github.com/apache/spark/pull/32664#issuecomment-850124776


   @dongjoon-hyun  @srowen please take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

2021-05-27 Thread GitBox


SparkQA commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-850124566


   **[Test build #139045 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139045/testReport)**
 for PR 32473 at commit 
[`d4d39d3`](https://github.com/apache/spark/commit/d4d39d3fdecccd3551b07f8249a4015a0420a170).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850124515


   **[Test build #139044 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139044/testReport)**
 for PR 32658 at commit 
[`a4a2bb2`](https://github.com/apache/spark/commit/a4a2bb239e428b6da21c0a2214f348c89067048d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32582: [SPARK-35436][SS] RocksDBFileManager - save checkpoint to DFS

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32582:
URL: https://github.com/apache/spark/pull/32582#issuecomment-850124238


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43559/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32689: [SPARK-35552][SQL] Make query stage materialized more readable

2021-05-27 Thread GitBox


SparkQA commented on pull request #32689:
URL: https://github.com/apache/spark/pull/32689#issuecomment-850124454


   **[Test build #139043 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139043/testReport)**
 for PR 32689 at commit 
[`06a5cd7`](https://github.com/apache/spark/commit/06a5cd7889190714ec13db3f3124fc249398038e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #32689: [SPARK-35552][SQL] Make query stage materialized more readable

2021-05-27 Thread GitBox


ulysses-you commented on pull request #32689:
URL: https://github.com/apache/spark/pull/32689#issuecomment-850124397


   cc @maropu @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you opened a new pull request #32689: [SPARK-35552][SQL] Make query stage materialized more readable

2021-05-27 Thread GitBox


ulysses-you opened a new pull request #32689:
URL: https://github.com/apache/spark/pull/32689


   
   
   ### What changes were proposed in this pull request?
   
   Add a new method `isMaterialized` in `QueryStageExec`.
   
   ### Why are the changes needed?
   
   Currently, we use `resultOption().get.isDefined` to check if a query stage 
has materialized. The code is not readable at a glance. It's better to use a 
new method like `isMaterialized` to define it.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32582: [SPARK-35436][SS] RocksDBFileManager - save checkpoint to DFS

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32582:
URL: https://github.com/apache/spark/pull/32582#issuecomment-850124238


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43559/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850123795


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43561/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-85015


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43560/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32582: [SPARK-35436][SS] RocksDBFileManager - save checkpoint to DFS

2021-05-27 Thread GitBox


SparkQA commented on pull request #32582:
URL: https://github.com/apache/spark/pull/32582#issuecomment-850121547


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43559/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-27 Thread GitBox


otterc commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r641257596



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2000,6 +2023,147 @@ private[spark] class DAGScheduler(
 }
   }
 
+  /**
+   * Schedules shuffle merge finalize.
+   */
+  private[scheduler] def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): 
Unit = {
+logInfo(("%s (%s) scheduled for finalizing" +
+  " shuffle merge in %s s").format(stage, stage.name, 
shuffleMergeFinalizeWaitSec))
+shuffleMergeFinalizeScheduler.schedule(
+  new Runnable {
+override def run(): Unit = finalizeShuffleMerge(stage)
+  },
+  shuffleMergeFinalizeWaitSec,
+  TimeUnit.SECONDS
+)
+  }
+
+  /**
+   * DAGScheduler notifies all the remote shuffle services chosen to serve 
shuffle merge request for
+   * the given shuffle map stage to finalize the shuffle merge process for 
this shuffle. This is
+   * invoked in a separate thread to reduce the impact on the DAGScheduler 
main thread, as the
+   * scheduler might need to talk to 1000s of shuffle services to finalize 
shuffle merge.
+   */
+  private[scheduler] def finalizeShuffleMerge(stage: ShuffleMapStage): Unit = {
+logInfo("%s (%s) finalizing the shuffle merge".format(stage, stage.name))
+externalShuffleClient.foreach { shuffleClient =>
+  val shuffleId = stage.shuffleDep.shuffleId
+  val numMergers = stage.shuffleDep.getMergerLocs.length
+  val numResponses = new AtomicInteger()
+  val results = (0 until numMergers).map(_ => 
SettableFuture.create[Boolean]())
+  val timedOut = new AtomicBoolean()
+
+  def increaseAndCheckResponseCount(): Unit = {
+if (numResponses.incrementAndGet() == numMergers) {
+  logInfo("%s (%s) shuffle merge finalized".format(stage, stage.name))
+  // Since this runs in the netty client thread and is outside of 
DAGScheduler
+  // event loop, we only post ShuffleMergeFinalized event into the 
event queue.
+  // The processing of this event should be done inside the event 
loop, so it
+  // can safely modify scheduler's internal state.
+  eventProcessLoop.post(ShuffleMergeFinalized(stage))
+}
+  }
+
+  stage.shuffleDep.getMergerLocs.zipWithIndex.foreach {
+case (shuffleServiceLoc, index) =>
+  // Sends async request to shuffle service to finalize shuffle merge 
on that host
+  // TODO: SPARK-35536: Cancel finalizeShuffleMerge if the stage is 
cancelled
+  // TODO: during shuffleMergeFinalizeWaitSec
+  shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
+shuffleServiceLoc.port, shuffleId,
+new MergeFinalizerListener {
+  override def onShuffleMergeSuccess(statuses: MergeStatuses): 
Unit = {
+assert(shuffleId == statuses.shuffleId)
+if (!timedOut.get()) {
+  eventProcessLoop.post(RegisterMergeStatuses(stage, 
MergeStatus.
+convertMergeStatusesToMergeStatusArr(statuses, 
shuffleServiceLoc)))
+  increaseAndCheckResponseCount()
+  results(index).set(true)
+}
+  }
+
+  override def onShuffleMergeFailure(e: Throwable): Unit = {
+if (!timedOut.get()) {
+  logWarning(s"Exception encountered when trying to finalize 
shuffle " +
+s"merge on ${shuffleServiceLoc.host} for shuffle 
$shuffleId", e)
+  increaseAndCheckResponseCount()
+  // Do not fail the future as this would cause dag scheduler 
to prematurely
+  // give up on waiting for merge results from the remaining 
shuffle services
+  // if one fails
+  results(index).set(false)
+}
+  }
+})
+  }
+  // DAGScheduler only waits for a limited amount of time for the merge 
results.
+  // It will attempt to submit the next stage(s) irrespective of whether 
merge results
+  // from all shuffle services are received or not.
+  // TODO: SPARK-33701: Instead of waiting for a constant amount of time 
for finalization
+  // TODO: for all the stages, adaptively tune timeout for merge 
finalization
+  try {
+Futures.allAsList(results: _*).get(shuffleMergeResultsTimeoutSec, 
TimeUnit.SECONDS)
+  } catch {
+case _: TimeoutException =>
+  logInfo(s"Timed out on waiting for merge results from all " +
+s"$numMergers mergers for shuffle $shuffleId")
+  timedOut.set(true)
+  eventProcessLoop.post(ShuffleMergeFinalized(stage))
+  }
+}
+  }
+
+  private def processShuffleMapStageCompletion(shuffleStage: ShuffleMapStage): 
Unit = {
+markStageAsFinished(shuffleStage)
+logInfo("looking for newly runnable stages")
+

[GitHub] [spark] allisonwang-db commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-27 Thread GitBox


allisonwang-db commented on a change in pull request #32303:
URL: https://github.com/apache/spark/pull/32303#discussion_r641255370



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala
##
@@ -107,6 +107,11 @@ case class UsingJoin(tpe: JoinType, usingColumns: 
Seq[String]) extends JoinType
   override def sql: String = "USING " + tpe.sql
 }
 
+case class LateralJoin(tpe: JoinType) extends JoinType {
+  require(Seq(Inner, LeftOuter, Cross).contains(tpe), "Unsupported lateral 
join type " + tpe)

Review comment:
   @maropu Postgres supports INNER, CROSS, and LEFT lateral join, and it 
doesn't make sense to support RIGHT OUTER and FULL OUTER lateral join. How 
about let's add the other two types of supported left joins: left semi and left 
anti here. Then lateral join types shouldn't be changed in the future.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon edited a comment on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


HyukjinKwon edited a comment on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850107310


   @itholic the generated doc looks a bit weird:
   
   ![Screen Shot 2021-05-28 at 1 09 38 
PM](https://user-images.githubusercontent.com/6477701/119928166-04c67480-bfb6-11eb-8449-428b01f2144a.png)
   
   It includes `# noqa`
   
   Can you double check and fix? Seems like we should fix other places for 
JSON, etc.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


HyukjinKwon commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850107310


   @itholic the generated doc looks a bit weird:
   ![Screen Shot 2021-05-28 at 1 09 38 
PM](https://user-images.githubusercontent.com/6477701/119928166-04c67480-bfb6-11eb-8449-428b01f2144a.png)
   
   Can you double check and fix? Seems like we should fix other places for 
JSON, etc.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries

2021-05-27 Thread GitBox


allisonwang-db commented on a change in pull request #32303:
URL: https://github.com/apache/spark/pull/32303#discussion_r641253529



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##
@@ -168,6 +168,21 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with 
PredicateHelper {
   }
 }
 
+/**
+ * Rewrite lateral joins by rewriting all dependent joins (if any) inside the 
right
+ * sub-tree of the lateral join and converting the lateral join into a base 
join type.
+ */
+object RewriteLateralJoin extends Rule[LogicalPlan] with PredicateHelper {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+case j @ Join(left, right, LateralJoin(joinType), condition, _) =>
+  val conditions = condition.map(splitConjunctivePredicates).getOrElse(Nil)
+  val newRight = DecorrelateInnerQuery.rewriteDomainJoins(left, right, 
conditions)
+  // TODO: handle the COUNT bug

Review comment:
   Created a new ticket for handling the COUNT bug in lateral subqueries: 
https://issues.apache.org/jira/browse/SPARK-35551




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


HyukjinKwon commented on a change in pull request #32658:
URL: https://github.com/apache/spark/pull/32658#discussion_r641253439



##
File path: docs/sql-data-sources-csv.md
##
@@ -38,3 +36,217 @@ Spark SQL provides `spark.read().csv("file_name")` to read 
a file or directory o
 
 
 
+
+## Data Source Option
+
+Data source options of CSV can be set via:
+* the `.option`/`.options` methods of
+  *  `DataFrameReader`
+  *  `DataFrameWriter`
+  *  `DataStreamReader`
+  *  `DataStreamWriter`
+* the built-in functions below
+  * `from_csv`
+  * `to_csv`
+  * `schema_of_csv`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+
+
+  Property 
NameDefaultMeaningScope
+  
+sep
+,
+Sets a separator (one or more characters) for each field and 
value.
+read/write
+  
+  
+encoding
+UTF-8 for reading, not set for writing
+Specifies encoding (charset) for reading or writing CSV files
+read/write
+  
+  
+quote
+"
+Sets a single character used for escaping quoted values where the 
separator can be part of the value. If you would like to turn off quotations, 
you need to set an empty string. If an empty string is set, it uses 
u (null character) for wirting, and it disables the quotation 
handling for reading.
+read/write
+  
+  
+quoteAll
+false
+A flag indicating whether all values should always be enclosed in 
quotes. It only escapes values containing a quote character by default.
+write
+  
+  
+escape
+\
+Sets a single character used for escaping quotes inside an already 
quoted value.
+read/write
+  
+  
+escapeQuotes
+true
+A flag indicating whether values containing quotes should always be 
enclosed in quotes. It escapes all values containing a quote character by 
default.
+write
+  
+  
+comment
+empty string
+Sets a single character used for skipping lines beginning with this 
character. It's disabled by default
+read
+  
+  
+header
+false
+For reading, uses the first line as names of columns. For writing, 
writes the names of columns as the first line. Note that if the given path is a 
RDD of Strings, this header option will remove all lines same with the header 
if exists.
+read/write
+  
+  
+inferSchema
+false
+Infers the input schema automatically from data. It requires one extra 
pass over the data.
+read
+  
+  
+enforceSchema
+true
+If it is set to true, the specified or inferred schema 
will be forcibly applied to datasource files, and headers in CSV files will be 
ignored. If the option is set to false, the schema will be 
validated against all headers in CSV files or the first header in RDD if the 
header option is set to true. Field names in the 
schema and column names in CSV headers are checked by their positions taking 
into account spark.sql.caseSensitive. Though the default value is 
true, it is recommended to disable the enforceSchema 
option to avoid incorrect results.
+read
+  
+  
+ignoreLeadingWhiteSpace
+false (for reading), true (for writing)
+A flag indicating whether or not leading whitespaces from values being 
read/written should be skipped.
+read/write
+  
+  
+ignoreTrailingWhiteSpace
+false (for reading), true (for writing)
+A flag indicating whether or not trailing whitespaces from values 
being read/written should be skipped.
+read/write
+  
+  
+nullValue
+empty string
+Sets the string representation of a null value. Since 2.0.1, this 
nullValue param applies to all supported types including the 
string type.
+read/write
+  
+  
+nanValue
+NaN
+Sets the string representation of a non-number value.
+read
+  
+  
+positiveInf
+Inf
+Sets the string representation of a positive infinity value.
+read
+  
+  
+negativeInf
+-Inf
+Sets the string representation of a negative infinity value.
+read
+  
+  
+dateFormat
+-MM-dd
+Sets the string that indicates a date format. Custom date formats 
follow the formats at https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html;> 
Datetime Patterns. This applies to date type.
+read/write
+  
+  
+timestampFormat
+-MM-dd'T'HH:mm:ss[.SSS][XXX]
+Sets the string that indicates a timestamp format. Custom date formats 
follow the formats at https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html;>Datetime
 Patterns. This applies to timestamp type.
+read/write
+  
+  
+maxColumns
+20480
+Defines a hard limit of how many columns a record can have.
+read
+  
+  
+maxCharsPerColumn
+-1
+Defines the maximum number of characters allowed for any given value 
being read. The default value -1 means unlimited length.
+read
+  
+  
+mode
+PERMISSIVE
+Allows a mode for dealing with corrupt records during parsing. Note 
that Spark 

[GitHub] [spark] AmplabJenkins commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850092153


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43558/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


SparkQA commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850092142


   Kubernetes integration test status success
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43558/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


itholic commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850092067


   Thanks, @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32658: [SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

2021-05-27 Thread GitBox


SparkQA commented on pull request #32658:
URL: https://github.com/apache/spark/pull/32658#issuecomment-850089349


   **[Test build #139042 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139042/testReport)**
 for PR 32658 at commit 
[`1991031`](https://github.com/apache/spark/commit/1991031ea3871acc0a6ea20b96e3299bc1d6c51c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32582: [SPARK-35436][SS] RocksDBFileManager - save checkpoint to DFS

2021-05-27 Thread GitBox


SparkQA commented on pull request #32582:
URL: https://github.com/apache/spark/pull/32582#issuecomment-850087350


   **[Test build #139041 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139041/testReport)**
 for PR 32582 at commit 
[`703f59f`](https://github.com/apache/spark/commit/703f59fd4055d68e5bf957e1dc4f17159256a65a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32687: [SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32687:
URL: https://github.com/apache/spark/pull/32687#issuecomment-850086867


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139036/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850086868


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43557/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850086868


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43557/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32687: [SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32687:
URL: https://github.com/apache/spark/pull/32687#issuecomment-850086867


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139036/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #32675: [SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema

2021-05-27 Thread GitBox


wangyum commented on pull request #32675:
URL: https://github.com/apache/spark/pull/32675#issuecomment-850086524


   cc @cloud-fan @yaooqinn @AngersZh


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #32675: [SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema

2021-05-27 Thread GitBox


wangyum commented on a change in pull request #32675:
URL: https://github.com/apache/spark/pull/32675#discussion_r641220605



##
File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala
##
@@ -870,4 +871,68 @@ class InsertSuite extends QueryTest with TestHiveSingleton 
with BeforeAndAfter
   assert(e.contains("Partition spec is invalid"))
 }
   }
+
+  test("Insert data with different cases") {

Review comment:
   Add `SPARK-35531` prefix to test name?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a change in pull request #32675: [SPARK-35531][SQL] Can not insert into hive bucket table if create table with upper case schema

2021-05-27 Thread GitBox


wangyum commented on a change in pull request #32675:
URL: https://github.com/apache/spark/pull/32675#discussion_r641220214



##
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
##
@@ -1092,14 +1092,28 @@ private[hive] object HiveClientImpl extends Logging {
   hiveTable.setViewExpandedText(t)
 }
 
+// hive may convert schema into lower cases while bucketSpec will not
+// only convert if case not match
+def convertColumnNames(schema: StructType, names: Seq[String]): 
Seq[String] = {
+  names.map(name => {
+val s = schema.find(col => col.name.equalsIgnoreCase(name))
+if (s.isDefined) {
+  s.get.name
+} else {
+  name
+}
+  })
+}

Review comment:
   Rewrite `convertColumnNames`?
   ```scala
   def restoreHiveBucketSpecColNames(schema: StructType, names: 
Seq[String]): Seq[String] = {
 names.map { name =>
   schema.find(col => SQLConf.get.resolver(col.name, 
name)).map(_.name).getOrElse(name)
 }
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


SparkQA commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850083738


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43557/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #32582: [SPARK-35436][SS] RocksDBFileManager - save checkpoint to DFS

2021-05-27 Thread GitBox


viirya commented on pull request #32582:
URL: https://github.com/apache/spark/pull/32582#issuecomment-850083083


   Thanks @xuanyuanking. I will find some time to review this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


SparkQA commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850082809


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43558/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #32582: [SPARK-35436] RocksDBFileManager - save checkpoint to DFS

2021-05-27 Thread GitBox


xuanyuanking commented on pull request #32582:
URL: https://github.com/apache/spark/pull/32582#issuecomment-850079062


   As we merged #32272, after rebasing and addressing the comment, this one is 
ready for review. cc @viirya and @HeartSaVioR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata

2021-05-27 Thread GitBox


xuanyuanking commented on pull request #32272:
URL: https://github.com/apache/spark/pull/32272#issuecomment-850078509


   Thanks for the review and help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on a change in pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata

2021-05-27 Thread GitBox


xuanyuanking commented on a change in pull request #32272:
URL: https://github.com/apache/spark/pull/32272#discussion_r641196335



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
##
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.state
+
+import java.io.File
+import java.nio.charset.StandardCharsets.UTF_8
+import java.nio.file.Files
+
+import scala.collection.Seq
+
+import com.fasterxml.jackson.annotation.JsonInclude.Include
+import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
+import com.fasterxml.jackson.module.scala.{DefaultScalaModule, 
ScalaObjectMapper}
+import org.json4s.NoTypeHints
+import org.json4s.jackson.Serialization
+
+/**
+ * Classes to represent metadata of checkpoints saved to DFS. Since this is 
converted to JSON, any
+ * changes to this MUST be backward-compatible.
+ */
+case class RocksDBCheckpointMetadata(
+sstFiles: Seq[RocksDBSstFile],
+logFiles: Seq[RocksDBLogFile],
+numKeys: Long) {
+  import RocksDBCheckpointMetadata._
+
+  def json: String = {
+// We turn this field into a null to avoid write a empty logFiles field in 
the json.
+val nullified = if (logFiles.isEmpty) this.copy(logFiles = null) else this
+mapper.writeValueAsString(nullified)
+  }
+
+  def prettyJson: String = 
Serialization.writePretty(this)(RocksDBCheckpointMetadata.format)
+
+  def writeToFile(metadataFile: File): Unit = {
+val writer = Files.newBufferedWriter(metadataFile.toPath, UTF_8)
+try {
+  writer.write(s"v$VERSION\n")
+  writer.write(this.json)
+} finally {
+  writer.close()
+}
+  }
+
+  def immutableFiles: Seq[RocksDBImmutableFile] = sstFiles ++ logFiles
+}
+
+/** Helper class for [[RocksDBCheckpointMetadata]] */
+object RocksDBCheckpointMetadata {
+  val VERSION = 1
+
+  implicit val format = Serialization.formats(NoTypeHints)
+
+  /** Used to convert between classes and JSON. */
+  lazy val mapper = {
+val _mapper = new ObjectMapper with ScalaObjectMapper
+_mapper.setSerializationInclusion(Include.NON_ABSENT)
+_mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
+_mapper.registerModule(DefaultScalaModule)
+_mapper
+  }
+
+  def readFromFile(metadataFile: File): RocksDBCheckpointMetadata = {
+val reader = Files.newBufferedReader(metadataFile.toPath, UTF_8)
+try {
+  val versionLine = reader.readLine()
+  if (versionLine != s"v$VERSION") {
+throw new IllegalStateException(
+  s"Cannot read RocksDB checkpoint metadata of version $versionLine")
+  }
+  Serialization.read[RocksDBCheckpointMetadata](reader)
+} finally {
+  reader.close()
+}
+  }
+
+  def apply(rocksDBFiles: Seq[RocksDBImmutableFile], numKeys: Long): 
RocksDBCheckpointMetadata = {
+val sstFiles = rocksDBFiles.collect { case file: RocksDBSstFile => file }
+val logFiles = rocksDBFiles.collect { case file: RocksDBLogFile => file }
+
+RocksDBCheckpointMetadata(sstFiles, logFiles, numKeys)
+  }
+}
+
+/**
+ * A RocksDBImmutableFile maintains a mapping between a local RocksDB file 
name and the name of
+ * its copy on DFS. Since these files are immutable, their DFS copies can be 
reused.
+ */
+sealed trait RocksDBImmutableFile {
+  def localFileName: String
+  def dfsFileName: String
+  def sizeBytes: Long
+
+  /**
+   * Whether another local file is same as the file described by this class.
+   * A file is same only when the name and the size are same.
+   */
+  def isSameFile(otherFile: File): Boolean = {
+otherFile.getName == localFileName && otherFile.length() == sizeBytes
+  }
+}
+
+/**
+ * Class to represent a RocksDB SST file. Since this is converted to JSON,
+ * any changes to these MUST be backward-compatible.
+ */
+private[sql] case class RocksDBSstFile(
+localFileName: String,
+dfsSstFileName: String,
+sizeBytes: Long) extends RocksDBImmutableFile {
+
+  override def dfsFileName: String = dfsSstFileName
+}
+
+/**
+ * Class to represent a RocksDB Log file. Since this is converted to JSON,
+ * any changes to these MUST be 

[GitHub] [spark] SparkQA removed a comment on pull request #32687: [SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions

2021-05-27 Thread GitBox


SparkQA removed a comment on pull request #32687:
URL: https://github.com/apache/spark/pull/32687#issuecomment-849987071


   **[Test build #139036 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139036/testReport)**
 for PR 32687 at commit 
[`d37d01a`](https://github.com/apache/spark/commit/d37d01a6ae8fd5404dca172e748fc9994c0d66b3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32687: [SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions

2021-05-27 Thread GitBox


SparkQA commented on pull request #32687:
URL: https://github.com/apache/spark/pull/32687#issuecomment-850074626


   **[Test build #139036 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139036/testReport)**
 for PR 32687 at commit 
[`d37d01a`](https://github.com/apache/spark/commit/d37d01a6ae8fd5404dca172e748fc9994c0d66b3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #32673: [SPARK-35530][ML][TESTS] Fix rounding error in DifferentiableLossAggregatorSuite with Java 11

2021-05-27 Thread GitBox


HyukjinKwon closed pull request #32673:
URL: https://github.com/apache/spark/pull/32673


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #32673: [SPARK-35530][ML][TESTS] Fix rounding error in DifferentiableLossAggregatorSuite with Java 11

2021-05-27 Thread GitBox


HyukjinKwon commented on pull request #32673:
URL: https://github.com/apache/spark/pull/32673#issuecomment-850068886


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


SparkQA commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850068550


   **[Test build #139040 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139040/testReport)**
 for PR 32686 at commit 
[`408851d`](https://github.com/apache/spark/commit/408851d641be4aa13146c640a48ebfb9bc158be8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


SparkQA commented on pull request #32688:
URL: https://github.com/apache/spark/pull/32688#issuecomment-850068523


   **[Test build #139039 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139039/testReport)**
 for PR 32688 at commit 
[`72c425c`](https://github.com/apache/spark/commit/72c425c2022eb76436af499f27f514556da18444).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32397: [SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32397:
URL: https://github.com/apache/spark/pull/32397#issuecomment-850067873






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request #32688: [SPARK-35550][BUILD] Upgrade Jackson to 2.12.3

2021-05-27 Thread GitBox


LuciferYang opened a new pull request #32688:
URL: https://github.com/apache/spark/pull/32688


   ### What changes were proposed in this pull request?
   This pr upgrade Jackson version to 2.12.3.
   Jackson Release 2.12.3: 
[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.12.3](https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.12.3)
 
   
   
   ### Why are the changes needed?
   Upgrade to a new version to bring potential bug fixes like 
[https://github.com/FasterXML/jackson-modules-java8/issues/207](https://github.com/FasterXML/jackson-modules-java8/issues/207)
 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass the Jenkins or GitHub Action
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32397: [SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32397:
URL: https://github.com/apache/spark/pull/32397#issuecomment-850067873






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32397: [SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode

2021-05-27 Thread GitBox


SparkQA removed a comment on pull request #32397:
URL: https://github.com/apache/spark/pull/32397#issuecomment-850011679


   **[Test build #139038 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139038/testReport)**
 for PR 32397 at commit 
[`b47599f`](https://github.com/apache/spark/commit/b47599fe82a2b6d4cd3896c4117d3de19cc62d5e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32397: [SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode

2021-05-27 Thread GitBox


SparkQA commented on pull request #32397:
URL: https://github.com/apache/spark/pull/32397#issuecomment-850058005


   **[Test build #139038 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139038/testReport)**
 for PR 32397 at commit 
[`b47599f`](https://github.com/apache/spark/commit/b47599fe82a2b6d4cd3896c4117d3de19cc62d5e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32397: [SPARK-35084][CORE] Spark 3: supporting "--packages" in k8s cluster mode

2021-05-27 Thread GitBox


SparkQA commented on pull request #32397:
URL: https://github.com/apache/spark/pull/32397#issuecomment-850056962


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43556/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32301: [SPARK-35194][SQL] Refactor nested column aliasing for readability

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32301:
URL: https://github.com/apache/spark/pull/32301#issuecomment-850053730


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139035/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32301: [SPARK-35194][SQL] Refactor nested column aliasing for readability

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32301:
URL: https://github.com/apache/spark/pull/32301#issuecomment-850053730


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139035/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32301: [SPARK-35194][SQL] Refactor nested column aliasing for readability

2021-05-27 Thread GitBox


SparkQA removed a comment on pull request #32301:
URL: https://github.com/apache/spark/pull/32301#issuecomment-849958280


   **[Test build #139035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139035/testReport)**
 for PR 32301 at commit 
[`8a29e94`](https://github.com/apache/spark/commit/8a29e943447808391c17f860598e3f11ae41d54d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32301: [SPARK-35194][SQL] Refactor nested column aliasing for readability

2021-05-27 Thread GitBox


SparkQA commented on pull request #32301:
URL: https://github.com/apache/spark/pull/32301#issuecomment-850053105


   **[Test build #139035 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139035/testReport)**
 for PR 32301 at commit 
[`8a29e94`](https://github.com/apache/spark/commit/8a29e943447808391c17f860598e3f11ae41d54d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-27 Thread GitBox


otterc commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r641118134



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2000,6 +2023,147 @@ private[spark] class DAGScheduler(
 }
   }
 
+  /**
+   * Schedules shuffle merge finalize.
+   */
+  private[scheduler] def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): 
Unit = {
+logInfo(("%s (%s) scheduled for finalizing" +
+  " shuffle merge in %s s").format(stage, stage.name, 
shuffleMergeFinalizeWaitSec))
+shuffleMergeFinalizeScheduler.schedule(
+  new Runnable {
+override def run(): Unit = finalizeShuffleMerge(stage)
+  },
+  shuffleMergeFinalizeWaitSec,
+  TimeUnit.SECONDS
+)
+  }
+
+  /**
+   * DAGScheduler notifies all the remote shuffle services chosen to serve 
shuffle merge request for
+   * the given shuffle map stage to finalize the shuffle merge process for 
this shuffle. This is
+   * invoked in a separate thread to reduce the impact on the DAGScheduler 
main thread, as the
+   * scheduler might need to talk to 1000s of shuffle services to finalize 
shuffle merge.
+   */
+  private[scheduler] def finalizeShuffleMerge(stage: ShuffleMapStage): Unit = {
+logInfo("%s (%s) finalizing the shuffle merge".format(stage, stage.name))
+externalShuffleClient.foreach { shuffleClient =>
+  val shuffleId = stage.shuffleDep.shuffleId
+  val numMergers = stage.shuffleDep.getMergerLocs.length
+  val numResponses = new AtomicInteger()
+  val results = (0 until numMergers).map(_ => 
SettableFuture.create[Boolean]())
+  val timedOut = new AtomicBoolean()
+
+  def increaseAndCheckResponseCount(): Unit = {
+if (numResponses.incrementAndGet() == numMergers) {
+  logInfo("%s (%s) shuffle merge finalized".format(stage, stage.name))
+  // Since this runs in the netty client thread and is outside of 
DAGScheduler
+  // event loop, we only post ShuffleMergeFinalized event into the 
event queue.
+  // The processing of this event should be done inside the event 
loop, so it
+  // can safely modify scheduler's internal state.
+  eventProcessLoop.post(ShuffleMergeFinalized(stage))
+}
+  }
+
+  stage.shuffleDep.getMergerLocs.zipWithIndex.foreach {
+case (shuffleServiceLoc, index) =>
+  // Sends async request to shuffle service to finalize shuffle merge 
on that host
+  // TODO: SPARK-35536: Cancel finalizeShuffleMerge if the stage is 
cancelled
+  // TODO: during shuffleMergeFinalizeWaitSec
+  shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
+shuffleServiceLoc.port, shuffleId,
+new MergeFinalizerListener {
+  override def onShuffleMergeSuccess(statuses: MergeStatuses): 
Unit = {
+assert(shuffleId == statuses.shuffleId)
+if (!timedOut.get()) {
+  eventProcessLoop.post(RegisterMergeStatuses(stage, 
MergeStatus.
+convertMergeStatusesToMergeStatusArr(statuses, 
shuffleServiceLoc)))
+  increaseAndCheckResponseCount()
+  results(index).set(true)
+}
+  }
+
+  override def onShuffleMergeFailure(e: Throwable): Unit = {
+if (!timedOut.get()) {
+  logWarning(s"Exception encountered when trying to finalize 
shuffle " +
+s"merge on ${shuffleServiceLoc.host} for shuffle 
$shuffleId", e)
+  increaseAndCheckResponseCount()
+  // Do not fail the future as this would cause dag scheduler 
to prematurely
+  // give up on waiting for merge results from the remaining 
shuffle services
+  // if one fails
+  results(index).set(false)
+}
+  }
+})
+  }
+  // DAGScheduler only waits for a limited amount of time for the merge 
results.
+  // It will attempt to submit the next stage(s) irrespective of whether 
merge results
+  // from all shuffle services are received or not.
+  // TODO: SPARK-33701: Instead of waiting for a constant amount of time 
for finalization
+  // TODO: for all the stages, adaptively tune timeout for merge 
finalization
+  try {
+Futures.allAsList(results: _*).get(shuffleMergeResultsTimeoutSec, 
TimeUnit.SECONDS)
+  } catch {
+case _: TimeoutException =>
+  logInfo(s"Timed out on waiting for merge results from all " +
+s"$numMergers mergers for shuffle $shuffleId")
+  timedOut.set(true)
+  eventProcessLoop.post(ShuffleMergeFinalized(stage))
+  }
+}
+  }
+
+  private def processShuffleMapStageCompletion(shuffleStage: ShuffleMapStage): 
Unit = {
+markStageAsFinished(shuffleStage)
+logInfo("looking for newly runnable stages")
+

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850049662


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139037/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850049662


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139037/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


SparkQA removed a comment on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850010173


   **[Test build #139037 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139037/testReport)**
 for PR 32686 at commit 
[`ab3b61f`](https://github.com/apache/spark/commit/ab3b61fc05e1df8ba75320d30c7c96834c291db2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


SparkQA commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850049444


   **[Test build #139037 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139037/testReport)**
 for PR 32686 at commit 
[`ab3b61f`](https://github.com/apache/spark/commit/ab3b61fc05e1df8ba75320d30c7c96834c291db2).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-27 Thread GitBox


venkata91 commented on pull request #30691:
URL: https://github.com/apache/spark/pull/30691#issuecomment-850045368


   Addressed all the comments AFAIK, please review @mridulm @Victsm @Ngone51 
@otterc 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-27 Thread GitBox


venkata91 commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r641095507



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2004,6 +2020,131 @@ private[spark] class DAGScheduler(
 }
   }
 
+  /**
+   * Schedules shuffle merge finalize.
+   */
+  private[scheduler] def scheduleShuffleMergeFinalize(stage: ShuffleMapStage): 
Unit = {
+logInfo(("%s (%s) scheduled for finalizing" +
+  " shuffle merge in %s s").format(stage, stage.name, 
shuffleMergeFinalizeWaitSec))
+shuffleMergeFinalizeScheduler.schedule(
+  new Runnable {
+override def run(): Unit = finalizeShuffleMerge(stage)
+  },
+  shuffleMergeFinalizeWaitSec,
+  TimeUnit.SECONDS
+)
+  }
+
+  /**
+   * DAGScheduler notifies all the remote shuffle services chosen to serve 
shuffle merge request for
+   * the given shuffle map stage to finalize the shuffle merge process for 
this shuffle. This is
+   * invoked in a separate thread to reduce the impact on the DAGScheduler 
main thread, as the
+   * scheduler might need to talk to 1000s of shuffle services to finalize 
shuffle merge.
+   */
+  private[scheduler] def finalizeShuffleMerge(stage: ShuffleMapStage): Unit = {
+logInfo("%s (%s) finalizing the shuffle merge".format(stage, stage.name))

Review comment:
   Added additional tests to handle the cases of stage cancellation, 
barrier stage, late arrival of merge results etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] venkata91 commented on a change in pull request #30691: [SPARK-32920][SHUFFLE] Finalization of Shuffle push/merge with Push based shuffle and preparation step for the reduce stage

2021-05-27 Thread GitBox


venkata91 commented on a change in pull request #30691:
URL: https://github.com/apache/spark/pull/30691#discussion_r641094842



##
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##
@@ -2136,9 +2137,24 @@ private[spark] class DAGScheduler(
 }
   }
 
-  private[scheduler] def handleShuffleMergeFinalized(stage: ShuffleMapStage): 
Unit = {
-stage.shuffleDep.markShuffleMergeFinalized
-processShuffleMapStageCompletion(stage)
+  private[scheduler] def handleRegisterMergeStatuses(
+  stage: ShuffleMapStage,
+  mergeStatuses: Seq[(Int, MergeStatus)]): Unit = {
+// Register merge statuses if the stage is still running and shuffle merge 
is not finalized yet.
+if (runningStages.contains(stage) && 
!stage.shuffleDep.shuffleMergeFinalized) {
+  mapOutputTracker.registerMergeResults(stage.shuffleDep.shuffleId, 
mergeStatuses)
+}
+  }
+
+  private[scheduler] def handleShuffleMergeFinalized(
+  stage: ShuffleMapStage): Unit = {
+// Only update MapOutputTracker metadata if the stage is still active. i.e 
not cancelled.
+if (runningStages.contains(stage)) {
+  stage.shuffleDep.markShuffleMergeFinalized()
+  processShuffleMapStageCompletion(stage)
+} else {
+  mapOutputTracker.unregisterAllMergeResult(stage.shuffleDep.shuffleId)

Review comment:
   Discussed offline with @mridulm and currently there are few corner cases 
which needs to be carefully thought through before having this behavior. 
Created a TODO and a corresponding follow up JIRA - 
https://issues.apache.org/jira/browse/SPARK-35549




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on pull request #32687: [SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions

2021-05-27 Thread GitBox


allisonwang-db commented on pull request #32687:
URL: https://github.com/apache/spark/pull/32687#issuecomment-850041129


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #31998: [SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

2021-05-27 Thread GitBox


sunchao commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-850040241


   @lxian In the current approach we'd have to copy values from one vector to 
another. I think a better and more efficient approach may be to feed the row 
indexes to `VectorizedRleValuesReader#readXXX` and skip rows if they are not in 
the range, so basically we increment both `rowId` and row indexes in parallel. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhouyejoe edited a comment on pull request #32007: [SPARK-33350][SHUFFLE] Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-05-27 Thread GitBox


zhouyejoe edited a comment on pull request #32007:
URL: https://github.com/apache/spark/pull/32007#issuecomment-850036241


   Created ticket for later improvement 
[SPARK-35546](https://issues.apache.org/jira/browse/SPARK-35546)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhouyejoe commented on pull request #32007: [SPARK-33350][SHUFFLE] Add support to DiskBlockManager to create merge directory and to get the local shuffle merged data

2021-05-27 Thread GitBox


zhouyejoe commented on pull request #32007:
URL: https://github.com/apache/spark/pull/32007#issuecomment-850036241


   Created ticket for later improvement 
https://issues.apache.org/jira/browse/SPARK-35546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


AmplabJenkins removed a comment on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850035894


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43555/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #32686: [WIP][SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

2021-05-27 Thread GitBox


AmplabJenkins commented on pull request #32686:
URL: https://github.com/apache/spark/pull/32686#issuecomment-850035894


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43555/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >