[spark] branch master updated (4144b6d -> 125cbe3)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 add 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl No new revisions were added by this update. Summary of changes: .../apache/spark/ExecutorAllocationManager.scala | 2 +- .../org/apache/spark/deploy/DeployMessage.scala| 4 +- .../spark/deploy/client/StandaloneAppClient.scala | 6 +-- .../client/StandaloneAppClientListener.scala | 2 +- .../org/apache/spark/deploy/master/Master.scala| 11 ++-- .../executor/CoarseGrainedExecutorBackend.scala| 7 +-- .../org/apache/spark/internal/config/package.scala | 10 .../org/apache/spark/scheduler/DAGScheduler.scala | 17 +++--- .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++-- .../spark/scheduler/ExecutorLossReason.scala | 10 ++-- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++--- .../apache/spark/scheduler/TaskSetManager.scala| 2 +- .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++ .../cluster/StandaloneSchedulerBackend.scala | 7 ++- .../spark/deploy/client/AppClientSuite.scala | 4 +- .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 -- .../spark/scheduler/TaskSchedulerImplSuite.scala | 36 - .../spark/scheduler/TaskSetManagerSuite.scala | 9 ++-- .../WorkerDecommissionExtendedSuite.scala | 2 +- .../spark/scheduler/WorkerDecommissionSuite.scala | 2 +- .../BlockManagerDecommissionIntegrationSuite.scala | 2 +- .../scheduler/ExecutorAllocationManager.scala | 2 +- .../scheduler/ExecutorAllocationManagerSuite.scala | 2 +- 23 files changed, 98 insertions(+), 164 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4144b6d -> 125cbe3)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 add 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl No new revisions were added by this update. Summary of changes: .../apache/spark/ExecutorAllocationManager.scala | 2 +- .../org/apache/spark/deploy/DeployMessage.scala| 4 +- .../spark/deploy/client/StandaloneAppClient.scala | 6 +-- .../client/StandaloneAppClientListener.scala | 2 +- .../org/apache/spark/deploy/master/Master.scala| 11 ++-- .../executor/CoarseGrainedExecutorBackend.scala| 7 +-- .../org/apache/spark/internal/config/package.scala | 10 .../org/apache/spark/scheduler/DAGScheduler.scala | 17 +++--- .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++-- .../spark/scheduler/ExecutorLossReason.scala | 10 ++-- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++--- .../apache/spark/scheduler/TaskSetManager.scala| 2 +- .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++ .../cluster/StandaloneSchedulerBackend.scala | 7 ++- .../spark/deploy/client/AppClientSuite.scala | 4 +- .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 -- .../spark/scheduler/TaskSchedulerImplSuite.scala | 36 - .../spark/scheduler/TaskSetManagerSuite.scala | 9 ++-- .../WorkerDecommissionExtendedSuite.scala | 2 +- .../spark/scheduler/WorkerDecommissionSuite.scala | 2 +- .../BlockManagerDecommissionIntegrationSuite.scala | 2 +- .../scheduler/ExecutorAllocationManager.scala | 2 +- .../scheduler/ExecutorAllocationManagerSuite.scala | 2 +- 23 files changed, 98 insertions(+), 164 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4144b6d -> 125cbe3)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 add 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl No new revisions were added by this update. Summary of changes: .../apache/spark/ExecutorAllocationManager.scala | 2 +- .../org/apache/spark/deploy/DeployMessage.scala| 4 +- .../spark/deploy/client/StandaloneAppClient.scala | 6 +-- .../client/StandaloneAppClientListener.scala | 2 +- .../org/apache/spark/deploy/master/Master.scala| 11 ++-- .../executor/CoarseGrainedExecutorBackend.scala| 7 +-- .../org/apache/spark/internal/config/package.scala | 10 .../org/apache/spark/scheduler/DAGScheduler.scala | 17 +++--- .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++-- .../spark/scheduler/ExecutorLossReason.scala | 10 ++-- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++--- .../apache/spark/scheduler/TaskSetManager.scala| 2 +- .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++ .../cluster/StandaloneSchedulerBackend.scala | 7 ++- .../spark/deploy/client/AppClientSuite.scala | 4 +- .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 -- .../spark/scheduler/TaskSchedulerImplSuite.scala | 36 - .../spark/scheduler/TaskSetManagerSuite.scala | 9 ++-- .../WorkerDecommissionExtendedSuite.scala | 2 +- .../spark/scheduler/WorkerDecommissionSuite.scala | 2 +- .../BlockManagerDecommissionIntegrationSuite.scala | 2 +- .../scheduler/ExecutorAllocationManager.scala | 2 +- .../scheduler/ExecutorAllocationManagerSuite.scala | 2 +- 23 files changed, 98 insertions(+), 164 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4144b6d -> 125cbe3)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 add 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl No new revisions were added by this update. Summary of changes: .../apache/spark/ExecutorAllocationManager.scala | 2 +- .../org/apache/spark/deploy/DeployMessage.scala| 4 +- .../spark/deploy/client/StandaloneAppClient.scala | 6 +-- .../client/StandaloneAppClientListener.scala | 2 +- .../org/apache/spark/deploy/master/Master.scala| 11 ++-- .../executor/CoarseGrainedExecutorBackend.scala| 7 +-- .../org/apache/spark/internal/config/package.scala | 10 .../org/apache/spark/scheduler/DAGScheduler.scala | 17 +++--- .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++-- .../spark/scheduler/ExecutorLossReason.scala | 10 ++-- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++--- .../apache/spark/scheduler/TaskSetManager.scala| 2 +- .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++ .../cluster/StandaloneSchedulerBackend.scala | 7 ++- .../spark/deploy/client/AppClientSuite.scala | 4 +- .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 -- .../spark/scheduler/TaskSchedulerImplSuite.scala | 36 - .../spark/scheduler/TaskSetManagerSuite.scala | 9 ++-- .../WorkerDecommissionExtendedSuite.scala | 2 +- .../spark/scheduler/WorkerDecommissionSuite.scala | 2 +- .../BlockManagerDecommissionIntegrationSuite.scala | 2 +- .../scheduler/ExecutorAllocationManager.scala | 2 +- .../scheduler/ExecutorAllocationManagerSuite.scala | 2 +- 23 files changed, 98 insertions(+), 164 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl 125cbe3 is described below commit 125cbe3ae0d664ddc80b5b83cc82a43a0cefb5ca Author: yi.wu AuthorDate: Tue Sep 8 04:40:13 2020 + [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl ### What changes were proposed in this pull request? The motivation of this PR is to avoid caching the removed decommissioned executors in `TaskSchedulerImpl`. The cache is introduced in https://github.com/apache/spark/pull/29422. The cache will hold the `isHostDecommissioned` info for a while. So if the task `FetchFailure` event comes after the executor loss event, `DAGScheduler` can still get the `isHostDecommissioned` from the cache and unregister the host shuffle map status when the host is decommissioned too. This PR tries to achieve the same goal without the cache. Instead of saving the `workerLost` in `ExecutorUpdated` / `ExecutorDecommissionInfo` / `ExecutorDecommissionState`, we could save the `hostOpt` directly. When the host is decommissioned or lost too, the `hostOpt` can be a specific host address. Otherwise, it's `None` to indicate that only the executor is decommissioned or lost. Now that we have the host info, we can also unregister the host shuffle map status when `executorLost` is triggered for the decommissioned executor. Besides, this PR also includes a few cleanups around the touched code. ### Why are the changes needed? It helps to unregister the shuffle map status earlier for both decommission and normal executor lost cases. It also saves memory in `TaskSchedulerImpl` and simplifies the code a little bit. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR only refactor the code. The original behaviour should be covered by `DecommissionWorkerSuite`. Closes #29579 from Ngone51/impr-decom. Authored-by: yi.wu Signed-off-by: Wenchen Fan --- .../apache/spark/ExecutorAllocationManager.scala | 2 +- .../org/apache/spark/deploy/DeployMessage.scala| 4 +- .../spark/deploy/client/StandaloneAppClient.scala | 6 +-- .../client/StandaloneAppClientListener.scala | 2 +- .../org/apache/spark/deploy/master/Master.scala| 11 ++-- .../executor/CoarseGrainedExecutorBackend.scala| 7 +-- .../org/apache/spark/internal/config/package.scala | 10 .../org/apache/spark/scheduler/DAGScheduler.scala | 17 +++--- .../spark/scheduler/ExecutorDecommissionInfo.scala | 10 ++-- .../spark/scheduler/ExecutorLossReason.scala | 10 ++-- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 61 +++--- .../apache/spark/scheduler/TaskSetManager.scala| 2 +- .../cluster/CoarseGrainedSchedulerBackend.scala| 38 ++ .../cluster/StandaloneSchedulerBackend.scala | 7 ++- .../spark/deploy/client/AppClientSuite.scala | 4 +- .../apache/spark/scheduler/DAGSchedulerSuite.scala | 16 -- .../spark/scheduler/TaskSchedulerImplSuite.scala | 36 - .../spark/scheduler/TaskSetManagerSuite.scala | 9 ++-- .../WorkerDecommissionExtendedSuite.scala | 2 +- .../spark/scheduler/WorkerDecommissionSuite.scala | 2 +- .../BlockManagerDecommissionIntegrationSuite.scala | 2 +- .../scheduler/ExecutorAllocationManager.scala | 2 +- .../scheduler/ExecutorAllocationManagerSuite.scala | 2 +- 23 files changed, 98 insertions(+), 164 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala index c298931..b6e14e8 100644 --- a/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala +++ b/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala @@ -580,7 +580,7 @@ private[spark] class ExecutorAllocationManager( // when the task backlog decreased. if (decommissionEnabled) { val executorIdsWithoutHostLoss = executorIdsToBeRemoved.toSeq.map( - id => (id, ExecutorDecommissionInfo("spark scale down", false))).toArray + id => (id, ExecutorDecommissionInfo("spark scale down"))).toArray client.decommissionExecutors(executorIdsWithoutHostLoss, adjustTargetNumExecutors = false) } else { client.killExecutors(executorIdsToBeRemoved.toSeq, adjustTargetNumExecutors = false, diff --git a/core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala b/core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala index b7a64d75..83f373d 100644 ---
[spark] branch branch-3.0 updated (6b47abd -> f42b56c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add f42b56c [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c8c082c -> 4144b6d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (6b47abd -> f42b56c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add f42b56c [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c8c082c -> 4144b6d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (6b47abd -> f42b56c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add f42b56c [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c8c082c -> 4144b6d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (6b47abd -> f42b56c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add f42b56c [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c8c082c -> 4144b6d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 277ccba [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe 277ccba is described below commit 277ccba382ac19debeaf15cd648cf6ab69603012 Author: sychen AuthorDate: Tue Sep 8 03:23:59 2020 + [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? This PR increases the thread safety of the `BytesToBytesMap`: - It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators. - Removes the `safeIterator()` function. This is not needed anymore. - Improves the documentation of a couple of methods w.r.t. thread-safety. ### Why are the changes needed? It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads. ### Does this PR introduce any user-facing change? No ### How was this patch tested? add ut Closes #29605 from cxzl25/SPARK-31511. Lead-authored-by: sychen Co-authored-by: herman Signed-off-by: Wenchen Fan --- .../apache/spark/unsafe/map/BytesToBytesMap.java | 18 +- .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 2 files changed, 49 insertions(+), 8 deletions(-) diff --git a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java index 2b23fbbf..5ab52cc 100644 --- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java +++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java @@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator iterator() { -return new MapIterator(numValues, loc, false); +return new MapIterator(numValues, new Location(), false); } /** @@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator destructiveIterator() { updatePeakMemoryUsed(); -return new MapIterator(numValues, loc, true); +return new MapIterator(numValues, new Location(), true); } /** * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength) { safeLookup(keyBase, keyOffset, keyLength, loc, @@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer { * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength, int hash) { safeLookup(keyBase, keyOffset, keyLength, loc, hash); diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index d9b34dc..1bdd6fe 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++
[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new f2bcc93 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py f2bcc93 is described below commit f2bcc9349d86be71dba491b8348ac8d83f0764a8 Author: itholic AuthorDate: Tue Sep 8 12:22:13 2020 +0900 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py ### What changes were proposed in this pull request? In certain environments, seems it fails to run `run-tests.py` script as below: ``` Traceback (most recent call last): File "", line 1, in ... raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): ... raise EOFError EOFError ``` The reason is that `Manager.dict()` launches another process when the main process is initiated. It works in most environments for an unknown reason but it should be good to avoid such pattern as guided from Python itself. ### Why are the changes needed? To prevent the test failure for Python. ### Does this PR introduce _any_ user-facing change? No, it fixes a test script. ### How was this patch tested? Manually ran the script after fixing. ``` Running PySpark tests. Output is in /.../python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(/.../python3): pyspark.sql.dataframe (33s) Finished test(python3.8): pyspark.sql.dataframe (34s) Tests passed in 34 seconds ``` Closes #29666 from itholic/SPARK-32812. Authored-by: itholic Signed-off-by: HyukjinKwon (cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf) Signed-off-by: HyukjinKwon --- python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index c34e48a..9a95c96 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -53,7 +53,7 @@ def print_red(text): print('\033[31m' + text + '\033[0m') -SKIPPED_TESTS = Manager().dict() +SKIPPED_TESTS = None LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log") FAILURE_REPORTING_LOCK = Lock() LOGGER = logging.getLogger() @@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): skipped_counts = len(skipped_tests) if skipped_counts > 0: key = (pyspark_python, test_name) +assert SKIPPED_TESTS is not None SKIPPED_TESTS[key] = skipped_tests per_test_output.close() except: @@ -293,4 +294,5 @@ def main(): if __name__ == "__main__": +SKIPPED_TESTS = Manager().dict() main() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema add 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (6b47abd -> f42b56c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add f42b56c [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c8c082c -> 4144b6d)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py add 4144b6d [SPARK-32764][SQL] -0.0 should be equal to 0.0 No new revisions were added by this update. Summary of changes: .../expressions/codegen/CodeGenerator.scala| 10 ++- .../spark/sql/catalyst/util/SQLOrderingUtil.scala | 30 + .../org/apache/spark/sql/types/DoubleType.scala| 3 +- .../org/apache/spark/sql/types/FloatType.scala | 3 +- .../org/apache/spark/sql/types/numerics.scala | 5 +- .../sql/catalyst/expressions/PredicateSuite.scala | 16 + .../sql/catalyst/util/SQLOrderingUtilSuite.scala | 75 ++ .../org/apache/spark/sql/DataFrameSuite.scala | 5 ++ 8 files changed, 126 insertions(+), 21 deletions(-) copy core/src/main/scala/org/apache/spark/SparkFiles.scala => sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtil.scala (57%) create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 277ccba [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe 277ccba is described below commit 277ccba382ac19debeaf15cd648cf6ab69603012 Author: sychen AuthorDate: Tue Sep 8 03:23:59 2020 + [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? This PR increases the thread safety of the `BytesToBytesMap`: - It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators. - Removes the `safeIterator()` function. This is not needed anymore. - Improves the documentation of a couple of methods w.r.t. thread-safety. ### Why are the changes needed? It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads. ### Does this PR introduce any user-facing change? No ### How was this patch tested? add ut Closes #29605 from cxzl25/SPARK-31511. Lead-authored-by: sychen Co-authored-by: herman Signed-off-by: Wenchen Fan --- .../apache/spark/unsafe/map/BytesToBytesMap.java | 18 +- .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 2 files changed, 49 insertions(+), 8 deletions(-) diff --git a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java index 2b23fbbf..5ab52cc 100644 --- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java +++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java @@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator iterator() { -return new MapIterator(numValues, loc, false); +return new MapIterator(numValues, new Location(), false); } /** @@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator destructiveIterator() { updatePeakMemoryUsed(); -return new MapIterator(numValues, loc, true); +return new MapIterator(numValues, new Location(), true); } /** * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength) { safeLookup(keyBase, keyOffset, keyLength, loc, @@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer { * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength, int hash) { safeLookup(keyBase, keyOffset, keyLength, loc, hash); diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index d9b34dc..1bdd6fe 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++
[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new f2bcc93 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py f2bcc93 is described below commit f2bcc9349d86be71dba491b8348ac8d83f0764a8 Author: itholic AuthorDate: Tue Sep 8 12:22:13 2020 +0900 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py ### What changes were proposed in this pull request? In certain environments, seems it fails to run `run-tests.py` script as below: ``` Traceback (most recent call last): File "", line 1, in ... raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): ... raise EOFError EOFError ``` The reason is that `Manager.dict()` launches another process when the main process is initiated. It works in most environments for an unknown reason but it should be good to avoid such pattern as guided from Python itself. ### Why are the changes needed? To prevent the test failure for Python. ### Does this PR introduce _any_ user-facing change? No, it fixes a test script. ### How was this patch tested? Manually ran the script after fixing. ``` Running PySpark tests. Output is in /.../python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(/.../python3): pyspark.sql.dataframe (33s) Finished test(python3.8): pyspark.sql.dataframe (34s) Tests passed in 34 seconds ``` Closes #29666 from itholic/SPARK-32812. Authored-by: itholic Signed-off-by: HyukjinKwon (cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf) Signed-off-by: HyukjinKwon --- python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index c34e48a..9a95c96 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -53,7 +53,7 @@ def print_red(text): print('\033[31m' + text + '\033[0m') -SKIPPED_TESTS = Manager().dict() +SKIPPED_TESTS = None LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log") FAILURE_REPORTING_LOCK = Lock() LOGGER = logging.getLogger() @@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): skipped_counts = len(skipped_tests) if skipped_counts > 0: key = (pyspark_python, test_name) +assert SKIPPED_TESTS is not None SKIPPED_TESTS[key] = skipped_tests per_test_output.close() except: @@ -293,4 +294,5 @@ def main(): if __name__ == "__main__": +SKIPPED_TESTS = Manager().dict() main() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema add 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c336ae3 -> c8c082c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging add c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 277ccba [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe 277ccba is described below commit 277ccba382ac19debeaf15cd648cf6ab69603012 Author: sychen AuthorDate: Tue Sep 8 03:23:59 2020 + [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? This PR increases the thread safety of the `BytesToBytesMap`: - It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators. - Removes the `safeIterator()` function. This is not needed anymore. - Improves the documentation of a couple of methods w.r.t. thread-safety. ### Why are the changes needed? It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads. ### Does this PR introduce any user-facing change? No ### How was this patch tested? add ut Closes #29605 from cxzl25/SPARK-31511. Lead-authored-by: sychen Co-authored-by: herman Signed-off-by: Wenchen Fan --- .../apache/spark/unsafe/map/BytesToBytesMap.java | 18 +- .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 2 files changed, 49 insertions(+), 8 deletions(-) diff --git a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java index 2b23fbbf..5ab52cc 100644 --- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java +++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java @@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator iterator() { -return new MapIterator(numValues, loc, false); +return new MapIterator(numValues, new Location(), false); } /** @@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator destructiveIterator() { updatePeakMemoryUsed(); -return new MapIterator(numValues, loc, true); +return new MapIterator(numValues, new Location(), true); } /** * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength) { safeLookup(keyBase, keyOffset, keyLength, loc, @@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer { * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength, int hash) { safeLookup(keyBase, keyOffset, keyLength, loc, hash); diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index d9b34dc..1bdd6fe 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++
[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new f2bcc93 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py f2bcc93 is described below commit f2bcc9349d86be71dba491b8348ac8d83f0764a8 Author: itholic AuthorDate: Tue Sep 8 12:22:13 2020 +0900 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py ### What changes were proposed in this pull request? In certain environments, seems it fails to run `run-tests.py` script as below: ``` Traceback (most recent call last): File "", line 1, in ... raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): ... raise EOFError EOFError ``` The reason is that `Manager.dict()` launches another process when the main process is initiated. It works in most environments for an unknown reason but it should be good to avoid such pattern as guided from Python itself. ### Why are the changes needed? To prevent the test failure for Python. ### Does this PR introduce _any_ user-facing change? No, it fixes a test script. ### How was this patch tested? Manually ran the script after fixing. ``` Running PySpark tests. Output is in /.../python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(/.../python3): pyspark.sql.dataframe (33s) Finished test(python3.8): pyspark.sql.dataframe (34s) Tests passed in 34 seconds ``` Closes #29666 from itholic/SPARK-32812. Authored-by: itholic Signed-off-by: HyukjinKwon (cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf) Signed-off-by: HyukjinKwon --- python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index c34e48a..9a95c96 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -53,7 +53,7 @@ def print_red(text): print('\033[31m' + text + '\033[0m') -SKIPPED_TESTS = Manager().dict() +SKIPPED_TESTS = None LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log") FAILURE_REPORTING_LOCK = Lock() LOGGER = logging.getLogger() @@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): skipped_counts = len(skipped_tests) if skipped_counts > 0: key = (pyspark_python, test_name) +assert SKIPPED_TESTS is not None SKIPPED_TESTS[key] = skipped_tests per_test_output.close() except: @@ -293,4 +294,5 @@ def main(): if __name__ == "__main__": +SKIPPED_TESTS = Manager().dict() main() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema add 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c336ae3 -> c8c082c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging add c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 277ccba [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe 277ccba is described below commit 277ccba382ac19debeaf15cd648cf6ab69603012 Author: sychen AuthorDate: Tue Sep 8 03:23:59 2020 + [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? This PR increases the thread safety of the `BytesToBytesMap`: - It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators. - Removes the `safeIterator()` function. This is not needed anymore. - Improves the documentation of a couple of methods w.r.t. thread-safety. ### Why are the changes needed? It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads. ### Does this PR introduce any user-facing change? No ### How was this patch tested? add ut Closes #29605 from cxzl25/SPARK-31511. Lead-authored-by: sychen Co-authored-by: herman Signed-off-by: Wenchen Fan --- .../apache/spark/unsafe/map/BytesToBytesMap.java | 18 +- .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 2 files changed, 49 insertions(+), 8 deletions(-) diff --git a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java index 2b23fbbf..5ab52cc 100644 --- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java +++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java @@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator iterator() { -return new MapIterator(numValues, loc, false); +return new MapIterator(numValues, new Location(), false); } /** @@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator destructiveIterator() { updatePeakMemoryUsed(); -return new MapIterator(numValues, loc, true); +return new MapIterator(numValues, new Location(), true); } /** * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength) { safeLookup(keyBase, keyOffset, keyLength, loc, @@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer { * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength, int hash) { safeLookup(keyBase, keyOffset, keyLength, loc, hash); diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index d9b34dc..1bdd6fe 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++
[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new f2bcc93 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py f2bcc93 is described below commit f2bcc9349d86be71dba491b8348ac8d83f0764a8 Author: itholic AuthorDate: Tue Sep 8 12:22:13 2020 +0900 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py ### What changes were proposed in this pull request? In certain environments, seems it fails to run `run-tests.py` script as below: ``` Traceback (most recent call last): File "", line 1, in ... raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): ... raise EOFError EOFError ``` The reason is that `Manager.dict()` launches another process when the main process is initiated. It works in most environments for an unknown reason but it should be good to avoid such pattern as guided from Python itself. ### Why are the changes needed? To prevent the test failure for Python. ### Does this PR introduce _any_ user-facing change? No, it fixes a test script. ### How was this patch tested? Manually ran the script after fixing. ``` Running PySpark tests. Output is in /.../python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(/.../python3): pyspark.sql.dataframe (33s) Finished test(python3.8): pyspark.sql.dataframe (34s) Tests passed in 34 seconds ``` Closes #29666 from itholic/SPARK-32812. Authored-by: itholic Signed-off-by: HyukjinKwon (cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf) Signed-off-by: HyukjinKwon --- python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index c34e48a..9a95c96 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -53,7 +53,7 @@ def print_red(text): print('\033[31m' + text + '\033[0m') -SKIPPED_TESTS = Manager().dict() +SKIPPED_TESTS = None LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log") FAILURE_REPORTING_LOCK = Lock() LOGGER = logging.getLogger() @@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): skipped_counts = len(skipped_tests) if skipped_counts > 0: key = (pyspark_python, test_name) +assert SKIPPED_TESTS is not None SKIPPED_TESTS[key] = skipped_tests per_test_output.close() except: @@ -293,4 +294,5 @@ def main(): if __name__ == "__main__": +SKIPPED_TESTS = Manager().dict() main() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema add 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c336ae3 -> c8c082c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging add c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 277ccba [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe 277ccba is described below commit 277ccba382ac19debeaf15cd648cf6ab69603012 Author: sychen AuthorDate: Tue Sep 8 03:23:59 2020 + [SPARK-31511][SQL][2.4] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? This PR increases the thread safety of the `BytesToBytesMap`: - It makes the `iterator()` and `destructiveIterator()` methods used their own `Location` object. This used to be shared, and this was causing issues when the map was being iterated over in two threads by two different iterators. - Removes the `safeIterator()` function. This is not needed anymore. - Improves the documentation of a couple of methods w.r.t. thread-safety. ### Why are the changes needed? It is unexpected an iterator shares the object it is returning with all other iterators. This is a violation of the iterator contract, and it causes issues with iterators over a map that are consumed in different threads. ### Does this PR introduce any user-facing change? No ### How was this patch tested? add ut Closes #29605 from cxzl25/SPARK-31511. Lead-authored-by: sychen Co-authored-by: herman Signed-off-by: Wenchen Fan --- .../apache/spark/unsafe/map/BytesToBytesMap.java | 18 +- .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 2 files changed, 49 insertions(+), 8 deletions(-) diff --git a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java index 2b23fbbf..5ab52cc 100644 --- a/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java +++ b/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java @@ -421,11 +421,11 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator iterator() { -return new MapIterator(numValues, loc, false); +return new MapIterator(numValues, new Location(), false); } /** @@ -435,19 +435,20 @@ public final class BytesToBytesMap extends MemoryConsumer { * * For efficiency, all calls to `next()` will return the same {@link Location} object. * - * If any other lookups or operations are performed on this map while iterating over it, including - * `lookup()`, the behavior of the returned iterator is undefined. + * The returned iterator is thread-safe. However if the map is modified while iterating over it, + * the behavior of the returned iterator is undefined. */ public MapIterator destructiveIterator() { updatePeakMemoryUsed(); -return new MapIterator(numValues, loc, true); +return new MapIterator(numValues, new Location(), true); } /** * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength) { safeLookup(keyBase, keyOffset, keyLength, loc, @@ -459,7 +460,8 @@ public final class BytesToBytesMap extends MemoryConsumer { * Looks up a key, and return a {@link Location} handle that can be used to test existence * and read/write values. * - * This function always return the same {@link Location} instance to avoid object allocation. + * This function always returns the same {@link Location} instance to avoid object allocation. + * This function is not thread-safe. */ public Location lookup(Object keyBase, long keyOffset, int keyLength, int hash) { safeLookup(keyBase, keyOffset, keyLength, loc, hash); diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index d9b34dc..1bdd6fe 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++
[spark] branch branch-2.4 updated: [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new f2bcc93 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py f2bcc93 is described below commit f2bcc9349d86be71dba491b8348ac8d83f0764a8 Author: itholic AuthorDate: Tue Sep 8 12:22:13 2020 +0900 [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py ### What changes were proposed in this pull request? In certain environments, seems it fails to run `run-tests.py` script as below: ``` Traceback (most recent call last): File "", line 1, in ... raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): ... raise EOFError EOFError ``` The reason is that `Manager.dict()` launches another process when the main process is initiated. It works in most environments for an unknown reason but it should be good to avoid such pattern as guided from Python itself. ### Why are the changes needed? To prevent the test failure for Python. ### Does this PR introduce _any_ user-facing change? No, it fixes a test script. ### How was this patch tested? Manually ran the script after fixing. ``` Running PySpark tests. Output is in /.../python/unit-tests.log Will test against the following Python executables: ['/.../python3', 'python3.8'] Will test the following Python tests: ['pyspark.sql.dataframe'] /.../python3 python_implementation is CPython /.../python3 version is: Python 3.8.5 python3.8 python_implementation is CPython python3.8 version is: Python 3.8.5 Starting test(/.../python3): pyspark.sql.dataframe Starting test(python3.8): pyspark.sql.dataframe Finished test(/.../python3): pyspark.sql.dataframe (33s) Finished test(python3.8): pyspark.sql.dataframe (34s) Tests passed in 34 seconds ``` Closes #29666 from itholic/SPARK-32812. Authored-by: itholic Signed-off-by: HyukjinKwon (cherry picked from commit c8c082ce380b2357623511c6625503fb3f1d65bf) Signed-off-by: HyukjinKwon --- python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index c34e48a..9a95c96 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -53,7 +53,7 @@ def print_red(text): print('\033[31m' + text + '\033[0m') -SKIPPED_TESTS = Manager().dict() +SKIPPED_TESTS = None LOG_FILE = os.path.join(SPARK_HOME, "python/unit-tests.log") FAILURE_REPORTING_LOCK = Lock() LOGGER = logging.getLogger() @@ -141,6 +141,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): skipped_counts = len(skipped_tests) if skipped_counts > 0: key = (pyspark_python, test_name) +assert SKIPPED_TESTS is not None SKIPPED_TESTS[key] = skipped_tests per_test_output.close() except: @@ -293,4 +294,5 @@ def main(): if __name__ == "__main__": +SKIPPED_TESTS = Manager().dict() main() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (a22e1c5 -> 6b47abd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema add 6b47abd [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c336ae3 -> c8c082c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging add c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c336ae3 -> c8c082c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging add c8c082c [SPARK-32812][PYTHON][TESTS] Avoid initiating a process during the main process for run-tests.py No new revisions were added by this update. Summary of changes: python/run-tests.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (117a6f1 -> c336ae3)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan add c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging No new revisions were added by this update. Summary of changes: docs/img/pyspark-remote-debug1.png | Bin 0 -> 91214 bytes docs/img/pyspark-remote-debug2.png | Bin 0 -> 10058 bytes python/docs/source/development/debugging.rst | 280 +++ python/docs/source/development/index.rst | 1 + 4 files changed, 281 insertions(+) create mode 100644 docs/img/pyspark-remote-debug1.png create mode 100644 docs/img/pyspark-remote-debug2.png create mode 100644 python/docs/source/development/debugging.rst - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (117a6f1 -> c336ae3)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan add c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging No new revisions were added by this update. Summary of changes: docs/img/pyspark-remote-debug1.png | Bin 0 -> 91214 bytes docs/img/pyspark-remote-debug2.png | Bin 0 -> 10058 bytes python/docs/source/development/debugging.rst | 280 +++ python/docs/source/development/index.rst | 1 + 4 files changed, 281 insertions(+) create mode 100644 docs/img/pyspark-remote-debug1.png create mode 100644 docs/img/pyspark-remote-debug2.png create mode 100644 python/docs/source/development/debugging.rst - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (117a6f1 -> c336ae3)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan add c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging No new revisions were added by this update. Summary of changes: docs/img/pyspark-remote-debug1.png | Bin 0 -> 91214 bytes docs/img/pyspark-remote-debug2.png | Bin 0 -> 10058 bytes python/docs/source/development/debugging.rst | 280 +++ python/docs/source/development/index.rst | 1 + 4 files changed, 281 insertions(+) create mode 100644 docs/img/pyspark-remote-debug1.png create mode 100644 docs/img/pyspark-remote-debug2.png create mode 100644 python/docs/source/development/debugging.rst - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (117a6f1 -> c336ae3)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan add c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging No new revisions were added by this update. Summary of changes: docs/img/pyspark-remote-debug1.png | Bin 0 -> 91214 bytes docs/img/pyspark-remote-debug2.png | Bin 0 -> 10058 bytes python/docs/source/development/debugging.rst | 280 +++ python/docs/source/development/index.rst | 1 + 4 files changed, 281 insertions(+) create mode 100644 docs/img/pyspark-remote-debug1.png create mode 100644 docs/img/pyspark-remote-debug2.png create mode 100644 python/docs/source/development/debugging.rst - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (117a6f1 -> c336ae3)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan add c336ae3 [SPARK-32186][DOCS][PYTHON] Development - Debugging No new revisions were added by this update. Summary of changes: docs/img/pyspark-remote-debug1.png | Bin 0 -> 91214 bytes docs/img/pyspark-remote-debug2.png | Bin 0 -> 10058 bytes python/docs/source/development/debugging.rst | 280 +++ python/docs/source/development/index.rst | 1 + 4 files changed, 281 insertions(+) create mode 100644 docs/img/pyspark-remote-debug1.png create mode 100644 docs/img/pyspark-remote-debug2.png create mode 100644 python/docs/source/development/debugging.rst - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (954cd9f -> 117a6f1)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema add 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 58 - .../spark/sql/catalyst/plans/QueryPlan.scala | 85 ++ .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 9 +- 5 files changed, 136 insertions(+), 161 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (954cd9f -> 117a6f1)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema add 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 58 - .../spark/sql/catalyst/plans/QueryPlan.scala | 85 ++ .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 9 +- 5 files changed, 136 insertions(+), 161 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (954cd9f -> 117a6f1)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema add 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 58 - .../spark/sql/catalyst/plans/QueryPlan.scala | 85 ++ .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 9 +- 5 files changed, 136 insertions(+), 161 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 23 5 files changed, 60 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new bc471f3 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema bc471f3 is described below commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c Author: Max Gekk AuthorDate: Tue Sep 8 09:45:17 2020 +0900 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`: ```scala spark.read.csv("""/tmp/\[abc\].csv""").show spark.read.json("""/tmp/\[abc\].json""").show ``` but would end up hitting an exception: ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new test cases in `DataFrameReaderWriterSuite`. Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../sql/execution/datasources/DataSource.scala | 27 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/test/DataFrameReaderWriterSuite.scala | 23 ++ 4 files changed, 48 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index ff5fe09..31c91f7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala @@ -65,8 +65,9 @@ import org.apache.spark.util.Utils * metadata. For example, when reading a partitioned table from a file system, partition columns * will be inferred from the directory layout even if they are not specified. * - * @param paths A list of file system paths that hold data. These will be globbed before and - * qualified. This option only works when reading from a [[FileFormat]]. + * @param paths A list of file system paths that hold data. These will be globbed before if + * the "__globPaths__" option is true, and will be qualified. This option only works + * when reading from a [[FileFormat]]. * @param userSpecifiedSchema An optional specification of the schema of the data. When present *we skip attempting to infer the schema. * @param partitionColumns A list of column names that the relation is partitioned by. This list is @@ -97,6 +98,15 @@ case class DataSource( private val caseInsensitiveOptions = CaseInsensitiveMap(options) private val equality = sparkSession.sessionState.conf.resolver + /** + * Whether or not paths should be globbed before being used to access files. + */ + def globPaths: Boolean = { +options.get(DataSource.GLOB_PATHS_KEY) + .map(_ == "true") + .getOrElse(true) + } + bucketSpec.map { bucket => SchemaUtils.checkColumnNameDuplication( bucket.bucketColumnNames, "in the bucket definition", equality) @@ -223,7 +233,7 @@ case class DataSource( // For glob pattern, we do not check it because the glob pattern might only make sense // once the streaming job starts and some upstream source starts dropping data. val hdfsPath = new Path(path) -if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) { +if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) { val fs = hdfsPath.getFileSystem(newHadoopConfiguration()) if (!fs.exists(hdfsPath)) { throw new AnalysisException(s"Path does not exist: $path") @@ -550,7 +560,11 @@ case class DataSource( val hdfsPath = new Path(path) val fs = hdfsPath.getFileSystem(hadoopConf) val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory) - val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,
[spark] branch master updated (8bd3770 -> 954cd9f)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark add 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../execution/datasources/DataSourceSuite.scala| 18 .../sql/test/DataFrameReaderWriterSuite.scala | 23 6 files changed, 72 insertions(+), 15 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (954cd9f -> 117a6f1)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema add 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 58 - .../spark/sql/catalyst/plans/QueryPlan.scala | 85 ++ .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 9 +- 5 files changed, 136 insertions(+), 161 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 23 5 files changed, 60 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new bc471f3 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema bc471f3 is described below commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c Author: Max Gekk AuthorDate: Tue Sep 8 09:45:17 2020 +0900 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`: ```scala spark.read.csv("""/tmp/\[abc\].csv""").show spark.read.json("""/tmp/\[abc\].json""").show ``` but would end up hitting an exception: ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new test cases in `DataFrameReaderWriterSuite`. Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../sql/execution/datasources/DataSource.scala | 27 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/test/DataFrameReaderWriterSuite.scala | 23 ++ 4 files changed, 48 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index ff5fe09..31c91f7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala @@ -65,8 +65,9 @@ import org.apache.spark.util.Utils * metadata. For example, when reading a partitioned table from a file system, partition columns * will be inferred from the directory layout even if they are not specified. * - * @param paths A list of file system paths that hold data. These will be globbed before and - * qualified. This option only works when reading from a [[FileFormat]]. + * @param paths A list of file system paths that hold data. These will be globbed before if + * the "__globPaths__" option is true, and will be qualified. This option only works + * when reading from a [[FileFormat]]. * @param userSpecifiedSchema An optional specification of the schema of the data. When present *we skip attempting to infer the schema. * @param partitionColumns A list of column names that the relation is partitioned by. This list is @@ -97,6 +98,15 @@ case class DataSource( private val caseInsensitiveOptions = CaseInsensitiveMap(options) private val equality = sparkSession.sessionState.conf.resolver + /** + * Whether or not paths should be globbed before being used to access files. + */ + def globPaths: Boolean = { +options.get(DataSource.GLOB_PATHS_KEY) + .map(_ == "true") + .getOrElse(true) + } + bucketSpec.map { bucket => SchemaUtils.checkColumnNameDuplication( bucket.bucketColumnNames, "in the bucket definition", equality) @@ -223,7 +233,7 @@ case class DataSource( // For glob pattern, we do not check it because the glob pattern might only make sense // once the streaming job starts and some upstream source starts dropping data. val hdfsPath = new Path(path) -if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) { +if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) { val fs = hdfsPath.getFileSystem(newHadoopConfiguration()) if (!fs.exists(hdfsPath)) { throw new AnalysisException(s"Path does not exist: $path") @@ -550,7 +560,11 @@ case class DataSource( val hdfsPath = new Path(path) val fs = hdfsPath.getFileSystem(hadoopConf) val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory) - val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,
[spark] branch master updated (8bd3770 -> 954cd9f)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark add 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../execution/datasources/DataSourceSuite.scala| 18 .../sql/test/DataFrameReaderWriterSuite.scala | 23 6 files changed, 72 insertions(+), 15 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c43460c -> 8bd3770)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c43460c [SPARK-32753][SQL] Only copy tags to node with no tags add 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (954cd9f -> 117a6f1)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema add 117a6f1 [SPARK-32638][SQL][FOLLOWUP] Move the plan rewriting methods to QueryPlan No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 130 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 58 - .../spark/sql/catalyst/plans/QueryPlan.scala | 85 ++ .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 9 +- 5 files changed, 136 insertions(+), 161 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 23 5 files changed, 60 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new bc471f3 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema bc471f3 is described below commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c Author: Max Gekk AuthorDate: Tue Sep 8 09:45:17 2020 +0900 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`: ```scala spark.read.csv("""/tmp/\[abc\].csv""").show spark.read.json("""/tmp/\[abc\].json""").show ``` but would end up hitting an exception: ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new test cases in `DataFrameReaderWriterSuite`. Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../sql/execution/datasources/DataSource.scala | 27 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/test/DataFrameReaderWriterSuite.scala | 23 ++ 4 files changed, 48 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index ff5fe09..31c91f7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala @@ -65,8 +65,9 @@ import org.apache.spark.util.Utils * metadata. For example, when reading a partitioned table from a file system, partition columns * will be inferred from the directory layout even if they are not specified. * - * @param paths A list of file system paths that hold data. These will be globbed before and - * qualified. This option only works when reading from a [[FileFormat]]. + * @param paths A list of file system paths that hold data. These will be globbed before if + * the "__globPaths__" option is true, and will be qualified. This option only works + * when reading from a [[FileFormat]]. * @param userSpecifiedSchema An optional specification of the schema of the data. When present *we skip attempting to infer the schema. * @param partitionColumns A list of column names that the relation is partitioned by. This list is @@ -97,6 +98,15 @@ case class DataSource( private val caseInsensitiveOptions = CaseInsensitiveMap(options) private val equality = sparkSession.sessionState.conf.resolver + /** + * Whether or not paths should be globbed before being used to access files. + */ + def globPaths: Boolean = { +options.get(DataSource.GLOB_PATHS_KEY) + .map(_ == "true") + .getOrElse(true) + } + bucketSpec.map { bucket => SchemaUtils.checkColumnNameDuplication( bucket.bucketColumnNames, "in the bucket definition", equality) @@ -223,7 +233,7 @@ case class DataSource( // For glob pattern, we do not check it because the glob pattern might only make sense // once the streaming job starts and some upstream source starts dropping data. val hdfsPath = new Path(path) -if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) { +if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) { val fs = hdfsPath.getFileSystem(newHadoopConfiguration()) if (!fs.exists(hdfsPath)) { throw new AnalysisException(s"Path does not exist: $path") @@ -550,7 +560,11 @@ case class DataSource( val hdfsPath = new Path(path) val fs = hdfsPath.getFileSystem(hadoopConf) val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory) - val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,
[spark] branch master updated (8bd3770 -> 954cd9f)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark add 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../execution/datasources/DataSourceSuite.scala| 18 .../sql/test/DataFrameReaderWriterSuite.scala | 23 6 files changed, 72 insertions(+), 15 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c43460c -> 8bd3770)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c43460c [SPARK-32753][SQL] Only copy tags to node with no tags add 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 23 5 files changed, 60 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new bc471f3 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema bc471f3 is described below commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c Author: Max Gekk AuthorDate: Tue Sep 8 09:45:17 2020 +0900 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`: ```scala spark.read.csv("""/tmp/\[abc\].csv""").show spark.read.json("""/tmp/\[abc\].json""").show ``` but would end up hitting an exception: ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new test cases in `DataFrameReaderWriterSuite`. Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../sql/execution/datasources/DataSource.scala | 27 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/test/DataFrameReaderWriterSuite.scala | 23 ++ 4 files changed, 48 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index ff5fe09..31c91f7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala @@ -65,8 +65,9 @@ import org.apache.spark.util.Utils * metadata. For example, when reading a partitioned table from a file system, partition columns * will be inferred from the directory layout even if they are not specified. * - * @param paths A list of file system paths that hold data. These will be globbed before and - * qualified. This option only works when reading from a [[FileFormat]]. + * @param paths A list of file system paths that hold data. These will be globbed before if + * the "__globPaths__" option is true, and will be qualified. This option only works + * when reading from a [[FileFormat]]. * @param userSpecifiedSchema An optional specification of the schema of the data. When present *we skip attempting to infer the schema. * @param partitionColumns A list of column names that the relation is partitioned by. This list is @@ -97,6 +98,15 @@ case class DataSource( private val caseInsensitiveOptions = CaseInsensitiveMap(options) private val equality = sparkSession.sessionState.conf.resolver + /** + * Whether or not paths should be globbed before being used to access files. + */ + def globPaths: Boolean = { +options.get(DataSource.GLOB_PATHS_KEY) + .map(_ == "true") + .getOrElse(true) + } + bucketSpec.map { bucket => SchemaUtils.checkColumnNameDuplication( bucket.bucketColumnNames, "in the bucket definition", equality) @@ -223,7 +233,7 @@ case class DataSource( // For glob pattern, we do not check it because the glob pattern might only make sense // once the streaming job starts and some upstream source starts dropping data. val hdfsPath = new Path(path) -if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) { +if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) { val fs = hdfsPath.getFileSystem(newHadoopConfiguration()) if (!fs.exists(hdfsPath)) { throw new AnalysisException(s"Path does not exist: $path") @@ -550,7 +560,11 @@ case class DataSource( val hdfsPath = new Path(path) val fs = hdfsPath.getFileSystem(hadoopConf) val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory) - val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,
[spark] branch master updated (8bd3770 -> 954cd9f)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark add 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../execution/datasources/DataSourceSuite.scala| 18 .../sql/test/DataFrameReaderWriterSuite.scala | 23 6 files changed, 72 insertions(+), 15 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c43460c -> 8bd3770)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c43460c [SPARK-32753][SQL] Only copy tags to node with no tags add 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (c2c7c9e -> a22e1c5)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add a22e1c5 [SPARK-32810][SQL][3.0] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../sql/test/DataFrameReaderWriterSuite.scala | 23 5 files changed, 60 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new bc471f3 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema bc471f3 is described below commit bc471f3ec76618409fc9cc5791ddd2e3beca9c5c Author: Max Gekk AuthorDate: Tue Sep 8 09:45:17 2020 +0900 [SPARK-32810][SQL][2.4] CSV/JSON data sources should avoid globbing paths when inferring schema ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`: ```scala spark.read.csv("""/tmp/\[abc\].csv""").show spark.read.json("""/tmp/\[abc\].json""").show ``` but would end up hitting an exception: ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new test cases in `DataFrameReaderWriterSuite`. Closes #29663 from MaxGekk/globbing-paths-when-inferring-schema-2.4. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../sql/execution/datasources/DataSource.scala | 27 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/test/DataFrameReaderWriterSuite.scala | 23 ++ 4 files changed, 48 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala index ff5fe09..31c91f7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala @@ -65,8 +65,9 @@ import org.apache.spark.util.Utils * metadata. For example, when reading a partitioned table from a file system, partition columns * will be inferred from the directory layout even if they are not specified. * - * @param paths A list of file system paths that hold data. These will be globbed before and - * qualified. This option only works when reading from a [[FileFormat]]. + * @param paths A list of file system paths that hold data. These will be globbed before if + * the "__globPaths__" option is true, and will be qualified. This option only works + * when reading from a [[FileFormat]]. * @param userSpecifiedSchema An optional specification of the schema of the data. When present *we skip attempting to infer the schema. * @param partitionColumns A list of column names that the relation is partitioned by. This list is @@ -97,6 +98,15 @@ case class DataSource( private val caseInsensitiveOptions = CaseInsensitiveMap(options) private val equality = sparkSession.sessionState.conf.resolver + /** + * Whether or not paths should be globbed before being used to access files. + */ + def globPaths: Boolean = { +options.get(DataSource.GLOB_PATHS_KEY) + .map(_ == "true") + .getOrElse(true) + } + bucketSpec.map { bucket => SchemaUtils.checkColumnNameDuplication( bucket.bucketColumnNames, "in the bucket definition", equality) @@ -223,7 +233,7 @@ case class DataSource( // For glob pattern, we do not check it because the glob pattern might only make sense // once the streaming job starts and some upstream source starts dropping data. val hdfsPath = new Path(path) -if (!SparkHadoopUtil.get.isGlobPath(hdfsPath)) { +if (!globPaths || !SparkHadoopUtil.get.isGlobPath(hdfsPath)) { val fs = hdfsPath.getFileSystem(newHadoopConfiguration()) if (!fs.exists(hdfsPath)) { throw new AnalysisException(s"Path does not exist: $path") @@ -550,7 +560,11 @@ case class DataSource( val hdfsPath = new Path(path) val fs = hdfsPath.getFileSystem(hadoopConf) val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory) - val globPath = SparkHadoopUtil.get.globPathIfNecessary(fs,
[spark] branch master updated (8bd3770 -> 954cd9f)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark add 954cd9f [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/DataSource.scala | 32 ++ .../execution/datasources/csv/CSVDataSource.scala | 2 +- .../datasources/json/JsonDataSource.scala | 2 +- .../sql/execution/datasources/v2/FileTable.scala | 10 ++- .../execution/datasources/DataSourceSuite.scala| 18 .../sql/test/DataFrameReaderWriterSuite.scala | 23 6 files changed, 72 insertions(+), 15 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c43460c -> 8bd3770)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c43460c [SPARK-32753][SQL] Only copy tags to node with no tags add 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c43460c -> 8bd3770)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c43460c [SPARK-32753][SQL] Only copy tags to node with no tags add 8bd3770 [SPARK-32798][PYTHON] Make unionByName optionally fill missing columns with nulls in PySpark No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
subscribe
[spark] branch master updated (04f7f6da -> c43460c)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec add c43460c [SPARK-32753][SQL] Only copy tags to node with no tags No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++ 2 files changed, 20 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (04f7f6da -> c43460c)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec add c43460c [SPARK-32753][SQL] Only copy tags to node with no tags No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++ 2 files changed, 20 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (04f7f6da -> c43460c)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec add c43460c [SPARK-32753][SQL] Only copy tags to node with no tags No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++ 2 files changed, 20 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (04f7f6da -> c43460c)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec add c43460c [SPARK-32753][SQL] Only copy tags to node with no tags No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++ 2 files changed, 20 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-32753][SQL] Only copy tags to node with no tags
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c43460c [SPARK-32753][SQL] Only copy tags to node with no tags c43460c is described below commit c43460cf82a075fd071717489798cde6a61b8515 Author: manuzhang AuthorDate: Mon Sep 7 16:08:57 2020 + [SPARK-32753][SQL] Only copy tags to node with no tags ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce _any_ user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29593 from manuzhang/spark-32753. Authored-by: manuzhang Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala| 14 ++ 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index 616572d..8003012 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -92,7 +92,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty protected def copyTagsFrom(other: BaseType): Unit = { -tags ++= other.tags +// SPARK-32753: it only makes sense to copy tags to a new node +// but it's too expensive to detect other cases likes node removal +// so we make a compromise here to copy tags to node with no tags +if (tags.isEmpty) { + tags ++= other.tags +} } def setTagValue[T](tag: TreeNodeTag[T], value: T): Unit = { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala index f892e66..628bafa 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala @@ -1224,4 +1224,18 @@ class AdaptiveQueryExecSuite }) } } + + test("SPARK-32753: Only copy tags to node with no tags") { +withSQLConf( + SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true" +) { + spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") + + val (_, adaptivePlan) = runAdaptiveAndVerifyResult( +"SELECT id FROM v1 GROUP BY id DISTRIBUTE BY id") + assert(collect(adaptivePlan) { +case s: ShuffleExchangeExec => s + }.length == 1) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r41342 - /dev/spark/v3.0.1-rc3-docs/
Author: wenchen Date: Mon Sep 7 15:30:54 2020 New Revision: 41342 Log: Remove RC artifacts Removed: dev/spark/v3.0.1-rc3-docs/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r41344 - /release/spark/spark-3.0.0/
Author: wenchen Date: Mon Sep 7 15:30:55 2020 New Revision: 41344 Log: Remove old release Removed: release/spark/spark-3.0.0/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r41343 - /dev/spark/v3.0.1-rc3-bin/ /release/spark/spark-3.0.1/
Author: wenchen Date: Mon Sep 7 15:30:55 2020 New Revision: 41343 Log: Apache Spark 3.0.1 Added: release/spark/spark-3.0.1/ - copied from r41342, dev/spark/v3.0.1-rc3-bin/ Removed: dev/spark/v3.0.1-rc3-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] tag v3.0.1 created (now 2b147c4)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to tag v3.0.1 in repository https://gitbox.apache.org/repos/asf/spark.git. at 2b147c4 (commit) No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b0322bf -> 04f7f6da)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 -- .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 72 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b0322bf -> 04f7f6da)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 -- .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 72 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b0322bf -> 04f7f6da)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 -- .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 72 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b0322bf -> 04f7f6da)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 -- .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 72 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock c2c7c9e is described below commit c2c7c9ef78441682a585abb1dede9b668802a224 Author: sandeep.katta AuthorDate: Mon Sep 7 15:10:33 2020 +0900 [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock ### What changes were proposed in this pull request? No need of using database name in `loadPartition` API of `Shim_v3_0` to get the hive table, in hive there is a overloaded method which gives hive table using table name. By using this API dependency with `SessionCatalog` can be removed in Shim layer ### Why are the changes needed? To avoid deadlock when communicating with Hive metastore 3.1.x ``` Found one Java-level deadlock: = "worker3": waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" Java stack information for the threads listed above: === "worker3": at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256) - waiting to lock <0x0007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown Source) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) - locked <0x000785ef9d78> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - locked <0x000785c15c80> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x0007b1690ff8> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
[spark] branch master updated (05fcf26 -> b0322bf)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 05fcf26 [SPARK-32677][SQL] Load function resource before create add b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b0322bf -> 04f7f6da)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock add 04f7f6da [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 -- .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 72 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock c2c7c9e is described below commit c2c7c9ef78441682a585abb1dede9b668802a224 Author: sandeep.katta AuthorDate: Mon Sep 7 15:10:33 2020 +0900 [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock ### What changes were proposed in this pull request? No need of using database name in `loadPartition` API of `Shim_v3_0` to get the hive table, in hive there is a overloaded method which gives hive table using table name. By using this API dependency with `SessionCatalog` can be removed in Shim layer ### Why are the changes needed? To avoid deadlock when communicating with Hive metastore 3.1.x ``` Found one Java-level deadlock: = "worker3": waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" Java stack information for the threads listed above: === "worker3": at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256) - waiting to lock <0x0007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown Source) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) - locked <0x000785ef9d78> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - locked <0x000785c15c80> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x0007b1690ff8> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
[spark] branch master updated (05fcf26 -> b0322bf)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 05fcf26 [SPARK-32677][SQL] Load function resource before create add b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock c2c7c9e is described below commit c2c7c9ef78441682a585abb1dede9b668802a224 Author: sandeep.katta AuthorDate: Mon Sep 7 15:10:33 2020 +0900 [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock ### What changes were proposed in this pull request? No need of using database name in `loadPartition` API of `Shim_v3_0` to get the hive table, in hive there is a overloaded method which gives hive table using table name. By using this API dependency with `SessionCatalog` can be removed in Shim layer ### Why are the changes needed? To avoid deadlock when communicating with Hive metastore 3.1.x ``` Found one Java-level deadlock: = "worker3": waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" Java stack information for the threads listed above: === "worker3": at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256) - waiting to lock <0x0007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown Source) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) - locked <0x000785ef9d78> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - locked <0x000785c15c80> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x0007b1690ff8> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
[spark] branch master updated (05fcf26 -> b0322bf)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 05fcf26 [SPARK-32677][SQL] Load function resource before create add b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock c2c7c9e is described below commit c2c7c9ef78441682a585abb1dede9b668802a224 Author: sandeep.katta AuthorDate: Mon Sep 7 15:10:33 2020 +0900 [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock ### What changes were proposed in this pull request? No need of using database name in `loadPartition` API of `Shim_v3_0` to get the hive table, in hive there is a overloaded method which gives hive table using table name. By using this API dependency with `SessionCatalog` can be removed in Shim layer ### Why are the changes needed? To avoid deadlock when communicating with Hive metastore 3.1.x ``` Found one Java-level deadlock: = "worker3": waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" Java stack information for the threads listed above: === "worker3": at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256) - waiting to lock <0x0007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown Source) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) - locked <0x000785ef9d78> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - locked <0x000785c15c80> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x0007b1690ff8> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
[spark] branch master updated (05fcf26 -> b0322bf)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 05fcf26 [SPARK-32677][SQL] Load function resource before create add b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c2c7c9e [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock c2c7c9e is described below commit c2c7c9ef78441682a585abb1dede9b668802a224 Author: sandeep.katta AuthorDate: Mon Sep 7 15:10:33 2020 +0900 [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock ### What changes were proposed in this pull request? No need of using database name in `loadPartition` API of `Shim_v3_0` to get the hive table, in hive there is a overloaded method which gives hive table using table name. By using this API dependency with `SessionCatalog` can be removed in Shim layer ### Why are the changes needed? To avoid deadlock when communicating with Hive metastore 3.1.x ``` Found one Java-level deadlock: = "worker3": waiting to lock monitor 0x7faf0be602b8 (object 0x0007858f85f0, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x7faf0be5fc88 (object 0x000785c15c80, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" Java stack information for the threads listed above: === "worker3": at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCurrentDatabase(SessionCatalog.scala:256) - waiting to lock <0x0007858f85f0> (a org.apache.spark.sql.hive.HiveSessionCatalog) at org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1332) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$4459/1387095575.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl$$Lambda$2227/313239499.apply(Unknown Source) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) - locked <0x000785ef9d78> (a org.apache.spark.sql.hive.client.IsolatedClientLoader) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$4457/2037578495.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) - locked <0x000785c15c80> (a org.apache.spark.sql.hive.HiveExternalCatalog) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:512) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:383) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) - locked <0x0007b1690ff8> (a org.apache.spark.sql.execution.command.ExecutedCommandExec) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset$$Lambda$2084/428667685.apply(Unknown Source) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616) at org.apache.spark.sql.Dataset$$Lambda$2085/559530590.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$$$Lambda$2093/139449177.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
[spark] branch master updated (05fcf26 -> b0322bf)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 05fcf26 [SPARK-32677][SQL] Load function resource before create add b0322bf [SPARK-32779][SQL] Avoid using synchronized API of SessionCatalog in withClient flow, this leads to DeadLock No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de44e9c -> 05fcf26)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de44e9c [SPARK-32785][SQL] Interval with dangling parts should not results null add 05fcf26 [SPARK-32677][SQL] Load function resource before create No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_catalog.py | 5 +++-- .../spark/sql/catalyst/catalog/SessionCatalog.scala | 20 ++-- .../spark/sql/execution/command/functions.scala | 13 + .../test/resources/sql-tests/results/udaf.sql.out| 7 +-- .../resources/sql-tests/results/udf/udf-udaf.sql.out | 7 +-- .../spark/sql/execution/command/DDLSuite.scala | 15 +++ .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 4 +++- 7 files changed, 54 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de44e9c -> 05fcf26)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de44e9c [SPARK-32785][SQL] Interval with dangling parts should not results null add 05fcf26 [SPARK-32677][SQL] Load function resource before create No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_catalog.py | 5 +++-- .../spark/sql/catalyst/catalog/SessionCatalog.scala | 20 ++-- .../spark/sql/execution/command/functions.scala | 13 + .../test/resources/sql-tests/results/udaf.sql.out| 7 +-- .../resources/sql-tests/results/udf/udf-udaf.sql.out | 7 +-- .../spark/sql/execution/command/DDLSuite.scala | 15 +++ .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 4 +++- 7 files changed, 54 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de44e9c -> 05fcf26)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de44e9c [SPARK-32785][SQL] Interval with dangling parts should not results null add 05fcf26 [SPARK-32677][SQL] Load function resource before create No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_catalog.py | 5 +++-- .../spark/sql/catalyst/catalog/SessionCatalog.scala | 20 ++-- .../spark/sql/execution/command/functions.scala | 13 + .../test/resources/sql-tests/results/udaf.sql.out| 7 +-- .../resources/sql-tests/results/udf/udf-udaf.sql.out | 7 +-- .../spark/sql/execution/command/DDLSuite.scala | 15 +++ .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 4 +++- 7 files changed, 54 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de44e9c -> 05fcf26)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de44e9c [SPARK-32785][SQL] Interval with dangling parts should not results null add 05fcf26 [SPARK-32677][SQL] Load function resource before create No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_catalog.py | 5 +++-- .../spark/sql/catalyst/catalog/SessionCatalog.scala | 20 ++-- .../spark/sql/execution/command/functions.scala | 13 + .../test/resources/sql-tests/results/udaf.sql.out| 7 +-- .../resources/sql-tests/results/udf/udf-udaf.sql.out | 7 +-- .../spark/sql/execution/command/DDLSuite.scala | 15 +++ .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 4 +++- 7 files changed, 54 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de44e9c -> 05fcf26)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de44e9c [SPARK-32785][SQL] Interval with dangling parts should not results null add 05fcf26 [SPARK-32677][SQL] Load function resource before create No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_catalog.py | 5 +++-- .../spark/sql/catalyst/catalog/SessionCatalog.scala | 20 ++-- .../spark/sql/execution/command/functions.scala | 13 + .../test/resources/sql-tests/results/udaf.sql.out| 7 +-- .../resources/sql-tests/results/udf/udf-udaf.sql.out | 7 +-- .../spark/sql/execution/command/DDLSuite.scala | 15 +++ .../spark/sql/hive/HiveUDFDynamicLoadSuite.scala | 4 +++- 7 files changed, 54 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org